I want to train a text classification model. And I want to vectorize the data using fasttext.
This is the code so far:
cluster = LocalCluster(n_workers=4, threads_per_worker=1)
client = Client(cluster)
vectorizer = fasttext.train_unsupervised('tests/data/case_df_50k.csv')
def transform(texts):
texts = [text.split() for text in texts]
vectors = []
for text in texts:
word_vectors = []
# get vector for each word in text and average
for word in text:
word_vectors.append(vectorizer.get_word_vector(word))
vectors.append(np.mean(word_vectors, axis=0))
return np.array(vectors)
vectors = ddf.map_partitions(lambda df: transform(df.text))
model = SGDClassifier()
model = Incremental(model)
model.fit(vectors, ddf['label'], classes=ddf['label'].unique())
When I run this pipeline, I got the error: TypeError: cannot pickle 'fasttext_pybind.fasttext' object
(the full trace is pretty long).
What could I do?
Also, suggestions about how can I preprocess and vectorize my data memmory-efficiently would be appeciated! Thanks.