TypeError: cannot pickle 'fasttext_pybind.fasttext' object

I want to train a text classification model. And I want to vectorize the data using fasttext.

This is the code so far:

cluster = LocalCluster(n_workers=4, threads_per_worker=1)
client = Client(cluster)

vectorizer = fasttext.train_unsupervised('tests/data/case_df_50k.csv')

def transform(texts):
    texts = [text.split() for text in texts]

    vectors = []
    for text in texts:
        word_vectors = []
        # get vector for each word in text and average
        for word in text:
            word_vectors.append(vectorizer.get_word_vector(word))

        vectors.append(np.mean(word_vectors, axis=0))

    return np.array(vectors)

vectors = ddf.map_partitions(lambda df: transform(df.text))

model = SGDClassifier()
model = Incremental(model)
model.fit(vectors, ddf['label'], classes=ddf['label'].unique())

When I run this pipeline, I got the error: TypeError: cannot pickle 'fasttext_pybind.fasttext' object (the full trace is pretty long).

What could I do?

Also, suggestions about how can I preprocess and vectorize my data memmory-efficiently would be appeciated! Thanks.

Hi @daskee,

This suggest that the fasttext library uses some low level objects that are not serializable, and thus cannot be exchanged between Client/Scheduler/Workers.

One thing you could try would be to import fasttext and create the vectorizer object inside the transform function.