Cython Code significantly slow with bag `.map` than in sequentially

I have a module written in mostly cython/c++, when it’s called sequentially it’s extremely fast, when I call it with .map from a dask bag, it is extremely slow. When I scale up the number of items in the bag the slowness seems to compound.

The code is fairly straight forward,

    def extract_in_parallel(self):
        runs the extracts in parallel using dask.
        sequence = []
        for path in Path('datasets/scrappinghub_aeb/html').glob('*.html.gz'):
            with, 'rt', encoding='utf8') as f:
                html =
            item_id = path.stem.split('.')[0]
                'item_id': item_id,
                'html': html
        start = time.time()
        bagged = db.from_sequence(sequence)\
        bagged = bagged.compute()
        self.elapsed_time = time.time() - start
        self.extracts = {item['item_id']:{'articleBody': item['articleBody']} for item in bagged}

The entire test suite is here: wee-benchmarking-tool/ at main · Nootka-io/wee-benchmarking-tool · GitHub (Also interesting, I see bigger speed ups from some libraries than others, when running in parallel vs sequential)

In this case the code is about 8x slower when run in parallel.

I understand there are sometimes when parallel processing may not be necessary, but I’m trying to sort out what the slow down may be caused by, and how to best profile this.

Seems to be related to the overhead of serialising the objects. Switching to python’s multiprocessing pool solves the issue. If anyone has any deeper insights, it would be appreciated.

@winddude if you have a profile showing this, ideally with py-spy in speedscope format, that could be interesting.

pythons multiproccessing pool → speedscope

dask bag → speedscope

Now speedscope it new to me, so I’m not fully sure what to make of them.

I may need to take another look at this. Seems like the output doesn’t actually include the functions that do the actual work. A little confused.