Cython Code significantly slow with bag `.map` than in sequentially

winddude · October 31, 2022, 9:48pm

I have a module written in mostly cython/c++, when it’s called sequentially it’s extremely fast, when I call it with .map from a dask bag, it is extremely slow. When I scale up the number of items in the bag the slowness seems to compound.

The code is fairly straight forward,

    def extract_in_parallel(self):
        """
        runs the extracts in parallel using dask.
        :return:
        """
        sequence = []
        for path in Path('datasets/scrappinghub_aeb/html').glob('*.html.gz'):
            with gzip.open(path, 'rt', encoding='utf8') as f:
                html = f.read()
            item_id = path.stem.split('.')[0]
            sequence.append({
                'item_id': item_id,
                'html': html
            })
        start = time.time()
        bagged = db.from_sequence(sequence)\
            .map(self.parallel_extract)
        bagged = bagged.compute()
        self.elapsed_time = time.time() - start
        self.extracts = {item['item_id']:{'articleBody': item['articleBody']} for item in bagged}

The entire test suite is here: wee-benchmarking-tool/__init__.py at main · Nootka-io/wee-benchmarking-tool · GitHub (Also interesting, I see bigger speed ups from some libraries than others, when running in parallel vs sequential)

In this case the code is about 8x slower when run in parallel.

I understand there are sometimes when parallel processing may not be necessary, but I’m trying to sort out what the slow down may be caused by, and how to best profile this.

winddude · November 1, 2022, 10:14pm

Seems to be related to the overhead of serialising the objects. Switching to python’s multiprocessing pool solves the issue. If anyone has any deeper insights, it would be appreciated.

gjoseph92 · November 4, 2022, 10:14pm

@winddude if you have a profile showing this, ideally with py-spy in speedscope format, that could be interesting.

winddude · November 7, 2022, 5:57pm

pythons multiproccessing pool → speedscope

dask bag → speedscope

Now speedscope it new to me, so I’m not fully sure what to make of them.

winddude · November 15, 2022, 2:34am

I may need to take another look at this. Seems like the output doesn’t actually include the functions that do the actual work. A little confused.

Topic		Replies	Views
Dask Bag significantly faster with `scheduler='processes'`, help me understand why? Dask Bag	3	847	July 13, 2022
Why is it so slow? Dask Array	1	367	October 31, 2022
Need help with efficient parallelization [local machine] Distributed delayed , distributed	2	247	July 30, 2022
Dask slower than numpy Dask Array	1	363	August 23, 2022
Optimizing Dask Delayed Pandas DataFrames for Large-Scale Data Processing - Emmanuel Katto Dask DataFrame delayed	3	66	September 19, 2024

Cython Code significantly slow with bag `.map` than in sequentially

Related topics