I have a module written in mostly cython/c++, when it’s called sequentially it’s extremely fast, when I call it with .map from a dask bag, it is extremely slow. When I scale up the number of items in the bag the slowness seems to compound.
The code is fairly straight forward,
def extract_in_parallel(self):
"""
runs the extracts in parallel using dask.
:return:
"""
sequence = []
for path in Path('datasets/scrappinghub_aeb/html').glob('*.html.gz'):
with gzip.open(path, 'rt', encoding='utf8') as f:
html = f.read()
item_id = path.stem.split('.')[0]
sequence.append({
'item_id': item_id,
'html': html
})
start = time.time()
bagged = db.from_sequence(sequence)\
.map(self.parallel_extract)
bagged = bagged.compute()
self.elapsed_time = time.time() - start
self.extracts = {item['item_id']:{'articleBody': item['articleBody']} for item in bagged}
The entire test suite is here: wee-benchmarking-tool/__init__.py at main · Nootka-io/wee-benchmarking-tool · GitHub (Also interesting, I see bigger speed ups from some libraries than others, when running in parallel vs sequential)
In this case the code is about 8x slower when run in parallel.
I understand there are sometimes when parallel processing may not be necessary, but I’m trying to sort out what the slow down may be caused by, and how to best profile this.