54k of small files. Is Dask good for it?

kvdm.dev · March 13, 2023, 11:06am

I’d like to test out Dask on analysing thousand of files. They are small, but reading them into memory is problematic.

I have a general question, is Dask intended to work with big files only?

I could preprocess the small files( json ) and concatenate them up , but i would like to start data analysis without modifying file source structure as later i’m going to have even more new files, so it’s much better to build a data pipeline upon files as is.

guillaumeeb · March 13, 2023, 11:22am

Hi @kvdm.dev, welcome to this community!

The short answer is: Dask can perfectly work with a lot a small files.

However, with 54k files, you may have to be careful of two things:

Avoid generating a tasks graph that is too complicated, or with two many tasks. Dask add some overhead for each task, and the Scheduler might begin to show some latency when you go above 100k to 1m tasks. This means for example that if you have performance trouble if using DataFrame.read_json, you might want to try with Dask Bag or to develop some lower code to read file by batches in a single task.
54k small files may be limited by IO. Again, try to read the files by batch, but at some point this probably will be much less performant than with some big files. Anyway, this is not related to Dask.

kvdm.dev · March 13, 2023, 11:42am

but at some point this probably will be much less performant than with some big files.

Agree, but i’d perefer to stay with small files(average size is 14kb) avoiding concatenation them in bigger files as explained above.

. Again, try to read the files by batch,

Doesn’t Dask do it automatically and i simply should set up a batch size?

guillaumeeb · March 14, 2023, 10:49am

As far as I know, at least for DataFrame.read_json, there is no such thing as batch size. You can configure blocksize, but it is just used to split big files, not to concatenate several files’ content.

kvdm.dev · March 14, 2023, 11:33am

So i have to organize reading files by batches manually you meant?

I tought dask.bag.Bag can do it.

guillaumeeb · March 14, 2023, 1:56pm

Yes, you are right, as you rely on read_text, you can use the kwarg files_per_partition.

kvdm.dev · March 15, 2023, 10:58am

Could you explain briefly, what is the partition in Bag Context?

As i see in the documentation, only DataFrame has such a property.

Does it mean that a bag will take files_per_partition into account only when it’s getting converted to a dataframe like bag.to_dataframe()?

How does computing mechanic differ depending on a files_per_patition when i don’t make such a convertation?

Brief explanation could be enough to answer the questions, anyway i have to test it out, i’m a very begginer in Dask usage.

guillaumeeb · March 17, 2023, 1:15pm

Bags are explained there:
https://docs.dask.org/en/stable/bag-creation.html#create-dask-bags

A Bag is an unordered collection of item. Dask split this collections into several sub-collections called partitions. Each partition can be processed independently. For example, a list [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] could be for example split in two partitions : [0, 1, 2, 3, 4] and [5, 6, 7, 8, 9]. You can control the number of partitions during Bag creation as mentioned above.

No, a Bag is definitely a partitioned collection.

Topic		Replies	Views
Using Dask bag to load and read large json file Dask Bag	6	1560	April 20, 2023
Dask read_csv() multiple files but separate partition for each file Dask DataFrame	4	939	January 24, 2024
How can I optimize the speed of reading JSON Lines file(s) into a Dask dataframe? Dask DataFrame json	2	1632	January 10, 2023
Unable to create Dask dataframe at scale Dask DataFrame	6	888	October 22, 2022
Ideal way to create parquet part files limited for size? parquet	4	178	August 16, 2024

54k of small files. Is Dask good for it?

Related topics