54k of small files. Is Dask good for it?

I’d like to test out Dask on analysing thousand of files. They are small, but reading them into memory is problematic.

I have a general question, is Dask intended to work with big files only?

I could preprocess the small files( json ) and concatenate them up , but i would like to start data analysis without modifying file source structure as later i’m going to have even more new files, so it’s much better to build a data pipeline upon files as is.

Hi @kvdm.dev, welcome to this community!

The short answer is: Dask can perfectly work with a lot a small files.

However, with 54k files, you may have to be careful of two things:

  • Avoid generating a tasks graph that is too complicated, or with two many tasks. Dask add some overhead for each task, and the Scheduler might begin to show some latency when you go above 100k to 1m tasks. This means for example that if you have performance trouble if using DataFrame.read_json, you might want to try with Dask Bag or to develop some lower code to read file by batches in a single task.
  • 54k small files may be limited by IO. Again, try to read the files by batch, but at some point this probably will be much less performant than with some big files. Anyway, this is not related to Dask.

but at some point this probably will be much less performant than with some big files.

Agree, but i’d perefer to stay with small files(average size is 14kb) avoiding concatenation them in bigger files as explained above.

. Again, try to read the files by batch,

Doesn’t Dask do it automatically and i simply should set up a batch size?

As far as I know, at least for DataFrame.read_json, there is no such thing as batch size. You can configure blocksize, but it is just used to split big files, not to concatenate several files’ content.

So i have to organize reading files by batches manually you meant?

I tought dask.bag.Bag can do it.

Yes, you are right, as you rely on read_text, you can use the kwarg files_per_partition.

Could you explain briefly, what is the partition in Bag Context?

As i see in the documentation, only DataFrame has such a property.

Does it mean that a bag will take files_per_partition into account only when it’s getting converted to a dataframe like bag.to_dataframe()?

How does computing mechanic differ depending on a files_per_patition when i don’t make such a convertation?

Brief explanation could be enough to answer the questions, anyway i have to test it out, i’m a very begginer in Dask usage.

Bags are explained there:
https://docs.dask.org/en/stable/bag-creation.html#create-dask-bags

A Bag is an unordered collection of item. Dask split this collections into several sub-collections called partitions. Each partition can be processed independently. For example, a list [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] could be for example split in two partitions : [0, 1, 2, 3, 4] and [5, 6, 7, 8, 9]. You can control the number of partitions during Bag creation as mentioned above.

No, a Bag is definitely a partitioned collection.

1 Like