Reading arbitrary files within bag

Hi, I need to deal with impala’s *.har files on remote hdfs storage. I can identify the paths on my own and download files against common storage within bag’s map but this way I am redoing some stuff that typical dask read operations do for me (e.g. globbing paths and downloading). Is there a more dask-friendly way of going around this problem? If I wanted to implement another bag read operation, which of the dask-wide read operations would be the closest to my use case to draw some inspiration from? Thanks!

1 Like

@Antymon Hi and welcome to Disocurse!

Is there a more dask-friendly way of going around this problem?

Since Dask doesn’t support *.har files directly, your approach sounds good to me!

If I wanted to implement another bag read operation, which of the dask-wide read operations would be the closest to my use case to draw some inspiration from?

Maybe, read_text?

1 Like