What I need to is to compute the sum of size columns of every moving 2 microseconds (1,2 are in a group, 3,4 are in a group, etc.). The start time will be the timestamp of the first row. As you can see that, there are some missing timestamps, and the size column value is regarded as 0 for those missing timestamps. I currently can use Pandas reindex function and date_range to insert rows for those missing timestamps. After that, I can use rolling window to compute what I want easily. However, since there’s no reindex function in Dask DataFrame, I don’t have an idea how to do this in Dask. Can someone enlight me some ways to implement this feature?
I’m afraid there is no easy solution to do this with Dask currently.
However, maybe some workaround can be found? In what format is your original dataset? Couldn’t you use Pandas dataframe calls on already ordered parts of the dataset (you could even distribute those with Dask), and when it’s done use dask dataframe to process all the time serie?
Hi @guillaumeeb, thank you for your reply first, sorry for delay response.
My original dataset is CSV and the format is analogous to the following one:
size timestamp
0 1 4
1 2 4
2 3 6
3 4 7
It has one timestamp column which is represented in integer format and unit is us, and other columns are all scalar type. It is guaranteed that the timestamp column is increasing (not strictly), so there may some repeated timestamps like the example above.
Currently I need to use groupby function to aggregate those rows with same timestamp, and due to original dataset is huge, I need do it by chunks.
My solution now:
Split data into chunks
Groupby on timestamp column to aggregate rows with same timestamp.
Use Pandas’s reindex feature to fill in missing timestamps.
Concatenate result to a temp file (CSV or parquet, also need to deal with same timestamp shown in the boundaries of chunks since first step can not split data to make sure each chunk has unique timestamps).
Could you produce a minimum reproducible example of your algorithm? I guess that to do that, you’ll only need to have some code for generating fake data. Then you could just give us the code you are using with Pandas, and what you are trying to do with Dask. It would be a lot easier to help then.