Hi everyone.
I am trying to convert a remote tabular datasource into a dask dataframe. The remote datasource is a niche tabular REST-API that utilizes a simple pagination mechanism to consume data. The user requests the first page without a cursor and is returned a subset of the result-set and a cursor. Subsequent pages are consumed using the cursor of the preceding page until no cursor is returned, signalling the end of the table.
Is there a way represent the data as a dask dataframe and have it be loaded lazily? I currently have a generator-function which loads each page and yields the result-set as a pandas dataframe. This works quite well, but means that the whole dataset has to be kept in memory before the dask dataframe can be used.
Simplified mock-code of what I have now:
import pandas as pd
import dask.dataframe
client = ... # Set up connection to Dask cluster
def read_data():
cursor = None
while True:
rows, cursor = my_table_client.read(cursor=cursor)
# Rows is a simple list of dictionaries
yield client.scatter(pd.DataFrame(rows))
if not cursor:
break
ddf = dask.dataframe.from_map(lambda x: x.result() if not isinstance(x, pd.Dataframe) else x, read_data())
Does anyone know how to lazy load a paginated source?