How to combine lazy loading and pagination?

Hi everyone.

I am trying to convert a remote tabular datasource into a dask dataframe. The remote datasource is a niche tabular REST-API that utilizes a simple pagination mechanism to consume data. The user requests the first page without a cursor and is returned a subset of the result-set and a cursor. Subsequent pages are consumed using the cursor of the preceding page until no cursor is returned, signalling the end of the table.

Is there a way represent the data as a dask dataframe and have it be loaded lazily? I currently have a generator-function which loads each page and yields the result-set as a pandas dataframe. This works quite well, but means that the whole dataset has to be kept in memory before the dask dataframe can be used.

Simplified mock-code of what I have now:

import pandas as pd
import dask.dataframe

client = ... # Set up connection to Dask cluster

def read_data():
  cursor = None
  while True:
    rows, cursor = my_table_client.read(cursor=cursor)

    # Rows is a simple list of dictionaries
    yield client.scatter(pd.DataFrame(rows))
    if not cursor:
      break

ddf = dask.dataframe.from_map(lambda x: x.result() if not isinstance(x, pd.Dataframe) else x, read_data())

Does anyone know how to lazy load a paginated source?

Hi @thomafred, welcome to Dask Discourse forum!

This might be a problem! Do you think you could handle this cursor object without calling read sequentially? This is the key as Dask will try to read concurrently, especially in a multiprocessing or distributed approach. This might work as is in a Threaded context, but I would not bet on it.

You can try it though, and I don’t think the code you provided implies Data has to be kept in memory. You should just not use client.Scatter, the read_data function should return Pandas DataFrame.