How to combine lazy loading and pagination?

thomafred · September 10, 2024, 7:29am

Hi everyone.

I am trying to convert a remote tabular datasource into a dask dataframe. The remote datasource is a niche tabular REST-API that utilizes a simple pagination mechanism to consume data. The user requests the first page without a cursor and is returned a subset of the result-set and a cursor. Subsequent pages are consumed using the cursor of the preceding page until no cursor is returned, signalling the end of the table.

Is there a way represent the data as a dask dataframe and have it be loaded lazily? I currently have a generator-function which loads each page and yields the result-set as a pandas dataframe. This works quite well, but means that the whole dataset has to be kept in memory before the dask dataframe can be used.

Simplified mock-code of what I have now:

import pandas as pd
import dask.dataframe

client = ... # Set up connection to Dask cluster

def read_data():
  cursor = None
  while True:
    rows, cursor = my_table_client.read(cursor=cursor)

    # Rows is a simple list of dictionaries
    yield client.scatter(pd.DataFrame(rows))
    if not cursor:
      break

ddf = dask.dataframe.from_map(lambda x: x.result() if not isinstance(x, pd.Dataframe) else x, read_data())

Does anyone know how to lazy load a paginated source?

guillaumeeb · September 11, 2024, 12:09pm

Hi @thomafred, welcome to Dask Discourse forum!

This might be a problem! Do you think you could handle this cursor object without calling read sequentially? This is the key as Dask will try to read concurrently, especially in a multiprocessing or distributed approach. This might work as is in a Threaded context, but I would not bet on it.

You can try it though, and I don’t think the code you provided implies Data has to be kept in memory. You should just not use client.Scatter, the read_data function should return Pandas DataFrame.

Topic		Replies	Views
Loading large dataset from postgres using the minimun amount of memory	3	361	April 13, 2023
Gradually build up a Dataframe Dask DataFrame	2	1133	July 14, 2022
Using dask's read_csv or pandas's read_csv in from_map? Dask DataFrame distributed	3	82	July 31, 2024
Dask.read_sql_table() too slow Distributed	7	350	June 16, 2023
Help to check my delayed methods with Dask dataframe Dask DataFrame delayed	2	123	April 1, 2024

How to combine lazy loading and pagination?

Related topics