I’m dumping tables with a SQL query to Parquet files. When the id of the table is an int everything works as expected. When it’s a string the divisions don’t appear to exist when I load it back with read_parquet
.
Let’s say I have two different tables corresponding to df1 and df2. df1 has an int index and df2 has a string index.
Working example:
df1 = dd.read_sql_query(df1_query, con=con_string, index_col='id', bytes_per_chunk="256 MiB")
dd.to_parquet(df1, 'data/output/df1', engine='pyarrow', compression='snappy')
...
df1 = dd.read_parquet("data/output/df1", engine='pyarrow', calculate_divisions=True)
assert df1.known_division # all good!
Failing example:
import string
string_divisions = list(string.ascii_lowercase)
df2 = dd.read_sql_query(df2_query, con=con_string, index_col='id', divisions=string_divisions)
dd.to_parquet(df2, 'data/output/df2', engine='pyarrow', compression='snappy')
...
df2 = dd.read_parquet("data/output/df2", engine='pyarrow', calculate_divisions=True)
assert df2.known_division # not good!
Any help appreciated
EDIT:
Could this be related to inconsistencies in how my PG instance sorts strings (collation “en_US.UTF-8”) and how Python sorts strings?
If I check the first character of the string ids according to the PG order I get this order
['0', '1', '2', '3', '4', '5', '7', '8', '9', 'a', '.', '-', 'b', 'c', '%', 'd', 'e', 'f', 'g', 'h', 'i', "'", 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
Trying to use that as my divisions with read_sql_query
gives ValueError: divisions must be sorted