The Apache Parquet project is looking for real-world samples and feedback

Hello all,

I know Parquet is used quite a bit in the Dask community, so I figured it could be useful to post this information here.

The Apache Parquet project, which is the maintainer of the Parquet file format, is actively discussing improvements to Parquet. To make more informed decisions, we want to encourage users of Parquet files to provide real-world samples of their Parquet usage. This should help measure performance on a more realistic set of use cases.

Specifically, the Parquet project is currently interested in “footers” of Parquet files, a piece of binary metadata at the end of Parquet files (hence the name :wink: ) that’s central to reading and decoding Parquet. Such footers can be extracted independently of the actual data, and even scrubbed/anonymized to avoid revealing any sensitive information.

Your participation is especially useful if you have Parquet files with complex or unusual schemas (for example many columns, deeply nested types, etc), or you work with very large Parquet datasets. If you’re interested in helping, a repository has been set up for this: GitHub - apache/parquet-benchmark: Apache parquet . The README there will guide you, and you can of course open issues there if you’re not sure about the specifics. If you would like to help but do not want to be associated with a particular contribution on GitHub, you can also privately message me.

More generally, the Parquet project is looking for any feedback about use cases that you feel are currently poorly handled by the Parquet format. Of course, we do not promise that we can improve all such use cases, but we are definitely looking for insights about existing pain points (please note, though, that any suggestions about Parquet reading APIs and implementations are more usefully directed to the corresponding projects – such as PyArrow, etc.).

(note for mods: I hesitated posting this in the Dataframe category, but ended up choosing Uncategorized; please feel free to suggest a better place!)

3 Likes

cc @jrbourbeau @mrocklin @fjetter @scharlottej13