Reading Hive SerDe files

Hi Team,
I was trying to load data that has been created by hive or iceberg and do some computation on them. Using hivemetastore I can find the location and Serializer/Deserializer type of file.

I can load the parquet file using read_parquet but not able to load the file which has been serialized using this org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe There are more types also.

when I try to read LazySimpleSerDe this file using read_csv I below error.

Traceback (most recent call last):
  File "csv_s3.py", line 15, in <module>
    print(df.compute())
  File "/home/daskloader/venv/lib/python3.8/site-packages/dask/base.py", line 314, in compute
    (result,) = compute(self, traverse=False, **kwargs)
  File "/home/daskloader/venv/lib/python3.8/site-packages/dask/base.py", line 599, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/home/daskloader/venv/lib/python3.8/site-packages/dask/threaded.py", line 89, in get
    results = get_async(
  File "/home/daskloader/venv/lib/python3.8/site-packages/dask/local.py", line 511, in get_async
    raise_exception(exc, tb)
  File "/home/daskloader/venv/lib/python3.8/site-packages/dask/local.py", line 319, in reraise
    raise exc
  File "/home/daskloader/venv/lib/python3.8/site-packages/dask/local.py", line 224, in execute_task
    result = _execute_task(task, data)
  File "/home/daskloader/venv/lib/python3.8/site-packages/dask/core.py", line 119, in _execute_task
    return func(*(_execute_task(a, cache) for a in args))
  File "/home/daskloader/venv/lib/python3.8/site-packages/dask/optimization.py", line 990, in __call__
    return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))
  File "/home/daskloader/venv/lib/python3.8/site-packages/dask/core.py", line 149, in get
    result = _execute_task(task, cache)
  File "/home/daskloader/venv/lib/python3.8/site-packages/dask/core.py", line 119, in _execute_task
    return func(*(_execute_task(a, cache) for a in args))
  File "/home/daskloader/venv/lib/python3.8/site-packages/dask/dataframe/io/csv.py", line 140, in __call__
    df = pandas_read_text(
  File "/home/daskloader/venv/lib/python3.8/site-packages/dask/dataframe/io/csv.py", line 193, in pandas_read_text
    df = reader(bio, **kwargs)
  File "/home/daskloader/venv/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 912, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/home/daskloader/venv/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 583, in _read
    return parser.read(nrows)
  File "/home/daskloader/venv/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 1704, in read
    ) = self._engine.read(  # type: ignore[attr-defined]
  File "/home/daskloader/venv/lib/python3.8/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 234, in read
    chunks = self._reader.read_low_memory(nrows)
  File "pandas/_libs/parsers.pyx", line 814, in pandas._libs.parsers.TextReader.read_low_memory
  File "pandas/_libs/parsers.pyx", line 875, in pandas._libs.parsers.TextReader._read_rows
  File "pandas/_libs/parsers.pyx", line 850, in pandas._libs.parsers.TextReader._tokenize_rows
  File "pandas/_libs/parsers.pyx", line 861, in pandas._libs.parsers.TextReader._check_tokenize_status
  File "pandas/_libs/parsers.pyx", line 2029, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 1 fields in line 25, saw 3

is there any way I can pass the SerDe class (e.g. “org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe”) and read files.
or any other approach would be helpful.

Thank you!!

Hi @rvarunrathod,

I’m not really familliar with Hive, so I’m not sure I can help here, and this is probably why your problem is not really clear to me.

In which format is your dataset, Parquet? In which kind of file system?

Could you post the code you are using to read your dataset?

org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe looks like a Java class to me, is that so?

yes, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe is java Serializer/Deserializer class and I am using read_csv to read this file.

so when the hive creates a table, it stores the data of the table in files and the information of how it can Serializer/Deserializer in hive metastore.
I found there different java classes. I wondered if there was a way we could pass these classes and read files.
e.g
org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe (parquet)
org.apache.hadoop.hive.serde2.avro.AvroSerDe (avro)
org.apache.hadoop.hive.serde2.OpenCSVSerde (csv)

But the file you are reading is not a CSV file, is it?

I’m really not sure there is a Dask supported way of using a Java deserializer somehow. Or it would rely on an external library probably.

This file is created by running a create table command in the hive without specifying any serialization or storage options.

The hive uses org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe this to serialize the data and while opening the file in text editor looks like following.

As pyarrow uses java libraries underneath I assume there has to be a way to pass a Serializer/Deserializer class to dask so that it can decode files which are not supported out of the box with dask.

But the file you are reading is not a CSV file, is it?

This assessment is correct that it is not a csv file. it was created by running insert commands into hive. can you please help how do I read such files.

Probably @martindurant knows better than me here.

Thanks for the ping. I have no idea how this file type works, but passing a java class to dask (or any other python function!) is not possible. You can maybe do that with Spark, but I wouldn’t know how.

LazySimpleSerDe does seem to have options for storing its output in something approaching CSV (see csv - What format applies to the Hive LazySimpleSerDe - Stack Overflow ), so you can maybe look at the files and come up with something that works. (This being hadoop, there is probably some header matter detailing what this is in terms of java classes.)