Hi Team,
I was trying to load data that has been created by hive or iceberg and do some computation on them. Using hivemetastore I can find the location and Serializer/Deserializer type of file.
I can load the parquet file using read_parquet but not able to load the file which has been serialized using this org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe There are more types also.
when I try to read LazySimpleSerDe this file using read_csv I below error.
Traceback (most recent call last):
File "csv_s3.py", line 15, in <module>
print(df.compute())
File "/home/daskloader/venv/lib/python3.8/site-packages/dask/base.py", line 314, in compute
(result,) = compute(self, traverse=False, **kwargs)
File "/home/daskloader/venv/lib/python3.8/site-packages/dask/base.py", line 599, in compute
results = schedule(dsk, keys, **kwargs)
File "/home/daskloader/venv/lib/python3.8/site-packages/dask/threaded.py", line 89, in get
results = get_async(
File "/home/daskloader/venv/lib/python3.8/site-packages/dask/local.py", line 511, in get_async
raise_exception(exc, tb)
File "/home/daskloader/venv/lib/python3.8/site-packages/dask/local.py", line 319, in reraise
raise exc
File "/home/daskloader/venv/lib/python3.8/site-packages/dask/local.py", line 224, in execute_task
result = _execute_task(task, data)
File "/home/daskloader/venv/lib/python3.8/site-packages/dask/core.py", line 119, in _execute_task
return func(*(_execute_task(a, cache) for a in args))
File "/home/daskloader/venv/lib/python3.8/site-packages/dask/optimization.py", line 990, in __call__
return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))
File "/home/daskloader/venv/lib/python3.8/site-packages/dask/core.py", line 149, in get
result = _execute_task(task, cache)
File "/home/daskloader/venv/lib/python3.8/site-packages/dask/core.py", line 119, in _execute_task
return func(*(_execute_task(a, cache) for a in args))
File "/home/daskloader/venv/lib/python3.8/site-packages/dask/dataframe/io/csv.py", line 140, in __call__
df = pandas_read_text(
File "/home/daskloader/venv/lib/python3.8/site-packages/dask/dataframe/io/csv.py", line 193, in pandas_read_text
df = reader(bio, **kwargs)
File "/home/daskloader/venv/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 912, in read_csv
return _read(filepath_or_buffer, kwds)
File "/home/daskloader/venv/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 583, in _read
return parser.read(nrows)
File "/home/daskloader/venv/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 1704, in read
) = self._engine.read( # type: ignore[attr-defined]
File "/home/daskloader/venv/lib/python3.8/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 234, in read
chunks = self._reader.read_low_memory(nrows)
File "pandas/_libs/parsers.pyx", line 814, in pandas._libs.parsers.TextReader.read_low_memory
File "pandas/_libs/parsers.pyx", line 875, in pandas._libs.parsers.TextReader._read_rows
File "pandas/_libs/parsers.pyx", line 850, in pandas._libs.parsers.TextReader._tokenize_rows
File "pandas/_libs/parsers.pyx", line 861, in pandas._libs.parsers.TextReader._check_tokenize_status
File "pandas/_libs/parsers.pyx", line 2029, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 1 fields in line 25, saw 3
is there any way I can pass the SerDe class (e.g. “org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe”) and read files.
or any other approach would be helpful.
Thank you!!