Until about a week ago (07/03/2022), I had various tests using parquet files on the s3://nyc-tlc public bucket. For example, the following code prints zero as the length of the DataFrame, where a week ago, the dataframe was over 84 million rows:
import dask.dataframe as dd
df_nyctlc = dd.read_parquet(
"s3://nyc-tlc/trip data/yellow_tripdata_2019-*.parquet",
parse_dates=["tpep_pickup_datetime", "tpep_dropoff_datetime"],
dtype={
"payment_type": "UInt8",
"VendorID": "UInt8",
"passenger_count": "UInt8",
"RatecodeID": "UInt8",
"store_and_fwd_flag": "category",
"PULocationID": "UInt16",
"DOLocationID": "UInt16",
"tolls_amount": "float64"
},
storage_options={"anon": True},
blocksize="16 MiB",
).persist()
print(len(df_nyctlc))
Does anyone know where this data went?
@bgithub1 Welcome!
We’re also trying to figure this out, ref: Access denied for NYC taxi dataset · Issue #1418 · awslabs/open-data-registry · GitHub
I’ll let you know if we have any updates!
1 Like
Thanks so much. I though I was losing my mind, or doing something really dumb.
I sent a message to opendata.cityofnewyork asking if they knew anything. Also, this page
https://registry.opendata.aws/nyc-tlc-trip-records-pds/
shows information about the s3://nyc-tlc data. It has a link to nyc.gov:
http://www.nyc.gov/html/tlc/html/about/trip _record_data.shtml
However, this link redirects you to a page that says:
Taxi & Limousine Commission has recently redesigned its website and this page has moved. Please update your bookmark to:
TLC Trip Record Data - TLC
You will be redirected in 5 seconds, or click on the link above.
Not sure if this helps, but it would not surprise me that transition caused this problem.
1 Like