Description
Overview
Currently, if you read one of the PUDL Parquet files that contains a date or datetime field into pandas, what dtype you get depends on how you read it in, and by default our date columns get converted into objects, which then need to be converted manually using e.g. pd.datetime()
which is a hassle and will not be intuitive to all users.
Given that we are just distributing the data, and not a software package, it would be nice if the easiest ways of reading the data in the wild also reflected the dtypes that we use internally when producing the data, since those are the types that we expect to work and test with.
PyArrow provides a bunch of rich time dtypes, including a date with day resolution, and Pandas seems to have taken that type as the solution to this problem.
Questions
- What are the appropriate types?
- How should we tell people to read this data to get usable time types?
- Should we change the PUDL dtypes so they work smoothly with the outside world?
- Should we switch over to using PyArrow dtypes by default throughout PUDL (maybe when pandas 3.0 lands?)
Current Behavior
report_date
pudl.helpers.get_parquet_table()
-->datetime64[s]
(explicit imposition of the PUDL dtype)pd.read_parquet(dtype_backend="pyarrow")
-->date32[day][pyarrow]
pandas.read_parquet().convert_dtypes()
-->object
datetime_utc
pudl.helpers.get_parquet_table()
-->datetime64[s]
(explicit imposition of the PUDL dtype)pd.read_parquet(dtype_backend="pyarrow")
-->timestamp[ms][pyarrow]
pandas.read_parquet().convert_dtypes()
-->datetime64[ms]
Metadata
Metadata
Assignees
Labels
Type
Projects
Status