Skip to content

Date & time columns in PUDL Parquet outputs do not always parse as native types #4326

Open
@zaneselvans

Description

@zaneselvans

Overview

Currently, if you read one of the PUDL Parquet files that contains a date or datetime field into pandas, what dtype you get depends on how you read it in, and by default our date columns get converted into objects, which then need to be converted manually using e.g. pd.datetime() which is a hassle and will not be intuitive to all users.

Given that we are just distributing the data, and not a software package, it would be nice if the easiest ways of reading the data in the wild also reflected the dtypes that we use internally when producing the data, since those are the types that we expect to work and test with.

PyArrow provides a bunch of rich time dtypes, including a date with day resolution, and Pandas seems to have taken that type as the solution to this problem.

Questions

  • What are the appropriate types?
  • How should we tell people to read this data to get usable time types?
  • Should we change the PUDL dtypes so they work smoothly with the outside world?
  • Should we switch over to using PyArrow dtypes by default throughout PUDL (maybe when pandas 3.0 lands?)

Current Behavior

report_date

  • pudl.helpers.get_parquet_table() --> datetime64[s] (explicit imposition of the PUDL dtype)
  • pd.read_parquet(dtype_backend="pyarrow") --> date32[day][pyarrow]
  • pandas.read_parquet().convert_dtypes() --> object

datetime_utc

  • pudl.helpers.get_parquet_table() --> datetime64[s] (explicit imposition of the PUDL dtype)
  • pd.read_parquet(dtype_backend="pyarrow") --> timestamp[ms][pyarrow]
  • pandas.read_parquet().convert_dtypes() --> datetime64[ms]

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugThings that are just plain broken.data-typesDtype conversions, standardization and implications of data typesoutputExporting data from PUDL into other platforms or interchange formats.parquetIssues related to the Apache Parquet file format which we use for long tables.timewhat even is time. fixing and changing the way in which PUDL data deals with time

    Type

    No type

    Projects

    Status

    New

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions