Time Series Data

Time Series Data#

In addition to the data types we’ve encountered so far, Pandas supports datetime objects.

Let’s load an air quality dataset that includes a datetime field.

  • air_quality_no2_long provies NO2 values for three measurement stations.

import pandas as pd # import pandas
df = pd.read_csv("https://raw.githubusercontent.com/kwaldenphd/elements-of-computing/main/book/data/ch5/air-quality.csv") # load data
df = df[["date.utc", "location", "parameter", "value"]] # subset data
df.info() # get df summary
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2068 entries, 0 to 2067
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   date.utc   2068 non-null   object 
 1   location   2068 non-null   object 
 2   parameter  2068 non-null   object 
 3   value      2068 non-null   float64
dtypes: float64(1), object(3)
memory usage: 64.8+ KB

You might be wondering what exactly we are supposed to do with the convoluted string of numbers and characters in the datetime field for the air_quality_no2_long file.

Let’s first break down the information contained in this field. The data contained in the string for a single datetime value: 2019-06-21 00:00:00+00:00

  • year (2019)

  • month (6)

  • day (21)

  • hour (00 before first colon)

  • minute (00 between two colons)

  • second (00 after second colon)

  • UTC offset, or time zone (00:00 after + symbol)

We can imagine a number of different scenarios in which we would want to interact with this information as a time series object, rather than just a string of characters. Initially, pandas is treating the datetime field as a character string. We can use the .to_datetime() function to convert this field to a datetime object.

df['datetime'] = pd.to_datetime(df['date.utc']) # add new column with converted date
df.info() # check output
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2068 entries, 0 to 2067
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype              
---  ------     --------------  -----              
 0   date.utc   2068 non-null   object             
 1   location   2068 non-null   object             
 2   parameter  2068 non-null   object             
 3   value      2068 non-null   float64            
 4   datetime   2068 non-null   datetime64[ns, UTC]
dtypes: datetime64[ns, UTC](1), float64(1), object(3)
memory usage: 80.9+ KB

Now that we have the datatime field as a datatime object, we can access a number of specialized commands.

df['datetime'].min(), df['datetime'].max() # get min and max values
(Timestamp('2019-05-07 01:00:00+0000', tz='UTC'),
 Timestamp('2019-06-21 00:00:00+0000', tz='UTC'))
df['datetime'].max() - df['datetime'].min() # calculate time span
Timedelta('44 days 23:00:00')

If we know we are loading data with a datatime like field, we can specify how pandas should parse the datetime field in the newly-created data frame using parse_dates.

df = pd.read_csv("https://raw.githubusercontent.com/kwaldenphd/elements-of-computing/main/book/data/ch5/air-quality.csv", parse_dates=["date.utc"]) # load data
df.info() # show info
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2068 entries, 0 to 2067
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype              
---  ------     --------------  -----              
 0   city       2068 non-null   object             
 1   country    2068 non-null   object             
 2   date.utc   2068 non-null   datetime64[ns, UTC]
 3   location   2068 non-null   object             
 4   parameter  2068 non-null   object             
 5   value      2068 non-null   float64            
 6   unit       2068 non-null   object             
dtypes: datetime64[ns, UTC](1), float64(1), object(5)
memory usage: 113.2+ KB

Additional Resources#

For more on datatime objects and time series data in Python: