Time Series Data#
In addition to the data types we’ve encountered so far, Pandas
supports datetime
objects.
Let’s load an air quality dataset that includes a datetime
field.
air_quality_no2_long
provies NO2 values for three measurement stations.
import pandas as pd # import pandas
df = pd.read_csv("https://raw.githubusercontent.com/kwaldenphd/elements-of-computing/main/book/data/ch5/air-quality.csv") # load data
df = df[["date.utc", "location", "parameter", "value"]] # subset data
df.info() # get df summary
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2068 entries, 0 to 2067
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date.utc 2068 non-null object
1 location 2068 non-null object
2 parameter 2068 non-null object
3 value 2068 non-null float64
dtypes: float64(1), object(3)
memory usage: 64.8+ KB
You might be wondering what exactly we are supposed to do with the convoluted string of numbers and characters in the datetime
field for the air_quality_no2_long
file.
Let’s first break down the information contained in this field. The data contained in the string for a single datetime
value: 2019-06-21 00:00:00+00:00
year (2019)
month (6)
day (21)
hour (00 before first colon)
minute (00 between two colons)
second (00 after second colon)
UTC offset, or time zone (00:00 after + symbol)
We can imagine a number of different scenarios in which we would want to interact with this information as a time series object, rather than just a string of characters. Initially, pandas
is treating the datetime
field as a character string. We can use the .to_datetime()
function to convert this field to a datetime object.
df['datetime'] = pd.to_datetime(df['date.utc']) # add new column with converted date
df.info() # check output
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2068 entries, 0 to 2067
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date.utc 2068 non-null object
1 location 2068 non-null object
2 parameter 2068 non-null object
3 value 2068 non-null float64
4 datetime 2068 non-null datetime64[ns, UTC]
dtypes: datetime64[ns, UTC](1), float64(1), object(3)
memory usage: 80.9+ KB
Now that we have the datatime
field as a datatime object, we can access a number of specialized commands.
df['datetime'].min(), df['datetime'].max() # get min and max values
(Timestamp('2019-05-07 01:00:00+0000', tz='UTC'),
Timestamp('2019-06-21 00:00:00+0000', tz='UTC'))
df['datetime'].max() - df['datetime'].min() # calculate time span
Timedelta('44 days 23:00:00')
If we know we are loading data with a datatime like field, we can specify how pandas
should parse the datetime field in the newly-created data frame using parse_dates
.
df = pd.read_csv("https://raw.githubusercontent.com/kwaldenphd/elements-of-computing/main/book/data/ch5/air-quality.csv", parse_dates=["date.utc"]) # load data
df.info() # show info
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2068 entries, 0 to 2067
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 city 2068 non-null object
1 country 2068 non-null object
2 date.utc 2068 non-null datetime64[ns, UTC]
3 location 2068 non-null object
4 parameter 2068 non-null object
5 value 2068 non-null float64
6 unit 2068 non-null object
dtypes: datetime64[ns, UTC](1), float64(1), object(5)
memory usage: 113.2+ KB
Additional Resources#
For more on datatime objects and time series data in Python: