DataFrame#

Panopto logo DataFrame

While a Series object is a one-dimensional array, a DataFrame includes a tabular data structure “and contains an ordered collection of columns, each of which can be a different value type” (McKinney, 130). A pandas DataFrame has a row and column index–we can think of these as Series that all share the same index. Behold, a two-dimensional data structure!

Creating a DataFrame (from scratch)#

In most situations, you’ll create a DataFrame by reading in a structured data file. But we’re going to manually create a DataFrame to better understand how they work in pandas. Let’s use a state population data example, featuring a dictionary with equal-length lists:

# create dictionary with three equal-length lists
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
        'year': [2000, 2001, 2002, 2001, 2002, 2003],
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}

# write data dictionary values to data frame
frame = pd.DataFrame(data)

# show newly-created frame
frame

Behold, a two-dimensional DataFrame object with rows, columns, labels, and values. The first object from each of the lists in our data dictionary now populate the first row of our frame DataFrame. This pattern continues for subsequent rows.

Interacting With a DataFrame#

We can start to explore the dimensions or overall characteristics of our DataFrame:

# shows index labels and values for first 5 rows
frame.head(5)

# shows list of column labels
frame.columns.values

# shows basic statistical information for the data frame
frame.describe()

The last command .describe() returns some statistical information about values in our dataset, including:

  • count

  • mean

  • standard deviation

  • minimum

  • 25th percentile

  • 50th percentile

  • 75th percentile

  • maximum

We can select columns in our DataFrame using their index labels.

frame2['state']

# returns state column

We can also select a column using the name attribute.

frame2.year

# returns year column

We can retrieve rows based on their position using the iloc (index location) attribute.

frame2.iloc[3]

# returns the fourth row in the dataframe

Application#

Q2: Create your own small DataFrame. Write code that accomplishes the following tasks. Your answer for these items should include a Python program + comments that document process and explain your code.

  • Select a specific column(s) using its index label or name attribute

  • Select a specific row(s) using its index label or index value

  • Determine summary statistics for values in the DataFrame