Skip to article frontmatterSkip to article content

6. Spreadsheets with Pandas

Authors
Affiliations
Microsoft
Anaconda

In this section of the tutorial, we will learn how to work with data and spreadsheets using the pandas library in Python. The corrisponding slides for this section are available on the repo.

NB: If you are not running this in a dev container, make sure you check the readme for setup instructions!

Let’s start by importing pandas:

import pandas as pd

Reading our data into pandas

The tabular data you want to work with in pandas and then export to a spreadsheet can start in a variety of formats. Some common formats are:

  • XSLX (Excel spreadsheets)
  • ODF (Open Document Format)
  • CSV (comma-separated values)
  • JSON (JavaScript Object Notation)
# Lets start by creating a DataFrame from a csv file
acnh_fish_df = pd.read_csv('sample-data/animal-crossing/fish.csv')
acnh_fish_df
Loading...

We can also create a DataFrame by loading data from a xlsx file:

sports_df = pd.read_excel("sample-data/summer-sports/summer-sports.xlsx")
sports_df
Loading...

We need to have a bit more control over how we are reading the xlsx file, so let’s look at the options we have for the read_excel function.

# help(pd.read_excel) # or ?? pd.read_excel in jupyter cell magic syntax

The most likely ones we will need are:

General

  • sheet_name: the name of the sheet we want to read from the xlsx file
    • pandas will read the first sheet by default
  • engine: the engine to use to read the xlsx file, for the most part pandas will guess the right one!
    • openpyxl supports newer Excel file formats
    • calamine supports Excel (.xls, .xlsx, .xlsm, .xlsb) and OpenDocument (.ods) file formats
    • odf supports OpenDocument file formats (.odf, .ods, .odt)
    • pyxlsb supports Binary Excel files
    • xlrd supports old-style Excel files (.xls)

N.B. If you need a specific engine you may need to install it as they are optional dependancies.

#Choosing more than one sheet returns a dictionary of DataFrames
sports_data = pd.read_excel("sample-data/summer-sports/summer-sports.xlsx",
                            sheet_name=["README", "data"]
                            )
sports_data["data"] 
Loading...

We can test the engine option by reading an ods formatted file of the same data.

sports_info = pd.read_excel("sample-data/summer-sports/summer-sports.ods",
                            sheet_name=["README", "data"],
                            engine="odf"
                            )
sports_info["data"]
Loading...
# Our readme sheet is now a DataFrame too, but we can add it as
# metadata to the data DataFrame

sports_data = sports_info["data"]
sports_data.attrs = {"metadata" : sports_info["README"]}
sports_data.attrs
{'metadata': What's in this Dataset? Unnamed: 1 \ 0 Name Summer Sports Experience and "Kids in Motion" ... 1 URL https://data.cityofnewyork.us/Recreation/Summe... 2 Description The Kids in Motion (KIM) program provides free... 3 Rows 12.3K 4 Columns 7 5 Each row is a… Each record represents weekly attendance at at... 6 NaN NaN 7 Schema NaN 8 Column Name Description 9 Borough Borough in which Kids in Motion or Summer Spor... 10 ParkorPlayground Name of park, playground, or recreation center... 11 Date Date class occurred 12 SportsPlayed Name of sport played for Summer Sports Experience 13 KIMorSSE Is this class part of Kids in Motion or Summer... 14 Attendance Attendance for the week Unnamed: 2 Unnamed: 3 0 NaN NaN 1 NaN NaN 2 NaN NaN 3 NaN NaN 4 NaN NaN 5 NaN NaN 6 NaN NaN 7 NaN NaN 8 API Field Name Data Type 9 borough Text 10 parkorplayground Text 11 date Floating Timestamp 12 sportsplayed Text 13 kimorsse Text 14 attendance Number }

Rows

  • header: the row number to use as the column names
    • uses the first row by default
  • skiprows: the number of rows to skip at the beginning of the file
    • will not skip any rows by default
  • nrows: the number of rows to read from the file
    • reads all rows by default

Columns

  • index_col: the column number to use as the index
    • uses a default integer index
  • usecols: the columns to read from the xlsx file
    • reads all columns by default
  • dtype: the data type to use for the columns
    • tries to infer the data type from the data
# The fish alreay have an index column, and let's set the data 
# type of the color columns to category



acnh_fish_df = pd.read_csv('sample-data/animal-crossing/fish.csv', 
                           index_col=0, 
                           dtype={"Color 1":'category'}
                           )
acnh_fish_df
Loading...

Data

  • parse_dates: whether to parse dates.
    • If not specified, pandas will not parse dates by default.
  • na_values: the values to consider as missing values.
    • If not specified, pandas will use a default set of missing values.
  • converters: the functions to use to convert the data in the columns.
    • If not specified, pandas will use a default set of converters.
  • thousands: the thousands separator to use.
    • If not specified, pandas will use a default thousands separator.
  • decimal: the decimal separator to use.
    • If not specified, pandas will use a default decimal separator.

Let’s use the converters option to make the time of day values in the month columns more useable.

from enum import Enum

class TimeOfDay(Enum):
   am = 1
   pm = 2
   all_day = 3

def time_of_day(time_period:str) -> TimeOfDay:
      if time_period == "All day":
         return TimeOfDay.all_day
      elif time_period == "4 AM – 9 PM":
         return TimeOfDay.am
      elif time_period == "9 AM – 4 PM":
         return TimeOfDay.pm
      else:
         return None

acnh_fish_df = pd.read_csv('sample-data/animal-crossing/fish.csv', 
                           index_col=0, 
                           dtype={"Color 1":'category', 
                                  "Color 2":'category'},
                           
                           converters={"NH Jan": time_of_day}
                        )
acnh_fish_df["NH Jan"]
Loading...

We can continue doing all the good data cleaning we all love like:

  • missing values
  • duplicates
  • data types
  • renaming columns
  • aggregating data

... but we are here to learn about spreadsheets so let’s move on to how to export our pandas data to a spreadsheet!


How to export your DataFrame to a spreadsheet

  • If you have only one DataFrame, you can use to_excel with a path to the file you want to save to.

  • If you want to combine multiple DataFrames into one spreadsheet, you can use ExcelWriter to create a new Excel file and then use to_excel with a reference to the ExcelWriter object to write each DataFrame to the same file, usually as seperate sheets.

acnh_fish_df.to_excel(
    "sample-data/animal-crossing/acnh-fish.xlsx"
    )
acnh_bug_df = pd.read_csv('sample-data/animal-crossing/insects.csv')
acnh_fossil_df = pd.read_csv('sample-data/animal-crossing/fossils.csv')
acnh_art_df = pd.read_csv('sample-data/animal-crossing/art.csv')

# Let's combine all the item lists into one Excel file
with pd.ExcelWriter('sample-data/animal-crossing/museum.xlsx') as writer:  
    acnh_bug_df.to_excel(writer, sheet_name='bugs')
    acnh_fossil_df.to_excel(writer, sheet_name='fossils')
    acnh_art_df.to_excel(writer, sheet_name='art')

There are some similar options for the to_excel function as there are for the read_excel function.