6. Spreadsheets with Pandas - Develop Pythonic Spreadsheets

In this section of the tutorial, we will learn how to work with data and spreadsheets using the pandas library in Python. The corrisponding slides for this section are available on the repo.

NB: If you are not running this in a dev container, make sure you check the readme for setup instructions!

Let’s start by importing pandas:

import pandas as pd

Reading our data into pandas¶

The tabular data you want to work with in pandas and then export to a spreadsheet can start in a variety of formats. Some common formats are:

XSLX (Excel spreadsheets)
ODF (Open Document Format)
CSV (comma-separated values)
JSON (JavaScript Object Notation)

# Lets start by creating a DataFrame from a csv file
acnh_fish_df = pd.read_csv('sample-data/animal-crossing/fish.csv')
acnh_fish_df

We can also create a DataFrame by loading data from a xlsx file:

sports_df = pd.read_excel("sample-data/summer-sports/summer-sports.xlsx")
sports_df

We need to have a bit more control over how we are reading the xlsx file, so let’s look at the options we have for the read_excel function.

# help(pd.read_excel) # or ?? pd.read_excel in jupyter cell magic syntax

The most likely ones we will need are:

General

sheet_name: the name of the sheet we want to read from the xlsx file
- pandas will read the first sheet by default
engine: the engine to use to read the xlsx file, for the most part pandas will guess the right one!
- openpyxl supports newer Excel file formats
- calamine supports Excel (.xls, .xlsx, .xlsm, .xlsb) and OpenDocument (.ods) file formats
- odf supports OpenDocument file formats (.odf, .ods, .odt)
- pyxlsb supports Binary Excel files
- xlrd supports old-style Excel files (.xls)

N.B. If you need a specific engine you may need to install it as they are optional dependancies.

#Choosing more than one sheet returns a dictionary of DataFrames
sports_data = pd.read_excel("sample-data/summer-sports/summer-sports.xlsx",
                            sheet_name=["README", "data"]
                            )
sports_data["data"]

We can test the engine option by reading an ods formatted file of the same data.

sports_info = pd.read_excel("sample-data/summer-sports/summer-sports.ods",
                            sheet_name=["README", "data"],
                            engine="odf"
                            )
sports_info["data"]

# Our readme sheet is now a DataFrame too, but we can add it as
# metadata to the data DataFrame

sports_data = sports_info["data"]
sports_data.attrs = {"metadata" : sports_info["README"]}

sports_data.attrs

{'metadata':    What's in this Dataset?                                         Unnamed: 1  \
 0                     Name  Summer Sports Experience and "Kids in Motion" ...   
 1                      URL  https://data.cityofnewyork.us/Recreation/Summe...   
 2              Description  The Kids in Motion (KIM) program provides free...   
 3                     Rows                                              12.3K   
 4                  Columns                                                  7   
 5           Each row is a…  Each record represents weekly attendance at at...   
 6                      NaN                                                NaN   
 7                   Schema                                                NaN   
 8              Column Name                                        Description   
 9                  Borough  Borough in which Kids in Motion or Summer Spor...   
 10        ParkorPlayground  Name of park, playground, or recreation center...   
 11                    Date                                Date class occurred   
 12            SportsPlayed  Name of sport played for Summer Sports Experience   
 13                KIMorSSE  Is this class part of Kids in Motion or Summer...   
 14              Attendance                            Attendance for the week   
 
           Unnamed: 2          Unnamed: 3  
 0                NaN                 NaN  
 1                NaN                 NaN  
 2                NaN                 NaN  
 3                NaN                 NaN  
 4                NaN                 NaN  
 5                NaN                 NaN  
 6                NaN                 NaN  
 7                NaN                 NaN  
 8     API Field Name           Data Type  
 9            borough                Text  
 10  parkorplayground                Text  
 11              date  Floating Timestamp  
 12      sportsplayed                Text  
 13          kimorsse                Text  
 14        attendance              Number  }

Rows

header: the row number to use as the column names
- uses the first row by default
skiprows: the number of rows to skip at the beginning of the file
- will not skip any rows by default
nrows: the number of rows to read from the file
- reads all rows by default

Columns

index_col: the column number to use as the index
- uses a default integer index
usecols: the columns to read from the xlsx file
- reads all columns by default
dtype: the data type to use for the columns
- tries to infer the data type from the data

# The fish alreay have an index column, and let's set the data 
# type of the color columns to category



acnh_fish_df = pd.read_csv('sample-data/animal-crossing/fish.csv', 
                           index_col=0, 
                           dtype={"Color 1":'category'}
                           )
acnh_fish_df

Data

parse_dates: whether to parse dates.
- If not specified, pandas will not parse dates by default.
na_values: the values to consider as missing values.
- If not specified, pandas will use a default set of missing values.
converters: the functions to use to convert the data in the columns.
- If not specified, pandas will use a default set of converters.
thousands: the thousands separator to use.
- If not specified, pandas will use a default thousands separator.
decimal: the decimal separator to use.
- If not specified, pandas will use a default decimal separator.

Let’s use the converters option to make the time of day values in the month columns more useable.

from enum import Enum

class TimeOfDay(Enum):
   am = 1
   pm = 2
   all_day = 3

def time_of_day(time_period:str) -> TimeOfDay:
      if time_period == "All day":
         return TimeOfDay.all_day
      elif time_period == "4 AM – 9 PM":
         return TimeOfDay.am
      elif time_period == "9 AM – 4 PM":
         return TimeOfDay.pm
      else:
         return None

acnh_fish_df = pd.read_csv('sample-data/animal-crossing/fish.csv', 
                           index_col=0, 
                           dtype={"Color 1":'category', 
                                  "Color 2":'category'},
                           
                           converters={"NH Jan": time_of_day}
                        )
acnh_fish_df["NH Jan"]

We can continue doing all the good data cleaning we all love like:

missing values
duplicates
data types
renaming columns
aggregating data

... but we are here to learn about spreadsheets so let’s move on to how to export our pandas data to a spreadsheet!

How to export your DataFrame to a spreadsheet¶

If you have only one DataFrame, you can use to_excel with a path to the file you want to save to.
If you want to combine multiple DataFrames into one spreadsheet, you can use ExcelWriter to create a new Excel file and then use to_excel with a reference to the ExcelWriter object to write each DataFrame to the same file, usually as seperate sheets.

acnh_fish_df.to_excel(
    "sample-data/animal-crossing/acnh-fish.xlsx"
    )

acnh_bug_df = pd.read_csv('sample-data/animal-crossing/insects.csv')
acnh_fossil_df = pd.read_csv('sample-data/animal-crossing/fossils.csv')
acnh_art_df = pd.read_csv('sample-data/animal-crossing/art.csv')

# Let's combine all the item lists into one Excel file
with pd.ExcelWriter('sample-data/animal-crossing/museum.xlsx') as writer:  
    acnh_bug_df.to_excel(writer, sheet_name='bugs')
    acnh_fossil_df.to_excel(writer, sheet_name='fossils')
    acnh_art_df.to_excel(writer, sheet_name='art')

There are some similar options for the to_excel function as there are for the read_excel function.

Develop Pythonic Spreadsheets

5. Anaconda Toolbox and Code

Develop Pythonic Spreadsheets

7. Working with spreadsheets with OpenPyXL