In this section of the tutorial, we will learn how to work with data and spreadsheets using the pandas
library in Python. The corrisponding slides for this section are available on the repo.
NB: If you are not running this in a dev container, make sure you check the readme for setup instructions!
Let’s start by importing pandas:
import pandas as pd
Reading our data into pandas¶
The tabular data you want to work with in pandas and then export to a spreadsheet can start in a variety of formats. Some common formats are:
- XSLX (Excel spreadsheets)
- ODF (Open Document Format)
- CSV (comma-separated values)
- JSON (JavaScript Object Notation)
# Lets start by creating a DataFrame from a csv file
acnh_fish_df = pd.read_csv('sample-data/animal-crossing/fish.csv')
acnh_fish_df
We can also create a DataFrame by loading data from a xlsx file:
sports_df = pd.read_excel("sample-data/summer-sports/summer-sports.xlsx")
sports_df
We need to have a bit more control over how we are reading the xlsx file, so let’s look at the options we have for the read_excel
function.
# help(pd.read_excel) # or ?? pd.read_excel in jupyter cell magic syntax
The most likely ones we will need are:
General
sheet_name
: the name of the sheet we want to read from the xlsx file- pandas will read the first sheet by default
engine
: the engine to use to read the xlsx file, for the most part pandas will guess the right one!openpyxl
supports newer Excel file formatscalamine
supports Excel (.xls, .xlsx, .xlsm, .xlsb) and OpenDocument (.ods) file formatsodf
supports OpenDocument file formats (.odf, .ods, .odt)pyxlsb
supports Binary Excel filesxlrd
supports old-style Excel files (.xls)
N.B. If you need a specific engine you may need to install it as they are optional dependancies.
#Choosing more than one sheet returns a dictionary of DataFrames
sports_data = pd.read_excel("sample-data/summer-sports/summer-sports.xlsx",
sheet_name=["README", "data"]
)
sports_data["data"]
We can test the engine
option by reading an ods formatted file of the same data.
sports_info = pd.read_excel("sample-data/summer-sports/summer-sports.ods",
sheet_name=["README", "data"],
engine="odf"
)
sports_info["data"]
# Our readme sheet is now a DataFrame too, but we can add it as
# metadata to the data DataFrame
sports_data = sports_info["data"]
sports_data.attrs = {"metadata" : sports_info["README"]}
sports_data.attrs
{'metadata': What's in this Dataset? Unnamed: 1 \
0 Name Summer Sports Experience and "Kids in Motion" ...
1 URL https://data.cityofnewyork.us/Recreation/Summe...
2 Description The Kids in Motion (KIM) program provides free...
3 Rows 12.3K
4 Columns 7
5 Each row is a… Each record represents weekly attendance at at...
6 NaN NaN
7 Schema NaN
8 Column Name Description
9 Borough Borough in which Kids in Motion or Summer Spor...
10 ParkorPlayground Name of park, playground, or recreation center...
11 Date Date class occurred
12 SportsPlayed Name of sport played for Summer Sports Experience
13 KIMorSSE Is this class part of Kids in Motion or Summer...
14 Attendance Attendance for the week
Unnamed: 2 Unnamed: 3
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
5 NaN NaN
6 NaN NaN
7 NaN NaN
8 API Field Name Data Type
9 borough Text
10 parkorplayground Text
11 date Floating Timestamp
12 sportsplayed Text
13 kimorsse Text
14 attendance Number }
Rows
header
: the row number to use as the column names- uses the first row by default
skiprows
: the number of rows to skip at the beginning of the file- will not skip any rows by default
nrows
: the number of rows to read from the file- reads all rows by default
Columns
index_col
: the column number to use as the index- uses a default integer index
usecols
: the columns to read from the xlsx file- reads all columns by default
dtype
: the data type to use for the columns- tries to infer the data type from the data
# The fish alreay have an index column, and let's set the data
# type of the color columns to category
acnh_fish_df = pd.read_csv('sample-data/animal-crossing/fish.csv',
index_col=0,
dtype={"Color 1":'category'}
)
acnh_fish_df
Data
parse_dates
: whether to parse dates.- If not specified, pandas will not parse dates by default.
na_values
: the values to consider as missing values.- If not specified, pandas will use a default set of missing values.
converters
: the functions to use to convert the data in the columns.- If not specified, pandas will use a default set of converters.
thousands
: the thousands separator to use.- If not specified, pandas will use a default thousands separator.
decimal
: the decimal separator to use.- If not specified, pandas will use a default decimal separator.
Let’s use the converters
option to make the time of day values in the month columns more useable.
from enum import Enum
class TimeOfDay(Enum):
am = 1
pm = 2
all_day = 3
def time_of_day(time_period:str) -> TimeOfDay:
if time_period == "All day":
return TimeOfDay.all_day
elif time_period == "4 AM – 9 PM":
return TimeOfDay.am
elif time_period == "9 AM – 4 PM":
return TimeOfDay.pm
else:
return None
acnh_fish_df = pd.read_csv('sample-data/animal-crossing/fish.csv',
index_col=0,
dtype={"Color 1":'category',
"Color 2":'category'},
converters={"NH Jan": time_of_day}
)
acnh_fish_df["NH Jan"]
We can continue doing all the good data cleaning we all love like:
- missing values
- duplicates
- data types
- renaming columns
- aggregating data
... but we are here to learn about spreadsheets so let’s move on to how to export our pandas data to a spreadsheet!
How to export your DataFrame to a spreadsheet¶
If you have only one DataFrame, you can use
to_excel
with a path to the file you want to save to.If you want to combine multiple DataFrames into one spreadsheet, you can use
ExcelWriter
to create a new Excel file and then useto_excel
with a reference to theExcelWriter
object to write each DataFrame to the same file, usually as seperate sheets.
acnh_fish_df.to_excel(
"sample-data/animal-crossing/acnh-fish.xlsx"
)
acnh_bug_df = pd.read_csv('sample-data/animal-crossing/insects.csv')
acnh_fossil_df = pd.read_csv('sample-data/animal-crossing/fossils.csv')
acnh_art_df = pd.read_csv('sample-data/animal-crossing/art.csv')
# Let's combine all the item lists into one Excel file
with pd.ExcelWriter('sample-data/animal-crossing/museum.xlsx') as writer:
acnh_bug_df.to_excel(writer, sheet_name='bugs')
acnh_fossil_df.to_excel(writer, sheet_name='fossils')
acnh_art_df.to_excel(writer, sheet_name='art')
There are some similar options for the to_excel
function as there are for the read_excel
function.