Download this page as a jupyter notebook at Lab 10

Laboratory 10: Databases

LAST NAME, FIRST NAME

R00000000

ENGR 1330 Laboratory 10 - In Lab

Pandas Cheat Sheet(s)

The Pandas library is a preferred tool for data scientists to perform data manipulation and analysis, next to matplotlib for data visualization and NumPy for scientific computing in Python.

The fast, flexible, and expressive Pandas data structures are designed to make real-world data analysis significantly easier, but this might not be immediately the case for those who are just getting started with it. Exactly because there is so much functionality built into this package that the options are overwhelming.

Hence summary sheets will be useful

Pandas

A data table is called a DataFrame in pandas (and other programming environments too).

The figure below from https://pandas.pydata.org/docs/getting_started/index.html illustrates a dataframe model:

Each column and each row in a dataframe is called a series, the header row, and index column are special.

To use pandas, we need to import the module, often pandas has numpy as a dependency so it also must be imported

Dataframe-structure using primative python

First lets construct a dataframe like object using python primatives. We will construct 3 lists, one for row names, one for column names, and one for the content.

The above builds a placeholder named mytable for the psuedo-dataframe. Next we populate the table, using a for loop to write the column names in the first row, row names in the first column, and the table fill for the rest of the table.

Now lets print the table out by row and we see we have a very dataframe-like structure

We can also query by row

Or by column

Or by row+column index; sort of looks like a spreadsheet syntax.

Create a proper dataframe

We will now do the same using pandas

We can also turn our table into a dataframe, notice how the constructor adds header row and index column

To get proper behavior, we can just reuse our original objects

Getting the shape of dataframes

The shape method will return the row and column rank (count) of a dataframe.

Appending new columns

To append a column simply assign a value to a new column name to the dataframe

Appending new rows

A bit trickier but we can create a copy of a row and concatenate it back into the dataframe.

Removing Rows and Columns

To remove a column is straightforward, we use the drop method

To remove a row, you really got to want to, easiest is probablty to create a new dataframe with the row removed

Indexing

We have already been indexing, but a few examples follow:

Conditional Selection

Descriptor Functions

head method

Returns the first few rows, useful to infer structure

info method

Returns the data model (data column count, names, data types)

describe method

Returns summary statistics of each numeric column.
Also returns the minimum and maximum value in each column, and the IQR (Interquartile Range).
Again useful to understand structure of the columns.

Counting and Sum methods

There are also methods for counts and sums by specific columns

The unique method returns a list of unique values (filters out duplicates in the list, underlying dataframe is preserved)

The nunique method returns a count of unique values

The value_counts() method returns the count of each unique value (kind of like a histogram, but each value is the bin)

Using functions in dataframes - symbolic apply

The power of pandas is an ability to apply a function to each element of a dataframe series (or a whole frame) by a technique called symbolic (or synthetic programming) application of the function.

Its pretty complicated but quite handy, best shown by an example

Sorts

Exercise 1

Create a prototype function to compute the cube root of a numeric object (literally two lines to define the function), recall exponentation is available in primative python.

Apply your function to column 'X' of dataframe newtable created above

Aggregating (Grouping Values) dataframe contents

Filtering out missing values

Below we drop any row that contains a NaN code.

Below we replace NaN codes with some value, in this case 0

Below we replace NaN codes with some value, in this case the mean value of of the column in which the missing value code resides.

Exercise 2

Replace the 'NaN' codes with the string 'missing' in dataframe 'df'