ENGR 1330 Computational Thinking with Data Science

Last GitHub Commit Date: 31 January 2021

Lesson 8 The Pandas module


Special Script Blocks


Objectives

  1. To understand the dataframe abstraction as implemented in the Pandas library(module).
    1. To be able to access and manipulate data within a dataframe
    2. To be able to obtain basic statistical measures of data within a dataframe
  2. Read/Write from/to files
    1. MS Excel-type files (.xls,.xlsx,.csv) (LibreOffice files use the MS .xml standard)
    2. Ordinary ASCII (.txt) files
  3. Access files directly from a URL (advanced concept)
    1. Using a wget-type function
    2. Using a curl-type function
    3. Using API keys (future versions)

Pandas:

Pandas is the core library for dataframe manipulation in Python. It provides a high-performance multidimensional array object, and tools for working with these arrays. The library’s name is derived from the term ‘Panel Data’. If you are curious about Pandas, this cheat sheet is recommended: https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf

Data Structure

The Primary data structure is called a dataframe. It is an abstraction where data are represented as a 2-dimensional mutable and heterogenous tabular data structure; much like a Worksheet in MS Excel. The structure itself is popular among statisticians and data scientists and business executives.

According to the marketing department "Pandas Provides rich data structures and functions designed to make working with data fast, easy, and expressive. It is useful in data manipulation, cleaning, and analysis; Pandas excels in performance and productivity "


The Dataframe

A data table is called a DataFrame in pandas (and other programming environments too). The figure below from https://pandas.pydata.org/docs/getting_started/index.html illustrates a dataframe model:

Each column and each row in a dataframe is called a series, the header row, and index column are special. Like MS Excel we can query the dataframe to find the contents of a particular cell using its row name and column name, or operate on entire rows and columns

To use pandas, we need to import the module.

Computational Thinking Concepts

The CT concepts expressed within Pandas include:

Module Set-Up

In principle, Pandas should be available in a default Anaconda install

How to check

If you do get an error, that means that you will have to install using conda or pip; you are on-your-own here! On the content server the process is:

  1. Open a new terminal from the launcher
  2. Change to root user su then enter the root password
  3. sudo -H /opt/jupyterhib/bin/python3 -m pip install pandas
  4. Wait until the install is complete; for security, user compthink is not in the sudo group
  5. Verify the install by trying to execute import pandas as above.

The process above will be similar on a Macintosh, or Windows if you did not use an Anaconda distribution. Best is to have a sucessful anaconda install, or go to the GoodJobUntilMyOrgansGetHarvested.

If you have to do this kind of install, you will have to do some reading, some references I find useful are:

  1. https://jupyterlab.readthedocs.io/en/stable/user/extensions.html
  2. https://www.pugetsystems.com/labs/hpc/Note-How-To-Install-JupyterHub-on-a-Local-Server-1673/#InstallJupyterHub
  3. https://jupyterhub.readthedocs.io/en/stable/installation-guide-hard.html (This is the approach on the content server which has a functioning JupyterHub)

Dataframe-type Structure using primative python

First lets construct a dataframe like object using python primatives. We will construct 3 lists, one for row names, one for column names, and one for the content.

The above builds a placeholder named mytable for the psuedo-dataframe. Next we populate the table, using a for loop to write the column names in the first row, row names in the first column, and the table fill for the rest of the table.

Now lets print the table out by row and we see we have a very dataframe-like structure

We can also query by row

Or by column

Or by row+column index; sort of looks like a spreadsheet syntax.

Now we shall create a proper dataframe

We will now do the same using pandas

We can also turn our table into a dataframe, notice how the constructor adds header row and index column

To get proper behavior, we can just reuse our original objects

Why are mydf and mydf2 different?

Getting the shape of dataframes

The shape method, which is available after the dataframe is constructed, will return the row and column rank (count) of a dataframe.

Appending new columns

To append a column simply assign a value to a new column name to the dataframe

Appending new rows

This is sometimes a bit trickier but here is one way:

Removing Rows and Columns

To remove a column is straightforward, we use the drop method

To remove a row, you really got to want to, easiest is probablty to create a new dataframe with the row removed

Indexing

We have already been indexing, but a few examples follow:

Conditional Selection

Descriptor Functions

head method

Returns the first few rows, useful to infer structure

info method

Returns the data model (data column count, names, data types)

describe method

Returns summary statistics of each numeric column.
Also returns the minimum and maximum value in each column, and the IQR (Interquartile Range).
Again useful to understand structure of the columns.

Counting and Sum methods

There are also methods for counts and sums by specific columns

The unique method returns a list of unique values (filters out duplicates in the list, underlying dataframe is preserved)

The nunique method returns a count of unique values

The value_counts() method returns the count of each unique value (kind of like a histogram, but each value is the bin)

Using functions in dataframes - symbolic apply

The power of Pandas is an ability to apply a function to each element of a dataframe series (or a whole frame) by a technique called symbolic (or synthetic programming) application of the function.

This employs principles of pattern matching, abstraction, and algorithm development; a holy trinity of Computational Thinning.

It's somewhat complicated but quite handy, best shown by an example:

Sorts

Aggregating (Grouping Values) dataframe contents

Filtering out missing values

Filtering and cleaning are often used to describe the process where data that does not support a narrative is removed ;typically for maintenance of profit applications, if the data are actually missing that is common situation where cleaning is justified.

Below we drop any row that contains a NaN code.

Below we replace NaN codes with some value, in this case 0

Below we replace NaN codes with some value, in this case the mean value of of the column in which the missing value code resides.


Reading a File into a Dataframe

Pandas has methods to read common file types, such as csv,xlsx, and json.
Ordinary text files are also quite manageable.

On a machine you control you can write script to retrieve files from the internet and process them.

Similar to reading and writing .csv files, you can also read and write .xslx files as below (useful to know this)

Writing a dataframe to file


References

Overland, B. (2018). Python Without Fear. Addison-Wesley ISBN 978-0-13-468747-6.

Grus, Joel (2015). Data Science from Scratch: First Principles with Python O’Reilly Media. Kindle Edition.

Precord, C. (2010) wxPython 2.8 Application Development Cookbook Packt Publishing Ltd. Birmingham , B27 6PA, UK ISBN 978-1-849511-78-0.