ENGR 1330 Computational Thinking with Data Science
Copyright © 2021 Theodore G. Cleveland and Farhang Forghanparast
Last GitHub Commit Date:
12: Databases and PANDAS¶
Advanced data acquisition (files from a URL)
Examples of manual download vs automated
Objectives¶
Additional practice with dataframe abstraction as implemented in the Pandas library(module).
To be able to access and manipulate data within a dataframe
To be able to obtain basic statistical measures of data within a dataframe
Access files directly from a URL (ordinary http://… )
Using a wget-type function
Using a curl-type function
Using API keys (future versions)
Access files directly from a URL (encrypted https://… CA and self signed certificates)
Using a wget-type function (future versions)
Using a curl-type function (future versions)
Using API keys (future versions)
Review¶
Pandas data abstraction is a dataframe. Once a dataframe is created there are a lot of methods attached to the dataframe structure (actually its class).
Some useful quick-reference sheets
Hence summary sheets will be useful
A summary sheet: https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf
A different one: http://datacamp-community-prod.s3.amazonaws.com/f04456d7-8e61-482f-9cc9-da6f7f25fc9b
Reading a File into a Dataframe¶
Pandas has methods to read common file types, such as csv
,xls
, and json
.
Ordinary text files are also quite manageable.
Specifying
engine='openpyxl'
in the read/write statement is required for the xml versions of Excel (xlsx). Default is .xls regardless of file name. If you still encounter read errors, try opening the file in Excel and saving as .xls (Excel 97-2004 Workbook) or as a CSV if the structure is appropriate.
You may have to install the packages using something likesudo -H /opt/jupyterhub/bin/python3 -m pip install xlwt openpyxl xlsxwriter xlrd
from the terminal interface (adjust the path to your system) or something likesudo -H /opt/conda/envs/python/bin/python -m pip install xlwt openpyxl xlsxwriter xlrd
Or in the anaconda powershellconda install xlwt openpyxl xlsxwriter xlrd
The files in the following examples are CSV_ReadingFile.csv, Excel_ReadingFile.xlsx,
import pandas
readfilecsv = pandas.read_csv('CSV_ReadingFile.csv') #Reading a .csv file
print(readfilecsv)
a b c d
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
3 12 13 14 15
Similar to reading and writing .csv files, you can also read and write .xslx files as below (useful to know this)
readfileexcel = pandas.read_excel('Excel_ReadingFile.xlsx', sheet_name='Sheet1', engine='openpyxl') #Reading a .xlsx file
print(readfileexcel)
Unnamed: 0 a b c d
0 0 0 1 2 3
1 1 4 5 6 7
2 2 8 9 10 11
3 3 12 13 14 15
Writing a dataframe to file¶
#Creating and writing to a .csv file
readfilecsv = pandas.read_csv('CSV_ReadingFile.csv')
readfilecsv.to_csv('CSV_WritingFile1.csv') # write to local directory
readfilecsv = pandas.read_csv('CSV_WritingFile1.csv') # read the file back
print(readfilecsv)
Unnamed: 0 a b c d
0 0 0 1 2 3
1 1 4 5 6 7
2 2 8 9 10 11
3 3 12 13 14 15
#Creating and writing to a .csv file by excluding row labels
readfilecsv = pandas.read_csv('CSV_ReadingFile.csv')
readfilecsv.to_csv('CSV_WritingFile2.csv', index = False)
readfilecsv = pandas.read_csv('CSV_WritingFile2.csv')
print(readfilecsv)
a b c d
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
3 12 13 14 15
#Creating and writing to a .xls file
readfileexcel = pandas.read_excel('Excel_ReadingFile.xlsx', sheet_name='Sheet1', engine='openpyxl')
readfileexcel.to_excel('Excel_WritingFile.xlsx', sheet_name='Sheet1' , index = False, engine='openpyxl')
readfileexcel = pandas.read_excel('Excel_WritingFile.xlsx', sheet_name='Sheet1', engine='openpyxl')
print(readfileexcel)
Unnamed: 0 a b c d
0 0 0 1 2 3
1 1 4 5 6 7
2 2 8 9 10 11
3 3 12 13 14 15
Access files directly from a URL (ordinary http://… )¶
This section shows how to get files from a remote computer. There are several ways to get the files, most importantly you need the FQDN to the file.
Method: Get the actual file from a remote web server (unencrypted)¶
You know the FQDN to the file it will be in structure of “http://server-name/…/filename.ext”
The server is running ordinary (unencrypted) web services, i.e.
http://...
We will need a module to interface with the remote server. Here we will use requests
, so first we load the module
You may need to install the module into your anaconda environment using the anaconda power shell using
conda install requests
or if you are running a JupyterHub using the terminal prompt and system commands. On the course content host computer the commands are:
sudo -H /opt/jupyterhub/bin/python3 -m pip install requests
Or:
sudo -H /opt/conda/envs/python/bin/python -m pip install requests
You will have to do some reading, but with any luck something similar will work for you.
import pandas
import requests # Module to process http/https requests
Now we will generate a GET
request to the remote http server. I chose to do so using a variable to store the remote URL so I can reuse code in future projects. The GET
request (an http/https method) is generated with the requests method get
and assigned to an object named rget
– the name is arbitrary. Next we extract the file from the rget
object and write it to a local file with the name of the remote file - esentially automating the download process. Then we import the pandas
module.
remote_url="http://54.243.252.9/engr-1330-webroot/4-Databases/all_quads_gross_evaporation.csv" # set the url
rget = requests.get(remote_url, allow_redirects=True) # get the remote resource, follow imbedded links
open('all_quads_gross_evaporation.csv','wb').write(rget.content) # extract from the remote the contents, assign to a local file same name
import pandas as pd # Module to process dataframes (not absolutely needed but somewhat easier than using primatives, and gives graphing tools)
Now we can read the file contents and check its structure, before proceeding.
#evapdf = pd.read_csv("all_quads_gross_evaporation.csv",parse_dates=["YYYY-MM"]) # Read the file as a .CSV assign to a dataframe evapdf
evapdf = pandas.read_csv("all_quads_gross_evaporation.csv")
evapdf.head() # check structure
YYYY-MM | 104 | 105 | 106 | 107 | 108 | 204 | 205 | 206 | 207 | ... | 911 | 912 | 1008 | 1009 | 1010 | 1011 | 1108 | 1109 | 1110 | 1210 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1954-01 | 1.80 | 1.80 | 2.02 | 2.24 | 2.24 | 2.34 | 1.89 | 1.80 | 1.99 | ... | 1.42 | 1.30 | 2.50 | 2.42 | 1.94 | 1.29 | 2.59 | 2.49 | 2.22 | 2.27 |
1 | 1954-02 | 4.27 | 4.27 | 4.13 | 3.98 | 3.90 | 4.18 | 4.26 | 4.27 | 4.26 | ... | 2.59 | 2.51 | 4.71 | 4.30 | 3.84 | 2.50 | 5.07 | 4.62 | 4.05 | 4.18 |
2 | 1954-03 | 4.98 | 4.98 | 4.62 | 4.25 | 4.20 | 5.01 | 4.98 | 4.98 | 4.68 | ... | 3.21 | 3.21 | 6.21 | 6.06 | 5.02 | 3.21 | 6.32 | 6.20 | 5.68 | 5.70 |
3 | 1954-04 | 6.09 | 5.94 | 5.94 | 6.07 | 5.27 | 6.31 | 5.98 | 5.89 | 5.72 | ... | 3.83 | 3.54 | 6.45 | 6.25 | 4.92 | 3.54 | 6.59 | 6.44 | 5.88 | 5.95 |
4 | 1954-05 | 5.41 | 5.09 | 5.14 | 4.40 | 3.61 | 5.57 | 4.56 | 4.47 | 4.18 | ... | 3.48 | 3.97 | 7.92 | 8.13 | 6.31 | 3.99 | 7.75 | 7.98 | 7.40 | 7.40 |
5 rows × 93 columns
Structure looks like a spreadsheet as expected; lets plot the time series for cell ‘911’
evapdf.plot.line(x='YYYY-MM',y='911') # Plot quadrant 911 evaporation time series
<AxesSubplot:xlabel='YYYY-MM'>
evapdf[['911','912']] # pull out columns
911 | 912 | |
---|---|---|
0 | 1.42 | 1.30 |
1 | 2.59 | 2.51 |
2 | 3.21 | 3.21 |
3 | 3.83 | 3.54 |
4 | 3.48 | 3.97 |
... | ... | ... |
787 | 5.96 | 6.06 |
788 | 5.17 | 5.39 |
789 | 4.47 | 4.39 |
790 | 2.49 | 2.40 |
791 | 2.39 | 2.31 |
792 rows × 2 columns
evapdf[evapdf['YYYY-MM'] == "1993-01"][['911','912']] # get 2 columns from 1993-01 date in YYYY-MM
911 | 912 | |
---|---|---|
468 | 1.79 | 1.81 |
Access files directly from a URL (encrypted https://… )(Future Semesters)¶
This section shows how to get files from a remote computer that is serving web requests using encryption. The key indicator is that the remote website is running the https:\\...
protocol. Most such websites use a 3-rd party certificate authority (CA) and the method is essentially the same as above. But small-time websites (like ours) use a self-signed certificate which requires us to modify the request to circumvent our computer’s security settings. Obviously only do so if you know the remote website is legitimate.
Other things to consider:
The remote is running
https:\\...
using a self-signed certificateYou know the remote is a correct source of the file you desire
The file is not executible (you want to verify the file is ASCII UTF-8 or some other similar encoding) if its binary, grab a small portion of the file from your browser and inspect using a hex-editor
References¶
Overland, B. (2018). Python Without Fear. Addison-Wesley ISBN 978-0-13-468747-6.
Grus, Joel (2015). Data Science from Scratch: First Principles with Python O’Reilly Media. Kindle Edition.
Precord, C. (2010) wxPython 2.8 Application Development Cookbook Packt Publishing Ltd. Birmingham , B27 6PA, UK ISBN 978-1-849511-78-0.
Johnson, J. (2020). Python Numpy Tutorial (with Jupyter and Colab). Retrieved September 15, 2020, from https://cs231n.github.io/python-numpy-tutorial/
Willems, K. (2019). (Tutorial) Python NUMPY Array TUTORIAL. Retrieved September 15, 2020, from https://www.datacamp.com/community/tutorials/python-numpy-tutorial?utm_source=adwords_ppc
Willems, K. (2017). NumPy Cheat Sheet: Data Analysis in Python. Retrieved September 15, 2020, from https://www.datacamp.com/community/blog/python-numpy-cheat-sheet
W3resource. (2020). NumPy: Compare two given arrays. Retrieved September 15, 2020, from https://www.w3resource.com/python-exercises/numpy/python-numpy-exercise-28.php
Sorting https://www.programiz.com/python-programming/methods/list/sort
https://www.oreilly.com/library/view/relational-theory-for/9781449365431/ch01.html
https://realpython.com/pandas-read-write-files/#using-pandas-to-write-and-read-excel-files
Laboratory 12¶
Examine (click) Laboratory 12 as a webpage at Laboratory 12.html
Download (right-click, save target as …) Laboratory 12 as a jupyterlab notebook from Laboratory 12.ipynb
Exercise Set 12¶
Examine (click) Exercise Set 11 as a webpage at Exercise 12.html
Download (right-click, save target as …) Exercise Set 12 as a jupyterlab notebook at Exercise Set 12.ipynb