Download this page as a jupyter notebook at Lab 10-TH
LAST NAME, FIRST NAME
R00000000
ENGR 1330 Laboratory 10 - Homework
# Preamble script block to identify host, user, and kernel
import sys
! hostname
! whoami
print(sys.executable)
print(sys.version)
print(sys.version_info)
The Pandas library is a preferred tool for data scientists to perform data manipulation and analysis, next to matplotlib for data visualization and NumPy for scientific computing in Python.
The fast, flexible, and expressive Pandas data structures are designed to make real-world data analysis significantly easier, but this might not be immediately the case for those who are just getting started with it. Exactly because there is so much functionality built into this package that the options are overwhelming.
Hence summary sheets will be useful
A summary sheet: https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf
A different one: http://datacamp-community-prod.s3.amazonaws.com/f04456d7-8e61-482f-9cc9-da6f7f25fc9b
import pandas
import numpy
Pandas has methods to read common file types, such as csv
,xlsx
, and json
. Ordinary text files are also quite manageable. (We will study these more in Lesson 11)
Here are the steps to follow:
# download the file (do this before running the script)
readfilecsv = pandas.read_csv('CSV_ReadingFile.csv') # Reading a .csv file
# print the contents of readfilecsv
print(readfilecsv)
# How many rows are in the data table?
print('Row count = ',readfilecsv.shape[0])
# How many columns?
print('Col count = ',readfilecsv.shape[1])
Now that you have downloaded and read a file, lets do it again, but with feeling!
Download the file named concreteData.xls to your local computer.
The file is an Excel 97-2004 Workbook; you probably cannot inspect it within Anaconda (but maybe yes). File size is about 130K, we are going to rely on Pandas to work here!
Read the file into a dataframe object named 'concreteData' the method name is
- object_name = pandas.read_excel(filename)
- It should work as above if you replace the correct placeholders
Then perform the following activities.
# Optional Automated Download or just
#Get database -- use the Get Data From URL Script
#Step 1: import needed modules to interact with the internet
import requests
#Step 2: make the connection to the remote file (actually its implementing "bash curl -O http://fqdn/path ...")
remote_url = 'http://54.243.252.9/engr-1330-webroot/8-Labs/Lab10/concreteData.xls' # an Excel file
response = requests.get(remote_url) # Gets the file contents puts into an object
output = open('concreteData.xls', 'wb') # Prepare a destination, local
output.write(response.content) # write contents of object to named local file
output.close() # close the connection
# code here looks like object_name = pandas.read_excel(filename)
concreteData = pandas.read_excel('concreteData.xls')
# code here looks like object_name.head()
concreteData.head()
# code here
req_col_names = ["Cement", "BlastFurnaceSlag", "FlyAsh", "Water", "Superplasticizer",
"CoarseAggregate", "FineAggregate", "Age", "CC_Strength"]
curr_col_names = list(concreteData.columns)
mapper = {}
for i, name in enumerate(curr_col_names):
mapper[name] = req_col_names[i]
concreteData = concreteData.rename(columns=mapper)
concreteData.head()
# code here
concreteData.describe()
# After concreteData exists, and is non-empty; how do you know?
# then run the code block below -- It takes awhile to render output, give it a minute:
import matplotlib.pyplot
import seaborn
%matplotlib inline
seaborn.pairplot(concreteData)
matplotlib.pyplot.show()
Output is a graphic that contains a matrix of plots of the relationship of each column with another column.
The main diagional are histograms of the variable represented by the column. None of the individual plots indicate a strength -to -variable relationship, so it is likely multiple variables are required to build a prediction engine.