7. Logistic Regression

7. Logistic Regression#

Course Website

http://54.243.252.9/ce-5319-webroot/

So far, our focus has been on prediction engines, where the goal is to build models that predict a continuous response variable from one or more input features. These models minimize some form of squared error and aim to fit the best curve or surface to the data—an approach deeply rooted in continuous prediction.

But in many practical problems, we are not trying to predict a value; instead, we want to make a decision or assign a label. Will a system fail or not? Will a contaminant exceed a threshold? Will a storm cause flooding? These are classification problems, where the output is categorical rather than numerical.

Readings/References#

Burkov, A. (2019) The One Hundred Page Machine Learning Book Required Textbook
Rashid, Tariq. (2016) Make Your Own Neural Network. Kindle Edition. Required Textbook

Videos#

What is “Logistic Regression”?#

Despite its name, logistic regression is not about regression in the traditional sense. Instead, it models the probability that a given input belongs to a particular class—usually coded as 0 or 1. It sits right on the boundary between regression and classification: we still estimate parameters by minimizing a loss function, but the target is a probability, and the ultimate goal is a binary decision.

It bridges the familiar world of regression with the discrete nature of classification.
It uses nonlinear transformation (the logistic or sigmoid function) to map predictions to probabilities.
It lays the groundwork for more sophisticated classifiers, such as decision trees, support vector machines, and neural networks.

Civil engineers can use logistic regression to:

Predict whether a hydraulic structure is likely to fail based on load and condition data,
Classify storm events as safe or hazardous,
And explore how decision boundaries emerge from simple parameter estimates.

Let’s begin by looking at the core idea: turning a linear model into a probability estimator.

The Sigmoid Function: From Scores to Probabilities#

At the heart of logistic regression is the sigmoid (or logistic) function:

\[\sigma(z) = \frac{1}{1+e^{-z}}\]

This function takes any real-valued number $z$ and maps it to the interval $[0,1]$. In the context of logistic regression, $z$ is typically a linear combination of input variables:

\[z = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots +\beta_n x_n \]

The sigmoid then transforms this linear score into a probability—interpreted as the likelihood of belonging to the positive class (usually labeled as 1).

Python Code to Visualize the Sigmoid#

%reset -f
import numpy as np
import matplotlib.pyplot as plt

# Sigmoid function
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

# Generate values for z
z = np.linspace(-10, 10, 400)
prob = sigmoid(z)

# Plot
plt.figure(figsize=(8, 4))
plt.plot(z, prob, label='Sigmoid Function', color='blue')
plt.axvline(0, color='gray', linestyle='--', linewidth=1)
plt.axhline(0.5, color='gray', linestyle='--', linewidth=1)
plt.title("Sigmoid Function – Logistic Regression")
plt.xlabel("Linear Score (z)")
plt.ylabel("Predicted Probability")
plt.grid(True)
plt.legend()
plt.show()

../../_images/5ed6d50a5fac4e6c14d8ec3c586740b58de90210ab8ef8120f79f01938e4d2bb.png

Simple Example: Logistic Model for System Failure#

Let’s suppose you want to predict whether a system fails (1) or operates normally (0) based on a single variable: applied load.

We simulate some simple data below, then fit a logistic model.

%reset -f
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression

# Simulated data
np.random.seed(0)
load = np.linspace(0, 10, 100).reshape(-1, 1)
failure_prob = 1 / (1 + np.exp(-1.2 * (load - 5)))  # Centered around 5
failure = np.random.binomial(1, failure_prob.ravel())  # Simulate binary outcomes

# Fit logistic regression
model = LogisticRegression()
model.fit(load, failure)

# Predict probabilities
load_test = np.linspace(0, 10, 200).reshape(-1, 1)
predicted_prob = model.predict_proba(load_test)[:, 1]

# Plot
plt.figure(figsize=(8, 4))
plt.scatter(load, failure, label='Observed Outcome', alpha=0.5, color='black')
plt.plot(load_test, predicted_prob, label='Logistic Regression Fit', color='red')
plt.axhline(0.5, linestyle='--', color='gray')
plt.title("System Failure Prediction Using Logistic Regression")
plt.xlabel("Applied Load")
plt.ylabel("Probability of Failure")
plt.legend()
plt.grid(True)
plt.show()

../../_images/3afb32b08848398f3cc0ed05fd1cf4ef0e8cc197d99d2d22f8afecd91862b71f.png

7.1 A Logistic Regression Example#

Logistic regression is a type of model fitting exercise where the observed responses are binary encoded (0,1). The input design matrix is a collection of features much like ordinary linear regression.

The logistic function is

\[\pi_i(\beta) = \frac{e^{X_i~\beta}}{1+e^{X_i~\beta}} \]

where $X_i$ is the i-th row of the design matrix, and $\beta$ are unknown coefficients.

The associated optimization problem is to minimize some measure of error between the model values (above) and the observed values, typically a squared error is considered.

If we consider a single observatyon $Y_i$ its value is either 0 or 1. So the error is

\[\epsilon_i = Y_i - \pi_i(\beta) = Y_i - \frac{e^{X_i~\beta}}{1+e^{X_i~\beta}} \]

The function we wish to minimize is

\[\min_{\beta} (Y - \pi(\beta))^T(Y-\pi(\beta))\]

The minimization method is a matter of analyst preference - these are usually well behaved minimization problems (the examples will use Powell’s Direction Set Method, and the Broyden–Fletcher–Goldfarb–Shanno algorithm; most professional tools operate on the logarithm of the components to avoid numerical overflow during the minimization process.

Note

Powell’s method is robust and comparatively simple to program, here is some older code Powell’s Method FORTRAN source code that is fairly readable and still works today!

The model result $\pi_i$ will range between 0 and 1, and can be interpreted as a probability that the inputs predict a value of 1. Normally the decision boundary is set at 0.5, so when the model is cast into a prediction engine it will first compute a $\pi_i$ value, then apply the decision boundary rule to either snap the result to 1 or 0 as indicated by the probability.

The example below should help clarify a bit.

Success at Task vs Experience#

A systems analyst studied the effect of experience on ability to complete a complex task within a specified time. The participants had varying amounts of experience measured in months as shown in the table below, where E represents months of experience, and S is binary 0=FAIL, 1=SUCCESS

Perform a logistic regression and estimate the experience required to ensure sucess with an empirical probability of 75 percent.

%reset -f
# Load the data
experience=[14,29,6,25,18,4,18,12,22,6,30,11,30,5,20,13,9,32,24,13,19,4,28,22,8]
result=[0,0,0,1,1,0,0,0,1,0,1,0,1,0,1,0,0,1,0,1,0,0,1,1,1]

Now lets plot the data and see what we can learn, later on we will need a model and data plot on same figure, so just build it here

# Load a Plotting Tool
import matplotlib.pyplot as plt
def make1plot(listx1,listy1,strlablx,strlably,strtitle):
    mydata = plt.figure(figsize = (6,6)) # build a square drawing canvass from figure class
    plt.plot(listx1,listy1, c='red', marker='o',linewidth=0) # basic data plot
    plt.xlabel(strlablx)
    plt.ylabel(strlably)
    plt.legend(['Data','Model'])# modify for argument insertion
    plt.title(strtitle)
    plt.grid(axis='y')
    plt.show()
def make2plot(listx1,listy1,listx2,listy2,strlablx,strlably,strtitle):
    mydata = plt.figure(figsize = (6,6)) # build a square drawing canvass from figure class
    plt.plot(listx1,listy1, c='red', marker='o',linewidth=0) # basic data plot
    plt.plot(listx2,listy2, c='blue',linewidth=1) # basic model plot
    plt.xlabel(strlablx)
    plt.ylabel(strlably)
    plt.legend(['Data','Model'])# modify for argument insertion
    plt.title(strtitle)
    plt.grid(axis='y')
    plt.show()

Now plot the data

make1plot(experience,result,"Experience (months)","Success in Task","Success vs Experience")

../../_images/f0bfc6774595f797536c230569a60928ced9eb662c7f09d35d3ffeae8d7de498.png

Now a logistic model, we will need 3 prototype functions; a sigmoid function, an error function (here we use sum of squared error) and a merit function to minimize by changing $\beta_0$ and $\beta_1$.

The sigmoid function is

\[\pi_i = \frac{e^{\beta_0 + \beta_1 X_i}}{1+e^{\beta_0 + \beta_1 X_i}} \]

def pii(b0,b1,x): #sigmoidal function
    import math
    pii = math.exp(b0+b1*x)/(1+ math.exp(b0+b1*x))
    return(pii)

def sse(mod,obs): #compute sse from observations and model values
    howmany = len(mod)
    sse=0.0
    for i in range(howmany):
        sse=sse+(mod[i]-obs[i])**2
    return(sse)

def merit(beta): # merit function to minimize
    global result,experience #access lists already defined external to function
    mod=[0 for i in range(len(experience))]
    for i in range(len(experience)):
        mod[i]=pii(beta[0],beta[1],experience[i])
    merit = sse(mod,result)
    return(merit)

Now we will attempt to minimize the merit function by changing values of $\beta$, first lets apply an initial value and check that our merit function works

beta = [0,0] #initial guess of betas
merit(beta) #check that does not raise an exception

6.25

Good we obtained a value, and not an error, now an optimizer

import numpy as np
from scipy.optimize import minimize

#x0 = np.array([-3.0597,0.1615])
x0 = np.array(beta)
res = minimize(merit, x0, method='powell',options={'disp': True})

Optimization terminated successfully.
         Current function value: 4.199713
         Iterations: 5
         Function evaluations: 132

Yay! It found some answer, examine and plot the findings

fitted=[0 for i in range(40)]
xaxis =[0 for i in range(40)]
beta[0]=res.x[0]
beta[1]=res.x[1]
for i in range(40):
    xaxis[i]=float(i)
    fitted[i]=pii(res.x[0],res.x[1],float(i))
    
print(" b0 = ",res.x[0])
print(" b1 = ",res.x[1])
plottitle = 'Sucess vs Experience and\n Fitted Logistic Response Function\n'+'b0: '+ str(round(res.x[0],4))+ ' b1: ' +str(round(res.x[1],4))
make2plot(experience,result,xaxis,fitted,'Experience (months)','Sucess in Task',plottitle)

 b0 =  -2.982390266878914
 b1 =  0.15954743797099522

../../_images/3ff6af6477f68a070d67869af47d848d02298915b34fc79269ac734342a62f8b.png

Now to recover estimates of success for different experience levels

guess = 25.5
pii(res.x[0],res.x[1],guess)

0.7476408418994896

So we expect that someone with 25 to 26 months of experience would have a 75 percent chance of success. Similarily the time to median success is 18 months and 3 weeks.

guess = 18.75
pii(res.x[0],res.x[1],guess)

0.5022810329444896

Now using the same data, but a formal logistic package

# import the class
import numpy
from sklearn.linear_model import LogisticRegression
X = numpy.array(experience)
y = numpy.array(result)
# instantiate the model (using the default parameters)
logreg = LogisticRegression(solver='lbfgs',max_iter=10000);
# fit the model with data  -TRAIN the model
logreg.fit(X.reshape(-1, 1),y);

logreg.coef_

array([[0.16080665]])

xplot = numpy.array(xaxis)
y_pred=logreg.predict(xplot.reshape(-1,1))

b0 = logreg.intercept_[0]
b1 = logreg.coef_[0][0]

plottitle = 'Sucess vs Experience and\n Fitted Logistic Response Function\n Using SKLEARN\n '+'b0: '+ str(round(b0,4)) + ' b1: ' + str(round(b1,4))
make2plot(experience,result,xplot,y_pred,'Experience (months)','Sucess in Task',plottitle)

../../_images/a2a05dc254d524570b0478794c8d286a5bd32e05d6db8e09867dce9b862370cd.png

for i in range(40):
    xaxis[i]=float(i)
    fitted[i]=pii(beta[0],beta[1],float(i))
#    fitted[i]=pii(b0,b1,float(i))
    if fitted[i] > 0.5:
        fitted[i] = 1.0
    else:
        fitted[i]=0.0

plottitle = 'Sucess vs Experience and\n Fitted Logistic Response Function\n Using Homebrew and 0.5 decision boundary\n '+'b0: '+ str(round(beta[0],4)) + ' b1: ' + str(round(beta[1],4))
make2plot(experience,result,xaxis,fitted,'Experience (months)','Sucess in Task',plottitle)

../../_images/7d12c73b5aa9adc5920ec72b7a78422d90af5fa3df75695200767317304921a1.png

A proper ML approach would create a hold-out set (split) to provide some test of the model - in this example the data set is too small but to explore the syntax using ML workflow ….

Split the Data Set#

from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.25,random_state=0)

Now perform the ML execrise in typical workflow

# instantiate the model (using the default parameters)
logreg = LogisticRegression(solver='lbfgs',max_iter=10000);
# fit the model with data  -TRAIN the model
logreg.fit(X_train.reshape(-1, 1),y_train);

xplot = numpy.array(xaxis)
y_pred=logreg.predict(X_test.reshape(-1,1))

b0 = logreg.intercept_[0]
b1 = logreg.coef_[0][0]

for i in range(40):
    xaxis[i]=float(i)
#    fitted[i]=pii(beta[0],beta[1],float(i))
    fitted[i]=pii(b0,b1,float(i))
    if fitted[i] > 0.5:
        fitted[i] = 1.0
    else:
        fitted[i]=0.0

plottitle = 'Sucess vs Experience and\n Fitted Logistic Response Function\n Using SKLEARN\n '+'b0: '+ str(round(b0,4)) + ' b1: ' + str(round(b1,4))
make2plot(X_test,y_test,xaxis,fitted,'Experience (months)','Sucess in Task',plottitle)

../../_images/2c477740b4d9eccf71ad5f4144ff03d5a8f08a957cda315fe393f4ff1e5726a3.png

How to assess the performance of logistic regression?#

Binary classification has four possible types of results:

True negatives: correctly predicted negatives (zeros)

True positives: correctly predicted positives (ones)

False negatives: incorrectly predicted negatives (zeros)

False positives: incorrectly predicted positives (ones)

We usually evaluate the performance of a classifier by comparing the actual and predicted outputs and counting the correct and incorrect predictions. Our simple graph above conveys the same information, just not as easily interpreted.

A confusion matrix is a table that is used to display the performance of a classification model.

Some other indicators of binary classifiers include the following:

The most straightforward indicator of classification accuracy is the ratio of the number of correct predictions to the total number of predictions (or observations).
The positive predictive value is the ratio of the number of true positives to the sum of the numbers of true and false positives.
The negative predictive value is the ratio of the number of true negatives to the sum of the numbers of true and false negatives.
The sensitivity (also known as recall or true positive rate) is the ratio of the number of true positives to the number of actual positives.
The precision score quantifies the ability of a classifier to not label a negative example as positive. The precision score can be interpreted as the probability that a positive prediction made by the classifier is positive.
The specificity (or true negative rate) is the ratio of the number of true negatives to the number of actual negatives.

The extent of importance of recall and precision depends on the problem. Achieving a high recall is more important than getting a high precision in cases like when we would like to detect as many heart patients as possible. For some other models, like classifying whether a bank customer is a loan defaulter or not, it is desirable to have a high precision since the bank wouldn’t want to lose customers who were denied a loan based on the model’s prediction that they would be defaulters. There are also a lot of situations where both precision and recall are equally important. Then we would aim for not only a high recall but a high precision as well. In such cases, we use something called F1-score. F1-score is the Harmonic mean of the Precision and Recall:

This is easier to work with since now, instead of balancing precision and recall, we can just aim for a good F1-score and that would be indicative of a good Precision and a good Recall value as well.

Model Evaluation using Confusion Matrix#

The confusion matrix is a table that is used to evaluate the performance of a classification model. You can also visualize the performance of an algorithm. The fundamental content of a confusion matrix is the number of correct and incorrect predictions are summed up class-wise.

# import the metrics class
from sklearn import metrics
cnf_matrix = metrics.confusion_matrix(y_pred, y_test)
cnf_matrix

array([[4, 1],
       [0, 2]])

However it is often more visible in a heatmap structure

import seaborn as sns
import pandas as pd
class_names=[0,1] # name  of classes
fig, ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
# create heatmap
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="YlGnBu" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Predicted label')
plt.xlabel('Actual label');

../../_images/151355a24ca19154bc2e3e8b7a4557f370c323a2554548b639ed9e3bbbb45000.png

Confusion Matrix Evaluation Metrics: evaluate the model using model evaluation metrics such as accuracy, precision, and recall. Its all built-in to sklearn (although we could get it entirely from our homebrew results too!)

print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
print("Precision:",metrics.precision_score(y_test, y_pred))
print("Recall:",metrics.recall_score(y_test, y_pred))
print("F1-score:",metrics.f1_score(y_test, y_pred))

Accuracy: 0.8571428571428571
Precision: 1.0
Recall: 0.6666666666666666
F1-score: 0.8

from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.80      1.00      0.89         4
           1       1.00      0.67      0.80         3

    accuracy                           0.86         7
   macro avg       0.90      0.83      0.84         7
weighted avg       0.89      0.86      0.85         7

7.2 Selecting the Decision Boundary#

In practice, the decision boundary for logistic regression is commonly set at 0.5, reflecting the point where the model’s estimated probability of success equals that of failure.

However, this threshold is not fixed. It may be lowered to increase sensitivity (e.g., in public health to catch more positives), or raised to improve specificity (e.g., in fraud detection to reduce false positives).

Thresholds should reflect the relative cost of misclassification, the balance of class frequencies, or policy objectives — and may be determined through ROC analysis, Precision-Recall tradeoffs, or domain knowledge.

Some references#

Hastie, Tibshirani, Friedman (2009) The Elements of Statistical Learning Chapter 4.4: “Two-Class Logistic Regression” … “the class label is typically assigned based on whether the estimated probability exceeds 0.5. However, if the costs of misclassification are asymmetric, this threshold can be adjusted accordingly.”
James, Witten, Hastie, Tibshirani (2013) An Introduction to Statistical Learning Chapter 4.3: “Logistic Regression” … “If we are interested in the most accurate predictions, then the default threshold of 0.5 is often reasonable. But if false positives or false negatives are more costly, then the threshold should be modified.”
Christopher Bishop (2006) Pattern Recognition and Machine Learning Chapter 4: “Linear Models for Classification” Discusses decision boundaries in probabilistic terms, and introduces the idea that thresholds may vary based on prior class probabilities or expected costs of errors.
sklearn.linear_model.LogisticRegression guide Recommends using ROC curves or Precision-Recall analysis to set thresholds based on context, particularly for imbalanced datasets.

Tip

The 0.5 threshold is standard in binary logistic regression, it’s not a rigid or universal rule. The threshold used to define the decision boundary can and often should be adjusted based on:

Class imbalance
Cost of false positives vs false negatives
Risk preferences or operating context
ROC or Precision-Recall trade-offs

Note

Below is a modularized version generated by Chat GPT, using usual principles of AI-assisted code generation.

The actual prompt is

“Here’s my logistic example, can you examine and modularize my code. I want to maintain my multiple cell structure for teaching pace”

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

def make1plot(x, y, xlabel, ylabel, title):
    plt.figure(figsize=(6, 6))
    plt.plot(x, y, 'o')
    plt.xlabel(xlabel)
    plt.ylabel(ylabel)
    plt.title(title)
    plt.grid(True)
    plt.show()

# Simulate experience (months) and probability of success
np.random.seed(42)
experience = np.linspace(0, 12, 50).reshape(-1, 1)
true_params = [0.8, -4.5]
logit_score = true_params[0] * experience + true_params[1]
prob_success = sigmoid(logit_score)
result = np.random.binomial(1, prob_success.ravel())  # Binary outcome

make1plot(experience, result, "Experience (months)", "Success in Task", "Success vs Experience")

../../_images/0593d1d9748c596ca0fda3dbed1406636aff70cb06ef811945eb08156037aa6d.png

model = LogisticRegression()
model.fit(experience, result)

# Predict probability over a finer grid
x_pred = np.linspace(0, 12, 200).reshape(-1, 1)
y_pred = model.predict_proba(x_pred)[:, 1]

plt.figure(figsize=(6, 6))
plt.scatter(experience, result, label="Observed", color="black", alpha=0.5)
plt.plot(x_pred, y_pred, color="red", label="Logistic Fit")
plt.xlabel("Experience (months)")
plt.ylabel("Predicted Probability")
plt.title("Logistic Regression – Fitted Probability Curve")
plt.legend()
plt.grid(True)
plt.show()

../../_images/2fd003a4f4c9227a878aacec7c23ab2671e7a24741a6322368e9760105b60b09.png

y_pred_binary = model.predict(experience)
print(confusion_matrix(result, y_pred_binary))
print(classification_report(result, y_pred_binary))

[[22  3]
 [ 3 22]]
              precision    recall  f1-score   support

           0       0.88      0.88      0.88        25
           1       0.88      0.88      0.88        25

    accuracy                           0.88        50
   macro avg       0.88      0.88      0.88        50
weighted avg       0.88      0.88      0.88        50

7.3 Logistic Regression in Designed Experiments#

Logistic regression is a powerful tool for analyzing structured experiments, especially when the outcomes are expressed as proportions. These situations often arise in marketing studies, medical trials, and other behavioral or economic experiments.

For instance, suppose we send out 1,000 surveys and receive 300 responses. While each individual response is binary (responded or not), we’re often more interested in the proportion of responses in each category or treatment level. In such cases, we compute the observed proportion as:

\[ p_j = \frac{R_j}{n_j} \]

where:

$ p_j $ is the observed proportion of responses in treatment level $ j $,
$ R_j $ is the number of positive (1-valued) responses, and
$ n_j $ is the total number of observations in that treatment level.

We then model these proportions using the logistic (sigmoid) transformation:

\[ \pi_j(\beta) = \frac{e^{X_j \beta}}{1 + e^{X_j \beta}} \]

The residual error is:

\[ \epsilon_j = p_j - \pi_j(\beta) \]

and the objective function to minimize becomes:

\[ \min_{\beta} \left( \vec{p} - \pi(\beta) \right)^T \left( \vec{p} - \pi(\beta) \right) \]

The example below illustrates this approach using synthetic data from a coupon redemption study.

Redemption Rate vs. Rebate Value#

Consider a marketing study designed to estimate how the value of a coupon rebate affects the likelihood of customer redemption. In this experiment, 1,000 participants were randomly selected and each received a coupon along with advertising material. The coupons offered different rebate values—specifically, $5, $10, $15, $20, and $30. Each rebate level was assigned to 200 participants.

The key variables in this study are:

Predictor: Coupon rebate value ($)
Response: Number of coupons redeemed in each group

The data are summarized in the following table:

Level $ j $	Price Reduction $ X_j $	Participants $ n_j $	Redeemed $ R_j $
1	5	200	30
2	10	200	55
3	15	200	70
4	20	200	100
5	30	200	137

As with any modeling task, we begin by preparing the dataset, computing the proportion of redemptions at each level, and visualizing the relationship between the rebate and redemption rate.

%reset -f
# Load The Data
level = [1,2,3,4,5]
reduction = [5,10,15,20,30]
participants = [200,200,200,200,200]
redeemed = [30,55,70,100,137]
proportion = [0 for i in range(len(level))]
for i in range(len(proportion)):
    proportion[i]=redeemed[i]/participants[i]

Next, we perform some basic exploratory data analysis. Since we only have one predictor (rebate value) and a proportion response, a simple scatter plot will suffice to help us visualize the relationship.

This initial step helps us answer: is there a clear trend? Does a larger rebate lead to more redemptions?

import matplotlib.pyplot as plt
def make1plot(listx1,listy1,strlablx,strlably,strtitle):
    mydata = plt.figure(figsize = (6,6)) # build a square drawing canvass from figure class
    plt.plot(listx1,listy1, c='red', marker='o',linewidth=0) # basic data plot
    plt.xlabel(strlablx)
    plt.ylabel(strlably)
    plt.legend(['Data','Model'])# modify for argument insertion
    plt.title(strtitle)
    plt.grid(axis='y')
    plt.show()

def make2plot(listx1,listy1,listx2,listy2,strlablx,strlably,strtitle):
    mydata = plt.figure(figsize = (6,6)) # build a square drawing canvass from figure class
    plt.plot(listx1,listy1, c='red', marker='v',linewidth=0) # basic data plot
    plt.plot(listx2,listy2, c='blue',linewidth=1) # basic model plot
    plt.grid(which='both',axis='both')
    plt.xlabel(strlablx)
    plt.ylabel(strlably)
    plt.legend(['Data','Model'])# modify for argument insertion
    plt.title(strtitle)
    plt.grid(axis='y')
    plt.show()

%matplotlib inline

make1plot(reduction,proportion,'Price Reduction','Proportion Redemmed','Plot of Proportion of Coupons Redeemed')

../../_images/4cd58550d2173c662d64c6bd9d7afa3229500451bb97eab694319722377036ab.png

From the plot, we observe a clear trend: as the rebate value increases, the proportion of coupons redeemed also increases. This aligns with our intuition—higher incentives generally lead to greater participation.

But what if we want to estimate redemption rates for rebate values that weren’t tested, or optimize the rebate amount to achieve a target redemption rate? For these tasks, logistic regression offers a smooth predictive curve over the entire domain, even between and beyond the sampled rebate values.

We now define our logistic regression functions—reusing the structure from a previous example, but adapting the variable names to fit the context of this marketing study.

def pii(b0,b1,x): #sigmoidal function
    import math
    pii = math.exp(b0+b1*x)/(1+ math.exp(b0+b1*x))
    return(pii)

def sse(mod,obs): #compute sse from observations and model values
    howmany = len(mod)
    sse=0.0
    for i in range(howmany):
        sse=sse+(mod[i]-obs[i])**2
    return(sse)

def merit(beta): # merit function to minimize
    global proportion,reduction #access lists already defined external to function
    mod=[0 for i in range(len(proportion))]
    for i in range(len(level)):
        mod[i]=pii(beta[0],beta[1],reduction[i])
    merit = sse(mod,proportion)
    return(merit)

We make an initial guess for the regression coefficients and evaluate the merit function. This helps us understand the behavior of the model before full optimization.

beta = [0,0] # initial guess
merit(beta) # test for exceptions

0.22985

We now fit the model using Powell’s method, a derivative-free optimization algorithm suitable for this kind of nonlinear objective function.

import numpy as np
from scipy.optimize import minimize

x0 = np.array([-2,0.09])
res = minimize(merit, x0, method='powell',options={'disp': True})
# 
fitted=[0 for i in range(50)]
xaxis =[0 for i in range(50)]
for i in range(50):
    xaxis[i]=float(i)
    fitted[i]=pii(res.x[0],res.x[1],float(i))
    
print(" b0 = ",res.x[0])
print(" b1 = ",res.x[1])

plottitle = 'Coupons Retreived and\n Fitted Logistic Response Function\n'+'b0: '+ str(round(res.x[0],4))+ ' b1: ' +str(round(res.x[1],4))
make2plot(reduction,proportion,xaxis,fitted,'Price Reduction','Proportion Redemmed',plottitle)

Optimization terminated successfully.
         Current function value: 0.002032
         Iterations: 3
         Function evaluations: 81
 b0 =  -2.0165402749739756
 b1 =  0.09580373659184761

../../_images/80f3a7d93fa91927653fb468d596d45c948200f8a791ece1d6d555088ce8b157.png

With the model fit, we can ask practical questions: What rebate value corresponds to a 50% redemption rate? This is equivalent to finding the decision boundary of our logistic model.

guess = 21.05
fraction = pii(res.x[0],res.x[1],guess)
print("For Reduction of ",guess," projected redemption rate is ",round(fraction,3))

For Reduction of  21.05  projected redemption rate is  0.5

For more aggressive marketing goals—such as liquidating excess inventory—we might ask: what rebate value is needed to push redemption to 98%?

guess = 61.90
fraction = pii(res.x[0],res.x[1],guess)
print("For Reduction of ",guess," projected redemption rate is ",round(fraction,3))

For Reduction of  61.9  projected redemption rate is  0.98

To communicate our results to non-technical stakeholders, we generate a plot showing the fitted logistic curve and the observed redemption rates. This visualization helps reinforce trust in the model’s predictions.

fitted=[0 for i in range(80)]
xaxis =[0 for i in range(80)]
for i in range(80):
    xaxis[i]=float(i)
    fitted[i]=pii(res.x[0],res.x[1],float(i))

plottitle = 'Coupons Retreived and\n Fitted Logistic Response Function\n'+'b0: '+ str(round(res.x[0],4))+ ' b1: ' +str(round(res.x[1],4))
make2plot(reduction,proportion,xaxis,fitted,'Price Reduction','Proportion Redemmed',plottitle)

../../_images/d37a529dddc7eead0b6f71b19fa3804570384563290399e03036e001f1204a63.png

Exercises#

Use the model to estimate redemption rate for a $25 rebate.
Modify the number of participants per level and re-run the model. How sensitive are predictions to sample size?
Suppose we cap the rebate budget at $1,500 for the entire group. Can you propose an optimal allocation across price levels?

References#

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Set seed for repeatability
np.random.seed(42)

# Define rebate levels (predictor values)
rebate_levels = np.array([5, 10, 15, 20, 30])

# Randomize number of participants per level (e.g., between 150 and 250)
participants_per_level = np.random.randint(150, 250, size=len(rebate_levels))

# Define true logistic model parameters for simulation
true_intercept = -2.0
true_slope = 0.12

# Compute true probability of redemption at each level
def logistic(x, b0=true_intercept, b1=true_slope):
    return 1 / (1 + np.exp(-(b0 + b1 * x)))

true_probs = logistic(rebate_levels)

# Simulate number of redemptions as binomial draws
redemptions = np.random.binomial(participants_per_level, true_probs)

# Assemble the simulated dataset
simulated_data = pd.DataFrame({
    'Rebate': rebate_levels,
    'Participants': participants_per_level,
    'Redeemed': redemptions,
    'Proportion': redemptions / participants_per_level
})

# Show the dataset
print(simulated_data)

# Optional: plot the simulated proportions and true logistic curve
x_plot = np.linspace(0, 35, 200)
plt.scatter(rebate_levels, simulated_data['Proportion'], label='Simulated Proportions', zorder=3)
plt.plot(x_plot, logistic(x_plot), 'r--', label='True Logistic Model')
plt.xlabel('Rebate Value ($)')
plt.ylabel('Redemption Rate')
plt.title('Simulated Coupon Redemption Data')
plt.grid(True, linestyle='--', alpha=0.6)
plt.legend()
plt.tight_layout()
plt.show()

   Rebate  Participants  Redeemed  Proportion
     5           201        37    0.184080
    10           242        81    0.334711
    15           164        71    0.432927
    20           221       145    0.656109
    30           210       175    0.833333

../../_images/e59d13509a0f7d41e1945696a096d5d43ef581196fbddfbb14b89a81dbcc02a4.png

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Set seed for reproducibility
np.random.seed(123)

# Define rebate levels (predictor values)
rebate_levels = np.array([5, 10, 15, 20, 30])

# Randomize number of participants per level
participants_per_level = np.random.randint(150, 251, size=len(rebate_levels))

# Base logistic model slope (kept fixed)
true_slope = 0.12

# Define intercepts that vary by rebate level
# (e.g., due to regional or marketing channel effects)
base_intercept = -2.0
intercept_variation = np.random.normal(loc=0, scale=0.3, size=len(rebate_levels))
varying_intercepts = base_intercept + intercept_variation

# Define logistic function with per-level intercepts
def logistic(x, b0, b1=true_slope):
    return 1 / (1 + np.exp(-(b0 + b1 * x)))

# Compute probabilities and simulate redemptions
true_probs = [logistic(x, b0) for x, b0 in zip(rebate_levels, varying_intercepts)]
redemptions = np.random.binomial(participants_per_level, true_probs)

# Assemble dataset
simulated_data = pd.DataFrame({
    'Rebate': rebate_levels,
    'Intercept': np.round(varying_intercepts, 3),
    'Participants': participants_per_level,
    'Redeemed': redemptions,
    'Proportion': redemptions / participants_per_level
})

# Display the simulated data
print(simulated_data)

# Plot: Simulated data points and multiple logistic curves
x_plot = np.linspace(0, 35, 300)
plt.figure(figsize=(8, 5))

# Plot each individual curve
for i, (b0, label) in enumerate(zip(varying_intercepts, rebate_levels)):
    plt.plot(x_plot, logistic(x_plot, b0), linestyle='--', alpha=0.5)

# Overlay simulated points
plt.scatter(rebate_levels, simulated_data['Proportion'], color='black', label='Simulated Data', zorder=5)
plt.xlabel('Rebate Value ($)')
plt.ylabel('Redemption Rate')
plt.title('Simulated Data with Varying Intercepts')
plt.grid(True, linestyle='--', alpha=0.6)
plt.legend()
plt.tight_layout()
plt.show()

   Rebate  Intercept  Participants  Redeemed  Proportion
     5     -1.544           216        58    0.268519
    10     -2.014           242        84    0.347107
    15     -1.521           248       140    0.564516
    20     -2.535           167        80    0.479042
    30     -2.086           233       185    0.793991

../../_images/c23150e8734a54ac84a41bad2a93b1123fe0215c237910e4091352a00301dcbe.png

7.4 Multiple Logistic Regression#

Simple logistic regression can be extended to handle multiple predictors by expanding the definition of the design matrix $ X $ and the coefficient vector $ \beta $, just as we did in linear regression.

This approach allows us to explore more complex relationships where the probability of a binary outcome depends on multiple features.

Disease Rate vs Multiple Predictors#

In this illustrative example, researchers investigate the outbreak of a mosquito-borne illness across two urban sectors. Randomly selected individuals were surveyed to determine if they had recently contracted the disease (encoded as 1 if present, 0 otherwise).

Disease status was determined by a rapid test administered during a public health enforcement campaign.

For each participant, the following features were collected:

Age (a continuous variable),
Socioeconomic status, and
Sector of the city where the individual resides.

To use these features in a logistic regression model, we convert the categorical variables into binary indicators (one-hot encoding). Socioeconomic status is grouped into three levels:

Socioeconomic Status	$X_2$	$X_3$
Upper quartile (≥ 75% income)	0	0
Middle quartile (25–75% income)	1	0
Lower quartile (≤ 25% income)	0	1

We select the upper-income group as the reference category, anticipating that this group would experience the lowest disease incidence due to better environmental conditions (e.g., access to mosquito control).

City sector is encoded as follows:

City Sector	$X_4$
Red	0
Blue	1

The dataset is available at: West Nile Virus Data

A snippet of the raw data file looks like:

CASE,AGE,INCOME,SECTOR,SICK
1,33, 1, 1, 0, 1
2,35, 1, 1, 0, 1
3, 6, 1, 1, 0, 0
4,60, 1, 1, 0, 1
5,18, 3, 1, 1, 0
6,26, 3, 1, 0, 0
 ...
 ... many more rows
 ...

Note that the INCOME field is stored as an ordinal value (1, 2, 3) and will need to be recoded into binary indicator variables for proper interpretation in a multiple logistic regression model.

As usual, our workflow begins with importing and inspecting the data, followed by any necessary preprocessing.

%reset -f
# get data
import pandas as pd
df = pd.read_csv("APC3.DAT",
                header=0,index_col=["CASE"],
                usecols=["CASE","AGE","INCOME","SECTOR","SICK"])
df.head()

	AGE	INCOME	SECTOR	SICK
CASE
1	33	1	1	0
2	35	1	1	0
3	6	1	1	0
4	60	1	1	0
5	18	3	1	1

The data has been successfully read. Note that we applied a small fix to use the CASE column as the index.

We begin by using .describe() and .info() to get a feel for the data types, missing values, and distributions of the numeric columns.

df.describe()

	AGE	INCOME	SECTOR	SICK
count	196.000000	196.000000	196.000000	196.000000
mean	25.178571	1.964286	1.403061	0.290816
std	18.904721	0.867505	0.491769	0.455302
min	1.000000	1.000000	1.000000	0.000000
25%	10.750000	1.000000	1.000000	0.000000
50%	21.000000	2.000000	1.000000	0.000000
75%	35.000000	3.000000	2.000000	1.000000
max	85.000000	3.000000	2.000000	1.000000

Next, we prepare to encode the INCOME variable using one-hot encoding.

We’re assuming that the original INCOME column uses the following scale:

1 → Upper income group (reference)
2 → Middle income group
3 → Lower income group

We will create two new binary variables X2 and X3 to represent these groups.

df["X2"]=0
df["X3"]=0
df.head()

	AGE	INCOME	SECTOR	SICK	X2	X3
CASE
1	33	1	1	0	0	0
2	35	1	1	0	0	0
3	6	1	1	0	0	0
4	60	1	1	0	0	0
5	18	3	1	1	0	0

Here we perform the encoding based on the definitions above.

This transformation is crucial for logistic regression to correctly interpret the effect of each socioeconomic category relative to the reference group.

def isone(a):
    if a == 1:
        isone = 1
    else:
        isone = 0
    return(isone)

def istwo(a):
    if a == 2:
        isone = 1
    else:
        isone = 0
    return(isone)

def isthree(a):
    if a == 3:
        isone = 1
    else:
        isone = 0
    return(isone)

df["X2"]=df["INCOME"].apply(istwo)
df["X3"]=df["INCOME"].apply(isone)

df.head()

	AGE	INCOME	SECTOR	SICK	X2	X3
CASE
1	33	1	1	0	0	1
2	35	1	1	0	0	1
3	6	1	1	0	0	1
4	60	1	1	0	0	1
5	18	3	1	1	0	0

Now we implement our homebrew logistic regression using Python functions.

These functions compute:

The predicted probabilities using the logistic model,
The merit function (sum of squared residuals), and
A predictor for new data.

These prototypes will be useful for educational purposes and allow us to compare against library-based tools later.

def pii(b0,b1,b2,b3,b4,x1,x2,x3,x4): #sigmoidal function
    import math
    pii = math.exp(b0+b1*x1+b2*x2+b3*x3+b4*x4)/(1+ math.exp(b0+b1*x1+b2*x2+b3*x3+b4*x4))
    return(pii)

def sse(mod,obs): #compute sse from observations and model values
    howmany = len(mod)
    sse=0.0
    for i in range(howmany):
        sse=sse+(mod[i]-obs[i])**2
    return(sse)

feature_cols = ['AGE', 'X2','X3', 'SECTOR']
Xmat = df[feature_cols].values # Features as an array (not a dataframe)
Yvec = df["SICK"].values # Target as a vector (not a dataframe)

def merit(beta): # merit function to minimize
    global Yvec,Xmat #access lists already defined external to function
    mod=[0 for i in range(len(Yvec))]
    for i in range(len(Yvec)):
        mod[i]=pii(beta[0],beta[1],beta[2],beta[3],beta[4],Xmat[i][0],Xmat[i][1],Xmat[i][2],Xmat[i][3])
    merit = sse(mod,Yvec)
    return(merit)

We test the custom functions to ensure they accept inputs and return outputs without raising errors.

This sanity check confirms the functions are correctly structured before we begin optimization.

beta=[0,0,0,0,0]
irow = 1
print("The pi function ",pii(beta[0],beta[1],beta[2],beta[3],beta[4],Xmat[irow][0],Xmat[irow][1],Xmat[irow][2],Xmat[irow][3]))
print("The merit function ",merit(beta))

The pi function  0.5
The merit function  49.0

So both functions “work” in the sense that input goes in, output comes out - no exceptions are raised.

We now attempt to optimize the logistic regression parameters using Powell’s method.

Note

This is a five-dimensional optimization problem (one intercept and four coefficients), which may be challenging to solve numerically—but we’ll proceed and evaluate the outcome.

import numpy as np
from scipy.optimize import minimize

x0 = np.array(beta)
res = minimize(merit, x0, method='powell',options={'disp': True})
# 
fitted=[0 for i in range(90)]
xaxis_age =[0 for i in range(90)]
#xaxis_sector =[0 for i in range(90)]
#xaxis_x2 =[0 for i in range(90)]
#xaxis_x3 =[0 for i in range(90)]

sector_tmp = 1 # some sector
x2_tmp = 0 #  not poor
x3_tmp = 0 # not middle

for i in range(90):
    xaxis_age[i]=float(i)
    fitted[i]=pii(res.x[0],res.x[1],res.x[2],res.x[3],res.x[4],xaxis_age[i],x2_tmp,x3_tmp,sector_tmp)
    
print(" b0 = ",res.x[0])
print(" b1 = ",res.x[1])
print(" b2 = ",res.x[2])
print(" b3 = ",res.x[3])
print(" b4 = ",res.x[4])

Optimization terminated successfully.
         Current function value: 35.382536
         Iterations: 8
         Function evaluations: 460
 b0 =  -3.219051882695808
 b1 =  0.024207668771930674
 b2 =  -0.13015812819127218
 b3 =  -0.25363504886581595
 b4 =  1.2449844422211769

Now how can we evaluate the “fit”?

Interpreting a 5D logistic surface is difficult at best, but we can visualize how the predicted probability varies with age while holding other variables constant.

This gives us at least one view of the model’s behavior.

# Load a Plotting Tool
import matplotlib.pyplot as plt
def make1plot(listx1,listy1,strlablx,strlably,strtitle):
    mydata = plt.figure(figsize = (6,6)) # build a square drawing canvass from figure class
    plt.plot(listx1,listy1, c='red', marker='o',linewidth=0) # basic data plot
    plt.xlabel(strlablx)
    plt.ylabel(strlably)
    plt.legend(['Data','Model'])# modify for argument insertion
    plt.title(strtitle)
    plt.grid(axis='y')
    plt.show()
def make2plot(listx1,listy1,listx2,listy2,strlablx,strlably,strtitle):
    mydata = plt.figure(figsize = (6,6)) # build a square drawing canvass from figure class
    plt.plot(listx1,listy1, c='red', marker='o',linewidth=0) # basic data plot
    plt.plot(listx2,listy2, c='blue',linewidth=1) # basic model plot
    plt.xlabel(strlablx)
    plt.ylabel(strlably)
    plt.legend(['Data','Model'])# modify for argument insertion
    plt.title(strtitle)
    plt.grid(axis='y')
    plt.show()
 
xobs = [0 for i in range(len(Xmat))]
for i in range(len(Xmat)):    
    xobs[i] = Xmat[i][0]

plottitle = 'Sickness Probability vs Age'
make2plot(xobs,Yvec,xaxis_age,fitted,'Age','Sickness',plottitle)

../../_images/9eaca2761df8587acf5a7a91ab654ddaaced1b0272388ac4a8936af0c5e987c3.png

From the plot, it appears that age alone is not a strong predictor. The full model is also challenging to interpret directly, which is not unusual in real-world applications.

We now turn to package-based tools for more efficient and interpretable modeling.

We identify:

The design matrix $ X $, which includes all predictors,
The target vector $ y $, which is the binary response.

#split dataset in features and target variable
#feature_cols = ['AGE']
#feature_cols = ['AGE', 'SECTOR']
feature_cols = ['AGE', 'X2','X3', 'SECTOR']
X = df[feature_cols] # Features
y = df["SICK"] # Target variable

We split the dataset into training and testing subsets to evaluate generalization.

# split X and y into training and testing sets
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.25,random_state=0)

We fit a logistic regression model using the scikit-learn package.

This approach provides fitted coefficients, a built-in solver, and standard scoring methods.

# import the class
from sklearn.linear_model import LogisticRegression

# instantiate the model (using the default parameters)
#logreg = LogisticRegression()
logreg = LogisticRegression()
# fit the model with data
logreg.fit(X_train,y_train)

#
y_pred=logreg.predict(X_test)

Model parameters are extracted and compared to the homebrew results. The similarity provides confidence in both implementations.

print(logreg.intercept_[0])
print(logreg.coef_)
#y.head()

-3.8678278714188528
[[ 0.02518861  0.18701833 -0.22604932  1.51451584]]

We evaluate the model’s predictive performance using the .score() method.

The resulting accuracy is modest. This prompts us to investigate whether all predictors are needed or if a simpler model might perform just as well.

# import the metrics class
from sklearn import metrics
cnf_matrix = metrics.confusion_matrix(y_pred, y_test)
cnf_matrix
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import pandas as pd
class_names=[0,1] # name  of classes
fig, ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
# create heatmap
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="YlGnBu" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Predicted label')
plt.xlabel('Actual label');

../../_images/a78afd409bf545f349a39ae7b2e11f88bc7e5c58151317623fa1e7d3956fb422.png

We begin by testing models with fewer predictors to assess their individual contributions.

First: AGE only This isolates the predictive value of age alone.

#split dataset in features and target variable
feature_cols = ['AGE']
#feature_cols = ['AGE', 'SECTOR']
#feature_cols = ['AGE', 'X2','X3', 'SECTOR']
X = df[feature_cols] # Features
y = df["SICK"] # Target variable
# split X and y into training and testing sets
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.25,random_state=0)
# import the class
from sklearn.linear_model import LogisticRegression

# instantiate the model (using the default parameters)
#logreg = LogisticRegression()
logreg = LogisticRegression()
# fit the model with data
logreg.fit(X_train,y_train)

#
y_pred=logreg.predict(X_test)
print(logreg.intercept_[0])
print(logreg.coef_)
# import the metrics class
from sklearn import metrics
cnf_matrix = metrics.confusion_matrix(y_pred, y_test)
cnf_matrix
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import pandas as pd
class_names=[0,1] # name  of classes
fig, ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
# create heatmap
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="YlGnBu" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Predicted label')
plt.xlabel('Actual label');

-1.7238065116617818
[[0.02719036]]

../../_images/58f08cd0f940786c1a32a762d4c762750498abd6ccf6498679fd1e051022c476.png

Next: INCOME (X2 and X3) only We evaluate the predictive value of socioeconomic status without age or sector.

#split dataset in features and target variable
#feature_cols = ['AGE']
#feature_cols = ['SECTOR']
feature_cols = ['X2','X3']
#feature_cols = ['AGE', 'X2','X3', 'SECTOR']
X = df[feature_cols] # Features
y = df["SICK"] # Target variable
# split X and y into training and testing sets
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.25,random_state=0)
# import the class
from sklearn.linear_model import LogisticRegression

# instantiate the model (using the default parameters)
#logreg = LogisticRegression()
logreg = LogisticRegression()
# fit the model with data
logreg.fit(X_train,y_train)
y_pred=logreg.predict(X_test)
print(logreg.intercept_[0])
print(logreg.coef_)
# import the metrics class
from sklearn import metrics
cnf_matrix = metrics.confusion_matrix(y_pred, y_test)
cnf_matrix
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import pandas as pd
class_names=[0,1] # name  of classes
fig, ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
# create heatmap
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="YlGnBu" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Predicted label')
plt.xlabel('Actual label');

-1.1389014522629408
[[0.3963004  0.21673499]]

Next: SECTOR only We isolate the predictive power of the city sector.

#split dataset in features and target variable
#feature_cols = ['AGE']
feature_cols = ['SECTOR']
#feature_cols = ['X2','X3']
#feature_cols = ['AGE', 'X2','X3', 'SECTOR']
X = df[feature_cols] # Features
y = df["SICK"] # Target variable
# split X and y into training and testing sets
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.25,random_state=0)
# import the class
from sklearn.linear_model import LogisticRegression

# instantiate the model (using the default parameters)
#logreg = LogisticRegression()
logreg = LogisticRegression()
# fit the model with data
logreg.fit(X_train,y_train)
y_pred=logreg.predict(X_test)
print(logreg.intercept_[0])
print(logreg.coef_)
# import the metrics class
from sklearn import metrics
cnf_matrix = metrics.confusion_matrix(y_pred, y_test)
cnf_matrix
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import pandas as pd
class_names=[0,1] # name  of classes
fig, ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
# create heatmap
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="YlGnBu" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Predicted label')
plt.xlabel('Actual label');

-3.265050761147032
[[1.56248233]]

Finally: AGE and SECTOR We explore whether combining age and sector improves predictive performance.

#split dataset in features and target variable
#feature_cols = ['AGE']
#feature_cols = ['SECTOR']
feature_cols = ['AGE','SECTOR']
#feature_cols = ['AGE', 'X2','X3', 'SECTOR']
X = df[feature_cols] # Features
y = df["SICK"] # Target variable
# split X and y into training and testing sets
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.25,random_state=0)
# import the class
from sklearn.linear_model import LogisticRegression

# instantiate the model (using the default parameters)
#logreg = LogisticRegression()
logreg = LogisticRegression()
# fit the model with data
logreg.fit(X_train,y_train)
y_pred=logreg.predict(X_test)
print(logreg.intercept_[0])
print(logreg.coef_)
# import the metrics class
from sklearn import metrics
cnf_matrix = metrics.confusion_matrix(y_pred, y_test)
cnf_matrix
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import pandas as pd
class_names=[0,1] # name  of classes
fig, ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
# create heatmap
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="YlGnBu" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Predicted label')
plt.xlabel('Actual label');

-3.8498269191334367
[[0.02426237 1.49033741]]

../../_images/32b93ad63d50e42dd911a37aa6d4157e0dd1649cb29569cd05b4ef685ba06055.png

In summary:

The full model did not substantially outperform simpler models.
Sector appears to carry some predictive power, while age alone is nearly as informative as the full set.
Income encoding adds some nuance but doesn’t dramatically improve results.

This dataset is known to be somewhat noisy and underfit, which makes it a good example of both the strengths and limitations of logistic regression.

The key learning takeaway is that we have decoded and implemented both manual and package-based multiple logistic regression pipelines—a foundational skill in statistical modeling and machine learning.

7.5 Titanic Survival Example (adapted from 2012 Kaggle Competition)#

This example uses the dataset for the 2012 Kaggle “Titanic - Machine Learning from Disaster Competition. The dataset is known to be well explained using Multiple Logistic Regression.

Verbatin from Kaggle

The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).

The variable names in the dataset are:

variable_name	Alpha Name	Meaning
survival	Survival	0 = No, 1 = Yes
pclass	Ticket class	1 = 1st, 2 = 2nd, 3 = 3rd
sex	Sex
Age	Age in years	Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5
sibsp	# of siblings / spouses aboard the Titanic	Sibling = brother, sister, stepbrother, stepsister Spouse = husband, wife (mistresses and fiancés were ignored)
parch	# of parents / children aboard the Titanic	Parent = mother, father Child = daughter, son, stepdaughter, stepson Some children travelled only with a nanny, therefore parch=0 for them.
ticket	Ticket number
fare	Passenger fare
cabin	Cabin number
embarked	Port of Embarkation	C = Cherbourg, Q = Queenstown, S = Southampton

%reset -f
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
import requests
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix

# Try to download the dataset from GitHub (fallback to local if needed)
url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
filename = "titanic.csv"

if not os.path.exists(filename):
    try:
        response = requests.get(url)
        with open(filename, 'wb') as f:
            f.write(response.content)
        print(f"Dataset downloaded and saved as '{filename}'")
    except Exception as e:
        print(f"Could not download file. Please ensure '{filename}' is available locally.\n", e)

# Load dataset
df = pd.read_csv(filename)

# Display sample
print(df.head())

   PassengerId  Survived  Pclass  \
          1         0       3   
          2         1       1   
          3         1       3   
          4         1       1   
          5         0       3   

                                                Name     Sex   Age  SibSp  \
                          Braund, Mr. Owen Harris    male  22.0      1   
Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
                           Heikkinen, Miss. Laina  female  26.0      0   
     Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
                         Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
    0         A/5 21171   7.2500   NaN        S  
    0          PC 17599  71.2833   C85        C  
    0  STON/O2. 3101282   7.9250   NaN        S  
    0            113803  53.1000  C123        S  
    0            373450   8.0500   NaN        S  

# Drop columns that are not immediately useful or too messy (Name, Ticket, Cabin)
df = df.drop(columns=['Name', 'Ticket', 'Cabin'])

# Handle missing values
df = df.dropna(subset=['Embarked'])              # drop rows with missing Embarked
df['Age'] = df['Age'].fillna(df['Age'].median()) # fill missing Age with median
df['Fare'] = df['Fare'].fillna(df['Fare'].median()) # fill Fare if needed

# Display sample
print(df.head())

   PassengerId  Survived  Pclass     Sex   Age  SibSp  Parch     Fare Embarked
          1         0       3    male  22.0      1      0   7.2500        S
          2         1       1  female  38.0      1      0  71.2833        C
          3         1       3  female  26.0      0      0   7.9250        S
          4         1       1  female  35.0      1      0  53.1000        S
          5         0       3    male  35.0      0      0   8.0500        S

# Encode categorical variables
df['Sex'] = df['Sex'].map({'male': 0, 'female': 1})
df = pd.get_dummies(df, columns=['Embarked'], drop_first=True)

# Features and target
X = df[['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked_Q', 'Embarked_S']]
y = df['Survived']

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1912)

# Fit logistic regression
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

# Evaluate
y_pred = model.predict(X_test)
print("\nClassification Report:\n", classification_report(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

# Coefficient interpretation
coef_df = pd.DataFrame({
    'Feature': X.columns,
    'Coefficient': model.coef_[0],
    'Odds Ratio': np.exp(model.coef_[0])
}).sort_values('Odds Ratio', ascending=False)

print("\nModel Coefficients and Odds Ratios:")
print(coef_df)

Classification Report:
               precision    recall  f1-score   support

           0       0.81      0.86      0.83       111
           1       0.73      0.66      0.69        67

    accuracy                           0.78       178
   macro avg       0.77      0.76      0.76       178
weighted avg       0.78      0.78      0.78       178

Confusion Matrix:
 [[95 16]
 [23 44]]

Model Coefficients and Odds Ratios:
      Feature  Coefficient  Odds Ratio
1         Sex     2.706257   14.973123
5        Fare     0.001798    1.001799
2         Age    -0.037653    0.963047
6  Embarked_Q    -0.066455    0.935705
4       Parch    -0.098388    0.906297
3       SibSp    -0.332609    0.717051
7  Embarked_S    -0.444609    0.641075
0      Pclass    -1.128217    0.323610

# Plot: Survival rate by Sex and Pclass
sns.barplot(data=df, x='Sex', y='Survived', hue='Pclass')
plt.title("Survival Rate by Sex and Passenger Class")
plt.xlabel("Sex (0 = Male, 1 = Female)")
plt.ylabel("Survival Rate")
plt.show()

../../_images/0679155b535b561def6eb3d400878b77376a46e44b36c842dd992df40c863e96.png

# Plot: Survival rate by Sex and Pclass
sns.scatterplot(x="Fare", y="Survived", hue="Sex", data=df)
plt.title("Line Plot of Survival vs. Fare by Sex")
plt.show()

../../_images/659db6506adad1057ef9a366d0ddf5199f70b96b4b2d63309086f3a65874a7d2.png

The sex-classified plot is interesting in that fare (wealth) is less of an indicator of survival than one might expect (its kind of same for recent plane wrecks); but lets explore anyway - is the median fare for surviving females different from that of males. Just by visual inspection, yes - but how much?

So one way to handle is to modify our script to do two logistic models; survival vs fare for males, and again for females.

filtered_male_df = df.loc[df['Sex'] == 0]
# Features and target
X = filtered_male_df[['Fare']]
y = filtered_male_df['Survived']
#print(len(X))

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1912)

# Fit logistic regression model
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train);

# Evaluate
y_pred = model.predict(X_test)
print("Classification Report:\n", classification_report(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

Classification Report:
               precision    recall  f1-score   support

           0       0.85      0.98      0.91        99
           1       0.00      0.00      0.00        17

    accuracy                           0.84       116
   macro avg       0.43      0.49      0.46       116
weighted avg       0.73      0.84      0.78       116

Confusion Matrix:
 [[97  2]
 [17  0]]

X_train.describe()

	Fare
count	461.000000
mean	25.326824
std	44.374488
min	0.000000
25%	7.854200
50%	11.500000
75%	26.550000
max	512.329200

import warnings
warnings.filterwarnings('ignore')
# Predict probability over grid
x_pred = np.linspace(0, 500, 200).reshape(-1, 1)
y_pred = model.predict_proba(x_pred)[:, 1]

plt.figure(figsize=(6, 6))
plt.scatter(X_train, y_train, label="Observed", color="black", alpha=0.5)
plt.plot(x_pred, y_pred, color="red", label="Logistic Fit")
plt.xlabel("Fare (paid)")
plt.ylabel("Survival Probability")
plt.title("Logistic Regression – Fitted Probability Curve - Male")
plt.legend()
plt.grid(True)
plt.show()

../../_images/87bab2f4ae3ee81cd33e18403512ed9b6019054289c3d117399bf42c95f4be6e.png

So a logistic model fit suggests that a male who paid a fare over 200 pounds, has a probability of survival in excess of 50%, if paid less then likely not to survive (notice from the plotted observations your chances are kind of crummy anyway). Now repeating the analysis for females:

import warnings
warnings.filterwarnings('ignore')
filtered_female_df = df.loc[df['Sex'] == 1]
# Features and target
X = filtered_female_df[['Fare']]
y = filtered_female_df['Survived']
print(len(X))
# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1912)
# Fit logistic regression model
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
# Predict probability over grid
x_pred = np.linspace(0, 500, 200).reshape(-1, 1)
y_pred = model.predict_proba(x_pred)[:, 1]

plt.figure(figsize=(6, 6))
plt.scatter(X_train, y_train, label="Observed", color="black", alpha=0.5)
plt.plot(x_pred, y_pred, color="red", label="Logistic Fit")
plt.xlabel("Fare (paid)")
plt.ylabel("Survival Probability")
plt.title("Logistic Regression – Fitted Probability Curve - Female")
plt.legend()
plt.grid(True)
plt.show()

../../_images/45eedcd86fea35265a854a6c331c03d421d9375bf8a4125fadf6c70df48b54fc.png

So from this plot the logistic model suggests just being female confers a huge survival advantage regardless of wealth.

7.6 Poison Mushrooms#

Mushroom foraging is a practice with both cultural significance and serious risks. Many wild mushrooms are edible and nutritious, but others contain potent toxins that can cause severe illness or death, even in small quantities. While advanced techniques like High-Performance Liquid Chromatography (HPLC) and mass spectrometry can accurately identify toxic compounds, these methods are laboratory-based, expensive, and time-consuming—making them impractical for real-time decision-making in the field. For amateur foragers, park rangers, and even emergency responders, the ability to assess toxicity risk based on visible features—such as cap color, gill shape, bruising reactions, and habitat—is far more practical, if not lifesaving.

This example seeks a machine learning classification algorithm to predict whether a mushroom is edible or poisonous based on easily observable characteristics. The training dataset is drawn from field guide-style descriptions, not lab tests, mimicking how a forager might gather information in the wild.

Problem Statement#

Using the database description at:

http://54.243.252.9/ce-5319-webroot/2-Exercises/ES-2/agaricus-lepiota.names

and the encoded database at:

http://54.243.252.9/ce-5319-webroot/2-Exercises/ES-2/agaricus-lepiota.data

build a suitable prediction engine to classify a mushroom as poison or non-poison. Apply your model on the mushroom Coprinopsis atramentaria which has the characteristics below:

Coprinopsis atramentaria (Black Ink Cap) Feature Encoding

Feature #	Feature Name	Likely Value	Code	Justification
1	Cap shape	conical	c	Young cap is conical or bell-shaped, becoming more flattened later
2	Cap surface	smooth	s	Smooth surface
3	Cap color	gray	g	Gray to brownish gray
4	Bruises?	no	f	Does not bruise distinctively
5	Odor	none	n	Mild or no odor
6	Gill attachment	free	f	Gills free from the stalk
7	Gill spacing	crowded	w	Gills are crowded
8	Gill size	broad	b	Broad gills, typical of inky caps
9	Gill color	black	k	Turns black and dissolves into ink
10	Stalk shape	tapering	t	Often narrows at base
11	Stalk root	rooted	r	Rooted in soil (some sources say “equal”)
12	Stalk surface above ring	smooth	s	Smooth texture
13	Stalk surface below ring	smooth	s	Smooth texture
14	Stalk color above ring	gray	g	Gray to whitish
15	Stalk color below ring	gray	g	Same as above
16	Veil type	partial	p	Has a fleeting partial veil
17	Veil color	white	w	Veil is whitish when present
18	Ring number	one	o	One ring from partial veil
19	Ring type	evanescent	e	Ring disappears quickly
20	Spore print color	black	k	Produces a black spore print
21	Population	scattered	s	Grows in scattered groups
22	Habitat	urban	u	Often found in grassy areas, roadsides, lawns (urban-ish)

The feature vector in the format of the database is : csgfnfwbktrssggpwoeksu

Sources for the table above include Wikipedia

…Measuring 3–10 centimetres in diameter, the greyish or brownish-grey cap is furrowed, initially bell-shaped, and later more convex, splitting at the margin. It melts from the outside in. The very crowded gills are free; they are white at first, then grey or pinkish and turn black and deliquesce. The stipe measures 5–17 cm high by 1–2 cm thick, is grey in colour, and lacks a ring. In young groups, the stems may be obscured by the caps. The spore print is black and the almond-shaped spores measure 8–11 by 5–6 μm. The flesh is thin and pale grey in colour. It is commonly associated with buried wood and is found in grassland, meadows, disturbed ground, and open terrain from late spring to autumn. Like many ink caps, it grows in tufts.

Warning

This example is to illustrate ML tools, DO NOT USE THIS EXAMPLE FOR REAL MUSHROOM SAFETY DETERMINATION

ML Workflow#

Download the database from remote server

Note

While the dataset is already available locally on the server, this step intentionally simulates retrieving data from a persistent, remote source. This reflects a common pattern in real-world workflows, where data is often accessed from version-controlled or centralized storage systems rather than hardcoded locations.

import sys # Module to process commands to/from the OS using a shell-type syntax
import requests
remote_url="http://54.243.252.9/ce-5319-webroot/ce5319jb/lessons/logisticregression/agaricus-lepiota.data"  # set the url
rget = requests.get(remote_url, allow_redirects=True)  # get the remote resource, follow imbedded links
localfile = open('poisonmushroom.csv','wb') # open connection to a local file same name as remote
localfile.write(rget.content) # extract from the remote the contents,insert into the local file same name
localfile.close() # close connection to the local file
# delete file if it exists

Load the database into a dataframe

Load the file downloaded suitable pandas tools.
Create an attribute list with meaningful (to us humans) names for the database columns.
Update the column names using a mapping method. In this example, a for loop creates a dictionary that maps default column indices to meaningful attribute names. Bear in mind later on this setp will create a warning, but it is suppressed intentionally. Alternatively, we could have directly assigned the list to df.columns, used a zipped dictionary, or applied set_axis()—each option has different strengths in terms of clarity, safety, and style.
List the revised database (here I choose the first 20 rows, recall the default is 5)

Note

Some other ways to handle the column naming are:

```
mymushroom.columns = attributes
```

A concise approach that assumes the number of columns exactly matches the number of attributes (len(attributes) == mymushroom.shape[1]). This approach may raise ValueError if mismatched

mymushroom = mymushroom.rename(columns=dict(zip(mymushroom.columns, attributes)))

A succinct and expressive mapping concept but all in a single line. Easily extended if only renaming a subset. Still assumes 1-to-1 mapping (i.e., same lengths)

mymushroom = mymushroom.set_axis(attributes, axis=1, inplace=False)

Functionally equivalent to .columns = … Can be method chained: df.set_axis(…).dropna() etc.

import pandas as pd
mymushroom = pd.read_csv('poisonmushroom.csv',header=None)

attributes=['poison-class','cap-shape','cap-surface','cap-color','bruises','odor','gill-attachment','gill-spacing','gill-size','gill-color','stalk-shape','stalk-root','stalk-surface-above-ring','stalk-surface-below-ring','stalk-color-above-ring','stalk-color-below-ring','veil-type','veil-color','ring-number','ring-type','spore-print-color','population','habitat']
len(attributes)

req_col_names = attributes
curr_col_names = list(mymushroom.columns)

mapper = {}
for i, name in enumerate(curr_col_names):
    mapper[name] = req_col_names[i]

mymushroom = mymushroom.rename(columns=mapper)
interim = pd.DataFrame(mymushroom)
mymushroom.head(20)

	poison-class	cap-shape	cap-surface	cap-color	bruises	odor	gill-attachment	gill-spacing	gill-size	gill-color	...	stalk-surface-below-ring	stalk-color-above-ring	stalk-color-below-ring	veil-type	veil-color	ring-number	ring-type	spore-print-color	population	habitat
0	p	x	s	n	t	p	f	c	n	k	...	s	w	w	p	w	o	p	k	s	u
1	e	x	s	y	t	a	f	c	b	k	...	s	w	w	p	w	o	p	n	n	g
2	e	b	s	w	t	l	f	c	b	n	...	s	w	w	p	w	o	p	n	n	m
3	p	x	y	w	t	p	f	c	n	n	...	s	w	w	p	w	o	p	k	s	u
4	e	x	s	g	f	n	f	w	b	k	...	s	w	w	p	w	o	e	n	a	g
5	e	x	y	y	t	a	f	c	b	n	...	s	w	w	p	w	o	p	k	n	g
6	e	b	s	w	t	a	f	c	b	g	...	s	w	w	p	w	o	p	k	n	m
7	e	b	y	w	t	l	f	c	b	n	...	s	w	w	p	w	o	p	n	s	m
8	p	x	y	w	t	p	f	c	n	p	...	s	w	w	p	w	o	p	k	v	g
9	e	b	s	y	t	a	f	c	b	g	...	s	w	w	p	w	o	p	k	s	m
10	e	x	y	y	t	l	f	c	b	g	...	s	w	w	p	w	o	p	n	n	g
11	e	x	y	y	t	a	f	c	b	n	...	s	w	w	p	w	o	p	k	s	m
12	e	b	s	y	t	a	f	c	b	w	...	s	w	w	p	w	o	p	n	s	g
13	p	x	y	w	t	p	f	c	n	k	...	s	w	w	p	w	o	p	n	v	u
14	e	x	f	n	f	n	f	w	b	n	...	f	w	w	p	w	o	e	k	a	g
15	e	s	f	g	f	n	f	c	n	k	...	s	w	w	p	w	o	p	n	y	u
16	e	f	f	w	f	n	f	w	b	k	...	s	w	w	p	w	o	e	n	a	g
17	p	x	s	n	t	p	f	c	n	n	...	s	w	w	p	w	o	p	k	s	g
18	p	x	y	w	t	p	f	c	n	n	...	s	w	w	p	w	o	p	n	s	u
19	p	x	s	n	t	p	f	c	n	k	...	s	w	w	p	w	o	p	n	s	u

20 rows × 23 columns

Encode the feature list into numeric values.

The steps below handle the encoding using a cipher substution approach, there are undoubtably better ways to handle the encoding.
The functions below are NOT one-hot encoding.

Each function interprets the feature value based on position in the vector, and uses the alphabet for that feature to assign a numeric value for the feature.

Note

This is a good example of when to use a LLM to help improve our code.

def p0(stringvalue):
    if stringvalue == 'e':
        p0 = 0
    elif stringvalue == 'p':
        p0 = 1
    else: 
        raise Exception("Encoding failed in p0 missing data maybe?")
    return(p0)
######################################################################
# Feature Encoding Functions using a Simple Substitution Cipher     ##
######################################################################
def c1(stringvalue):
#cap-shape:  bell=b,conical=c,convex=x,flat=f,knobbed=k,sunken=s
    ncode=True # set exception flag
    alphabet=['b','c','x','f','k','s']
    for i in range(len(alphabet)):
        if stringvalue == alphabet[i]:
            c1=i+1
            ncode=False #  if encoding swithc flag value
    if ncode:
        raise Exception("Encoding failed in c1 missing data maybe?")
    return(c1)
######################################################################
def c2(stringvalue):
#cap-surface: fibrous=f,grooves=g,scaly=y,smooth=s
    ncode=True
    alphabet=['f','g','y','s']
    for i in range(len(alphabet)):
        if stringvalue == alphabet[i]:
            c2=i+1
            ncode=False
    if ncode:
        raise Exception("Encoding failed in c2 missing data maybe?")
    return(c2)
######################################################################
def c3(stringvalue):
#cap-color: brown=n,buff=b,cinnamon=c,gray=g,green=r,pink=p,purple=u,red=e,white=w,yellow=y
    ncode=True
    alphabet=['n','b','c','g','r','p','u','e','w','y']
    for i in range(len(alphabet)):
        if stringvalue == alphabet[i]:
            c3=i+1
            ncode=False
    if ncode:
        raise Exception("Encoding failed in c3 missing data maybe?")
    return(c3)
######################################################################
def c4(stringvalue): #this is a simple binary encoding column
#bruises?:bruises=t,no=f
    ncode=True
    alphabet=['f','t']
    for i in range(len(alphabet)):
        if stringvalue == alphabet[i]:
            c4=i+1
            ncode=False
    if ncode:
        raise Exception("Encoding failed in c4 missing data maybe?")
    return(c4)
######################################################################
def c5(stringvalue):
#odor:  almond=a,anise=l,creosote=c,fishy=y,foul=f,musty=m,none=n,pungent=p,spicy=s
    ncode=True
    alphabet=['a','l','c','y','f','m','n','p','s']
    for i in range(len(alphabet)):
        if stringvalue == alphabet[i]:
            c5=i+1
            ncode=False
    if ncode:
        raise Exception("Encoding failed in c5 missing data maybe?")
    return(c5)
######################################################################
def c6(stringvalue):
#gill-attachment: attached=a,descending=d,free=f,notched=n
    ncode=True
    alphabet=['a','d','f','n']
    for i in range(len(alphabet)):
        if stringvalue == alphabet[i]:
            c6=i+1
            ncode=False
    if ncode:
        raise Exception("Encoding failed in c6 missing data maybe?")
    return(c6)
######################################################################
def c7(stringvalue):
#gill-spacing: close=c,crowded=w,distant=d
    ncode=True
    alphabet=['c','w','d']
    for i in range(len(alphabet)):
        if stringvalue == alphabet[i]:
            c7=i+1
            ncode=False
    if ncode:
        raise Exception("Encoding failed in c7 missing data maybe?")
    return(c7)
######################################################################
def c8(stringvalue):
#gill-size:  broad=b,narrow=n
    ncode=True
    alphabet=['b','n']
    for i in range(len(alphabet)):
        if stringvalue == alphabet[i]:
            c8=i+1
            ncode=False
    if ncode:
        raise Exception("Encoding failed in c8 missing data maybe?")
    return(c8)
######################################################################
def c9(stringvalue):
#gill-color: black=k,brown=n,buff=b,chocolate=h,gray=g,green=r,orange=o,pink=p,purple=u,red=e,white=w,yellow=y
    ncode=True
    alphabet=['k','n','b','h','g','r','o','p','u','e','w','y']
    for i in range(len(alphabet)):
        if stringvalue == alphabet[i]:
            c9=i+1
            ncode=False
    if ncode:
        raise Exception("Encoding failed in c9 missing data maybe?")
    return(c9)
######################################################################
def c10(stringvalue):
#stalk-shape:enlarging=e,tapering=t
    ncode=True
    alphabet=['e','t']
    for i in range(len(alphabet)):
        if stringvalue == alphabet[i]:
            c10=i+1
            ncode=False
    if ncode:
        raise Exception("Encoding failed in c10 missing data maybe?")
    return(c10)
######################################################################
def c11(stringvalue):
#stalk-root: bulbous=b,club=c,cup=u,equal=e,rhizomorphs=z,rooted=r,missing=?
    ncode=True
    alphabet=['b','c','u','e','z','r','?'] # set missing to zero position
    for i in range(len(alphabet)):
        if stringvalue == alphabet[i]:
            c11=i+1
            ncode=False
    if ncode:
        raise Exception("Encoding failed in c11 missing data maybe?")
    return(c11)
######################################################################
def c12(stringvalue):
#stalk-surface-above-ring: fibrous=f,scaly=y,silky=k,smooth=s
    ncode=True
    alphabet=['f','y','k','s','?'] # set missing to zero position
    for i in range(len(alphabet)):
        if stringvalue == alphabet[i]:
            c12=i+1
            ncode=False
    if ncode:
        raise Exception("Encoding failed in c12 missing data maybe?")
    return(c12)
######################################################################
def c13(stringvalue):
#stalk-surface-below-ring: fibrous=f,scaly=y,silky=k,smooth=s
    ncode=True
    alphabet=['f','y','k','s','?'] # set missing to zero position
    for i in range(len(alphabet)):
        if stringvalue == alphabet[i]:
            c13=i+1
            ncode=False
    if ncode:
        raise Exception("Encoding failed in c13 missing data maybe?")
    return(c13)
######################################################################
def c14(stringvalue):
#stalk-color-above-ring:   brown=n,buff=b,cinnamon=c,gray=g,orange=o,pink=p,red=e,white=w,yellow=y
    ncode=True
    alphabet=['n','b','c','g','o','p','e','w','y'] # set missing to zero position
    for i in range(len(alphabet)):
        if stringvalue == alphabet[i]:
            c14=i+1
            ncode=False
    if ncode:
        raise Exception("Encoding failed in c14 missing data maybe?")
    return(c14)
######################################################################
def c15(stringvalue):
#stalk-color-below-ring:   brown=n,buff=b,cinnamon=c,gray=g,orange=o,pink=p,red=e,white=w,yellow=y
    ncode=True
    alphabet=['n','b','c','g','o','p','e','w','y'] # set missing to zero position
    for i in range(len(alphabet)):
        if stringvalue == alphabet[i]:
            c15=i+1
            ncode=False
    if ncode:
        raise Exception("Encoding failed in c15 missing data maybe?")
    return(c15)
######################################################################
def c16(stringvalue):
#veil-type:                partial=p,universal=u
    ncode=True
    alphabet=['p','u'] # 
    for i in range(len(alphabet)):
        if stringvalue == alphabet[i]:
            c16=i+1
            ncode=False
    if ncode:
        raise Exception("Encoding failed in c16 missing data maybe?")
    return(c16)
######################################################################
def c17(stringvalue):
#veil-color:               brown=n,orange=o,white=w,yellow=y
    ncode=True
    alphabet=['n','o','w','y'] # 
    for i in range(len(alphabet)):
        if stringvalue == alphabet[i]:
            c17=i+1
            ncode=False
    if ncode:
        raise Exception("Encoding failed in c17 missing data maybe?")
    return(c17)
######################################################################
def c18(stringvalue):
#ring-number:              none=n,one=o,two=t
    ncode=True
    alphabet=['n','o','t'] # 
    for i in range(len(alphabet)):
        if stringvalue == alphabet[i]:
            c18=i+1
            ncode=False
    if ncode:
        raise Exception("Encoding failed in c18 missing data maybe?")
    return(c18)
######################################################################
def c19(stringvalue):
#ring-type:  cobwebby=c,evanescent=e,flaring=f,large=l,none=n,pendant=p,sheathing=s,zone=z
    ncode=True
    alphabet=['c','e','f','l','n','p','s','z'] # 
    for i in range(len(alphabet)):
        if stringvalue == alphabet[i]:
            c19=i+1
            ncode=False
    if ncode:
        raise Exception("Encoding failed in c19 missing data maybe?")
    return(c19)
######################################################################
def c20(stringvalue):
#spore-print-color:        black=k,brown=n,buff=b,chocolate=h,green=r,orange=o,purple=u,white=w,yellow=y
    ncode=True
    alphabet=['k','n','b','h','r','o','u','w','y','?'] # 
    for i in range(len(alphabet)):
        if stringvalue == alphabet[i]:
            c20=i+1
            ncode=False
    if ncode:
        raise Exception("Encoding failed in c20 missing data maybe?")
    return(c20)
######################################################################
def c21(stringvalue):
#population:abundant=a,clustered=c,numerous=n,scattered=s,several=v,solitary=y
    ncode=True
    alphabet=['a','c','n','s','v','y','?'] # 
    for i in range(len(alphabet)):
        if stringvalue == alphabet[i]:
            c21=i+1
            ncode=False
    if ncode:
        raise Exception("Encoding failed in c21 missing data maybe?")
    return(c21)
######################################################################
def c22(stringvalue):
# habitat: grasses=g,leaves=l,meadows=m,paths=p,urban=u,waste=w,woods=d   
    ncode=True
    alphabet=['g','l','m','p','u','w','d']
    for i in range(len(alphabet)):
        if stringvalue == alphabet[i]:
            c22=i+1
            ncode=False
    if ncode:
        raise Exception("Encoding failed in c22 missing data maybe?")
    return(c22)

interim['poison-class']            = mymushroom['poison-class'].apply(p0)
interim['cap-shape']               = mymushroom['cap-shape'].apply(c1)
interim['cap-surface']             = mymushroom['cap-surface'].apply(c2)
interim['cap-color']               = mymushroom['cap-color'].apply(c3)
interim['bruises']                 = mymushroom['bruises'].apply(c4)
interim['odor']                    = mymushroom['odor'].apply(c5)
interim['gill-attachment']         = mymushroom['gill-attachment'].apply(c6)
interim['gill-spacing']            = mymushroom['gill-spacing'].apply(c7)
interim['gill-size']               = mymushroom['gill-size'].apply(c8)
interim['gill-color']              = mymushroom['gill-color'].apply(c9)
interim['stalk-shape']             = mymushroom['stalk-shape'].apply(c10)
interim['stalk-root']              = mymushroom['stalk-root'].apply(c11)
interim['stalk-surface-above-ring']= mymushroom['stalk-surface-above-ring'].apply(c12)
interim['stalk-surface-below-ring']= mymushroom['stalk-surface-below-ring'].apply(c13)
interim['stalk-color-above-ring']  = mymushroom['stalk-color-above-ring'].apply(c14)
interim['stalk-color-below-ring']  = mymushroom['stalk-color-below-ring'].apply(c15)
interim['veil-type']               = mymushroom['veil-type'].apply(c16)
interim['veil-color']              = mymushroom['veil-color'].apply(c17)
interim['ring-number']             = mymushroom['ring-number'].apply(c18)
interim['ring-type']               = mymushroom['ring-type'].apply(c19)
interim['spore-print-color']       = mymushroom['spore-print-color'].apply(c20)
interim['population']              = mymushroom['population'].apply(c21)
interim['habitat']                 = mymushroom['habitat'].apply(c22)

interim.head(10)

	poison-class	cap-shape	cap-surface	cap-color	bruises	odor	gill-attachment	gill-spacing	gill-size	gill-color	...	stalk-surface-below-ring	stalk-color-above-ring	stalk-color-below-ring	veil-type	veil-color	ring-number	ring-type	spore-print-color	population	habitat
0	1	3	4	1	2	8	3	1	2	1	...	4	8	8	1	3	2	6	1	4	5
1	0	3	4	10	2	1	3	1	1	1	...	4	8	8	1	3	2	6	2	3	1
2	0	1	4	9	2	2	3	1	1	2	...	4	8	8	1	3	2	6	2	3	3
3	1	3	3	9	2	8	3	1	2	2	...	4	8	8	1	3	2	6	1	4	5
4	0	3	4	4	1	7	3	2	1	1	...	4	8	8	1	3	2	2	2	1	1
5	0	3	3	10	2	1	3	1	1	2	...	4	8	8	1	3	2	6	1	3	1
6	0	1	4	9	2	1	3	1	1	5	...	4	8	8	1	3	2	6	1	3	3
7	0	1	3	9	2	2	3	1	1	2	...	4	8	8	1	3	2	6	2	4	3
8	1	3	3	9	2	8	3	1	2	8	...	4	8	8	1	3	2	6	1	5	1
9	0	1	4	10	2	1	3	1	1	5	...	4	8	8	1	3	2	6	1	4	3

10 rows × 23 columns

Use sklearn to built training and testing data sets

Split the database
Examine outputs to verify maintain structure.

#split dataset in features and target variable
feature_cols = attributes[1:]
X = interim[feature_cols] # Features
y = interim['poison-class'] # Target variable
# split X and y into training and testing sets
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.25,random_state=0)

X_train.head()

	cap-shape	cap-surface	cap-color	bruises	odor	gill-attachment	gill-spacing	gill-size	gill-color	stalk-shape	...	stalk-surface-below-ring	stalk-color-above-ring	stalk-color-below-ring	veil-type	veil-color	ring-number	ring-type	spore-print-color	population	habitat
5832	4	3	10	1	5	3	1	1	8	1	...	3	2	1	1	3	2	4	4	6	1
601	3	3	1	2	2	3	1	1	11	1	...	2	8	8	1	3	2	6	2	6	1
1601	4	4	4	1	7	3	2	1	1	2	...	1	8	8	1	3	2	2	1	1	1
4941	3	1	4	1	5	3	1	1	5	1	...	3	2	6	1	3	2	4	4	5	4
7492	4	3	1	1	5	3	1	2	3	2	...	3	6	8	1	3	2	2	8	5	7

5 rows × 22 columns

y_train.head()

  1
   0
  0
  1
  1
Name: poison-class, dtype: int64

X_test.head()

	cap-shape	cap-surface	cap-color	bruises	odor	gill-attachment	gill-spacing	gill-size	gill-color	stalk-shape	...	stalk-surface-below-ring	stalk-color-above-ring	stalk-color-below-ring	veil-type	veil-color	ring-number	ring-type	spore-print-color	population	habitat
380	3	3	1	2	8	3	1	2	8	1	...	4	8	8	1	3	2	6	1	4	5
3641	4	3	4	2	7	3	1	1	8	2	...	4	4	4	1	3	2	6	1	6	7
273	3	3	10	2	1	3	1	1	1	1	...	4	8	8	1	3	2	6	2	3	3
1029	3	4	9	1	7	3	2	1	1	2	...	4	8	8	1	3	2	2	1	1	1
684	3	1	1	2	7	3	1	1	8	2	...	4	6	8	1	3	2	6	2	5	7

5 rows × 22 columns

y_test.head()

   1
  0
   0
  0
   0
Name: poison-class, dtype: int64

Load the sklearn Logistic Regression package

Load the package into notebook
Create an instance of the model.
Fit on traning set
Predict on test set

# import the method
from sklearn.linear_model import LogisticRegression
# instantiate the model (using the default parameters)
logreg = LogisticRegression(max_iter=1000)
# fit the model with data
logreg.fit(X_train,y_train)
# predict the test set
y_pred=logreg.predict(X_test)
# 

Report Results

Examine logistic constants (gives idea of importantce of specific features)
Confusion matrix as heatmap

print("Logistic Regression Values")
print("Intercept:", logreg.intercept_[0])
print("Feature Coefficients: \n",logreg.coef_)

Logistic Regression Values
Intercept: 16.542399116671376
Feature Coefficients: 
 [[-0.11042847  1.09064506  0.08741523 -3.45823817  0.14263106  0.24391943
  -9.04357091  7.79074285 -0.11221928 -0.07236673 -2.05036577 -0.70728185
  -0.04710663 -0.30924264  0.34523294  0.24430201  0.02944439 -0.82700158
  -0.86654293  0.46866517 -0.59647226 -0.19592944]]

# import the metrics class
from sklearn import metrics
cnf_matrix = metrics.confusion_matrix(y_pred, y_test)
cnf_matrix
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import pandas as pd
class_names=[0,1] # name  of classes
fig, ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
# create heatmap
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="YlGnBu" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Predicted label')
plt.xlabel('Actual label');

../../_images/da2dde30918b6197e15104ba7b087484e61b1242e283bc53fee061ed9016a636.png

Classify the new mushroom
This is the deployment step — the whole point is to build a prediction engine!

Here we simply provide the feature vector values.
A practical improvement would be to create a user interface that prompts for inputs.

Note

Although the classifier correctly predicts this mushroom as edible, there’s an important caveat: it is only safe to consume if the person has no alcohol in their system. A compound in the mushroom reacts with alcohol to form a liver toxin — a detail the model cannot detect because this biologically relevant interaction was not part of the training data.

This highlights a key limitation of data-driven models: accurate prediction depends on having all relevant features represented. In this case, domain knowledge reveals a missing causal factor, reminding us that models should be used as tools—not oracles.

import warnings
warnings.filterwarnings("ignore")
s_new = "csgfnfwbktrssggpwoeksu"
x_new =[c1(s_new[0]),c2(s_new[1]),c3(s_new[2]),c4(s_new[3]),c5(s_new[4]),c6(s_new[5]),c7(s_new[6]),c8(s_new[7]),c9(s_new[8]),c10(s_new[9]),c11(s_new[10]),
           c12(s_new[11]),c13(s_new[12]),c14(s_new[13]),c15(s_new[14]),c16(s_new[15]),c17(s_new[16]),c18(s_new[17]),c19(s_new[18]),c20(s_new[19]),c21(s_new[20]),c22(s_new[21]) ]
#print(x_new)
y_new = logreg.predict(np.array(x_new).reshape(1, -1))  # 1 sample, all features
if y_new < 1 :
    print("New mushroom classified as EDIBLE")
else:
    print("New mushroom classified as POISONIOUS")
    

New mushroom classified as EDIBLE

7.6.1 OpenAI Assisted Refactoring#

The code below is a result of supplying the code above, and supplying the prompt:

Can you examine the code and suggest alternate ways to generate the column names? (ie what i called the mapper method) I am using a mapping method in a for loop to change existing names (if any) to these listed names. Bear in mind later on this will create a warning, but it is suppressed intentionally.

Later on when we get to the encoding step the OpenAI prompt was:

The encoding steps are using a simple-cipher substution approach. The functions below are NOT intended for one-hot encoding. Each function interprets the feature value based on position in the vector, and uses the reduced alphabet for that feature to assign a numeric value for the feature. Can you refactor to generalize the encoding - namely to avoid having multiple functions, so I can process any string of correct length and alphabet?

These two prompts provided enough refactored code for the program below, which is more generic and modular - hence should be easier to maintain and apply for other situations.

# 7.6.1 Improved code (using OpenAI assisted refactoring)
##########################################################################################################################
##### Encoding definitions: (ordered list of categories, starting index for ordinal encoding)#############################
##########################################################################################################################
encodings = {
    'poison-class':               (['e', 'p'], 0),
    'cap-shape':                  (['b','c','x','f','k','s'], 1),
    'cap-surface':                (['f','g','y','s'], 1),
    'cap-color':                  (['n','b','c','g','r','p','u','e','w','y'], 1),
    'bruises':                    (['f','t'], 1),
    'odor':                       (['a','l','c','y','f','m','n','p','s'], 1),
    'gill-attachment':            (['a','d','f','n'], 1),
    'gill-spacing':              (['c','w','d'], 1),
    'gill-size':                 (['b','n'], 1),
    'gill-color':                (['k','n','b','h','g','r','o','p','u','e','w','y'], 1),
    'stalk-shape':              (['e','t'], 1),
    'stalk-root':               (['b','c','u','e','z','r','?'], 1),
    'stalk-surface-above-ring':(['f','y','k','s','?'], 1),
    'stalk-surface-below-ring':(['f','y','k','s','?'], 1),
    'stalk-color-above-ring':  (['n','b','c','g','o','p','e','w','y'], 1),
    'stalk-color-below-ring':  (['n','b','c','g','o','p','e','w','y'], 1),
    'veil-type':               (['p','u'], 1),
    'veil-color':              (['n','o','w','y'], 1),
    'ring-number':             (['n','o','t'], 1),
    'ring-type':               (['c','e','f','l','n','p','s','z'], 1),
    'spore-print-color':       (['k','n','b','h','r','o','u','w','y','?'], 1),
    'population':              (['a','c','n','s','v','y','?'], 1),
    'habitat':                 (['g','l','m','p','u','w','d'], 1)
}
##########################################################################################################################
###### Define Encoder Generator ##########################################################################################
##########################################################################################################################
def make_encoder(alphabet, offset=1, name='feature'):
    def encoder(val):
        if val in alphabet:
            return alphabet.index(val) + offset
        raise Exception(f"Encoding failed in '{name}': unknown value '{val}'")
    return encoder
##########################################################################################################################
##### Apply Encodings ####################################################################################################
##########################################################################################################################
from pandas import get_dummies

def apply_encodings(df, encodings, method="ordinal"):
    encoded_df = pd.DataFrame(index=df.index)
    for col, (alphabet, offset) in encodings.items():
        if method == "ordinal":
            encoder = make_encoder(alphabet, offset=offset, name=col)
            encoded_df[col] = df[col].apply(encoder)
        elif method == "onehot":
            df[col] = pd.Categorical(df[col], categories=alphabet, ordered=True)
            dummies = get_dummies(df[col], prefix=col)
            encoded_df = pd.concat([encoded_df, dummies], axis=1)
        else:
            raise ValueError("Unknown encoding method: use 'ordinal' or 'onehot'")
    return encoded_df
##########################################################################################################################
##### Report Generator ###################################################################################################
##########################################################################################################################
def generate_encoding_markdown(encodings):
    md = "# Mushroom Feature Encoding Reference\n\n"
    for feature, (alphabet, offset) in encodings.items():
        md += f"### {feature}\n"
        md += "| Category | Encoded Value |\n"
        md += "|----------|----------------|\n"
        for i, symbol in enumerate(alphabet):
            md += f"| `{symbol}` | `{i + offset}` |\n"
        md += "\n"
    return md

#########################################################################################################################
###### Step 1: Load the raw data ########################################################################################
#########################################################################################################################
###### Get the database (unchanged from original code)
import sys # Module to process commands to/from the OS using a shell-type syntax
import requests
remote_url="http://54.243.252.9/ce-5319-webroot/ce5319jb/lessons/logisticregression/agaricus-lepiota.data"  # set the url
rget = requests.get(remote_url, allow_redirects=True)  # get the remote resource, follow imbedded links
localfile = open('poisonmushroom.csv','wb') # open connection to a local file same name as remote
localfile.write(rget.content) # extract from the remote the contents,insert into the local file same name
localfile.close() # close connection to the local file
###### Read and load data into dataframe, set column names to the encoding keys
mymushroom = pd.read_csv('poisonmushroom.csv', header=None)
mymushroom.columns = list(encodings.keys())
#########################################################################################################################
###### Step 2: Encode ###################################################################################################
#########################################################################################################################
interim = apply_encodings(mymushroom.copy(), encodings, method="ordinal")  # or "onehot"
#########################################################################################################################
###### Step 3: Export Markdown (optional) ###############################################################################
#########################################################################################################################
md_report = generate_encoding_markdown(encodings)
with open("encoding_reference.md", "w") as f:
    f.write(md_report)
interim.head()
#########################################################################################################################
###### Step 4: Split Dataframe for Training, Testing  ###################################################################
#########################################################################################################################
feature_cols = mymushroom.columns[1:] #split dataset in features and target variable
X = interim[feature_cols] # Features
y = interim['poison-class'] # Target variable
# split X and y into training and testing sets
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.25,random_state=0)
#########################################################################################################################
###### Step 5: Load Logistic Regression, Fit Data Model  ################################################################
#########################################################################################################################
from sklearn.linear_model import LogisticRegression # import the method
logreg = LogisticRegression(max_iter=1000) # instantiate the model (using the default parameters)
logreg.fit(X_train,y_train) # fit the model with data
y_pred=logreg.predict(X_test) # predict the test set
#########################################################################################################################
###### Step 6: Report Results  ##########################################################################################
#########################################################################################################################
print("Logistic Regression Values")
print("Intercept:", logreg.intercept_[0])
print("Feature Coefficients: \n",logreg.coef_)
from sklearn import metrics # import the metrics class
cnf_matrix = metrics.confusion_matrix(y_pred, y_test)
cnf_matrix
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import pandas as pd
class_names=[0,1] # name  of classes
fig, ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="YlGnBu" ,fmt='g') # create heatmap
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Heat Map Performance (all dataframe)', y=1.1)
plt.ylabel('Predicted label')
plt.xlabel('Actual label');
#########################################################################################################################
###### Step 7: Classify New Mushroom  ###################################################################################
#########################################################################################################################
s_new = "csgfnfwbktrssggpwoeksu" # Sample input string from a new mushroom (22 characters)
# Create encoders from the original encoding dictionary
encoders = {
    col: make_encoder(alphabet, offset=offset, name=col)
    for col, (alphabet, offset) in encodings.items()
}
# Get list of feature columns (excluding 'poison-class')
feature_cols = list(encodings.keys())[1:]  # skip target
# Zip characters with corresponding features
try:
    x_new = [encoders[col](char) for col, char in zip(feature_cols, s_new)]
except Exception as e:
    print("Encoding error:", e)
    x_new = None
# Predict if encoding successful
if x_new:
    y_new = logreg.predict(np.array(x_new).reshape(1, -1))  # reshape for sklearn
if y_new < 1:
    print("\033[92mNew mushroom classified as EDIBLE\033[0m")      # Green text
else:
    print("\033[91mNew mushroom classified as POISONOUS\033[0m")   # Red text

Logistic Regression Values
Intercept: 16.542399116671376
Feature Coefficients: 
 [[-0.11042847  1.09064506  0.08741523 -3.45823817  0.14263106  0.24391943
  -9.04357091  7.79074285 -0.11221928 -0.07236673 -2.05036577 -0.70728185
  -0.04710663 -0.30924264  0.34523294  0.24430201  0.02944439 -0.82700158
  -0.86654293  0.46866517 -0.59647226 -0.19592944]]
New mushroom classified as EDIBLE

../../_images/d9f0a5443d9da67377c115780725f697b1774669be2616d3fae1c95043632b3a.png

7.7 Poisson Regression#

Poisson regression is a modeling technique used when the response variable represents count data—that is, observations take on discrete non-negative integer values:

\[ Y_i \in \{0, 1, 2, \dots\} \]

Unlike logistic regression, which models binary outcomes, Poisson regression is suitable when the outcome is a count of events (e.g., number of failures, accidents, arrivals, or occurrences within a fixed time or space window).

Modeling Assumptions#

The core assumption of Poisson regression is that the expected value of the response variable $ Y_i $ is modeled using a log-linear relationship:

\[ \mu_i(\beta) = \mathbb{E}[Y_i] = e^{X_i \cdot \beta} \]

where:

$ X_i $ is the vector of predictors for observation $ i $,
$ \beta $ is the vector of unknown coefficients,
$ \mu_i $ is the predicted mean count.

We sometimes write this equivalently as:

\[ \log(\mu_i) = X_i \cdot \beta \]

This formulation ensures the predicted mean $ \mu_i $ is always non-negative, satisfying the natural constraint for count data.

Although alternative link functions exist (like identity or square root), the canonical link for Poisson regression is the log link, due to its mathematical properties and interpretability.

Error Formulation#

Although Poisson regression is normally fit using maximum likelihood estimation (MLE), an intuitive way to introduce the model is through residual-based loss:

Let:

\[ \epsilon_i = Y_i - e^{X_i \cdot \beta} \]

Then the optimization objective becomes:

\[ \min_{\beta} \sum_{i=1}^n \left(Y_i - e^{X_i \cdot \beta}\right)^2 \]

While this is not the canonical approach for Poisson models (which use log-likelihood), it works for pedagogical purposes and allows for homebrew implementations using nonlinear least squares or numerical optimizers.

Applications#

Poisson regression is commonly used in:

Transportation engineering: vehicle accidents per mile, traffic incidents per hour.
Environmental studies: counts of disease occurrences in a region.
Civil systems: number of system failures, calls to maintenance, pipe bursts per year.
Hydrology: daily rainfall event counts over seasons or decades.

It’s particularly useful when:

The data represent rate-based outcomes,
The counts are not overly dispersed (see Negative Binomial for that case),
Zero values are common, but not dominant.

In the next sections, we explore homebrew implementations of Poisson regression.

Homebrew 1#

In this variant, we will implement a Poisson regression model using a squared-error objective function:

\[ J(\beta) = \sum_{i=1}^n \left(Y_i - e^{X_i \cdot \beta} \right)^2 \]

This allows us to use standard nonlinear optimizers such as Powell’s method or the Nelder-Mead simplex method.

We will:

Simulate or load a dataset where the target is a count.
Construct the design matrix $ X $ and response vector $ Y $.
Implement the Poisson kernel and the loss function.
Optimize using a homebrew solver.
Compare predictions to observed values.

This approach is pedagogically useful but not numerically optimal.

%reset -f
# build a dataset -  
import numpy as np
from numpy.random import normal
import math

M = 10_000
x = np.hstack([
    normal(0.0, 1.0, M).reshape(M, 1),
    normal(0.0, 1.0, M).reshape(M, 1),
    normal(0.0, 1.0, M).reshape(M, 1)
])
z = np.dot(x, np.array([0.15, 0.5, 0.2])) + 2.0 + normal(0.0, 0.01, M)
y = np.exp(z)

X    = x  # Design Matrix
Yobs = [math.trunc(item) for item in y]  # Discrete Target vector 

print(X[2][0])
print(x[2][0])
print(Yobs[0])

-0.3082091962166388
-0.3082091962166388
3

Yobs[1]

x[0:12]

array([[-1.04196896, -0.85124958, -0.89414553],
       [-0.32694623,  0.02530301,  0.69814781],
       [-0.3082092 ,  0.14995534, -0.79696863],
       [ 0.82711746,  0.50262614,  0.26527505],
       [ 1.07078142,  1.19763325,  1.31284919],
       [-3.05557742,  0.343873  ,  1.75246224],
       [ 1.00593193,  0.83969649,  1.2195609 ],
       [-1.76505643, -0.91064384,  0.6334922 ],
       [ 0.61171791, -1.11284845, -1.06640961],
       [ 2.00335928,  1.21496645,  1.08589271],
       [-0.02941931, -0.15019781,  0.43973858],
       [-1.58293956, -0.90016974, -0.78833107]])

X[0:12]

array([[-1.04196896, -0.85124958, -0.89414553],
       [-0.32694623,  0.02530301,  0.69814781],
       [-0.3082092 ,  0.14995534, -0.79696863],
       [ 0.82711746,  0.50262614,  0.26527505],
       [ 1.07078142,  1.19763325,  1.31284919],
       [-3.05557742,  0.343873  ,  1.75246224],
       [ 1.00593193,  0.83969649,  1.2195609 ],
       [-1.76505643, -0.91064384,  0.6334922 ],
       [ 0.61171791, -1.11284845, -1.06640961],
       [ 2.00335928,  1.21496645,  1.08589271],
       [-0.02941931, -0.15019781,  0.43973858],
       [-1.58293956, -0.90016974, -0.78833107]])

def mu(b0,b1,b2,b3,x,y,z): #poisson function (scalar) 3-design columns
    import math
    mu = math.exp(b0+b1*x+b2*y+b3*z)
    return(mu)

def sse(mod,obs): #compute sse from observations and model values
    howmany = len(mod)
    sse=0.0
    for i in range(howmany):
        sse=sse+(mod[i]-obs[i])**2
    return(sse)

def merit(beta): # merit function to minimize
    global Yobs,X #access lists already defined external to function
    mod=[0 for i in range(len(X))]
    for i in range(len(X)):
        mod[i]=mu(beta[0],beta[1],beta[2],beta[3],X[i][0],X[i][1],X[i][2])
    merit = sse(mod,Yobs)
    return(merit)

beta = [0,0,0,0] #initial guess of betas
merit(beta) #check that does not raise an exception

781238.0

import numpy as np
from scipy.optimize import minimize

#x0 = np.array([-3.0597,0.1615])
x0 = np.array(beta)
res = minimize(merit, x0, method='powell',options={'disp': True , 'maxiter':10 , 'return_all' : True})

Optimization terminated successfully.
         Current function value: 1035.440719
         Iterations: 5
         Function evaluations: 234

res.x

array([1.92859994, 0.15732782, 0.52267212, 0.20893387])

res.fun

array(1035.44071886)

mu(1.92859994, 0.15732782, 0.52267212, 0.20893387,-1.03262907e+00,  1.99696197e-01, -1.33509785e+00)

4.911439782570921

Homebrew 2#

In this variant, we will implement maximum likelihood estimation (MLE) directly using the Poisson log-likelihood:

\[ \log L(\beta) = \sum_{i=1}^n \left[ Y_i \cdot \log(\mu_i) - \mu_i - \log(Y_i!) \right] \]

Since:

\[ \mu_i = e^{X_i \cdot \beta} \]

the function to maximize becomes:

\[ \sum_{i=1}^n \left[ Y_i \cdot (X_i \cdot \beta) - e^{X_i \cdot \beta} - \log(Y_i!) \right] \]

This is the standard formulation used in statistical packages and is more statistically principled. However, it’s slightly more complex to implement because it involves logarithms and exponentials in the same objective.

We will proceed by:

Implementing the log-likelihood function.
Maximizing it (or equivalently minimizing the negative log-likelihood).
Comparing results with library-based solutions (e.g., statsmodels, scikit-learn).

%reset -f
# source code adapted from https://github.com/ximenasandoval/
#                          Poisson_regression/blob/main/
#                          Poisson%20regression%20model.ipynb

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from numpy.random import normal
plt.rcParams['figure.figsize'] = (16,8)
plt.style.use('ggplot')



np.random.seed(37)
sns.color_palette("Set2", as_cmap=True)

M = 10_000
x = np.hstack([
    normal(0.0, 1.0, M).reshape(M, 1),
    normal(0.0, 1.0, M).reshape(M, 1),
    normal(0.0, 1.0, M).reshape(M, 1)
])
z = np.dot(x, np.array([0.15, 0.5, 0.2])) + 2.0 + normal(0.0, 0.01, M)
y = np.exp(z)



fig, ax = plt.subplots(1, 2, figsize=(20, 5))

sns.kdeplot(z, ax=ax[0], color='#fcb103', fill=True)
ax[0].set_title(r'Distribution of Scores')
ax[0].set_xlabel('score')
ax[0].set_ylabel('probability')

sns.kdeplot(y, ax=ax[1], color='#fcb103', fill=True)
ax[1].set_title(r'Distribution of Means')
ax[1].set_xlabel('mean')
ax[1].set_ylabel('probability')

def loss(x, y, w, b):
    y_hat = np.exp(x @ w + b)
    # You can use the normal MSE error too! 
    #error = np.square(y_hat - y).mean() / 2
    error = (y_hat - np.log(y_hat) * y).mean()
    return error
        
def grad(x, y, w, b):
    M, n = x.shape
    y_hat = np.exp(x @ w + b)
    dw = (x.T @ (y_hat - y)) / M
    db = (y_hat - y).mean()
    return dw, db

def gradient_descent(x, y, w_0, b_0, alpha, num_iter):
    w, b = w_0.copy(), b_0
    hist = np.zeros(num_iter)
    M, n = x.shape
    
    for iter in range(num_iter):
        dw, db = grad(x, y, w, b)
        w -= alpha * dw 
        b -= alpha * db
        hist[iter] = loss(x, y, w, b)

    return w, b, hist


M, n = x.shape
w_0 = np.zeros((n, ))
b_0 = 1
alpha = 0.001
w, b, hist = gradient_descent(x, y, w_0, b_0, alpha, num_iter=10_000)




plt.plot(hist, 'b', color='#fcb103')
plt.title(u'Loss per iteration')
plt.xlabel(u'iteration')
plt.ylabel(r'$Loss(w, b)$')
plt.figtext(x=.6, y=.6, s="At the final iteration:\n\n w = {}\n b    = {}".format(w, b))
plt.show()



print(f"The final values for w = {w}")
print(f"The final value for b = {b}")

../../_images/92aa0aa6e65b2dfc7d639de2c3c137bb9aa2c4a0fdb377719a47af6c1cf97853.png

The final values for w = [0.14999278 0.49976397 0.20017282]
The final value for b = 1.9999653396018062

7.8 Pipe Burst Count/Prediction#

Warning

This example uses a simulated dataset.

Modeling the number of pipe burst incidents as a function of pipe age and urban location.

Simulated Dataset Overview:

Variable	Description
Pipe_Age	Age of the pipeline in years
Urban	1 if pipe is in urban area, 0 otherwise
Incidents	Number of burst incidents (count)

%reset -f
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt

# Simulate dataset
np.random.seed(42)
n = 100
Pipe_Age = np.random.uniform(0, 10, n)
Urban = np.random.binomial(1, 0.5, n)
X = sm.add_constant(pd.DataFrame({'Pipe_Age': Pipe_Age, 'Urban': Urban}))

beta_true = np.array([0.5, 0.2, 0.8])  # intercept, age effect, urban effect
mu = np.exp(X @ beta_true)
Incidents = np.random.poisson(mu)

df_sim = pd.DataFrame({'Pipe_Age': Pipe_Age, 'Urban': Urban, 'Incidents': Incidents})

X.describe()

	const	Pipe_Age	Urban
count	100.0	100.000000	100.000000
mean	1.0	4.701807	0.510000
std	0.0	2.974894	0.502418
min	1.0	0.055221	0.000000
25%	1.0	1.932008	0.000000
50%	1.0	4.641425	1.000000
75%	1.0	7.302031	1.000000
max	1.0	9.868869	1.000000

# Fit and summarize Poisson model
poisson_model = sm.GLM(df_sim.Incidents, X, family=sm.families.Poisson()).fit()
print(poisson_model.summary())

                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:              Incidents   No. Observations:                  100
Model:                            GLM   Df Residuals:                       97
Model Family:                 Poisson   Df Model:                            2
Link Function:                    Log   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -233.32
Date:                Fri, 11 Jul 2025   Deviance:                       107.99
Time:                        17:18:33   Pearson chi2:                     103.
No. Iterations:                     5   Pseudo R-squ. (CS):             0.9805
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.4602      0.104      4.404      0.000       0.255       0.665
Pipe_Age       0.2085      0.013     16.400      0.000       0.184       0.233
Urban          0.8070      0.075     10.726      0.000       0.660       0.954
==============================================================================

# Plot observed vs predicted
df_sim['Predicted'] = poisson_model.predict(X)
plt.scatter(df_sim.Pipe_Age, df_sim.Incidents, label='Observed', alpha=0.6)
plt.scatter(df_sim.Pipe_Age, df_sim.Predicted, marker='x', label='Predicted', linewidth=1, alpha=0.6)
plt.xlabel('Pipe Age')
plt.ylabel('Incident Count')
plt.legend()
plt.title('Pipe Burst Incidents: Observed vs Predicted')
plt.grid(True, linestyle='--', alpha=0.5)
plt.show()

../../_images/e38c56f9be291a14ea8bf3d06b3964eb9ceed507adfd3fdcf2b4a8c62131421d.png

Tip

Generalized Linear Models (GLM) Toolbox

Many real-world engineering problems involve data that aren’t well-modeled by ordinary linear regression — such as binary outcomes, event counts, or skewed continuous responses. Generalized Linear Models (GLMs) provide a unified framework to handle these cases.

GLM Structure:

A GLM consists of three components:

Random Component: The distribution of the response variable (e.g., Binomial, Poisson, Gamma).
Systematic Component: The linear predictor:
$$ \eta_i = X_i \cdot \beta $$
Link Function: Connects the expected value to the linear predictor:
$$ g(\mu_i) = \eta_i $$

Common Link Functions:

Model Type	Distribution	Link Function$ g(\mu)$	Use Case
Logistic Regression	Binomial	$ \log\left(\frac{\mu}{1 - \mu}\right)$	Binary classification
Poisson Regression	Poisson	$ \log(\mu)$	Count data (events, rates)
Gamma Regression	Gamma	$ \log(\mu)$ or$ 1/\mu$	Skewed continuous data

Interpretation Tip:

The log link (used in Poisson models) ensures the model predicts non-negative values and interprets coefficients as multiplicative effects.
The logit link (used in logistic regression) maps probabilities to the entire real line, making classification stable and interpretable.

GLMs are powerful because they extend linear modeling techniques to non-linear problems, while preserving familiar structures like design matrices and residual analysis.

7.9 Bicycle Rental Count/Prediction#

%reset -f
import os
import requests
import zipfile
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt

# Load dataset from UCI repo (daily counts)
url = "https://archive.ics.uci.edu/static/public/275/bike+sharing+dataset.zip"
filename = "bsd.zip"
csvname = "day.csv"
# Fallback: local copy if download fails
if not os.path.exists(filename):
    try:
        response = requests.get(url, timeout=60)
        with open(filename, 'wb') as f:
            f.write(response.content)
        print(f"Dataset downloaded and saved as '{filename}'")
    except Exception as e:
        print(f"Could not download file. Please ensure '{filename}' is available locally.\n", e)

# Unzip the archive if not already unzipped
if not os.path.exists(csvname):
    try:
        with zipfile.ZipFile(filename, 'r') as zip_ref:
            zip_ref.extractall()
        print(f"Extracted '{csvname}' from archive.")
    except Exception as e:
        print("Error extracting ZIP file:", e)

# Load the daily bike rental data
df = pd.read_csv(csvname)
df.head()

	instant	dteday	season	mnth	weekday	workingday	weathersit	temp	atemp	hum	windspeed	casual	registered	cnt
0	1	2011-01-01	1	1	6	0	2	0.344167	0.363625	0.805833	0.160446	331	654	985
1	2	2011-01-02	1	1	0	0	2	0.363478	0.353739	0.696087	0.248539	131	670	801
2	3	2011-01-03	1	1	1	1	1	0.196364	0.189405	0.437273	0.248309	120	1229	1349
3	4	2011-01-04	1	1	2	1	1	0.200000	0.212122	0.590435	0.160296	108	1454	1562
4	5	2011-01-05	1	1	3	1	1	0.226957	0.229270	0.436957	0.186900	82	1518	1600

# Variables:
# dteday: date
# temp: normalized temperature
# hum: humidity
# weathersit: weather condition
# workingday: 1 if working day, 0 otherwise
# cnt: count of total rentals

# Create copy and select relevant columns
df_small = df[['dteday', 'cnt', 'temp', 'hum', 'workingday', 'weathersit']].copy()
df_small['const'] = 1  # manually add intercept

# Fit Poisson model
X = df_small[['const', 'temp', 'hum', 'workingday', 'weathersit']]
y = df_small['cnt']

poisson_model = sm.GLM(y, X, family=sm.families.Poisson()).fit()
# Later: Assign predicted values safely

print(poisson_model.summary())

import matplotlib.dates as mdates

df_small['Predicted'] = poisson_model.predict(X)

# Ensure dteday is datetime type
df_small['dteday'] = pd.to_datetime(df_small['dteday'])

plt.figure(figsize=(12, 5))
plt.plot(df_small['dteday'], df_small['cnt'], label='Observed', alpha=0.5)
plt.plot(df_small['dteday'], df_small['Predicted'], '--', label='Poisson Prediction')

# Set monthly ticks
ax = plt.gca()
ax.xaxis.set_major_locator(mdates.MonthLocator(interval=1))  # Every month
ax.xaxis.set_major_formatter(mdates.DateFormatter('%b-%Y'))  # Format as Jan-2011

plt.xlabel('Date')
plt.ylabel('Daily Rentals')
plt.title('Bike Rentals in Somewhere USA — Observed vs Poisson Model')
plt.xticks(rotation=45)
plt.legend()
plt.tight_layout()
plt.grid(True, linestyle='--', alpha=0.4)
plt.show()

                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:                    cnt   No. Observations:                  731
Model:                            GLM   Df Residuals:                      726
Model Family:                 Poisson   Df Model:                            4
Link Function:                    Log   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:            -2.0111e+05
Date:                Fri, 11 Jul 2025   Deviance:                   3.9482e+05
Time:                        17:18:37   Pearson chi2:                 3.83e+05
No. Iterations:                     5   Pseudo R-squ. (CS):              1.000
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const          7.9579      0.003   2588.870      0.000       7.952       7.964
temp           1.4391      0.003    452.300      0.000       1.433       1.445
hum           -0.1660      0.005    -32.843      0.000      -0.176      -0.156
workingday     0.0410      0.001     34.115      0.000       0.039       0.043
weathersit    -0.1605      0.001   -117.577      0.000      -0.163      -0.158
==============================================================================

../../_images/ce0ac657ad5771b02693eef830f908c9ba5b17620d537365a0d255a6ca014584.png

Here’s how to compute and visualize deviance residuals and Pearson residuals, both of which are commonly used for Poisson models:

import numpy as np

# Calculate residuals
mu = df_small['Predicted']
y = df_small['cnt']

# Deviance residuals
deviance_resid = np.sign(y - mu) * np.sqrt(2 * (y * np.log((y + 1e-8) / mu) - (y - mu)))

# Pearson residuals
pearson_resid = (y - mu) / np.sqrt(mu)

# Add to DataFrame
df_small['Deviance_Resid'] = deviance_resid
df_small['Pearson_Resid'] = pearson_resid

# Plot: Deviance residuals over time
plt.figure(figsize=(12, 4))
plt.plot(df_small['dteday'], df_small['Deviance_Resid'], alpha=0.7)
plt.axhline(0, color='red', linestyle='--')
plt.title('Deviance Residuals Over Time')
plt.xlabel('Date')
plt.ylabel('Deviance Residual')
plt.xticks(rotation=45)
plt.grid(True, linestyle='--', alpha=0.5)
plt.tight_layout()
plt.show()

# Plot: Residuals vs Predicted
plt.figure(figsize=(8, 5))
plt.scatter(mu, pearson_resid, alpha=0.6)
plt.axhline(0, color='red', linestyle='--')
plt.xlabel('Predicted Count')
plt.ylabel('Pearson Residual')
plt.title('Residuals vs Predicted Values')
plt.grid(True, linestyle='--', alpha=0.5)
plt.tight_layout()
plt.show()

../../_images/570bce2267af6e2cde132b02d01d3c4d5efade8a20d10c426dcbceeac04d0bde.png

../../_images/88e500e5116f3bfc6db6159c5da4f51b354fe96e9fc0874b1cdb16af2c2f2148.png

What to Look For:#

Deviance residuals: Large outliers may signal poor model fit or unaccounted variability.
Pearson residuals vs predicted: Fan-shaped patterns suggest overdispersion (⇒ Negative Binomial).

Residual Diagnostics Checklist#

After fitting a Poisson regression model, use this checklist to evaluate how well the model fits the data and whether underlying assumptions are violated.

1. Deviance Residuals#

Plot residuals over time or index:
- Look for systematic patterns or outliers.
- A good fit should have residuals randomly scattered around zero.
If many residuals are very large (in absolute value), it may suggest:
- Overdispersion
- Unaccounted predictors
- Non-Poisson data generation

2. Pearson Residuals vs. Predicted Values#

Plot residuals against fitted values:
- Check for heteroskedasticity (e.g., funnel or fan shape).
- Residuals should be center

Summary: Logistic and Poisson Regression in Engineering Contexts#

1. Binary Classification with Logistic Regression#

In this lesson, we introduced logistic regression as a statistical modeling technique used to classify observations into two categories (e.g., success/failure, pass/fail, event/no event). The method is particularly effective when the outcome is binary and we wish to estimate the probability that a given observation falls into one class based on one or more input features.

We explored:

Simple logistic regression: using a single predictor.
Multiple logistic regression: extending the model to include multiple predictors.

Interpretation: understanding model coefficients in terms of odds ratios and assessing how changes in inputs affect the likelihood of a particular outcome.

This model is useful in both directions:

Forward prediction: Estimate the probability of a particular outcome given known inputs.
Inverse inquiry: Determine what input values are associated with a desired probability threshold.

Example: What temperature yields a 75% chance of system failure? This is conceptually similar to exceedance probabilities familiar from hydrologic frequency analysis.

2. Modeling Count Data with Poisson Regression#

We also introduced Poisson regression, a related but distinct modeling tool used when the outcome is not binary, but rather a count—i.e., the number of times an event occurs.

Poisson regression is appropriate when:

The response variable is a non-negative integer (0, 1, 2, …).
We can assume that the mean count increases or decreases exponentially with the input variables.

This model allows us to predict discrete quantities such as:

Number of pipe bursts per year,
Number of traffic accidents per day,
Number of service calls per region.

We discussed how the log link function ensures predicted counts remain non-negative and scale multiplicatively with respect to input variables.

3. Model Implementation: Homebrew and Package-Based#

Throughout the lesson, we demonstrated how to build both:

Homebrew implementations of logistic and Poisson models, useful for understanding the mathematics behind the methods.
Package-based implementations using statsmodels, scikit-learn, and other libraries, useful for practical, robust, and scalable model fitting.

We incorporated:

Nonlinear optimization methods (e.g., Powell’s method) to minimize residuals in a custom setting,
Maximum likelihood estimation (MLE) for a more statistically principled and numerically stable approach, especially for Poisson regression.

4. Creativity and Engineering Judgment#

Ultimately, these regression techniques are tools—mathematical hammers in our engineering toolbox. Their power lies in how creatively and appropriately we apply them to real-world problems. Whether predicting failure risk, analyzing safety incidents, or designing smarter infrastructure systems, these models can guide both analysis and decision-making in uncertainty.

As with all tools, we emphasized the importance of:

Residual diagnostics to check model assumptions,
Interpretable results to ensure engineering relevance,

And using models not as ends in themselves, but as parts of a broader cycle of observation, modeling, validation, and action.

	poison-class	cap-shape	cap-surface	cap-color	bruises	odor	gill-attachment	gill-spacing	gill-size	gill-color	...	stalk-surface-below-ring	stalk-color-above-ring	stalk-color-below-ring	veil-type	veil-color	ring-number	ring-type	spore-print-color	population	habitat
0	p	x	s	n	t	p	f	c	n	k	...	s	w	w	p	w	o	p	k	s	u
1	e	x	s	y	t	a	f	c	b	k	...	s	w	w	p	w	o	p	n	n	g
2	e	b	s	w	t	l	f	c	b	n	...	s	w	w	p	w	o	p	n	n	m
3	p	x	y	w	t	p	f	c	n	n	...	s	w	w	p	w	o	p	k	s	u
4	e	x	s	g	f	n	f	w	b	k	...	s	w	w	p	w	o	e	n	a	g
5	e	x	y	y	t	a	f	c	b	n	...	s	w	w	p	w	o	p	k	n	g
6	e	b	s	w	t	a	f	c	b	g	...	s	w	w	p	w	o	p	k	n	m
7	e	b	y	w	t	l	f	c	b	n	...	s	w	w	p	w	o	p	n	s	m
8	p	x	y	w	t	p	f	c	n	p	...	s	w	w	p	w	o	p	k	v	g
9	e	b	s	y	t	a	f	c	b	g	...	s	w	w	p	w	o	p	k	s	m
10	e	x	y	y	t	l	f	c	b	g	...	s	w	w	p	w	o	p	n	n	g
11	e	x	y	y	t	a	f	c	b	n	...	s	w	w	p	w	o	p	k	s	m
12	e	b	s	y	t	a	f	c	b	w	...	s	w	w	p	w	o	p	n	s	g
13	p	x	y	w	t	p	f	c	n	k	...	s	w	w	p	w	o	p	n	v	u
14	e	x	f	n	f	n	f	w	b	n	...	f	w	w	p	w	o	e	k	a	g
15	e	s	f	g	f	n	f	c	n	k	...	s	w	w	p	w	o	p	n	y	u
16	e	f	f	w	f	n	f	w	b	k	...	s	w	w	p	w	o	e	n	a	g
17	p	x	s	n	t	p	f	c	n	n	...	s	w	w	p	w	o	p	k	s	g
18	p	x	y	w	t	p	f	c	n	n	...	s	w	w	p	w	o	p	n	s	u
19	p	x	s	n	t	p	f	c	n	k	...	s	w	w	p	w	o	p	n	s	u

	poison-class	cap-shape	cap-surface	cap-color	bruises	odor	gill-attachment	gill-spacing	gill-size	gill-color	...	stalk-surface-below-ring	stalk-color-above-ring	stalk-color-below-ring	veil-type	veil-color	ring-number	ring-type	spore-print-color	population	habitat
0	1	3	4	1	2	8	3	1	2	1	...	4	8	8	1	3	2	6	1	4	5
1	0	3	4	10	2	1	3	1	1	1	...	4	8	8	1	3	2	6	2	3	1
2	0	1	4	9	2	2	3	1	1	2	...	4	8	8	1	3	2	6	2	3	3
3	1	3	3	9	2	8	3	1	2	2	...	4	8	8	1	3	2	6	1	4	5
4	0	3	4	4	1	7	3	2	1	1	...	4	8	8	1	3	2	2	2	1	1
5	0	3	3	10	2	1	3	1	1	2	...	4	8	8	1	3	2	6	1	3	1
6	0	1	4	9	2	1	3	1	1	5	...	4	8	8	1	3	2	6	1	3	3
7	0	1	3	9	2	2	3	1	1	2	...	4	8	8	1	3	2	6	2	4	3
8	1	3	3	9	2	8	3	1	2	8	...	4	8	8	1	3	2	6	1	5	1
9	0	1	4	10	2	1	3	1	1	5	...	4	8	8	1	3	2	6	1	4	3

Model Type	Distribution	Link Function\( g(\mu)\)	Use Case
Logistic Regression	Binomial	\( \log\left(\frac{\mu}{1 - \mu}\right)\)	Binary classification
Poisson Regression	Poisson	\( \log(\mu)\)	Count data (events, rates)
Gamma Regression	Gamma	\( \log(\mu)\) or\( 1/\mu\)	Skewed continuous data

	poison-class	cap-shape	cap-surface	cap-color	bruises	odor	gill-attachment	gill-spacing	gill-size	gill-color	...	stalk-surface-below-ring	stalk-color-above-ring	stalk-color-below-ring	veil-type	veil-color	ring-number	ring-type	spore-print-color	population	habitat
0	p	x	s	n	t	p	f	c	n	k	...	s	w	w	p	w	o	p	k	s	u
1	e	x	s	y	t	a	f	c	b	k	...	s	w	w	p	w	o	p	n	n	g
2	e	b	s	w	t	l	f	c	b	n	...	s	w	w	p	w	o	p	n	n	m
3	p	x	y	w	t	p	f	c	n	n	...	s	w	w	p	w	o	p	k	s	u
4	e	x	s	g	f	n	f	w	b	k	...	s	w	w	p	w	o	e	n	a	g
5	e	x	y	y	t	a	f	c	b	n	...	s	w	w	p	w	o	p	k	n	g
6	e	b	s	w	t	a	f	c	b	g	...	s	w	w	p	w	o	p	k	n	m
7	e	b	y	w	t	l	f	c	b	n	...	s	w	w	p	w	o	p	n	s	m
8	p	x	y	w	t	p	f	c	n	p	...	s	w	w	p	w	o	p	k	v	g
9	e	b	s	y	t	a	f	c	b	g	...	s	w	w	p	w	o	p	k	s	m
10	e	x	y	y	t	l	f	c	b	g	...	s	w	w	p	w	o	p	n	n	g
11	e	x	y	y	t	a	f	c	b	n	...	s	w	w	p	w	o	p	k	s	m
12	e	b	s	y	t	a	f	c	b	w	...	s	w	w	p	w	o	p	n	s	g
13	p	x	y	w	t	p	f	c	n	k	...	s	w	w	p	w	o	p	n	v	u
14	e	x	f	n	f	n	f	w	b	n	...	f	w	w	p	w	o	e	k	a	g
15	e	s	f	g	f	n	f	c	n	k	...	s	w	w	p	w	o	p	n	y	u
16	e	f	f	w	f	n	f	w	b	k	...	s	w	w	p	w	o	e	n	a	g
17	p	x	s	n	t	p	f	c	n	n	...	s	w	w	p	w	o	p	k	s	g
18	p	x	y	w	t	p	f	c	n	n	...	s	w	w	p	w	o	p	n	s	u
19	p	x	s	n	t	p	f	c	n	k	...	s	w	w	p	w	o	p	n	s	u

	poison-class	cap-shape	cap-surface	cap-color	bruises	odor	gill-attachment	gill-spacing	gill-size	gill-color	...	stalk-surface-below-ring	stalk-color-above-ring	stalk-color-below-ring	veil-type	veil-color	ring-number	ring-type	spore-print-color	population	habitat
0	1	3	4	1	2	8	3	1	2	1	...	4	8	8	1	3	2	6	1	4	5
1	0	3	4	10	2	1	3	1	1	1	...	4	8	8	1	3	2	6	2	3	1
2	0	1	4	9	2	2	3	1	1	2	...	4	8	8	1	3	2	6	2	3	3
3	1	3	3	9	2	8	3	1	2	2	...	4	8	8	1	3	2	6	1	4	5
4	0	3	4	4	1	7	3	2	1	1	...	4	8	8	1	3	2	2	2	1	1
5	0	3	3	10	2	1	3	1	1	2	...	4	8	8	1	3	2	6	1	3	1
6	0	1	4	9	2	1	3	1	1	5	...	4	8	8	1	3	2	6	1	3	3
7	0	1	3	9	2	2	3	1	1	2	...	4	8	8	1	3	2	6	2	4	3
8	1	3	3	9	2	8	3	1	2	8	...	4	8	8	1	3	2	6	1	5	1
9	0	1	4	10	2	1	3	1	1	5	...	4	8	8	1	3	2	6	1	4	3

7. Logistic Regression

Contents

7. Logistic Regression#

Readings/References#

Videos#

What is “Logistic Regression”?#

The Sigmoid Function: From Scores to Probabilities#

Python Code to Visualize the Sigmoid#

Simple Example: Logistic Model for System Failure#

7.1 A Logistic Regression Example#

Success at Task vs Experience#

Split the Data Set#

How to assess the performance of logistic regression?#

Model Evaluation using Confusion Matrix#

7.2 Selecting the Decision Boundary#

Some references#

7.3 Logistic Regression in Designed Experiments#

Redemption Rate vs. Rebate Value#

Exercises#

References#

7.4 Multiple Logistic Regression#

Disease Rate vs Multiple Predictors#

7.5 Titanic Survival Example (adapted from 2012 Kaggle Competition)#

7.6 Poison Mushrooms#

Problem Statement#

ML Workflow#

7.6.1 OpenAI Assisted Refactoring#

7.7 Poisson Regression#

Modeling Assumptions#

Error Formulation#

Applications#

Homebrew 1#

Homebrew 2#

7.8 Pipe Burst Count/Prediction#

GLM Structure:

Common Link Functions:

Interpretation Tip:

7.9 Bicycle Rental Count/Prediction#

What to Look For:#

Residual Diagnostics Checklist#

1. Deviance Residuals#

2. Pearson Residuals vs. Predicted Values#

Summary: Logistic and Poisson Regression in Engineering Contexts#

1. Binary Classification with Logistic Regression#

2. Modeling Count Data with Poisson Regression#

3. Model Implementation: Homebrew and Package-Based#

4. Creativity and Engineering Judgment#

End of Section#

	poison-class	cap-shape	cap-surface	cap-color	bruises	odor	gill-attachment	gill-spacing	gill-size	gill-color	...	stalk-surface-below-ring	stalk-color-above-ring	stalk-color-below-ring	veil-type	veil-color	ring-number	ring-type	spore-print-color	population	habitat
0	p	x	s	n	t	p	f	c	n	k	...	s	w	w	p	w	o	p	k	s	u
1	e	x	s	y	t	a	f	c	b	k	...	s	w	w	p	w	o	p	n	n	g
2	e	b	s	w	t	l	f	c	b	n	...	s	w	w	p	w	o	p	n	n	m
3	p	x	y	w	t	p	f	c	n	n	...	s	w	w	p	w	o	p	k	s	u
4	e	x	s	g	f	n	f	w	b	k	...	s	w	w	p	w	o	e	n	a	g
5	e	x	y	y	t	a	f	c	b	n	...	s	w	w	p	w	o	p	k	n	g
6	e	b	s	w	t	a	f	c	b	g	...	s	w	w	p	w	o	p	k	n	m
7	e	b	y	w	t	l	f	c	b	n	...	s	w	w	p	w	o	p	n	s	m
8	p	x	y	w	t	p	f	c	n	p	...	s	w	w	p	w	o	p	k	v	g
9	e	b	s	y	t	a	f	c	b	g	...	s	w	w	p	w	o	p	k	s	m
10	e	x	y	y	t	l	f	c	b	g	...	s	w	w	p	w	o	p	n	n	g
11	e	x	y	y	t	a	f	c	b	n	...	s	w	w	p	w	o	p	k	s	m
12	e	b	s	y	t	a	f	c	b	w	...	s	w	w	p	w	o	p	n	s	g
13	p	x	y	w	t	p	f	c	n	k	...	s	w	w	p	w	o	p	n	v	u
14	e	x	f	n	f	n	f	w	b	n	...	f	w	w	p	w	o	e	k	a	g
15	e	s	f	g	f	n	f	c	n	k	...	s	w	w	p	w	o	p	n	y	u
16	e	f	f	w	f	n	f	w	b	k	...	s	w	w	p	w	o	e	n	a	g
17	p	x	s	n	t	p	f	c	n	n	...	s	w	w	p	w	o	p	k	s	g
18	p	x	y	w	t	p	f	c	n	n	...	s	w	w	p	w	o	p	n	s	u
19	p	x	s	n	t	p	f	c	n	k	...	s	w	w	p	w	o	p	n	s	u

	poison-class	cap-shape	cap-surface	cap-color	bruises	odor	gill-attachment	gill-spacing	gill-size	gill-color	...	stalk-surface-below-ring	stalk-color-above-ring	stalk-color-below-ring	veil-type	veil-color	ring-number	ring-type	spore-print-color	population	habitat
0	1	3	4	1	2	8	3	1	2	1	...	4	8	8	1	3	2	6	1	4	5
1	0	3	4	10	2	1	3	1	1	1	...	4	8	8	1	3	2	6	2	3	1
2	0	1	4	9	2	2	3	1	1	2	...	4	8	8	1	3	2	6	2	3	3
3	1	3	3	9	2	8	3	1	2	2	...	4	8	8	1	3	2	6	1	4	5
4	0	3	4	4	1	7	3	2	1	1	...	4	8	8	1	3	2	2	2	1	1
5	0	3	3	10	2	1	3	1	1	2	...	4	8	8	1	3	2	6	1	3	1
6	0	1	4	9	2	1	3	1	1	5	...	4	8	8	1	3	2	6	1	3	3
7	0	1	3	9	2	2	3	1	1	2	...	4	8	8	1	3	2	6	2	4	3
8	1	3	3	9	2	8	3	1	2	8	...	4	8	8	1	3	2	6	1	5	1
9	0	1	4	10	2	1	3	1	1	5	...	4	8	8	1	3	2	6	1	4	3