Logistic Regression¶

Background¶

Multiple Variable Models¶

\[ \begin{align}\begin{aligned}Y_{obs} = \frac{e^{\bf{\beta}^TX}}{1 + e^{\bf{\beta}^TX}} + \epsilon$$<br>\\\\where\end{aligned}\end{align} \]

\begin{gather} \mathbf{\beta}= \begin{pmatrix} \beta_0 \ \beta_1 \ \vdots \ \beta_m \ \end{pmatrix} \end{gather} $$

\[\begin{split} \begin{gather} \mathbf{X}= \begin{pmatrix} 1 & x_{1,1} & x_{1,2}&\dots & x_{1,m}\\ 1 & x_{2,1} & x_{2,2}&\dots & x_{2,m}\\ \vdots & \vdots & \vdots & \vdots & \vdots\\ 1 & x_{n,1} & x_{n,2}&\dots & x_{n,m}\\ \end{pmatrix} \end{gather} \end{split}\]

The matrix has as many rows as observations to fit, and as many columns as predictor variables.

Fitting Process¶

The logistic regression method attempts to find values of $\beta$ that minimizes the log-likelihood function

\[log_e L(\beta) = Y_{obs}^T \cdot (\beta^TX) - log_e \bf{1}^T \cdot (1 + e^{{\beta}^TX})\]

The presence of the last term makes the process a search process (instead of a single equation to solve), some form of iterative least squares. In general we will rely on packages to handle this work - an example follows

Read the database

import pandas as pd
import numpy as np
import sklearn.metrics as metrics
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

mydatabase = pd.read_csv('LogisticRegressionHomebrewExample.csv')

Verify the read, and plot the database

mydatabase.describe()

	X1	X2	Class
count	20000.000000	20000.000000	20000.000000
mean	0.496920	2.001764	0.500000
std	1.113666	2.240266	0.500013
min	-4.214002	-3.489115	0.000000
25%	-0.260873	-0.015846	0.000000
50%	0.498444	1.978725	0.500000
75%	1.264066	4.016432	1.000000
max	4.605613	7.945987	1.000000

#Split dataframe based on class value and plot
classone = mydatabase[mydatabase['Class']<1.0]
classzero = mydatabase[mydatabase['Class']>0.0]
# plot the two classes as red and blue
plt.figure(figsize=(10, 8))
plt.scatter(classzero['X1'],classzero['X2'], c="red", alpha=0.3)
plt.scatter(classone['X1'],classone['X2'], c="blue", alpha=0.3)
plt.legend(["Class 0","Class 1"])
plt.xlabel('X1 Predictor') 
plt.ylabel('X2 Predictor') 
plt.grid(which='both')
plt.show()

The two classes appear different enough so the logistic-regression classification seems possible. For example if the predictor input is $[0,0]$ then the class is likely “Class 1”, whereas if the predictor input is $[0,4]$ then the class is likely “Class 0” according to the above plot.

#split dataset in features and target variable
feature_cols = ['X1', 'X2']
X = mydatabase[feature_cols] # Features
y = mydatabase["Class"] # Target variable

# split X and y into training and testing sets
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.25,random_state=0)

# import the class
from sklearn.linear_model import LogisticRegression

# instantiate the model (using the default parameters)
#logreg = LogisticRegression()
logreg = LogisticRegression()
# fit the model with data
logreg.fit(X_train,y_train)

#
y_pred=logreg.predict(X_test)

# import the metrics class
from sklearn import metrics
cnf_matrix = metrics.confusion_matrix(y_pred, y_test)
cnf_matrix

array([[2435,   58],
       [  42, 2465]])

class_names=[0,1] # name  of classes
fig, ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
# create heatmap
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="YlGnBu" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Predicted label')
plt.xlabel('Actual label');

Now lets build a classification (class prediction) engine using logistic regression results.

First we need the sigmoid response function where

\[sigmoid(X,\beta) = 1 + e^{{\beta}^TX} \]

def sigmoid(scores): # scores = b0+b1X1+b2X2+... where b0,b1,b2,... are specified
    return 1/(1+np.exp(-scores))

Now lets use our two obvious values from the discussion above

inputs = np.array([[0],[1]]) # one example input, should evaluate as
scores = np.dot(logreg.coef_,inputs) + logreg.intercept_
estimated_class = sigmoid(np.dot(logreg.coef_,inputs) + logreg.intercept_)[0,0] # grab the only element
print("X1 Input = ",inputs[0][0],"\nX2 Input = ",inputs[1][0],"\nEstimated Class Value",round(estimated_class,0))

X1 Input =  0 
X2 Input =  1 
Estimated Class Value 0.0

inputs = np.array([[0],[4]]) # one example input, should evaluate as
scores = np.dot(logreg.coef_,inputs) + logreg.intercept_
estimated_class = sigmoid(np.dot(logreg.coef_,inputs) + logreg.intercept_)[0,0] # grab the only element
print("X1 Input = ",inputs[0][0],"\nX2 Input = ",inputs[1][0],"\nEstimated Class Value",round(estimated_class,0))

X1 Input =  0 
X2 Input =  4 
Estimated Class Value 1.0

So now we have a classifer function, we could probably draw a dividing line back on the original plot and identify the dividing line, we only need two points where the classifier returns a value of 0.5 as below

inputs = np.array([[-4],[0.645]]) # one example input, should evaluate as
scores = np.dot(logreg.coef_,inputs) + logreg.intercept_
estimated_class = sigmoid(np.dot(logreg.coef_,inputs) + logreg.intercept_)[0,0] # grab the only element
print("X1 Input = ",inputs[0][0],"\nX2 Input = ",inputs[1][0],"\nEstimated Class Value",round(estimated_class,3))

X1 Input =  -4.0 
X2 Input =  0.645 
Estimated Class Value 0.502

inputs = np.array([[4],[3.018]]) # one example input, should evaluate as
scores = np.dot(logreg.coef_,inputs) + logreg.intercept_
estimated_class = sigmoid(np.dot(logreg.coef_,inputs) + logreg.intercept_)[0,0] # grab the only element
print("X1 Input = ",inputs[0][0],"\nX2 Input = ",inputs[1][0],"\nEstimated Class Value",round(estimated_class,3))

X1 Input =  4.0 
X2 Input =  3.018 
Estimated Class Value 0.5

Our two points by trial and error are: $[-4.0,0.645]$ and $[4.0,3.018]$ we can add this line to our plot

# plot the two classes as red and blue
plt.figure(figsize=(10, 8))
plt.scatter(classzero['X1'],classzero['X2'], c="red", alpha=0.3)
plt.scatter(classone['X1'],classone['X2'], c="blue", alpha=0.3)
plt.plot([-4,4],[0.645,3.018],c="magenta",linewidth=4)
plt.legend(["Class 0","Class 1"])
plt.xlabel('X1 Predictor') 
plt.ylabel('X2 Predictor') 
plt.grid(which='both')
plt.show()

Single Variable Models¶

A single variable model is a special case of the multi-variate approach with the following modifications:

\[\begin{split} \begin{gather} \mathbf{\beta}= \begin{pmatrix} \beta_0 \\ \beta_1 \\ \end{pmatrix} \end{gather} \end{split}\]

\[\begin{split} \begin{gather} \mathbf{X}= \begin{pmatrix} 1 & x_{1,1} \\ 1 & x_{2,1} \\ \vdots & \vdots \\ 1 & x_{n,1} \\ \end{pmatrix} \end{gather} \end{split}\]

A simple example follows:

toxicity = pd.DataFrame([[1,2,3,4,5,6],[28,53,93,126,172,197]],["Log_Dose","Deaths"] )
toxicity = toxicity.transpose()
toxicity.head()
# compute death fraction, all doses exposed to 250 subjects
toxicity["Fraction"]=toxicity["Deaths"]/250.0
# declare lethal id fraction >0.5

def islethal(value):
    if value >= 0.5:
        islethal = 1
    else:
        islethal = 0
    return(islethal)

toxicity["Class"]=toxicity["Fraction"].apply(islethal)
toxicity.head()

	Log_Dose	Deaths	Fraction	Class
0	1	28	0.112	0
1	2	53	0.212	0
2	3	93	0.372	0
3	4	126	0.504	1
4	5	172	0.688	1

def log_likelihood(features, target, weights): # features are the design variables; weights are the b0,b1,b2,...;target is the known class (0 or 1)
    scores = np.dot(features, weights)  
    ll = np.sum(target * scores - np.log(1 + np.exp(scores)))  
    return ll

def logistic_regression(features, target, num_steps, learning_rate, add_intercept=False, verbose=True, exit_tolerance = 1.0e-9):
    if add_intercept: 
        intercept = np.ones((features.shape[0], 1))  
        features = np.hstack((intercept, features))     
# print statements are for debugging        
    weights = np.zeros(features.shape[1])  # not sure about this line
    print("initial weights = ",weights)
    print("initial log likelihood = ",log_likelihood(features, target, weights))
    
#          
    for step in range(num_steps): 
        llold = log_likelihood(features, target, weights)
        if verbose: print("start weights = ",weights) 
        scores = np.dot(features, weights) 
        if verbose: print("start scores = ",scores)
        predictions = sigmoid(scores)  
        if verbose: print("predictions = ",predictions)
        output_error_signal = target - predictions      
        gradient = np.dot(features.T, output_error_signal)    
        weights = weights + learning_rate * gradient 
        if verbose: print("new weights = ",weights)
        newscores = np.dot(features, weights) 
        if verbose: print("new scores = ",newscores)
        llnew = log_likelihood(features, target, weights)

# test for exit
        if abs(llnew-llold) <= exit_tolerance:
            print("Exit Tolerance Satisfied \n  LL is not changing")
            print("step = ",step) 
            print("ll = ",log_likelihood(features, target, weights))
            break # exit the loop
        if step % 5 == 0:      
            if verbose: print("step = ",step) 
            if verbose: print("ll = ",log_likelihood(features, target, weights))
            if verbose: print("error signal = ",output_error_signal)
            if verbose: print("predictions = ",predictions)
# print on exit
    print("step = ",step) 
    print("ll = ",log_likelihood(features, target, weights))
    print("error signal = ",output_error_signal)
    print("predictions = ",predictions)
    return weights

Produce a design matrix using numpy and the dataframe we read. Similarily the class vector (of 0’s and 1’s) is constructed from the dataframe. These new objects have row-wise correspondence; that is a row of designmatrix corresponds to the same row of classvector. In terms of the above expressions

$$\text{classvector} = Y_{obs} $$

$$\text{designmatrix} = X$$

designmatrix = mydatabase[["X1","X2"]].to_numpy()
classvector = mydatabase["Class"].to_numpy()

These are supplied to our regression machine

betas = logistic_regression(designmatrix, classvector, 900, 0.00001, add_intercept=False, verbose=False)

initial weights =  [0. 0.]
initial log likelihood =  -13862.943611198905

Exit Tolerance Satisfied 
  LL is not changing
step =  621
ll =  -8139.368262867247
step =  621
ll =  -8139.368262867247
error signal =  [-0.4258073  -0.69253714 -0.32390763 ...  0.04174041  0.01879931
  0.0332575 ]
predictions =  [0.4258073  0.69253714 0.32390763 ... 0.95825959 0.98120069 0.9667425 ]

modelvector = sigmoid(np.dot(designmatrix, betas))

# plot the two classes as red and blue
plt.figure(figsize=(10, 8))
plt.scatter(classvector,modelvector, c="red", alpha=0.3)
#plt.scatter(classone['X1'],classone['X2'], c="blue", alpha=0.3)
#plt.legend(["Class 0","Class 1"])
plt.xlabel('X1 Predictor') 
plt.ylabel('X2 Predictor') 
plt.grid(which='both')
plt.show()

modelvector[1]

0.6925371237744782

classvector[100]

sigmoid(np.dot([0,2], betas))

0.8509476740456239

sigmoid(np.dot([0,4], betas))

0.9702321502038639

My sample book

Logistic Regression¶

Background¶

Multiple Variable Models¶

Fitting Process¶

Single Variable Models¶