Processing Texas Database

Following the process described in Prediction of Bridge Component Ratings Using Ordinal Logistic Regression Model but for Texas data.

import warnings
warnings.filterwarnings('ignore')
# Playing with Texas Data
local_file_name='2021TexasNBIData.csv'
import pandas as pd
# Read the NBI Database
texasdb = pd.read_csv(local_file_name)
# Verify the Structure
texasdb.describe()
STATE_CODE_001 ROUTE_PREFIX_005B SERVICE_LEVEL_005C DIRECTION_005E HIGHWAY_DISTRICT_002 COUNTY_CODE_003 PLACE_CODE_004 CRITICAL_FACILITY_006B MIN_VERT_CLR_010 KILOPOINT_011 ... FEDERAL_LANDS_105 YEAR_RECONSTRUCTED_106 PERCENT_ADT_TRUCK_109 NATIONAL_NETWORK_110 PIER_PROTECTION_111 FUTURE_ADT_114 YEAR_OF_FUTURE_ADT_115 MIN_NAV_CLR_MT_116 SUBMITTED_BY DECK_AREA
count 64864.0 64861.000000 64863.000000 64802.000000 55179.000000 64864.000000 64864.000000 0.0 64864.000000 64627.000000 ... 55179.000000 55134.000000 64570.000000 64794.000000 1332.000000 55179.000000 55179.000000 654.000000 64864.000000 55179.000000
mean 48.0 3.308383 1.536192 0.055677 12.724932 241.292335 18365.083159 NaN 84.662018 14.325147 ... 0.001287 387.471143 9.593604 0.151573 1.081832 14738.009768 2032.348049 5.944190 48.065815 964.955676
std 0.0 1.524604 1.830507 0.409700 6.339940 148.401968 23984.559132 NaN 34.798024 81.534822 ... 0.066350 786.827517 10.983355 0.358609 0.389648 33669.125288 10.215096 52.747722 1.226490 2706.771529
min 48.0 1.000000 0.000000 0.000000 0.000000 1.000000 0.000000 NaN 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 1900.000000 0.000000 48.000000 22.570000
25% 48.0 2.000000 1.000000 0.000000 9.000000 113.000000 0.000000 NaN 99.990000 1.448000 ... 0.000000 0.000000 0.000000 0.000000 1.000000 500.000000 2033.000000 0.000000 48.000000 135.150000
50% 48.0 3.000000 1.000000 0.000000 13.000000 201.000000 1000.000000 NaN 99.990000 6.437000 ... 0.000000 0.000000 6.000000 0.000000 1.000000 3510.000000 2033.000000 0.000000 48.000000 313.600000
75% 48.0 4.000000 1.000000 0.000000 18.000000 373.000000 35000.000000 NaN 99.990000 19.195500 ... 0.000000 0.000000 14.000000 0.000000 1.000000 14000.000000 2034.000000 0.000000 48.000000 870.195000
max 48.0 8.000000 8.000000 4.000000 48.000000 507.000000 99315.000000 NaN 99.990000 9999.999000 ... 5.000000 2020.000000 99.000000 1.000000 5.000000 999999.000000 2086.000000 999.900000 75.000000 125732.720000

8 rows × 80 columns

Extract just culverts send to different object name (its probably just a HASH

We name our structure txculv (or whatever you want)

txculv = texasdb.loc[texasdb['STRUCTURE_TYPE_043B'] == 19]
txculv.describe()
STATE_CODE_001 ROUTE_PREFIX_005B SERVICE_LEVEL_005C DIRECTION_005E HIGHWAY_DISTRICT_002 COUNTY_CODE_003 PLACE_CODE_004 CRITICAL_FACILITY_006B MIN_VERT_CLR_010 KILOPOINT_011 ... FEDERAL_LANDS_105 YEAR_RECONSTRUCTED_106 PERCENT_ADT_TRUCK_109 NATIONAL_NETWORK_110 PIER_PROTECTION_111 FUTURE_ADT_114 YEAR_OF_FUTURE_ADT_115 MIN_NAV_CLR_MT_116 SUBMITTED_BY DECK_AREA
count 20551.0 20551.00000 20551.000000 20551.000000 20545.000000 20551.000000 20551.000000 0.0 20551.000000 20551.000000 ... 20545.000000 20520.000000 20498.000000 20551.000000 190.000000 20545.000000 20545.000000 111.000000 20551.000000 20545.000000
mean 48.0 3.25916 1.343390 0.019123 12.900949 248.910759 15100.965111 NaN 99.702754 13.756241 ... 0.001898 583.425146 11.671675 0.077855 1.021053 11016.450961 2032.816938 6.052252 48.116247 219.712841
std 0.0 1.29903 1.492621 0.247959 6.490255 155.890098 23840.797409 NaN 5.223589 72.832328 ... 0.080438 904.383997 12.044279 0.267950 0.204652 28829.037024 7.714370 41.045256 1.674647 284.001639
min 48.0 1.00000 0.000000 0.000000 0.000000 1.000000 0.000000 NaN 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 1.000000 1.000000 1900.000000 0.000000 48.000000 24.790000
25% 48.0 3.00000 1.000000 0.000000 8.000000 113.000000 0.000000 NaN 99.990000 1.529000 ... 0.000000 0.000000 1.000000 0.000000 1.000000 530.000000 2033.000000 0.000000 48.000000 95.950000
50% 48.0 3.00000 1.000000 0.000000 14.000000 231.000000 0.000000 NaN 99.990000 6.851000 ... 0.000000 0.000000 9.000000 0.000000 1.000000 2450.000000 2033.000000 0.000000 48.000000 149.410000
75% 48.0 4.00000 1.000000 0.000000 18.000000 401.000000 27000.000000 NaN 99.990000 18.537000 ... 0.000000 1961.000000 18.000000 0.000000 1.000000 9030.000000 2033.000000 0.000000 48.000000 251.490000
max 48.0 8.00000 8.000000 4.000000 25.000000 507.000000 83745.000000 NaN 99.990000 9999.999000 ... 5.000000 2020.000000 90.000000 1.000000 3.000000 656470.000000 2052.000000 304.500000 74.000000 19993.960000

8 rows × 80 columns

list(texasdb)
['STATE_CODE_001',
 'STRUCTURE_NUMBER_008',
 'RECORD_TYPE_005A',
 'ROUTE_PREFIX_005B',
 'SERVICE_LEVEL_005C',
 'ROUTE_NUMBER_005D',
 'DIRECTION_005E',
 'HIGHWAY_DISTRICT_002',
 'COUNTY_CODE_003',
 'PLACE_CODE_004',
 'FEATURES_DESC_006A',
 'CRITICAL_FACILITY_006B',
 'FACILITY_CARRIED_007',
 'LOCATION_009',
 'MIN_VERT_CLR_010',
 'KILOPOINT_011',
 'BASE_HWY_NETWORK_012',
 'LRS_INV_ROUTE_013A',
 'SUBROUTE_NO_013B',
 'LAT_016',
 'LONG_017',
 'DETOUR_KILOS_019',
 'TOLL_020',
 'MAINTENANCE_021',
 'OWNER_022',
 'FUNCTIONAL_CLASS_026',
 'YEAR_BUILT_027',
 'TRAFFIC_LANES_ON_028A',
 'TRAFFIC_LANES_UND_028B',
 'ADT_029',
 'YEAR_ADT_030',
 'DESIGN_LOAD_031',
 'APPR_WIDTH_MT_032',
 'MEDIAN_CODE_033',
 'DEGREES_SKEW_034',
 'STRUCTURE_FLARED_035',
 'RAILINGS_036A',
 'TRANSITIONS_036B',
 'APPR_RAIL_036C',
 'APPR_RAIL_END_036D',
 'HISTORY_037',
 'NAVIGATION_038',
 'NAV_VERT_CLR_MT_039',
 'NAV_HORR_CLR_MT_040',
 'OPEN_CLOSED_POSTED_041',
 'SERVICE_ON_042A',
 'SERVICE_UND_042B',
 'STRUCTURE_KIND_043A',
 'STRUCTURE_TYPE_043B',
 'APPR_KIND_044A',
 'APPR_TYPE_044B',
 'MAIN_UNIT_SPANS_045',
 'APPR_SPANS_046',
 'HORR_CLR_MT_047',
 'MAX_SPAN_LEN_MT_048',
 'STRUCTURE_LEN_MT_049',
 'LEFT_CURB_MT_050A',
 'RIGHT_CURB_MT_050B',
 'ROADWAY_WIDTH_MT_051',
 'DECK_WIDTH_MT_052',
 'VERT_CLR_OVER_MT_053',
 'VERT_CLR_UND_REF_054A',
 'VERT_CLR_UND_054B',
 'LAT_UND_REF_055A',
 'LAT_UND_MT_055B',
 'LEFT_LAT_UND_MT_056',
 'DECK_COND_058',
 'SUPERSTRUCTURE_COND_059',
 'SUBSTRUCTURE_COND_060',
 'CHANNEL_COND_061',
 'CULVERT_COND_062',
 'OPR_RATING_METH_063',
 'OPERATING_RATING_064',
 'INV_RATING_METH_065',
 'INVENTORY_RATING_066',
 'STRUCTURAL_EVAL_067',
 'DECK_GEOMETRY_EVAL_068',
 'UNDCLRENCE_EVAL_069',
 'POSTING_EVAL_070',
 'WATERWAY_EVAL_071',
 'APPR_ROAD_EVAL_072',
 'WORK_PROPOSED_075A',
 'WORK_DONE_BY_075B',
 'IMP_LEN_MT_076',
 'DATE_OF_INSPECT_090',
 'INSPECT_FREQ_MONTHS_091',
 'FRACTURE_092A',
 'UNDWATER_LOOK_SEE_092B',
 'SPEC_INSPECT_092C',
 'FRACTURE_LAST_DATE_093A',
 'UNDWATER_LAST_DATE_093B',
 'SPEC_LAST_DATE_093C',
 'BRIDGE_IMP_COST_094',
 'ROADWAY_IMP_COST_095',
 'TOTAL_IMP_COST_096',
 'YEAR_OF_IMP_097',
 'OTHER_STATE_CODE_098A',
 'OTHER_STATE_PCNT_098B',
 'OTHR_STATE_STRUC_NO_099',
 'STRAHNET_HIGHWAY_100',
 'PARALLEL_STRUCTURE_101',
 'TRAFFIC_DIRECTION_102',
 'TEMP_STRUCTURE_103',
 'HIGHWAY_SYSTEM_104',
 'FEDERAL_LANDS_105',
 'YEAR_RECONSTRUCTED_106',
 'DECK_STRUCTURE_TYPE_107',
 'SURFACE_TYPE_108A',
 'MEMBRANE_TYPE_108B',
 'DECK_PROTECTION_108C',
 'PERCENT_ADT_TRUCK_109',
 'NATIONAL_NETWORK_110',
 'PIER_PROTECTION_111',
 'BRIDGE_LEN_IND_112',
 'SCOUR_CRITICAL_113',
 'FUTURE_ADT_114',
 'YEAR_OF_FUTURE_ADT_115',
 'MIN_NAV_CLR_MT_116',
 'FED_AGENCY',
 'SUBMITTED_BY',
 'BRIDGE_CONDITION',
 'LOWEST_RATING',
 'DECK_AREA']
txculv = txculv[['CULVERT_COND_062','YEAR_BUILT_027','ADT_029','DESIGN_LOAD_031','OPERATING_RATING_064','STRUCTURAL_EVAL_067',
                  'PERCENT_ADT_TRUCK_109','BRIDGE_CONDITION','YEAR_RECONSTRUCTED_106']].copy()
txculv.head()
CULVERT_COND_062 YEAR_BUILT_027 ADT_029 DESIGN_LOAD_031 OPERATING_RATING_064 STRUCTURAL_EVAL_067 PERCENT_ADT_TRUCK_109 BRIDGE_CONDITION YEAR_RECONSTRUCTED_106
0 7 2008 100.0 5 38.6 7 10.0 G 0.0
5 7 1990 15.0 0 36.3 7 NaN G 0.0
18 7 1930 5581.0 0 32.7 6 16.0 G 1962.0
19 6 1934 5104.0 2 40.8 6 17.0 F 1959.0
21 6 1967 4348.0 5 44.4 6 19.0 F 0.0
txculv.describe()
YEAR_BUILT_027 ADT_029 OPERATING_RATING_064 PERCENT_ADT_TRUCK_109 YEAR_RECONSTRUCTED_106
count 20551.000000 20551.000000 20476.000000 20498.000000 20520.000000
mean 1969.900151 8578.712715 39.604732 11.671675 583.425146
std 25.511268 21636.892122 9.133697 12.044279 904.383997
min 1914.000000 0.000000 0.000000 0.000000 0.000000
25% 1951.000000 400.000000 32.700000 1.000000 0.000000
50% 1967.000000 1950.000000 44.400000 9.000000 0.000000
75% 1990.000000 7101.500000 44.400000 18.000000 1961.000000
max 2020.000000 313693.000000 99.900000 90.000000 2020.000000
age = 2022 - txculv['YEAR_BUILT_027'] # age of culvert, surrogate for service life
drat = txculv['OPERATING_RATING_064'].fillna(1) # operating rating (in tons), nan is 1
drat = pd.to_numeric(drat) # coerce to numeric
steval = txculv['STRUCTURAL_EVAL_067'].fillna(1).replace('*','1') # structural eval, nan is 1 (not in original NBI coding tables)
steval = pd.to_numeric(steval) # coerce to numeric
adt =  txculv['ADT_029'].fillna(1) # 
adt = pd.to_numeric(adt) # coerce to numeric
pctrk =  txculv['PERCENT_ADT_TRUCK_109'].fillna(txculv['PERCENT_ADT_TRUCK_109'].mean()) # 
pctrk = pd.to_numeric(pctrk) # coerce to numeric


#df_filled2 = df.fillna(df.mean())
frame = { 'AGE': age, 'DRAT': drat, "ADT" :adt, "STEVAL":steval, "PCTRK":pctrk}
#frame = { 'AGE': age, 'DRAT': drat, "STEVAL":steval}
#Creating DataFrame by passing Dictionary
X = pd.DataFrame(frame)  # our design matrix

Now build our target

Prepare the target variable, from the culvert condition. The series has NaN and we convert these to a goofy string

pre_target = txculv['CULVERT_COND_062'].fillna('11').replace('N','10') # Nan to 11, "N" to 10 so all codes able to be numeric
print("Type ",type(pre_target[0]))
Type  <class 'str'>

then convert from string to numeric.

pre_target = pd.to_numeric(pre_target)
print("Type ",type(pre_target[0]))
Type  <class 'numpy.int64'>

Need a function to select the two conditions

def isok(value_int): # function to interpret condition rating and issue binary state 0 == fail 1 == OK
    cut = 3
    if value_int > cut:
        isok = 1
    elif value_int <= cut:
        isok = 0
    return(isok)

y = pre_target.apply(isok) # our target vector

Now check that our design matrix and target vector have correct structure

X.describe()
AGE DRAT ADT STEVAL PCTRK
count 20551.000000 20551.000000 20551.000000 20551.000000 20551.000000
mean 52.099849 39.463846 8578.712715 6.164372 11.671675
std 25.511268 9.409530 21636.892122 0.838454 12.028738
min 2.000000 0.000000 0.000000 0.000000 0.000000
25% 32.000000 32.700000 400.000000 6.000000 1.000000
50% 55.000000 44.400000 1950.000000 6.000000 9.000000
75% 71.000000 44.400000 7101.500000 7.000000 18.000000
max 108.000000 99.900000 313693.000000 9.000000 90.000000
y.describe()
count    20551.000000
mean         0.999465
std          0.023130
min          0.000000
25%          1.000000
50%          1.000000
75%          1.000000
max          1.000000
Name: CULVERT_COND_062, dtype: float64
# Now we can do some machine learning
# split X and y into training and testing sets
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.25,random_state=0)
# import the class
from sklearn.linear_model import LogisticRegression

# instantiate the model (using the default parameters)
#logreg = LogisticRegression()
logreg = LogisticRegression()
# fit the model with data
logreg.fit(X_train,y_train)

#
y_pred=logreg.predict(X_test)
print(logreg.intercept_[0])
print(logreg.coef_)
#y.head()
0.08278239586830166
[[-1.03778439e-02 -1.30938676e-01  6.65067052e-05  2.17738437e+00
   6.97735441e-01]]
# import the metrics class
from sklearn import metrics
cnf_matrix = metrics.confusion_matrix(y_pred, y_test)
cnf_matrix
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import pandas as pd
class_names=[0,1] # name  of classes
fig, ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
# create heatmap
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="YlGnBu" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Predicted label')
plt.xlabel('Actual label');
../../_images/TexasNBI_database_28_0.png
# import the class
from sklearn.linear_model import PoissonRegressor

# instantiate the model (using the default parameters)
#logreg = LogisticRegression()
posreg = PoissonRegressor()
# fit the model with data
posreg.fit(X_train,y_train)

#
y_pred=posreg.predict(X_test)
print(posreg.intercept_)
print(posreg.coef_)
#y.head()
-0.0005840932119201926
[0. 0. 0. 0. 0.]
y_pred
array([0.99941608, 0.99941608, 0.99941608, ..., 0.99941608, 0.99941608,
       0.99941608])

Structurally Deficient (SD): This term was previously defined in https://www.fhwa.dot.gov/bridge/0650dsup.cfm as having a condition rating of 4 or less for Item 58 (Deck), Item 59 (Superstructure), Item 60 (Substructure), or Item 62 (Culvert), OR having an appraisal rating of 2 or less for Item 67 (Structural Condition) or Item 71 (Waterway Adequacy) Beginning with the 2018 data archive, this term will be defined in accordance with the Pavement and Bridge Condition Performance Measures final rule, published in January of 2017, as a classification given to a bridge which has any component [Item 58, 59, 60, or 62] in Poor or worse condition [code of 4 or less].

This capacity rating, referred to as the operating rating, will result in the absolute maximum permissible load level to which the structure may be subjected for the vehicle type used in the rating. Code the operating rating as a 3-digit number to represent the total mass in metric tons of the entire vehicle measured to the nearest tenth of a metric ton (with an assumed decimal point).

#type(pre_target[0])

Prepare the target variable, the culvert condition as above (not processing just yet). The series has NaN and we convert these to a goofy string then convert from string to numeric.

Now make a proper target

#target