[ML_6] Prediction of pima diabetes using Scikitlearn

2025. 7. 17. 07:04python/ML

Procedures are as follows:

 

  1. Introduction of confusion matrix, precision, recall, f1 score and roc_auc
  2. Data Preprocessing
  3. Data Splitting (Train/Test)
  4. Model Training and Prediction
  5. Evaluation (We will focus specifically on evaluation metrics.) 

 

1. Introduction to Confusion Matrix, Precision, Recall, F1 Score, and ROC AUC

→ Confusion Matrix
This is a matrix composed of four quadrants: False Negative (FN), False Positive (FP), True Negative (TN), and True Positive (TP).
“False” indicates an incorrect prediction compared to the actual label, while “Negative” and “Positive” refer to the model’s predicted class.
We primarily focus on reducing FN and FP, as minimizing these errors is critical to improving model reliability.

 

→ Precision
The formula for precision is:
Precision = TP / (TP + FP)
Precision is especially important in scenarios where predicting a positive instance as negative would cause significant operational impact.

 

→ Recall
The formula for recall is:
Recall = TP / (TP + FN)
Recall becomes crucial when predicting a negative instance as positive could lead to serious consequences.

 

→ F1 Score
The F1 Score represents the harmonic mean of precision and recall.
A high F1 Score indicates that precision and recall are balanced and both performing well.

 

→ ROC AUC
The ROC curve illustrates how well the model distinguishes between classes across different thresholds.
As shown in the figure below, the closer the curve approaches the top-left corner, the better the performance.
The AUC (Area Under the Curve) simply measures the total area under this ROC curve.

 

2~5. Prediction of pima diabetes. 

Data source is kaggle. 

import pandas as pd 

pima_df = pd.read_csv("pima_diabetes.csv") 
pima_df.info() 

#%% 
# Data preprocessing 
#null check , feature drop check , data type check 
# scaling()  , train test split() 
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score 

y_df = pima_df['Outcome'].values
X_df = pima_df.drop(['Outcome'],axis = 1 )   

X_train, X_test, y_train, y_test = train_test_split(X_df, y_df, test_size = 0.2, random_state = 121) 

#%% 
from sklearn.tree import DecisionTreeClassifier 
from sklearn.ensemble import RandomForestClassifier 
from sklearn.linear_model import LogisticRegression 
from sklearn.metrics import accuracy_score  

dtc_ml = DecisionTreeClassifier() 
rfc_ml = RandomForestClassifier() # rfc model is best performance 
lr_ml = LogisticRegression(solver = "liblinear") 
model_list = [dtc_ml, rfc_ml, lr_ml]
acc_list = []
def fit_predict(model ,X_train=X_train, X_test=X_test, y_train=y_train, y_test=y_test): 
    model.fit(X_train, y_train) 
    y_pred = model.predict(X_test) 
    acc = accuracy_score(y_test, y_pred) 
    acc_list.append(acc) 

for model in model_list: 
    fit_predict(model) 


#%% 
from sklearn.ensemble import RandomForestClassifier 
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, roc_curve, precision_recall_curve
import numpy as np 
import matplotlib.pyplot as plt 
%matplotlib inline
y_pred = rfc_ml.predict(X_test) 
y_pred_proba = rfc_ml.predict_proba(X_test)[:,1]

def evaluation_model(y_test, y_pred, pred_proba) :
    confusion = confusion_matrix(y_test,y_pred)
    accuracy = accuracy_score(y_test, y_pred) 
    precision = precision_score(y_test, y_pred) 
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    roc_auc = roc_auc_score(y_test, pred_proba) 
    print(f"Confusion matrix : {confusion}") 
    print(f"accuracy : {accuracy}, precision : {precision} , recall : {recall}, roc_auc : {roc_auc}") 

def plot_the_precision_recall(y_test, pred_proba_c1): 
    precision, recall, threshold = precision_recall_curve(y_test, pred_proba_c1)
    
    plt.figure(figsize = (8,6)) 
    threshold_boundary = threshold.shape[0] 
    plt.plot(threshold, precision[0:threshold_boundary], linestyle = '--', label = 'precision') 
    plt.plot(threshold, recall[0:threshold_boundary], label = 'recall') 
    
    start, end = plt.xlim() 
    plt.xticks(np.round(np.arange(start,end,0.1),2)) 
    
    plt.xlabel("Threshold value"); plt.ylabel('Precision and Recall curve') 
    plt.legend(); plt.grid() 
    plt.show()
    
    
    
def plot_the_roc_curve(y_test, y_pred_proba_c1): 
    fpr, tpr, threshold = roc_curve(y_test, y_pred_proba_c1) 
    
    plt.plot(fpr, tpr, label = "ROC") 
    plt.plot([0,1],[0,1],'k--',label = 'Random')
    start,end = plt.xlim()
    plt.xticks(np.round(np.arange(start,end,0.1),2))
    plt.xlim(0,1) ; plt.ylim(0,1) 
    plt.xlabel("FPR(1-specificity)") ; plt.ylabel("TPR(Recall)") 
    plt.legend()
evaluation_model(y_test, y_pred, y_pred_proba)
plot_the_roc_curve(y_test, y_pred_proba)
plot_the_precision_recall(y_test, y_pred_proba)