[ML_5] Prediction of Titanic survival by Scikitlearn

2025. 7. 15. 06:30python/ML

Today, we are going to build a model using scikit-learn.

Our workflow is as follows:

  • Data Preprocessing :In this step, we will preprocess our dataset. For example, we will handle missing values, perform feature selection, and apply label encoding.
  • Splitting the Data : Next, we will divide the data into training and test sets.
  • Model Training : Here, we will train the model using the training data. We will also demonstrate cross-validation with functions such as cross_val_score and GridSearchCV.
  • Model Evaluation : Finally, we will assess the model’s performance on the test set.

 

1. Data Preprocessing : 

We are using the Titanic dataset, which we obtained from Kaggle.
Let’s take a look at its structure.

<data structure of titanic_csv>

There are many features in this dataset, but we don’t need all of them. For example, the PassengerId feature is not significant for predicting survival, right? So this feature should be removed.

For this reason, the PassengerId, Name, and Ticket features will be dropped.

Next, we will process the character-type columns, especially those containing ambiguous values. For example, look at the Cabin feature. There are many ambiguous values, including overlapping mappings. To handle this, we can use Series string methods (e.g., str). In this case, we extract the first character using:

df['Cabin'].str[:1]

After that, we will handle the missing values. The Age, Cabin, and Embarked columns contain null values. To fill these gaps, we will replace the missing Age values with the mean of the column, and fill the missing Cabin and Embarked values with 'N'.

 

Lastly, we will perform label encoding using scikit-learn’s LabelEncoder.
First, we initialize a LabelEncoder instance, and then sequentially apply fit and transform to encode the categorical values.

#%% 
# Null value , age class --> column, Labelling 
import pandas as pd
from sklearn.preprocessing import LabelEncoder
 
titanic_df = pd.read_csv("titanic_train.csv")

def null_processing(df): 
    df['Age'] = df['Age'].fillna(df['Age'].mean()) 
    df['Cabin'] = df['Cabin'].fillna('N') 
    df['Embarked'] = df['Embarked'].fillna('N') 
    df['Fare'] = df['Fare'].fillna(0)
    return df

def drop_features(df): 
    df = df.drop(['PassengerId', 'Name', 'Ticket'],axis = 1)
    return df 

def format_features(df): 
    df['Cabin'] = df['Cabin'].str[:1] 
    features = ['Cabin', 'Sex', 'Embarked'] 
    for feature in features: 
        le = LabelEncoder() 
        le = le.fit(df[feature]) 
        df[feature] = le.transform(df[feature]) 
    
    return df 

def transform_df(df): 
    df = null_processing(df) 
    df = drop_features(df) 
    df = format_features(df) 
    
    return df 

y_titanic_df = titanic_df['Survived'] 
X_titanic_df = titanic_df.drop('Survived', axis = 1) 
X_titanic_df = transform_df(X_titanic_df)

 

2. Splitting the Data 

 Great, we have completed the data preprocessing.

The next step is to split the dataset.

This step allows us to create training and test sets so that we can train the model and later evaluate how well it performs.

from sklearn.model_selection import train_test_split 

# I am take Enable line wrapping 
X_train, X_test, y_train, y_test = train_test_split(X_titanic_df, y_titanic_df, test_size=0.2, random_state=11)

 

3. Model training 

Let’s start training our models.

In this section, we will use three types of models: DecisionTreeClassifier, RandomForestClassifier, and LogisticRegression. 

from sklearn.tree import DecisionTreeClassifier 
from sklearn.ensemble import RandomForestClassifier 
from sklearn.linear_model import LogisticRegression 
from sklearn.metrics import accuracy_score 

dt_model = DecisionTreeClassifier() 
rf_model = RandomForestClassifier() 
lr_model = LogisticRegression(solver= "liblinear") 

model_list = [dt_model, rf_model, lr_model] 
for model in model_list: 
    model.fit(X_train,y_train) 
    model_pred = model.predict(X_test) 
    print(f"{model}_acc_score : {accuracy_score(y_test,model_pred)}")

 

4. Kfold method , cross_val_score(), GridSearchCv 

 What about the cross-validation method?

This method splits the data into multiple folds and repeatedly uses different subsets as training and validation data. This approach helps the model prevent overfitting and provides a more reliable estimate of its performance.

#%%
from sklearn.model_selection import KFold 
import numpy as np 

def exec_Kfold(model, fold = 5): 
    kfold = KFold(n_splits = fold) 
    scores = [] 
    
    for iter_num, (train_idx, test_idx) in enumerate(kfold.split(X_titanic_df)): 
        X_train, X_test = X_titanic_df.values[train_idx], X_titanic_df.values[test_idx] 
        y_train, y_test = y_titanic_df.values[train_idx], y_titanic_df.values[test_idx] 
        
        model.fit(X_train, y_train) 
        y_pred = model.predict(X_test) 
        accuracy = accuracy_score(y_test, y_pred) 
        scores.append(accuracy) 
        print(f"cv{iter_num} : {accuracy}") 
        
    mean_score = np.mean(scores) 
    print(f"avg_acc : {mean_score}") 
    

exec_Kfold(lr_model, fold = 5)
from sklearn.model_selection import cross_val_score 

scores = cross_val_score(lr_model, X_titanic_df, y_titanic_df, cv = 5 ) 
scores
#%%
from sklearn.model_selection import GridSearchCV

parameters = {
    'C': [0.01, 0.1, 1, 10, 100],
    'penalty': ['l2'],
    'solver': ['liblinear'],
    'max_iter': [100, 200, 500]
}

grid_lr = GridSearchCV(lr_model, param_grid=parameters, scoring='accuracy', cv=5)
grid_lr.fit(X_train, y_train)

print(f"grid_lr_best_params : {grid_lr.best_params_}")
print(f"grid_lr_best_score : {grid_lr.best_score_}")

best_lr_model = grid_lr.best_estimator_

dpredictions = best_lr_model.predict(X_test)
accuracy = accuracy_score(y_test, dpredictions)
print(accuracy)