Find the best data preparation method and model using a pipeline

6 min readMay 25, 2022

Source: https://unsplash.com/photos/xNdPWGJ6UCQ?utm_source=unsplash&utm_medium=referral&utm_content=creditShareLink

In the previous article, I wrote about the simple implementation of a pipeline for machine learning model training which could make your code neater. You can read it at the below link.

Neat data preprocessing with Pipeline and ColumnTransformer

Create one pipeline in which you can input any data and those data will be transformed before training machine learning…

yannawut.medium.com

A pipeline does not only make your code tidier, it can also help in the hyperparameter tuning and data preparation process.

📔 Content in this article

Find the changeable pipeline parameters
Find the best hyperparameter sets: Add a pipeline to Grid Search
Find the best data preparation method: Skip a step in a pipeline
Find the best hyperparameter sets and the best data preparation method

🛣️ The Pipeline

Referring to the previous article, this is the pipeline.

# sets of columns to be transformed in different waysnum_cols = ['city_development_index','relevent_experience', 'experience','last_new_job', 'training_hours']
cat_cols = ['gender', 'enrolled_university', 'education_level', 'major_discipline', 'company_size', 'company_type']# Create pipelines for numerical and categorical featuresfrom sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler
from sklearn.pipeline import Pipelinenum_pipeline = Pipeline(steps=[
    ('impute', SimpleImputer(strategy='mean')),
    ('scale',MinMaxScaler())
])cat_pipeline = Pipeline(steps=[
    ('impute', SimpleImputer(strategy='most_frequent')),
    ('one-hot',OneHotEncoder(handle_unknown='ignore', sparse=False))
])# Create ColumnTransformer to apply pipeline for each column typefrom sklearn.compose import ColumnTransformercol_trans = ColumnTransformer(transformers=[
    ('num_pipeline',num_pipeline,num_cols),
    ('cat_pipeline',cat_pipeline,cat_cols)
    ],
    remainder='drop',
    n_jobs=-1)# Add a model to a final pipeline, clf_pipelinefrom sklearn.linear_model import LogisticRegressionclf = LogisticRegression(random_state=0)clf_pipeline = Pipeline(steps=[
    ('col_trans', col_trans),
    ('model', clf)
])

In this pipeline, the clf_pipeline (the final pipeline) contains col_trans (ColumnTransformer) and a logistic regression model.

In the col_tran, there are num_pipeline and cat_pipeline. Those pipelines will transform numerical features and categorical features respectively.

🔍 Find the changeable pipeline parameters

First, let’s see the list of parameters that can be adjusted.
Syntax: pipeline_name.get_param()

clf_pipeline.get_params()

The result can be very long. Take a deep breath and continue reading.

The first part is just about the steps of the pipeline.

Below the first part is what we are interested in, a list of parameters that we can adjust.

The format is step1_step2_…_parameter.

For example col_trans_cat_pipeline_one-hot_sparse means parameter sparse of the one-hot step.

You can change parameters directly using set_param.

clf_pipeline.set_params(model_C = 10)

➕ Find the best hyperparameter sets: Add a pipeline to Grid Search

Grid Search is a method used to perform hyperparameter tuning, finding the optimum parameter sets that yield the highest model accuracy.

1. Set tuning parameters and their range.

Create a dictionary of tuning parameters (hyperparameters)

{ ‘tuning parameter’ : ‘possible value’, … }

In this example, I want to find the best penalty type and C of a logistic regression model.

grid_params = {'model__penalty' : ['none', 'l2'],
               'model__C' : np.logspace(-4, 4, 20)}

2. Add the pipeline to Grid Search

Syntax: GridSearchCV(model, tuning parameter, …)

Our pipeline has a model step as the final step, so we can input the pipeline directly to the GridSearchCV function.

from sklearn.model_selection import GridSearchCVgs = GridSearchCV(clf_pipeline, grid_params, cv=5, scoring='accuracy')
gs.fit(X_train, y_train)print("Best Score of train set: "+str(gs.best_score_))
print("Best parameter set: "+str(gs.best_params_))
print("Test Score: "+str(gs.score(X_test,y_test)))

Result of Grid Search

After setting Grid Search, you can fit Grid Search with the data and see the results.

.fit: fit the model and try all sets of parameters in the tuning parameter dictionary
.best_score_: the highest accuracy across all sets of parameters
.best_params_: The set of parameters that yield the best score
.score(X_test,y_test): The score when trying the best model with the test set.

You can read more about GridSearchCV in the documentation here.

⏩ Find the best data preparation method: Skip a step in a pipeline

Finding the best data preparation method can be difficult without a pipeline since you have to create so many variables for many data transformation cases.

With the pipeline, we can create data transformation steps in the pipeline and perform a grid search to find the best step. A grid search will select which step to skip and compare the result of each case.

Adjust the current pipeline a little

I want to know which scaling method will work best for my data between MinMaxScaler and StandardScaler.

I add a step StandardScaler in the num_pipeline. The rest has no change.

from sklearn.preprocessing import StandardScalernum_pipeline2 = Pipeline(steps=[
    ('impute', SimpleImputer(strategy='mean')),
    ('minmax_scale', MinMaxScaler()),
    ('std_scale', StandardScaler()),
])col_trans2 = ColumnTransformer(transformers=[
    ('num_pipeline',num_pipeline2,num_cols),
    ('cat_pipeline',cat_pipeline,cat_cols)
    ],
    remainder='drop',
    n_jobs=-1)clf_pipeline2 = Pipeline(steps=[
    ('col_trans', col_trans2),
    ('model', clf)
])

Grid search

In grid search parameters, specify the steps you want to skip and set their value to passthrough.

Since MinMaxScaler and StandardScaler should not perform at the same time, I will use a list of dictionaries for the grid search parameters.

[{case 1},{case 2}]

If using a list of dictionaries, grid search will perform a combination of every parameter in case 1 until complete. Then, it will perform a combination of every parameter in case 2. Therefore, there is no case where MinMaxScaler and StandardScaler are used together.

grid_step_params = [{'col_trans__num_pipeline__minmax_scale': ['passthrough']},
                    {'col_trans__num_pipeline__std_scale': ['passthrough']}]

Perform Grid Search and print the results (like normal grid search).

gs2 = GridSearchCV(clf_pipeline2, grid_step_params, scoring='accuracy')
gs2.fit(X_train, y_train)print("Best Score of train set: "+str(gs2.best_score_))
print("Best parameter set: "+str(gs2.best_params_))
print("Test Score: "+str(gs2.score(X_test,y_test)))

The best case is minmax_scale : ‘passthrough’, so StandardScaler is the best scaling method for this data.

💥 Find the best hyperparameter sets and the best data preparation method

You can find the best hyperparameter sets and the best data preparation method by adding tuning parameters to the dictionary of each case of the data preparation method.

grid_params = {'model__penalty' : ['none', 'l2'],
               'model__C' : np.logspace(-4, 4, 20)}grid_step_params = [{**{'col_trans__num_pipeline__minmax_scale': ['passthrough']}, **grid_params},
                    {**{'col_trans__num_pipeline__std_scale': ['passthrough']}, **grid_params}]

grid_params will be added to both case 1 (skip MinMaxScaler) and case 2 (skip StandardScalerand).

You can merge dictionary using the syntax below.
merge_dict = {**dict_1,**dict_2}

Perform Grid Search and print the results (like normal grid search).

gs3 = GridSearchCV(clf_pipeline2, grid_step_params2, scoring='accuracy')
gs3.fit(X_train, y_train)print("Best Score of train set: "+str(gs3.best_score_))
print("Best parameter set: "+str(gs3.best_params_))
print("Test Score: "+str(gs3.score(X_test,y_test)))

You can find the best parameter set using .best_params_. As minmax_scale : ‘passthrough’, so StandardScaler is the best scaling method for this data.

All grid search cases can be shown using .cv_results_

pd.DataFrame(gs3.cv_results_)

There are 80 cases for this example. There are running time, and accuracy of each case for you to consider since sometimes we may select the fastest model with acceptable accuracy instead of the highest accuracy one.

Conclusion

The pipeline can ease the hyperparameter tuning and data preparation process. By using a pipeline with grid search, you can define grid search parameters to explore all available cases and find the best one.

What’s next?

Finding the best machine learning model requires a custom transformation function, which I will write in the next article. See you.

Mlearning.ai Submission Suggestions

How to become a writer on Mlearning.ai

medium.com