Neat data preprocessing with Pipeline and ColumnTransformer

5 min readMay 22, 2022

❓ Why pipeline and column transformer?

When working on a machine learning project, the most tedious step is often a data cleaning and preprocessing step. Primarily when you work in Jupyter Notebook, running code in many cells could be confusing.

Before training a model, data should be spitted into a training set and a test set. Each data set will pass the data cleaning and preprocessing step before entering a machine learning model. It is not efficient to write repetitive code for the training set and the test set. This is when a pipeline comes into play.

Pipeline and Column Transformer are the elegant ways to create a data preprocessing workflow

First of all, imagine that you can create only one pipeline in which you can input any data and those data will be transformed into an appropriate format before model training or prediction. It will shorten your code and make code easier to read and adjust.

Let’s start coding!!

💽 Dataset

The Data I used comes from

HR Analytics: Job Change of Data Scientists

Predict who will move to a new job

www.kaggle.com

You can find my article about data exploration of this data set in the below link.

Data Analysis — Job change of data scientist

Which factors lead a person to leave their current job? Let’s explore the data with Power BI.

yannawut.medium.com

In short, this data set contains information about job candidates and their decision about whether changing jobs or not.

Objective: Predict whether a candidate will change jobs based on their information (Classification task).

🛣️ Pipeline and ColumnTransformer

There is a big difference between Pipeline and ColumnTransformer that you must understand.

Pipeline: Use for multiple transformations of the same columns.

ColumnTransformer: Use to transform each column set differently.

⚠️ The ColumnTransformer doesn't transform step by step but it transforms each step separately, and commingle later.

🗺️ Data Preprocessing plan

Please be noted that I skip categorical features encoding for the simplicity of this article.

Step 1: Specify define sets of columns to be transformed in different ways

numerical and categorical should be transformed in different ways, so I define num_col for numerical column (number) and cat_cols for categorical columns.

num_cols = ['city_development_index','relevent_experience', 'experience','last_new_job', 'training_hours']cat_cols = ['gender', 'enrolled_university', 'education_level', 'major_discipline', 'company_size', 'company_type']

Step 2: Split data to train and test sets

Split 20 percent of data into a test set.

from sklearn.model_selection import train_test_splitX = df[num_cols+cat_cols]
y = df['target']# train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y)

I will fit the pipeline for the train set and use that fitted pipeline for the test set to prevent data leakage from the test set to the model.

Step 3: Create pipelines for numerical and categorical features

Syntax of the pipeline is

Pipeline(steps = [(‘step name’, transform function), …])

For numerical features, I perform
1. SimpleImputer to fill missing value with the mean of that column.
2. MinMaxScaler to scale value to range 0 to 1 (this will affect regression performance).

For categorical features, I perform
1. SimpleImputer to fill the missing value with the most frequency value of that column.
2. OneHotEncoder to spit to many numerical columns for model training. (handle_unknown=’ignore’ is specified to prevent error when found an unseen category in the test set)

from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler
from sklearn.pipeline import Pipelinenum_pipeline = Pipeline(steps=[
    ('impute', SimpleImputer(strategy='mean')),
    ('scale',MinMaxScaler())
])cat_pipeline = Pipeline(steps=[
    ('impute', SimpleImputer(strategy='most_frequent')),
    ('one-hot',OneHotEncoder(handle_unknown='ignore', sparse=False))
])

Step 4: Create ColumnTransformer to apply pipeline for each column set

Syntax of the ColumnTransformer is

ColumnTransformer(transformers=[(‘step name’, transform function,cols), …])

Pass numerical columns through the numerical pipeline and pass categorical columns through the categorical pipeline created in step 3.

remainder=’drop’ is specified to ignore other columns in a dataframe.

n_job = -1 means using all processors to run in parallel.

from sklearn.compose import ColumnTransformercol_trans = ColumnTransformer(transformers=[
    ('num_pipeline',num_pipeline,num_cols),
    ('cat_pipeline',cat_pipeline,cat_cols)
    ],
    remainder='drop',
    n_jobs=-1)

Step 5: Add a model to a final pipeline

I use the logistic regression model in this example.

Create a new pipeline to commingle ColumnTransformer in step 4 with the logistic regression model. I use a pipeline in this case because the entire dataframe must pass the ColumnTransformer step and modeling step respectively.

from sklearn.linear_model import LogisticRegressionclf = LogisticRegression(random_state=0)clf_pipeline = Pipeline(steps=[
    ('col_trans', col_trans),
    ('model', clf)
])

Step 6: Display pipeline

display(pipeline name)

from sklearn import set_config
set_config(display='diagram')display(clf_pipeline)

You can click on the displayed image to see the detail of each step. How convenience!!

Step 7: Pass data through Pipeline

pipeline.fit: pass data through a pipeline. it also fits the model.

pipeline.predict: Use model trained when pipeline.fit to predict a new data

pipeline.score: Get a score of the model in the pipeline (accuracy of Logistic Regression in this example)

clf_pipeline.fit(X_train, y_train)
# preds = clf_pipeline.predict(X_test)
score = clf_pipeline.score(X_test, y_test)
print(f"Model score: {score}") # accuracy

(Optional) Step 8: Save the pipeline

Use the joblib library to save the pipeline for later use, so you don’t need to create and fit a pipeline again. When you want to use a saved pipeline just load the file using joblib.load.

import joblib# Save pipeline to file "pipe.joblib"
joblib.dump(clf_pipeline,"pipe.joblib")# Load pipeline when you want to use
same_pipe = joblib.load("pipe.joblib")

Conclusion

You can implement the pipeline from the data cleaning to the data modeling step to make your code neater. By displaying the pipeline, you can easily visualize how you build a model.

Any feedback is welcomed!!

Mlearning.ai Submission Suggestions

How to become a writer on Mlearning.ai

medium.com