Get column name after fitting the machine learning pipeline

Yannawut Kimnaruk
5 min readMay 24, 2022
Picture by Henry & Co. from Pexels

In the previous article, I wrote about the simple implementation of a pipeline for machine learning model training which could make your code neater. You can read it at the below link.

The issue when implementing a pipeline is that the pipeline returns an array without a column header, so it can be difficult to interpret the model and also improve the model.

In this article, I will share my solution and show an example chart of model coefficients.

📔 Content in this article

  1. Call a step in the pipeline
  2. Get the column name
  3. Plotting coefficients of features

🛣️ The Pipeline

Referring to the previous article, this is the pipeline.

# sets of columns to be transformed in different waysnum_cols = ['city_development_index','relevent_experience', 'experience','last_new_job', 'training_hours']
cat_cols = ['gender', 'enrolled_university', 'education_level', 'major_discipline', 'company_size', 'company_type']
# Create pipelines for numerical and categorical featuresfrom sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler
from sklearn.pipeline import Pipeline
num_pipeline = Pipeline(steps=[
('impute', SimpleImputer(strategy='mean')),
('scale',MinMaxScaler())
])
cat_pipeline = Pipeline(steps=[
('impute', SimpleImputer(strategy='most_frequent')),
('one-hot',OneHotEncoder(handle_unknown='ignore', sparse=False))
])
# Create ColumnTransformer to apply pipeline for each column typefrom sklearn.compose import ColumnTransformercol_trans = ColumnTransformer(transformers=[
('num_pipeline',num_pipeline,num_cols),
('cat_pipeline',cat_pipeline,cat_cols)
],
remainder='drop',
n_jobs=-1)
# Add a model to a final pipeline, clf_pipelinefrom sklearn.linear_model import LogisticRegressionclf = LogisticRegression(random_state=0)clf_pipeline = Pipeline(steps=[
('col_trans', col_trans),
('model', clf)
])

In this pipeline, the clf_pipeline (the final pipeline) contains col_trans (ColumnTransformer) and a logistic regression model.

In the col_tran, there are num_pipeline and cat_pipeline. Those pipelines will transform numerical features and categorical features respectively.

🌲 1. Call a step in the pipeline

Calling a step in the pipeline can be tricky. It is like you climb a tree step by step.

Start with 2 attributes you should remember:

named_steps : Call a step in a pipeline

named_transformers_ : Call a step in a ColumnTransformer

If you don’t know the difference between a pipeline and a ColumnTransformer, please take 5 minutes to review my previous article (link above).

Example

I would like to access OneHotEncoder named ‘one-hot’ to use the function get_feature_names and get feature name after performing OneHotEncoder

This is the code.

clf_pipeline.named_steps[“col_trans”].named_transformers_[“cat_pipeline”].named_steps[“one-hot”].get_feature_names(cat_cols)

Code explanation:

  • Start with the clf_pipeline (the final pipeline)
  • .name_steps will call “cal_trans” which is a ColumnTransformer
  • .name_transformer_ will call “cat_pipeline” which is a pipeline inside “col_trans”
  • The second .named_steps will call “one-hot” which is the desired step.
  • .get_feature_names is like calling this function directly from OneHotEncoder

🤏 2. Get the column name

The difficult task of pipeline implementation is getting column names since a pipeline will return an array without a column header.

The solution is to track a pipeline and column transformer process.

Pipeline : transform data step by step as input in the steps argument. ColumnTransformer : transform data in parallel and concatenate the result in order as input in the transformers argument.

Most data transformation functions don’t change column names (or the number of columns). Therefore, you have to focus on only a few functions that change the number of columns such as OneHotEncoder.

# numerical columns not change, categorical columns change from one-hot encodernew_cat_cols = clf_pipeline.named_steps["col_trans"]\
.named_transformers_["cat_pipeline"]\ .named_steps["one-hot"].get_feature_names(cat_cols)
# concatenate categorical columns with numerical columns to get all columns
all_cols = np.concatenate([num_cols, new_cat_cols])

Code explanation:

  • Get new categorical columns by using get_feature_names. (Numerical columns aren’t changed.)
  • (cat_cols) inside get_feature_names function will return column name in the format of “column name_category value”.
    If no argument cat_cols, it will return the index of the column instead.
Compare get_feature_name with and without argument
  • Since the col_trans, a ColumnTransformer, performs step num_pipeline before cat_pipeline, the result of the pipeline will be the numerical columns before categorical columns. All columns are get by concatenating num_col and new_cat_col.

Note: \ is only to continue code to a new line.

📊 3. Plotting coefficients of features

The model in this pipeline is logistic regression, so we can quickly understand feature importance by looking at the coefficients of each feature/column.

The coefficients can be gotten by going to step “model” and calling attributes coef_

coefs = clf_pipeline.named_steps["model"].coef_.flatten()
coef = pd.DataFrame(zip(all_cols, coefs), columns=["feature", "coef"])
coef["abs_coef"] = coef["coef"].apply(lambda x: abs(x))
coef["colors"] = coef["coef"].apply(lambda x: "green" if x > 0 else "red")
coef = coef.sort_values("abs_coef", ascending=False)

Code explanation:

  • Access a “model” step to get coef_
  • Change an array coefs to a dataframe coef, so more columns could be added.
  • Add column “abs_coef” that is an absolute of coefficients
  • Add column “colors” that return green when coef is positive and red if coef is negative.
  • Sort the dataframe by column abs_coef, so the most important feature is at the top of the dataframe.

Plotting the coefficients using a barplot from the seaborn library. Data is the coef dataframe.

# Plot coef
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inlinefig, ax = plt.subplots(1, 1, figsize=(12, 7))
sns.barplot(x="feature",
y="coef",
data=coef.head(20),
palette=coef.head(20)["colors"])
ax.set_xticklabels(ax.get_xticklabels(), rotation=90, fontsize=20)
ax.set_title("Top 20 Features", fontsize=25)
ax.set_ylabel("Coef", fontsize=22)
ax.set_xlabel("Feature Name", fontsize=22)

Conclusion

You can get the columns' names after the pipeline by climbing the pipeline tree using name_steps and name_transformer_. Then, you can keep track of changed columns’ names using get_feature_names. After that, you can find the model coefficient and plot the graph to visualize feature importance.

If you have a better way to get column names after pipeline transformation, please leave the idea in the comment.

--

--