
5 Scikit-learn Pipeline Tricks to Supercharge Your Workflow
Image by Editor | ChatGPT
Introduction
Perhaps one of the most underrated yet powerful features that scikit-learn has to offer, pipelines are a great ally for building effective and modular machine learning workflows. They streamline the entire process — from data preparation and feature engineering to modeling, fine-tuning, and validation — while mitigating the risk of data leakage, making the code reproducible, and keeping it cleaner and easier to maintain.
In this article, we describe and exemplify — through concise but intermediate- to advanced-level use cases — five pipeline tricks to level up your ongoing machine learning projects.
Initial Setup
The following code and its elements will be used in several of the examples listed later on; therefore, applying these preparatory steps first is advisable. Note that we will predominantly use the popular Titanic Survivorship dataset hereinafter:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
import pandas as pd import numpy as np from sklearn.model_selection import train_test_split, GridSearchCV from sklearn.preprocessing import OneHotEncoder, StandardScaler from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline from sklearn.linear_model import LogisticRegression from sklearn.impute import SimpleImputer from sklearn.datasets import fetch_openml
# Loading Dataset (Titanic Survivorship) titanic = fetch_openml(“titanic”, version=1, as_frame=True) X = titanic.data[[“pclass”, “sex”, “age”, “fare”]] y = titanic.target == “1”
# Split the dataset into training and test subsets X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)
# Select features by type num_features = [“age”, “fare”] cat_features = [“pclass”, “sex”] |
Time to get started with the core of this hands-on article!
1. ColumnTransformer to Handle Mixed Data Types
In the first example, we will instantiate a ColumnTransformer
object to define a robust data preprocessing pipeline. This class allows different transformations to be flexibly applied to different feature subsets in a unified manner, facilitating the processing of mixed data types and missing values without burdensome operations or repetitive code.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
# Instantiate a ColumnTransformer for robust data preprocessing preprocessor = ColumnTransformer([ (“num”, Pipeline([ (“imputer”, SimpleImputer(strategy=“median”)), (“scaler”, StandardScaler()) ]), num_features), (“cat”, Pipeline([ (“imputer”, SimpleImputer(strategy=“most_frequent”)), (“onehot”, OneHotEncoder(handle_unknown=“ignore”)) ]), cat_features) ])
pipe = Pipeline([ (“preprocessor”, preprocessor), (“model”, LogisticRegression(max_iter=1000)) ])
pipe.fit(X_train, y_train) print(“Accuracy:”, pipe.score(X_test, y_test)) |
This approach jointly processes numerical features, categorical ones, and missing values, integrating them into an overall pipeline before training the logistic regression classifier.
2. Feature Engineering with a Custom Transformer
Custom transformers in scikit-learn allow us to define our own feature-level transformation steps (be it feature engineering or preprocessing) and inject them directly into a pipeline. One example is TransformerMixin
, which requires defining our own fit
and transform
methods but, in exchange, allows us to use fit_transform()
seamlessly.
The following code defines a custom transformer class through inheritance. It applies its logic to map the “age” numerical feature into binary (0 or 1) values indicating whether the passenger is an adult. We then incorporate this custom transformer logic into a ColumnTransformer
like the one defined in the previous use case.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
from sklearn.base import BaseEstimator, TransformerMixin
# Custom transformer to create binary feature “is_adult” upon age class IsAdult(BaseEstimator, TransformerMixin): def fit(self, X, y=None): return self def transform(self, X): return (X[“age”].fillna(0) >= 18).astype(int).to_frame(“is_adult”)
# Extended ColumnTransformer that incorporates the custom transformer extended = ColumnTransformer([ (“num”, Pipeline([ (“imputer”, SimpleImputer(strategy=“median”)), (“scaler”, StandardScaler()) ]), num_features), (“cat”, Pipeline([ (“imputer”, SimpleImputer(strategy=“most_frequent”)), (“onehot”, OneHotEncoder(handle_unknown=“ignore”)) ]), cat_features), (“is_adult”, IsAdult(), [“age”]) ])
pipe = Pipeline([ (“preprocessor”, extended), (“model”, LogisticRegression(max_iter=1000)) ])
pipe.fit(X_train, y_train) print(“Accuracy with custom feature:”, pipe.score(X_test, y_test)) |
3. Hyperparameter Tuning Across the Entire Pipeline
This example demonstrates that hyperparameter tuning—finding the best configuration among many options—is not exclusively related to the machine learning model’s settings. It can also apply to choices made in previous preprocessing steps, as shown in this interesting example:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
from sklearn.svm import SVC
pipe = Pipeline([ (“preprocessor”, preprocessor), (“model”, SVC()) ])
param_grid = { “preprocessor__num__imputer__strategy”: [“mean”, “median”], “model__C”: [0.1, 1, 10], “model__kernel”: [“linear”, “rbf”] }
search = GridSearchCV(pipe, param_grid, cv=3) search.fit(X_train, y_train) print(“Best params:”, search.best_params_) print(“Best score:”, search.best_score_) |
Note that preprocessor
is the preprocessing pipeline defined in example 1, i.e. a ColumnTransformer
. The key in this example is the addition of a hyperparameter to the search grid that is related to a preprocessing step, namely the missing value imputation strategy. In other words, multiple model versions are trained not only based on hyperparameters of the model itself but also regarding specific settings in the preprocessing steps.
4. Integrating Feature Selection into a Pipeline
Another powerful technique, especially for datasets with many features, is to dynamically perform feature selection within the pipeline to keep the final model simpler.
This example automatically selects the most informative preprocessed features before training the model by incorporating the call to the SelectKBest
class to select the highest-scoring features into the overarching pipeline (once again, we use the same preprocessor
instance defined in earlier examples):
from sklearn.feature_selection import SelectKBest, f_classif
pipe = Pipeline([ (“preprocessor”, preprocessor), (“feature_selection”, SelectKBest(score_func=f_classif, k=5)), (“model”, LogisticRegression(max_iter=1000)) ])
pipe.fit(X_train, y_train) print(“Test accuracy:”, pipe.score(X_test, y_test)) |
The SelectKBest
class requires a scoring function or criterion to determine the top-k features to retain for model training. Another possible argument for it could be score_func=chi2
, which applies a Chi-squared test to select features and is useful when categorical features dominate.
5. Stacked Pipelines
Our last example shows how to stack multiple pipelines for building an ensemble machine learning solution. Pipelines are a great way to design our own highly customizable ensembles, in cases where different models, sometimes with distinct preprocessing steps, need to be trained and combined without risking data management inconsistencies.
In the example below, two “overarching” pipelines are defined: one to preprocess the data and train a logistic regression classifier, and the other to apply the same preprocessing (for simplicity, it could have been different) but train a decision tree instead.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
from sklearn.ensemble import StackingClassifier from sklearn.tree import DecisionTreeClassifier
log_reg_pipe = Pipeline([ (“preprocessor”, preprocessor), (“logreg”, LogisticRegression(max_iter=1000)) ])
tree_pipe = Pipeline([ (“preprocessor”, preprocessor), (“tree”, DecisionTreeClassifier(max_depth=5)) ])
stack = StackingClassifier( estimators=[(“lr”, log_reg_pipe), (“dt”, tree_pipe)], final_estimator=LogisticRegression() )
stack.fit(X_train, y_train) print(“Stacked accuracy:”, stack.score(X_test, y_test)) |
The two pipelines are then stacked using the StackingClassifier
class. This class uses a final estimator to learn the best way to combine the base models’ predictions, yielding a stronger, more generalizable model.
Wrapping Up
This article showed five insightful examples of what we can do with scikit-learn pipelines to turbocharge and make our machine learning workflows more effective, customizable, and, in some cases, better-performing. From custom preprocessing pipelines for mixed data types to extending hyperparameter tuning to preprocessing steps, we revealed several tricks and hacks to take your machine learning modeling projects to the next level.