5 Scikit-learn Pipeline Tricks to Supercharge Your Workflow

5 Scikit-learn Pipeline Tricks to Supercharge Your Workflow
Image by Editor | ChatGPT

Introduction

Perhaps one of the most underrated yet powerful features that scikit-learn has to offer, pipelines are a great ally for building effective and modular machine learning workflows. They streamline the entire process — from data preparation and feature engineering to modeling, fine-tuning, and validation — while mitigating the risk of data leakage, making the code reproducible, and keeping it cleaner and easier to maintain.

AIAllure Video Generator: My Unfiltered Thoughts

How to Build a Fully Functional Computer-Use Agent that Thinks, Plans, and Executes Virtual Actions Using Local AI Models

In this article, we describe and exemplify — through concise but intermediate- to advanced-level use cases — five pipeline tricks to level up your ongoing machine learning projects.

Initial Setup

The following code and its elements will be used in several of the examples listed later on; therefore, applying these preparatory steps first is advisable. Note that we will predominantly use the popular Titanic Survivorship dataset hereinafter:

import pandas as pd import numpy as np from sklearn.model_selection import train_test_split, GridSearchCV from sklearn.preprocessing import OneHotEncoder, StandardScaler from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline from sklearn.linear_model import LogisticRegression from sklearn.impute import SimpleImputer from sklearn.datasets import fetch_openml # Loading Dataset (Titanic Survivorship) titanic = fetch_openml(“titanic”, version=1, as_frame=True) X = titanic.data[[“pclass”, “sex”, “age”, “fare”]] y = titanic.target == “1” # Split the dataset into training and test subsets X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42) # Select features by type num_features = [“age”, “fare”] cat_features = [“pclass”, “sex”]

import pandas as pd

import numpy as np

from sklearn.model_selection import train_test_split, GridSearchCV

from sklearn.preprocessing import OneHotEncoder, StandardScaler

from sklearn.compose import ColumnTransformer

from sklearn.pipeline import Pipeline

from sklearn.linear_model import LogisticRegression

from sklearn.impute import SimpleImputer

from sklearn.datasets import fetch_openml

# Loading Dataset (Titanic Survivorship)

titanic = fetch_openml(“titanic”, version=1, as_frame=True)

X = titanic.data[[“pclass”, “sex”, “age”, “fare”]]

y = titanic.target == “1”

# Split the dataset into training and test subsets

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)

# Select features by type

num_features = [“age”, “fare”]

cat_features = [“pclass”, “sex”]

Time to get started with the core of this hands-on article!

1. ColumnTransformer to Handle Mixed Data Types

In the first example, we will instantiate a ColumnTransformer object to define a robust data preprocessing pipeline. This class allows different transformations to be flexibly applied to different feature subsets in a unified manner, facilitating the processing of mixed data types and missing values without burdensome operations or repetitive code.

# Instantiate a ColumnTransformer for robust data preprocessing preprocessor = ColumnTransformer([ (“num”, Pipeline([ (“imputer”, SimpleImputer(strategy=”median”)), (“scaler”, StandardScaler()) ]), num_features), (“cat”, Pipeline([ (“imputer”, SimpleImputer(strategy=”most_frequent”)), (“onehot”, OneHotEncoder(handle_unknown=”ignore”)) ]), cat_features) ]) pipe = Pipeline([ (“preprocessor”, preprocessor), (“model”, LogisticRegression(max_iter=1000)) ]) pipe.fit(X_train, y_train) print(“Accuracy:”, pipe.score(X_test, y_test))

# Instantiate a ColumnTransformer for robust data preprocessing

preprocessor = ColumnTransformer([

(“num”, Pipeline([

(“imputer”, SimpleImputer(strategy=“median”)),

(“scaler”, StandardScaler())

]), num_features),

(“cat”, Pipeline([

(“imputer”, SimpleImputer(strategy=“most_frequent”)),

(“onehot”, OneHotEncoder(handle_unknown=“ignore”))

]), cat_features)

])

pipe = Pipeline([

(“preprocessor”, preprocessor),

(“model”, LogisticRegression(max_iter=1000))

])

pipe.fit(X_train, y_train)

print(“Accuracy:”, pipe.score(X_test, y_test))

This approach jointly processes numerical features, categorical ones, and missing values, integrating them into an overall pipeline before training the logistic regression classifier.

2. Feature Engineering with a Custom Transformer

Custom transformers in scikit-learn allow us to define our own feature-level transformation steps (be it feature engineering or preprocessing) and inject them directly into a pipeline. One example is TransformerMixin, which requires defining our own fit and transform methods but, in exchange, allows us to use fit_transform() seamlessly.

The following code defines a custom transformer class through inheritance. It applies its logic to map the “age” numerical feature into binary (0 or 1) values indicating whether the passenger is an adult. We then incorporate this custom transformer logic into a ColumnTransformer like the one defined in the previous use case.

from sklearn.base import BaseEstimator, TransformerMixin # Custom transformer to create binary feature “is_adult” upon age class IsAdult(BaseEstimator, TransformerMixin): def fit(self, X, y=None): return self def transform(self, X): return (X[“age”].fillna(0) >= 18).astype(int).to_frame(“is_adult”) # Extended ColumnTransformer that incorporates the custom transformer extended = ColumnTransformer([ (“num”, Pipeline([ (“imputer”, SimpleImputer(strategy=”median”)), (“scaler”, StandardScaler()) ]), num_features), (“cat”, Pipeline([ (“imputer”, SimpleImputer(strategy=”most_frequent”)), (“onehot”, OneHotEncoder(handle_unknown=”ignore”)) ]), cat_features), (“is_adult”, IsAdult(), [“age”]) ]) pipe = Pipeline([ (“preprocessor”, extended), (“model”, LogisticRegression(max_iter=1000)) ]) pipe.fit(X_train, y_train) print(“Accuracy with custom feature:”, pipe.score(X_test, y_test))

from sklearn.base import BaseEstimator, TransformerMixin

# Custom transformer to create binary feature “is_adult” upon age

class IsAdult(BaseEstimator, TransformerMixin):

def fit(self, X, y=None):

return self

def transform(self, X):

return (X[“age”].fillna(0) >= 18).astype(int).to_frame(“is_adult”)

# Extended ColumnTransformer that incorporates the custom transformer

extended = ColumnTransformer([

(“num”, Pipeline([

(“imputer”, SimpleImputer(strategy=“median”)),

(“scaler”, StandardScaler())

]), num_features),

(“cat”, Pipeline([

(“imputer”, SimpleImputer(strategy=“most_frequent”)),

(“onehot”, OneHotEncoder(handle_unknown=“ignore”))

]), cat_features),

(“is_adult”, IsAdult(), [“age”])

])

pipe = Pipeline([

(“preprocessor”, extended),

(“model”, LogisticRegression(max_iter=1000))

])

pipe.fit(X_train, y_train)

print(“Accuracy with custom feature:”, pipe.score(X_test, y_test))

3. Hyperparameter Tuning Across the Entire Pipeline

This example demonstrates that hyperparameter tuning—finding the best configuration among many options—is not exclusively related to the machine learning model’s settings. It can also apply to choices made in previous preprocessing steps, as shown in this interesting example:

from sklearn.svm import SVC pipe = Pipeline([ (“preprocessor”, preprocessor), (“model”, SVC()) ]) param_grid = { “preprocessor__num__imputer__strategy”: [“mean”, “median”], “model__C”: [0.1, 1, 10], “model__kernel”: [“linear”, “rbf”] } search = GridSearchCV(pipe, param_grid, cv=3) search.fit(X_train, y_train) print(“Best params:”, search.best_params_) print(“Best score:”, search.best_score_)

from sklearn.svm import SVC

pipe = Pipeline([

(“preprocessor”, preprocessor),

(“model”, SVC())

])

param_grid = {

“preprocessor__num__imputer__strategy”: [“mean”, “median”],

“model__C”: [0.1, 1, 10],

“model__kernel”: [“linear”, “rbf”]

}

search = GridSearchCV(pipe, param_grid, cv=3)

search.fit(X_train, y_train)

print(“Best params:”, search.best_params_)

print(“Best score:”, search.best_score_)

Note that preprocessor is the preprocessing pipeline defined in example 1, i.e. a ColumnTransformer. The key in this example is the addition of a hyperparameter to the search grid that is related to a preprocessing step, namely the missing value imputation strategy. In other words, multiple model versions are trained not only based on hyperparameters of the model itself but also regarding specific settings in the preprocessing steps.

4. Integrating Feature Selection into a Pipeline

Another powerful technique, especially for datasets with many features, is to dynamically perform feature selection within the pipeline to keep the final model simpler.

This example automatically selects the most informative preprocessed features before training the model by incorporating the call to the SelectKBest class to select the highest-scoring features into the overarching pipeline (once again, we use the same preprocessor instance defined in earlier examples):

from sklearn.feature_selection import SelectKBest, f_classif pipe = Pipeline([ (“preprocessor”, preprocessor), (“feature_selection”, SelectKBest(score_func=f_classif, k=5)), (“model”, LogisticRegression(max_iter=1000)) ]) pipe.fit(X_train, y_train) print(“Test accuracy:”, pipe.score(X_test, y_test))

from sklearn.feature_selection import SelectKBest, f_classif

pipe = Pipeline([

(“preprocessor”, preprocessor),

(“feature_selection”, SelectKBest(score_func=f_classif, k=5)),

(“model”, LogisticRegression(max_iter=1000))

])

pipe.fit(X_train, y_train)

print(“Test accuracy:”, pipe.score(X_test, y_test))

The SelectKBest class requires a scoring function or criterion to determine the top-k features to retain for model training. Another possible argument for it could be score_func=chi2, which applies a Chi-squared test to select features and is useful when categorical features dominate.

5. Stacked Pipelines

Our last example shows how to stack multiple pipelines for building an ensemble machine learning solution. Pipelines are a great way to design our own highly customizable ensembles, in cases where different models, sometimes with distinct preprocessing steps, need to be trained and combined without risking data management inconsistencies.

In the example below, two “overarching” pipelines are defined: one to preprocess the data and train a logistic regression classifier, and the other to apply the same preprocessing (for simplicity, it could have been different) but train a decision tree instead.

from sklearn.ensemble import StackingClassifier from sklearn.tree import DecisionTreeClassifier log_reg_pipe = Pipeline([ (“preprocessor”, preprocessor), (“logreg”, LogisticRegression(max_iter=1000)) ]) tree_pipe = Pipeline([ (“preprocessor”, preprocessor), (“tree”, DecisionTreeClassifier(max_depth=5)) ]) stack = StackingClassifier( estimators=[(“lr”, log_reg_pipe), (“dt”, tree_pipe)], final_estimator=LogisticRegression() ) stack.fit(X_train, y_train) print(“Stacked accuracy:”, stack.score(X_test, y_test))

from sklearn.ensemble import StackingClassifier

from sklearn.tree import DecisionTreeClassifier

log_reg_pipe = Pipeline([

(“preprocessor”, preprocessor),

(“logreg”, LogisticRegression(max_iter=1000))

])

tree_pipe = Pipeline([

(“preprocessor”, preprocessor),

(“tree”, DecisionTreeClassifier(max_depth=5))

])

stack = StackingClassifier(

estimators=[(“lr”, log_reg_pipe), (“dt”, tree_pipe)],

final_estimator=LogisticRegression()

)

stack.fit(X_train, y_train)

print(“Stacked accuracy:”, stack.score(X_test, y_test))

The two pipelines are then stacked using the StackingClassifier class. This class uses a final estimator to learn the best way to combine the base models’ predictions, yielding a stronger, more generalizable model.

Wrapping Up

This article showed five insightful examples of what we can do with scikit-learn pipelines to turbocharge and make our machine learning workflows more effective, customizable, and, in some cases, better-performing. From custom preprocessing pipelines for mixed data types to extending hyperparameter tuning to preprocessing steps, we revealed several tricks and hacks to take your machine learning modeling projects to the next level.

Source_link