How AutoGluon Enables Modern AutoML Pipelines for Production-Grade Tabular Models with Ensembling and Distillation

In this tutorial, we build a production-grade tabular machine learning pipeline using AutoGluon, taking a real-world mixed-type dataset from raw ingestion through to deployment-ready artifacts. We train high-quality stacked and bagged ensembles, evaluate performance with robust metrics, perform subgroup and feature-level analysis, and then optimize the model for real-time inference using refit-full and distillation. Throughout the workflow, we focus on practical decisions that balance accuracy, latency, and deployability. Check out the FULL CODES here.

!pip -q install -U "autogluon==1.5.0" "scikit-learn>=1.3" "pandas>=2.0" "numpy>=1.24"


import os, time, json, warnings
warnings.filterwarnings("ignore")


import numpy as np
import pandas as pd


from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, log_loss, accuracy_score, classification_report, confusion_matrix


from autogluon.tabular import TabularPredictor

We set up the environment by installing the required libraries and importing all core dependencies used throughout the pipeline. We configure warnings to keep outputs clean and ensure numerical, tabular, and evaluation utilities are ready. Check out the FULL CODES here.

from sklearn.datasets import fetch_openml
df = fetch_openml(data_id=40945, as_frame=True).frame


target = "survived"
df[target] = df[target].astype(int)


drop_cols = [c for c in ["boat", "body", "home.dest"] if c in df.columns]
df = df.drop(columns=drop_cols, errors="ignore")


df = df.replace({None: np.nan})
print("Shape:", df.shape)
print("Target positive rate:", df[target].mean().round(4))
print("Columns:", list(df.columns))


train_df, test_df = train_test_split(
   df,
   test_size=0.2,
   random_state=42,
   stratify=df[target],
)

We load a real-world mixed-type dataset and perform light preprocessing to prepare a clean training signal. We define the target, remove highly leaky columns, and validate the dataset structure. We then create a stratified train–test split to preserve class balance. Check out the FULL CODES here.

def has_gpu():
   try:
       import torch
       return torch.cuda.is_available()
   except Exception:
       return False


presets = "extreme" if has_gpu() else "best_quality"


save_path = "/content/autogluon_titanic_advanced"
os.makedirs(save_path, exist_ok=True)


predictor = TabularPredictor(
   label=target,
   eval_metric="roc_auc",
   path=save_path,
   verbosity=2
)

We detect hardware availability to dynamically select the most suitable AutoGluon training preset. We configure a persistent model directory and initialize the tabular predictor with an appropriate evaluation metric. Check out the FULL CODES here.

start = time.time()
predictor.fit(
   train_data=train_df,
   presets=presets,
   time_limit=7 * 60,
   num_bag_folds=5,
   num_stack_levels=2,
   refit_full=False
)
train_time = time.time() - start
print(f"\nTraining done in {train_time:.1f}s with presets="{presets}"")

We train a high-quality ensemble using bagging and stacking within a controlled time budget. We rely on AutoGluon’s automated model search to efficiently explore strong architectures. We also record training time to understand computational cost. Check out the FULL CODES here.

lb = predictor.leaderboard(test_df, silent=True)
print("\n=== Leaderboard (top 15) ===")
display(lb.head(15))


proba = predictor.predict_proba(test_df)
pred = predictor.predict(test_df)


y_true = test_df[target].values
if isinstance(proba, pd.DataFrame) and 1 in proba.columns:
   y_proba = proba[1].values
else:
   y_proba = np.asarray(proba).reshape(-1)


print("\n=== Test Metrics ===")
print("ROC-AUC:", roc_auc_score(y_true, y_proba).round(5))
print("LogLoss:", log_loss(y_true, np.clip(y_proba, 1e-6, 1 - 1e-6)).round(5))
print("Accuracy:", accuracy_score(y_true, pred).round(5))
print("\nClassification report:\n", classification_report(y_true, pred))

We evaluate the trained models using a held-out test set and inspect the leaderboard to compare performance. We compute probabilistic and discrete predictions and derive key classification metrics. It gives us a comprehensive view of model accuracy and calibration. Check out the FULL CODES here.

if "pclass" in test_df.columns:
   print("\n=== Slice AUC by pclass ===")
   for grp, part in test_df.groupby("pclass"):
       part_proba = predictor.predict_proba(part)
       part_proba = part_proba[1].values if isinstance(part_proba, pd.DataFrame) and 1 in part_proba.columns else np.asarray(part_proba).reshape(-1)
       auc = roc_auc_score(part[target].values, part_proba)
       print(f"pclass={grp}: AUC={auc:.4f} (n={len(part)})")


fi = predictor.feature_importance(test_df, silent=True)
print("\n=== Feature importance (top 20) ===")
display(fi.head(20))

We analyze model behavior through subgroup performance slicing and permutation-based feature importance. We identify how performance varies across meaningful segments of the data. It helps us assess robustness and interpretability before deployment. Check out the FULL CODES here.

t0 = time.time()
refit_map = predictor.refit_full()
t_refit = time.time() - t0


print(f"\nrefit_full completed in {t_refit:.1f}s")
print("Refit mapping (sample):", dict(list(refit_map.items())[:5]))


lb_full = predictor.leaderboard(test_df, silent=True)
print("\n=== Leaderboard after refit_full (top 15) ===")
display(lb_full.head(15))


best_model = predictor.get_model_best()
full_candidates = [m for m in predictor.get_model_names() if m.endswith("_FULL")]


def bench_infer(model_name, df_in, repeats=3):
   times = []
   for _ in range(repeats):
       t1 = time.time()
       _ = predictor.predict(df_in, model=model_name)
       times.append(time.time() - t1)
   return float(np.median(times))


small_batch = test_df.drop(columns=[target]).head(256)
lat_best = bench_infer(best_model, small_batch)
print(f"\nBest model: {best_model} | median predict() latency on 256 rows: {lat_best:.4f}s")


if full_candidates:
   lb_full_sorted = lb_full.sort_values(by="score_test", ascending=False)
   best_full = lb_full_sorted[lb_full_sorted["model"].str.endswith("_FULL")].iloc[0]["model"]
   lat_full = bench_infer(best_full, small_batch)
   print(f"Best FULL model: {best_full} | median predict() latency on 256 rows: {lat_full:.4f}s")
   print(f"Speedup factor (best / full): {lat_best / max(lat_full, 1e-9):.2f}x")


try:
   t0 = time.time()
   distill_result = predictor.distill(
       train_data=train_df,
       time_limit=4 * 60,
       augment_method="spunge",
   )
   t_distill = time.time() - t0
   print(f"\nDistillation completed in {t_distill:.1f}s")
except Exception as e:
   print("\nDistillation step failed")
   print("Error:", repr(e))


lb2 = predictor.leaderboard(test_df, silent=True)
print("\n=== Leaderboard after distillation attempt (top 20) ===")
display(lb2.head(20))


predictor.save()
reloaded = TabularPredictor.load(save_path)


sample = test_df.drop(columns=[target]).sample(8, random_state=0)
sample_pred = reloaded.predict(sample)
sample_proba = reloaded.predict_proba(sample)


print("\n=== Reloaded predictor sanity-check ===")
print(sample.assign(pred=sample_pred).head())


print("\nProbabilities (head):")
display(sample_proba.head())


artifacts = {
   "path": save_path,
   "presets": presets,
   "best_model": reloaded.get_model_best(),
   "model_names": reloaded.get_model_names(),
   "leaderboard_top10": lb2.head(10).to_dict(orient="records"),
}
with open(os.path.join(save_path, "run_summary.json"), "w") as f:
   json.dump(artifacts, f, indent=2)


print("\nSaved summary to:", os.path.join(save_path, "run_summary.json"))
print("Done.")

We optimize the trained ensemble for inference by collapsing bagged models and benchmarking latency improvements. We optionally distill the ensemble into faster models and validate persistence through save-reload checks. Also, we export structured artifacts required for production handoff.

In conclusion, we implemented an end-to-end workflow with AutoGluon that transforms raw tabular data into production-ready models with minimal manual intervention, while maintaining strong control over accuracy, robustness, and inference efficiency. We performed systematic error analysis and feature importance evaluation, optimized large ensembles through refitting and distillation, and validated deployment readiness using latency benchmarking and artifact packaging. This workflow enables the deployment of high-performing, scalable, interpretable, and well-suited tabular models for real-world production environments.

Check out the FULL CODES here. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Source_link