• About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
Friday, March 6, 2026
mGrowTech
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
No Result
View All Result
mGrowTech
No Result
View All Result
Home Al, Analytics and Automation

A Coding Guide to Build a Scalable End-to-End Machine Learning Data Pipeline Using Daft for High-Performance Structured and Image Data Processing

Josh by Josh
March 6, 2026
in Al, Analytics and Automation
0


In this tutorial, we explore how we use Daft as a high-performance, Python-native data engine to build an end-to-end analytical pipeline. We start by loading a real-world MNIST dataset, then progressively transform it using UDFs, feature engineering, aggregations, joins, and lazy execution. Also, we demonstrate how to seamlessly combine structured data processing, numerical computation, and machine learning. By the end, we are not just manipulating data, we are building a complete model-ready pipeline powered by Daft’s scalable execution engine.

!pip -q install daft pyarrow pandas numpy scikit-learn


import os
os.environ["DO_NOT_TRACK"] = "true"


import numpy as np
import pandas as pd
import daft
from daft import col


print("Daft version:", getattr(daft, "__version__", "unknown"))


URL = "https://github.com/Eventual-Inc/mnist-json/raw/master/mnist_handwritten_test.json.gz"


df = daft.read_json(URL)
print("\nSchema (sampled):")
print(df.schema())


print("\nPeek:")
df.show(5)

We install Daft and its supporting libraries directly in Google Colab to ensure a clean, reproducible environment. We configure optional settings and verify the installed version to confirm everything is working correctly. By doing this, we establish a stable foundation for building our end-to-end data pipeline.

READ ALSO

Pay for the data you’re using

Liquid AI Releases LocalCowork Powered By LFM2-24B-A2B to Execute Privacy-First Agent Workflows Locally Via Model Context Protocol (MCP)

def to_28x28(pixels):
   arr = np.array(pixels, dtype=np.float32)
   if arr.size != 784:
       return None
   return arr.reshape(28, 28)


df2 = (
   df
   .with_column(
       "img_28x28",
       col("image").apply(to_28x28, return_dtype=daft.DataType.python())
   )
   .with_column(
       "pixel_mean",
       col("img_28x28").apply(lambda x: float(np.mean(x)) if x is not None else None,
                              return_dtype=daft.DataType.float32())
   )
   .with_column(
       "pixel_std",
       col("img_28x28").apply(lambda x: float(np.std(x)) if x is not None else None,
                              return_dtype=daft.DataType.float32())
   )
)


print("\nAfter reshaping + simple features:")
df2.select("label", "pixel_mean", "pixel_std").show(5)

We load a real-world MNIST JSON dataset directly from a remote URL using Daft’s native reader. We inspect the schema and preview the data to understand its structure and column types. It allows us to validate the dataset before applying transformations and feature engineering.

@daft.udf(return_dtype=daft.DataType.list(daft.DataType.float32()), batch_size=512)
def featurize(images_28x28):
   out = []
   for img in images_28x28.to_pylist():
       if img is None:
           out.append(None)
           continue
       img = np.asarray(img, dtype=np.float32)
       row_sums = img.sum(axis=1) / 255.0
       col_sums = img.sum(axis=0) / 255.0
       total = img.sum() + 1e-6
       ys, xs = np.indices(img.shape)
       cy = float((ys * img).sum() / total) / 28.0
       cx = float((xs * img).sum() / total) / 28.0
       vec = np.concatenate([row_sums, col_sums, np.array([cy, cx, img.mean()/255.0, img.std()/255.0], dtype=np.float32)])
       out.append(vec.astype(np.float32).tolist())
   return out


df3 = df2.with_column("features", featurize(col("img_28x28")))


print("\nFeature column created (list[float]):")
df3.select("label", "features").show(2)

We reshape the raw pixel arrays into structured 28×28 images using a row-wise UDF. We compute statistical features, such as the mean and standard deviation, to enrich the dataset. By applying these transformations, we convert raw image data into structured and model-friendly representations.

label_stats = (
   df3.groupby("label")
      .agg(
          col("label").count().alias("n"),
          col("pixel_mean").mean().alias("mean_pixel_mean"),
          col("pixel_std").mean().alias("mean_pixel_std"),
      )
      .sort("label")
)


print("\nLabel distribution + summary stats:")
label_stats.show(10)


df4 = df3.join(label_stats, on="label", how="left")


print("\nJoined label stats back onto each row:")
df4.select("label", "n", "mean_pixel_mean", "mean_pixel_std").show(5)

We implement a batch UDF to extract richer feature vectors from the reshaped images. We perform group-by aggregations and join summary statistics back to the dataset for contextual enrichment. This demonstrates how we combine scalable computation with advanced analytics within Daft.

small = df4.select("label", "features").collect().to_pandas()


small = small.dropna(subset=["label", "features"]).reset_index(drop=True)


X = np.vstack(small["features"].apply(np.array).values).astype(np.float32)
y = small["label"].astype(int).values


from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report


clf = LogisticRegression(max_iter=1000, n_jobs=None)
clf.fit(X_train, y_train)


pred = clf.predict(X_test)
acc = accuracy_score(y_test, pred)


print("\nBaseline accuracy (feature-engineered LogisticRegression):", round(acc, 4))
print("\nClassification report:")
print(classification_report(y_test, pred, digits=4))


out_df = df4.select("label", "features", "pixel_mean", "pixel_std", "n")
out_path = "/content/daft_mnist_features.parquet"
out_df.write_parquet(out_path)


print("\nWrote parquet to:", out_path)


df_back = daft.read_parquet(out_path)
print("\nRead-back check:")
df_back.show(3)

We materialize selected columns into pandas and train a baseline Logistic Regression model. We evaluate performance to validate the usefulness of our engineered features. Also, we persist the processed dataset to Parquet format, completing our end-to-end pipeline from raw data ingestion to production-ready storage.

In this tutorial, we built a production-style data workflow using Daft, moving from raw JSON ingestion to feature engineering, aggregation, model training, and Parquet persistence. We demonstrated how to integrate advanced UDF logic, perform efficient groupby and join operations, and materialize results for downstream machine learning, all within a clean, scalable framework. Through this process, we saw how Daft enables us to handle complex transformations while remaining Pythonic and efficient. We finished with a reusable, end-to-end pipeline that showcases how we can combine modern data engineering and machine learning workflows in a unified environment.


Check out the Full Codes here. Also, feel free to follow us on Twitter and don’t forget to join our 120k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.


Michal Sutter is a data science professional with a Master of Science in Data Science from the University of Padova. With a solid foundation in statistical analysis, machine learning, and data engineering, Michal excels at transforming complex datasets into actionable insights.



Source_link

Related Posts

Pay for the data you’re using
Al, Analytics and Automation

Pay for the data you’re using

March 6, 2026
Liquid AI Releases LocalCowork Powered By LFM2-24B-A2B to Execute Privacy-First Agent Workflows Locally Via Model Context Protocol (MCP)
Al, Analytics and Automation

Liquid AI Releases LocalCowork Powered By LFM2-24B-A2B to Execute Privacy-First Agent Workflows Locally Via Model Context Protocol (MCP)

March 6, 2026
CamSoda AI Chatbot Features and Pricing Model
Al, Analytics and Automation

CamSoda AI Chatbot Features and Pricing Model

March 6, 2026
OpenAI Releases Symphony: An Open Source Agentic Framework for Orchestrating Autonomous AI Agents through Structured, Scalable Implementation Runs
Al, Analytics and Automation

OpenAI Releases Symphony: An Open Source Agentic Framework for Orchestrating Autonomous AI Agents through Structured, Scalable Implementation Runs

March 5, 2026
YuanLab AI Releases Yuan 3.0 Ultra: A Flagship Multimodal MoE Foundation Model, Built for Stronger Intelligence and Unrivaled Efficiency
Al, Analytics and Automation

YuanLab AI Releases Yuan 3.0 Ultra: A Flagship Multimodal MoE Foundation Model, Built for Stronger Intelligence and Unrivaled Efficiency

March 5, 2026
How to Build an EverMem-Style Persistent AI Agent OS with Hierarchical Memory, FAISS Vector Retrieval, SQLite Storage, and Automated Memory Consolidation
Al, Analytics and Automation

How to Build an EverMem-Style Persistent AI Agent OS with Hierarchical Memory, FAISS Vector Retrieval, SQLite Storage, and Automated Memory Consolidation

March 5, 2026
Next Post
11 Best USB Flash Drives (2026): Pen Drives, Thumb Drives, Memory Sticks

11 Best USB Flash Drives (2026): Pen Drives, Thumb Drives, Memory Sticks

POPULAR NEWS

Trump ends trade talks with Canada over a digital services tax

Trump ends trade talks with Canada over a digital services tax

June 28, 2025
Communication Effectiveness Skills For Business Leaders

Communication Effectiveness Skills For Business Leaders

June 10, 2025
15 Trending Songs on TikTok in 2025 (+ How to Use Them)

15 Trending Songs on TikTok in 2025 (+ How to Use Them)

June 18, 2025
App Development Cost in Singapore: Pricing Breakdown & Insights

App Development Cost in Singapore: Pricing Breakdown & Insights

June 22, 2025
Google announced the next step in its nuclear energy plans 

Google announced the next step in its nuclear energy plans 

August 20, 2025

EDITOR'S PICK

Google AI Plus expands to 35 new countries and territories including the US

Google AI Plus expands to 35 new countries and territories including the US

February 1, 2026

How to integrate PR into every part of your AI strategy

August 19, 2025
Apple Finally Destroyed Steve Jobs’ Vision of the iPad. Good

Apple Finally Destroyed Steve Jobs’ Vision of the iPad. Good

August 16, 2025
The Pixel 10A and Soundcore Space One are just two of the best deals this week

The Pixel 10A and Soundcore Space One are just two of the best deals this week

February 22, 2026

About

We bring you the best Premium WordPress Themes that perfect for news, magazine, personal blog, etc. Check our landing page for details.

Follow us

Categories

  • Account Based Marketing
  • Ad Management
  • Al, Analytics and Automation
  • Brand Management
  • Channel Marketing
  • Digital Marketing
  • Direct Marketing
  • Event Management
  • Google Marketing
  • Marketing Attribution and Consulting
  • Marketing Automation
  • Mobile Marketing
  • PR Solutions
  • Social Media Management
  • Technology And Software
  • Uncategorized

Recent Posts

  • Why Your Loyalty Program Feels Busy But Won’t Move the Needle
  • Curb Appeal for Every Season: Why Your Front Porch Matters
  • Nintendo is suing the US government over Trump’s tariffs
  • Pay for the data you’re using
  • About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions