• About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
Tuesday, September 2, 2025
mGrowTech
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
No Result
View All Result
mGrowTech
No Result
View All Result
Home Al, Analytics and Automation

10 NumPy One-Liners to Simplify Feature Engineering

Josh by Josh
July 16, 2025
in Al, Analytics and Automation
0
10 NumPy One-Liners to Simplify Feature Engineering
0
SHARES
0
VIEWS
Share on FacebookShare on Twitter


10 NumPy One-Liners to Simplify Feature Engineering

10 NumPy One-Liners to Simplify Feature Engineering
Image by Author | Ideogram

When building machine learning models, most developers focus on model architectures and hyperparameter tuning. However, the real competitive advantage comes from crafting representative features that help your model understand the underlying patterns in your data.

READ ALSO

Turnitin’s AI Bypasser Detector: New Sledgehammer or Continuing Mistrust?

Implementing OAuth 2.1 for MCP Servers with Scalekit: A Step-by-Step Coding Tutorial

While libraries like Pandas and Scikit-learn provide excellent tools for this task, NumPy’s vectorized operations can make feature engineering both faster and more elegant.

In this article, we’ll explore 10 powerful NumPy one-liners that can simplify your feature engineering workflow. These techniques use NumPy’s broadcasting, advanced indexing, and mathematical functions to create new features efficiently.

🔗 Link to the code on GitHub

1. Robust Scaling with Median Absolute Deviation

Standard scaling works well for normally distributed data, but it breaks down when outliers are present. A single extreme value can completely skew your normalization, making your features less useful for machine learning models.

This is especially problematic in domains like finance or web analytics where outliers often contain important information. Median Absolute Deviation (MAD) scaling provides a robust alternative that can handle a substantially large fraction of outliers.

import numpy as np

 

# Sample data with outliers

data = np.array([1, 200, 3, 10, 4, 50, 6, 9, 3, 100])

 

# One-liner: Robust scaling using MAD

scaled = (data – np.median(data)) / np.median(np.abs(data – np.median(data)))

print(scaled)

Output:

[–1.44444444 42.77777778 –1.          0.55555556 –0.77777778

  9.44444444 –0.33333333  0.33333333 –1.         20.55555556]

This one-liner works by first centering the data around the median (data - np.median(data)), then dividing by the MAD. The MAD is the median of the absolute deviations from the median. This gives you a robust measure of scale that outliers can’t corrupt, while still preserving the relative importance of extreme values.

2. Binning Continuous Variables with Quantiles

Converting continuous variables into categorical bins is essential for many algorithms and can help capture non-linear relationships. Equal-width binning often creates imbalanced groups, especially with skewed data. With quantile-based binning, however, you get roughly the same number of samples in each bin.

This technique is particularly useful when you need to discretize variables for tree-based models or when creating interpretable features for business stakeholders.

# Sample continuous data (e.g., customer ages)

ages = np.array([18, 25, 35, 22, 45, 67, 23, 29, 34, 56, 41, 38, 52, 28, 33])

 

# One-liner: Create 4 equal-frequency bins

binned = np.digitize(ages, np.percentile(ages, [25, 50, 75])) – 1

print(binned)

Output:

[–1 –1  1 –1  2  2 –1  0  1  2  1  1  2  0  0]

The np.percentile() function calculates the quantile boundaries (25th, 50th, 75th percentiles), then np.digitize() assigns each value to its appropriate bin. This approach automatically handles skewed distributions and creates meaningful groups regardless of the underlying data distribution.

3. Polynomial Features Without Loops

Polynomial features help capture non-linear relationships between variables. Traditional approaches often involve nested loops or complex library calls.

Creating polynomial features is important when you suspect interaction effects between variables, such as the relationship between temperature and humidity affecting crop yields, or how price and quality together influence customer satisfaction.

# Original features (e.g., temperature, humidity)

X = np.array([[20, 65], [25, 70], [30, 45], [22, 80]])

 

# One-liner: Generate degree-2 polynomial features

poly_features = np.column_stack([X[:, [i, j]].prod(axis=1) for i in range(X.shape[1]) for j in range(i, X.shape[1])])

print(poly_features)

Output:

[[ 400 1300 4225]

[ 625 1750 4900]

[ 900 1350 2025]

[ 484 1760 6400]]

This list comprehension creates all possible polynomial combinations by iterating through column pairs and computing their product. We use np.column_stack() function to get these into a feature matrix. The result includes both squared terms (x₁², x₂²) and interaction terms (x₁x₂), giving your model access to non-linear relationships.

4. Lag Features for Time Series

Time series analysis often requires features that capture temporal dependencies. Lag features let your model access historical values, which is essential for forecasting and anomaly detection.

Creating lag features usually requires loops with quite careful index management. This vectorized approach generates all desired lags simultaneously while handling edge cases automatically.

# Time series data (e.g., daily sales)

sales = np.array([100, 98, 120,130, 74, 145, 110, 140, 65, 105, 135])

 

lags = np.column_stack([np.roll(sales, shift) for shift in [1, 2, 3]])[3:]

print(lags)

Output:

[[120  98 100]

[130 120  98]

[ 74 130 120]

[145  74 130]

[110 145  74]

[140 110 145]

[ 65 140 110]

[105  65 140]]

The np.roll() function shifts array elements by the specified number of positions. The list comprehension creates multiple shifted versions, and np.column_stack() combines them into a feature matrix. Slicing with [3:] removes the initial rows where lagged values would be invalid, ensuring clean training data.

5. One-Hot Encoding Without pandas

One-hot encoding is essential for handling categorical variables in machine learning. While pandas provides convenient methods, pure NumPy implementations are faster and more memory-efficient for large datasets.

This approach is particularly valuable when working with high-cardinality categorical features.

# Categorical data (e.g., product categories)

categories = np.array([0, 1, 2, 1, 0, 2, 3, 1])

 

# One-liner: One-hot encode

one_hot = (categories[:, None] == np.arange(categories.max() + 1)).astype(int)

print(one_hot)

Output:

[[1 0 0 0]

[0 1 0 0]

[0 0 1 0]

[0 1 0 0]

[1 0 0 0]

[0 0 1 0]

[0 0 0 1]

[0 1 0 0]]

This technique uses broadcasting to compare each category value against all possible categories. The [:, None] reshapes the array to enable broadcasting, and np.arange(categories.max() + 1) creates the comparison range. The boolean result is converted to integers, creating a binary matrix where each row represents one sample and each column represents one category.

6. Distance Features from Coordinates

Geospatial features often require distance calculations from reference points. This is common in location-based models for delivery optimization, real estate pricing, or demographic analysis.

Computing distances efficiently is crucial when dealing with large datasets of coordinates, and this vectorized approach scales well to millions of data points.

# Coordinate data

locations = np.array([[40.7128, –74.0060],

                      [34.0522, –118.2437],  

                      [41.8781, –87.6298],  

                      [29.7604, –95.3698]])  

reference = np.array([39.7392, –104.9903])

 

# One-liner: Calculate Euclidean distances from reference point

distances = np.sqrt(((locations – reference) ** 2).sum(axis=1))

print(distances)

Output:

[30.99959263 14.42201722 17.4917653  13.86111358]

This uses NumPy’s broadcasting to subtract the reference point from all locations simultaneously. The squared differences are summed along axis 1, and the square root gives Euclidean distances. For more precise geographic distances, you could extend this to use the haversine formula.

7. Interaction Features Between Variable Pairs

Feature interactions often help understand hidden patterns that individual features miss. This is common in domains like marketing (price × quality), medicine (drug interactions), or finance (volatility × volume).

Creating all pairwise interactions manually is tedious and error-prone. This vectorized approach generates them systematically and efficiently.

# Sample features (e.g., price, quality, brand_score)

features = np.array([[10, 8, 7], [15, 9, 6], [12, 7, 8], [20, 10, 9]])

 

# One-liner: Create all pairwise interactions

interactions = np.array([features[:, i] * features[:, j]

                        for i in range(features.shape[1])

                        for j in range(i+1, features.shape[1])]).T

print(interactions)

Output:

[[ 80  70  56]

[135  90  54]

[ 84  96  56]

[200 180  90]]

The nested comprehension generates all unique pairs of features (avoiding duplicates like feature 1 × feature 2 and feature 2 × feature 1). The .T transposes the result so each row represents a sample and each column represents an interaction term. This systematic approach ensures you don’t miss important feature combinations.

8. Rolling Window Statistics

Rolling statistics smooth noisy data and capture local trends. This is essential for time series analysis, signal processing, and creating features that represent recent behavior rather than historical averages.

Traditional approaches often involve loops or complex pandas operations. This convolution-based method is both elegant and efficient.

# Noisy signal data (e.g., stock prices, sensor readings)

signal = np.array([10, 27, 12, 18, 11, 19, 20, 26, 12, 19, 25, 31, 28])

window_size = 4

 

# One-liner: Create rolling mean features

rolling_mean = np.convolve(signal, np.ones(window_size)/window_size, mode=‘valid’)

print(rolling_mean)  

Output:

[16.75 17.   15.   17.   19.   19.25 19.25 20.5  21.75 25.75]

Convolution naturally implements rolling window operations. The np.ones(window_size)/window_size creates a uniform averaging kernel, and mode="valid" ensures the output only includes positions where the window fully overlaps with the data. This approach extends easily to other window functions like Gaussian or exponential weighting.

9. Outlier Indicator Features

Rather than removing outliers, creating features that flag their presence can provide valuable information to your model. This is particularly useful in fraud detection, quality control, or any domain where anomalies are meaningful.

This approach preserves the information content of outliers while preventing them from dominating your model’s training process.

# Data with potential outliers (e.g., transaction amounts)

amounts = np.array([25, 30, 28, 32, 500, 29, 31, 27, 33, 26])

 

# One-liner: Create outlier indicator features  

outlier_flags = ((amounts < np.percentile(amounts, 5)) |

                 (amounts > np.percentile(amounts, 95))).astype(int)

print(outlier_flags)

Output:

This technique uses the 5th and 95th percentiles as outlier thresholds, flagging any values outside this range. The boolean result is converted to integers, creating binary features that indicate anomalous observations. You can adjust the percentile thresholds based on your domain knowledge and the acceptable false positive rate.

10. Frequency Encoding for Categorical Variables

Frequency encoding replaces categorical values with their occurrence counts, which can be more informative than arbitrary label encoding. This is particularly useful when category frequency correlates with your target variable.

# Categorical data (e.g., product categories)

categories = np.array([‘Electronics’, ‘Books’, ‘Electronics’, ‘Clothing’,

                      ‘Books’, ‘Electronics’, ‘Home’, ‘Books’])

 

# One-liner: Frequency encode

unique_cats, counts = np.unique(categories, return_counts=True)

freq_encoded = np.array([counts[np.where(unique_cats == cat)[0][0]] for cat in categories])

print(freq_encoded)  

Output:

This approach first uses np.unique() to find all unique categories and their counts. Then, for each original category value, it looks up the corresponding frequency count. The result is a numerical feature where each value represents how often that category appears in the dataset, providing the model with information about category popularity or rarity.

Best Practices for Feature Engineering

When you’re creating new and more representative, please keep the following in mind:

Memory efficiency: When working with large datasets, consider the memory implications of feature engineering. Some operations can significantly increase your dataset size.

Feature selection: More features aren’t always better. Use techniques like correlation analysis or feature importance to select the most relevant engineered features.

Validation: Always validate your engineered features on a holdout set to ensure they improve model performance and don’t cause overfitting.

Domain knowledge: The best engineered features often come from understanding your domain. These NumPy techniques are tools to implement your domain insights efficiently.

Conclusion

These NumPy one-liners are practical solutions to common feature engineering challenges.

When you’re working with time series, geospatial data, or traditional tabular datasets, these techniques will help you build more efficient and maintainable feature engineering pipelines. The key is knowing when to use each approach and how to combine them to extract the maximum signal from your data.

Remember that the best feature engineering technique is the one that helps your model learn the patterns specific to your problem domain. Use these one-liners as building blocks, but always validate their effectiveness through proper cross-validation and domain expertise. So yeah, happy feature engineering!



Source_link

Related Posts

Turnitin’s AI Bypasser Detector: New Sledgehammer or Continuing Mistrust?
Al, Analytics and Automation

Turnitin’s AI Bypasser Detector: New Sledgehammer or Continuing Mistrust?

September 2, 2025
Implementing OAuth 2.1 for MCP Servers with Scalekit: A Step-by-Step Coding Tutorial
Al, Analytics and Automation

Implementing OAuth 2.1 for MCP Servers with Scalekit: A Step-by-Step Coding Tutorial

September 2, 2025
7 Pandas Tricks to Improve Your Machine Learning Model Development
Al, Analytics and Automation

7 Pandas Tricks to Improve Your Machine Learning Model Development

September 1, 2025
How Dental Data Annotation Powers AI-driven Clinical Decisions
Al, Analytics and Automation

How Dental Data Annotation Powers AI-driven Clinical Decisions

September 1, 2025
Ellison’s £118M Gift Ignites AI-Powered Vaccine Revolution at Oxford
Al, Analytics and Automation

Ellison’s £118M Gift Ignites AI-Powered Vaccine Revolution at Oxford

September 1, 2025
Al, Analytics and Automation

StepFun AI Releases Step-Audio 2 Mini: An Open-Source 8B Speech-to-Speech AI Model that Surpasses GPT-4o-Audio

September 1, 2025
Next Post
Who put all these videos in my games?

Who put all these videos in my games?

POPULAR NEWS

Communication Effectiveness Skills For Business Leaders

Communication Effectiveness Skills For Business Leaders

June 10, 2025
15 Trending Songs on TikTok in 2025 (+ How to Use Them)

15 Trending Songs on TikTok in 2025 (+ How to Use Them)

June 18, 2025
App Development Cost in Singapore: Pricing Breakdown & Insights

App Development Cost in Singapore: Pricing Breakdown & Insights

June 22, 2025
7 Best EOR Platforms for Software Companies in 2025

7 Best EOR Platforms for Software Companies in 2025

June 21, 2025
Refreshing a Legacy Brand for a Meaningful Future – Truly Deeply – Brand Strategy & Creative Agency Melbourne

Refreshing a Legacy Brand for a Meaningful Future – Truly Deeply – Brand Strategy & Creative Agency Melbourne

June 7, 2025

EDITOR'S PICK

What Happens When Employees Become Your Biggest Brand Ambassadors?

What Happens When Employees Become Your Biggest Brand Ambassadors?

June 22, 2025
Experiential Trend of the Week: Diner Digs

Experiential Trend of the Week: Diner Digs

July 28, 2025
New Study Reveals Startling Surge in Distracted Driving and Its Deadly Consequences

New Study Reveals Startling Surge in Distracted Driving and Its Deadly Consequences

June 21, 2025
Grow a Garden Shiba Inu Pet Wiki

Grow a Garden Shiba Inu Pet Wiki

July 21, 2025

About

We bring you the best Premium WordPress Themes that perfect for news, magazine, personal blog, etc. Check our landing page for details.

Follow us

Categories

  • Account Based Marketing
  • Ad Management
  • Al, Analytics and Automation
  • Brand Management
  • Channel Marketing
  • Digital Marketing
  • Direct Marketing
  • Event Management
  • Google Marketing
  • Marketing Attribution and Consulting
  • Marketing Automation
  • Mobile Marketing
  • PR Solutions
  • Social Media Management
  • Technology And Software
  • Uncategorized

Recent Posts

  • Agentic AI vs Generative AI: A Complete Comparison
  • How Max Muir Is Making Unknowns Into Industry Icons
  • BMW, I am so breaking up with you
  • Turnitin’s AI Bypasser Detector: New Sledgehammer or Continuing Mistrust?
  • About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions

Are you sure want to unlock this post?
Unlock left : 0
Are you sure want to cancel subscription?