
7 Pandas Tricks to Improve Your Machine Learning Model Development
Image by Author | ChatGPT
Introduction
If you’re reading this, it’s likely that you are already aware that the performance of a machine learning model is not just a function of the chosen algorithm. It is also highly influenced by the quality and representation of the data that said model has been trained on.
Data preprocessing and feature engineering are some of the most important steps in your machine learning workflow. In the Python ecosystem, Pandas is the go-to library for these types of data manipulation tasks, something you also likely know. Mastering a few select Pandas data transformation techniques can significantly streamline your workflow, make your code cleaner and more efficient, and ultimately lead to better performing models.
This tutorial will walk you through seven practical Pandas scenarios and the tricks that can enhance your data preparation and feature engineering process, setting you up for success in your next machine learning project.
Preparing Our Data
To demonstrate these tricks, we’ll use the classic Titanic dataset. This is a useful example because it contains a mix of numerical and categorical data, as well as missing values, challenges you will frequently encounter in real-world machine learning tasks.
We can easily load the dataset into a Pandas DataFrame directly from URL.
import pandas as pd import numpy as np
# Load the Titanic dataset from URL url = “https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv” df = pd.read_csv(url)
# Output shape and first 5 rows print(“Dataset shape:”, df.shape) print(df.head()) |
Output:
Dataset shape: (891, 12) PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked 0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S 1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C 2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S 3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S 4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S |
This gives us a DataFrame with columns like Survived
(our target variable), Pclass
(passenger class), Sex
, Age
, and more.
Now, let’s reach into our bag of tricks.
1. Using query() for Cleaner Data Filtering
Filtering data is a never-ending task, whether creating subsets for training or exploring specific segments. The standard method of doing so by using boolean indexing can become clumsy and convoluted with multiple conditions. The query()
method offers a more readable and intuitive alternative by allowing you to filter using a string expression.
Standard Filtering
# Filter for first-class passengers over 30 who survived filtered_df = df[(df[‘Pclass’] == 1) & (df[‘Age’] > 30) & (df[‘Survived’] == 1)] print(filtered_df.head()) |
Filtering with query()
# Same filter, but using the query() method query_df = df.query(‘Pclass == 1 and Age > 30 and Survived == 1’) print(query_df.head()) |
Same output:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked 1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C 3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S 11 12 1 1 Bonnell, Miss. Elizabeth female 58.0 0 0 113783 26.5500 C103 S 52 53 1 1 Harper, Mrs. Henry Sleeper (Myna Haxtun) female 49.0 1 0 PC 17572 76.7292 D33 C 61 62 1 1 Icard, Miss. Amelie female 38.0 0 0 113572 80.0000 B28 NaN |
I doubt you would disagree that the query()
version is cleaner and easier to read, especially as the number of conditions grows.
2. Creating Bins for Continuous Variables with cut()
Some models — think linear models and decision trees — can benefit from discretizing continuous variables, which can help the model capture non-linear relationships. The pd.cut()
function can be used for binning data into custom ranges. To demonstrate, let’s create age groups.
# Define the bins and labels for age groups bins = [0, 12, 18, 60, np.inf] labels = [‘Child’, ‘Teenager’, ‘Adult’, ‘Senior’]
# Create the new ‘AgeGroup’ feature df[‘AgeGroup’] = pd.cut(df[‘Age’], bins=bins, labels=labels, right=False)
# Display the counts of each age group print(df[‘AgeGroup’].value_counts()) |
Output:
AgeGroup Adult 575 Child 68 Teenager 45 Senior 26 Name: count, dtype: int64 |
This new AgeGroup
feature is a powerful categorical variable that your model can now use.
3. Extracting Features from Text with the .str Accessor
Text columns often contain valuable, structured information. The .str
accessor in Pandas provides a whole host of string processing methods that work on an entire series at once. We can use the .str
accessor with a regular expression to extract passenger titles (e.g. ‘Mr.’, ‘Miss.’, ‘Dr.’) from the Name
column.
# Use a regular expression to extract titles from the Name column df[‘Title’] = df[‘Name’].str.extract(‘ ([A-Za-z]+)\.’, expand=False)
# Display the value counts of the new Title feature print(df[‘Title’].value_counts()) |
Output:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
Title Mr 517 Miss 182 Mrs 125 Master 40 Dr 7 Rev 6 Mlle 2 Major 2 Col 2 Countess 1 Capt 1 Ms 1 Sir 1 Lady 1 Mme 1 Don 1 Jonkheer 1 Name: count, dtype: int64 |
This Title
feature has often proven to be a strong predictor of survival in Titanic models.
4. Performing Advanced Imputation with transform()
Simply dropping rows with missing data is often not an option, as it can lead to data loss. In many situations, a better strategy is imputation. While filling with a global mean or median is common, a more sophisticated approach is to impute based on a related group. For example, we can fill missing Age
values with the median age of passengers in the same Pclass
. The groupby()
and transform()
methods make this straightforward, and it is an elegant solution.
# Calculate the median age for each passenger class median_age_by_pclass = df.groupby(‘Pclass’)[‘Age’].transform(‘median’)
# Fill missing Age values with the calculated median df[‘Age’].fillna(median_age_by_pclass, inplace=True)
# Verify that there are no more missing Age values print(“Missing Age values after imputation:”, df[‘Age’].isnull().sum()) |
Output:
Missing Age values before imputation: 177 Missing Age values after imputation: 0 |
We did it; there are no more missing ages. This group-based imputation is often more accurate than using a single global value, for a variety of reasons.
5. Streamlining Workflows with Method Chaining and pipe()
A machine learning preprocessing pipeline often involves multiple steps. Chaining these operations together can make the code more readable and help to avoid creating unnecessary intermediate DataFrames. The pipe()
method takes this a step further by allowing you to integrate your own custom functions into the chain along the way.
First, let’s define a custom function to drop columns, and another to encode the Sex
column as 0 for male
and 1 for female
. Then, we can create a pipeline using pipe
that integrates these 2 custom functions into our chain.
# A custom function to drop columns def drop_cols(df, cols_to_drop): return df.drop(columns=cols_to_drop)
# A custom function to encode ‘Sex’ def encode_sex(df): df[‘Sex’] = df[‘Sex’].map({‘male’: 0, ‘female’: 1}) return df
# Create a chained pipeline processed_df = (df.copy() .pipe(drop_cols, cols_to_drop=[‘Ticket’, ‘Cabin’, ‘Name’]) .pipe(encode_sex) )
print(processed_df.head()) |
And our output:
PassengerId Survived Pclass Sex Age SibSp Parch Fare Embarked AgeGroup Title 0 1 0 3 0 22.0 1 0 7.2500 S Adult Mr 1 2 1 1 1 38.0 1 0 71.2833 C Adult Mrs 2 3 1 3 1 26.0 0 0 7.9250 S Adult Miss 3 4 1 1 1 35.0 1 0 53.1000 S Adult Mrs 4 5 0 3 0 35.0 0 0 8.0500 S Adult Mr |
This approach is effective for building clean, reproducible machine learning pipelines.
6. Mapping Ordinal Categories Efficiently with map()
While one-hot encoding is standard for nominal categorical data, ordinal data (where categories have a natural order) is better handled by mapping to integers. A dictionary and the map()
method are perfect for this. Let’s imagine passenger class has a quality ordering.
# Let’s assume Embarked has an order: S > C > Q embarked_mapping = {‘S’: 2, ‘C’: 1, ‘Q’: 0} df[‘Embarked_mapped’] = df[‘Embarked’].map(embarked_mapping)
print(df[[‘Embarked’, ‘Embarked_mapped’]].head()) |
And here is our output:
Embarked Embarked_mapped 0 S 2.0 1 C 1.0 2 S 2.0 3 S 2.0 4 S 2.0 |
This is a fast and explicit way to encode ordinal relationships for your model to learn.
7. Optimizing Memory with astype()
When working with large datasets, memory usage can become a bottleneck. Pandas defaults to larger data types (like int64
and float64
), but you can often use smaller types without losing information. Converting object columns to the category
dtype is an effective approach to this.
# Check original memory usage print(“Original memory usage:”) print(df.info(memory_usage=‘deep’))
# Optimize data types df_optimized = df.copy() df_optimized[‘Pclass’] = df_optimized[‘Pclass’].astype(‘int8’) df_optimized[‘Sex’] = df_optimized[‘Sex’].astype(‘category’) df_optimized[‘Age’] = df_optimized[‘Age’].astype(‘float32’) df_optimized[‘Embarked’] = df_optimized[‘Embarked’].astype(‘category’)
# Check new memory usage print(“\nOptimized memory usage:”) print(df_optimized.info(memory_usage=‘deep’)) |
The output:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 |
Original memory usage: <class ‘pandas.core.frame.DataFrame’> RangeIndex: 891 entries, 0 to 890 Data columns (total 15 columns): # Column Non-Null Count Dtype —– ——— ——————— ——– 0 PassengerId 891 non–null int64 1 Survived 891 non–null int64 2 Pclass 891 non–null int64 3 Name 891 non–null object 4 Sex 891 non–null object 5 Age 891 non–null float64 6 SibSp 891 non–null int64 7 Parch 891 non–null int64 8 Ticket 891 non–null object 9 Fare 891 non–null float64 10 Cabin 204 non–null object 11 Embarked 889 non–null object 12 AgeGroup 714 non–null category 13 Title 891 non–null object 14 Embarked_mapped 889 non–null float64 dtypes: category(1), float64(3), int64(5), object(6) memory usage: 338.9 KB None
Optimized memory usage: <class ‘pandas.core.frame.DataFrame’> RangeIndex: 891 entries, 0 to 890 Data columns (total 15 columns): # Column Non-Null Count Dtype —– ——— ——————— ——– 0 PassengerId 891 non–null int64 1 Survived 891 non–null int64 2 Pclass 891 non–null int8 3 Name 891 non–null object 4 Sex 891 non–null category 5 Age 891 non–null float32 6 SibSp 891 non–null int64 7 Parch 891 non–null int64 8 Ticket 891 non–null object 9 Fare 891 non–null float64 10 Cabin 204 non–null object 11 Embarked 889 non–null category 12 AgeGroup 714 non–null category 13 Title 891 non–null object 14 Embarked_mapped 889 non–null float64 dtypes: category(3), float32(1), float64(2), int64(4), int8(1), object(4) memory usage: 241.3 KB None |
You will often see a significant reduction in the memory footprint, which can become important for training models on large datasets without crashing your machine.
Wrapping Up
Machine learning always starts with well-prepared data. While the complexity of algorithms, their hyperparameters, and the model-building process often capture the spotlight, the efficient manipulation of data is where the real leverage lies.
The seven Pandas tricks covered here are more than just coding shortcuts — they represent powerful strategies for cleaning your data, engineering insightful features, and building robust, reproducible models.