Logistic Regression in Python – Preparing Data
Data preparation is one of the most important stages in building a machine learning model. Before training a Logistic Regression algorithm, we must ensure that the dataset is clean, structured, and properly formatted.
Even the best algorithm will fail if the input data is not well-prepared. That is why data preparation plays a critical role in achieving accurate predictions.
In this tutorial, you will learn how to prepare data for Logistic Regression in Python using Scikit-Learn, step by step.
Why Data Preparation Matters
Machine learning models cannot work directly with raw data.
Proper data preparation helps to:
- Improve model accuracy
- Reduce errors during training
- Ensure consistency in data format
- Improve learning speed
- Prevent data leakage
- Enhance model performance
Without proper preparation, Logistic Regression may produce unreliable results.
Steps in Data Preparation
A typical Logistic Regression data preparation workflow includes:
1. Load Dataset
2. Clean Data
3. Encode Categorical Variables
4. Scale Features
5. Select Features
6. Split Data
7. Validate DatasetEach step ensures the dataset is ready for training.
Step 1: Load the Dataset
The first step is loading the dataset using Pandas.
import pandas as pd
data = pd.read_csv("data/customers.csv")
print(data.head())Example output:
Age Salary Gender Purchased
0 22 25000 Male 0
1 25 30000 Female 0
2 30 45000 Female 1Step 2: Inspect the Data
Before preprocessing, inspect the dataset:
print(data.info())
print(data.describe())Check for:
- Missing values
- Incorrect data types
- Outliers
- Data imbalance
Step 3: Handle Missing Values
Missing values must be treated before training.
Check missing values:
print(data.isnull().sum())Remove Missing Values
data = data.dropna()Fill Missing Values
data.fillna(data.mean(), inplace=True)Step 4: Encode Categorical Variables
Logistic Regression works only with numerical data.
Label Encoding
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
data['Gender'] = le.fit_transform(data['Gender'])Example:
Male → 1
Female → 0One-Hot Encoding
data = pd.get_dummies(data, columns=['Gender'])This converts categories into binary columns.
Step 5: Feature Selection
Select relevant input variables.
X = data[['Age', 'Salary', 'Gender_Male']]
y = data['Purchased']Features:
- Age
- Salary
- Gender
Target:
- Purchased
Step 6: Feature Scaling
Scaling ensures all features contribute equally.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X[['Age', 'Salary']] = scaler.fit_transform(
X[['Age', 'Salary']]
)Why Scaling is Important
- Prevents feature dominance
- Improves gradient descent performance
- Enhances model accuracy
- Speeds up convergence
Step 7: Train-Test Split
Split dataset into training and testing sets.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.25,
random_state=42
)Step 8: Data Validation
Ensure the dataset is properly structured.
print(X_train.shape)
print(X_test.shape)Check:
- No missing values
- Proper scaling applied
- Correct feature selection
- Balanced dataset
Full Data Preparation Code
import pandas as pd
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split
# Load data
data = pd.read_csv("data/customers.csv")
# Encode categorical variable
le = LabelEncoder()
data['Gender'] = le.fit_transform(data['Gender'])
# Feature selection
X = data[['Age', 'Salary', 'Gender']]
y = data['Purchased']
# Feature scaling
scaler = StandardScaler()
X[['Age', 'Salary']] = scaler.fit_transform(X[['Age', 'Salary']])
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.25, random_state=42
)
print("Data is ready for Logistic Regression")Best Practices for Data Preparation
- Always clean data before training
- Encode all categorical variables
- Scale numerical features
- Avoid data leakage
- Keep preprocessing consistent
- Save preprocessing steps for deployment
Common Mistakes
Avoid:
- Training model on unscaled data
- Forgetting to encode categories
- Ignoring missing values
- Mixing train and test preprocessing
- Using irrelevant features
Real-World Importance
Proper data preparation is essential in:
- Fraud detection systems
- Healthcare prediction models
- Marketing analytics
- Customer behavior analysis
- Financial forecasting systems
Even advanced machine learning models rely heavily on well-prepared data.
Conclusion
Data preparation is the foundation of successful Logistic Regression modeling. By cleaning, encoding, scaling, and splitting data properly, you ensure that your machine learning model learns effectively and produces accurate results.
Mastering data preparation techniques in Python is essential for building real-world classification systems and improving model performance in any machine learning project.


0 Comments