Logistic Regression in Python – Splitting Data
One of the most important steps in machine learning is splitting data into training and testing sets. Before building a Logistic Regression model, we must divide the dataset properly so that we can evaluate how well the model performs on unseen data.
If we train and test a model on the same data, it will give overly optimistic results. That is why data splitting is essential for building reliable and real-world machine learning systems.
In this tutorial, you will learn how to split data for Logistic Regression in Python using Scikit-Learn, along with best practices and real-world understanding.
Why Splitting Data is Important
Data splitting helps us measure how well a model generalizes to new data.
It ensures:
- Fair model evaluation
- Prevention of overfitting
- Detection of model performance issues
- Real-world prediction accuracy
- Reliable machine learning workflow
Without splitting, the model may simply memorize the dataset instead of learning patterns.
What is Train-Test Split?
Train-test split divides the dataset into two parts:
Training Set
Used to train the Logistic Regression model.
- Usually 70% to 80% of the data
- Model learns patterns here
Testing Set
Used to evaluate the model.
- Usually 20% to 30% of the data
- Model is tested on unseen data
Basic Concept
Dataset
↓
-------------------------
| Training Data (80%) |
| Testing Data (20%) |
-------------------------Import Required Library
We use Scikit-Learn for splitting data.
from sklearn.model_selection import train_test_splitExample Dataset
import pandas as pd
data = pd.DataFrame({
'Age': [22, 25, 30, 35, 40, 45, 50],
'Salary': [25000, 30000, 45000, 60000, 70000, 85000, 95000],
'Purchased': [0, 0, 0, 1, 1, 1, 1]
})
print(data)Defining Features and Target
Before splitting, separate input and output variables.
X = data[['Age', 'Salary']]
y = data['Purchased']- X → Input features
- y → Target variable
Performing Train-Test Split
Now we split the dataset.
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.25,
random_state=42
)Explanation of Parameters
X
Input features dataset.
y
Target labels.
test_size
Percentage of data used for testing.
Example:
test_size = 0.25 → 25% test data, 75% training datarandom_state
Ensures reproducibility.
random_state = 42 → same split every timeChecking Split Results
print("Training set size:", X_train.shape)
print("Testing set size:", X_test.shape)Example output:
Training set size: (5, 2)
Testing set size: (2, 2)Viewing Training Data
print(X_train)Viewing Testing Data
print(X_test)Why Not Train on Full Data?
If we use all data for training:
- Model memorizes data
- No way to test accuracy
- Overfitting occurs
- Real-world performance becomes unknown
Train-test split prevents this problem.
Data Leakage Problem
Data leakage happens when test data influences training.
Example mistakes:
- Scaling before splitting
- Using test data in training
- Feature engineering on full dataset
Correct approach:
Split → Train → Test
NOT
Train + Test togetherBest Practice Workflow
A standard Logistic Regression pipeline:
1. Load Data
2. Clean Data
3. Feature Engineering
4. Split Data
5. Train Model
6. Evaluate ModelTrain-Test Split with Scaling (Correct Way)
Scaling should be done AFTER splitting.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)Important rule:
- Fit only on training data
- Transform both training and testing data
Complete Example Code
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
data = pd.DataFrame({
'Age': [22, 25, 30, 35, 40, 45, 50],
'Salary': [25000, 30000, 45000, 60000, 70000, 85000, 95000],
'Purchased': [0, 0, 0, 1, 1, 1, 1]
})
X = data[['Age', 'Salary']]
y = data['Purchased']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.25, random_state=42
)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
print("Data successfully split and prepared")Advantages of Train-Test Split
- Reliable evaluation
- Better generalization
- Prevents overfitting
- Real-world simulation
- Improved model validation
Common Mistakes
Avoid:
- Using full dataset for training
- Scaling before splitting
- Ignoring random_state
- Using wrong test size
- Data leakage issues
Real-World Importance
Train-test split is used in:
- Fraud detection systems
- Customer prediction models
- Medical diagnosis systems
- Recommendation systems
- Financial risk models
It ensures that machine learning models perform well in real-world scenarios.
Conclusion
Splitting data is a critical step in Logistic Regression modeling. It ensures that the model is trained on one portion of data and tested on unseen data, providing a realistic evaluation of performance.
By correctly applying train-test split in Python using Scikit-Learn, you can build more reliable, accurate, and production-ready machine learning models.


0 Comments