Logistic Regression in Python – Splitting Data

One of the most important steps in machine learning is splitting data into training and testing sets. Before building a Logistic Regression model, we must divide the dataset properly so that we can evaluate how well the model performs on unseen data.

If we train and test a model on the same data, it will give overly optimistic results. That is why data splitting is essential for building reliable and real-world machine learning systems.

In this tutorial, you will learn how to split data for Logistic Regression in Python using Scikit-Learn, along with best practices and real-world understanding.

Why Splitting Data is Important

Data splitting helps us measure how well a model generalizes to new data.

It ensures:

Fair model evaluation
Prevention of overfitting
Detection of model performance issues
Real-world prediction accuracy
Reliable machine learning workflow

Without splitting, the model may simply memorize the dataset instead of learning patterns.

What is Train-Test Split?

Train-test split divides the dataset into two parts:

Training Set

Used to train the Logistic Regression model.

Usually 70% to 80% of the data
Model learns patterns here

Testing Set

Used to evaluate the model.

Usually 20% to 30% of the data
Model is tested on unseen data

Basic Concept

Dataset
   ↓
-------------------------
| Training Data (80%)   |
| Testing Data (20%)    |
-------------------------

Import Required Library

We use Scikit-Learn for splitting data.

from sklearn.model_selection import train_test_split

Example Dataset

import pandas as pd

data = pd.DataFrame({
    'Age': [22, 25, 30, 35, 40, 45, 50],
    'Salary': [25000, 30000, 45000, 60000, 70000, 85000, 95000],
    'Purchased': [0, 0, 0, 1, 1, 1, 1]
})

print(data)

Defining Features and Target

Before splitting, separate input and output variables.

X = data[['Age', 'Salary']]
y = data['Purchased']

X → Input features
y → Target variable

Performing Train-Test Split

Now we split the dataset.

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.25,
    random_state=42
)

Explanation of Parameters

X

Input features dataset.

y

Target labels.

test_size

Percentage of data used for testing.

Example:

test_size = 0.25 → 25% test data, 75% training data

random_state

Ensures reproducibility.

random_state = 42 → same split every time

Checking Split Results

print("Training set size:", X_train.shape)
print("Testing set size:", X_test.shape)

Example output:

Training set size: (5, 2)
Testing set size: (2, 2)

Viewing Training Data

print(X_train)

Viewing Testing Data

print(X_test)

Why Not Train on Full Data?

If we use all data for training:

Model memorizes data
No way to test accuracy
Overfitting occurs
Real-world performance becomes unknown

Train-test split prevents this problem.

Data Leakage Problem

Data leakage happens when test data influences training.

Example mistakes:

Scaling before splitting
Using test data in training
Feature engineering on full dataset

Correct approach:

Split → Train → Test
NOT
Train + Test together

Best Practice Workflow

A standard Logistic Regression pipeline:

1. Load Data
2. Clean Data
3. Feature Engineering
4. Split Data
5. Train Model
6. Evaluate Model

Train-Test Split with Scaling (Correct Way)

Scaling should be done AFTER splitting.

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Important rule:

Fit only on training data
Transform both training and testing data

Complete Example Code

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

data = pd.DataFrame({
    'Age': [22, 25, 30, 35, 40, 45, 50],
    'Salary': [25000, 30000, 45000, 60000, 70000, 85000, 95000],
    'Purchased': [0, 0, 0, 1, 1, 1, 1]
})

X = data[['Age', 'Salary']]
y = data['Purchased']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42
)

scaler = StandardScaler()

X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

print("Data successfully split and prepared")

Advantages of Train-Test Split

Reliable evaluation
Better generalization
Prevents overfitting
Real-world simulation
Improved model validation

Common Mistakes

Avoid:

Using full dataset for training
Scaling before splitting
Ignoring random_state
Using wrong test size
Data leakage issues

Real-World Importance

Train-test split is used in:

Fraud detection systems
Customer prediction models
Medical diagnosis systems
Recommendation systems
Financial risk models

It ensures that machine learning models perform well in real-world scenarios.

Conclusion

Splitting data is a critical step in Logistic Regression modeling. It ensures that the model is trained on one portion of data and tested on unseen data, providing a realistic evaluation of performance.

By correctly applying train-test split in Python using Scikit-Learn, you can build more reliable, accurate, and production-ready machine learning models.

Header Ads Widget

Logistic Regression in Python – Splitting Data for Machine Learning | Train-Test Split Guide

Logistic Regression in Python – Splitting Data

Why Splitting Data is Important

What is Train-Test Split?

Training Set

Testing Set

Basic Concept

Import Required Library

Example Dataset

Defining Features and Target

Performing Train-Test Split

Explanation of Parameters

X

y

test_size

random_state

Checking Split Results

Viewing Training Data

Viewing Testing Data

Why Not Train on Full Data?

Data Leakage Problem

Best Practice Workflow

Train-Test Split with Scaling (Correct Way)

Complete Example Code

Advantages of Train-Test Split

Common Mistakes

Real-World Importance

Conclusion

Posted by: Roger John Williams

You may like these posts

Post a Comment

0 Comments

Search This Blog

Report Abuse

Labels

Subscribe Us

Ad Space

Popular Posts

NumPy Inverse Fourier Transform Explained – Python IFFT with Examples

Python - Join Tuples (Complete Guide for Beginners)

Python - Tuple Methods (Complete Guide for Beginners)

Tags

Popular Posts

NumPy Inverse Fourier Transform Explained – Python IFFT with Examples

Python - Join Tuples (Complete Guide for Beginners)

Python - Tuple Methods (Complete Guide for Beginners)

Labels

Menu Footer Widget