Header Ads Widget

⚡ Premium Tools Hub • EXE Apps + Full Python Source Code
Lite • Pro • Bundle Packs • Instant Download

Logistic Regression in Python – Splitting Data for Machine Learning | Train-Test Split Guide

Logistic Regression in Python – Splitting Data

One of the most important steps in machine learning is splitting data into training and testing sets. Before building a Logistic Regression model, we must divide the dataset properly so that we can evaluate how well the model performs on unseen data.

If we train and test a model on the same data, it will give overly optimistic results. That is why data splitting is essential for building reliable and real-world machine learning systems.

In this tutorial, you will learn how to split data for Logistic Regression in Python using Scikit-Learn, along with best practices and real-world understanding.


Why Splitting Data is Important

Data splitting helps us measure how well a model generalizes to new data.

It ensures:

  • Fair model evaluation
  • Prevention of overfitting
  • Detection of model performance issues
  • Real-world prediction accuracy
  • Reliable machine learning workflow

Without splitting, the model may simply memorize the dataset instead of learning patterns.


What is Train-Test Split?

Train-test split divides the dataset into two parts:

Training Set

Used to train the Logistic Regression model.

  • Usually 70% to 80% of the data
  • Model learns patterns here

Testing Set

Used to evaluate the model.

  • Usually 20% to 30% of the data
  • Model is tested on unseen data

Basic Concept

Dataset
   ↓
-------------------------
| Training Data (80%)   |
| Testing Data (20%)    |
-------------------------

Import Required Library

We use Scikit-Learn for splitting data.

from sklearn.model_selection import train_test_split

Example Dataset

import pandas as pd

data = pd.DataFrame({
    'Age': [22, 25, 30, 35, 40, 45, 50],
    'Salary': [25000, 30000, 45000, 60000, 70000, 85000, 95000],
    'Purchased': [0, 0, 0, 1, 1, 1, 1]
})

print(data)

Defining Features and Target

Before splitting, separate input and output variables.

X = data[['Age', 'Salary']]
y = data['Purchased']
  • X → Input features
  • y → Target variable

Performing Train-Test Split

Now we split the dataset.

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.25,
    random_state=42
)

Explanation of Parameters

X

Input features dataset.

y

Target labels.

test_size

Percentage of data used for testing.

Example:

test_size = 0.25 → 25% test data, 75% training data

random_state

Ensures reproducibility.

random_state = 42 → same split every time

Checking Split Results

print("Training set size:", X_train.shape)
print("Testing set size:", X_test.shape)

Example output:

Training set size: (5, 2)
Testing set size: (2, 2)

Viewing Training Data

print(X_train)

Viewing Testing Data

print(X_test)

Why Not Train on Full Data?

If we use all data for training:

  • Model memorizes data
  • No way to test accuracy
  • Overfitting occurs
  • Real-world performance becomes unknown

Train-test split prevents this problem.


Data Leakage Problem

Data leakage happens when test data influences training.

Example mistakes:

  • Scaling before splitting
  • Using test data in training
  • Feature engineering on full dataset

Correct approach:

Split → Train → Test
NOT
Train + Test together

Best Practice Workflow

A standard Logistic Regression pipeline:

1. Load Data
2. Clean Data
3. Feature Engineering
4. Split Data
5. Train Model
6. Evaluate Model

Train-Test Split with Scaling (Correct Way)

Scaling should be done AFTER splitting.

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Important rule:

  • Fit only on training data
  • Transform both training and testing data

Complete Example Code

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

data = pd.DataFrame({
    'Age': [22, 25, 30, 35, 40, 45, 50],
    'Salary': [25000, 30000, 45000, 60000, 70000, 85000, 95000],
    'Purchased': [0, 0, 0, 1, 1, 1, 1]
})

X = data[['Age', 'Salary']]
y = data['Purchased']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42
)

scaler = StandardScaler()

X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

print("Data successfully split and prepared")

Advantages of Train-Test Split

  • Reliable evaluation
  • Better generalization
  • Prevents overfitting
  • Real-world simulation
  • Improved model validation

Common Mistakes

Avoid:

  • Using full dataset for training
  • Scaling before splitting
  • Ignoring random_state
  • Using wrong test size
  • Data leakage issues

Real-World Importance

Train-test split is used in:

  • Fraud detection systems
  • Customer prediction models
  • Medical diagnosis systems
  • Recommendation systems
  • Financial risk models

It ensures that machine learning models perform well in real-world scenarios.


Conclusion

Splitting data is a critical step in Logistic Regression modeling. It ensures that the model is trained on one portion of data and tested on unseen data, providing a realistic evaluation of performance.

By correctly applying train-test split in Python using Scikit-Learn, you can build more reliable, accurate, and production-ready machine learning models.




Post a Comment

0 Comments