Logistic Regression in Python Tutorial
Logistic Regression is one of the most popular machine learning algorithms used for classification problems. Despite its name, Logistic Regression is used to predict categorical outcomes rather than continuous values. It is widely applied in spam detection, customer churn prediction, disease diagnosis, sentiment analysis, and many other classification tasks.
In this tutorial, you will learn the fundamentals of Logistic Regression, how it works, and how to implement it in Python using Scikit-Learn.
What is Logistic Regression?
Logistic Regression is a supervised machine learning algorithm used for predicting discrete classes.
Unlike Linear Regression, which predicts numerical values, Logistic Regression predicts probabilities that can be mapped to class labels.
For example:
- Email is Spam or Not Spam
- Customer Will Buy or Not Buy
- Student Passes or Fails
- Disease Positive or Negative
The output probability ranges between 0 and 1.
How Logistic Regression Works
Logistic Regression uses the Sigmoid Function to transform predictions into probabilities.
The sigmoid function is:
Where:
- P(y=1) = Probability of positive class
- e = Euler's number
- z = Linear combination of features
If probability > 0.5, the prediction is usually classified as Class 1; otherwise, it is classified as Class 0.
Installing Required Libraries
Install the necessary Python packages:
pip install numpy pandas matplotlib scikit-learn
Importing Required Modules
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
Creating a Sample Dataset
Let's create a simple dataset representing study hours and exam results.
data = {
'Hours': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'Pass': [0, 0, 0, 0, 1, 1, 1, 1, 1, 1]
}
df = pd.DataFrame(data)
print(df)
Output:
Hours Pass
0 1 0
1 2 0
2 3 0
3 4 0
4 5 1
...
Preparing Features and Labels
Features are input variables, while labels are target outputs.
X = df[['Hours']]
y = df['Pass']
Splitting Data into Training and Testing Sets
Machine learning models should be evaluated on unseen data.
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.2,
random_state=42
)
Training the Logistic Regression Model
Create and train the model.
model = LogisticRegression()
model.fit(X_train, y_train)
The model learns the relationship between study hours and exam results.
Making Predictions
Predict class labels for test data.
predictions = model.predict(X_test)
print(predictions)
Example output:
[1 0]
Predicting Probabilities
Logistic Regression can provide probabilities for each class.
probabilities = model.predict_proba(X_test)
print(probabilities)
Example output:
[[0.12 0.88]
[0.91 0.09]]
Interpretation:
- 88% chance of passing
- 9% chance of passing
Evaluating Model Accuracy
Calculate prediction accuracy.
accuracy = accuracy_score(y_test, predictions)
print("Accuracy:", accuracy)
Output:
Accuracy: 1.0
A score of 1.0 means 100% accuracy on the test data.
Complete Logistic Regression Example
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
data = {
'Hours': [1,2,3,4,5,6,7,8,9,10],
'Pass': [0,0,0,0,1,1,1,1,1,1]
}
df = pd.DataFrame(data)
X = df[['Hours']]
y = df['Pass']
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size=0.2,
random_state=42
)
model = LogisticRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
accuracy = accuracy_score(y_test, predictions)
print("Predictions:", predictions)
print("Accuracy:", accuracy)
Using the Iris Dataset
The Iris dataset is a classic machine learning dataset.
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
iris = load_iris()
X = iris.data
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size=0.2,
random_state=42
)
model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)
predictions = model.predict(X_test)
print("Accuracy:",
accuracy_score(y_test, predictions))
Important Evaluation Metrics
Beyond accuracy, Logistic Regression can be evaluated using:
Precision
Measures how many positive predictions were correct.
from sklearn.metrics import precision_score
Recall
Measures how many actual positives were identified.
from sklearn.metrics import recall_score
F1 Score
Balances precision and recall.
from sklearn.metrics import f1_score
Confusion Matrix
Shows detailed prediction results.
from sklearn.metrics import confusion_matrix
Advantages of Logistic Regression
- Easy to understand and implement
- Fast training process
- Works well on linearly separable data
- Provides probability outputs
- Suitable for binary and multiclass classification
- Requires fewer computational resources
Limitations of Logistic Regression
- Assumes linear relationships
- Less effective for highly complex data
- Sensitive to outliers
- Performance decreases with non-linear patterns
Real-World Applications
Logistic Regression is used in:
- Email spam detection
- Fraud detection
- Disease diagnosis
- Credit risk assessment
- Customer churn prediction
- Sentiment analysis
- Marketing campaign response prediction
Best Practices
- Scale features when necessary.
- Remove irrelevant variables.
- Handle missing values properly.
- Evaluate using multiple metrics.
- Use cross-validation for reliable results.
- Monitor class imbalance issues.
Conclusion
Logistic Regression is one of the most important classification algorithms in machine learning. It is simple, efficient, interpretable, and widely used in real-world applications. By combining Logistic Regression with Python and Scikit-Learn, developers can quickly build powerful predictive models for binary and multiclass classification tasks. Understanding Logistic Regression provides a strong foundation for learning more advanced machine learning algorithms and building intelligent data-driven applications.


0 Comments