Logistic Regression in Python – Getting Data
Data is the foundation of every machine learning project. Even the most advanced algorithms cannot produce meaningful results without quality data. Before training a Logistic Regression model, the first step is to collect, load, and understand the dataset that will be used for classification.
In this tutorial, you will learn how to obtain data for Logistic Regression projects, load datasets into Python, inspect data quality, and prepare data for machine learning tasks.
By the end of this guide, you will understand how to build a reliable data pipeline for Logistic Regression applications.
Why Data Matters
Machine learning models learn patterns from data.
The quality of predictions depends heavily on:
- Data accuracy
- Data completeness
- Data consistency
- Data relevance
- Data quantity
Poor-quality data often leads to poor model performance.
A common saying in data science is:
Garbage In, Garbage Out (GIGO)
This means that inaccurate or incomplete data produces unreliable predictions.
Types of Data Used in Logistic Regression
Logistic Regression is primarily used for classification problems.
Examples include:
| Application | Target Variable |
|---|---|
| Email Filtering | Spam / Not Spam |
| Medical Diagnosis | Sick / Healthy |
| Customer Churn | Leave / Stay |
| Loan Approval | Approved / Rejected |
| Product Purchase | Buy / Not Buy |
The target variable typically contains categorical values represented as:
0 = Negative Class
1 = Positive ClassSources of Machine Learning Data
Data can come from various sources.
CSV Files
One of the most common formats.
Example:
Age,Salary,Purchased
25,30000,0
35,65000,1
45,85000,1Advantages:
- Easy to create
- Human-readable
- Supported by Pandas
Excel Files
Many businesses store information in Excel spreadsheets.
Example:
customers.xlsxUseful for:
- Sales reports
- Financial records
- Customer databases
Databases
Large applications often store data in:
- MySQL
- PostgreSQL
- SQLite
- SQL Server
Example:
SELECT * FROM customers;APIs
Data can be retrieved from web services.
Examples:
- Weather APIs
- Finance APIs
- Social media APIs
- E-commerce APIs
Public Datasets
Popular sources include:
- Kaggle
- UCI Machine Learning Repository
- Government Open Data Portals
- Academic Research Datasets
These datasets are excellent for practice and learning.
Creating a Sample Dataset
For this tutorial, we will use customer purchase data.
Create a file named:
customers.csvContents:
Age,Salary,Purchased
22,25000,0
25,30000,0
30,45000,0
35,60000,1
40,70000,1
45,85000,1
50,95000,1Save the file inside:
data/customers.csvInstalling Pandas
Pandas is the most popular Python library for data analysis.
Install it using:
pip install pandasVerify installation:
import pandas as pd
print(pd.__version__)Loading Data from a CSV File
Use Pandas to load the dataset.
import pandas as pd
data = pd.read_csv(
"data/customers.csv"
)
print(data)Output:
Age Salary Purchased
0 22 25000 0
1 25 30000 0
2 30 45000 0
3 35 60000 1The dataset is now available as a DataFrame.
Viewing the First Rows
Inspect the first few records.
print(data.head())Output:
Age Salary Purchased
0 22 25000 0
1 25 30000 0
2 30 45000 0
3 35 60000 1
4 40 70000 1This helps verify that data was loaded correctly.
Viewing the Last Rows
Display the last records.
print(data.tail())Output:
Age Salary Purchased
2 30 45000 0
3 35 60000 1
4 40 70000 1
5 45 85000 1
6 50 95000 1Checking Dataset Information
Use the info() method.
print(data.info())Example output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries
Data columns (total 3 columns):This shows:
- Number of rows
- Number of columns
- Data types
- Missing values
Understanding Data Types
Check column types.
print(data.dtypes)Output:
Age int64
Salary int64
Purchased int64Machine learning models require numerical values, so data types should be verified before training.
Statistical Summary
Generate summary statistics.
print(data.describe())Output:
Age Salary
count 7.000000 7.000000
mean 35.285714 58571.428571Useful metrics include:
- Mean
- Minimum
- Maximum
- Standard deviation
Checking Missing Values
Missing values can negatively impact model performance.
Check for missing data:
print(data.isnull().sum())Output:
Age 0
Salary 0
Purchased 0A value greater than zero indicates missing data.
Handling Missing Values
Example dataset:
Age Salary Purchased
25 30000 0
NaN 45000 1
35 NaN 1Remove missing values:
data = data.dropna()Or replace them:
data.fillna(
data.mean(),
inplace=True
)Removing Duplicate Records
Check duplicates.
print(data.duplicated().sum())Remove duplicates.
data = data.drop_duplicates()Duplicate records can distort model training.
Selecting Features and Target
Separate input variables and labels.
X = data[['Age', 'Salary']]
y = data['Purchased']Features:
Age
SalaryTarget:
PurchasedThis structure is required by Scikit-Learn.
Loading Data from Excel
Install support package:
pip install openpyxlRead Excel files:
data = pd.read_excel(
"customers.xlsx"
)This is useful when working with business spreadsheets.
Loading Data from SQL Databases
Example using SQLite:
import sqlite3
import pandas as pd
conn = sqlite3.connect(
"customers.db"
)
data = pd.read_sql_query(
"SELECT * FROM customers",
conn
)Databases are common in production systems.
Saving Cleaned Data
After cleaning the dataset:
data.to_csv(
"data/cleaned_customers.csv",
index=False
)Benefits:
- Faster future loading
- Consistent preprocessing
- Reproducible results
Best Practices for Data Collection
Use Reliable Sources
Ensure data accuracy and credibility.
Gather Relevant Features
Collect variables related to the target prediction.
Maintain Data Quality
Remove errors and inconsistencies.
Keep Data Organized
Use dedicated folders for datasets.
Document Data Sources
Track where each dataset originated.
Store Raw Data Separately
Keep original data untouched.
Common Beginner Mistakes
Avoid these mistakes:
- Ignoring missing values
- Using duplicate records
- Mixing categorical and numerical data incorrectly
- Forgetting to inspect datasets
- Training models before cleaning data
- Using irrelevant features
Example Workflow
A typical machine learning data workflow:
1. Collect Data
2. Load Data
3. Inspect Data
4. Clean Data
5. Handle Missing Values
6. Remove Duplicates
7. Select Features
8. Train ModelFollowing this process improves model reliability.
Conclusion
Getting data is the first and most important step in building a Logistic Regression model. High-quality data allows machine learning algorithms to learn meaningful patterns and produce accurate predictions.
In this tutorial, you learned how to obtain data from different sources, load datasets into Python, inspect data quality, handle missing values, remove duplicates, and prepare features for machine learning. With a clean and organized dataset, you are now ready to move on to data preprocessing, feature engineering, and Logistic Regression model training using Scikit-Learn.


0 Comments