NumPy Imputing Missing Data
Missing data is a common problem in real-world datasets.
Instead of removing missing values, another approach is to impute them.
Imputation means replacing missing values with meaningful substitutes so that valuable data is not lost.
What is Data Imputation?
Data imputation is the process of:
Replacing missing values with estimated or calculated values.
Example:
Before:
[10, NaN, 20, NaN, 30]
After:
[10, 20, 20, 20, 30]
Why Impute Missing Data?
- Preserve dataset size
- Prevent data loss
- Improve machine learning accuracy
- Handle incomplete records
- Maintain statistical consistency
1. Creating an Array with Missing Values
import numpy as np
arr = np.array([10, 20, np.nan, 40, np.nan])
print(arr)
Output
[10. 20. nan 40. nan]
2. Replacing Missing Values with Zero
The simplest imputation method.
import numpy as np
arr = np.array([10, 20, np.nan, 40])
imputed = np.nan_to_num(arr, nan=0)
print(imputed)
Output
[10. 20. 0. 40.]
3. Imputing with Mean Value
A common statistical technique.
import numpy as np
arr = np.array([10, 20, np.nan, 40])
mean_value = np.nanmean(arr)
arr[np.isnan(arr)] = mean_value
print(arr)
Output
[10. 20. 23.33333333 40.]
4. Imputing with Median Value
Median is often better when data contains outliers.
import numpy as np
arr = np.array([10, 20, np.nan, 40])
median_value = np.nanmedian(arr)
arr[np.isnan(arr)] = median_value
print(arr)
Output
[10. 20. 20. 40.]
5. Imputing with Maximum Value
import numpy as np
arr = np.array([10, 20, np.nan, 40])
max_value = np.nanmax(arr)
arr[np.isnan(arr)] = max_value
print(arr)
Output
[10. 20. 40. 40.]
6. Imputing with Minimum Value
import numpy as np
arr = np.array([10, 20, np.nan, 40])
min_value = np.nanmin(arr)
arr[np.isnan(arr)] = min_value
print(arr)
Output
[10. 20. 10. 40.]
7. Imputing Missing Values in a 2D Array
import numpy as np
arr = np.array([
[1, 2, np.nan],
[4, np.nan, 6]
])
mean_value = np.nanmean(arr)
arr[np.isnan(arr)] = mean_value
print(arr)
Output
[[1. 2. 3.25]
[4. 3.25 6. ]]
8. Column-Wise Imputation
Replace missing values using the mean of each column.
import numpy as np
arr = np.array([
[10, 20, np.nan],
[40, np.nan, 60],
[70, 80, 90]
])
col_means = np.nanmean(arr, axis=0)
inds = np.where(np.isnan(arr))
arr[inds] = np.take(col_means, inds[1])
print(arr)
9. Real-World Example: Student Scores
scores = np.array([80, 90, np.nan, 70, 85])
avg_score = np.nanmean(scores)
scores[np.isnan(scores)] = avg_score
print(scores)
10. Real-World Example: Temperature Dataset
temps = np.array([30.5, np.nan, 29.0, 31.0])
mean_temp = np.nanmean(temps)
temps[np.isnan(temps)] = mean_temp
print(temps)
Common Imputation Techniques
| Technique | Description |
|---|---|
| Zero Imputation | Replace with 0 |
| Mean Imputation | Replace with average |
| Median Imputation | Replace with median |
| Maximum Imputation | Replace with max value |
| Minimum Imputation | Replace with min value |
Mean vs Median Imputation
| Method | Best Use Case |
|---|---|
| Mean | Normally distributed data |
| Median | Data with outliers |
| Zero | Placeholder values |
| Max/Min | Special domain requirements |
Advantages of Imputing Missing Data
- Keeps all records
- Prevents information loss
- Improves model performance
- Maintains dataset structure
- Easy to implement
Summary
Imputing missing data is an important preprocessing step in data science. NumPy provides powerful functions such as nanmean(), nanmedian(), and nan_to_num() to replace missing values efficiently.
This functionality is part of NumPy and widely used in applications built with Python.
Conclusion
Instead of deleting valuable records, imputation allows you to intelligently replace missing values and preserve your dataset. By using NumPy's built-in tools, you can prepare clean and complete data for analysis, visualization, and machine learning.


0 Comments