NumPy Finding Unique Rows
When working with large datasets, duplicate rows often appear due to:
- Data entry errors
- Data merging operations
- Database exports
- Sensor readings
- Machine learning datasets
Before analyzing data, it's important to remove duplicates and keep only unique records.
NumPy provides an efficient way to accomplish this using:
np.unique()
Finding unique rows is a common task in data cleaning and preprocessing.
What are Unique Rows?
Unique rows are rows that appear only once in a dataset.
Consider this array:
[[1 2]
[3 4]
[1 2]
[5 6]]
The row:
[1 2]
appears twice.
Unique rows are:
[[1 2]
[3 4]
[5 6]]
Why Find Unique Rows?
Finding unique rows helps:
- Remove duplicate records
- Clean datasets
- Reduce storage requirements
- Improve model accuracy
- Prepare data for analysis
NumPy Function for Unique Rows
The primary function is:
np.unique()
For rows, use:
np.unique(array, axis=0)
Syntax
np.unique(
array,
axis=0
)
Parameters
| Parameter | Description |
|---|---|
| array | Input array |
| axis=0 | Finds unique rows |
Basic Example
import numpy as np
arr = np.array([
[1, 2],
[3, 4],
[1, 2],
[5, 6]
])
unique_rows = np.unique(
arr,
axis=0
)
print(unique_rows)
Output
[[1 2]
[3 4]
[5 6]]
Understanding the Result
Original array:
[1 2]
[3 4]
[1 2]
[5 6]
Duplicate:
[1 2]
Removed result:
[1 2]
[3 4]
[5 6]
Example with Multiple Duplicates
import numpy as np
arr = np.array([
[10, 20],
[30, 40],
[10, 20],
[30, 40],
[50, 60]
])
print(
np.unique(arr, axis=0)
)
Output
[[10 20]
[30 40]
[50 60]]
Finding Unique Rows and Their Counts
Use:
return_counts=True
Example:
import numpy as np
arr = np.array([
[1, 2],
[1, 2],
[3, 4],
[5, 6],
[5, 6]
])
rows, counts = np.unique(
arr,
axis=0,
return_counts=True
)
print(rows)
print(counts)
Output
[[1 2]
[3 4]
[5 6]]
[2 1 2]
Explanation
Row frequencies:
| Row | Count |
|---|---|
| [1, 2] | 2 |
| [3, 4] | 1 |
| [5, 6] | 2 |
Finding Original Row Indices
Use:
return_index=True
Example:
import numpy as np
arr = np.array([
[1, 2],
[3, 4],
[1, 2],
[5, 6]
])
rows, indices = np.unique(
arr,
axis=0,
return_index=True
)
print(rows)
print(indices)
Output
[[1 2]
[3 4]
[5 6]]
[0 1 3]
Finding Inverse Mapping
Use:
return_inverse=True
Example:
import numpy as np
arr = np.array([
[1, 2],
[3, 4],
[1, 2]
])
rows, inverse = np.unique(
arr,
axis=0,
return_inverse=True
)
print(rows)
print(inverse)
Output
[[1 2]
[3 4]]
[0 1 0]
Real-World Example: Customer Database
import numpy as np
customers = np.array([
[101, 25],
[102, 30],
[101, 25],
[103, 28]
])
unique_customers = np.unique(
customers,
axis=0
)
print(unique_customers)
Output
[[101 25]
[102 30]
[103 28]]
Real-World Example: Sensor Data
import numpy as np
sensor = np.array([
[100, 200],
[100, 200],
[150, 250]
])
print(
np.unique(sensor, axis=0)
)
Output
[[100 200]
[150 250]]
Unique Rows vs Unique Elements
Unique Elements
np.unique(arr)
Returns:
Individual unique values
Unique Rows
np.unique(arr, axis=0)
Returns:
Entire unique rows
Performance Benefits
NumPy's implementation is:
- Highly optimized
- Memory efficient
- Faster than Python loops
- Suitable for large datasets
Practical Applications
Finding unique rows is used in:
- Data cleaning
- Machine learning preprocessing
- Database management
- Log analysis
- Customer records
- Financial datasets
- Scientific research
Advantages of Finding Unique Rows
- Removes duplicate records
- Improves data quality
- Saves storage space
- Speeds up analysis
- Simplifies preprocessing
Summary
Finding unique rows in NumPy is simple using np.unique() with axis=0. This technique removes duplicate rows while preserving only distinct records, making it an essential tool for data cleaning and preprocessing.
This functionality is provided by NumPy and is widely used in data science workflows built with Python.
Conclusion
Understanding how to find unique rows is an important skill for working with real-world datasets. Whether you're cleaning customer records, preparing machine learning data, or analyzing scientific measurements, np.unique(axis=0) provides a fast and efficient solution.


0 Comments