How to detect the outlier in dataframe?
Here we will see the steps involved in identifying the outlier value in a dataframe.
Follow the steps below:
- Define the problem: According to which column we need to find the outliers
- Once identified the feature/ column: calculate the 25th and 75th percentile of that column
- Once we get the percentile, we need to calculate the range, known as InterQuarantile range i.e IQR by substracting the 75th and 25th percentile i.e q3 – q1
- We have IQR now, now we need lower boundry and upper bound by following equation: lower_boundry = q1 – (threshold * iqr) and upper_boundry = q3 + (threshold * iqr)
- Now we need to filter out the value falling out side of these range.
- Values outside these range are considered as outliers.
import pandas as pd import numpy as np def outliers(data, threshold=1.5): """ outlier """ q1 , q3 = np.percentile(data, [25, 75]) iqr = q3 - q1 lower_bound = q1 - (threshold * iqr) upper_bound = q3 + (threshold * iqr) return np.where((data < lower_bound) | (data > upper_bound))[0] df = pd.DataFrame({'x':[4,5,6,8,8,9,2], 'x1':[6,3,4,5,68,9,0]}) #a = [5,3,5,81,2,9,0,3] result = outliers(df.x1) print(f'The row index of outliers are {result}') print(f'Values are {[df.loc[ind,:] for ind in result]}')
Threshold is inversely proportional to the number of outliers.
Less threshold value means more outliers.
#Output
The row index of outliers are [4]
Values are [x 8
x1 68
Name: 4, dtype: int64]
JAMES JULIUS TUDU
0
Tags :