How to detect the outlier in dataframe?

Here we will see the steps involved in identifying the outlier value in a dataframe.

Follow the steps below:

  1. Define the problem: According to which column we need to find the outliers
  2. Once identified the feature/ column: calculate the 25th and 75th percentile of that column
  3. Once we get the percentile, we need to calculate the range, known as InterQuarantile range i.e IQR by substracting the 75th and 25th percentile i.e q3 – q1
  4. We have IQR now, now we need lower boundry and upper bound by following equation: lower_boundry = q1 – (threshold * iqr) and upper_boundry = q3 + (threshold * iqr)
  5. Now we need to filter out the value falling out side of these range.
  6. Values outside these range are considered as outliers.

import pandas as pd
import numpy as np
def outliers(data, threshold=1.5):
    """ outlier """
    q1 , q3 = np.percentile(data, [25, 75])
    
    iqr = q3 - q1
    lower_bound = q1 - (threshold * iqr)
    upper_bound = q3 + (threshold * iqr)
    return np.where((data < lower_bound) | (data > upper_bound))[0]



df = pd.DataFrame({'x':[4,5,6,8,8,9,2],
                  'x1':[6,3,4,5,68,9,0]})

#a = [5,3,5,81,2,9,0,3]
result = outliers(df.x1)
print(f'The row index of outliers are {result}')
print(f'Values are {[df.loc[ind,:] for ind in result]}')
    
    

Threshold is inversely proportional to the number of outliers.

Less threshold value means more outliers.

#Output
The row index of outliers are [4]
Values are [x      8
x1    68
Name: 4, dtype: int64]