Tech
4 -Machine learning, code for identifying the outliers in jupyter notebook

4 -Machine learning, code for identifying the outliers in jupyter notebook

We first load the dataset into a pandas dataframe using the pd.read_csv() function.

We then define a function called detect_outliers that takes in a dataset and uses the z-score method to detect outliers. The function first calculates the mean and standard deviation of the data, and then sets a threshold for detecting outliers as three times the standard deviation.


import pandas as pd
import numpy as np

# load the dataset
df = pd.read_csv('path/to/dataset.csv')

# define a function to detect outliers
def detect_outliers(data):
    # calculate the mean and standard deviation of the data
    mean = np.mean(data)
    std = np.std(data)
    # set the threshold for detecting outliers
    threshold = 3 * std
    # identify outliers using the z-score method
    z_scores = [(x - mean) / std for x in data]
    outliers = np.where(np.abs(z_scores) > threshold)
    return outliers[0]

# apply the detect_outliers function to each column of the dataset
outliers = {}
for column in df.columns:
    column_outliers = detect_outliers(df[column])
    outliers[column] = column_outliers

# print the outliers for each column
for column, column_outliers in outliers.items():
    print('Outliers in column {}: {}'.format(column, column_outliers))

The function then calculates the z-scores for each data point in the dataset, and identifies outliers as any data point with an absolute z-score greater than the threshold.

Next, we apply the detect_outliers function to each column of the dataset using a for loop, and store the outliers for each column in a dictionary called outliers.

Finally, we print the outliers for each column by iterating through the outliers dictionary using another for loop.