
4 -Machine learning, code for identifying the outliers in jupyter notebook
We first load the dataset into a pandas dataframe using the pd.read_csv()
function.
We then define a function called detect_outliers
that takes in a dataset and uses the z-score method to detect outliers. The function first calculates the mean and standard deviation of the data, and then sets a threshold for detecting outliers as three times the standard deviation.
import pandas as pd
import numpy as np
# load the dataset
df = pd.read_csv('path/to/dataset.csv')
# define a function to detect outliers
def detect_outliers(data):
# calculate the mean and standard deviation of the data
mean = np.mean(data)
std = np.std(data)
# set the threshold for detecting outliers
threshold = 3 * std
# identify outliers using the z-score method
z_scores = [(x - mean) / std for x in data]
outliers = np.where(np.abs(z_scores) > threshold)
return outliers[0]
# apply the detect_outliers function to each column of the dataset
outliers = {}
for column in df.columns:
column_outliers = detect_outliers(df[column])
outliers[column] = column_outliers
# print the outliers for each column
for column, column_outliers in outliers.items():
print('Outliers in column {}: {}'.format(column, column_outliers))
The function then calculates the z-scores for each data point in the dataset, and identifies outliers as any data point with an absolute z-score greater than the threshold.
Next, we apply the detect_outliers
function to each column of the dataset using a for
loop, and store the outliers for each column in a dictionary called outliers
.
Finally, we print the outliers for each column by iterating through the outliers
dictionary using another for
loop.