Data Pre processing

Naman Mehra
3 min readJun 22, 2021

Data-Preprocessing is the most important step to implement machine learning on the dataset. It is the first and crucial step while creating a machine learning model. Data preprocessing in machine learning refers to the technique of preparing the raw data to make it suitable for a model.

Reading the dataset

import pandas as pd
import numpy as np
dataset = pd.read_csv('sample - sample.csv')
dataset.head()

Checking the count of null values in columns

x = dataset.iloc[:,:-1].values
y = dataset.iloc[:,3:].values

Replacing the missing values with the mean

Handling the Categorical Column

We will handle the categorical Column(Country and Purchased) with help of LabelEncoder

from sklearn.preprocessing import LabelEncoder
LabEnc = LabelEncoder()
x[:,0] = LabEnc.fit_transform(x[:,0]) #as we need to labialize the 0th column
x
y = LabEnc.fit_transform(y)
y

Applying OneHotEncoder

We need to apply OneHotEncoder because if we have more than two categories in a column(Country Column) then problem will occur when we are taking the average. Eg:- If France is1 , Germany 2 and Spain 3 then the average of France and Spain will be Germany.

OneHotEncoder will simply change the column in which we are applying it.
If we have three categories then OneHotEncoder will make delete the first column and add three columns of 1’s and 0’s . 1 where the value is present.

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
transform = ColumnTransformer([('norm1',OneHotEncoder(),[0])], remainder = 'passthrough') # we need to add passthrough because we need the remaining data also
x = transform.fit_transform(x)
x

Splitting the Data into Train and test set

from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=0)

Applying Standard Scaler

y = (x-mean)/standard_deviation
mean = sum(x)/count(x)
standard_deviation = sqrt(sum((x-mean)n^²)/count(x))

The standard scaler range is -1 to 1

from sklearn.preprocessing import StandardScaler
St = StandardScaler()
x_train[:,3:5] = St.fit_transform(x_train[:,3:5])
x_train

Thanks all for reading this far.

--

--