Data Pre processing
Data-Preprocessing is the most important step to implement machine learning on the dataset. It is the first and crucial step while creating a machine learning model. Data preprocessing in machine learning refers to the technique of preparing the raw data to make it suitable for a model.
Reading the dataset
import pandas as pd
import numpy as np
dataset = pd.read_csv('sample - sample.csv')
dataset.head()
Checking the count of null values in columns
x = dataset.iloc[:,:-1].values
y = dataset.iloc[:,3:].values
Replacing the missing values with the mean
Handling the Categorical Column
We will handle the categorical Column(Country and Purchased) with help of LabelEncoder
from sklearn.preprocessing import LabelEncoder
LabEnc = LabelEncoder()
x[:,0] = LabEnc.fit_transform(x[:,0]) #as we need to labialize the 0th column
x
y = LabEnc.fit_transform(y)
y
Applying OneHotEncoder
We need to apply OneHotEncoder because if we have more than two categories in a column(Country Column) then problem will occur when we are taking the average. Eg:- If France is1 , Germany 2 and Spain 3 then the average of France and Spain will be Germany.
OneHotEncoder will simply change the column in which we are applying it.
If we have three categories then OneHotEncoder will make delete the first column and add three columns of 1’s and 0’s . 1 where the value is present.
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformertransform = ColumnTransformer([('norm1',OneHotEncoder(),[0])], remainder = 'passthrough') # we need to add passthrough because we need the remaining data also
x = transform.fit_transform(x)
x
Splitting the Data into Train and test set
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=0)
Applying Standard Scaler
y = (x-mean)/standard_deviation
mean = sum(x)/count(x)
standard_deviation = sqrt(sum((x-mean)n^²)/count(x))
The standard scaler range is -1 to 1
from sklearn.preprocessing import StandardScaler
St = StandardScaler()
x_train[:,3:5] = St.fit_transform(x_train[:,3:5])
x_train
Thanks all for reading this far.