Comparative analysis of different Machine Learning algorithms.

Context:

YALAMANCHILI DIWAKAR
7 min readOct 13, 2019

This blog explains different kinds of Machine Learning algorithms to the data set which contains data of traffic of different Uber car booking instances and their status.

Data set:

Many of us had experienced issues in booking cabs while we are traveling to the airport from the city and vice-versa. So this data set provides a huge chunk of supply/demand data of Uber cabs and their status.This data set is adopted from the below-mentioned link.

Data set link: https://www.kaggle.com/hellbuoy/uber-supplydemand-gap

Data set Information:

This data set consists of 6 different attributes that are associated with requests made by customers.

1)Request id: A unique identifier of the request

2)Time of request: The date and time at which the customer made the trip request

3)Drop-off time: The drop-off date and time, in case the trip was completed

4)Pick-up point: The point from which the request was made

5)Driver id: The unique identification number of the driver

6)Status of the request: The final status of the trip, that can be either completed, canceled by the driver or no cars available

Begin with performing those algorithms on this data

Importing the libraries:

import pandas as pd 
import numpy as np
from sklearn import preprocessing
from sklearn.model_selection import train_test_split

Acquiring data:

df=pd.read_csv(‘Uber Request Data.csv’)
df.head()

Cleaning of data:

Remove columns that contain mixed and unwanted data

df=df.drop([“Request timestamp”,”Drop timestamp”], axis=1)
df.head()
Data set after removal of columns

Charecterstics of dataset:

df.info()
df.shape

(6745, 4)

Pre-processing of data:

function df.isnull() describes the missing values in the data set and df.fillna() helps to fill those missing values.

df.isnull().sum()
Missing values sum

Since missing values are found in Driver_id as it contains random values so use mode function to fill in those values.

df[‘Driver_id’].fillna(df[‘Driver_id’].mode()[0], inplace=True)df.isnull().sum()
Missing values after filling it with mode

For applying different classification algorithms we need to have a numerical data so we are preprocessing those data.

Since ‘status’ column is of object type lets perform label encoding to that column and convert the string data into fixed numbers.

Label encoding VS One-Hot encoding:

Label encoding:

Converts categorical text data into model-understandable numerical data, we use the Label Encoder class. So all we have to do, to label encode the first column, is import the LabelEncoder class from the sklearn library, fit and transform the first column of the data, and then replace the existing text data with the new encoded data.

Application:This code refers to choosing Label encoding for the column status because this column consists of discrete labels which make this database feasible for classification algorithms. So to make all those values of status column into discrete values.

One-Hot encoding:

What one hot encoding does is, it takes a column which has categorical data, which has been label encoded, and then splits the column into multiple columns. The numbers are replaced by 1s and 0s, depending on which column has what value

Application:This code refers to choosing One-Hot encoding for the column Pickup_point because it is a categorical value that affects the data based on the palce. So to remove place to two different columns in the dataset we chose One Hot encoding technique.

label_encoder = preprocessing.LabelEncoder()df[‘Status’]= label_encoder.fit_transform(df[‘Status’])df[‘Status’].unique()
status column before and after encoding

Splitting of data set into x and y data sets.

y=df[[“Status”]]
y
df=df.drop(“Status”,axis=1)
df.head()

Perform one hot encoding for Pickup_point column as it is also a column of object type.

dummies=pd.get_dummies(df.Pickup_point)
dummies
df=pd.concat([df,dummies], axis=’columns’)
df
image of x after one hot encoding

Normalization of data:

In statistics and applications of statistics, normalization can have a range of meanings. Convert those numerical values into a range between 0&1. Normalization avoids these problems by creating new values that maintain the general distribution and ratios in the source data, while keeping values within a scale applied across all numeric columns used in the model. So apply normalization to the data set

x = df.values #returns a numpy array
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
df = pd.DataFrame(x_scaled)
df.head()

Split the data into test and train files to perform algorithms on data and compare based on accuracy.This code divides it into 0.3 and 0.7 probability.

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3)

Classification VS Regression:

Regression and classification are categorized under the same umbrella of supervised machine learning. Both share the same concept of utilizing known data sets (referred to as training data sets) to make predictions.

Regression: Regression is a ML algorithm that can be trained to predict real numbered outputs; like temperature, stock price, etc. Regression is based on a hypothesis that can be linear, quadratic, polynomial, non-linear, etc. The hypothesis is a function that based on some hidden parameters and the input values.

Classification: Classification is the process of predicting the class of given data points. Classes are sometimes called as targets/ labels or categories. Classification belongs to the category of supervised learning where the targets also provided with the input data.

We chose Classification for this data set because it contains columns with specific label categorical values.

  1. Linear Classifiers: Logistic Regression, Naive Bayes Classifier
  2. Nearest Neighbor
  3. Support Vector Machines
  4. Decision Trees
  5. Boosted Trees
  6. Random Forest
  7. Neural Networks
  8. Perceptron

Application of some of these Classifier Algorithms.

CART(Decision Tree):

from sklearn.tree import DecisionTreeClassifier # Import Decision Tree Classifier
from sklearn import metrics #Import scikit-learn metrics module for accuracy calculation

from sklearn.tree import DecisionTreeClassifier 
from sklearn import metrics
# Create Decision Tree classifer object
clf = DecisionTreeClassifier()
# Train Decision Tree Classifer
clf = clf.fit(x_train,y_train)
#Predict the response for test dataset
y_pred = clf.predict(x_test)

Accuracy:

Accuracy_descision_tree=metrics.accuracy_score(y_test, y_pred)
Accuracy_descision_tree

Pandas series is a One-dimensional ndarray with axis labels. The labels need not be unique but must be a hashable type. The object supports both integer- and label-based indexing and provides a host of methods for performing operations involving the index.values.ravel() is a function that is used in this processing to flatten this ndarray.

Random Forest:

Choose those n_estimators based on the size of the data that you are processing.

#Import Random Forest Model
from sklearn.ensemble import RandomForestClassifier
#Create a Gaussian Classifier
clf=RandomForestClassifier(n_estimators=500)
#Train the model using the training sets y_pred=clf.predict(X_test)
clf.fit(x_train,y_train.values.ravel())
y_pred=clf.predict(x_test)

Accuracy:

Accuracy_random_forest=metrics.accuracy_score(y_test, y_pred)
Accuracy_random_forest

Perceptron:

Maintain high iterations for huge data.

from sklearn.linear_model import Perceptronperceptron = Perceptron(max_iter=20)
perceptron.fit(x_train, y_train.values.ravel())
y_pred = perceptron.predict(x_test)

Accuracy:

accuracy_perceptron = round(perceptron.score(x_train, y_train) , 2)
accuracy_perceptron

Logistic Regression:

Need to specify those solver multi class and iterations to avoid future errors

# import the class
from sklearn.linear_model import LogisticRegression
# instantiate the model (using the default parameters)
logreg = LogisticRegression(max_iter=500, random_state=0, solver=’lbfgs’, multi_class=’multinomial’)
# fit the model with data
logreg.fit(x_train,y_train.values.ravel())
#
y_pred=logreg.predict(x_test)

Accuracy:

Accuracy_LogisticRegression=metrics.accuracy_score(y_test, y_pred)
Accuracy_LogisticRegression

Neural Networks:

Need to specify activation,solver and iterations to avoid future errors.

from sklearn.neural_network import MLPClassifier
from sklearn.neural_network import MLPRegressor
from sklearn.neural_network import MLPClassifiermlp = MLPClassifier(hidden_layer_sizes=(8,8,8), activation=’relu’, solver=’adam’, max_iter=500)
mlp.fit(x_train,y_train.values.ravel())
predict_train = mlp.predict(x_train)
predict_test = mlp.predict(x_test)

Accuracy:

Accuracy_Neural_Networks=metrics.accuracy_score(y_test, predict_test)
Accuracy_Neural_Networks

Naive Bayes:

from sklearn.naive_bayes import GaussianNB#Create a Gaussian Classifier
gnb = GaussianNB()
#Train the model using the training sets
gnb.fit(x_train, y_train.values.ravel())
#Predict the response for test dataset
y_pred = gnb.predict(x_test)

Accuracy:

Accuracy_Naive_Bayes=metrics.accuracy_score(y_test, y_pred)
Accuracy_Naive_Bayes

Display of all different accuracies by descending order

results = pd.DataFrame({
‘Model’: [ ‘Logistic Regression’, ‘Random Forest’, ‘Naive Bayes’, ‘Perceptron’, ‘Decision Tree’,’Nueral Networks’],’Score’: [Accuracy_LogisticRegression, Accuracy_random_forest, Accuracy_Naive_Bayes, accuracy_perceptron, Accuracy_descision_tree,Accuracy_Neural_Networks]})
result_df = results.sort_values(by=’Score’, ascending=False)
result_df = result_df.set_index(‘Score’)
result_df
Accuracies

Conclusion:

So conclusion is that random forest is the best algorithm with an accuracy of 83% among the algorithms we applied to this data set.

Summary:

The summary of this blog is to load the data set and cleaning it and preprocess it. Next thing is to apply different types of classification algorithms to data set and calculate accuracies of all those applied algorithms. Then based on those accuracies conclude the best possible algorithm for this dataset.

--

--