Comparative analysis of different Machine Learning algorithms.


This blog explains different kinds of Machine Learning algorithms to the data set which contains data of traffic of different Uber car booking instances and their status.

Data set:

Data set link:

Data set Information:

This data set consists of 6 different attributes that are associated with requests made by customers.

1)Request id: A unique identifier of the request

2)Time of request: The date and time at which the customer made the trip request

3)Drop-off time: The drop-off date and time, in case the trip was completed

4)Pick-up point: The point from which the request was made

5)Driver id: The unique identification number of the driver

6)Status of the request: The final status of the trip, that can be either completed, canceled by the driver or no cars available

Begin with performing those algorithms on this data

Importing the libraries:

import pandas as pd 
import numpy as np
from sklearn import preprocessing
from sklearn.model_selection import train_test_split

Acquiring data:

df=pd.read_csv(‘Uber Request Data.csv’)

Cleaning of data:

df=df.drop([“Request timestamp”,”Drop timestamp”], axis=1)
Data set after removal of columns

Charecterstics of dataset:

(6745, 4)

Pre-processing of data:

Missing values sum

Since missing values are found in Driver_id as it contains random values so use mode function to fill in those values.

df[‘Driver_id’].fillna(df[‘Driver_id’].mode()[0], inplace=True)df.isnull().sum()
Missing values after filling it with mode

For applying different classification algorithms we need to have a numerical data so we are preprocessing those data.

Since ‘status’ column is of object type lets perform label encoding to that column and convert the string data into fixed numbers.

Label encoding VS One-Hot encoding:

Converts categorical text data into model-understandable numerical data, we use the Label Encoder class. So all we have to do, to label encode the first column, is import the LabelEncoder class from the sklearn library, fit and transform the first column of the data, and then replace the existing text data with the new encoded data.

Application:This code refers to choosing Label encoding for the column status because this column consists of discrete labels which make this database feasible for classification algorithms. So to make all those values of status column into discrete values.

One-Hot encoding:

What one hot encoding does is, it takes a column which has categorical data, which has been label encoded, and then splits the column into multiple columns. The numbers are replaced by 1s and 0s, depending on which column has what value

Application:This code refers to choosing One-Hot encoding for the column Pickup_point because it is a categorical value that affects the data based on the palce. So to remove place to two different columns in the dataset we chose One Hot encoding technique.

label_encoder = preprocessing.LabelEncoder()df[‘Status’]= label_encoder.fit_transform(df[‘Status’])df[‘Status’].unique()
status column before and after encoding

Splitting of data set into x and y data sets.


Perform one hot encoding for Pickup_point column as it is also a column of object type.

df=pd.concat([df,dummies], axis=’columns’)
image of x after one hot encoding

Normalization of data:

In statistics and applications of statistics, normalization can have a range of meanings. Convert those numerical values into a range between 0&1. Normalization avoids these problems by creating new values that maintain the general distribution and ratios in the source data, while keeping values within a scale applied across all numeric columns used in the model. So apply normalization to the data set

x = df.values #returns a numpy array
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
df = pd.DataFrame(x_scaled)

Split the data into test and train files to perform algorithms on data and compare based on accuracy.This code divides it into 0.3 and 0.7 probability.

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3)

Classification VS Regression:

Regression: Regression is a ML algorithm that can be trained to predict real numbered outputs; like temperature, stock price, etc. Regression is based on a hypothesis that can be linear, quadratic, polynomial, non-linear, etc. The hypothesis is a function that based on some hidden parameters and the input values.

Classification: Classification is the process of predicting the class of given data points. Classes are sometimes called as targets/ labels or categories. Classification belongs to the category of supervised learning where the targets also provided with the input data.

We chose Classification for this data set because it contains columns with specific label categorical values.

  1. Linear Classifiers: Logistic Regression, Naive Bayes Classifier
  2. Nearest Neighbor
  3. Support Vector Machines
  4. Decision Trees
  5. Boosted Trees
  6. Random Forest
  7. Neural Networks
  8. Perceptron

Application of some of these Classifier Algorithms.

CART(Decision Tree):

from sklearn.tree import DecisionTreeClassifier 
from sklearn import metrics
# Create Decision Tree classifer object
clf = DecisionTreeClassifier()
# Train Decision Tree Classifer
clf =,y_train)
#Predict the response for test dataset
y_pred = clf.predict(x_test)


Accuracy_descision_tree=metrics.accuracy_score(y_test, y_pred)

Pandas series is a One-dimensional ndarray with axis labels. The labels need not be unique but must be a hashable type. The object supports both integer- and label-based indexing and provides a host of methods for performing operations involving the index.values.ravel() is a function that is used in this processing to flatten this ndarray.

Random Forest:

#Import Random Forest Model
from sklearn.ensemble import RandomForestClassifier
#Create a Gaussian Classifier
#Train the model using the training sets y_pred=clf.predict(X_test),y_train.values.ravel())


Accuracy_random_forest=metrics.accuracy_score(y_test, y_pred)


from sklearn.linear_model import Perceptronperceptron = Perceptron(max_iter=20), y_train.values.ravel())
y_pred = perceptron.predict(x_test)


accuracy_perceptron = round(perceptron.score(x_train, y_train) , 2)

Logistic Regression:

# import the class
from sklearn.linear_model import LogisticRegression
# instantiate the model (using the default parameters)
logreg = LogisticRegression(max_iter=500, random_state=0, solver=’lbfgs’, multi_class=’multinomial’)
# fit the model with data,y_train.values.ravel())


Accuracy_LogisticRegression=metrics.accuracy_score(y_test, y_pred)

Neural Networks:

from sklearn.neural_network import MLPClassifier
from sklearn.neural_network import MLPRegressor
from sklearn.neural_network import MLPClassifiermlp = MLPClassifier(hidden_layer_sizes=(8,8,8), activation=’relu’, solver=’adam’, max_iter=500),y_train.values.ravel())
predict_train = mlp.predict(x_train)
predict_test = mlp.predict(x_test)


Accuracy_Neural_Networks=metrics.accuracy_score(y_test, predict_test)

Naive Bayes:

from sklearn.naive_bayes import GaussianNB#Create a Gaussian Classifier
gnb = GaussianNB()
#Train the model using the training sets, y_train.values.ravel())
#Predict the response for test dataset
y_pred = gnb.predict(x_test)


Accuracy_Naive_Bayes=metrics.accuracy_score(y_test, y_pred)

Display of all different accuracies by descending order

results = pd.DataFrame({
‘Model’: [ ‘Logistic Regression’, ‘Random Forest’, ‘Naive Bayes’, ‘Perceptron’, ‘Decision Tree’,’Nueral Networks’],’Score’: [Accuracy_LogisticRegression, Accuracy_random_forest, Accuracy_Naive_Bayes, accuracy_perceptron, Accuracy_descision_tree,Accuracy_Neural_Networks]})
result_df = results.sort_values(by=’Score’, ascending=False)
result_df = result_df.set_index(‘Score’)