Suppose you are a product manager, you want to classify customer reviews in positive and negative classes. 

Or As a loan manager, you want to identify which loan applicants are safe or risky? 

As a healthcare analyst, you want to predict which patients can suffer from diabetes disease. 

All the examples have the same kind of problem to classify reviews, loan applicants, and patients. 

Naive Bayes is the most straightforward and fast classification algorithm, which is suitable for a large chunk of data. Naive Bayes classifier is successfully used in various applications such as spam filtering, text classification, sentiment analysis, and recommender systems. It uses Bayes theorem of probability for prediction of unknown class.


# Classification workflow 

Whenever you perform classification, the first step is to understand the problem and identify potential features and label. Features are those characteristics or attributes which affect the results of the label. For example, in the case of a loan distribution, bank manager's identify customer’s occupation, income, age, location, previous loan history, transaction history, and credit score. These characteristics are known as features which help the model classify customers. The classification has two phases, a learning phase, and the evaluation phase. In the learning phase, classifier trains its model on a given dataset and in the evaluation phase, it tests the classifier performance. Performance is evaluated on the basis of various parameters such as accuracy, error, precision, and recall.


![image.png](attachment:image.png)


![image-2.png](attachment:image-2.png)

Let’s understand the working of Naive Bayes through an example. Given an example of weather conditions and playing sports. You need to calculate the probability of playing sports. Now, you need to classify whether players will play or not, based on the weather condition.


![image.png](attachment:image.png)


# 1) Defining Dataset
In this example, you can use the dummy dataset with three columns: weather, temperature, and play. The first two are features(weather, temperature) and the other is the label.


In [45]:
# Assigning features and label variables
weather=['Sunny','Sunny','Overcast','Rainy','Rainy','Rainy','Overcast','Sunny','Sunny',
'Rainy','Sunny','Overcast','Overcast','Rainy']
temp=['Hot','Hot','Hot','Mild','Cool','Cool','Cool','Mild','Cool','Mild','Mild','Mild','Hot','Mild']

play=['No','No','Yes','Yes','Yes','No','Yes','No','Yes','Yes','Yes','Yes','Yes','No']

# Encoding Features
First, you need to convert these string labels into numbers. for example: 'Overcast', 'Rainy', 'Sunny' as 0, 1, 2. This is known as label encoding. 

Scikit-learn provides LabelEncoder library for encoding labels with a value between 0 and one less than the number of discrete classes.

In [46]:
# Import LabelEncoder
from sklearn import preprocessing
#creating labelEncoder
le = preprocessing.LabelEncoder()
# Converting string labels into numbers.
weather_encoded=le.fit_transform(weather)
print (weather_encoded)

[2 2 0 1 1 1 0 2 2 1 2 0 0 1]


Similarly, you can also encode temp and play columns.

In [47]:
# Converting string labels into numbers
temp_encoded=le.fit_transform(temp)
label=le.fit_transform(play)
print ("Temp:",temp_encoded)
print ("Play:",label)

Temp: [1 1 1 2 0 0 0 2 0 2 2 2 1 2]
Play: [0 0 1 1 1 0 1 0 1 1 1 1 1 0]


Now combine both the features (weather and temp) in a single variable (list of tuples).

In [48]:
import numpy as np
#Combinig weather and temp into single listof tuples

features = np.array((weather_encoded,temp_encoded)).T
features

array([[2, 1],
 [2, 1],
 [0, 1],
 [1, 2],
 [1, 0],
 [1, 0],
 [0, 0],
 [2, 2],
 [2, 0],
 [1, 2],
 [2, 2],
 [0, 2],
 [0, 1],
 [1, 2]])

# Generating Model
Generate a model using naive bayes classifier in the following steps:
Create naive bayes classifier
Fit the dataset on classifier
Perform prediction

In [50]:
#Import Gaussian Naive Bayes model
from sklearn.naive_bayes import GaussianNB

#Create a Gaussian Classifier
model = GaussianNB()

# Train the model using the training sets
model.fit(features,label)

#Predict Output
predicted= model.predict([[2,1]]) # 0:Overcast, 2:Mild --- Sunny: 2 Hot : 1
print ("Predicted Value:", predicted)

Predicted Value: [0]


Here, 1 indicates that players can 'play'.

# Naive Bayes with Multiple Labels
Till now you have learned Naive Bayes classification with binary labels. Now you will learn about multiple class classification in Naive Bayes. Which is known as multinomial Naive Bayes classification. For example, if you want to classify a news article about technology, entertainment, politics, or sports. In model building part, you can use wine dataset which is a very famous multi-class classification problem. "This dataset is the result of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars." (UC Irvine) Dataset comprises of 13 features (alcohol, malic_acid, ash, alcalinity_of_ash, magnesium, total_phenols, flavanoids, nonflavanoid_phenols, proanthocyanins, color_intensity, hue, od280/od315_of_diluted_wines, proline) and type of wine cultivar. This data has three type of wine Class_0, Class_1, and Class_3. Here you can build a model to classify the type of wine. The dataset is available in the scikit-learn library.

## Loading Data
Let's first load the required wine dataset from scikit-learn datasets.

In [51]:
#Import scikit-learn dataset library
from sklearn import datasets

#Load dataset
wine = datasets.load_wine()

## Exploring Data
You can print the target and feature names, to make sure you have the right dataset, as such:

In [53]:
# print the names of the 13 features
print ("Features: ", wine.feature_names)

# print the label type of wine(class_0, class_1, class_2)
print ("\nLabels: ", wine.target_names)

Features: ['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium', 'total_phenols', 'flavanoids', 'nonflavanoid_phenols', 'proanthocyanins', 'color_intensity', 'hue', 'od280/od315_of_diluted_wines', 'proline']

Labels: ['class_0' 'class_1' 'class_2']


It's a good idea to always explore your data a bit, so you know what you're working with. Here, you can see the first five rows of the dataset are printed, as well as the target variable for the whole dataset.

In [68]:
wine.feature_names

['alcohol',
 'malic_acid',
 'ash',
 'alcalinity_of_ash',
 'magnesium',
 'total_phenols',
 'flavanoids',
 'nonflavanoid_phenols',
 'proanthocyanins',
 'color_intensity',
 'hue',
 'od280/od315_of_diluted_wines',
 'proline']

In [23]:
# print data(feature)shape
wine.data.shape

(178, 13)

In [26]:
# print the wine data features (top 5 records)
print (wine.data[0:5]) # Here we shoul do a StdScaler of the data!! 

[[1.423e+01 1.710e+00 2.430e+00 1.560e+01 1.270e+02 2.800e+00 3.060e+00
 2.800e-01 2.290e+00 5.640e+00 1.040e+00 3.920e+00 1.065e+03]
 [1.320e+01 1.780e+00 2.140e+00 1.120e+01 1.000e+02 2.650e+00 2.760e+00
 2.600e-01 1.280e+00 4.380e+00 1.050e+00 3.400e+00 1.050e+03]
 [1.316e+01 2.360e+00 2.670e+00 1.860e+01 1.010e+02 2.800e+00 3.240e+00
 3.000e-01 2.810e+00 5.680e+00 1.030e+00 3.170e+00 1.185e+03]
 [1.437e+01 1.950e+00 2.500e+00 1.680e+01 1.130e+02 3.850e+00 3.490e+00
 2.400e-01 2.180e+00 7.800e+00 8.600e-01 3.450e+00 1.480e+03]
 [1.324e+01 2.590e+00 2.870e+00 2.100e+01 1.180e+02 2.800e+00 2.690e+00
 3.900e-01 1.820e+00 4.320e+00 1.040e+00 2.930e+00 7.350e+02]]


In [27]:
# print the wine labels (0:Class_0, 1:class_2, 2:class_2)
print (wine.target)

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2]


## Splitting Data
First, you separate the columns into dependent and independent variables(or features and label). Then you split those variables into train and test set.
![image.png](attachment:image.png)



In [71]:
# Import train_test_split function
from sklearn.model_selection import train_test_split

# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(wine.data, wine.target, test_size=0.3, random_state=109) # 70% training and 30% test



In [73]:
X_train.shape

(124, 13)

In [30]:
X_train

array([[1.323e+01, 3.300e+00, 2.280e+00, ..., 5.600e-01, 1.510e+00,
 6.750e+02],
 [1.384e+01, 4.120e+00, 2.380e+00, ..., 5.700e-01, 1.640e+00,
 4.800e+02],
 [1.220e+01, 3.030e+00, 2.320e+00, ..., 6.600e-01, 1.830e+00,
 5.100e+02],
 ...,
 [1.362e+01, 4.950e+00, 2.350e+00, ..., 9.100e-01, 2.050e+00,
 5.500e+02],
 [1.336e+01, 2.560e+00, 2.350e+00, ..., 7.000e-01, 2.470e+00,
 7.800e+02],
 [1.439e+01, 1.870e+00, 2.450e+00, ..., 1.020e+00, 3.580e+00,
 1.290e+03]])

In [31]:
X_test

array([[1.330000e+01, 1.720000e+00, 2.140000e+00, 1.700000e+01,
 9.400000e+01, 2.400000e+00, 2.190000e+00, 2.700000e-01,
 1.350000e+00, 3.950000e+00, 1.020000e+00, 2.770000e+00,
 1.285000e+03],
 [1.293000e+01, 3.800000e+00, 2.650000e+00, 1.860000e+01,
 1.020000e+02, 2.410000e+00, 2.410000e+00, 2.500000e-01,
 1.980000e+00, 4.500000e+00, 1.030000e+00, 3.520000e+00,
 7.700000e+02],
 [1.221000e+01, 1.190000e+00, 1.750000e+00, 1.680000e+01,
 1.510000e+02, 1.850000e+00, 1.280000e+00, 1.400000e-01,
 2.500000e+00, 2.850000e+00, 1.280000e+00, 3.070000e+00,
 7.180000e+02],
 [1.253000e+01, 5.510000e+00, 2.640000e+00, 2.500000e+01,
 9.600000e+01, 1.790000e+00, 6.000000e-01, 6.300000e-01,
 1.100000e+00, 5.000000e+00, 8.200000e-01, 1.690000e+00,
 5.150000e+02],
 [1.421000e+01, 4.040000e+00, 2.440000e+00, 1.890000e+01,
 1.110000e+02, 2.850000e+00, 2.650000e+00, 3.000000e-01,
 1.250000e+00, 5.240000e+00, 8.700000e-01, 3.330000e+00,
 1.080000e+03],
 [1.311000e+01, 1.010000e+00, 1.700000e+00, 1.500000e+

In [32]:
y_train

array([2, 2, 2, 2, 1, 0, 2, 1, 2, 1, 1, 0, 1, 0, 2, 2, 1, 1, 0, 1, 0, 1,
 2, 1, 2, 0, 0, 0, 1, 0, 0, 2, 1, 2, 1, 1, 2, 0, 1, 2, 1, 0, 0, 2,
 1, 0, 2, 1, 1, 0, 1, 1, 2, 1, 1, 1, 0, 2, 0, 1, 1, 0, 1, 1, 1, 1,
 2, 1, 1, 1, 1, 0, 0, 0, 1, 1, 2, 2, 1, 1, 0, 1, 2, 2, 1, 2, 2, 1,
 1, 2, 2, 2, 0, 0, 0, 0, 0, 1, 1, 1, 0, 2, 0, 1, 2, 0, 0, 1, 0, 1,
 1, 2, 1, 2, 0, 0, 0, 1, 0, 0, 1, 2, 2, 0])

In [33]:
y_test

array([0, 0, 1, 2, 0, 1, 0, 1, 1, 0, 1, 1, 2, 2, 0, 1, 1, 0, 0, 1, 2, 1,
 0, 2, 0, 0, 1, 2, 0, 1, 2, 1, 1, 0, 1, 1, 0, 2, 2, 0, 2, 0, 0, 0,
 0, 2, 2, 0, 1, 1, 2, 1, 0, 2])

## Model Generation


In [74]:
#Import Gaussian Naive Bayes model
from sklearn.naive_bayes import GaussianNB

#Create a Gaussian Classifier
gnb = GaussianNB()

#Train the model using the training sets
gnb.fit(X_train, y_train)

#Predict the response for test dataset
y_pred = gnb.predict(X_test)

## Evaluating Model
After model generation, check the accuracy using actual and predicted values.

In [75]:
#Import scikit-learn metrics module for accuracy calculation
from sklearn import metrics

# Model Accuracy, how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

Accuracy: 0.9074074074074074


# Try with another model : Decision Tree Classifier

In [43]:
from sklearn import tree
model = tree.DecisionTreeClassifier()
model.fit(X_train,y_train)
y_pred = gnb.predict(X_test)
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

Accuracy: 0.9074074074074074


# Try with another model : Random Forest Classifier

In [44]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(max_depth=2, random_state=0)
clf.fit(X_train,y_train)
y_pred = gnb.predict(X_test)
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

Accuracy: 0.9074074074074074




# Let's do a better data cleaning and preparation!!!

In [None]:
1) Download a toy dataset : 
 
 https://scikit-learn.org/stable/datasets/toy_dataset.html