Machine Learning With Python [Just Model!]

It has been 34 days since I started the 100 days of code challenge and 21 days of the 70-day’s pre-boot camp classes by Data Science Nigeria. I have tried to make sure that my tasks are kept separate and I have succeeded sometimes but not all of the time. When I first started, I was curious. I wanted to see how to create models right away. I wanted to see the real thing. Although after I did, I had to go back to really understand what was going on. I’m guessing some people are as curious as I was/am so let’s be on our way to build our model. By the way, there are no plots here. Hopefully, that will be some other time.

Firstly, Machine learning is categorized into two:

Supervised Learning: This is used when the dataset to be used has a known target variable.

These variables are the headings per column you have in your dataset assuming your dataset is in a tabular form. The target variable is like the final results in which every other column is leading to. Supervised learning consists of Classification and Regression Algorithms.
Unsupervised learning: This is used when the target variable is unknown. It basically focuses on clustering algorithms.

We will be focusing on running classification and regression algorithms. These algorithms include Support Vector Machine (SVM), Linear Regression, Decision Tree, Naïve Bayes, Neural Network, etc. Note that I used Spyder for these codes.

The example will be done with Support Vector Classifier:

STEP 1:

Import your modules. Modules are like your building blocks. Yes, you can mould your blocks(write algorithms, modules and classes from scratch) but it’s definitely easier to just have them already made:

_import numpy as pd 
import pandas as pd 
from sklearn import svm 
from sklearn.model_selection import train_test_split_</span>

STEP 2:

Import and read your dataset: This uses the pandas library to import and read your dataset as a dataframe with just a line of code. Isnt that awesome?!

Fix in whatever name your dataset is called with the extension. Give whatever name you want to the newly imported dataset. Somehow I have gotten used to “df” basically because it is short to write and easy to get along with when you are using a guide to learn. Most people conventionally just call their imported dataset df.

The breast cancer dataset can be gotten from UCI Repository

_df=pd.read_csv(‘breast-cancer-wisconsin.data’)_</span>

STEP 3: Replace missing values (in this case cells marked with “?”) with some other value. Pick a value far from every other digit in the dataset. -9999 is such a good choice because the algorithm will treat it as an outlier (a value just outside the population of the dataset). The parameter inplace has to be set to True. The best way I understand it is that the inplace being true creates this new change and deletes the old copy.

_df.replace(’?’, -9999, inplace=True)_</span>

STEP 4: Drop off any column that won’t be useful to your analysis. Although I prefer doing this in excel before bringing it to python. The difference is that your data remains the same in its location when you make all these changes in python. The ‘1’ means it is a column-wise operation and 0 stands for row-wise operation.

_df.drop([’id’], 1, inplace=True)_</span>

STEP 5:

Create your feature variable set (non-target variable). This comprises of every other column but the target column (Read the intro to understand these variables). In this case, ‘class’ will be the target variable so we are dropping it off in the feature variable set.

_x=np.array(df.drop([’class’], 1))_</span>

STEP 6:

Create your target variable set. This comprises of ‘Class’ column (the major classification column, the one that tells if the diagnosis is benign or malignant).

_y=np.array(df[’class’])_</span>

STEP 7:

Split your dataset into training and testing. The order does matter so be sure to get the order as shown below. The variables on the left hand side can be changed but it’s always safer to use those conventional variables. The x and y in the bracket are those values above. The x and y dataset is to be split into two (testing and training). The training dataset is that part of the dataset used to build the model and the testing dataset is used to check how the created model fares. The test size defines how the dataset is to be split. Below the test size is 0.2. This means the test dataset should be 20% of the entire dataset and the train dataset should be 80%.

_x_train,x_test, y_train, y_test= train_test_split(x,y, test_size=0.2)_</span>

STEP 8:

The model (classifier) is then created. Below, the model is called ‘clf’. It can be called whatever you want. Also, whatever algorithm you desire to use can be used in place of the one given below on the right hand side (svm). The gamma=’auto’ is to avoid a warning error caused by new changes happening to recent versions of gamma. Although I noticed when I used scale instead of auto, I got worse accuracy ( like, it was so bad! More than 30% difference).

_clf=svm.SVC(gamma=’auto’)_</span>

STEP 9:

The training dataset are fit into the algorithm to build the model for that dataset. Note the order of the parameters. The major last step.

_clf.fit(x_train,y_train)_</span>

STEP 10:

The remaining steps are to see how your model performs or behaves. The accuracy is one good factor (not always the only or the best though) to determine how the model works. The test dataset is used to check the accuracy. Note that the same train model can be used to test the performance of the model but it won’t make sense as the accuracy will always be 1 because that dataset has been ‘learned’ already. Hence it is always better to use dataset that the model hasn’t seen/used before.

_accuracy=clf.score(x_test, y_test) 
print(accuracy)_</span>

STEP 11:

Individual rows and group rows of the test dataset can be used to confirm if the model works well. Even new dataset can be used as well for immediate prediction.

For groups:

_y_pred=clf.predict(x_test) 
print(y_pred)_</span>

OR: for dataset in which you want to specify and see all the inputs

_y_predict=np.array([[4,2,1,1,1,2,3,2,1], [4,2,1,2,2,2,3,2,1]]) 
y_predict = y_predict.reshape(len(y_predict), -1) 
prediction=clf.predict(y_predict) 
print(prediction)_</span>

For Individual Prediction:

_y_single_predict=clf.predict([4,2,1,1,1,2,3,2,1]) 
print(y_single_predict)_</span>

STEP 12:

Confusion Matrix. This is used to see how your model performs. It tells you how many data points were correctly and wrongly predicted. It is the basis upon which the accuracy is developed. This means from a confusion matrix, you can calculate the accuracy of a model. You can read more about it but let me tell you that it is such a detailed determinant of how well your model works.

_y_pred=clf.predict(x_test) 
print(y_pred) 
y_act=y_test 
print(y_act) 
result=metrics.confusion_matrix(y_act,y_pred) 
print(result)_</span>

STEP 13: Further tests results can be viewed such as Precision, Recall, f1-Score and support (you can read all that at https://en.m.wikipedia.org/wiki/Precision_and_recall).)

_print(metrics.classification_report (y_act,y_pred))_</span>

STEP 14:

Go back to step 0 to understand all these better. 🤗

Best wishes!!!

REFERENCES

pythonprogramming.net

educative.io/edpresso/how-to-create-a-confu..