Simple Machine Learning Classification
Hi! This is my first medium article. So, thank you for coming here.
In this article, I will tell you about Classification 101. Let’s begin with a brief introduction to machine learning and classification method.
Machine learning is an application of artificial intelligence (AI) that provides systems the ability to learn and improve from experience without being explicitly programmed automatically. Machine learning focuses on developing computer programs that can access data and use it to learn for themselves.
The learning process begins with observations or data, such as examples, direct experience, or instruction, to look for patterns in data and make better decisions in the future based on the measures we provide. The primary aim is to allow the computers to learn automatically without human intervention or assistance and adjust actions accordingly.
Machine learning is divided into supervised learning, unsupervised learning, and reinforcement learning. Supervised learning is divided into classification and regression. Unsupervised learning is divided into clustering and dimensionality reduction.
Our focus here is the Classification method based on supervised learning.
In machine learning, classification refers to a predictive modeling problem where a class label is predicted for a given example of input data. For instance: Predicting spam messages and predicting early stroke indication.
Now let’s breakdown the example for predicting early stroke indication by using Machine learning classification!
First, we need to import some libraries. In this project, there are three essential libraries to import, which are:
- NumPy provides objects for multi-dimensional arrays.
- Pandas provides high-performance, easy-to-use structures, and data analysis tools. Also, Pandas provides an in-memory 2d table object called Dataframe.
- Matplotlib provides data visualization.
Another library in this project will be explained later. I used Google Collab to code this project.
As you can see above, there are google.colab files and io. Both libraries had a function to import data from local disk to Google Collab. After that, I convert raw data into Pandas dataframe. Now we can see the overview of our data.
After this, let’s explore our data using EDA (Exploratory data analysis). First, remove rows of duplicate and null data. Then we convert any string data into numeric labels using Label Encoder.
Now remove id_pasien because it is not related to stroke indication.
Now let’s making a correlation table to see how related all features to an early indication of a stroke.
You can see, not all features are correlated to an early indication for stroke. Now remove any features that are less correlated using VIF method. Variance inflation factor (VIF) is a measure of the amount of multicollinearity in a set of multiple regression variables. Mathematically, the VIF for a classification model variable is equal to the overall model variance-ratio to the variance of a model that includes only that single independent variable. For using VIF method, separate dependent variable and independent variables. A dependent variable is a variable whose variations depend on independent variables. In this case, it would be ‘stroke’. An Independent variable is a variable whose variations do not depend on another variable. In this case, it would be all features besides ‘stroke’.
After using VIF method, you can see ‘umur’ or age is not too correlated to an early indication for stroke. Therefore, the finals independent variables are [‘jenis_kelamin’, ‘hipertensi’, ‘penyakit_jantung’, ‘sudah_menikah’, ‘jenis_pekerjaan’, ‘jenis_tempat_tinggal’, ‘rata2_level_glukosa’, ‘bmi’, ‘merokok’].
Now let’s make the model using three different methods for machine learning classification. But, the first thing we have to do is to separate train data and test data. As you can see above, I am already using a train test split library to split the data into train data and test data.
I separate the data into 75% allocated to train data, and another 25% is allocated to test data.
To make the model, first, fit independent and dependant variable in train data. Then, predict the result using test data.
Logistic Regression method:
Logistic regression uses a logistic function to frame a binary output model. The output of the logistic regression will be a probability (0≤x≤1) and can be used to predict the binary 0 or 1 as the output ( if x<0.5, output= 0, else output=1).
Decision Tree:
Decision tree is derived from the independent variables, with each node having a condition over a feature. The nodes decide which node to navigate next based on the condition. Once the leaf node is reached, the output is predicted. The right sequence of conditions makes the tree efficient. Entropy/Information gain is used as the criteria to select the conditions in nodes. A recursive, greedy-based algorithm is used to derive the tree structure.
K-Nearest Neighbors (KNN)
The basic logic behind KNN is to explore your neighborhood, assume the test datapoint to be similar to them, and derive the output. In KNN, we look for k neighbors and come up with the prediction.
For conclusion, from all the methods above, we can say that both K-Nearest Neighbors and Logistic is a more suitable method for making a machine learning model with the data we provide.