Understanding Supervised Learning: A Step-by-Step Guide

What is supervised machine learning?

This is one of the main models in the machine learning module and is one of the most basic and widely used training models. In this model, you train the system by providing the correct features and the correct answers (labels). After training the model, you can predict the values you need by inputting new feature values.

I will explain this clearly using a commonly used example: house price prediction based on house features.
If we take House 1 with an area of 2000 square feet, priced at 23K dollars, and House 2 with an area of 2500 square feet, priced at 27K dollars, the right answers we give to the training model are the house prices.

The value on which the price depends is called the feature, which in this case is the area of the house. First, you provide the feature (house area) and then give the correct price of the house for that feature to train the model.

After training, the model can predict the prices of unknown houses by inputting their area.

Supervised machine learning types

Supervised machine learning can be divided into two main training models:

Regression model
Classification model

Regression Model

Models used for predicting values such as house prices and student marks come under regression models. This model predicts numerical values from an infinite set of possible outputs.

You may have studied regression analysis graphs, where we use a scatter plot to mark data points and get a visualization. Using this scatter plot, we draw a graph that best fits the data and passes through as many data points as possible, as shown in the graph below.

There are two types of regression models as well:

Linear regression model – In this model, we train the machine by providing the features (x) and the correct answers (y) to predict data using a straight-line graph. As you may have learned in school, we use the graph function y = mx + c. However, there is a slight difference in the constants used in machine learning. In linear regression, the function is written as:
ŷ = wx + b

Where:

x → feature
y → target
ŷ → predicted value
w → weight
b → bias

2. Polynomial regression model → This is an extension the linear regression model, where the graph is curve-shaped instead of a straight line. This allows us to fit the feature and target variables more accurately. The further the power of the equation increases, the more closely it can fit the data.

In the image below, you can see how the polynomial graph fits the training data.

Classification model

As the name suggests, this model is used to classify objects into specific groups, such as Mango and Orange under fruits, and Carrot and Potato under vegetables. This model has a small, finite number of possible outputs.

The model checks whether an object belongs to a specific category (yes → 1, no → 0).

The graph below shows how the data is categorized. There is a decision boundary that divides the data and shows which category each data point belongs to.

Important Concepts You should know for supervised machine learning

Data Table

This is similar to a database that contains data collected from research and used to train the model. It is also called the training set.

The Function to Predict

For linear regression, we use the straight-line function:
f(x) = wx + b

Where:

x → feature

y → target

ŷ or f(x) → predicted value

w → weight

b → bias

For polynomial regression, the function is a quadratic or higher-degree equation (An extension of the linear regression function).

For classification, we use logistic regression, which applies the sigmoid function:
σ(z) = 1 / (1 + e⁻ᶻ)

Where:

z = f(x) → regression function

σ(x) → sigmoid function

Cost Function

This function is used to find the error or difference between the actual target value and the predicted target value. It measures how close the prediction is to the real value.

We calculate the cost and try to gradually reduce it to the minimum possible value to obtain a good model. This value is referred to as either error or cost.

The cost function commonly used for regression models is called the squared cost function. Below is the squared cost function

The cost function we use for the classification model is known as the logistic loss function. Elow is the equation of the logistic loss function.

Gradient Descent

This is the algorithm used to minimize the cost by gradually updating the values of w and b until the minimum possible cost is reached.

We use α (alpha), known as the learning rate, which helps reduce the cost step by step from a random starting value of w and b.

This gradient descent algorithm is used to find the most suitable values of w and b that produce the minimum cost.

The below graph shows how this algorithm reduces the cost;

Vectorization

When coding, x, y, w, and b are array values. Calculating values using loops can be difficult and very slow, especially when the dataset is large. Therefore, we use vectorization, which allows us to perform calculations on all array values in parallel.

For example, to calculate all predicted values (ŷ), we could use w * x[i] + b inside a loop. However, with vectorization, we can directly use the dot product to reduce runtime:
np.dot(w, x) + b (numpy library used for this function), which can be written mathematically as w·x + b.

Feature Scaling

Feature scaling is one of the important steps you should do before training the model. When feature values are very large, such as 25000 or 7000, it becomes difficult to visualize them on a graph and harder for gradient descent to find the minimum cost efficiently.

Therefore, we scale the data to smaller values, usually around 0 to 10 or 0 to 1. This makes it easier to plot graphs and helps gradient descent converge faster to the minimum cost.

You can refer to the graph below to better understand feature scaling.

These are the two type of equations we use for the feature scaling as below diagram.

Feature engineering

We should properly choose and create new features from existing features to improve our model’s accuracy. Transforming and combining features to create new ones is known as feature engineering.

For example, calculating the area using the length and breadth features creates a new feature. This new feature can make our model more accurate. We can also transform a linear regression model into a polynomial regression model to improve accuracy.

The problem you may face and the solution

Overfiting and uderfitting

As the names suggest:

Underfitting occurs when the model doesn’t fit the dataset well. The cost may be high, and there may be other models that fit the data better than the underfitted model.
Overfitting occurs when the model fits the training data too well, matching the exact values of the dataset. This may lead to errors when predicting new data because the model has learned the noise as well. Overfitting often results in a very curvy graph, reducing prediction accuracy for unseen data.

This is similar to the story of Goldilocks and the Three Bears: one porridge was too hot, one was too cold, and one was “just right.” The “just right” model is neither underfitted nor overfitted.

The graphs below will help you clearly understand overfitting and underfitting.

Regularization

This method is used to overcome overfitting. In an overfitted model, there are many local minima of w in the cost function. When using gradient descent, the algorithm might not find the best value of w to achieve the lowest cost.

Regularization modifies the model to make the graph less bumpy and smoother, which helps it predict more accurately.

Conclusion

Supervised machine learning is one of the most widely used approaches in artificial intelligence, allowing models to learn from labeled data and make accurate predictions. By understanding key concepts such as regression and classification models, cost functions, gradient descent, feature scaling, vectorization, and regularization, you can train models that are both efficient and accurate.

Proper feature engineering and awareness of overfitting and underfitting are essential to ensure your model generalizes well to new data. With these techniques, supervised learning can be applied to a wide range of real-world problems, from predicting house prices to classifying objects.

References / Further Reading

Scikit-Learn Documentation – https://scikit-learn.org
Machine Learning Course by DeepLearning.AI (Andrem Ng) - https://www.coursera.org/learn/machine-learning
Wikipedia – Supervised Learning – https://en.wikipedia.org/wiki/Supervised_learning

Understanding Supervised Learning: A Step-by-Step Guide

What is supervised machine learning?