Gradient Descent algorithm and its variants

0

 


Gradient Descent is a popular optimization algorithm used in machine learning to find the optimal parameters of a model that minimizes a cost function. In this article, we will explain how Gradient Descent works, its variants, and provide examples to help understand this important optimization algorithm.

Gradient Descent

The goal of Gradient Descent is to find the minimum of a cost function by iteratively adjusting the parameters of the model. The cost function represents the difference between the predicted values and the actual values of the target variable. The Gradient Descent algorithm works by taking small steps in the direction of the steepest descent of the cost function.

The algorithm starts with an initial guess of the parameters, and then iteratively updates them until the cost function is minimized. The update rule for the parameters is given by:

θ = θ - α∇J(θ)

Here, θ represents the parameters of the model, α is the learning rate (step size), and ∇J(θ) is the gradient of the cost function with respect to the parameters.

The gradient points in the direction of the steepest ascent, so we subtract it from the current parameters to move in the direction of the steepest descent. The learning rate determines the size of the step we take in the direction of the gradient.

Batch Gradient Descent

Batch Gradient Descent is the simplest form of Gradient Descent. In this variant, the gradient is computed over the entire training dataset, and then the parameters are updated once per iteration.

Let's take a look at an example of Batch Gradient Descent using a simple linear regression model. We will use the Boston Housing dataset, which contains information about housing in Boston. Our goal is to predict the median value of owner-occupied homes based on the number of rooms.


from sklearn.datasets import load_boston
import numpy as np

boston = load_boston()
X = boston.data[:, 5].reshape(-1, 1)
y = boston.target.reshape(-1, 1)

# Normalize the features
X = (X - np.mean(X)) / np.std(X)

# Add a column of ones to X for the bias term
X = np.hstack((np.ones(X.shape), X))

# Initialize the parameters
theta = np.zeros((2, 1))

# Define the learning rate
alpha = 0.01

# Define the number of iterations
num_iters = 1000

# Batch Gradient Descent
for i in range(num_iters):
    # Compute the gradient
    grad = (1 / len(X)) * X.T.dot(X.dot(theta) - y)
   
    # Update the parameters
    theta = theta - alpha * grad



Stochastic Gradient Descent

Stochastic Gradient Descent is a variant of Gradient Descent that updates the parameters after each example in the training dataset. This variant is useful for large datasets and can converge faster than Batch Gradient Descent.

The update rule for Stochastic Gradient Descent is given by:

θ = θ - α∇Ji(θ)

Here, i represents the current example in the training dataset.


# Stochastic Gradient Descent
for i in range(num_iters):
    for j in range(len(X)):
        # Randomly select an example
        random_index = np.random.randint(0, len(X))
        xi = X[random_index].reshape(1, -1)
        yi = y[random_index].reshape(1, -1)
       
        # Compute the gradient
        grad = xi.T.dot(xi.dot(theta) - yi)
       
        # Update the parameters
        theta = theta - alpha * grad


Mini-batch Gradient Descent

Mini-batch Gradient Descent is a variant of Gradient Descent that updates the parameters after a small batch of examples in the training dataset. This variant is a compromise between Batch Gradient Descent and Stochastic Gradient Descent and is the most commonly used variant in practice.

The update rule for Mini-batch Gradient Descent is given by:

θ = θ - α∇Ji:i+b(θ)

Here, i and i+b represent the current batch of examples in the training dataset.


# Mini-batch Gradient Descent
batch_size = 32
for i in range(num_iters):
    # Randomly shuffle the training data
    shuffled_indices = np.random.permutation(len(X))
    X_shuffled = X[shuffled_indices]
    y_shuffled = y[shuffled_indices]
   
    for j in range(0, len(X), batch_size):
        # Select a mini-batch of examples
        X_batch = X_shuffled[j:j+batch_size]
        y_batch = y_shuffled[j:j+batch_size]
       
        # Compute the gradient
        grad = (1 / len(X_batch)) * X_batch.T.dot(X_batch.dot(theta) - y_batch)
       
        # Update the parameters
        theta = theta - alpha * grad

Conclusion

In this article, we explained the Gradient Descent algorithm and its variants, including Batch Gradient Descent, Stochastic Gradient Descent, and Mini-batch Gradient Descent. We also provided examples to help understand these optimization algorithms. The choice of variant depends on the size of the dataset, the amount of available computational resources, and the desired speed of convergence. Understanding Gradient Descent and its variants is important for optimizing machine learning models and achieving good performance on real-world problems.


Post a Comment

0Comments
Post a Comment (0)

#buttons=(Accept !) #days=(20)

Our website uses cookies to enhance your experience. Learn More
Accept !