How to Implement a Simple Gaussian Process in Python Using PyTorch ?

Introduction

Gaussian Processes are powerful probabilistic models that represent distributions over functions.
They excel in regression and classification when uncertainty quantification and small data performance are important.

Their main drawbacks are computational cost and reliance on good kernel design.

What is a Gaussian Process? (GP)

A Gaussian Process (GP) is a probabilistic model that defines a distribution over functions.
Instead of assuming a specific functional form (like a polynomial or neural network), a GP assumes:

Any finite set of function values follows a joint multivariate Gaussian distribution.

A GP is completely defined by:

  • a mean function ( m(x) )
  • a covariance function or kernel ( k(x, x') )

\begin{equation}
f(x) \sim \mathrm{GP}(m(x), k(x,x'))
\end{equation}

The kernel encodes smoothness, similarity, periodicity, noise, and other structural assumptions about the function.

Gaussian Process Regression (GPR)

GPR models a relationship:

\begin{equation}
y = f(x) + \epsilon, \quad \epsilon \sim \mathrm{N}(0, \sigma^2)
\end{equation}

Given training data ((X, y)), the GP infers a posterior distribution over functions.
This gives:

  • a predictive mean (best estimate)
  • a predictive variance (uncertainty)

So GPR is not just a prediction method—it is a quantified uncertainty framework.

Why is GPR good for regression?

  • Works well with small to medium datasets
  • Provides uncertainty (confidence intervals)
  • Automatically balances bias and complexity through the kernel

Gaussian Process Classification (GPC)

For classification, outputs are not Gaussian, so GPs use a link function (like logistic or probit):

\begin{equation}
p(y=1 \mid f) = \sigma(f)
\end{equation}

The GP still defines a prior on (f(x)), but the likelihood is non-Gaussian.
Inference requires approximation methods:

  • Laplace approximation
  • Expectation Propagation
  • Variational inference

GPC yields probabilistic predictions, giving classification confidence and calibrated probabilities.

Advantages of Gaussian Processes

1. Built-in uncertainty quantification

GPs return an entire probability distribution (mean + variance), not just a point estimate.

2. Excellent performance on small or noisy datasets

Ideal when data is expensive or limited.

3. Flexible through kernels

Choose kernels for:

  • smooth functions
  • periodic patterns
  • spatial correlations
  • sums/products for complex structures

4. Few hyperparameters

Training involves only kernel parameters, often fewer than neural networks.

5. Non-parametric

Model complexity grows with the data—no need to specify a fixed function form.

Limitations of Gaussian Processes

1. Poor scaling to large datasets

Training costs:

\begin{equation}
O(n^3) \quad \text{time}, \qquad O(n^2) \quad \text{memory}
\end{equation}

Unsuitable for large-scale data unless approximations (sparse GPs, inducing points) are used.

2. Kernel engineering can be difficult

Performance depends heavily on choosing and tuning the kernel.

3. GPC requires approximate inference

Classification is more computationally intensive due to non-Gaussian likelihoods.

4. Limited prediction speed

Prediction scales with number of training points; large datasets slow down inference.

Gaussian Process Regression (GPR)

Homoscedastic Noise - Example 1

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
import torch
import torch.nn as nn
import numpy as np
import matplotlib.pyplot as plt

# Use double precision for numerical stability with linear algebra
dtype = torch.double
torch.manual_seed(0)
np.random.seed(0)

# 1) Generate synthetic data
def generate_data(n_train=25, noise_std=0.2):
    X = np.linspace(-3, 3, n_train)[:, None]
    y = np.sin(X).ravel() + noise_std * np.random.randn(n_train)
    return X, y

X_train_np, y_train_np = generate_data(n_train=25, noise_std=0.2)
X_test_np = np.linspace(-5, 5, 400)[:, None]

X_train = torch.tensor(X_train_np, dtype=dtype)
y_train = torch.tensor(y_train_np[:, None], dtype=dtype)  # column vector
X_test = torch.tensor(X_test_np, dtype=dtype)

# 2) RBF kernel implemented in PyTorch (parametrized in log-space)
class RBFKernel(nn.Module):
    def __init__(self, init_lengthscale=1.0, init_variance=1.0):
        super().__init__()
        self.log_lengthscale = nn.Parameter(torch.tensor(np.log(init_lengthscale), dtype=dtype))
        self.log_variance = nn.Parameter(torch.tensor(np.log(init_variance), dtype=dtype))

    def forward(self, x1, x2):
        # Squared distance matrix
        x1_sq = (x1**2).sum(dim=1, keepdim=True)
        x2_sq = (x2**2).sum(dim=1, keepdim=True)
        sqdist = x1_sq + x2_sq.T - 2.0 * (x1 @ x2.T)
        lengthscale = torch.exp(self.log_lengthscale)
        variance = torch.exp(self.log_variance)
        K = variance * torch.exp(-0.5 * sqdist / (lengthscale**2))
        return K, lengthscale, variance

# 3) Gaussian Process Regression class (NLML and prediction)
class GaussianProcessRegression(nn.Module):
    def __init__(self, kernel: RBFKernel, noise_init=0.2):
        super().__init__()
        self.kernel = kernel
        self.log_noise = nn.Parameter(torch.tensor(np.log(noise_init), dtype=dtype))

    def negative_log_marginal_likelihood(self, X, y, jitter=1e-6):
        K, _, _ = self.kernel(X, X)
        noise = torch.exp(self.log_noise)
        n = X.shape[0]
        Kt = K + (noise**2 + jitter) * torch.eye(n, dtype=dtype)

        # Cholesky decomposition
        L = torch.linalg.cholesky(Kt)
        # Solve for alpha = K^{-1} y using triangular solves
        v = torch.linalg.solve_triangular(L, y, upper=False)
        alpha = torch.linalg.solve_triangular(L.T, v, upper=True)

        data_fit = 0.5 * (y.T @ alpha)
        log_det = torch.sum(torch.log(torch.diag(L)))
        constant = 0.5 * n * torch.log(torch.tensor(2.0 * np.pi, dtype=dtype))
        nlml = data_fit + log_det + constant
        return nlml.squeeze()

    def predict(self, X_train, y_train, X_test, jitter=1e-6):
        K_tt, _, _ = self.kernel(X_train, X_train)
        noise = torch.exp(self.log_noise)
        n = X_train.shape[0]
        Kt = K_tt + (noise**2 + jitter) * torch.eye(n, dtype=dtype)
        L = torch.linalg.cholesky(Kt)

        K_s, _, _ = self.kernel(X_train, X_test)
        K_ss, _, _ = self.kernel(X_test, X_test)

        # predictive mean
        v = torch.linalg.solve_triangular(L, y_train, upper=False)
        alpha = torch.linalg.solve_triangular(L.T, v, upper=True)
        pred_mean = K_s.T @ alpha

        # predictive covariance via Schur complement
        w = torch.linalg.solve_triangular(L, K_s, upper=False)
        z = torch.linalg.solve_triangular(L.T, w, upper=True)
        pred_cov = K_ss - K_s.T @ z
        pred_var = torch.clamp(torch.diag(pred_cov), min=1e-12).unsqueeze(1)
        return pred_mean.detach(), pred_var.detach()

# 4) Instantiate and train
kernel = RBFKernel(init_lengthscale=1.0, init_variance=1.0)
gpr = GaussianProcessRegression(kernel=kernel, noise_init=0.2).double()

optimizer = torch.optim.Adam(gpr.parameters(), lr=0.05)
n_iters = 300  # reduce iterations for quicker runs; increase for better convergence
loss_history = []

for i in range(n_iters):
    optimizer.zero_grad()
    nlml = gpr.negative_log_marginal_likelihood(X_train, y_train)
    nlml.backward()
    optimizer.step()
    loss_history.append(nlml.item())
    if (i+1) % 50 == 0 or i==0:
        with torch.no_grad():
            ls = torch.exp(gpr.kernel.log_lengthscale).item()
            var = torch.exp(gpr.kernel.log_variance).item()
            noise = torch.exp(gpr.log_noise).item()
        print(f"Iter {i+1}/{n_iters} - NLML: {nlml.item():.4f}  lengthscale: {ls:.4f}  variance: {var:.4f}  noise: {noise:.4f}")

# 5) Predict and plot
pred_mean, pred_var = gpr.predict(X_train, y_train, X_test)
pred_mean = pred_mean.numpy().ravel()
pred_std = np.sqrt(pred_var.numpy().ravel())

plt.figure(figsize=(8,5))
plt.scatter(X_train_np.ravel(), y_train_np.ravel(), marker='o', label='Train data')
plt.plot(X_test_np.ravel(), pred_mean, label='Predictive mean')
upper = pred_mean + 1.96 * pred_std
lower = pred_mean - 1.96 * pred_std
plt.fill_between(X_test_np.ravel(), lower, upper, alpha=0.3)
plt.title("Gaussian Process Regression (PyTorch)")
plt.xlabel("x")
plt.ylabel("y")
plt.legend()
plt.show()

with torch.no_grad():
    final_ls = torch.exp(gpr.kernel.log_lengthscale).item()
    final_var = torch.exp(gpr.kernel.log_variance).item()
    final_noise = torch.exp(gpr.log_noise).item()
print(f"Final hyperparameters:\n lengthscale = {final_ls:.4f}\n variance = {final_var:.4f}\n noise = {final_noise:.4f}")

How to Implement a Simple Gaussian Process in Python Using PyTorch ?
How to Implement a Simple Gaussian Process in Python Using PyTorch ?

Explanation of the Gaussian Process Regression Code

This code implements Gaussian Process Regression (GPR) from scratch in PyTorch, including kernel definition, marginal likelihood optimization, prediction, and visualization.

1. Imports and Setup

1
2
3
4
import torch
import torch.nn as nn
import numpy as np
import matplotlib.pyplot as plt

We import PyTorch for automatic differentiation and linear algebra, NumPy to generate synthetic data, and Matplotlib for plotting.

The model uses:

1
2
3
dtype = torch.double
torch.manual_seed(0)
np.random_seed(0)

Double precision is advised for Gaussian Processes due to Cholesky decomposition and matrix inversions. Seeds ensure reproducible results.

2. Generate Synthetic Training and Testing Data

1
2
3
4
def generate_data(n_train=25, noise_std=0.2):
    X = np.linspace(-3, 3, n_train)[:, None]
    y = np.sin(X).ravel() + noise_std * np.random.randn(n_train)
    return X, y
  • Inputs ( X ) are points from –3 to 3.
  • Targets ( y ) follow a noisy sine wave.
  • Test inputs are defined over a wider interval for visualization.

The arrays are converted to PyTorch tensors in double precision.

3. RBF Kernel Class

1
2
class RBFKernel(nn.Module):
    ...

The Radial Basis Function (RBF) kernel is a common covariance function:

\begin{equation}
k(x,x') = \sigma^2 \exp\left( -\frac{|x-x'|^2}{2\ell^2} \right)
\end{equation}

Parameters are learned in log-space for numerical stability:

  • log_lengthscale
  • log_variance

Inside forward, the pairwise squared distances between all points in x1 and x2 are computed, and the kernel matrix ( K ) is constructed.

4. Gaussian Process Regression Class

1
2
class GaussianProcessRegression(nn.Module):
    ...

This class performs:

(a) Negative Log Marginal Likelihood (NLML)

The GP marginal likelihood:

\begin{equation}
\log p(y|X,\theta) = -\frac12 y^\top K^{-1} y - \frac12 \log |K| - \frac{n}{2}\log(2\pi)
\end{equation}

This is implemented using:

  • Kernel matrix computation
  • Adding noise variance
  • Cholesky decomposition (stable alternative to matrix inversion)
  • Solving triangular systems to compute:

\begin{equation}
\alpha = K^{-1}y
\end{equation}

The NLML is returned and used as a loss function for training.

(b) Prediction Function

Given new points ( $X_*$ )

\begin{equation}
\mu_* = K_{*}^{T} K^{-1} y
\end{equation}

\begin{equation}
\Sigma_{*} = K_{**} - K_{*}^{T} K^{-1} K_*
\end{equation}

The code computes:

  • Predictive mean (pred_mean)
  • Predictive variance (pred_var)

Both are detached to avoid tracking gradients.

5. Model Instantiation and Training

1
2
3
kernel = RBFKernel(...)
gpr = GaussianProcessRegression(...)
optimizer = torch.optim.Adam(gpr.parameters(), lr=0.05)

The model learns:

  • lengthscale
  • variance
  • noise variance

by minimizing the negative log marginal likelihood.

The training loop:

  • Computes NLML
  • Backpropagates gradients
  • Updates parameters with Adam
  • Prints progress every 50 iterations

6. Prediction and Visualization

After training:

1
pred_mean, pred_var = gpr.predict(...)

The predictive standard deviation is:

1
pred_std = np.sqrt(pred_var...)

The plot shows:

  • Training data points
  • GP posterior mean (smooth curve)
  • 95% confidence interval (shaded region)

How to change the observation noise

The observation noise is controlled entirely by this line inside GaussianProcessRegression:

1
self.log_noise = nn.Parameter(torch.tensor(np.log(noise_init), dtype=dtype))

and used here during training:

1
2
noise = torch.exp(self.log_noise)
Kt = K + (noise**2 + jitter) * torch.eye(n, dtype=dtype)

This means the model assumes one global noise value, learned during training.

OPTION 1 — Set a different initial noise value

Change:

1
gpr = GaussianProcessRegression(kernel=kernel, noise_init=0.2)

For example:

1
gpr = GaussianProcessRegression(kernel=kernel, noise_init=0.8)

But remember: The GP will still learn a single global noise level, regardless of the initial value.

OPTION 2 — Freeze the noise and set it manually

If you want a fixed noise level (not learned), do:

1
2
gpr.log_noise.requires_grad_(False)
gpr.log_noise.data = torch.log(torch.tensor(0.5, dtype=dtype))

This forces the GP to use:

\begin{equation}
\sigma_n = 0.5
\end{equation}

for all observations.

Homoscedastic Noise - Example 2

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
import torch
import numpy as np
import matplotlib.pyplot as plt
torch.set_default_dtype(torch.double)

# 1) Data
X = torch.linspace(-3, 3, 25).unsqueeze(1)
y = torch.sin(X) + 0.2 * torch.randn_like(X)

# measurement uncertainty (example)
y_err = 0.5 * torch.ones_like(y)

X_test = torch.linspace(-5, 5, 400).unsqueeze(1)

# 2) Parameters (log-space)
log_length = torch.tensor(0.0, requires_grad=True)
log_var    = torch.tensor(0.0, requires_grad=True)
log_noise  = torch.tensor(-1.0, requires_grad=True)

# 3) RBF kernel
def rbf(x1, x2, log_l, log_v):
    l = torch.exp(log_l)
    v = torch.exp(log_v)
    sqdist = (x1**2).sum(1, keepdim=True) + (x2**2).sum(1) - 2*x1@x2.T
    return v * torch.exp(-0.5 * sqdist / l**2)

# 4) Negative log marginal likelihood
def nlml():
    K = rbf(X, X, log_length, log_var)
    noise = torch.exp(log_noise)
    K = K + noise**2 * torch.eye(len(X))
    L = torch.linalg.cholesky(K)
    alpha = torch.cholesky_solve(y, L)
    return (
        0.5 * (y.T @ alpha) +
        torch.sum(torch.log(torch.diag(L))) +
        0.5 * len(X) * torch.log(torch.tensor(2*np.pi))
    )

# 5) Train hyperparameters
opt = torch.optim.Adam([log_length, log_var, log_noise], lr=0.05)
for i in range(300):
    opt.zero_grad()
    loss = nlml()
    loss.backward()
    opt.step()
    if i % 50 == 0: print(i, loss.item())

# 6) Predictive distribution
def predict(X_new):
    K = rbf(X, X, log_length, log_var)
    noise = torch.exp(log_noise)
    K = K + torch.diag(noise**2 + y_err.squeeze()**2)

    L = torch.linalg.cholesky(K)
    Ks = rbf(X, X_new, log_length, log_var)
    Kss = rbf(X_new, X_new, log_length, log_var)

    alpha = torch.cholesky_solve(y, L)
    mean = Ks.T @ alpha

    v = torch.linalg.solve(L, Ks)
    cov = Kss - v.T @ v
    std = torch.diag(cov).clamp(min=1e-9).sqrt()

    return mean.squeeze().detach(), std.detach()

mu, std = predict(X_test)

# 7) Plot
plt.figure(figsize=(8,5))

# Training data WITH error bars
plt.errorbar(
    X.squeeze().numpy(),
    y.squeeze().numpy(),
    yerr=y_err.squeeze().numpy(),
    fmt='o',
    color='black',
    ecolor='gray',
    elinewidth=1.5,
    capsize=3,
    label="Training data ± error"
)

# Predictive mean
plt.plot(X_test.numpy(), mu.numpy(), label="Predictive mean")

# Predictive shaded region
plt.fill_between(
    X_test.squeeze().numpy(),
    (mu - 2*std).numpy(),
    (mu + 2*std).numpy(),
    alpha=0.3,
    label="GP ±2σ"
)

plt.legend()
plt.title("Gaussian Process Regression with Training Data Error Bars")
plt.show()

How to Implement a Simple Gaussian Process in Python Using PyTorch ?
How to Implement a Simple Gaussian Process in Python Using PyTorch ?

Heteroscedastic Noise - Example 3

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
import torch
import torch.nn as nn
import numpy as np
import matplotlib.pyplot as plt

# Use double precision for numerical stability with linear algebra
dtype = torch.double
torch.manual_seed(0)
np.random.seed(0)

# -------------------------------------------------------
# 1) Generate synthetic data + example heteroscedastic errors
# -------------------------------------------------------
def generate_data(n_train=25, noise_std=0.2):
    X = np.linspace(-3, 3, n_train)[:, None]
    y = np.sin(X).ravel() + noise_std * np.random.randn(n_train)
    return X, y

X_train_np, y_train_np = generate_data(n_train=25, noise_std=0.2)
X_test_np = np.linspace(-5, 5, 400)[:, None]

# Example: heteroscedastic obs uncertainty
# Make errors larger near edges just for demonstration
obs_error_std_np = 0.1 + 0.3 * np.abs(X_train_np.ravel()) / 3.0

X_train = torch.tensor(X_train_np, dtype=dtype)
y_train = torch.tensor(y_train_np[:, None], dtype=dtype)
obs_error_std = torch.tensor(obs_error_std_np[:, None], dtype=dtype)

X_test = torch.tensor(X_test_np, dtype=dtype)

# -------------------------------------------------------
# 2) RBF kernel
# -------------------------------------------------------
class RBFKernel(nn.Module):
    def __init__(self, init_lengthscale=1.0, init_variance=1.0):
        super().__init__()
        self.log_lengthscale = nn.Parameter(torch.tensor(np.log(init_lengthscale), dtype=dtype))
        self.log_variance = nn.Parameter(torch.tensor(np.log(init_variance), dtype=dtype))

    def forward(self, x1, x2):
        x1_sq = (x1**2).sum(dim=1, keepdim=True)
        x2_sq = (x2**2).sum(dim=1, keepdim=True)
        sqdist = x1_sq + x2_sq.T - 2.0 * (x1 @ x2.T)
        lengthscale = torch.exp(self.log_lengthscale)
        variance = torch.exp(self.log_variance)
        K = variance * torch.exp(-0.5 * sqdist / (lengthscale**2))
        return K, lengthscale, variance

# -------------------------------------------------------
# 3) Gaussian Process Regression with observation errors
# -------------------------------------------------------
class GaussianProcessRegression(nn.Module):
    def __init__(self, kernel: RBFKernel):
        super().__init__()
        self.kernel = kernel

    def negative_log_marginal_likelihood(self, X, y, obs_error_std, jitter=1e-6):
        K, _, _ = self.kernel(X, X)
        n = X.shape[0]

        # --- NEW: Use per-observation variances ---
        noise_diag = obs_error_std.squeeze()**2
        Kt = K + torch.diag(noise_diag + jitter)

        # Cholesky decomposition
        L = torch.linalg.cholesky(Kt)

        v = torch.linalg.solve_triangular(L, y, upper=False)
        alpha = torch.linalg.solve_triangular(L.T, v, upper=True)

        data_fit = 0.5 * (y.T @ alpha)
        log_det = torch.sum(torch.log(torch.diag(L)))
        constant = 0.5 * n * torch.log(torch.tensor(2.0 * np.pi, dtype=dtype))
        nlml = data_fit + log_det + constant
        return nlml.squeeze()

    def predict(self, X_train, y_train, X_test, obs_error_std, jitter=1e-6):
        K_tt, _, _ = self.kernel(X_train, X_train)

        noise_diag = obs_error_std.squeeze()**2
        Kt = K_tt + torch.diag(noise_diag + jitter)

        L = torch.linalg.cholesky(Kt)

        K_s, _, _ = self.kernel(X_train, X_test)
        K_ss, _, _ = self.kernel(X_test, X_test)

        v = torch.linalg.solve_triangular(L, y_train, upper=False)
        alpha = torch.linalg.solve_triangular(L.T, v, upper=True)
        pred_mean = K_s.T @ alpha

        w = torch.linalg.solve_triangular(L, K_s, upper=False)
        z = torch.linalg.solve_triangular(L.T, w, upper=True)
        pred_cov = K_ss - K_s.T @ z

        pred_var = torch.clamp(torch.diag(pred_cov), min=1e-12).unsqueeze(1)
        return pred_mean.detach(), pred_var.detach()

# -------------------------------------------------------
# 4) Instantiate and train
# -------------------------------------------------------
kernel = RBFKernel(init_lengthscale=1.0, init_variance=1.0)
gpr = GaussianProcessRegression(kernel=kernel).double()

optimizer = torch.optim.Adam(gpr.parameters(), lr=0.05)
n_iters = 300
loss_history = []

for i in range(n_iters):
    optimizer.zero_grad()
    nlml = gpr.negative_log_marginal_likelihood(X_train, y_train, obs_error_std)
    nlml.backward()
    optimizer.step()
    loss_history.append(nlml.item())
    if (i+1) % 50 == 0:
        with torch.no_grad():
            ls = torch.exp(gpr.kernel.log_lengthscale).item()
            var = torch.exp(gpr.kernel.log_variance).item()
        print(f"Iter {i+1}/{n_iters} | NLML={nlml:.4f} | ls={ls:.4f} | var={var:.4f}")

# -------------------------------------------------------
# 5) Predict and plot
# -------------------------------------------------------
pred_mean, pred_var = gpr.predict(X_train, y_train, X_test, obs_error_std)
pred_mean = pred_mean.numpy().ravel()
pred_std = np.sqrt(pred_var.numpy().ravel())

plt.figure(figsize=(10,5))
plt.scatter(X_train_np.ravel(), y_train_np.ravel(), c='k', label='Train', s=40)
plt.errorbar(
    X_train_np.ravel(), 
    y_train_np.ravel(),
    yerr=obs_error_std_np,
    fmt='none',
    ecolor='gray',
    alpha=0.5,
    label='Obs errors'
)
plt.plot(X_test_np.ravel(), pred_mean, label="Mean")
plt.fill_between(
    X_test_np.ravel(),
    pred_mean - 1.96*pred_std,
    pred_mean + 1.96*pred_std,
    alpha=0.3,
    label="95% CI"
)
plt.legend()
plt.title("Gaussian Process Regression with per-observation errors")
plt.show()

How to Implement a Simple Gaussian Process in Python Using PyTorch ?
How to Implement a Simple Gaussian Process in Python Using PyTorch ?

Gaussian Processes using GPyTorch

What is GPyTorch?

GPyTorch is a modern, scalable Gaussian Process (GP) library built on PyTorch.
It provides GPU acceleration, automatic differentiation, and efficient linear algebra methods that allow GPs to be trained on much larger datasets than classical GP implementations.

It was developed at Uber AI Labs and supports:

  • Exact Gaussian Processes (small/medium datasets)
  • Approximate/Variational GPs (large datasets)
  • Additive kernels, multitask GPs, spectral mixture kernels
  • Natural integration with PyTorch models

How to Install GPyTorch

Install using pip:

1
pip install gpytorch

or with conda:

1
conda install -c conda-forge gpytorch

You must have PyTorch installed first:

1
pip install torch

Why GPyTorch is interesting

Automatic differentiation

Kernel parameters, noise, and model parameters are automatically optimized using PyTorch autograd.

GPU acceleration

Cholesky decompositions, matrix solves, and kernel operations run on GPU.

Built-in kernels

RBF, Matern, Spectral Mixture, RQ, Polynomials, etc.

Large-Scale GP Support

Via SKI kernels, variational GPs, and structured kernel interpolation.

Clean, modular API

Models are subclasses of gpytorch.models.ExactGP.

Heteroscedastic noise supported

Via likelihood(noise) or custom noise models.

Full GPyTorch Implementation (with heteroscedastic noise)

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
import torch
import gpytorch
import numpy as np
import matplotlib.pyplot as plt

dtype = torch.double
torch.manual_seed(0)
np.random.seed(0)

# -------------------------------------------------------
# 1) Generate data
# -------------------------------------------------------
def generate_data(n_train=25, noise_std=0.2):
    X = np.linspace(-3, 3, n_train)[:, None]
    y = np.sin(X).ravel() + noise_std * np.random.randn(n_train)
    return X, y

X_train_np, y_train_np = generate_data()
X_test_np = np.linspace(-5, 5, 400)[:, None]

# Heteroscedastic noise example
obs_error_std_np = 0.1 + 0.3 * np.abs(X_train_np.ravel()) / 3.0

X_train = torch.tensor(X_train_np, dtype=dtype)
y_train = torch.tensor(y_train_np, dtype=dtype)
obs_noise = torch.tensor(obs_error_std_np**2, dtype=dtype)

X_test = torch.tensor(X_test_np, dtype=dtype)


# -------------------------------------------------------
# 2) Define GP Model (Exact GP)
# -------------------------------------------------------
class ExactGPModel(gpytorch.models.ExactGP):
    def __init__(self, train_x, train_y, likelihood):
        super().__init__(train_x, train_y, likelihood)
        self.mean_module = gpytorch.means.ConstantMean()
        self.covar_module = gpytorch.kernels.ScaleKernel(
            gpytorch.kernels.RBFKernel()
        )

    def forward(self, x):
        mean_x = self.mean_module(x)
        cov_x = self.covar_module(x)
        return gpytorch.distributions.MultivariateNormal(mean_x, cov_x)


# -------------------------------------------------------
# 3) Setup likelihood with heteroscedastic noise
# -------------------------------------------------------
likelihood = gpytorch.likelihoods.FixedNoiseGaussianLikelihood(
    noise=obs_noise,
    learn_additional_noise=False  # prevents learning extra homoscedastic term
)

model = ExactGPModel(X_train, y_train, likelihood).double()

model.train()
likelihood.train()

optimizer = torch.optim.Adam(model.parameters(), lr=0.1)
mll = gpytorch.mlls.ExactMarginalLogLikelihood(likelihood, model)


# -------------------------------------------------------
# 4) Train GP hyperparameters
# -------------------------------------------------------
training_iter = 300
for i in range(training_iter):
    optimizer.zero_grad()
    output = model(X_train)
    loss = -mll(output, y_train)
    loss.backward()
    if (i + 1) % 50 == 0:
        ls = model.covar_module.base_kernel.lengthscale.item()
        var = model.covar_module.outputscale.item()
        print(f"Iter {i+1}/{training_iter} | Loss={loss.item():.4f} | ls={ls:.3f} | var={var:.3f}")
    optimizer.step()


# -------------------------------------------------------
# 5) Evaluation
# -------------------------------------------------------
model.eval()
likelihood.eval()

with torch.no_grad():
    preds = likelihood(model(X_test))
    mean = preds.mean.numpy()
    lower, upper = preds.confidence_region()

# -------------------------------------------------------
# 6) Plot
# -------------------------------------------------------
plt.figure(figsize=(10, 5))
plt.scatter(X_train_np, y_train_np, c='k', s=40, label="Train")
plt.errorbar(
    X_train_np, y_train_np,
    yerr=obs_error_std_np,
    fmt='none',
    ecolor='gray',
    alpha=0.6,
    label="Obs noise"
)

plt.plot(X_test_np, mean, label='Prediction mean')
plt.fill_between(
    X_test_np.ravel(),
    lower.numpy(),
    upper.numpy(),
    alpha=0.3,
    label="95% CI"
)
plt.title("GPyTorch — GP Regression with Heteroscedastic Noise")
plt.legend()
plt.show()

How to Implement a Simple Gaussian Process in Python Using PyTorch ?
How to Implement a Simple Gaussian Process in Python Using PyTorch ?

Another Example

We will now revisit the example introduced in the previous article, where we manually implemented a Gaussian Process model without relying on PyTorch. This will allow us to compare that approach with a clean implementation using GPyTorch.

1D Example

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
import torch
import gpytorch
import matplotlib.pyplot as plt

# -----------------------------------------
# 1) Synthetic training data 
# -----------------------------------------
X = torch.tensor([1., 3., 5., 6., 7., 8.]).unsqueeze(1)
Y = X.squeeze() * torch.sin(X.squeeze())

# NEW: different noise per observation
obs_noise_std = torch.tensor([0.2, 0.1, 0.5, 0.3, 0.15, 0.4])   # example
obs_noise_var = obs_noise_std**2

# -----------------------------------------
# 2) GP Model with FixedNoiseGaussianLikelihood
# -----------------------------------------
class ExactGPModel(gpytorch.models.ExactGP):
    def __init__(self, train_x, train_y, likelihood):
        super().__init__(train_x, train_y, likelihood)
        self.mean_module = gpytorch.means.ConstantMean()
        self.covar_module = gpytorch.kernels.ScaleKernel(
            gpytorch.kernels.RBFKernel()
        )

    def forward(self, x):
        mean_x = self.mean_module(x)
        covar_x = self.covar_module(x)
        return gpytorch.distributions.MultivariateNormal(mean_x, covar_x)

# FIXED NOISE likelihood → uses OBS noise
likelihood = gpytorch.likelihoods.FixedNoiseGaussianLikelihood(
    noise=obs_noise_var
)

model = ExactGPModel(X, Y, likelihood)

# -----------------------------------------
# 3) Training
# -----------------------------------------
model.train()
likelihood.train()

optimizer = torch.optim.Adam(model.parameters(), lr=0.1)
mll = gpytorch.mlls.ExactMarginalLogLikelihood(likelihood, model)

for i in range(120):
    optimizer.zero_grad()
    output = model(X)
    loss = -mll(output, Y)
    loss.backward()
    optimizer.step()

# -----------------------------------------
# 4) Prediction
# -----------------------------------------
model.eval()
likelihood.eval()

Xtest = torch.linspace(0, 10, 200).unsqueeze(1)

with torch.no_grad(), gpytorch.settings.fast_pred_var():
    pred = likelihood(model(Xtest))

pred_mean = pred.mean.numpy()
pred_std = pred.stddev.numpy()

# -----------------------------------------
# 5) Plot results
# -----------------------------------------
plt.figure(figsize=(9,5))

# show per-point noise in the visualization
plt.errorbar(
    X.squeeze().numpy(),
    Y.numpy(),
    yerr=obs_noise_std.numpy(),
    fmt='o',
    capsize=4,
    label="Training data (heteroscedastic noise)"
)

plt.plot(Xtest.numpy(), pred_mean, "r--", label="GP mean")

plt.fill_between(
    Xtest.squeeze().numpy(),
    pred_mean - 1.96*pred_std,
    pred_mean + 1.96*pred_std,
    alpha=0.3,
    label="95% interval"
)

plt.xlabel("x")
plt.ylabel("y")
plt.title("Gaussian Process Regression (with per-point noise)")
plt.legend()
plt.grid(True)
plt.show()

How to Implement a Simple Gaussian Process in Python Using PyTorch ?
How to Implement a Simple Gaussian Process in Python Using PyTorch ?

2D Example

GP Regression Model (2D RBF kernel)

\begin{equation}
k(x, x') = \sigma_f^2 \exp\left(-\frac{(x_1-x_1')^2}{2\ell_1^2}-\frac{(x_2-x_2')^2}{2\ell_2^2}\right)
\end{equation}

Updated Code

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
import torch
import gpytorch
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import cm

torch.manual_seed(0)

# -----------------------------
# Data (same as our original example)
# -----------------------------
X_np = np.array([
    [-8.0,-8.0], [-6.0,-3.0], [-7.0,2.0], [-4.0,4.0],
    [ 2.0, 3.0], [ 5.0, 7.0], [ 1.0,-1.0], [ 3.0,-4.0], [ 7.0,-7.0]
], dtype=np.float32)

# labels originally -1 / +1 -> convert to 0/1 for Bernoulli likelihood
Y_class_np = np.array([-1.0, -1.0, -1.0, -1.0, -1.0, -1.0, 1.0, 1.0, 1.0], dtype=np.float32)
Y_class_01_np = (Y_class_np > 0).astype(np.float32)  # 0 or 1

# We'll also use Y_reg as same numeric labels for regression (float)
Y_reg_np = Y_class_np.copy()

# Torch tensors (float)
X = torch.from_numpy(X_np)
Y_reg = torch.from_numpy(Y_reg_np)
Y_class = torch.from_numpy(Y_class_01_np)  # 0/1

# -----------------------------
# Regression: Exact GP (RBF with ARD)
# -----------------------------
class ExactGPRegression(gpytorch.models.ExactGP):
    def __init__(self, train_x, train_y, likelihood):
        super().__init__(train_x, train_y, likelihood)
        self.mean_module = gpytorch.means.ConstantMean()
        # RBF with ARD -> separate lengthscale per input dim
        self.covar_module = gpytorch.kernels.ScaleKernel(
            gpytorch.kernels.RBFKernel(ard_num_dims=train_x.shape[1])
        )

    def forward(self, x):
        mean = self.mean_module(x)
        cov = self.covar_module(x)
        return gpytorch.distributions.MultivariateNormal(mean, cov)

# Setup likelihood & model
likelihood_reg = gpytorch.likelihoods.GaussianLikelihood()
model_reg = ExactGPRegression(X, Y_reg, likelihood_reg)

# Train regression model
model_reg.train()
likelihood_reg.train()
optimizer = torch.optim.Adam(model_reg.parameters(), lr=0.1)
mll = gpytorch.mlls.ExactMarginalLogLikelihood(likelihood_reg, model_reg)

n_iter_reg = 200
for i in range(n_iter_reg):
    optimizer.zero_grad()
    output = model_reg(X)
    loss = -mll(output, Y_reg)
    loss.backward()
    optimizer.step()
    if (i+1) % 50 == 0:
        print(f"[Reg] Iter {i+1}/{n_iter_reg}  NLML: {loss.item():.4f}")

# -----------------------------
# Regression: predictions on grid
# -----------------------------
model_reg.eval()
likelihood_reg.eval()

# grid settings
grid_x1 = np.arange(-10.0, 10.0, 0.12, dtype=np.float32)
grid_x2 = np.arange(-10.0, 10.0, 0.12, dtype=np.float32)
X1g, X2g = np.meshgrid(grid_x1, grid_x2, indexing='xy')
X_grid = torch.from_numpy(np.column_stack([X1g.ravel(), X2g.ravel()]).astype(np.float32))

with torch.no_grad(), gpytorch.settings.fast_pred_var():
    pred_reg = likelihood_reg(model_reg(X_grid))
    mean_grid = pred_reg.mean.reshape(X1g.shape).numpy()
    var_grid = pred_reg.variance.reshape(X1g.shape).numpy()

# Plot regression mean
plt.figure(figsize=(6,5))
plt.imshow(mean_grid, origin='lower', extent=[-10,10,-10,10], cmap=cm.jet)
plt.colorbar(label='Predictive mean')
plt.scatter(X_np[:,0], X_np[:,1], c=Y_reg_np, edgecolors='k', cmap=cm.coolwarm, s=60)
plt.title("GP Regression — Predictive Mean")
plt.xlabel("x1"); plt.ylabel("x2")
plt.grid(True, linestyle='--', alpha=0.4)
plt.show()

# Plot regression variance
plt.figure(figsize=(6,5))
plt.imshow(var_grid, origin='lower', extent=[-10,10,-10,10], cmap=cm.jet)
plt.colorbar(label='Predictive variance')
plt.scatter(X_np[:,0], X_np[:,1], c='white', edgecolors='k', s=60)
plt.title("GP Regression — Predictive Variance")
plt.xlabel("x1"); plt.ylabel("x2")
plt.grid(True, linestyle='--', alpha=0.4)
plt.show()

# -----------------------------
# Classification: Variational GP (Sparse/Approximate)
# -----------------------------
# Convert dataset for classifier: X (same), Y_class (0/1)
# Use inducing points (we'll use training inputs as inducing for this tiny example)
inducing_points = X.clone()

class VariationalGPClassifier(gpytorch.models.ApproximateGP):
    def __init__(self, inducing_points):
        variational_distribution = gpytorch.variational.CholeskyVariationalDistribution(inducing_points.size(0))
        variational_strategy = gpytorch.variational.VariationalStrategy(
            self, inducing_points, variational_distribution, learn_inducing_locations=True
        )
        super().__init__(variational_strategy)
        self.mean_module = gpytorch.means.ConstantMean()
        self.covar_module = gpytorch.kernels.ScaleKernel(
            gpytorch.kernels.RBFKernel(ard_num_dims=inducing_points.shape[1])
        )

    def forward(self, x):
        mean = self.mean_module(x)
        cov = self.covar_module(x)
        return gpytorch.distributions.MultivariateNormal(mean, cov)

# Likelihood and model
likelihood_clf = gpytorch.likelihoods.BernoulliLikelihood()
model_clf = VariationalGPClassifier(inducing_points)

# Put into training mode
model_clf.train()
likelihood_clf.train()

# Optimizer and ELBO
optimizer_clf = torch.optim.Adam(model_clf.parameters(), lr=0.05)
num_data = X.shape[0]
elbo = gpytorch.mlls.VariationalELBO(likelihood_clf, model_clf, num_data)

# Training loop
n_iter_clf = 500
for i in range(n_iter_clf):
    optimizer_clf.zero_grad()
    output = model_clf(X)
    loss = -elbo(output, Y_class)
    loss.backward()
    optimizer_clf.step()
    if (i+1) % 100 == 0:
        print(f"[Clf] Iter {i+1}/{n_iter_clf}  ELBO loss: {loss.item():.4f}")

# -----------------------------
# Classification: predictions on grid
# -----------------------------
model_clf.eval()
likelihood_clf.eval()

with torch.no_grad(), gpytorch.settings.fast_pred_var():
    preds_clf = likelihood_clf(model_clf(X_grid))
    probs = preds_clf.mean.reshape(X1g.shape).numpy()  # P(Y=1)

# Plot classification probability map
plt.figure(figsize=(6,5))
plt.imshow(probs, origin='lower', extent=[-10,10,-10,10], cmap=cm.jet, vmin=0, vmax=1)
plt.colorbar(label='P(Y=1)')
# plot training points (use marker styles like original)
for i, y in enumerate(Y_class_np):
    if y == -1.0:
        plt.scatter(X_np[i,0], X_np[i,1], marker='o', color='#1f77b4', edgecolors='k', s=80)
    else:
        plt.scatter(X_np[i,0], X_np[i,1], marker='x', color='#ff7f0e', s=80, linewidths=2)
plt.title("GP Classification — Predicted Probability P(Y=1)")
plt.xlabel("x1"); plt.ylabel("x2")
plt.grid(True, linestyle='--', alpha=0.4)
plt.show()

# Contour of decision surface
plt.figure(figsize=(6,5))
CS = plt.contour(X1g, X2g, probs, levels=np.linspace(0,1,11), cmap=cm.jet)
plt.clabel(CS, inline=True, fontsize=8)
plt.title("GP Classification — Probability Contours")
plt.xlabel("x1"); plt.ylabel("x2")
plt.grid(True, linestyle='--', alpha=0.4)
plt.scatter(X_np[:,0], X_np[:,1], c='k', s=30)
plt.show()

How to Implement a Simple Gaussian Process in Python Using PyTorch ? How to Implement a Simple Gaussian Process in Python Using PyTorch ?
How to Implement a Simple Gaussian Process in Python Using PyTorch ? How to Implement a Simple Gaussian Process in Python Using PyTorch ?
How to Implement a Simple Gaussian Process in Python Using PyTorch ?

References

Links Site
https://pytorch.org/docs/stable/index.html PyTorch Official Documentation
https://pytorch.org/docs/stable/generated/torch.linalg.cholesky.html PyTorch Cholesky Decomposition
https://pytorch.org/docs/stable/optim.html PyTorch Optimizers
https://scikit-learn.org/stable/modules/gaussian_process.html Scikit-Learn Gaussian Processes (reference implementation)
https://gpytorch.ai GPyTorch Official Library for Gaussian Processes in PyTorch
https://gaussianprocess.org/gpml/chapters/ GPML Textbook (Rasmussen & Williams)
https://en.wikipedia.org/wiki/Gaussian_process Overview of Gaussian Processes
Image

of