Introduction
Gaussian Processes are powerful probabilistic models that represent distributions over functions.
They excel in regression and classification when uncertainty quantification and small data performance are important.
Their main drawbacks are computational cost and reliance on good kernel design.
What is a Gaussian Process? (GP)
A Gaussian Process (GP) is a probabilistic model that defines a distribution over functions.
Instead of assuming a specific functional form (like a polynomial or neural network), a GP assumes:
Any finite set of function values follows a joint multivariate Gaussian distribution.
A GP is completely defined by:
- a mean function ( m(x) )
- a covariance function or kernel ( k(x, x') )
\begin{equation}
f(x) \sim \mathrm{GP}(m(x), k(x,x'))
\end{equation}
The kernel encodes smoothness, similarity, periodicity, noise, and other structural assumptions about the function.
Gaussian Process Regression (GPR)
GPR models a relationship:
\begin{equation}
y = f(x) + \epsilon, \quad \epsilon \sim \mathrm{N}(0, \sigma^2)
\end{equation}
Given training data ((X, y)), the GP infers a posterior distribution over functions.
This gives:
- a predictive mean (best estimate)
- a predictive variance (uncertainty)
So GPR is not just a prediction method—it is a quantified uncertainty framework.
Why is GPR good for regression?
- Works well with small to medium datasets
- Provides uncertainty (confidence intervals)
- Automatically balances bias and complexity through the kernel
Gaussian Process Classification (GPC)
For classification, outputs are not Gaussian, so GPs use a link function (like logistic or probit):
\begin{equation}
p(y=1 \mid f) = \sigma(f)
\end{equation}
The GP still defines a prior on (f(x)), but the likelihood is non-Gaussian.
Inference requires approximation methods:
- Laplace approximation
- Expectation Propagation
- Variational inference
GPC yields probabilistic predictions, giving classification confidence and calibrated probabilities.
Advantages of Gaussian Processes
1. Built-in uncertainty quantification
GPs return an entire probability distribution (mean + variance), not just a point estimate.
2. Excellent performance on small or noisy datasets
Ideal when data is expensive or limited.
3. Flexible through kernels
Choose kernels for:
- smooth functions
- periodic patterns
- spatial correlations
- sums/products for complex structures
4. Few hyperparameters
Training involves only kernel parameters, often fewer than neural networks.
5. Non-parametric
Model complexity grows with the data—no need to specify a fixed function form.
Limitations of Gaussian Processes
1. Poor scaling to large datasets
Training costs:
\begin{equation}
O(n^3) \quad \text{time}, \qquad O(n^2) \quad \text{memory}
\end{equation}
Unsuitable for large-scale data unless approximations (sparse GPs, inducing points) are used.
2. Kernel engineering can be difficult
Performance depends heavily on choosing and tuning the kernel.
3. GPC requires approximate inference
Classification is more computationally intensive due to non-Gaussian likelihoods.
4. Limited prediction speed
Prediction scales with number of training points; large datasets slow down inference.
Gaussian Process Regression (GPR)
Homoscedastic Noise - Example 1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 | import torch import torch.nn as nn import numpy as np import matplotlib.pyplot as plt # Use double precision for numerical stability with linear algebra dtype = torch.double torch.manual_seed(0) np.random.seed(0) # 1) Generate synthetic data def generate_data(n_train=25, noise_std=0.2): X = np.linspace(-3, 3, n_train)[:, None] y = np.sin(X).ravel() + noise_std * np.random.randn(n_train) return X, y X_train_np, y_train_np = generate_data(n_train=25, noise_std=0.2) X_test_np = np.linspace(-5, 5, 400)[:, None] X_train = torch.tensor(X_train_np, dtype=dtype) y_train = torch.tensor(y_train_np[:, None], dtype=dtype) # column vector X_test = torch.tensor(X_test_np, dtype=dtype) # 2) RBF kernel implemented in PyTorch (parametrized in log-space) class RBFKernel(nn.Module): def __init__(self, init_lengthscale=1.0, init_variance=1.0): super().__init__() self.log_lengthscale = nn.Parameter(torch.tensor(np.log(init_lengthscale), dtype=dtype)) self.log_variance = nn.Parameter(torch.tensor(np.log(init_variance), dtype=dtype)) def forward(self, x1, x2): # Squared distance matrix x1_sq = (x1**2).sum(dim=1, keepdim=True) x2_sq = (x2**2).sum(dim=1, keepdim=True) sqdist = x1_sq + x2_sq.T - 2.0 * (x1 @ x2.T) lengthscale = torch.exp(self.log_lengthscale) variance = torch.exp(self.log_variance) K = variance * torch.exp(-0.5 * sqdist / (lengthscale**2)) return K, lengthscale, variance # 3) Gaussian Process Regression class (NLML and prediction) class GaussianProcessRegression(nn.Module): def __init__(self, kernel: RBFKernel, noise_init=0.2): super().__init__() self.kernel = kernel self.log_noise = nn.Parameter(torch.tensor(np.log(noise_init), dtype=dtype)) def negative_log_marginal_likelihood(self, X, y, jitter=1e-6): K, _, _ = self.kernel(X, X) noise = torch.exp(self.log_noise) n = X.shape[0] Kt = K + (noise**2 + jitter) * torch.eye(n, dtype=dtype) # Cholesky decomposition L = torch.linalg.cholesky(Kt) # Solve for alpha = K^{-1} y using triangular solves v = torch.linalg.solve_triangular(L, y, upper=False) alpha = torch.linalg.solve_triangular(L.T, v, upper=True) data_fit = 0.5 * (y.T @ alpha) log_det = torch.sum(torch.log(torch.diag(L))) constant = 0.5 * n * torch.log(torch.tensor(2.0 * np.pi, dtype=dtype)) nlml = data_fit + log_det + constant return nlml.squeeze() def predict(self, X_train, y_train, X_test, jitter=1e-6): K_tt, _, _ = self.kernel(X_train, X_train) noise = torch.exp(self.log_noise) n = X_train.shape[0] Kt = K_tt + (noise**2 + jitter) * torch.eye(n, dtype=dtype) L = torch.linalg.cholesky(Kt) K_s, _, _ = self.kernel(X_train, X_test) K_ss, _, _ = self.kernel(X_test, X_test) # predictive mean v = torch.linalg.solve_triangular(L, y_train, upper=False) alpha = torch.linalg.solve_triangular(L.T, v, upper=True) pred_mean = K_s.T @ alpha # predictive covariance via Schur complement w = torch.linalg.solve_triangular(L, K_s, upper=False) z = torch.linalg.solve_triangular(L.T, w, upper=True) pred_cov = K_ss - K_s.T @ z pred_var = torch.clamp(torch.diag(pred_cov), min=1e-12).unsqueeze(1) return pred_mean.detach(), pred_var.detach() # 4) Instantiate and train kernel = RBFKernel(init_lengthscale=1.0, init_variance=1.0) gpr = GaussianProcessRegression(kernel=kernel, noise_init=0.2).double() optimizer = torch.optim.Adam(gpr.parameters(), lr=0.05) n_iters = 300 # reduce iterations for quicker runs; increase for better convergence loss_history = [] for i in range(n_iters): optimizer.zero_grad() nlml = gpr.negative_log_marginal_likelihood(X_train, y_train) nlml.backward() optimizer.step() loss_history.append(nlml.item()) if (i+1) % 50 == 0 or i==0: with torch.no_grad(): ls = torch.exp(gpr.kernel.log_lengthscale).item() var = torch.exp(gpr.kernel.log_variance).item() noise = torch.exp(gpr.log_noise).item() print(f"Iter {i+1}/{n_iters} - NLML: {nlml.item():.4f} lengthscale: {ls:.4f} variance: {var:.4f} noise: {noise:.4f}") # 5) Predict and plot pred_mean, pred_var = gpr.predict(X_train, y_train, X_test) pred_mean = pred_mean.numpy().ravel() pred_std = np.sqrt(pred_var.numpy().ravel()) plt.figure(figsize=(8,5)) plt.scatter(X_train_np.ravel(), y_train_np.ravel(), marker='o', label='Train data') plt.plot(X_test_np.ravel(), pred_mean, label='Predictive mean') upper = pred_mean + 1.96 * pred_std lower = pred_mean - 1.96 * pred_std plt.fill_between(X_test_np.ravel(), lower, upper, alpha=0.3) plt.title("Gaussian Process Regression (PyTorch)") plt.xlabel("x") plt.ylabel("y") plt.legend() plt.show() with torch.no_grad(): final_ls = torch.exp(gpr.kernel.log_lengthscale).item() final_var = torch.exp(gpr.kernel.log_variance).item() final_noise = torch.exp(gpr.log_noise).item() print(f"Final hyperparameters:\n lengthscale = {final_ls:.4f}\n variance = {final_var:.4f}\n noise = {final_noise:.4f}") |

Explanation of the Gaussian Process Regression Code
This code implements Gaussian Process Regression (GPR) from scratch in PyTorch, including kernel definition, marginal likelihood optimization, prediction, and visualization.
1. Imports and Setup
1 2 3 4 | import torch import torch.nn as nn import numpy as np import matplotlib.pyplot as plt |
We import PyTorch for automatic differentiation and linear algebra, NumPy to generate synthetic data, and Matplotlib for plotting.
The model uses:
1 2 3 | dtype = torch.double torch.manual_seed(0) np.random_seed(0) |
Double precision is advised for Gaussian Processes due to Cholesky decomposition and matrix inversions. Seeds ensure reproducible results.
2. Generate Synthetic Training and Testing Data
1 2 3 4 | def generate_data(n_train=25, noise_std=0.2): X = np.linspace(-3, 3, n_train)[:, None] y = np.sin(X).ravel() + noise_std * np.random.randn(n_train) return X, y |
- Inputs ( X ) are points from –3 to 3.
- Targets ( y ) follow a noisy sine wave.
- Test inputs are defined over a wider interval for visualization.
The arrays are converted to PyTorch tensors in double precision.
3. RBF Kernel Class
1 2 | class RBFKernel(nn.Module): ... |
The Radial Basis Function (RBF) kernel is a common covariance function:
\begin{equation}
k(x,x') = \sigma^2 \exp\left( -\frac{|x-x'|^2}{2\ell^2} \right)
\end{equation}
Parameters are learned in log-space for numerical stability:
log_lengthscalelog_variance
Inside forward, the pairwise squared distances between all points in x1 and x2 are computed, and the kernel matrix ( K ) is constructed.
4. Gaussian Process Regression Class
1 2 | class GaussianProcessRegression(nn.Module): ... |
This class performs:
(a) Negative Log Marginal Likelihood (NLML)
The GP marginal likelihood:
\begin{equation}
\log p(y|X,\theta) = -\frac12 y^\top K^{-1} y - \frac12 \log |K| - \frac{n}{2}\log(2\pi)
\end{equation}
This is implemented using:
- Kernel matrix computation
- Adding noise variance
- Cholesky decomposition (stable alternative to matrix inversion)
- Solving triangular systems to compute:
\begin{equation}
\alpha = K^{-1}y
\end{equation}
The NLML is returned and used as a loss function for training.
(b) Prediction Function
Given new points ( $X_*$ )
\begin{equation}
\mu_* = K_{*}^{T} K^{-1} y
\end{equation}
\begin{equation}
\Sigma_{*} = K_{**} - K_{*}^{T} K^{-1} K_*
\end{equation}
The code computes:
- Predictive mean (
pred_mean) - Predictive variance (
pred_var)
Both are detached to avoid tracking gradients.
5. Model Instantiation and Training
1 2 3 | kernel = RBFKernel(...) gpr = GaussianProcessRegression(...) optimizer = torch.optim.Adam(gpr.parameters(), lr=0.05) |
The model learns:
- lengthscale
- variance
- noise variance
by minimizing the negative log marginal likelihood.
The training loop:
- Computes NLML
- Backpropagates gradients
- Updates parameters with Adam
- Prints progress every 50 iterations
6. Prediction and Visualization
After training:
1 | pred_mean, pred_var = gpr.predict(...) |
The predictive standard deviation is:
1 | pred_std = np.sqrt(pred_var...) |
The plot shows:
- Training data points
- GP posterior mean (smooth curve)
- 95% confidence interval (shaded region)
How to change the observation noise
The observation noise is controlled entirely by this line inside GaussianProcessRegression:
1 | self.log_noise = nn.Parameter(torch.tensor(np.log(noise_init), dtype=dtype)) |
and used here during training:
1 2 | noise = torch.exp(self.log_noise) Kt = K + (noise**2 + jitter) * torch.eye(n, dtype=dtype) |
This means the model assumes one global noise value, learned during training.
OPTION 1 — Set a different initial noise value
Change:
1 | gpr = GaussianProcessRegression(kernel=kernel, noise_init=0.2) |
For example:
1 | gpr = GaussianProcessRegression(kernel=kernel, noise_init=0.8) |
But remember: The GP will still learn a single global noise level, regardless of the initial value.
OPTION 2 — Freeze the noise and set it manually
If you want a fixed noise level (not learned), do:
1 2 | gpr.log_noise.requires_grad_(False) gpr.log_noise.data = torch.log(torch.tensor(0.5, dtype=dtype)) |
This forces the GP to use:
\begin{equation}
\sigma_n = 0.5
\end{equation}
for all observations.
Homoscedastic Noise - Example 2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 | import torch import numpy as np import matplotlib.pyplot as plt torch.set_default_dtype(torch.double) # 1) Data X = torch.linspace(-3, 3, 25).unsqueeze(1) y = torch.sin(X) + 0.2 * torch.randn_like(X) # measurement uncertainty (example) y_err = 0.5 * torch.ones_like(y) X_test = torch.linspace(-5, 5, 400).unsqueeze(1) # 2) Parameters (log-space) log_length = torch.tensor(0.0, requires_grad=True) log_var = torch.tensor(0.0, requires_grad=True) log_noise = torch.tensor(-1.0, requires_grad=True) # 3) RBF kernel def rbf(x1, x2, log_l, log_v): l = torch.exp(log_l) v = torch.exp(log_v) sqdist = (x1**2).sum(1, keepdim=True) + (x2**2).sum(1) - 2*x1@x2.T return v * torch.exp(-0.5 * sqdist / l**2) # 4) Negative log marginal likelihood def nlml(): K = rbf(X, X, log_length, log_var) noise = torch.exp(log_noise) K = K + noise**2 * torch.eye(len(X)) L = torch.linalg.cholesky(K) alpha = torch.cholesky_solve(y, L) return ( 0.5 * (y.T @ alpha) + torch.sum(torch.log(torch.diag(L))) + 0.5 * len(X) * torch.log(torch.tensor(2*np.pi)) ) # 5) Train hyperparameters opt = torch.optim.Adam([log_length, log_var, log_noise], lr=0.05) for i in range(300): opt.zero_grad() loss = nlml() loss.backward() opt.step() if i % 50 == 0: print(i, loss.item()) # 6) Predictive distribution def predict(X_new): K = rbf(X, X, log_length, log_var) noise = torch.exp(log_noise) K = K + torch.diag(noise**2 + y_err.squeeze()**2) L = torch.linalg.cholesky(K) Ks = rbf(X, X_new, log_length, log_var) Kss = rbf(X_new, X_new, log_length, log_var) alpha = torch.cholesky_solve(y, L) mean = Ks.T @ alpha v = torch.linalg.solve(L, Ks) cov = Kss - v.T @ v std = torch.diag(cov).clamp(min=1e-9).sqrt() return mean.squeeze().detach(), std.detach() mu, std = predict(X_test) # 7) Plot plt.figure(figsize=(8,5)) # Training data WITH error bars plt.errorbar( X.squeeze().numpy(), y.squeeze().numpy(), yerr=y_err.squeeze().numpy(), fmt='o', color='black', ecolor='gray', elinewidth=1.5, capsize=3, label="Training data ± error" ) # Predictive mean plt.plot(X_test.numpy(), mu.numpy(), label="Predictive mean") # Predictive shaded region plt.fill_between( X_test.squeeze().numpy(), (mu - 2*std).numpy(), (mu + 2*std).numpy(), alpha=0.3, label="GP ±2σ" ) plt.legend() plt.title("Gaussian Process Regression with Training Data Error Bars") plt.show() |

Heteroscedastic Noise - Example 3
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 | import torch import torch.nn as nn import numpy as np import matplotlib.pyplot as plt # Use double precision for numerical stability with linear algebra dtype = torch.double torch.manual_seed(0) np.random.seed(0) # ------------------------------------------------------- # 1) Generate synthetic data + example heteroscedastic errors # ------------------------------------------------------- def generate_data(n_train=25, noise_std=0.2): X = np.linspace(-3, 3, n_train)[:, None] y = np.sin(X).ravel() + noise_std * np.random.randn(n_train) return X, y X_train_np, y_train_np = generate_data(n_train=25, noise_std=0.2) X_test_np = np.linspace(-5, 5, 400)[:, None] # Example: heteroscedastic obs uncertainty # Make errors larger near edges just for demonstration obs_error_std_np = 0.1 + 0.3 * np.abs(X_train_np.ravel()) / 3.0 X_train = torch.tensor(X_train_np, dtype=dtype) y_train = torch.tensor(y_train_np[:, None], dtype=dtype) obs_error_std = torch.tensor(obs_error_std_np[:, None], dtype=dtype) X_test = torch.tensor(X_test_np, dtype=dtype) # ------------------------------------------------------- # 2) RBF kernel # ------------------------------------------------------- class RBFKernel(nn.Module): def __init__(self, init_lengthscale=1.0, init_variance=1.0): super().__init__() self.log_lengthscale = nn.Parameter(torch.tensor(np.log(init_lengthscale), dtype=dtype)) self.log_variance = nn.Parameter(torch.tensor(np.log(init_variance), dtype=dtype)) def forward(self, x1, x2): x1_sq = (x1**2).sum(dim=1, keepdim=True) x2_sq = (x2**2).sum(dim=1, keepdim=True) sqdist = x1_sq + x2_sq.T - 2.0 * (x1 @ x2.T) lengthscale = torch.exp(self.log_lengthscale) variance = torch.exp(self.log_variance) K = variance * torch.exp(-0.5 * sqdist / (lengthscale**2)) return K, lengthscale, variance # ------------------------------------------------------- # 3) Gaussian Process Regression with observation errors # ------------------------------------------------------- class GaussianProcessRegression(nn.Module): def __init__(self, kernel: RBFKernel): super().__init__() self.kernel = kernel def negative_log_marginal_likelihood(self, X, y, obs_error_std, jitter=1e-6): K, _, _ = self.kernel(X, X) n = X.shape[0] # --- NEW: Use per-observation variances --- noise_diag = obs_error_std.squeeze()**2 Kt = K + torch.diag(noise_diag + jitter) # Cholesky decomposition L = torch.linalg.cholesky(Kt) v = torch.linalg.solve_triangular(L, y, upper=False) alpha = torch.linalg.solve_triangular(L.T, v, upper=True) data_fit = 0.5 * (y.T @ alpha) log_det = torch.sum(torch.log(torch.diag(L))) constant = 0.5 * n * torch.log(torch.tensor(2.0 * np.pi, dtype=dtype)) nlml = data_fit + log_det + constant return nlml.squeeze() def predict(self, X_train, y_train, X_test, obs_error_std, jitter=1e-6): K_tt, _, _ = self.kernel(X_train, X_train) noise_diag = obs_error_std.squeeze()**2 Kt = K_tt + torch.diag(noise_diag + jitter) L = torch.linalg.cholesky(Kt) K_s, _, _ = self.kernel(X_train, X_test) K_ss, _, _ = self.kernel(X_test, X_test) v = torch.linalg.solve_triangular(L, y_train, upper=False) alpha = torch.linalg.solve_triangular(L.T, v, upper=True) pred_mean = K_s.T @ alpha w = torch.linalg.solve_triangular(L, K_s, upper=False) z = torch.linalg.solve_triangular(L.T, w, upper=True) pred_cov = K_ss - K_s.T @ z pred_var = torch.clamp(torch.diag(pred_cov), min=1e-12).unsqueeze(1) return pred_mean.detach(), pred_var.detach() # ------------------------------------------------------- # 4) Instantiate and train # ------------------------------------------------------- kernel = RBFKernel(init_lengthscale=1.0, init_variance=1.0) gpr = GaussianProcessRegression(kernel=kernel).double() optimizer = torch.optim.Adam(gpr.parameters(), lr=0.05) n_iters = 300 loss_history = [] for i in range(n_iters): optimizer.zero_grad() nlml = gpr.negative_log_marginal_likelihood(X_train, y_train, obs_error_std) nlml.backward() optimizer.step() loss_history.append(nlml.item()) if (i+1) % 50 == 0: with torch.no_grad(): ls = torch.exp(gpr.kernel.log_lengthscale).item() var = torch.exp(gpr.kernel.log_variance).item() print(f"Iter {i+1}/{n_iters} | NLML={nlml:.4f} | ls={ls:.4f} | var={var:.4f}") # ------------------------------------------------------- # 5) Predict and plot # ------------------------------------------------------- pred_mean, pred_var = gpr.predict(X_train, y_train, X_test, obs_error_std) pred_mean = pred_mean.numpy().ravel() pred_std = np.sqrt(pred_var.numpy().ravel()) plt.figure(figsize=(10,5)) plt.scatter(X_train_np.ravel(), y_train_np.ravel(), c='k', label='Train', s=40) plt.errorbar( X_train_np.ravel(), y_train_np.ravel(), yerr=obs_error_std_np, fmt='none', ecolor='gray', alpha=0.5, label='Obs errors' ) plt.plot(X_test_np.ravel(), pred_mean, label="Mean") plt.fill_between( X_test_np.ravel(), pred_mean - 1.96*pred_std, pred_mean + 1.96*pred_std, alpha=0.3, label="95% CI" ) plt.legend() plt.title("Gaussian Process Regression with per-observation errors") plt.show() |

Gaussian Processes using GPyTorch
What is GPyTorch?
GPyTorch is a modern, scalable Gaussian Process (GP) library built on PyTorch.
It provides GPU acceleration, automatic differentiation, and efficient linear algebra methods that allow GPs to be trained on much larger datasets than classical GP implementations.
It was developed at Uber AI Labs and supports:
- Exact Gaussian Processes (small/medium datasets)
- Approximate/Variational GPs (large datasets)
- Additive kernels, multitask GPs, spectral mixture kernels
- Natural integration with PyTorch models
How to Install GPyTorch
Install using pip:
1 | pip install gpytorch |
or with conda:
1 | conda install -c conda-forge gpytorch |
You must have PyTorch installed first:
1 | pip install torch |
Why GPyTorch is interesting
Automatic differentiation
Kernel parameters, noise, and model parameters are automatically optimized using PyTorch autograd.
GPU acceleration
Cholesky decompositions, matrix solves, and kernel operations run on GPU.
Built-in kernels
RBF, Matern, Spectral Mixture, RQ, Polynomials, etc.
Large-Scale GP Support
Via SKI kernels, variational GPs, and structured kernel interpolation.
Clean, modular API
Models are subclasses of gpytorch.models.ExactGP.
Heteroscedastic noise supported
Via likelihood(noise) or custom noise models.
Full GPyTorch Implementation (with heteroscedastic noise)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 | import torch import gpytorch import numpy as np import matplotlib.pyplot as plt dtype = torch.double torch.manual_seed(0) np.random.seed(0) # ------------------------------------------------------- # 1) Generate data # ------------------------------------------------------- def generate_data(n_train=25, noise_std=0.2): X = np.linspace(-3, 3, n_train)[:, None] y = np.sin(X).ravel() + noise_std * np.random.randn(n_train) return X, y X_train_np, y_train_np = generate_data() X_test_np = np.linspace(-5, 5, 400)[:, None] # Heteroscedastic noise example obs_error_std_np = 0.1 + 0.3 * np.abs(X_train_np.ravel()) / 3.0 X_train = torch.tensor(X_train_np, dtype=dtype) y_train = torch.tensor(y_train_np, dtype=dtype) obs_noise = torch.tensor(obs_error_std_np**2, dtype=dtype) X_test = torch.tensor(X_test_np, dtype=dtype) # ------------------------------------------------------- # 2) Define GP Model (Exact GP) # ------------------------------------------------------- class ExactGPModel(gpytorch.models.ExactGP): def __init__(self, train_x, train_y, likelihood): super().__init__(train_x, train_y, likelihood) self.mean_module = gpytorch.means.ConstantMean() self.covar_module = gpytorch.kernels.ScaleKernel( gpytorch.kernels.RBFKernel() ) def forward(self, x): mean_x = self.mean_module(x) cov_x = self.covar_module(x) return gpytorch.distributions.MultivariateNormal(mean_x, cov_x) # ------------------------------------------------------- # 3) Setup likelihood with heteroscedastic noise # ------------------------------------------------------- likelihood = gpytorch.likelihoods.FixedNoiseGaussianLikelihood( noise=obs_noise, learn_additional_noise=False # prevents learning extra homoscedastic term ) model = ExactGPModel(X_train, y_train, likelihood).double() model.train() likelihood.train() optimizer = torch.optim.Adam(model.parameters(), lr=0.1) mll = gpytorch.mlls.ExactMarginalLogLikelihood(likelihood, model) # ------------------------------------------------------- # 4) Train GP hyperparameters # ------------------------------------------------------- training_iter = 300 for i in range(training_iter): optimizer.zero_grad() output = model(X_train) loss = -mll(output, y_train) loss.backward() if (i + 1) % 50 == 0: ls = model.covar_module.base_kernel.lengthscale.item() var = model.covar_module.outputscale.item() print(f"Iter {i+1}/{training_iter} | Loss={loss.item():.4f} | ls={ls:.3f} | var={var:.3f}") optimizer.step() # ------------------------------------------------------- # 5) Evaluation # ------------------------------------------------------- model.eval() likelihood.eval() with torch.no_grad(): preds = likelihood(model(X_test)) mean = preds.mean.numpy() lower, upper = preds.confidence_region() # ------------------------------------------------------- # 6) Plot # ------------------------------------------------------- plt.figure(figsize=(10, 5)) plt.scatter(X_train_np, y_train_np, c='k', s=40, label="Train") plt.errorbar( X_train_np, y_train_np, yerr=obs_error_std_np, fmt='none', ecolor='gray', alpha=0.6, label="Obs noise" ) plt.plot(X_test_np, mean, label='Prediction mean') plt.fill_between( X_test_np.ravel(), lower.numpy(), upper.numpy(), alpha=0.3, label="95% CI" ) plt.title("GPyTorch — GP Regression with Heteroscedastic Noise") plt.legend() plt.show() |

Another Example
We will now revisit the example introduced in the previous article, where we manually implemented a Gaussian Process model without relying on PyTorch. This will allow us to compare that approach with a clean implementation using GPyTorch.
1D Example
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 | import torch import gpytorch import matplotlib.pyplot as plt # ----------------------------------------- # 1) Synthetic training data # ----------------------------------------- X = torch.tensor([1., 3., 5., 6., 7., 8.]).unsqueeze(1) Y = X.squeeze() * torch.sin(X.squeeze()) # NEW: different noise per observation obs_noise_std = torch.tensor([0.2, 0.1, 0.5, 0.3, 0.15, 0.4]) # example obs_noise_var = obs_noise_std**2 # ----------------------------------------- # 2) GP Model with FixedNoiseGaussianLikelihood # ----------------------------------------- class ExactGPModel(gpytorch.models.ExactGP): def __init__(self, train_x, train_y, likelihood): super().__init__(train_x, train_y, likelihood) self.mean_module = gpytorch.means.ConstantMean() self.covar_module = gpytorch.kernels.ScaleKernel( gpytorch.kernels.RBFKernel() ) def forward(self, x): mean_x = self.mean_module(x) covar_x = self.covar_module(x) return gpytorch.distributions.MultivariateNormal(mean_x, covar_x) # FIXED NOISE likelihood → uses OBS noise likelihood = gpytorch.likelihoods.FixedNoiseGaussianLikelihood( noise=obs_noise_var ) model = ExactGPModel(X, Y, likelihood) # ----------------------------------------- # 3) Training # ----------------------------------------- model.train() likelihood.train() optimizer = torch.optim.Adam(model.parameters(), lr=0.1) mll = gpytorch.mlls.ExactMarginalLogLikelihood(likelihood, model) for i in range(120): optimizer.zero_grad() output = model(X) loss = -mll(output, Y) loss.backward() optimizer.step() # ----------------------------------------- # 4) Prediction # ----------------------------------------- model.eval() likelihood.eval() Xtest = torch.linspace(0, 10, 200).unsqueeze(1) with torch.no_grad(), gpytorch.settings.fast_pred_var(): pred = likelihood(model(Xtest)) pred_mean = pred.mean.numpy() pred_std = pred.stddev.numpy() # ----------------------------------------- # 5) Plot results # ----------------------------------------- plt.figure(figsize=(9,5)) # show per-point noise in the visualization plt.errorbar( X.squeeze().numpy(), Y.numpy(), yerr=obs_noise_std.numpy(), fmt='o', capsize=4, label="Training data (heteroscedastic noise)" ) plt.plot(Xtest.numpy(), pred_mean, "r--", label="GP mean") plt.fill_between( Xtest.squeeze().numpy(), pred_mean - 1.96*pred_std, pred_mean + 1.96*pred_std, alpha=0.3, label="95% interval" ) plt.xlabel("x") plt.ylabel("y") plt.title("Gaussian Process Regression (with per-point noise)") plt.legend() plt.grid(True) plt.show() |

2D Example
GP Regression Model (2D RBF kernel)
\begin{equation}
k(x, x') = \sigma_f^2 \exp\left(-\frac{(x_1-x_1')^2}{2\ell_1^2}-\frac{(x_2-x_2')^2}{2\ell_2^2}\right)
\end{equation}
Updated Code
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 | import torch import gpytorch import numpy as np import matplotlib.pyplot as plt from matplotlib import cm torch.manual_seed(0) # ----------------------------- # Data (same as our original example) # ----------------------------- X_np = np.array([ [-8.0,-8.0], [-6.0,-3.0], [-7.0,2.0], [-4.0,4.0], [ 2.0, 3.0], [ 5.0, 7.0], [ 1.0,-1.0], [ 3.0,-4.0], [ 7.0,-7.0] ], dtype=np.float32) # labels originally -1 / +1 -> convert to 0/1 for Bernoulli likelihood Y_class_np = np.array([-1.0, -1.0, -1.0, -1.0, -1.0, -1.0, 1.0, 1.0, 1.0], dtype=np.float32) Y_class_01_np = (Y_class_np > 0).astype(np.float32) # 0 or 1 # We'll also use Y_reg as same numeric labels for regression (float) Y_reg_np = Y_class_np.copy() # Torch tensors (float) X = torch.from_numpy(X_np) Y_reg = torch.from_numpy(Y_reg_np) Y_class = torch.from_numpy(Y_class_01_np) # 0/1 # ----------------------------- # Regression: Exact GP (RBF with ARD) # ----------------------------- class ExactGPRegression(gpytorch.models.ExactGP): def __init__(self, train_x, train_y, likelihood): super().__init__(train_x, train_y, likelihood) self.mean_module = gpytorch.means.ConstantMean() # RBF with ARD -> separate lengthscale per input dim self.covar_module = gpytorch.kernels.ScaleKernel( gpytorch.kernels.RBFKernel(ard_num_dims=train_x.shape[1]) ) def forward(self, x): mean = self.mean_module(x) cov = self.covar_module(x) return gpytorch.distributions.MultivariateNormal(mean, cov) # Setup likelihood & model likelihood_reg = gpytorch.likelihoods.GaussianLikelihood() model_reg = ExactGPRegression(X, Y_reg, likelihood_reg) # Train regression model model_reg.train() likelihood_reg.train() optimizer = torch.optim.Adam(model_reg.parameters(), lr=0.1) mll = gpytorch.mlls.ExactMarginalLogLikelihood(likelihood_reg, model_reg) n_iter_reg = 200 for i in range(n_iter_reg): optimizer.zero_grad() output = model_reg(X) loss = -mll(output, Y_reg) loss.backward() optimizer.step() if (i+1) % 50 == 0: print(f"[Reg] Iter {i+1}/{n_iter_reg} NLML: {loss.item():.4f}") # ----------------------------- # Regression: predictions on grid # ----------------------------- model_reg.eval() likelihood_reg.eval() # grid settings grid_x1 = np.arange(-10.0, 10.0, 0.12, dtype=np.float32) grid_x2 = np.arange(-10.0, 10.0, 0.12, dtype=np.float32) X1g, X2g = np.meshgrid(grid_x1, grid_x2, indexing='xy') X_grid = torch.from_numpy(np.column_stack([X1g.ravel(), X2g.ravel()]).astype(np.float32)) with torch.no_grad(), gpytorch.settings.fast_pred_var(): pred_reg = likelihood_reg(model_reg(X_grid)) mean_grid = pred_reg.mean.reshape(X1g.shape).numpy() var_grid = pred_reg.variance.reshape(X1g.shape).numpy() # Plot regression mean plt.figure(figsize=(6,5)) plt.imshow(mean_grid, origin='lower', extent=[-10,10,-10,10], cmap=cm.jet) plt.colorbar(label='Predictive mean') plt.scatter(X_np[:,0], X_np[:,1], c=Y_reg_np, edgecolors='k', cmap=cm.coolwarm, s=60) plt.title("GP Regression — Predictive Mean") plt.xlabel("x1"); plt.ylabel("x2") plt.grid(True, linestyle='--', alpha=0.4) plt.show() # Plot regression variance plt.figure(figsize=(6,5)) plt.imshow(var_grid, origin='lower', extent=[-10,10,-10,10], cmap=cm.jet) plt.colorbar(label='Predictive variance') plt.scatter(X_np[:,0], X_np[:,1], c='white', edgecolors='k', s=60) plt.title("GP Regression — Predictive Variance") plt.xlabel("x1"); plt.ylabel("x2") plt.grid(True, linestyle='--', alpha=0.4) plt.show() # ----------------------------- # Classification: Variational GP (Sparse/Approximate) # ----------------------------- # Convert dataset for classifier: X (same), Y_class (0/1) # Use inducing points (we'll use training inputs as inducing for this tiny example) inducing_points = X.clone() class VariationalGPClassifier(gpytorch.models.ApproximateGP): def __init__(self, inducing_points): variational_distribution = gpytorch.variational.CholeskyVariationalDistribution(inducing_points.size(0)) variational_strategy = gpytorch.variational.VariationalStrategy( self, inducing_points, variational_distribution, learn_inducing_locations=True ) super().__init__(variational_strategy) self.mean_module = gpytorch.means.ConstantMean() self.covar_module = gpytorch.kernels.ScaleKernel( gpytorch.kernels.RBFKernel(ard_num_dims=inducing_points.shape[1]) ) def forward(self, x): mean = self.mean_module(x) cov = self.covar_module(x) return gpytorch.distributions.MultivariateNormal(mean, cov) # Likelihood and model likelihood_clf = gpytorch.likelihoods.BernoulliLikelihood() model_clf = VariationalGPClassifier(inducing_points) # Put into training mode model_clf.train() likelihood_clf.train() # Optimizer and ELBO optimizer_clf = torch.optim.Adam(model_clf.parameters(), lr=0.05) num_data = X.shape[0] elbo = gpytorch.mlls.VariationalELBO(likelihood_clf, model_clf, num_data) # Training loop n_iter_clf = 500 for i in range(n_iter_clf): optimizer_clf.zero_grad() output = model_clf(X) loss = -elbo(output, Y_class) loss.backward() optimizer_clf.step() if (i+1) % 100 == 0: print(f"[Clf] Iter {i+1}/{n_iter_clf} ELBO loss: {loss.item():.4f}") # ----------------------------- # Classification: predictions on grid # ----------------------------- model_clf.eval() likelihood_clf.eval() with torch.no_grad(), gpytorch.settings.fast_pred_var(): preds_clf = likelihood_clf(model_clf(X_grid)) probs = preds_clf.mean.reshape(X1g.shape).numpy() # P(Y=1) # Plot classification probability map plt.figure(figsize=(6,5)) plt.imshow(probs, origin='lower', extent=[-10,10,-10,10], cmap=cm.jet, vmin=0, vmax=1) plt.colorbar(label='P(Y=1)') # plot training points (use marker styles like original) for i, y in enumerate(Y_class_np): if y == -1.0: plt.scatter(X_np[i,0], X_np[i,1], marker='o', color='#1f77b4', edgecolors='k', s=80) else: plt.scatter(X_np[i,0], X_np[i,1], marker='x', color='#ff7f0e', s=80, linewidths=2) plt.title("GP Classification — Predicted Probability P(Y=1)") plt.xlabel("x1"); plt.ylabel("x2") plt.grid(True, linestyle='--', alpha=0.4) plt.show() # Contour of decision surface plt.figure(figsize=(6,5)) CS = plt.contour(X1g, X2g, probs, levels=np.linspace(0,1,11), cmap=cm.jet) plt.clabel(CS, inline=True, fontsize=8) plt.title("GP Classification — Probability Contours") plt.xlabel("x1"); plt.ylabel("x2") plt.grid(True, linestyle='--', alpha=0.4) plt.scatter(X_np[:,0], X_np[:,1], c='k', s=30) plt.show() |


References
| Links | Site |
|---|---|
| https://pytorch.org/docs/stable/index.html | PyTorch Official Documentation |
| https://pytorch.org/docs/stable/generated/torch.linalg.cholesky.html | PyTorch Cholesky Decomposition |
| https://pytorch.org/docs/stable/optim.html | PyTorch Optimizers |
| https://scikit-learn.org/stable/modules/gaussian_process.html | Scikit-Learn Gaussian Processes (reference implementation) |
| https://gpytorch.ai | GPyTorch Official Library for Gaussian Processes in PyTorch |
| https://gaussianprocess.org/gpml/chapters/ | GPML Textbook (Rasmussen & Williams) |
| https://en.wikipedia.org/wiki/Gaussian_process | Overview of Gaussian Processes |
