Bayesian Optimization

by Ibai Roman

ISG at EHU

Problem description

  • Many machine-learning algorithms need to be fine-tuned in order to guarantee a good performance.
  • Sometimes manual optimization might be intractable.

Problem properties

  • Finding the best parameter set can be seen as a nonlinear function optimization problem: $$ \textbf{x}^* = argmin_{\textbf{x} \in A \subset \mathbb{R}^{n_{d}}}f(\textbf{x}) $$
  • The objective function is a black-box function.
  • The objective function is frequently very expensive to evaluate.
  • The best solution must be achieved with as few function evaluations as possible.
  • Bayesian Optimization is suitable for that kind of optimization problems.

Bayesian Optimization (BO)

  • Sequential global optimization algorithm.
  • As the analytic form of the objective function is generally unknown, the sampling strategy is based on a probability distribution over all the possible objective functions.
  • The sampling strategy: It will use a utility function ($u(\cdot)$), also called acquisition function, that provides a measure of the utility of each solution.
  • The probability distribution: It acts as a surrogate model ($SM$), and it is updated every time a solution is evaluated using the objective function.
  • Why Bayesian? This prior belief is updated with the likelihood of having those observations generating a posterior distribution over functions: $$ P\left ( f|D_{1:t} \right ) \propto P\left ( D_{1:t}|f \right )P\left ( f \right ) $$

Bayesian Optimization (BO)

Gaussian Process (GP)

  • One of the most popular choices as $SM$ in BO.
  • Definition: A GP is a collection of random variables, of which any finite number has a multivariate Gaussian distribution.
  • Key property for BO: The a posteriori distribution of a GP after sampling the objective function is also a GP.
  • Infinite dimensional multivariate gaussian.
  • GPs can be completely defined by a mean function and a covariance function, which depends on a kernel: $$ f(x) \sim GP(m(\textbf{x}), k(\textbf{x},\textbf{x}')) $$

Kernel Function

  • Establishes the relation between the objective function values of each pair of solutions.
  • Usually has its own parameters which need to be tunned. Two mayor approaches, OPT and MCMC.
  • The election of the kernel and its parameters is crucial for BO performance.
  • Kernel function examples:

Acquisition Function

  • The sampling strategy is based on this Acquisition Function.
  • Select next point to evaluate.
  • Provides measure of utility.
  • Regulates the exploration versus exploitation trade-off.
  • Secondary optimization Problem.
  • Example: Expected improvement $$ \begin{split} u_{EI}(\textbf{x}) &= \begin{cases} \sigma(\textbf{x}) \left( Z \cdot \Phi(Z) + \phi(Z) \right) & \quad \text{if } \sigma(\textbf{x}) > 0\\ 0 & \quad \text{if } \sigma(\textbf{x}) = 0\\ \end{cases} \\ where\\ Z &= \begin{cases} \frac{f(\textbf{x}^+)-\mu(\textbf{x})}{\sigma(\textbf{x})} & \quad \text{if } \sigma(\textbf{x}) > 0\\ 0 & \quad \text{if } \sigma(\textbf{x}) = 0\\ \end{cases} \end{split} $$

Adaptive Kernel Selection Criteria

  • The election of the kernel and its parameters is crucial for BO performance.
  • It seems reasonable to automatically select them to avoid another parameter optimization problem.
  • Although kernel parameter tuning has attracted much attention, adaptive kernel selection has not been extensively studied in the BO research field.
  • We present a general framework for adaptive kernel selection in BO.
  • We propose to carry out an optimization process with several GPs in parallel (each one with a different kernel) and adaptively choosing one of them.
  • By adaptively choosing the kernel during the optimization process, there is no need to select a kernel in advance.

Adaptive Kernel Selection Criteria

Adaptive Kernel Selection Criteria

Adaptive Random:
A GP will be randomly selected.
Best Likelihood:
Relies on the Gaussian likelihood function to select the best GP.
Best Utility:
The GP with the highest expected improvement is selected for the next evaluation.
Weighted Best:
A weight system is added to the previous approach.
Parallel Test:
All the points suggested by the GPs are evaluated in parallel.
Utility Mean:
The next point is selected by maximizing the mean of acquisition functions.
Weighted Mixture:
The next point is selected by maximizing the weighted mean of acquisition functions.

Experimental setup

  • 6 kernels.
  • 6 optimization problems (2 synthetic, 4 machine learning parameter tuning problems).
  • 7 adaptive vs 7 fixed-kernel approaches (one per kernel + Static Random).
  • Due to the random choice of the first solution, each experiment was repeated 30 times.
  • 2 kernel parameter tuning approaches OPT and MCMC.
  • A python framework was developed, called bopy bolib.

Experimental results

Experimental results

Branin-Hoo



ScoreStrategyComp.



10ParallelTest-
10Matern32-
10AdaptativeRandomMCMC
8WeightedBest-
7Matern52MCMC
6SquaredExponential-
6RationalQuadratic2-
5FixedAtRandom-
...



Hartmann 6D



ScoreStrategyComp.



10AdaptativeRandomMCMC
9ParallelTestMCMC
7WeightedMixtureMCMC
5UtilityMean-
3BestUtility-
3WeightedBestMCMC
3RationalQuadratic2-
2GammaExponential15-
...



Log. Regression (no cv)



ScoreStrategyComp.



1ExponentialOPT
0BestUtilityOPT
0WeightedMixtureOPT
0WeightedBestOPT
0ParallelTestOPT
0AdaptativeRandom-
0UtilityMeanOPT
0GammaExponential15-
...



Log. Regression (cv)



ScoreStrategyComp.



3ParallelTestOPT
1BestUtilityOPT
1WeightedBestOPT
0ExponentialOPT
0AdaptativeRandomOPT
0RationalQuadratic2-
0Matern52-
0FixedAtRandom-
...



LDA on grid



ScoreStrategyComp.



0UtilityMeanOPT
0WeightedMixtureOPT
0Matern32-
0FixedAtRandom-
0AdaptativeRandom-
0ParallelTestOPT
0RationalQuadratic2-
0Matern52-
...



SVM on grid *



ScoreStrategyComp.



0WeightedBestOPT
0ParallelTest-
0BestUtility-
0ExponentialOPT
0AdaptativeRandom-
0Matern52OPT
0BestLikelihoodOPT
0Matern32-
...



Conclusions

  • Adaptive techniques improve fixed-kernel approaches (even if the best kernel was selected).
  • They can be incorporated to BO in a straightforward way.
  • Among the Adaptive techniques, we have shown that Parallel Test produces good results.
  • Our results show that in most real-world scenarios OPT is still the best choice for BO.

Future Work

  • Population based Acquisition functions (Batch evaluation).
  • Parameter tuning on evolutionary algorithms.

Eskerrik asko !

Presentation created with flowtime.js

Images by: Intro, Angelo DeSantis (CC BY 2.0). Thanks, Paul Stumpr (CC BY-SA 2.0). Others, Ibai Roman (CC BY-SA 2.0).