by Ibai Roman

**ISG** at EHU

- Many machine-learning algorithms need to be fine-tuned in order to guarantee a good performance.
- Sometimes manual optimization might be
**intractable**.

- Finding the best parameter set can be seen as a nonlinear function optimization problem: $$ \textbf{x}^* = argmin_{\textbf{x} \in A \subset \mathbb{R}^{n_{d}}}f(\textbf{x}) $$
- The objective function is a
**black-box**function. - The objective function is frequently very
**expensive to evaluate**. - The best solution must be achieved with as few function evaluations as possible.
- Bayesian Optimization is suitable for that kind of optimization problems.

**Sequential**global optimization algorithm.- As the analytic form of the objective function is generally unknown, the
**sampling strategy**is based on a**probability distribution**over all the possible objective functions. - The sampling strategy: It will use a utility function ($u(\cdot)$), also called acquisition function, that provides a measure of the utility of each solution.
- The probability distribution: It acts as a surrogate model ($SM$), and it is updated every time a solution is evaluated using the objective function.
- Why Bayesian? This prior belief is updated with the likelihood of having those observations generating a posterior distribution over functions: $$ P\left ( f|D_{1:t} \right ) \propto P\left ( D_{1:t}|f \right )P\left ( f \right ) $$

- One of the most popular choices as $SM$ in BO.
**Definition**: A GP is a collection of random variables, of which any finite number has a multivariate Gaussian distribution.- Key property for BO: The a posteriori distribution of a GP after sampling the objective function is
**also a GP**. - Infinite dimensional multivariate gaussian.
- GPs can be completely defined by a mean function and a covariance function, which depends on a
**kernel**: $$ f(x) \sim GP(m(\textbf{x}), k(\textbf{x},\textbf{x}')) $$

- Establishes the relation between the objective function values of each pair of solutions.
- Usually has its own parameters which need to be tunned. Two mayor approaches, OPT and MCMC.
- The election of the kernel and its parameters is
**crucial**for BO performance. - Kernel function examples:

- The sampling strategy is based on this Acquisition Function.
- Select next point to evaluate.
- Provides measure of
**utility**. - Regulates the exploration versus exploitation trade-off.
**Secondary optimization Problem**.- Example: Expected improvement $$ \begin{split} u_{EI}(\textbf{x}) &= \begin{cases} \sigma(\textbf{x}) \left( Z \cdot \Phi(Z) + \phi(Z) \right) & \quad \text{if } \sigma(\textbf{x}) > 0\\ 0 & \quad \text{if } \sigma(\textbf{x}) = 0\\ \end{cases} \\ where\\ Z &= \begin{cases} \frac{f(\textbf{x}^+)-\mu(\textbf{x})}{\sigma(\textbf{x})} & \quad \text{if } \sigma(\textbf{x}) > 0\\ 0 & \quad \text{if } \sigma(\textbf{x}) = 0\\ \end{cases} \end{split} $$

- The election of the kernel and its parameters is crucial for BO performance.
- It seems reasonable to automatically select them to avoid another parameter optimization problem.
- Although kernel parameter tuning has attracted much attention, adaptive kernel selection has not been extensively studied in the BO research field.
- We present a general framework for adaptive kernel selection in BO.
- We propose to carry out an optimization process with several GPs in parallel (
**each one with a different kernel**) and adaptively choosing one of them. - By adaptively choosing the kernel during the optimization process, there is no need to select a kernel in advance.

- Adaptive Random:
- A GP will be randomly selected.
- Best Likelihood:
- Relies on the Gaussian likelihood function to select the best GP.
- Best Utility:
- The GP with the highest expected improvement is selected for the next evaluation.
- Weighted Best:
- A weight system is added to the previous approach.
- Parallel Test:
- All the points suggested by the GPs are evaluated in parallel.
- Utility Mean:
- The next point is selected by maximizing the mean of acquisition functions.
- Weighted Mixture:
- The next point is selected by maximizing the weighted mean of acquisition functions.

- 6 kernels.
- 6 optimization problems (2 synthetic, 4 machine learning parameter tuning problems).
- 7 adaptive vs 7 fixed-kernel approaches (one per kernel + Static Random).
- Due to the random choice of the first solution, each experiment was repeated
**30 times.** - 2 kernel parameter tuning approaches OPT and MCMC.
- A python framework was developed, called
~~bopy~~**bolib**.

Branin-Hoo | ||

Score | Strategy | Comp. |

10 | ParallelTest | - |

10 | Matern32 | - |

10 | AdaptativeRandom | MCMC |

8 | WeightedBest | - |

7 | Matern52 | MCMC |

6 | SquaredExponential | - |

6 | RationalQuadratic2 | - |

5 | FixedAtRandom | - |

... | ||

Hartmann 6D | ||

Score | Strategy | Comp. |

10 | AdaptativeRandom | MCMC |

9 | ParallelTest | MCMC |

7 | WeightedMixture | MCMC |

5 | UtilityMean | - |

3 | BestUtility | - |

3 | WeightedBest | MCMC |

3 | RationalQuadratic2 | - |

2 | GammaExponential15 | - |

... | ||

Log. Regression (no cv) | ||

Score | Strategy | Comp. |

1 | Exponential | OPT |

0 | BestUtility | OPT |

0 | WeightedMixture | OPT |

0 | WeightedBest | OPT |

0 | ParallelTest | OPT |

0 | AdaptativeRandom | - |

0 | UtilityMean | OPT |

0 | GammaExponential15 | - |

... | ||

Log. Regression (cv) | ||

Score | Strategy | Comp. |

3 | ParallelTest | OPT |

1 | BestUtility | OPT |

1 | WeightedBest | OPT |

0 | Exponential | OPT |

0 | AdaptativeRandom | OPT |

0 | RationalQuadratic2 | - |

0 | Matern52 | - |

0 | FixedAtRandom | - |

... | ||

LDA on grid | ||

Score | Strategy | Comp. |

0 | UtilityMean | OPT |

0 | WeightedMixture | OPT |

0 | Matern32 | - |

0 | FixedAtRandom | - |

0 | AdaptativeRandom | - |

0 | ParallelTest | OPT |

0 | RationalQuadratic2 | - |

0 | Matern52 | - |

... | ||

SVM on grid * | ||

Score | Strategy | Comp. |

0 | WeightedBest | OPT |

0 | ParallelTest | - |

0 | BestUtility | - |

0 | Exponential | OPT |

0 | AdaptativeRandom | - |

0 | Matern52 | OPT |

0 | BestLikelihood | OPT |

0 | Matern32 | - |

... | ||

- Adaptive techniques
**improve**fixed-kernel approaches (even if the best kernel was selected). - They can be incorporated to BO in a straightforward way.
- Among the Adaptive techniques, we have shown that
**Parallel Test**produces good results. - Our results show that in most
**real-world**scenarios OPT is still the best choice for BO.

- Population based Acquisition functions (Batch evaluation).
- Parameter tuning on evolutionary algorithms.

Presentation created with flowtime.js

Images by: Intro, Angelo DeSantis (CC BY 2.0). Thanks, Paul Stumpr (CC BY-SA 2.0). Others, Ibai Roman (CC BY-SA 2.0).