Cutting edge hyperparameter tuning made simple with ray tune
XiaoweiJiang7
301 views
31 slides
Dec 18, 2021
Slide 1 of 31
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
About This Presentation
Cutting edge hyperparameter tuning made simple with ray tune
Size: 1.22 MB
Language: en
Added: Dec 18, 2021
Slides: 31 pages
Slide Content
Ray Tune
Cutting edge
hyperparameter tuning
made simple
Agenda
1.Hyperparameter tuning (HPO) -
whys and challenges
2.HPO methods offered by Ray Tune
3.Distributed HPO made simple by
Ray Tune
4.Ray Tune APIs and integration with
other ml libraries
5.Demo
6.Q&A
Hyperparameter tuning - what and why
What are hyperparameters?
Model
parameters
●Model type and architecture
●Learning and training related
parameters
●Pipeline related parameters
Set before training
Learnt during training
Why are hyperparameters important?
RoBERTa: A Robustly Optimized BERT
Pretraining Approach
Imputer
Categorical
encoder
Under/oversampler XGBoost
Type: Simple or
iterative
Simple strategy:
Mean or median or
constant?
Type: One-hot
encoding or label
encoding?
Type: SMOTE or
random
undersampling?
Number of
neighbors?
6 - 10 hyperparameters
to tune
Total: 15
hyperparameters to
tune!
●Covers not only model training but also
data preprocessing and feature
engineering
●Relevant in both classical ML and DL
models
●Carry significance impact on ML
model/pipeline performance
Hyperparameter tuning is expensive
●Hyperparameter tuning is the trial and error process of
finding the optimal hyperparameter configuration
through a machine learning task.
●Black box optimization with a non-convex, nonlinear,
high dimension and noisy search space.
●A lot of configurations(trials) to try out.
●The evaluation of each configuration involves model
training.
Ray Tune makes HPO easy
Cutting Edge Optimization
Algorithms
By combining efficient algorithms with effective
distributed execution!
With easy to use APIs!!
Ray Tune offers a wide collection of
HPO algorithms
Exhaustive search
●Cross product of all possible
configurations
●Needs a discrete search space
●Simple and easy to parallelize, but
inefficient
●Samples configurations randomly
●Generally superior than Grid Search
●Hard to beat with high dimension
●Easily parallelizable, but still inefficient!
Bayesian optimization
●Uses information from previous
configurations to decide the next
configuration to try next
●Builds a surrogate model
●Different approaches to build this
surrogate model
●Inherently sequential
https://www.wikiwand.com/en/Hyperparamet
er_optimization
Early stopping
●Use intermediate results (epochs, trees)
to prune underperforming trials, saving
time and computing resources
●Median stopping, HyperBand, ASHA
●Inherently parallelizable
BOHB
●Standard bandit algorithms use
random search = uninformed
decisions. Solution: BOHB: Robust
and Efficient Hyperparameter
Optimization at Scale by Falkner
et al.
●Combines HyperBand with BO -
makes informed decisions
based on partial results
●Parallelizable
BOHB: Robust and Efficient Hyperparameter Optimization at Scale
HyperSched
●Standard bandit algorithms
aren’t deadline aware.
Solution: HyperSched:
Dynamic Resource
Reallocation for Model
Development on a Deadline
by Liaw et al.
●Reallocate resources to
prioritize promising trials https://arxiv.org/abs/2001.02338
BlendSearch
●Standard HPO methods try to minimize
number of iterations, but not actual
execution time (cost). Solution:
Economic hyperparameter optimization
with blended search strategy by Wang
et al.
●Combines global search with directed
local search
●Aware of hyperparameter cost &
deadlines - tries to first choose
configurations that are cheap to
evaluate
●Can be combined with bandit pruning
https://openreview.net/forum?id=VbLH04pRA3
Ray Tune manages distributed HPO
for you
What is Ray?
Key concepts
Execute remotely functions as tasks, and
instantiate remotely classes as actors
○Support both stateful and stateless computations
Asynchronous execution using futures
○Enable parallelism
class Trainer(object):
def __init__(self, config):
self._iter = 0
self._config = config
self._setup()
def step(self):
# train for one iteration
self._iter += 1
return {“iter”: self._iter}
t = Trainer(config=config)
train_result = t.step()
Trainer Class
@ray.remote(num_gpus=1)
class Trainer(object):
def __init__(self, config):
self._iter = 0
self._config = config
self._setup()
def step(self):
# train for one iteration
self._iter += 1
return {“iter”: self._iter}
t = Trainer(config=config)
train_result = t.step()
Trainer Class → Actor
@ray.remote(num_gpus=1)
class Trainer(object):
def __init__(self, config):
self._iteration = 0
self._config = config
self._setup()
def step(self):
# train for one iteration
self._iteration += 1
return {“iter”: self._iter}
t_handle = Trainer.remote(config=config)
train_result_future = t_handle.step.remote()
train_result = ray.get(train_result_future)
Trainer Class → Actor
Runs in worker processes
Runs in driver process
Ray Tune in Ray Architecture
Head Node Worker Node
Ray Tune (Driver process)
Ray Core (takes care about distributed orchestration, scheduling and object store)
Worker
Process
Worker
Process
TrialRunner Searcher/Scheduler
Worker
Process
Worker
Process
Worker
Process
Worker
Process
Woohoo!
Let’s review what we have talked about.
What makes Ray Tune special?
●Wealth of efficient search and scheduling algorithms (everything we
have talked about previously, and more!)
●Leveraging Ray for distributed HPO, which lets you save time by
running trials in parallel
○Same code for HPO on a laptop and a cluster
Tune API example
from ray import tune
def train_model(config):
model = ConvNet(config)
for i in range(epochs):
current_loss = model.train()
tune.report(loss=current_loss)