4. Training with the estimator API
The TensorFlow estimator is a high-level API that encapsulates several common functions that are useful during the development of deep learning models. These functions include (but are not limited to) training and evaluation methods. They provide the primary interface between the user and the underlying model. Given the level of abstraction, it is somewhat of a departure from the session scope paradigm so common to many TensorFlow scripts and is reviewed as a separate topic. The Estimator is supported on Graphcore’s TensorFlow release.
There are two primary facets of the estimator: the train
method and the
evaluate
method. In the train
method, the user provides an input pipeline
which is a function that is called for fetching a mini-batch from the training
dataset. The number of iterations in the training loop can be specified by
providing a steps parameter. The state of the model is captured in the
checkpoint (.ckpt) file, which is stored in a specified model directory. The
evaluate
method is typically used for model quality assessment, where certain
metrics are produced to quantify the performance of a model based on validation
data. The model under evaluation is fetched via a specified checkpoint (.ckpt)
file. Similar to the train
method, the user is expected to provide the input
pipeline function and the number of steps when the evaluate
method is invoked.
4.1. Instantiate an estimator for the IPU
The estimator is named IPUEstimator
in the API. The following provides the
essential arguments for instantiating an IPUEstimator
:
config
: set the configuration of the estimator, which includes:IPU profiling configuration
IPU selection configuration (number of IPUs to target, ID of IPU and so on)
graph placement configuration: number of shards, number of replicas and so on
logging configuration: parameters which control the logging frequency
output configuration: directory for output checkpoint files
model_fn
: definition of the model function. Refer to Write a model function for details on how to write this function.model_dir
: directory to save model parameters, graph and other dataparams
: hyperparameters to pass to the model functionwarm_start_from
: optional path to checkpoint file that you can use for a warm start
4.2. Abridged code sample for the estimator
The following is an abridged sample script that instantiates an estimator:
from tensorflow.python import ipu
def create_ipu_estimator(model_fn, model_dir, params):
# Create IPU configuration
ipu_options = ipu.config.IPUConfig()
# IPU selection configuration
ipu_options.auto_select_ipus = params['num_devices']
# Graph placement configuration
ipu_run_config = ipu.ipu_run_config.IPURunConfig(
iterations_per_loop=params['iterations_per_loop'],
ipu_options=ipu_options,
num_shards=1,
num_replicas=params['num_devices'],
autosharding=False)
# logging and output configuration
config = ipu.ipu_run_config.RunConfig(
ipu_run_config=ipu_run_config,
log_step_count_steps=params['log_interval'] ,
save_summary_steps=params['summary_interval'] ,
model_dir=model_dir)
# return an IPUEstimator instance
return ipu.ipu_estimator.IPUEstimator(
config=config,
model_fn=model_fn,
params=params)
# Instantiate an IPUEstimator
estimator = create_ipu_estimator(
model_fn,
model_dir=params['model_dir'],
params=params)
4.3. Train and evaluate methods
Once an IPUEstimator
is instantiated, you can run training and
evaluation with the train
and evaluate
methods. You can run these
as follows:
# partial is used to configure the training mode of the input function (input_fn)
train_input_fn = functools.partial(input_fn, params=params, is_training=True)
eval_input_fn = functools.partial(input_fn, params=params, is_training=False)
estimator.train(train_input_fn, steps=train_steps)
estimator.evaluate(eval_input_fn,
checkpoint_path=estimator.latest_checkpoint(),
steps=eval_steps)
For multiple epochs of training, you can calculate a number of steps according
to the number of samples in the data set and the batch size. Then you invoke
the train
and evaluate
methods in a loop of epochs as shown below:
for i in range(args.epochs):
print("Training epoch {}/{}".format(i, args.epochs))
estimator.train(train_input_fn, steps=train_steps)
estimator.evaluate(eval_input_fn,
checkpoint_path=n_est.latest_checkpoint(),
steps=eval_steps)
The train_input_fn
and evaluate_input_fn
parameters, which are supplied
to the estimator methods, are the inputs to the pipeline. Their major
functionality is to extract feature and label pairs from examples in the
dataset. For more information about how these input pipelines can be composed,
search for input_fn
in the evaluate section of the
TensorFlow Estimator documentation.