4. Training with the estimator API

The TensorFlow estimator is a high-level API that encapsulates several common functions that are useful during the development of deep learning models. These functions include (but are not limited to) training and evaluation methods. They provide the primary interface between the user and the underlying model. Given the level of abstraction, it is somewhat of a departure from the session scope paradigm so common to many TensorFlow scripts and is reviewed as a separate topic. The Estimator is supported on Graphcore’s TensorFlow release.

There are two primary facets of the estimator: the train method and the evaluate method. In the train method, the user provides an input pipeline which is a function that is called for fetching a mini-batch from the training dataset. The number of iterations in the training loop can be specified by providing a steps parameter. The state of the model is captured in the checkpoint (.ckpt) file, which is stored in a specified model directory. The evaluate method is typically used for model quality assessment, where certain metrics are produced to quantify the performance of a model based on validation data. The model under evaluation is fetched via a specified checkpoint (.ckpt) file. Similar to the train method, the user is expected to provide the input pipeline function and the number of steps when the evaluate method is invoked.

4.1. Instantiate an estimator for the IPU

The estimator is named IPUEstimator in the API. The following provides the essential arguments for instantiating an IPUEstimator:

  • config: set the configuration of the estimator, which includes:

    • IPU profiling configuration

    • IPU selection configuration (number of IPUs to target, ID of IPU and so on)

    • graph placement configuration: number of shards, number of replicas and so on

    • logging configuration: parameters which control the logging frequency

    • output configuration: directory for output checkpoint files

  • model_fn: definition of the model function. Refer to Write a model function for details on how to write this function.

  • model_dir: directory to save model parameters, graph and other data

  • params: hyperparameters to pass to the model function

  • warm_start_from: optional path to checkpoint file that you can use for a warm start

4.2. Abridged code sample for the estimator

The following is an abridged sample script that instantiates an estimator:

from tensorflow.python import ipu

def create_ipu_estimator(model_fn, model_dir, params):
   # Create IPU configuration
   ipu_options = ipu.config.IPUConfig()

   #  IPU selection configuration
   ipu_options.auto_select_ipus = params['num_devices']

   # Graph placement configuration
   ipu_run_config = ipu.ipu_run_config.IPURunConfig(
         iterations_per_loop=params['iterations_per_loop'],
         ipu_options=ipu_options,
         num_shards=1,
         num_replicas=params['num_devices'],
         autosharding=False)

   # logging and output configuration
   config = ipu.ipu_run_config.RunConfig(
         ipu_run_config=ipu_run_config,
         log_step_count_steps=params['log_interval'] ,
         save_summary_steps=params['summary_interval'] ,
         model_dir=model_dir)

   # return an IPUEstimator instance
   return ipu.ipu_estimator.IPUEstimator(
         config=config,
         model_fn=model_fn,
         params=params)

# Instantiate an IPUEstimator
estimator = create_ipu_estimator(
            model_fn,
            model_dir=params['model_dir'],
            params=params)

4.3. Train and evaluate methods

Once an IPUEstimator is instantiated, you can run training and evaluation with the train and evaluate methods. You can run these as follows:

# partial is used to configure the training mode of the input function (input_fn)
train_input_fn = functools.partial(input_fn, params=params, is_training=True)

eval_input_fn = functools.partial(input_fn, params=params, is_training=False)

estimator.train(train_input_fn, steps=train_steps)
estimator.evaluate(eval_input_fn,
                   checkpoint_path=estimator.latest_checkpoint(),
                   steps=eval_steps)

For multiple epochs of training, you can calculate a number of steps according to the number of samples in the data set and the batch size. Then you invoke the train and evaluate methods in a loop of epochs as shown below:

for i in range(args.epochs):
   print("Training epoch {}/{}".format(i, args.epochs))
   estimator.train(train_input_fn, steps=train_steps)

   estimator.evaluate(eval_input_fn,
                      checkpoint_path=n_est.latest_checkpoint(),
                      steps=eval_steps)

The train_input_fn and evaluate_input_fn parameters, which are supplied to the estimator methods, are the inputs to the pipeline. Their major functionality is to extract feature and label pairs from examples in the dataset. For more information about how these input pipelines can be composed, search for input_fn in the evaluate section of the TensorFlow Estimator documentation.