12. IPU-optimised operations
Several custom versions of operators are provided to target functions available in PopLibs. See the TensorFlow Python API for more details.
12.1. Image operations
Our architecture is well-suited to efficiently handle convolutions over four-channel tensors, however it is common for images to be represented with three channels. In order to obtain better IPU performance, both from a latency and memory standpoint, we advise that when dealing with three-channel inputs, you pad the fourth channel dimension.
See tensorflow.python.ipu.image_ops.normalise_image()
for the op that can perform this padding, in addition to normalising and casting if needed. Note that this padding will be
performed on-device, after the data has been transferred to the IPU.
An example of its use can be found in the fused_normalise_image()
function in the CNN training application
example
in Graphcore’s examples repository on GitHub.
12.2. Matmul serialisation
You have the option to serialise matrix multiplications along a particular dimension, in order to reduce the code size of the multiplication and the temporary memory requirements of the matmul, at the expense of extra computation.
See tensorflow.python.ipu.math_ops.serialized_matmul()
for details of the op.
An example of its use can be found in the mlm_head()
function in the BERT application example
in Graphcore’s examples repository on GitHub.
12.3. Dropout
The PopLibs version of dropout does not need to store the dropout mask between the forward and backward parts of the graph, saving memory.
12.4. Embedding lookup
This is a version of embedding lookup that has been optimised for the IPU. It allows the embedding lookup to be serialised into smaller lookups, which can reduce the maximum memory at the cost of extra computation when the embedding tensors are used by multiple operations.
12.5. Group normalisation
Group normalisation is an alternative to batch normalisation, and produces smaller and more optimised graphs.
The original paper on group normalisation is “Group Normalization”, Yuxin Wu, Kaiming He.
12.6. Instance normalisation
Instance normalisation is another alternative to batch normalisation.
The original paper on instance normalisation is “Instance Normalization: The Missing Ingredient for Fast Stylization” Dmitry Ulyanov, Andrea Vedaldi, Victor Lempitsky.
See tensorflow.python.ipu.normalization_ops.instance_norm()
.
12.7. Layer normalisation
Layer normalisation is another alternative to batch normalisation.
The original paper on layer normalisation is “Layer Normalization” Jimmy Lei Ba, Jamie Ryan Kiros, Geoffrey E. Hinton.
12.8. GeLU activation
Gaussian error linear units (GeLU) is an alternative to the ReLU non-linearity. This is described in “Gaussian Error Linear Units (GELUs)” Dan Hendrycks, Kevin Gimpel.
12.9. Sequence slice
The set of sequence slicing ops provided for the IPU.
See tensorflow.python.ipu.slicing_ops.sequence_slice()
,
tensorflow.python.ipu.slicing_ops.sequence_slice_unpack()
and
tensorflow.python.ipu.slicing_ops.sequence_slice_pack()
.
12.10. Histogram
The set of histogram ops provided for the IPU.
See tensorflow.python.ipu.statistics_ops.histogram()
,
tensorflow.python.ipu.statistics_ops.histogram_update()
,
tensorflow.python.ipu.statistics_ops.fixed_width_bins()
and
tensorflow.python.ipu.statistics_ops.histogram_normalize()
.