8. Graphcore Communication Library (GCL)

The Graphcore Communication Library (GCL) enables high-performance scale-out for IPU systems. GCL utilises the IPU´s built-in hardware support for transferring data directly from the the memory of one IPU to another via the IPU-Fabric. The result is a low-overhead, high-throughput communication library, specifically targeted at systems such as the IPU-POD128.

The GCL library is used by other frameworks, such as TensorFlow, to implement functions such as data-parallel gradient reductions using all-reduce.

8.1. Example

A full example of an all-reduce operation using GCL is available. The graph creation code is shown in Listing 8.1.

Listing 8.1 gcl_allreduce_example.cpp
  // Main program
  program::Sequence prog;
  prog.add(program::Copy(inStream, data));
  gcl::allReduceInPlaceCrossReplica(graph, data,
                                    popops::CollectiveOperator::ADD, prog);
  prog.add(program::Copy(data, outStream));

You can download the complete code and compile it with the command:

$ g++ gcl_allreduce_example.cpp -lpoplar -lgcl_ct -lpopops \
      -o gcl_allreduce_example && ./gcl_allreduce_example

Download gcl_allreduce_example.cpp

For more information see the GCL API reference.

8.2. Topologies

8.2.1. Physical topologies

There are two ways of connecting IPU-Links and sync signals: in a mesh or as a torus. The mesh structure is similar to a ladder, where pairs of IPUs form each rung. In a torus, the ends of the “ladder” loop round to form a closed loop. See Fig. 8.1.

GCL supports both those topologies with different restrictions related to traffic flow and replica size that are described in the following section.

_images/ladder-torus.png

Fig. 8.1 Ladder and torus topologies used by GCL

8.2.2. Logical topologies

GCL supports a number of logical topologies that describe the traffic flow in the physical topology. Fig. 8.2 illustrates these the topologies.

_images/col-topologies.png

Fig. 8.2 Logical topologies used by GCL

The following logical topologies are supported:

  • peripheral-ring is only relevant for replica size 1. The traffic follows a single ring on the periphery of the IPU-Link mesh. Assuming replica numbers assigned linearly from the bottom and an even-number communication-group size, the communication will follow this pattern:

    0 - 1 - 3 - ... - <comm_size-3 - <comm_size-1> - <comm_size-2> - <comm_size-4> - ... - 4 - 2 - 0
    
  • barley-twist is only relevant for replica size 1 on an IPU-Link torus (that is, with loop-back cables). The traffic is split over two concurrent rings, forming a dual-serpent-like pattern through the IPU-Link torus. In this way, all eight IPU-Links will be used for communication, enabling utilisation of all available links for optimal bandwidth. The communication follows this pseudocode pattern:

    int next_addr = (stream == barley-twist         _X) ? 0 : 1;
    for (int duo_step; duo_step < comm_size/2; duo_step++) {
      next_addr = next_addr ^ 1; // Go-side-ways
      next_addr = (next_addr + 2) % comm_size; // Go-up
    }
    
  • ring-on-line is relevant for replica size 2. The traffic follows one of two virtual rings, one on each side of the IPU-Link mesh. Assuming replica numbers assigned linearly from bottom and an even-number communication-group size, the communication will follow this pattern:

    0 - 1 - 3 - ... - <comm_size-3> - <comm_size-1> - <comm_size-2> - <comm_size-4> - ... - 4 - 2 - 0.
    
  • rung-ring-[2,4,8] is relevant for replica size 2, 4 and 8. The traffic follows one of two physical rings, one on each side of the IPU-Link mesh, by moving straight up and assuming wrap-around at the top for a torus. The communication will follow this pattern, looping-back to rank 0:

    0 - 1 - 2 - ... <comm_size-2> - <comm_size-1> - 0
    
  • rung-ring-[4,8] is also a valid topology for up to 16 IPUs per ILD, on a mesh or torus physical topology with DNC routing.

8.2.3. Relationship between logical and physical topologies

Table 8.1 lists the different relationships between logical and physical topologies, depending on the size of the replica and the IPU-Link routing.

Table 8.1 Relationship between logical and physical topologies

Replica size

Logical topology

Physical topology

IPU-Link routing

1

peripheral-ring

mesh, torus

DNC, SWNC, RINGSWNC

1

barley-twist

torus

BTNC

2

ring-on-line

mesh

DNC, SWNC

2

rung-ring-2

torus

DNC, RINGSWNC

4

rung-ring-4

mesh, torus

DNC

4

rung-ring-4

torus

RINGSWNC

8

rung-ring-8

mesh, torus

DNC

8

rung-ring-8

torus

RINGSWNC

Key

IPU-Link routing options

BTNC

Barley-twist network configuration

DNC

Default network configuration

SWNC

Sliding-window network configuration

RINGSWNC

Ring with sliding-window network configuration