Making Software Faster: RouteBricks (text)

by Sasha Shkrebets — last modified Mar 20, 2023 07:59 AM

We are continuing our study of programmable data planes. And in particular we are now going to talk about how to make programmable data planes both fast and programmable. Designing fast programmable data planes presents a tension.

Welcome back.

We are continuing our study of programmable data planes.

And in particular we are now going to talk about

how to make programmable data planes both fast and programmable.

Designing fast programmable data planes presents a tension.

On the one hand, software is incredibly flexible,

but it is typically not as fast as hardware.

Hardware, on the other hand, can forward network traffic at very

high rates, but it is typically not as flexible as software.

So what we'd like to do in designing a programmable

data plane is to get the best of both worlds.

We would like to have the programmability, flexibility,

and extensibility of software, but the performance of hardware.

There are essentially two ways to get the best of both worlds.

One is trying to make software solutions perform better.

The other is to make hardware more programmable.

In the two parts of this lesson, we will look at each of these approaches.

In this part of the lesson, we'll explore an architecture called Route Briggs.

Which uses a cluster of servers to achieve fast forwarding rates in software.

Many new protocols require changes to the data plane,

and these protocols must forward packets at acceptable speeds.

They may also need to run in parallel with existing protocols.

So we need a programmable data plane platform,

for developing and deploying these new network protocols,

that can forward packets at high speeds, and

can also run multiple data plane protocols in parallel.

There are various approaches we might take for achieving this goal.

One is to develop custom software.

The advantage to custom software is that it's flexible and easy to program.

The disadvantage is that forwarding speeds can be slow.

Although in this lesson we will look at how

to speed up the forwarding speeds of custom software.

Another approach is to develop modules with custom hardware.

Custom hardware of course provides excellent

performance, but development cycles can be long,

and the designs are rigid and fixed once the hardware has been fabricated.

Finally, we can develop custom data planes, using programmable hardware.

The advantage of programmable hardware is that it's

flexible and fast, but programming can be difficult.

In the next part of the lesson, we'll

look at how to make programming programmable hardware easier.

Let's first review the internals of a hardware router with N line cards.

Each of which processes traffic at a rate of R bits per second.

The switch fabric, must thus switch traffic at a rate of N times R.

Now it's worth keeping in mind as well that N and R could be quite large.

Each line card might forward at a rate

of anywhere between 10 and 100 gigabits per second.

And we may also have multiple line cards in the hardware router chassis.

So when we talk about taking that hardware switch

fabric and replacing it with a commodity interconnect, where forwarding

takes place on servers, as opposed to high end

hardware line cards, we have a new set of challenges.

Assuming that each server hosts a line card, each server

must process traffic at a rate of R bits per second.

And the interconnect must switch at a rate of about R bits per second per server.

The interconnect has some additional requirements.

We want internal link rates to be less than R.

The reason for that is because we are

trying to build the interconnect with commodity hardware.

And if the interconnect required a speed-up

beyond that of the network interfaces facing externally.

Then we'd be unable to build that interconnect with commodity hardware.

We'd like the servers themselves to be able to process traffic at

a rate of C times R, where C is a small integer constant.

We will see the reason for that speed-up shortly.

Finally, we'd like the server fan out to be constant.

In other words, because the servers can host

only a limited number of network interface cards.

And each one of those network interface cards

can only have a limited number of ports.

We cannot design a typology where by each of these server nodes has a

large or increasing number of output ports as the switch grows larger.

So using a cluster of servers to design a

programmable software data plane, presents a number of challenges.

The internal link rates can't exceed the external

link rates, otherwise we can't use commodity hardware.

Each node has a limited processing rate.

And there is a limited per-node fan out.

Let's now see how we might tackle these challenges.

A strawman approach is to take each sever and connect it to N other servers.

Assuming that we have N external lengths, each of capacity R, this interconnection

topology requires N squared internal links, each having capacity R.

This interconnection topology obviously does not scale well as the number of

external ports, and hence as the number of servers, grows.

Let's see what we can do to make this topology

scale better with the number of servers in the cluster.

One approach is to apply a technique called valiant load balancing.

The idea here is that instead of sending traffic that

arrives on an input port directly to the output port,

the server that processes the incoming traffic first picks an

intermediate server at random to which to send that traffic.

That intermediate server then forwards the traffic to the intended output port.

If each server sees an incoming traffic rate of R, then each link

to one of N intermediate nodes need only have capacity of R over N.

This dramatically reduces the capacity that is required on the interconnect.

And helps us satisfy the design goal of having the

internal links at capacities that are lower than the external links.

Of course, this comes at a cost.

Each server must now process traffic at a rate of 3R.

To see where that 3R comes from, we can count the incoming traffic at a rate R.

The traffic passed to N intermediate nodes each at a rate of R

over N and the traffic on the output port at a rate of R.

The required per-server processing rate of 3R, of course, assumes that

traffic is not uniformly balanced across the output links in the first place.

If it is, then we can avoid sending traffic to that

intermediate phase and then we only need a server capacity of 2R.

So now we've solved the problem of reducing the capacity of

the interconnect links, but what about the processing rate of individual servers?

Each server must also process traffic as quickly as possible.

With the initial architecture, the designers saw 1.3 gigabits per second.

Given that each server has multiple cores, it makes sense to figure out whether

we can take advantage of parallelism that

the processor offers to increase our forwarding rates.

Before we can even take advantage of parallelism, it's useful to recognize

that processing packets one at a time involves tremendous bookkeeping overhead.

The server must manage descriptors for each packet, move the packet between the

network interface card and memory, and

update file descriptors associated with that packet.

We can speed up forwarding operations by batching them together, in other

words, waiting for multiple packets to arrive before processing them.

Instead of processing each packet individually, the

network interface card can batch multiple packet descriptors.

And the CPU can then poll for multiple

packets, thereby amortizing the overhead associated with processing packets.

The cost of batching, of course,

is increased latency and also increased jitter.

Since sometimes packets may sit in memory on the interface card for

a variable amount of time before the CPU polls for the packet.

Performing this type of batching results in single-server

forwarding rates of about three gigabits per second.

Now, let's see if we can take advantage of parallelism

on a multi-core processor to increase this forwarding rate further.

The challenge is figuring out how to map the traffic that's arriving on multiple

incoming ports to the multiple cores that we have available on any single server.

There are clearly many options for mapping traffic arriving on ports to cores.

One could, for example, assign each port to a separate core.

Or, one might imagine creating a pipeline

of cores where each pipeline processes a packet.

It turns out that there are two design rules that tend to work well.

One is to assign a single core per queue.

This avoids locking.

By contrast, imagine if we had multiple cores accessing the same queue.

This would require each core to lock memory to prevent other

cores from accessing that same memory at the same time.

Therefore, it's much faster to simply assign one core for each queue.

Empirically, the designers also found that, as opposed to sending a packet

through a pipeline of cores, it is much faster to assign one core per packet.

So there are two rules.

Assign one core per queue, and assign one core per packet.

[BLANK_AUDIO].

These optimizations allow the single-server forwarding performance

to reach nearly ten gigabits per second.

There are other tricks that other architectures

have used to achieve fast forwarding and software.

Another architecture called PacketShader used a

large socket buffer to hold multiple packets.

And hold for multiple packets at once.

To amortize the overhead associated with processing each packet.

The trellis architecture simplified packet look up by

using ethernet GRE tunnels and avoided performing lookups on

the software bridge between virtual interfaces and the physical

interfaces on the host that's hosting the virtual machine.

In summary, one way of achieving a fast programmable

data plane is to try to make software faster.

A variety of projects has shown that software

routers can in fact be fast, and that general

purpose infrastructure is capable of fast forwarding performance, but

the low level details and the optimizations do matter.

There are also other efforts underway, such as Intel's data plane development

kit, that are making inroads in making programmable software data planes faster.

You can read more about Intel's DPDK and other efforts

for making software data planes on the course home page.