You are here: Home / Users / Sasha Shkrebets / SDN / 2024 / Бакалавриат / Dr. Nick Feamster - Software Defined Networking (Coursera) / Неделя 2 (Week 2) / Challenges in Separating the Data and Control Planes (English text)

Challenges in Separating the Data and Control Planes (English text)

by Sasha Shkrebets last modified Feb 21, 2023 12:54 PM
Welcome back to the course on software defined networking. In this lesson, we're continuing our discussion of the control and data planes separation, and in particular, we're talking about the challenges associated with separating the data and control planes. We'll overview several challenges associated with the separation of the control and data plane.
Welcome back to the course on software defined networking.
In this lesson, we're continuing our discussion
of the control and data planes separation, and
in particular, we're talking about the challenges
associated with separating the data and control planes.
We'll overview several challenges associated with the
separation of the control and data plane.
Including scalability, reliability, and consistency.
And we'll talk about approaches to solving these
problems in the form of two different systems.
The routing control platform and ONIX.
Let's first take a look at some
of the scalability challenges faced by the routing
control platform and the approaches that that platform
takes to solving some of these scalability challenges.
So one of the problems that the RCP faces is that it must store routes and
compute routing decisions for every router across the autonomous system.
And a single autonomous system may have hundreds to thousands of routers.
So that's potentially many, many routing tables and a lot
of routing computations, all being performed at a single node.
Whereas before all those computations were distributed across the routers themselves.
So some scalability principles from the RCP design.
The first is to eliminate redundancy.
So rather than storing a routing table
for every single router in the autonomous system.
The RCP stores a single copy of each route and if those routes are
duplicated across the routers and the autonomous system as is commonly the case.
That redundancy can be represented by
storing pointers into a common data structure.
The second principle is to accelerate lookups by maintaining indexes to identify
the particular routers that may be
effected by a change in network conditions.
Such as the advertisement of a new route or a node or link failure.
Therefore, when a particular event happens, the RCP only needs to compute new
routing information, or routing tables for the
routers that are affected by that change.
Rather than recomputing the state for the entire network.
Finally, RCP tackles some scalability problems by simply punting
on performing routing for all the routing protocols in the network and
simply decides to focus on performing inter-domain routing alone.
The ONIX network controller applies a couple
of related principles to handle scalability problems.
The first is partitioning, whereby an ONIX controller might actually only
keep track of a subset of the overall network information base.
And network state in the network.
And then apply various consistency protocols
to maintain consistency across different partitions.
The ONIX controller takes advantage of two different consistency models.
One is a strong consistency model that ensures that different
replicas are strongly consistent at the expense of some performance.
The other is a weaker consistency model.
That is more efficient about passing information around more quickly.
The second scalability principle is aggregation.
So the Onix design essentially describes the
notion of a hierarchical set of controllers.
Such as, for example, having an Onix controller for
a department or a building across a larger enterprise network.
And having a single super ONIX controller that
effectively controls those sub-controllers for the overall domain.
Let's now take a look at how these
systems tackle a second challenge: that of reliability.
So one particular approach to rel-, reliability is to simply replicate.
So, the RCP design advocates having a hot spare,
whereby multiple identical RCP servers essentially run in parallel.
And the backup or standby RCP could take over in the event that the primary fails.
So the idea here is that the network
would run independent replicas of the RCP whereby
each replica has its own feed of the
routes from all the routers in the autonomous system.
Now, if each replicate receives the exact same increase and runs the
exact same routing output, then it should be the case that the output,
or the resulting state, that each of these RCP's would push back.
Into the routers should be exactly the same because
the inputs and the algorithm are exactly the same.
So in the case of the Hot Spare approach, there's actually no
need for a consistency protocol if both replicas always see the same information.
There are potential consistency problems however,
if different replicas see different information.
Let's see how that might be the case.
So, here are two RCP's.
Let's suppose that they see different information
and, as a result compute different outcomes.
Or desired routing table state for routers A and B in this autonomous system.
So the RCP on the left might compute an egress route
for router A, that says, use Egress router D to
reach a particular destination and hence use router
B as the next hop to reach the egress router D.
Now similarly, the second RCP, the RCP on the right
might install a conflicting state into router B that says: use egress router C.
To reach that destination, and you use a as
the next top to reach the egress router C.
Now, you can see that if these two replicas install this respective
state into routers A and B, then we have a forwarding loop between
router A and router B because in trying to reach that destination,.
Router A is going to use the gold route to try to egress via router D and
router B is going to use the grey route to try and egress via router C.
And when each of these routers receives the packets for that destination.
From it's respective neighbor it's just
going to bounce the packets back and forth.
So what we want is for route assignments to
be consistent even in the presence of failures and partitions.
Now we previously just said that if every RCP receives the same input.
And runs the same out algorithim.
Then the output should be consistent and we want some way to guarantee that.
Now fortunately a flooding based Internal Gateway Protocol
such as OCPF or ISIS as we learned in our networking course.
Essentially means that each one of these
replicas already knows which partitions it's connected to.
Is if the RCP is participating in the intra-domain routing protocol, or
the IGP, then it sees the full link state of that partition.
And that information that it receives is enough
to make sure that the RCP only computes
routing table information, or routing state, for the
routers in the partition for which it's connected.
And that alone is enough to guarantee correctness.
Let's see why.
Let's suppose that we have a network partition where the routers in partition
one can't see, or can't forward traffic, to the routers in partition two.
And vice versa.
Now in this case, the solution would be to have that single RCP only use
state, from the routers in each one of these partitions in assigning routes.
For example to assign routes to routers in partition one, the RCP would only use
the set of candidate routes that it learned from the routers in partition one.
It would not use any candidate routes from routers it learned from partition two.
And
that alone is actually sufficient to guarantee consistent forwarding.
You can intuitively see why because, for example, if the RCP never
assigned a route learned for partition two to a router in partition
one, then effectively partition one and partition two are simply
acting as separate networks with a common routing control platform.
Let's suppose now that we've got multiple RCP's that we've
actually replicated the RCP, but the network itself has multiple partitions.
Now here you might think that we've got a more
serious problem, because you might have cases where there are partitions.
That are reachable by both RCP's or can be seen by both RCP's.
But you have others that are reachable
only by subsets that may be non-overlapping.
Well, the approach here is to just ensure that the RCP's
receive the same state from each partition that they can reach.
So the IGP provides complete visibility and connectivity for each of these
partitions and if the RCP only acts
on a partition, if it has the complete state for that partition.
Then it's guaranteed that the routes it
assigns for that partition will be consistent.
In other words, there'll be no forwarding
loops.
Let's look how how ONIX tackles the challenges of reliability.
So ONIX talks about different types of failures that may occur on the network.
The first, is network failures.
In this case, ONIX simply just assumes that it is
the application's responsibility, to detect and recover from those failures.
Now
if the network failure affects reachability to ONIX.
The design suggests that the use of a
reliable protocol or multi-path routing and so forth,
could help ensure that the ONIX controller remains
reachable even, in the case of a network failure.
If ONIX itself fails, the solution is to take a similar approach which
is to apply replication and then use
a distributed coordination protocol amongst those replicas.
Now, because ONIX has been designed for a far more general set of
applications than the routing control platform, a
more complicated distributed coordination protocol is necessary.
Some of the details of those coordination protocols are discussed
in the paper that was referenced on the original slide.
Where we talk, were we introduced ONIX in this lecture.
So in summary, separating the control
and data plane poses three significant challenges.
One is scalability.
In particular, a single controller must now make routing decisions,
or various control plane decisions on behalf of many, many network elements.
That were previously each performing those computations independently for themselves.
The second challenge is reliability.
That is guaranteeing correct operation under failure of
the network, or failure of the controller itself.
The third challenge is consistency.
We're ensuring consistency across multiple controller replicas.
Particularly, in cases of network partitions or failures.
So, we export each of these challenges in some detail,
and talked about various techniques
including hierarchy, aggregation, and various
clever state management and distribution that various systems such
as the RCP and onyx have used to tackle these challenges.
Each particular controller tackles these challenges in a different way, but many of
these principles apply, across different controller designs and implementations.
Navigation