mirror of
https://github.com/fdiskyou/Zines.git
synced 2025-03-09 00:00:00 +01:00
866 lines
48 KiB
Text
866 lines
48 KiB
Text
---[ Phrack Magazine Volume 8, Issue 53 July 8, 1998, article 05 of 15
|
|
|
|
|
|
-------------------------[ Introduction and Overview of Internet Routing
|
|
|
|
|
|
--------[ krnl <krnl@heuristic.org>
|
|
|
|
|
|
|
|
----[ Routing Overview:
|
|
|
|
The process of routing can be quickly summarized as a node finding the path to
|
|
every possible destination. Routing is present in everything from layer 1
|
|
(the physical layer) on up. The routing that most people are familiar with,
|
|
however, occurs at layer 3 (the network layer) and as such, we will only
|
|
reference layer 3 (and more specifically) Internet Protocol (IP) routing in
|
|
this document.
|
|
|
|
Protocols for exchange of routing information connect multiple routers around
|
|
the world to provide them with a common view of the network through their
|
|
heterogeneous, though generally consistent routing tables. Routing tables
|
|
store all information necessary for the router to reach every destination on
|
|
the network irrespective of size (i.e. the network could be j.random LAN with
|
|
one ip router and two hosts off of an ethernet port or it could be the
|
|
Internet proper).
|
|
|
|
----[ Routing Protocols:
|
|
|
|
There are a wide variety of routing protocols used to contribute to the
|
|
routing tables across a network. Protocols such as BGP, OSPF, RIP and ISIS
|
|
help to convey a correct and coherent picture of the network to all network
|
|
switches (routers).
|
|
|
|
----[ Routing Goals:
|
|
|
|
You can imagine that if each router has to store information that would allow
|
|
it to reach every destination on the network, there is the possibility for it
|
|
to amass a large routing table. Large routing tables are difficult (and
|
|
sometimes impossible) for routers to process because of physical constraints
|
|
(cpu, memory or a combination). Therefore, we would like to minimize the
|
|
routing table space without sacrificing the ability to reach every destination
|
|
on the network. For example, if the router is connected to the Internet via
|
|
one DS1 link to another router, the router could store routing table
|
|
information for each destination on the Internet or it could just default
|
|
non-local information out that serial link. What defaulting means is that if
|
|
the router does not have a specific entry in its table for the destination
|
|
that the packet is trying to find, it sends it out the default link. The
|
|
router towards which a router sends defaulted packets is sometimes called the
|
|
'gateway of last resort'. This simple trick allows many routing tables to
|
|
save a number of entries on the 30th order of magnitude. Routing information
|
|
should not be exchanged between routers in a spurious fashion. Frequent churn
|
|
in the routing table puts unnecessary stresses on the scare memory and cpu of
|
|
any given router. Information propagation should not interfere with the
|
|
forwarding operations of the router. Though this means that you should not
|
|
send routing updates every nanosecond, it does not mean that routing
|
|
information should only be exchanged and updated weekly. One of the important
|
|
goals of routing is that it provide the host with a table which accurately
|
|
reflects the current status of the network.
|
|
|
|
The most important aspect of a router's operation is sending packets from
|
|
input to correct output. Misrouting packets could cause a loss of data.
|
|
Routing table inconsistencies could also cause routing loops whereby a packet
|
|
is passed between two adjacent interfaces ad infinitum.
|
|
|
|
It is desirous for routers to have quick convergence. Convergence can be
|
|
informally defined as a metric which gauges the speed at which routers arrive
|
|
at a consistent view of the network. It would be ideal to have infinitesimal
|
|
convergence times because that would ensure that each router on the network
|
|
can accurately reflect the current topology even after a drastic change (link
|
|
failure). When the network is changing, each router must propagate data which
|
|
will aid other routers in converging to the correct picture of the network
|
|
status. Problems with quick convergence are found in the routing updates. If
|
|
a link is flapping (changing line status from up to down) rapidly, it can
|
|
generate numerous installation and withdrawal requests. Therefore, that one
|
|
link can end up consuming the resources of every router on the network because
|
|
the other routers are forced to install and withdraw the route in rapid
|
|
succession. While convergence is an important goal of routing protocols, it
|
|
is not a panacea to network woes.
|
|
|
|
|
|
----[ Distance Vector Routing
|
|
|
|
Distance vector routing protocols distribute a list of <destination, cost>
|
|
tuples to all of the router's neighbors. These tuples assign a cost to reach
|
|
every other node of the network. It is important to note that this routing
|
|
information is only distributed to routers which are assigned as neighbors to
|
|
the originating router. These neighbors are often physical, but can be
|
|
logical in the case of eBGP multihop. That cost is the sum of the link costs
|
|
for the router to reach a destination. Routers periodically send their
|
|
neighbors distance vector updates; the neighbor then compares the received
|
|
distance vector to its current distance vector. If the received values are
|
|
lower, the router sends output to the destination in the distance vector over
|
|
the link that it received the vector over.
|
|
|
|
The count to infinity problem is a problem with many distance vector
|
|
implementations. We will assume that all links have a unit cost and that each
|
|
hop corresponds to a unit. For example, if router X is connected to router Y
|
|
and router Y is connected to router Z, we can demonstrate this problem (see fig
|
|
1). Y knows a 1 hop path to Z and X knows a 2 hop path to Z. Assume that
|
|
link YZ goes down and the cost of that route is increased to infinity (fig 2).
|
|
Now, Y knows an infinite cost route to Z because it knows the link is down so
|
|
it propagates this distance vector to X. Suppose X has sent an update to Y
|
|
which advertises a 2 hop distance vector. Now, Y will think that it can get
|
|
to Z through X, so it sends X an update that says it can get to Z in three
|
|
hops (fig 3). Note that X has no idea that the distance vector being
|
|
advertised to it was originated from X. This is a serious flaw in distance
|
|
vectors. In their unmodified form, they do not contain the full path
|
|
information that the route has traversed. As illustrated above, the router
|
|
alternates states of advertising a path to Z and advertising infinity to Z.
|
|
They keep this exchange up forever or until they have reached some internally
|
|
defined infinity count (say 15 as in the case of RIP).
|
|
|
|
Count to Infinity problem:
|
|
|
|
X--------------------Y--------------------Z
|
|
|
|
Y:1 X:1 X:2
|
|
Z:2 Z:1 Y:1
|
|
|
|
[ fig 1 ]
|
|
All links are up, below each node we note the destination and hopcount
|
|
from each respective node.
|
|
|
|
|
|
X--------------------Y--------* *---------Z
|
|
|
|
Y:1 <------------- Z:infinity
|
|
Z:2 -------------> X:1
|
|
|
|
[ fig 2 ]
|
|
The link Y - Z breaks. Node X advertises Z:2 to node Y.
|
|
|
|
|
|
|
|
X--------------------Y--------* *---------Z
|
|
|
|
Z:infinity(frm Y) -> X:1
|
|
Y:1 <------------- Z:3
|
|
|
|
[ fig 3 ]
|
|
Node Y sends its Z distance vector to X _before_ it recieves node X's
|
|
infinity. Once node Y receives node X's infinity, it sets its distance to
|
|
infinity.
|
|
|
|
A path vector is an easy way to defeat the count-to-infinity problem.
|
|
Basically, each distance vector also includes the router path that it
|
|
traversed (fig 4). The router rejects an update from its neighbor if the path
|
|
included in the update includes the router receiving the update (fig 5). The
|
|
Border Gateway Protocol (which is used to exchange routing information between
|
|
Autonomous Systems on the Internet) incorporates the path vector to stop the
|
|
count-to-infinity problem. Obviously, you have to incorporate more
|
|
information in the routing table if you want to include the AS path
|
|
information that the route has traversed. The designers of BGP decided that it
|
|
was optimal to sacrifice storage space and processing power for the robustness
|
|
that the path vector affords the routing protocol.
|
|
|
|
Path Vector Solution:
|
|
|
|
X--------------------Y--------------------Z
|
|
|
|
Y:1 (Y) X:1 (X) X:2 (YX)
|
|
Z:2 (YZ) Z:1 (Z) Y:1 (Y)
|
|
|
|
[ fig 4 ]
|
|
All links are up, below each node we note the destination, hopcount and
|
|
path vector from each respective node.
|
|
|
|
|
|
X--------------------Y--------* *---------Z
|
|
|
|
Y:1 (Y) X:1 (X)
|
|
Z:2 (Y Z) Z:infinity
|
|
|
|
[ fig 5 ]
|
|
The link Y - Z breaks. Node Y knows to ignore Xs advertisement of Z
|
|
because Y is the path vector. The avoids the count-to-infinity problem.
|
|
|
|
|
|
Another way to counter this problem is the split horizon. Basically, this
|
|
means that a router shouldn't advertise a path to a neighbor if that neighbor
|
|
is the next hop to the destination. This solves the problem presented in the
|
|
example above because the path to Z from X through Y would not have been
|
|
advertised to Y because Y is the neighbor _and_ the next hop to the
|
|
destination (Z). A variation called split horizon with poisonous reverse has
|
|
router X advertise an infinite cost to get to destination Z. Under a split
|
|
horizon, router X would not advertise that it could get to router Z.
|
|
|
|
|
|
----[ Link State Routing
|
|
|
|
A router using a link state routing protocol distributes the distance to its
|
|
neighbors to every other router on the network. This allows each router on
|
|
the network to make a routing table without knowing the full cost to the
|
|
destination from any one source. The problems of loops are avoided because
|
|
each router contains the full topology of the network. Basically, the router
|
|
makes a 3 tuple containing the source router (itself) the neighbor and the
|
|
cost to its neighbor. Therefore, if router A is connected to Router B over a
|
|
link of cost 3 and router A is connected to router C over link cost 5, then
|
|
router A would advertise the Link State Packets (LSPs) <A,B,3> and <A,C,5> to
|
|
all routers on this network. Each router on the network would evaluate all of
|
|
the LSPs that it receives and calculate a shortest path to every destination
|
|
on the network.
|
|
|
|
Obviously, the LSP is an integral part of the convergence process. If someone
|
|
could inject false LSPs into the network, it could result in misrouted
|
|
information (a packet taking a longer path than it should) or even in the
|
|
blackholing of a router on the network. This is not necessary a malicious
|
|
attack of a network, however. Router C could advertise a link to its neighbor
|
|
D with the 3 tuple <C,D,6> and then withdraw the announcement when the link
|
|
goes down. Unfortunately, if the LSP advertising the link having an infinite
|
|
cost arrives before the LSP advertising the cost of that link being 6, the
|
|
routing table will not reflect the topology of the network and will be in that
|
|
state until another LSP comes to correct the problem.
|
|
|
|
To combat this, a sequence number is introduced into the LSP. Therefore, all
|
|
of the routers on the network would initialize their sequence number to some
|
|
starting value and then start advertising their LSPs. This solves the above
|
|
problem in that the LSP advertising the link of infinite cost would have a
|
|
higher sequence number than the LSP advertising the link as having cost 6.
|
|
|
|
Some problems encountered when using sequences numbers are finite sequence
|
|
space, sequence initialization, and aging. It is in the best interest of a
|
|
robust link state protocol needs to protect its LSPs as well as choose a
|
|
sequence space which is sufficiently large to accommodate updates. The
|
|
sequence space that the LSPs can use is set to some finite value. Therefore,
|
|
when the sequence numbers reach the top of the space, they must wrap around
|
|
towards the smallest sequence number. This presents a problem because when a
|
|
router compares link state updates, the greater sequence number takes
|
|
preference. To combat this problem, you can define a maximum age of the LSP.
|
|
Therefore, if you have not received an update in X ticks, you discard the
|
|
current LSP information and wait for a further update. It must be noted that
|
|
this invalidates the path information to a destination. For example, if
|
|
router Y advertises a cost to its neighbor router Z where router Y is
|
|
connected by one link to a meshed network, when the link between the mesh and
|
|
router Y breaks, the other routers in the mesh have preserved link state
|
|
information that will allow them to find a path towards Z. If they receive no
|
|
updates in MAX_AGE, then they will assume that the link to Y is unreachable.
|
|
This will allow each router to converge its table and allow it to advertise an
|
|
infinite LSP for Y and Z.
|
|
|
|
Sequence initialization is also an important facet of this problem. Say
|
|
router Y crashed and is rebooting while the network is recalculating paths to
|
|
it. When it starts its link state protocol back up, it must somehow indicate
|
|
that it needs to reinitialize its sequence number to the last number it gave
|
|
all of the other routers to allow for coherence. Therefore, it can announce
|
|
paths with a sequence number in a special "initialization set". This
|
|
initialization set will tell the other routers that this router needs the
|
|
sequence where it left off. This is the "lollipop sequence" idiom. The
|
|
sequence space really resembles a lollipop in that the normal sequence number
|
|
keep churning around the finite sequence space while reinitialization takes
|
|
place in a short linear sequence space (comparable to the stick :).
|
|
|
|
Great pains are taken to ensure the integrity of LSPs. In fact, this entire
|
|
routing algorithm depends on the LSP being digested in a coherent method to
|
|
guarantee that each router has the correct view of the network topology. The
|
|
question still remains how the root node router computes the distance to each
|
|
destination.
|
|
|
|
Because of the general nature of a link state protocol, you have various nodes
|
|
of the network advertising the distance to get to their neighbors to every
|
|
other node on the network. Thus each node has a collection of neighbor
|
|
distances from various routers on the network. The routing table is basically
|
|
'grown' outward from the root node to all of the network extremities. This
|
|
will be explained in a slightly rigorous fashion in the next section.
|
|
|
|
|
|
----[ Dijkstra's Algorithm
|
|
|
|
This algorithm is a simple and elegant way to determine network topology.
|
|
Basically, there are two distinct sets of destinations on the network.
|
|
Destinations in set K are known routes for which a shortest path has been
|
|
computed. Destinations in set U are routers for which the best path to that
|
|
router is not currently known. In this set, paths are being considered as
|
|
candidates to be the best path to that destination.
|
|
|
|
To start off, add the current node p into the set K. Then add all of its
|
|
neighbors into the set U with path/cost associations. If there is another path
|
|
to one of the neighbors in the U set, then choose the path which costs the
|
|
least. When the neighbors N* are added to U make sure that they indicate the
|
|
cost through p as well as p's ID .
|
|
|
|
Once this has been done for the set U, then pick the neighbor n to p which has
|
|
the smallest cost to reach p. This is assuming that the neighbor has not
|
|
already been installed in K. This algorithm stops when set U is equivalent to
|
|
the empty set. When set U is null, it is implied that all destinations are in
|
|
set K and have the shortest cost from the root node P on which this algorithm
|
|
is running. Note, that each step evaluates adds ONE neighbor into K. That
|
|
neighbor is the router with the smallest cost to reach p.
|
|
|
|
|
|
----[ Distance Vector vs. Link State
|
|
|
|
We are left with these protocols like BGP which uses path vector and OSPF
|
|
which uses link state. Why do they occupy such orthogonal space? When a link
|
|
state protocol is working correctly, it guarantees that there will be no
|
|
routing loops in the network. The link state protocol also guarantees fast
|
|
convergence when there is a change in the topology of the network because the
|
|
link state is distributed on a flat routing space. Since link state protocols
|
|
contain these inherent advantages, why do protocols like BGP chose to employ
|
|
the path vector approach?
|
|
|
|
Taking a cross-section of routing protocols that are employed on the internet,
|
|
one finds that the majority of large providers use OSPF to resolve routing
|
|
information on their internal network and BGP to talk to other distinct
|
|
networks (or autonomous systems) at their borders of administration. What
|
|
suits BGP as an external protocol and OSPF for an internal routing protocol?
|
|
|
|
One issue, which will be discussed in the next section, is hierarchy. BGP
|
|
provides a mechanism for a routing hierarchy which enables it to greatly
|
|
reduce the space of its table. OSPF, which is a link state protocol,
|
|
provides a flat routing table whereby any internal router knows the full
|
|
hop by hop path to any destination within the autonomous system. Furthermore,
|
|
distance vector protocols understand that different areas can have different
|
|
views of the network where link state protocols require that each node
|
|
independently compute a consistent view of the network. This saves the DV
|
|
protocol the overhead of maintaining a correct LSP database. BGP also has
|
|
another 'advantage' in that it is layered on top of the Transmission Control
|
|
Protocol (TCP). Therefore, in the 'best-effort' service of IP networks, BGP
|
|
has assurance (to the level that TCP can guarantee) that routing information
|
|
will be propagated. Whereas, you can (or should) be able to govern the status
|
|
of your internal network, the nebulous exterior past your border routers
|
|
confers no delivery guarantee on your routing information.
|
|
|
|
Each type of routing algorithm is suited for its function. Link State
|
|
protocols provide the quick convergence that is essential to an internal
|
|
network while distance vector protocols provide external reachability.
|
|
Hierarchy is not something that is inherent in distance vector protocols,
|
|
but the implementation of a hierarchy has made BGP a widely used exterior
|
|
gateway protocol.
|
|
|
|
|
|
----[ Routing Hierarchy
|
|
|
|
Routing hierarchy is an oft fought debate that borders on religion. There
|
|
are constantly questions about how hierarchy should be implemented (if at
|
|
all) in the free form state of the current global network. Hierarchy imposes
|
|
a tree of authority with the overall authority at the top of the tree and
|
|
branching down to regional authorities, local authorities ad infinitum.
|
|
Hierarchy simplifies routing because if a destination is not locally routable
|
|
(or under your section of the tree). You can iterate up towards the top tree
|
|
to try and deliver that information. As you move towards the top, the routing
|
|
information contained in the routers becomes less and less specific until you
|
|
reach the root node which is the least specific. It does, however, know how
|
|
to route information to every possible destination on the network. It may help
|
|
you to envision the hierarchy of the telephone network (built under one
|
|
collective). If a call cannot be placed within a central office, it is handed
|
|
to either another central office in the area code or a wide area link. The
|
|
wide area link understands how to route to each area code in a full national
|
|
mesh whilst the local 5ess switch only knows routing information for more
|
|
specific prefixes. As the phone number becomes less specific (from right
|
|
to left), the routing decision moves further up the strict hierarchy.
|
|
|
|
This similar to how the domain name system (DNS) works on the internet (fig 6).
|
|
You provide local records for domains that you host. When your nameserver
|
|
receives a query for a record, it either returns the fact that it has
|
|
authority for that record or points toward the root nameserver. The root
|
|
nameserver knows the delegations of .com, .net, .org et al. and then points
|
|
towards the site responsible for that top level domain. That site then points
|
|
towards the site that has authority for the specific second level domain.
|
|
Domain names take the form of most specific -> least specific; i.e.
|
|
microsoft.com is more specific than just .com. Likewise
|
|
gates.house.microsoft.com is more specific than microsoft.com.
|
|
|
|
DNS Hierarchy:
|
|
___ . ___
|
|
/ | \
|
|
.com. .org. .edu.
|
|
/ | \
|
|
microsoft.com. eff.org. isi.edu.
|
|
/ | \
|
|
billy.microsoft.com. x0r.eff.org. rs.isi.edu.
|
|
|
|
[ fig 6 ]
|
|
Each level in the hierarchy is responsible for levels of greater
|
|
specificity.
|
|
|
|
Root authority is controlled by the Internet Assigned Numbers Authority
|
|
(IANA). It provides the top of the hierarchy in a "centrally" managed
|
|
database (in fact, there are multiple root servers distributed across the
|
|
county which maintain a consistent database). This is the closest example of
|
|
strict hierarchy that can be found on the internet.
|
|
|
|
With IP addresses, specificity increases in the opposite direction. IP
|
|
addresses (version 4) are 32-bits. The rightmost bit signifies the greatest
|
|
amount of specificity and the leftmost, the least. IP routing authority
|
|
information is not maintained in a centralized database. Routing information
|
|
is exchanged between autonomous systems via the BGP protocol. Routes take
|
|
preference in order of most specific -> least specific. In this way, there is
|
|
some type of hierarchy in the system (even though it is more loose than the
|
|
DNS example). Generally, larger providers control larger parts of the total
|
|
IPv4 space ((2^32) - 3 addresses). The converse is also true.
|
|
|
|
Classless Inter-Domain Routing (CIDR) also helped to decrease the size of
|
|
routing tables and increase the appearance of hierarchy. Now, instead of
|
|
Sprint announcing routes to 130.4.0.0 through 130.20.0.0 (on the classical B
|
|
space boundary) it could announce 130.4.0.0/12 which encompasses that entire
|
|
16 class B range. The classful ranges, subnetworking and the like are
|
|
discussed in my "introduction to IP" page and are therefore not included in
|
|
this document.
|
|
|
|
|
|
----[ Routing Hierarchy and Aggregation
|
|
|
|
BBN divides their 8/8 network into two subnetworks and advertises reachability
|
|
to the aggregate to save table space. Once inside an AS, routing obeys a fairly
|
|
strict hierarchy. Router A is responsible for the entire 131.103/16. It
|
|
divides it into two /17. Likewise, Router D in AS1 is responsible for 8/8 and
|
|
chooses to divide it into 8.0/9 and 8.128/9 and divides responsibility for
|
|
these networks to Routers E and F respectively (fig 7). Routers B, C, E, and F
|
|
can further choose to subdivide their networks in a hierarchical fashion.
|
|
Because of the binary nature of subnetting, networks can only be divided in
|
|
half.
|
|
|
|
Routing Hierarchy and Aggregation:
|
|
|
|
BGP
|
|
|
|
131.169.0.0/16 <--------------------> 8.0.0.0/8
|
|
A (AS1239) D (AS1)
|
|
/ \ / \
|
|
B / \ C E / \ F
|
|
/ \ / \
|
|
131.169.0.0/17 131.169.128.0/17 8.0/9 8.128/9
|
|
|
|
[ fig 7 ]
|
|
In the internet, there is no strict routing hierarchy. There are simply
|
|
large networks which peer via BGP to distribute aggregated routing
|
|
information.
|
|
|
|
|
|
The national backbone is populated by few nodes (when compared to the end
|
|
nodes). Most national providers are one or two router hops away from every
|
|
large network. Through aggregation in networks below, national providers
|
|
provide fully (or at least we hope) aggregated routing information. In a
|
|
strict hierarchy, only one router on any given hierarchy level can advertise
|
|
reachability to a specific portion of the network. In the current state of
|
|
the Internet, multiple routers can advertise reachability information. For
|
|
example, Sprint announces 131.169.0.0/16 out to Digex, MCI, and BBN. Though
|
|
this breaks some of the benefits of a strict hierarchy, it confers other
|
|
benefits. This scheme allows for distributed control of routing information
|
|
instead of depending on the node above. Also, nodes on the same level are
|
|
often interconnected to aid in the dissemination of routing information.
|
|
|
|
|
|
----[ Aggregation
|
|
|
|
As discussed slightly before, aggregation allowed the internet to reduce the
|
|
size of its external reachability tables. Before, the granularity of route
|
|
announcements allowed for only /8, /16, and /24 (octet boundaries). Now, with
|
|
CIDR you could use variable length subnet masks. The only requirement was
|
|
that they fall on one of the 32-bit boundaries of the IP address.
|
|
|
|
Classless routing not only allows us to minimize routing table space, it also
|
|
allows us to divide up large chunks of unused space into manageable pieces.
|
|
Much of the Class A space is terribly under-utilized. With this scheme one
|
|
can more efficiently allocate IP addresses to providers/netizens. The American
|
|
Registry of Internet Numbers (ARIN) controls the allocation of IP addresses
|
|
within the United States.
|
|
|
|
Aggregation helps alleviate the problems of not being in a strict hierarchical
|
|
structure. It allows the least amount of route table specificity at each
|
|
level (assuming the routers on that level choose to fully aggregate their
|
|
announcements.) The less specific aggregates do not necessarily indicate the
|
|
position of a router in the hierarchy. For example, a university may announce
|
|
a /8 and be 3 hops away from the national backbone.
|
|
|
|
A problem with aggregates can be found when we examine candidate route
|
|
specificity. If ISP A only has address space from within the allocated block
|
|
to their parent P, then aggregation could cause problems if ISP A wanted to
|
|
multihome to parent Q. The problem comes in that ISP A is obligated to
|
|
advertise reachability only to their space. This would constitute them
|
|
announcing their address space to Parent Q. Assume that parent P aggregates
|
|
ISP A's space into a /16 for the sake of saving route announcements. Now, ISP
|
|
A would seem to have better reachability only through parent Q because of the
|
|
specificity of the route announcement (remember that more specific routes take
|
|
precedence over less specific routes). This would nullify the benefits of
|
|
multihoming in an attempt to distribute load over the two lines. In this case,
|
|
ISP A would ask parent P to announce a more specific destination which has a
|
|
length matching the length of the aggregate assigned to ISP A. Therefore, to
|
|
the world above parent P and parent Q, the path to ISP A looks equally
|
|
appealing.
|
|
|
|
|
|
----[ Exterior/Interior
|
|
|
|
It is important to look at how routing information is disseminated throughout
|
|
the network. It has already been discussed that we use separate routing
|
|
protocols (with their respective benefits/costs) to talk to the internal and
|
|
external world. However, these protocols cannot take orthogonal views on
|
|
routing information. In fact, the interplay between interior and exterior
|
|
routing protocols is what allows data to be effectively relayed to a
|
|
destination external to the network as well as to distribute external routing
|
|
information to all nodes on the internal network.
|
|
|
|
There are a few ways to ensure that each router has a consistent view of the
|
|
network. One is to distribute the external protocol into the internal
|
|
protocol. Thereby, the internal protocol instantly has routing information
|
|
injected in it for the best path to every external destination. Note that the
|
|
router which talks eBGP (or comparable protocol) only redistributes the route
|
|
that it installs in its routing table and not the other candidate routes which
|
|
may have been advertised to it.
|
|
|
|
Another approach is to inject the interior protocol into the exterior protocol.
|
|
Of course, this necessitates filtering at the entrance point to the exterior
|
|
protocol to prevent the announcement of all internal specifics. You can
|
|
accomplish internal routing dissemination inside an Interior Gateway Protocol
|
|
mesh. Because of the specifics of protocols like BGP, externally learned
|
|
routing information will only be propagated one logical hop within the network.
|
|
Therefore, every router that must know this external reachability information,
|
|
must be fully meshed with the eBGP speaking routers. Also, if other routers
|
|
are injecting information into the Exterior Gateway Protocol, the router
|
|
should be logically fully meshed with them.
|
|
|
|
|
|
----[ Multicast Routing Overview
|
|
|
|
What we have been talking about above is unicast routing. In unicast routing,
|
|
you assume that each packet has a single destination address. Assuming
|
|
infinite resources being available, unicast is a feasible solution for every
|
|
situation. However, there are situations when it would be advantageous to send
|
|
a packet to multiple destinations from a single source (point to multipoint) or
|
|
from multiple sources to multiple recipients (multipoint to multipoint).
|
|
|
|
The point of multicast is to provide a multicast group to which hosts can
|
|
subscribe and get the specific multicast feed. The multicast group is a single
|
|
IP address in class D space. Therefore, the senders and receivers are only
|
|
associated by a given multicast group address. Thus, the senders move their
|
|
data towards the multicast group address and the receivers specify that they
|
|
want to receive information from a given group address. In fact, the sender
|
|
is not required to know any information about the hosts that are receiving the
|
|
feed.
|
|
|
|
|
|
----[ Multicast vs. Unicast
|
|
|
|
If one was sending packets from a single source to a set of destinations, it
|
|
is important to investigate how multicast and unicast handle the distribution.
|
|
|
|
Assume that router A is sending data to routers B, D and E. A is at the top of
|
|
the hierarchy, B and C are at the second level of the hierarchy, and D and E
|
|
are below router B. With multiple unicast (fig 8), router A makes 3 copies of
|
|
the packet and sends them down link AB. Router B then sends one packet to a
|
|
host off of its ethernet and forwards the last two packets to routers D and E
|
|
whereupon those routers send the packets to the their respective hosts in the
|
|
multicast group.
|
|
|
|
Therefore, this transmission takes up 3 packets per second on link AB and 1
|
|
pps on links B->Host(b), router D and router E. In a multicast routing
|
|
implementation, assuming the same topology, we will have less packets. The
|
|
difference is that router A sends _one_ packet over link AB. Router B then
|
|
triplicates the packet and sends it to Host(b), router D and router E (fig 9).
|
|
One way for triplicating the packet is to simultaneously close crossbars on a
|
|
fully crossed switch fabric, thus sending data from one input to three outputs
|
|
simultaneously. As long as there is route redundancy, multicast is very
|
|
efficient because it minimizes redundant packets traveling to the same
|
|
next-hop. Simply, as long as there is route redundancy for the distributed
|
|
session (for example, an audio feed) you will see an advantage with multicast
|
|
over unicast.
|
|
|
|
Multicast Overview Example:
|
|
|
|
Multiple Unicast:
|
|
A A sends 3 packets to B.
|
|
/ \
|
|
/ \ 3
|
|
/ \
|
|
C B B sends 1 packet to each to D and E.
|
|
/ \
|
|
1 / \ 1
|
|
/ \
|
|
D E D and E send 1 packet to their respective
|
|
hosts.
|
|
|
|
[ fig 8 ]
|
|
|
|
Multicast:
|
|
|
|
A A sends 1 packet to B
|
|
/ \
|
|
/ \ 1
|
|
/ \
|
|
C B B duplicates the packet for its host;
|
|
/ \ therefore, there is 1 packet (at most) on
|
|
1 / \ 1 each link.
|
|
/ \
|
|
D E
|
|
|
|
[ fig 9 ]
|
|
|
|
|
|
This is a multicast topology rooted at node A. There is also a shortest path
|
|
from A to every destination in the multicast group. This is called the
|
|
shortest path multicast tree rooted in A. Data would like to shortest path on
|
|
the network layer. One problem with multicast sessions is that recipients
|
|
join and leave during a multicast session. This requires pruning of the
|
|
multicast "tree" so that packets do not clutter a link on which there is no
|
|
one requesting data from a given multicast group.
|
|
|
|
To detect if there are hosts on a particular broadcast LAN that are interested
|
|
in a multicast group, the router sends out Internet Group Management Protocol
|
|
(IGMP) messages. Each packet has a random reply time from which the host will
|
|
express interest. This is to prevent all hosts on a broadcast LAN from
|
|
responding at the same time to an IGMP query. Once one host desires to
|
|
receive data destined for a particular multicast groups, all other hosts which
|
|
desire to be in the multicast group can discard their replies because the
|
|
router knows to multicast all incoming packets destined for that group. The
|
|
host then configures its adapter to answer for the MAC address corresponding
|
|
to that group.
|
|
|
|
Multicast must also be functional outside of the broadcast LAN. A simple
|
|
solution to the problem is to give each router for which multicast is enabled
|
|
the multicast packet. This is called flooding. Basically, it functions by
|
|
forwarding the packet to every interface other than the one that the packet
|
|
arrived on. The inherent flaws in this approach is that there is packet
|
|
duplication as well as packets being sent to routers which have no hosts
|
|
subscribed to the multicast group. To clarify the duplication statement, if
|
|
Router A is physically meshed with routers B, C, and D and linked to its
|
|
upstream via serial, when router A receives the multicast packet, it floods it
|
|
out the interfaces to routers B, C, and D. These routers then flood the packet
|
|
out the interface other than the one they received the packet on (namely the
|
|
interface that connects them to A). This results in each of these routers
|
|
receiving two copies of the packet (other than the one they received from A)
|
|
in this exchange.
|
|
|
|
A solution to this problem can be found in a technique called Reverse Path
|
|
Forwarding (RPF). RPF specifies that the router forwards a packet with a
|
|
source address of X only if the interface which the router receives the
|
|
packet on corresponds to the shortest path that router has to source
|
|
X (fig 10). Therefore, in the above example, each of the meshed routers
|
|
_still_ receives 2 duplicate packets in the second step, but they refuse to
|
|
forward them because only the packet received from the interface directly
|
|
connected to A will be forwarded. As noted, RPF does not completely solve
|
|
the problem of packet duplication. To solve this, we must introduce
|
|
pruning. The idea is simplistic: inform neighbors that you do not wish to
|
|
receive multicast packets from source X to multicast group Y. You can also
|
|
specify prunes to a particular group. If a router tells its neighbors that it
|
|
did not desire to receive packets for group Y and then has a host which
|
|
desires to receive group Y, it sends a graft message to its neighbors
|
|
specifying that it be added into the multicast tree.
|
|
|
|
As a unicast aside, RPF can also be used to eliminate source address spoofing
|
|
in that the router will only forward packets from source Y if it is receiving
|
|
it on the interface which is the shortest path to source Y. Thus, if the
|
|
router receives packets from an external link which say their saddr ==
|
|
saddr(y), the router will not forward them because its shortest path to Y is
|
|
not from the external link.
|
|
|
|
RPF Example:
|
|
|
|
| <-- Point of ingress.
|
|
|
|
|
A-----------C
|
|
|\ /|
|
|
| \_______/ |
|
|
| / \ |
|
|
|/ \|
|
|
B-----------D
|
|
|
|
[ fig 10 ]
|
|
ABCD are physically meshed. When A distributes a packet to BCD, there is
|
|
no problem. Now, in the next step, B, C,and D forward the packet to each
|
|
of their respective neighbors (for B it would be C and D and ! A because
|
|
it received the packet from A). This results in C and D receiving 2
|
|
packets in this entire exchange. Note that C and D now do _not_ forward
|
|
the packet they have received from A through B because that is not their
|
|
shortest path to A. Their shortest path is their direct link.
|
|
|
|
|
|
----[ The Multicast Backbone (MBONE)
|
|
|
|
It would be myopic to assume that every router on the internet supports
|
|
multicast. Thus, when a router needed to find the shortest path to a
|
|
destination (for forwarding of a multicast packet) it could look in the
|
|
unicast routing table. Unfortunately (or fortunately depending on your
|
|
perspective) most routers on the Internet do not support multicast or do
|
|
not have it enabled by default. Therefore, until most routers support
|
|
multicast, it has been "layered" over IP and tunneled from multicast router to
|
|
multicast router (more specifically, the multicast header and data is
|
|
encapsulated in a unicast IP header). The tunnel (which bridges the gap of
|
|
unicast only routers between multicast routers) informs each end that some
|
|
packets will contain a multicast group in their payload. This allows data to
|
|
be routed by using unicast forwarding tables while at the same time preserving
|
|
the integrity of the multicast group id.
|
|
|
|
Because these multicast routers are not necessarily connected physically
|
|
(though they are tunneled logically), they must be connected by a multicast
|
|
routing protocol. This is necessary because the closest path via multicast
|
|
may not correspond to the shortest path over unicast only routers. Distance
|
|
Vector Multicast Routing Protocol (DVMRP) is an initial foray into this realm.
|
|
DVMRP distributes "unicast" routes to facilitate the construction of shortest
|
|
path trees. DVMRP uses the flood and prune method discussed above to discover
|
|
/maintain multicast trees. There is also a link state foray into this arena.
|
|
Multicast Open Shortest Path First (MOSPF) takes the LSP approach and
|
|
calculates shortest absolute path. One host off of a multicast router can
|
|
request to be in a multicast group. That router then distributes an LSP over
|
|
the network. Of course, MOSPF (being a link state protocol) runs into
|
|
salability problems. It is computationally expensive for a router to compute
|
|
reachability information for each end node router.
|
|
|
|
Core based trees (CBT) are an attempt to alleviate the problems that DVMRP and
|
|
MOSPF experience. The concept is that multicast routers send join requests to
|
|
core routers of arbitrary designation. When a router learns of a host which
|
|
wishes to join a specific multicast group, it forwards a packet to the core
|
|
multicast router. Every router along the way forwards the packet towards the
|
|
core router and marks the interface on which the packet arrives so that it
|
|
knows where to forward the multicast packets from the core. This solves the
|
|
problem of having to communicate topology among all of the endpoints. The
|
|
choice of a core multicast router is a non-trivial because all multicast
|
|
traffic for multicast routers branching off of it _must_ pass through the core
|
|
router.
|
|
|
|
|
|
----[ Protocol Independent Multicast
|
|
|
|
Protocol independent multicast (PIM). Pim is a balance between flood and
|
|
prune and CBT. When there are many multicast routers on a given network, it
|
|
is more efficient to use the flood-and-prune method. This network environment
|
|
is called "dense". On the contrary, sparse mode defines networks where there
|
|
are few multicast routers. In sparse mode, it is more efficient to use CBT
|
|
because the core router is not weighted in an environment when it 'polices'
|
|
few multicast routers. When most of network is comprised of multicast routers,
|
|
it is not prudent to require all sessions to be coordinated (and routed
|
|
through) a core. Sparse mode PIM has been adapted from CBT to allow data to
|
|
reach its destination via the core or through the shortest path tree.
|
|
Currently, the operator must define whether groups are sparse or dense instead
|
|
of leaving it up to the protocol. cisco systems' implementation of pim also
|
|
supports a middle ground called 'sparse-dense' mode.
|
|
|
|
|
|
----[ Border Gateway Protocol
|
|
|
|
There has been some mention of the Border Gateway Protocol (BGP) in this
|
|
document. BGP was groomed as the successor to the Exterior Gateway Protocol
|
|
(EGP). BGP is mainly an exterior routing protocol. It is used to communicate
|
|
with systems outside of the operator's control and both distribute and receive
|
|
network layer reachability information (NRLI) from the neighboring routers.
|
|
BGP must be a robust protocol which has the capability of quick convergence
|
|
while at the same time, not being influenced by frequent shifts in topology.
|
|
When you use BGP to receive routing information, you are depending on the
|
|
other networks to distribute correct information to your network.
|
|
|
|
A BGP speaking router communicates with its peers via TCP. TCP over IP is a
|
|
mechanism for guaranteeing the transmission of data over a best effort service
|
|
at the IP layer. The choice of TCP as the distribution mechanism for BGP
|
|
information is a point of contention. While TCP provides inherent checksums,
|
|
acknowledgments, retransmissions and duplicate suppression mechanisms for
|
|
received packets, it does not guarantee packet order or packet path. This can
|
|
lead to headaches for the router receiving this information.
|
|
|
|
BGP peers communicate with a variety of message formats. BGP speakers use the
|
|
OPEN message to establish a peering relationship with other speakers. BGP
|
|
speakers use the UPDATE message to transfer routing information between peers.
|
|
Update information includes all routes and their associated attributes.
|
|
KEEPALIVE messages assure that BGP peers are active. NOTIFICATION messages
|
|
inform peers of error conditions.
|
|
|
|
|
|
----[ BGP path selection
|
|
|
|
It is important that we understand the messages that constitute the Border
|
|
Gateway Protocol, but we are still left with the question of how BGP works on
|
|
the internet. One important area of clarification is in the BGP path selection
|
|
algorithm. This algorithm is how BGP decides which route to prefer and
|
|
attempt to install in the routing table.
|
|
|
|
This algorithm is employed when there are multiple paths to a destination. As
|
|
a failsafe, the first selection looks at the next hop and determines if it is
|
|
accessible. If the next hop is not accessible, it is important not to
|
|
consider that route as a candidate path to a destination because all data sent
|
|
to its next_hop will be blackholed. The second selection mechanism is the
|
|
weight of the route. Weight is a proprietary implementation to cisco Systems
|
|
routers and is analogous to local preference. If two routes have different
|
|
weights, the route with the largest weight is selected. Notice that the
|
|
selection mechanism is basically a logical if->then chain. If candidate paths
|
|
differ at a level of the selection algorithm, then the favorable path is
|
|
selected and the algorithm ceases trying to decide which path to prefer. The
|
|
next level is the local_pref attribute. This is a well known mandatory BGP
|
|
attribute. Much like weight, the path with the highest local_pref is
|
|
preferred. After local preference, the router selects the path that it
|
|
originated. If the router didn't originate the paths, then the path with the
|
|
shortest AS_PATH length should be selected. AS path length gauges the number
|
|
of autonomous systems that this routing information has traveled through.
|
|
The purpose of this selection relies in the assumption that the less ASNs the
|
|
route has passed through, the closer the destination. If all of the above
|
|
attributes are identical, then prefer path origin in this fashion IGP > EGP >
|
|
Incomplete. If the path origins are the same, prefer the path with the lowest
|
|
value MULTI_EXIT_DESCRIMINATOR (MED). MEDs are commonly used to distinguish
|
|
between multiple exit points to the same peer AS. If none of these attributes
|
|
are dissimilar, then prefer the path through the closest IGP neighbor. If
|
|
that fails, the tiebreaker is preferring the path with the lowest IP address
|
|
specified in the BGP router-id section discussed above.
|
|
|
|
This selection algorithm allows effective establishment of routing policy. If
|
|
I wanted to prefer routes from a certain AS over routes to another AS, I could
|
|
establish a route-map at both entry points of external routing information and
|
|
assign a higher LOCAL_PREF to the routes from the AS I want to favor.
|
|
Unfortunately, this does not provide much granularity. This means that all
|
|
traffic will be routed to the favorable AS and does not allow us to balance
|
|
the load over our multiple connections. If you allow path selection to
|
|
progress to the AS path length decision level, then you will get decent
|
|
(though not 50-50) load balancing to destinations. This of course is assuming
|
|
that you have providers with comparable customer routes and connectivity. If
|
|
you have a DS3 to MCI and a DS3 to the local BFE provider, nearly all traffic
|
|
will move out the MCI pipe if the BGP decision is allowed to progress down to
|
|
the AS path length category. At earlier selections, you can change the
|
|
preference of routes by using AS path access lists to select routes based on
|
|
as path regular expressions. For example, if you wanted to select all routes
|
|
that traversed UUnet and send them out your BFE provider, you could use a route
|
|
map to match an AS path access list which contained _701_ and set the
|
|
local_pref to 100 (or some value higher than the UUwho traversed paths from
|
|
MCI). This will force all traffic destined for UUwho to exit your AS over
|
|
your BFE DS3. While this affords you some granularity in load balancing, it
|
|
is often not optimal. Basically, you are forcing traffic out a path that it
|
|
would not normally select. If that BFE provider has many hops before it can
|
|
reach UUnet, you are forcing the traffic you send out that link to traverse
|
|
all of those hops and be subject to (possibly) more link congestion, latency,
|
|
etc.
|
|
|
|
Routing policy is something that requires the tweaking of many knobs. Much of
|
|
the tweaking I have described above pertains to cisco Systems routers. It is
|
|
important to understand that you must think through routing policy before you
|
|
implement it. You must evaluate what load balancing you want, what traffic
|
|
symmetry you want, and what general quality of service your traffic will
|
|
receive because of your decisions.
|
|
|
|
For information more specific than this, read the BGP RFC or current BGPv4
|
|
internet draft [1].
|
|
|
|
|
|
----[ Open Shortest Path First v2 (OSPFv2)
|
|
|
|
We are not going into great detail about OSPF. It is a link state routing
|
|
algorithm. As noted above, link state algorithms route on flat space (no
|
|
hierarchy). OSPF is an improvement over the vanilla LS protocol because it
|
|
provides areas of maintenance for hierarchy purposes. Areas distribute full
|
|
information internally by running a separate OSPF process with its area ID.
|
|
Each router has an identical link state database with other routers within its
|
|
area, but not with external routers. Each area operates in an autonomous
|
|
state and transfers inter-area information at junction routers called area
|
|
border routers. These routers are in two or more areas and help distribute
|
|
information between these areas. The router has separate link state databases
|
|
for each area to which it is connected.
|
|
|
|
OSPFv2's main advantage is that it supports Variable Length Subnet Masks
|
|
(VLSM). This means that a router can advertise reachability with more
|
|
granularity than a scheme which advertised host reachability. Therefore, if
|
|
the router can distribute packets to all hosts from 206.4.4.1 -> 206.4.5.254
|
|
it advertises reachability to 206.4.4.0/23 instead of each classful network
|
|
separately or each host separately. This obviously saves immensely on link
|
|
state database size and processing power required.
|
|
|
|
For information more specific than this, read the current OSPFv2 RFC or
|
|
internet draft [2].
|
|
|
|
|
|
----[ References
|
|
|
|
[1] Rehkter, Y., Li, T., " A Border Gateway Protocol 4 (BGP-4)",
|
|
draft-ietf-idr-bgp4-07.txt, February 1998.
|
|
|
|
[2] Moy, J., "OSPF Version 2", draft-ietf-ospf-vers2-02.txt,
|
|
January 1998.
|
|
|
|
----[ EOF
|
|
|