Layer-3 Discovery and LivenessArrcus & Internet Initiative Japan5147 Crystal SpringsBainbridge IslandWA98110USrandy@psg.comArrcus, Incsra@hactrn.netArrcus2077 Gateway Place, Suite #400San JoseCA95119USkeyur@arrcus.comIn Massive Data Centers, BGP-SPF and similar routing protocols
are used to build topology and reachability databases. These
protocols need to discover IP Layer-3 attributes of links, such as
neighbor IP addressing, logical link IP encapsulation abilities, and
link liveness. This Layer-3 Discovery and Liveness protocol
collects these data, which may then be disseminated using BGP-SPF
and similar protocols.The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
"OPTIONAL" in this document are to be interpreted as described in
BCP 14 when,
and only when, they appear in all capitals, as shown here.The Massive Data Center (MDC) environment presents unusual
problems of scale, e.g. O(10,000) forwarding devices, while its
homogeneity presents opportunities for simple approaches.
Approaches such as Jupiter Rising use a
central controller to deal with scaling, while BGP-SPF provides massive scale-out without
centralization using a tried and tested scalable distributed control
plane, offering a scalable routing solution in Clos and similar environments.
But BGP-SPF and similar higher level device-spanning protocols,
e.g. , need logical link
state and addressing data from the network to build the routing
topology. They also need prompt but prudent reaction to (logical)
link failure.Layer-3 Discovery and Liveness (L3DL) provides brutally simple
mechanisms for devices to Discover each other's unique endpoint identification,Discover mutually supported layer-3 encapsulations, e.g.
IP/MPLS,Discover Layer-3 IP and/or MPLS addressing of interfaces of the
encapsulations,Present these data, using a very restricted profile of a BGP-LS
API, to BGP-SPF which computes the
topology and builds routing and forwarding tables,Enable Layer-3 link liveness such as BFD,Provide Layer-2 keep-alive messages for session continuity, and
finallyProvide for authenticity verification of protocol messages.In this document, the use case for L3DL is for point to point
links in a datacenter Clos in order to exchange the data needed for
BGP-SPF bootstrap and
continuity. Once layer-2 connectivity has been leveraged to get
layer-3 addressability and forwarding capabilities, normal layer-3
forwarding and routing can take over.L3DL might be found to be more widely applicable to a range of
routing and similar protocols which need layer-3 discovery and
characterisation.Even though it concentrates on the inter-device layer, this
document relies heavily on routing terminology. The following
attempts to clarify the use of some possibly confusing terms:
Autonomous System Number , a BGP identifier for an originator of
Layer-3 routes, particularly BGP announcements.A mechanism by which link-state and TE
information can be collected from networks and shared with
external components using the BGP routing protocol. See .A hybrid protocol using BGP transport but
a Dijkstra Shortest Path First decision process. See .A hierarchic subset of a crossbar switch
topology commonly used in data centers.The L3DL content of a single Layer-2
frame, sans Ethernet framing. A full L3DL PDU may be packaged in
multiple Datagrams.Address Family Indicator and
Subsequent Address Family Indicator (AFI/SAFI). I.e. classes of
layer-2.5 and 3 addresses such as IPv4, IPv6, MPLS, etc.A Layer-2 Ethernet packet.A logical connection between
two logical ports on two devices. E.g. two VLANs between the same
two ports are two links.Logical Link Endpoint Identifier, the unique
identifier of one end of a logical link, see .48-bit Layer-2 addresses are assumed
since they are used by all widely deployed Layer-2 network
technologies of interest, especially Ethernet. See .Massive Data Center, commonly composed of
thousands of Top of Rack Switches (TORs).Maximum Transmission Unit, the size in octets
of the largest packet that can be sent on a medium, see 1.3.3.Protocol Data Unit, an L3DL application layer
message. A PDU's content may need to be broken into multiple
Datagrams to make it through MTU or other restrictions.An 32-bit identifier unique in the
current routing domain, see .An established, via OPEN PDUs, session
between two L3DL capable link end-points,Shortest Path First, an algorithm for finding
the shortest paths between nodes in a graph; AKA Dijkstra's
algorithm.An eight octet ISO System
Identifier a la System IDTop Of Rack switch, aggregates the servers in
a rack and connects to aggregation layers of the Clos tree, AKA
the Clos spine.Zero Touch Provisioning gives devices initial
addresses, credentials, etc. on boot/restart.L3DL is primarily designed for a Clos type datacenter scale and
topology, but can accommodate richer topologies which contain
potential cycles.While L3DL is designed for the MDC, there are no inherent reasons
it could not run on a WAN. The authentication and authorization
needed to run safely on a WAN need to be considered, and the
appropriate level of security options chosen.L3DL assumes a new IEEE assigned EtherType (TBD).The number of addresses of one Encapsulation type on an interface
link may be quite large given a TOR with tens of servers, each
server having a few hundred micro-services, resulting in an
inordinate number of addresses. And highly automated micro-service
migration can cause serious address prefix disaggregation, resulting
in interfaces with thousands of disaggregated prefixes.Therefore the L3DL protocol is session oriented and uses
incremental announcement and withdrawal with session restart, a la
BGP ().Devices discover each other on logical linksLogical Link Endpoint Identifiers (LLEIs) are exchangedLayer-2 Liveness checks may be startedEncapsulation data are exchanged and IP-Level Liveness checks
enabledA BGP-like upper layer protocol is assumed to use the
identifiers and encapsulation data to discover and build a topology
databaseThere are two protocols, the inter-device (left-right in the
diagram) per-link layer-3 discovery and the API to the upper level
BGP-like routing protocol (up-down in the above diagram):
Inter-device PDUs are used to exchange device and logical link
identities and layer-2.5 (MPLS) and 3 identifiers (not payloads),
e.g. device IDs, port identities, VLAN IDs, Encapsulations, and IP
addresses.A Link Layer to BGP API presents these data up the stack to
a BGP protocol or an other device-spanning upper layer protocol,
presenting them using the BGP-LS BGP-like data format.The upper layer BGP family routing protocols cross all the
devices, though they are not part of these L3DL protocols.To simplify this document, Layer-2 framing is not shown. L3DL is
about layer-3.Two devices discover each other and their respective identities
by sending multicast HELLO PDUs (). To assure
discovery of new devices coming up on a multi-link topology, devices
on such a topology, and only on a multi-link topology, send periodic
HELLOs forever, see .Once a new device is recognized, both devices attempt to
negotiate and establish a session by sending unicast OPEN PDUs
() to the source MAC addresses (plus VIDs if
VLANs) of the received HELLOs. Once a session is established
through the OPEN exchange, the Encapsulations () configured on an end point may be announced and
modified. Note that these are only the encapsulation and addresses
configured on the announcing interface; though a device's loopback
and overlay interface(s) may also be announced. When two devices on
a link have compatible Encapsulations and addresses, i.e. the same
AFI/SAFI and the same subnet, the link is announced via the BGP-LS
API.The HELLO, , is a priming message sent on
all configured logical links. It is a small L3DL PDU encapsulated
in an Ethernet multicast frame with the simple goal of discovering
the identities of logical link endpoint(s) reachable from a
Logical Link Endpoint, .The HELLO and OPEN, , PDUs, which are used
to discover and exchange detailed Logical Link Endpoint
Identifiers, LLEIs, and the ACK/ERROR PDU, are mandatory; other
PDUs are optional; though at least one encapsulation SHOULD be
agreed at some point.The following is a ladder-style diagram of the L3DL protocol
exchanges:L3DL PDUs are carried by a simple transport layer which allows
long PDUs to occupy many Ethernet frames. The L3DL content of a
single Ethernet frame, exclusive of Ethernet framing data, is
referred to as a Datagram.The L3DL Transport Layer encapsulates each Datagram using a
common transport header.If a PDU does not fit in a single datagram, it is broken into
multiple Datagrams and reassembled by the receiver a la Section 2.3 Fragmentation.This is not classic 'fragmentation', but rather decomposition at
the origin to allow PDU payloads larger than the frame allows.
There are no intermediate devices capable of further fragmentation
or reassembly.A PDU might need a large number of frames to be sent. As
fragments are not ACK paced (as PDUs are), to avoid overwhelming
bursts, the sender should pace fragments of a large PDU.L3DL is carrying a relatively small amount of data on relatively
high bandwidth links, and at a time when the link is not active with
other data as it does not yet have layer-3 connectivity. So
congestion is not considered a sufficiently significant risk to
warrant additional complexity.Should a PDU need to be retransmitted, it MUST BE sent as the
identical Datagram set as the original transmission. The
Transmission Sequence Number informs the receiver that it is the
same PDU.The fields of the L3DL Transport Header are as follows:
Eight-bit Version number of the protocol,
currently 0. Values other than 0 MUST BE treated as an error.
The protocol version needs to be in one and only one place, so it
is in the datagram as opposed to, for example, the PDU header.A 16-bit strictly
increasing unsigned integer identifying this PDU, possibly across
retransmissions, that wraps from 2^16-1 to 0. The initial value
is arbitrary. See on DNS Serial Number
Arithmetic for too much detail on comparing and incrementing a
wrapping sequence number.A bit that set to one if this Datagram is the
last Datagram of the PDU. For a PDU which fits in
only one Datagram, it is set to one. Note that this is the
inverse of the marking technique used by .A monotonically increasing 23-bit
value which starts at zero for each PDU. This is used to
reassemble frames into PDUs a la Section
2.3. Note that this limits an L3DL PDU to 2^24 frames.Total number of octets in the
Datagram including all payloads and fields. Note that this limits
a datagram to 2^16 octets; though Ethernet framing is likely to
impose a smaller limit.A 32 bit hash over the Datagram to detect
bit flips, see .If a Datagram fails checksum verification, the datagram is
invalid and SHOULD be silently discarded. The sender will
retransmit the PDU, and the receiver can assemble it.The PDU being transported or a fragment
thereof.To avoid the need for a receiver to reassemble two PDUs at the
same time, a sender MUST NOT send a subsequent PDU when a PDU is
already in flight and not yet acknowledged; assuming it is an ACKed
PDU Type.There is a reason conservative folk use a checksum in UDP. And
as many operators stretch to jumbo frames (over 1,500 octets) longer
checksums are the prudent approach.For the purpose of computing a checksum, the checksum field
itself is assumed to be zero.The following code describes a suggested algorithm. This
specification avoids mandatory to implement, algorithm agility, etc.
What matters is that the same algorithm is used consistently in any
deployment.The basic L3DL application layer PDU is a typical TLV (Type
Length Value) PDU. It includes a signature to provide optional
integrity and authentication. It may be broken into multiple
Datagrams, see .The fields of the basic L3DL header are as follows:
An integer differentiating PDU payload
types. See .Total number of octets in the
Payload field.The application layer content of the L3DL
PDU.The type of the Signature, see . Type 0, a null signature, is defined in
this document.Sig Type 0 indicates a null Signature. For a trivial PDU such
as KEEPALIVE, the underlying Datagram checksum may be sufficient
for integrity, though it lacks authenticity.Other Sig Types may be defined in other documents, cf. .The length of the Signature,
possibly including padding, in octets. If Sig Type is 0,
Signature Length MUST BE 0.The result of running the signature
algorithm specified in Sig Type over all octets of the PDU except
for the Signature itself.L3DL discovers neighbors on logical links and establishes
sessions between the two ends of all consenting discovered logical
links. A logical link is described by a pair of Logical Link
Endpoint Identifiers, LLEIs.An LLEI is a variable length descriptor which could be an ASN, a
classic RouterID, a catenation of the two, an eight octet ISO System
Identifier , or any other identifier unique
to a single logical link endpoint in the topology.An L3DL deployment will choose and define an LLEI which suits its
needs, simple or complex. Examples of two extremes follow:A simplistic view of a link between two devices is two ports,
identified by unique MAC addresses, carrying a layer-3 protocol
conversation. In this case, the MAC addresses might suffice for the
LLEIs.Unfortunately, things can get more complex. Multiple VLANs can
run between those two MAC addresses. In practice, many real devices
use the same MAC address on multiple ports and/or
sub-interfaces.Therefore, in the general circumstance, a fully described LLEI
might be as follows:System Identifier, a la , is an eight
octet identifier unique in the entire operational space. Routers
and switches usually have internal MAC Addresses which can be padded
with high order zeros and used if no System ID exists on the device.
If no unique identifier is burned into a device, the local L3DL
configuration SHOULD create and assign a unique one, likely by
configuration.ifIndex is the SNMP identifier of the (sub-)interface, see . This uniquely identifies the port.For a layer-3 tagged sub-interface or a VLAN/SVI interface,
IfIndex is that of the logical sub-interface, so no further
disambiguation is needed.L3DL PDUs learned over VLAN-ports may be interpreted by upper
layer-3 routing protocols as being learned on the corresponding
layer-3 SVI interface for the VLAN.LLEIs are big-endian.The HELLO PDU is unique in that it is encapsulated in a multicast
Ethernet frame. It solicits response(s) from other LLEI(s) on the
link. See for why multicast is used. The
destination multicast MAC Addressees to be used MUST be one of the
following, See Clause 9.2.2 of :
Nearest Bridge = Propagation
constrained to a single physical link; stopped by all types of
bridges (including MPRs (media converters)). This SHOULD be used
when the link is known to be a simple point to point link. When a switch receives a frame with
a multicast destination MAC it does not recognize, it forwards to
all ports. This destination MAC SHOULD be sent when the interface
is known to be connected to a switch. See . This SHOULD be used when the link may be a
multi-point link.All other L3DL PDUs are encapsulated in unicast frames, as the
peer's destination MAC address is known after the HELLO
exchange.When an interface is turned up on a device, it SHOULD issue a
HELLO if it is to participate in L3DL sessions.If a constrained Nearest Bridge destination address has been
configured for a point-to-point interface, see above, then the HELLO
SHOULD NOT be repeated once a session has been created by an
exchange of OPENs.If the configured destination address is one that is propagated
by switches, the HELLO SHOULD be repeated at a configured interval,
with a default of 60 seconds. This allows discovery by new devices
which come up on the layer-2 mesh. In this multi-link scenario, the
operator should be aware of the trade-off between timer tuning and
network noise and adjust the inter-HELLO timer accordingly.If more than one device responds, one adjacency is formed for
each unique source LLEI response. L3DL treats each adjacency as a
separate logical link.When a HELLO is received from a source MAC address (plus VID if
VLAN) with which there is no established L3DL session, the receiver
SHOULD respond by sending an OPEN PDU to the source MAC address
(plus VID). The two devices establish an L3DL session by exchanging
OPEN PDUs.To ameliorate possible load spikes during bootstrap or event
recovery, there SHOULD be a jittered delay between receipt of a
HELLO and issue of the OPEN. The default delay range SHOULD be zero
to five seconds, and MUST be configurable.If a HELLO is received from a MAC address with which there is an
established session, the HELLO should be dropped.The Payload Length is zero as there is no payload.HELLO PDUs can not be signed as keying material has yet to be
exchanged. Hence the signature MUST always be the null type.Each device has learned the other's MAC Address from the HELLO
exchange, see . Therefore the OPEN and all
subsequent PDUs MUST BE unicast, as opposed to the HELLO's multicast
frame.The Payload Length is the number of octets in all fields of the
PDU from the Nonce through the Serial Number, not including the
three final signature fields.The Nonce enables detection of a duplicate OPEN PDU. It SHOULD
be either a random number or a high resolution timestamp. It is
needed to prevent session closure due to a repeated OPEN caused by a
race or a dropped or delayed ACK.My LLEI is the sender's LLEI, see .AttrCount is the number of attributes in the Attribute List.
Attributes are single octets the semantics of which are
operator-defined.A node may have zero or more operator-defined attributes, e.g.:
spine, leaf, backbone, route reflector, arabica, ...Attribute syntax and semantics are local to an operator or
datacenter; hence there is no global registry. Nodes exchange
their attributes only in the OPEN PDU.Auth Type is the Signature algorithm suite, see .Key Length is a 16-bit field denoting the length in octets of the
Key itself, not including the Auth Type or the Key Length. If the
Auth Type is zero, then the Key Length MUST also be zero, and there
MUST BE no Key data.The Key is specific to the operational environment. A failure to
authenticate is a failure to start the L3DL session, an ERROR PDU
MUST BE sent (Error Code 3), and HELLOs MUST be restarted.Although delay and jitter in responding with an OPEN were
specified above, beware of load created by long strings of
authentication failures and retries. A configurable failure count
limit (default 8) SHOULD result in giving up on the connection
attempt.The Serial Number is a monotonically increasing 32-bit value
representing the sender's state at the time of sending the last PDU.
It may be an integer, a timestamp, etc. If incrementing the Serial
Number would cause it to be zero, it should be incremented
again.On session restart (new OPEN), a receiver MAY send the last
received Serial Number to tell the sender to only send data with a
Serial Number greater (in the sense), or
send a Serial Number of zero to request all data.The Serial Number supports session resumption in anticipation of
peers having a very large amount of state they would prefer not to
re-exchange because of some glitch. The Serial Number is not
expected to wrap for a considerable time, e.g. days or weeks. But
to address the rare case it does, on DNS
Serial Number Arithmetic should be used as it is in the Transmission
Sequence Number.This allows a sender of an OPEN to tell the receiver that the
sender would like to resume a session and that the receiver only
needs to send data starting with the PDU with the lowest Serial
Number greater (in the sense) than the one
sent in the OPEN. If the sender is not trying to resume a dropped
session, the Serial Number MUST be zero.If the receiver of an OPEN PDU with a non-zero Serial Number can
not resume from the requested point, it should return an ACK with an
Error Code of 2, Session could not be continued. The sender of the
failing OPEN PDU SHOULD then send an OPEN PDU with a Serial Number
of zero.The Signature fields are described in and in
an asymmetric key environment serve as a proof of possession of the
signing auth data by the sender.Once two logical link endpoints know each other, and have ACKed
each other's OPEN PDUs, Layer-2 KEEPALIVEs (see ) MAY be started to ensure Layer-2 liveness and
keep the session semantics alive. The timing and acceptable drop of
KEEPALIVE PDUs are discussed in .If a sender of OPEN does not receive an ACK of the OPEN PDU, then
they MUST resend the same OPEN PDU, with the same Nonce. Resending
an unacknowledged OPEN PDU, like other ACKed PDUs, SHOULD use
exponential back-off, see .If a properly authenticated OPEN arrives at L3DL speaker A with a
new Nonce from an LLEI, speaker B, with which A believes it already
has an L3DL session (OPENs have already been exchanged), and the
Serial Number in the OPEN PDU is non-zero, speaker A SHOULD
establish a new sending session by sending an OPEN with the Serial
Number being the same as that of A's last sent and ACKed PDU. A
MUST resume sending encapsulations etc. subsequent to the requested
Sequence Number. And B MUST retain all previously discovered
encapsulation and other data received from A.If a properly authenticated OPEN arrives with a new Nonce from an
LLEI with which the receiving logical link endpoint believes it
already has an L3DL session (OPENs have already been exchanged), and
the Serial Number in the OPEN is zero, then the receiver MUST assume
that the sending LLEI or entire device has been reset. All
Previously discovered encapsulation data MUST NOT be kept and MUST
BE withdrawn via the BGP-LS API and the recipient MUST respond with
a new OPEN.The ACK PDU acknowledges receipt of a PDU and reports any error
condition which might have been raised.The ACK acknowledges receipt of an OPEN, Encapsulation, VENDOR
PDU, etc.The ACKed PDU is the PDU Type of the PDU being acknowledged,
e.g., OPEN, one of the Encapsulations, etc.If there was an error processing the received PDU, then the EType
is non-zero. If the EType is zero, Error Code and Error Hint MUST
also be zero.A non-zero EType is the receiver's way of telling the PDU's
sender that the receiver had problems processing the PDU. The Error
Code and Error Hint will tell the sender more detail about the
error.The decimal value of EType gives a strong hint how the receiver
sending the ACK believes things should proceed:
0 - No Error, Error Code and Error Hint MUST be zero1 - Warning, something not too serious happened, continue2 - Session should not be continued, try to restart3 - Restart is hopeless, call the operator4-15 - ReservedThe Error Codes, noting protocol failures, are listed in . Someone stuck in the 1990s might think the
catenation of EType and Error Code as an echo of 0x1zzz, 0x2zzz,
etc. They might be right; or not.The Error Hint, an arbitrary 16 bits, is any additional data the
sender of the error PDU thinks will help the recipient or the
debugger with the particular error.The Signature fields are described in .If a PDU sender expects an ACK, e.g. for an OPEN, an
Encapsulation, a VENDOR PDU, etc., and does not receive the ACK
for a configurable time (default one second), and the interface is
live at layer-2, the sender resends the PDU using exponential
back-off, see . This cycle MAY be
repeated a configurable number of times (default three) before it
is considered a failure. The session MAY BE considered closed
in this case of this ACK failure.If the link is broken at layer-2, retransmission MAY BE retried
when the link is restored.Once the devices know each other's LLEIs, know each other's upper
layer (L2.5 and L3) identities, have means to ensure link state,
etc., the L3DL session is considered established, and the devices
SHOULD exchange L3 interface encapsulations, L3 addresses, and L2.5
labels.The Encapsulation types the peers exchange may be IPv4 (), IPv6 (), MPLS IPv4 (), MPLS IPv6 (), and/or
possibly others not defined here.The sender of an Encapsulation PDU MUST NOT assume that the peer
is capable of the same Encapsulation Type. An ACK () merely acknowledges receipt. Only if both peers
have sent the same Encapsulation Type is it safe for Layer-3
protocols to assume that they are compatible for that type.A receiver of an encapsulation might recognize an addressing
conflict, such as both ends of the link trying to use the same
address. In this case, the receiver SHOULD respond with an error
(Error Code 2) ACK. As there may be other usable addresses or
encapsulations, this error might log and continue, letting an upper
layer topology builder deal with what works.Further, to consider a logical link of a type to formally be
established so that it may be pushed up to upper layer protocols,
the addressing for the type must be compatible, e.g. on the same
IP subnet.The header for all encapsulation PDUs is as follows:An Encapsulation PDU describes zero or more addresses of the
encapsulation type.The 24-bit Count is the number of Encapsulations in the
Encapsulation list.The Serial Number is a monotonically increasing 32-bit value
representing the sender's state in time. It may be an integer, a
timestamp, etc. On session restart (new OPEN), a receiver MAY
send the last received Session Number to tell the sender to only
send newer data.If a sender has multiple links on the same interface, separate
state: data, ACKs, etc. must be kept for each peer session.Over time, multiple Encapsulation PDUs may be sent for an
interface as configuration changes.If the length of an Encapsulation PDU exceeds the Datagram size
limit on media, the PDU is broken into multiple Datagrams. See
.The Signature fields are described in .The Receiver MUST acknowledge the Encapsulation PDU with a
Type=3, ACK PDU () with the Encapsulation Type
being that of the encapsulation being announced, see .If the Sender does not receive an ACK in a configurable
interval (default one second), and the interface is live at
layer-2, they SHOULD retransmit. After a user configurable number
of failures (default three), the L3DL session should be considered
dead and the OPEN process SHOULD be restarted.If the link is broken at layer-2, retransmission MAY BE retried
if data have not changed in the interim.The Encapsulation Flags are a sequence of bit fields as
follows:Each encapsulation in an Encapsulation PDU of Type T may
announce new and/or withdraw old encapsulations of Type T. It
indicates this with the Ann/With Encapsulation Flag, Announce ==
1, Withdraw == 0.Each Encapsulation interface address in an Encapsulation PDU is
either a new encapsulation be announced (Ann/With == 1) (yes, a la
BGP) or requests one be withdrawn (Ann/With == 0). Adding an
encapsulation which already exists SHOULD raise an
Announce/Withdraw Error (see ); the EType
SHOULD be 2, suggesting a session restart (see so all encapsulations will be resent.If an LLEI has multiple addresses for an encapsulation type,
one and only one address MAY be marked as primary (Primary Flag ==
1) for that Encapsulation Type.An Encapsulation interface address in an Encapsulation PDU MAY
be marked as a loopback, in which case the Loopback bit is set.
Loopback addresses are generally not seen directly on an external
interface. One or more loopback addresses MAY be exposed by
configuration on one or more L3DL speaking external interfaces,
e.g. for iBGP peering. They SHOULD be marked as such, Loopback
Flag == 1.Each Encapsulation interface address in an Encapsulation PDU is
that of the direct 'underlay interface (Under/Over == 1), or an
'overlay' address (Under/Over == 0), likely that of a VM or
container guest bridged or configured on to the interface already
having an underlay address.The IPv4 Encapsulation describes a device's ability to exchange
IPv4 packets on one or more subnets. It does so by stating the
interface's addresses and the corresponding prefix lengths.The 24-bit Count is the sum of the number of IPv4
Encapsulations being announced and/or withdrawn.The IPv6 Encapsulation describes a logical link's ability to
exchange IPv6 packets on one or more subnets. It does so by
stating the interface's addresses and the corresponding prefix
lengths.The 24-bit Count is the sum of the number of IPv6
Encapsulations being announced and/or withdrawn.As an MPLS enabled interface may have a label stack, see , a variable length list of labels is needed.
These are the labels the sender will accept for the prefix to
which the list is attached.A Label Count of zero is an implicit withdraw of all labels for
that prefix on that interface.The MPLS IPv4 Encapsulation describes a logical link's ability
to exchange labeled IPv4 packets on one or more subnets. It does
so by stating the interface's addresses the corresponding prefix
lengths, and the corresponding labels which will be accepted for
each address.The 24-bit Count is the sum of the number of MPLSv4
Encapsulation being announced and/or withdrawn.The MPLS IPv6 Encapsulation describes a logical link's ability
to exchange labeled IPv6 packets on one or more subnets. It does
so by stating the interface's addresses, the corresponding prefix
lengths, and the corresponding labels which will be accepted for
each address.The 24-bit Count is the sum of the number of MPLSv6
Encapsulations being announced and/or withdrawn.Vendors or enterprises may define TLVs beyond the scope of L3DL
standards. This is done using a Private Enterprise Number followed by Enterprise Data in a format
defined for that Enterprise Number and Ent Type.Ent Type allows a VENDOR PDU to be sub-typed in the event that
the vendor/enterprise needs multiple PDU types.As with Encapsulation PDUs, a receiver of a VENDOR PDU MUST
respond with an ACK or an ERROR PDU. Similarly, a VENDOR PDU MUST
only be sent over an open session.L3DL devices SHOULD beacon frequent Layer-2 KEEPALIVE PDUs to
ensure session continuity. The inter-KEEPALIVE interval is
configurable, with a default of ten seconds. A receiver may choose
to ignore KEEPALIVE PDUs.An operational deployment MUST BE configured whether to use
KEEPALIVEs or not, either globally, or as finely as to per-link
granularity. Disagreement MAY result in repeated session failure
and reestablishment.KEEPALIVEs SHOULD be beaconed at a configured frequency. One per
second is the default. Layer-3 liveness, such as BFD, may be more
(or less) aggressive.When a sender transmits a PDU which is not a KEEPALIVE, the
sender SHOULD reset the KEEPALIVE timer. I.e. sending any PDU acts
as a keepalive. Once the last fragment has been sent, the
KEEPALIVE timer SHOULD be restarted. Do not wait for the ACK.If a KEEPALIVE or other PDUs have not been received from a peer
with which a receiver has an open session for a configurable time
(default 30 seconds), the link SHOULD be presumed down. The devices
MAY keep configuration state and restore it without retransmission
if no data have changed. Otherwise, a new session SHOULD be
established and new Encapsulation PDUs exchanged.Layer-2 liveness may be continuously tested by KEEPALIVE PDUs,
see . As layer-2.5 or layer-3
connectivity could still break, liveness above layer-2 MAY be
frequently tested using BFD () or a similar
technique.This protocol assumes that one or more Encapsulation addresses
may be used to ping, run BFD, or whatever the operator
configures.Thus far, a one-hop point-to-point logical link discovery
protocol has been defined.The devices know their unique LLEIs and know the unique peer
LLEIs and Encapsulations on each logical link interface.Full topology discovery is not appropriate at the L3DL layer, so
Dijkstra a la IS-IS etc. is assumed to be done by higher level
protocols such as BGP-SPF.Therefore the LLEIs, link Encapsulations, and state changes are
pushed North via a small subset of the BGP-LS API. The upper layer
routing protocol(s), e.g. BGP-SPF, learn and maintain the topology,
run Dijkstra, and build the routing database(s).For example, if a neighbor's IPv4 Encapsulation address changes,
the devices seeing the change push that change Northbound.BGP-LS defines BGP-like Datagrams
describing logical link state (links, nodes, link prefixes, and
many other things), and a new BGP path attribute providing
Northbound transport, all of which can be ingested by upper layer
protocols such as BGP-SPF; see Section 4 of .For IPv4 links, TLVs 259 and 260 are used. For IPv6 links,
TLVs 261 and 262. If there are multiple addresses on a link,
multiple TLV pairs are pushed North, having the same ID pairs.The Northbound protocol needs a few minor extensions to BGP-LS.
Luckily, others have needed the same extensions.Similarly to BGP-SPF, the BGP protocol is used in the
Protocol-ID field specified in table 1 of . The local and
remote node descriptors for all NLRI are the IDs described in
. This is equivalent to an adjacency SID or
a node SID if the address is a loopback address.Label Sub-TLVs from Section 2.1.1,
are used to associate one or more MPLS Labels with a link.This section explores some trade-offs taken and some
considerations.A device with multiple Layer-2 interfaces, traditionally called
a switch, may be used to forward frames and therefore packets from
multiple devices to one logical interface (LLEI), I, on an L3DL
speaking device. Interface I could discover a peer J across the
switch. Later, a prospective peer K could come up across the
switch. If I was not still sending and listening for HELLOs, the
potential peering with K could not be discovered. Therefore, on
multi-link interfaces, L3DL MUST continue to send HELLOs as long
as they are turned up.Both HELLO and KEEPALIVE are periodic. KEEPALIVE might be
eliminated in favor of keeping only HELLOs. But KEEPALIVEs are
unicast, and thus less noisy on the network, especially if HELLO
is configured to transit layer-2-only switches, see .One can think of the protocol as an instance (i.e. state machine)
which runs on each logical link of a device.As the upper routing layer must view VLAN topologies as separate
graphs, L3DL treats VLAN ports as separate links.L3DL PDUs learned over VLAN-ports may be interpreted by upper
layer-3 routing protocols as being learned on the corresponding
layer-3 SVI interface for the VLAN.As Sub-Interfaces each have their own LLIEs, they act as separate
interfaces, forming their own links.An implementation SHOULD provide the ability to configure each
logical interface as L3DL speaking or not.An implementation SHOULD provide the ability to configure whether
HELLOs on an L3DL enabled interface send Nearest Bridge or the MAC
which is propagated by switches from that interface; see .An implementation SHOULD provide the ability to distribute one or
more loopback addresses or interfaces into L3DL on an external L3DL
speaking interface.An implementation SHOULD provide the ability to distribute one or
more overlay and/or underlay addresses or interfaces into L3DL on an
external L3DL speaking interface.An implementation SHOULD provide the ability to configure one of
the addresses of an encapsulation as primary on an L3DL speaking
interface. If there is only one address for a particular
encapsulation, the implementation MAY mark it as primary by
default.An implementation MAY allow optional configuration which updates
the local forwarding table with overlay and underlay data both
learned from L3DL peers and configured locally.The protocol as is MUST NOT be used outside a datacenter or
similarly closed environment without authentication and
authorization mechanisms such as .Many MDC operators have a strange belief that physical walls and
firewalls provide sufficient security. This is not credible. All
MDC protocols need to be examined for exposure and attack surface.
In the case of L3DL, Authentication and Integrity as provided in
is strongly recommended.It is generally unwise to assume that on the wire Layer-2 is
secure. Strange/unauthorized devices may plug into a port.
Mis-wiring is very common in datacenter installations. A poisoned
laptop might be plugged into a device's port, form malicious
sessions, etc. to divert, intercept, or drop traffic.Similarly, malicious nodes/devices could mis-announce
addressing.If OPENs are not being authenticated, an attacker could forge an
OPEN for an existing session and cause the session to be reset.For these reasons, the OPEN PDU's authentication data exchange
SHOULD be used.If the KEEPALIVE PDU is not signed (as suggested in ) to save computation, then a MITM could fake a
session being alive.This document requests the IANA create a registry for L3DL PDU
Type, which may range from 0 to 255. The name of the registry
should be L3DL-PDU-Type. The policy for adding to the registry is
RFC Required per , either standards track or
experimental. The initial entries should be the following:This document requests the IANA create a registry for L3DL
Signature Type, AKA Sig Type, which may range from 0 to 255. The
name of the registry should be L3DL-Signature-Type. The policy for
adding to the registry is RFC Required per ,
either standards track or experimental. The initial entries should
be the following:This document requests the IANA create a registry for L3DL PL
Flag Bits, which may range from 0 to 7. The name of the registry
should be L3DL-PL-Flag-Bits. The policy for adding to the registry is
RFC Required per , either standards track or
experimental. The initial entries should be the following:This document requests the IANA create a registry for L3DL Error
Codes, a 16 bit integer. The name of the registry should be
L3DL-Error-Codes. The policy for adding to the registry is RFC
Required per , either standards track or
experimental. The initial entries should be the following:This document requires a new EtherType.This document requires a new multicast MAC address that will be
broadcast through a switch.The authors thank Cristel Pelsser for multiple reviews, Harsha
Kovuru for comments during implementation, Jeff Haas for review and
comments, Jörg Ott for an early but deep transport review, Joe
Clarke for a useful review, John Scudder for deeply serious review
and comments, Larry Kreeger for a lot of layer-2 clue, Martijn
Schmidt for his contribution, Nalinaksh Pai for transport
discussions, Neeraj Malhotra for review, Paul Congdon for Ethernet
hints, Russ Housley for checksum discussion and sBox, and Steve
Bellovin for checksum advice.IANA Private Enterprise NumbersIEEE Standard for Local and Metropolitan Area Networks:
Overview and Architecture
IEEELocal and Metropolitan Area Networks: Overview and ArchitectureInstitute of Electrical and Electronics EngineersA study of non-blocking switching networks [PAYWALLED]Clos Network