Remote Direct Memory Access (RDMA) architecture enables efficient data transfer between Compute Nodes (CN) in a High-Performance Computing (HPC) environment. RDMA over Converged Ethernet version 2 (RoCEv2) utilizes a routed IP Fabric as a transport network for RDMA messages. Due to the nature of RDMA packet flow, the transport network must provide lossless, low-latency packet transmission. The RoCEv2 solution uses UDP in the transport layer, which does not handle packet losses caused by network congestion (buffer overflow on switches or on a receiving Compute Node). To avoid buffer overflow issues, Priority Flow Control (PFC) and Explicit Congestion Notification (ECN) are used as signaling mechanisms to react to buffer threshold violations by requesting a lower packet transfer rate.
Before moving to RDMA processes, let’s take a brief look at our example Compute Nodes. Figure 1-1 illustrates our example Compute Nodes (CN). Both Client and Server CNs are equipped with one Graphical Processing Unit (GPU). The GPU has a Network Interface Card (NIC) with one interface. Additionally, the GPU has Device Memory Units to which it has a direct connection, bypassing the CPU. In real life, a CN may have several GPUs, each with multiple memory units. Intra-GPU communication within the CN happens over high-speed NVLinks. The connection to remote CNs occurs over the NIC, which has at least one high-speed uplink port/interface.
Figure 1-1 also shows the basic idea of a stacked Fine-Grained 3D DRAM (FG-DRAM) solution. In our example, there are four vertically interconnected DRAM dies, each divided into eight Banks. Each Bank contains four memory arrays, each consisting of rows and columns that contain memory units (transistors whose charge indicates whether a bit is set to 1 or 0). FG-DRAM enables cross-DRAM grouping into Ranks, increasing memory capacity and bandwidth.
The upcoming sections introduce the required processes and operations when the Client Compute Node wants to write data from its device memory to the Server Compute Node’s device memory. I will discuss the design models and requirements for lossless IP Fabric in later chapters.
In this section, we will first examine the update process of the BGP tables on the VTEP switch Leaf-102 when it receives a BGP Update message from Spine-11. After that, we will go through the update processes for the MAC-VRF and the MAC Address Table. Finally, we will examine how the VXLAN manager on Leaf-102 learns the IP address of Leaf-10's NVE interface and creates a unidirectional NVE peer record in the NVE Peer Database based on this information.
We have configured switches Leaf-101 and Leaf-102 as Route Reflector Clients on the Spine-11 switch. Spine-11 has stored the content of the BGP Update message sent by Leaf-101 in the neighbor-specific Adj-RIB-In of Leaf-101. Spine-11 does not import this information in its local BGP Loc-RIB because we have not defined a BGP import policy. Since Leaf-102 is an RR Client, the BGP process on Spine-11 copies this information in the neighbor-specific Adj-RIB-Out table for Leaf-102 and sends the information to Leaf-102 in a BGP Update message. The BGP process on Leaf-102 stores the received information from the Adj-RIB-In table to the BGP Loc-RIB according to the import policy of EVPN Instance 10010 (import RT 65000:10010). During the import process, the Route Distinguisher values are also modified to match the configuration of Leaf-102: change the RD value from 192.168.10.101:32777 (received RD) to 192.168.10.102:32777 (local RD).
In this scenario, we are building a protected Broadcast Domain (BD), which we extend to the VXLAN Tunnel Endpoint (VTEP) switches of the EVPN Fabric, Leaf-101 and Leaf-102. Note that the VTEP operates in the Network Virtualization Edge (NVE) role for the VXLAN segment. The term NVE refers to devices that encapsulate data packets to transport them over routed IP infrastructure. Another example of an NVE device is the MPLS Provider Edge (MPLS-PE) router at the edge of the MPLS network, doing MPLS labeling. The term “Tenant System” (TS) refers to a physical host, virtual machine, or an intra-tenant forwarding component attached to one or more Tenant-specific Virtual Networks. Examples of TS forwarding components include firewalls, load balancers, switches, and routers.
We begin by configuring L2 VLAN 10 to Leaf-101 and Leaf-102 and associate it with the vn-segment 10010. From the NVE perspective, this constitutes an L2-Only network segment, meaning we do not configure an Anycast Gateway (AGW) for the segment, and it does not have any VRF association.
Next, we deploy a Layer 2 EVPN Instance (EVI) with VXLAN Network Identifier (VNI) 10010. We utilize the 'auto' option to generate the Route Distinguisher (RD) and the Route Target (RT) import and export values for the EVI. The RD value is derived from the NVE Interface IP address and the VLAN Identifier (VLAN 10) associated with the EVI, added to the base value 32767 (e.g., 192.168.100.101:32777). The use of the VLAN ID as part of the automatically generated RD value is the reason why VLAN is configured before the EVPN Instance. Similarly, the RT values are derived from the BGP ASN and the VNI (e.g., 65000:10010).
As the final step for EVPN Instance deployment, we add EVI 10010 under the NVE interface configuration as a member vni with the Multicast Group 239.1.1.1 we are using for Broadcast, Unknown Unicast, and Multicast (BUM) traffic.
For connecting TS1 and TS2 to the Broadcast domain, we will configure Leaf-101's interface Eth 1/5 and Leaf-102's interface Eth1/3 as access ports for VLAN 10.
A few words regarding the terminology utilized in Figure 3-2. '3-Stage Routed Clos Fabric' denotes both the physical topology of the network and the model for forwarding data packets. The 3-Stage Clos topology has three switches (ingress, spine, and egress) between the attached Tenant Systems. Routed, in turn, means that switches forward packets based on the destination IP address.
With the term VXLAN Segment, I refer to a stretched Broadcast Domain, identified by the VXLAN Network Identifier value defined under the EVPN Instance on Leaf switches.
Figure 3-2: L2-Only Intra VN Connection.
Continue readingIn the previous section, we built a Single-AS EVPN Fabric with OSPF-enabled Underlay Unicast routing and PIM-SM for Multicast routing using Any Source Multicast service. In this section, we configure two L2-Only EVPN Instances (L2-EVI) and two L2/L3 EVPN Instances (L2/3-EVI) in the EVPN Fabric. We examine their operations in six scenarios depicted in Figure 3-1.
Scenario 1 (L2-Only EVI, Intra-VN):
In the Deployment section, we configure an L2-Only EVI with a Layer 2 VXLAN Network Identifier (L2VNI) of 10010. The Default Gateway for the VLAN associated with the EVI is a firewall. In the Analyze section, we observe the Control Plane and Data Plane operation when a) connecting Tenant Systems TS1 and TS2 to the segment, and b) TS1 communicates with TS2 (Intra-VN Communication).
Scenario 2 (L2-Only EVI, Inter-VN):
In the Deployment section, we configure another L2-Only EVI with L2VNI 10020, to which we attach TS3 and TS4. In the Analyze section, we examine EVPN Fabric's Control Plane and Data Plane operations when TS2 (L2VNI 10010) sends data to TS3 (L2VNI 10020), Inter-VN Communication.
Scenario 3 (L2/L3 EVI, Intra-VN):
In the Deployment section, we configure a Virtual Routing and Forwarding (VRF) Instance named VRF-NWKT with L3VNI 10077. Next, Continue reading
Instead of being a protocol, EVPN is a solution that utilizes the Multi-Protocol Border Gateway Protocol (MP-BGP) for its control plane in an overlay network. Besides, EVPN employs Virtual eXtensible Local Area Network (VXLAN) encapsulation for the data plane of the overlay network.
Multi-Protocol BGP (MP-BGP) is an extension of BGP-4 that allows BGP speakers to encode Network Layer Reachability Information (NLRI) of various address types, including IPv4/6, VPNv4, and MAC addresses, into BGP Update messages. The MP_REACH_NLRI path attribute (PA) carried within MP-BGP update messages includes Address Family Identifier (AFI) and Subsequent Address Family Identifier (SAFI) attributes. The combination of AFI and SAFI determines the semantics of the carried Network Layer Reachability Information (NLRI). For example, AFI-25 (L2VPN) with SAFI-70 (EVPN) defines an MP-BGP-based L2VPN solution, which extends a broadcast domain in a multipoint manner over a routed IPv4 infrastructure using an Ethernet VPN (EVPN) solution.
BGP EVPN Route Types (BGP RT) carried in BGP update messages describe the advertised EVPN NLRIs (Network Layer Reachability Information) type. Besides publishing IP Prefix information with IP Prefix Route (EVPN RT 5), BGP EVPN uses MAC Advertisement Route (EVPN RT 2) Continue reading
In a traditional Layer 2 network, switches forward Intra-VLAN data traffic based on the destination MAC address of Ethernet frames. Therefore, hosts within the same VLAN must resolve each other's MAC-IP address bindings using Address Resolution Protocol (ARP). When a host wants to open a new IP connection with a device in the same subnet and the destination MAC address is unknown, the connection initiator generates an ARP Request message. In the message, the sender provides its own MAC-IP binding information and queries the MAC address of the owner of the target IP. The ARP Request messages are Layer 2 Broadcast messages with the destination MAC address FF:FF:FF:FF:FF:FF.
EVPN Fabric is a routed network and requires a solution for Layer 2 Broadcast messages. We can select either BGP EVPN-based Ingress-Replication (IR) solution or enable Multicast routing in Underlay network. This chapter introduces the latter model. As in previous Unicast Routing section, we follow the Multicast deployment workflow of Nexus Dashboard Fabric Controller (NDFC) graphical user interface.
Figure 2-4 depicts the components needed to deploy Multicast service in the Underlay network. The default option for selecting “RP mode” is ASM (Any-Source Multicast). ASM is Continue reading
Image 2-1 illustrates the components essential for designing a Single-AS, Multicast-enabled OSPF Underlay EVPN Fabric. These components need to be established before constructing the EVPN fabric. I've grouped them into five categories based on their function.
The model presented in Figure 2-1 outlines the steps for configuring an EVPN fabric using the Continue reading
Figure illustrates the simplified operation model of EVPN Fabric. At the bottom of the figure is four devices, Tenant Systems (TS), connected to the network. When speaking about TS, I am referring to physical or virtual hosts. Besides, The Tenant System can be a forwarding component attached to one or more Tenant-specific Virtual Networks. Examples of TS forwarding components include firewalls, load balancers, switches, and routers.
We have connected TS1 and TS2 to VLAN 10 and TS3-4 to VLAN 20. VLAN 10 is associated with EVPN Instance (EVI) 10010 and VLAN 20 to EVI 10020. Note that VLAN-Id is switch-specific, while EVI is Fabric-wide. Thus, subnet A can have VLAN-Id XX on one Leaf switch and VLAN-Id YY on another. However, we must map both VLAN XX and YY to the same EVPN Instance.
When a TS connected to the Fabric sends the first Ethernet frame, the Leaf switch stores the source MAC address in the MAC address table, where it is copied to the Layer 2 routing table (L2RIB) of the EVPN Instance. Then, the BGP process of the Leaf switch advertises the MAC address with its reachability information to its BGP EVPN peers, essentially the Spine switches. Continue reading
During the load balancer deployment process, we define a virtual IP (a.k.a front-end IP) for our published service. As a next step, we create a backend (BE) pool to which we attach Virtual Machines using either their associated vNIC or Direct IP (DIP). Then, we bind the VIP to BE using an Inbound rule. Besides, in this phase, we create and associate health probes with inbound rules for monitoring VM's service availability. If VMs in the backend pool also initiate outbound connections, we build an outbound policy, which states the source Network Address Translation (SNAT) rule (DIP, src port > VIP, src port).
This chapter provides an overview of the components of the Azure load balancer service: Centralized SDN Controller, Virtual Load balancer pools, and Host Agents. In this chapter, we discuss control plane and data plane operation.
Figure 20-1 depicts our example diagram. The top-most box, Loadbalancer deployment, shows our LB settings. We intend to forward HTTP traffic from the Internet to VIP 1.2.3.4 to either DIP 10.0.0.4 (vm-beetle) or DIP 10.0.0.5 (vm-bailey). The health probe associated with Continue reading
In the previous chapter, we discussed how a VTEP learns the local TS's MAC address and the process through which the MAC address is programmed into BGP tables. An example VTEP device was configured with a Layer 2 VLAN and an EVPN Instance without deploying a VRF Context or VLAN routing interface. This chapter introduces, at a theoretical level, how the VTEP device, besides the TS's MAC address, learns the TS's IP address information after we have configured the VRF Context and routing interface for our example VLAN.
Figure 1-3: MAC-VRF Tenant System’s IP Address Propagation.
I have divided Figure 1-3 into three sections. The section on the top left, Integrated Routing and Bridging - IRB illustrates the components required for intra-tenant routing and their interdependencies. By configuring a Virtual Routing and Forwarding Context (VRF Context), we create a closed routing environment with a per-tenant IP-VRF L3 Routing Information Base (L3RIB). Within the VRF Context, we define the Layer 3 Virtual Network Identifier (L3VNI) along with the Route Distinguisher (RD) and Route Target (RT) values. The RD of the VRF Context enables the use of overlapping IP addresses across different tenants. Based on the RT value of the VRF Context, Continue reading
Instead of being a protocol, EVPN is a solution that utilizes the Multi-Protocol Border Gateway Protocol (MP-BGP) for its control plane in an overlay network. Besides, EVPN employs Virtual extensible Local Area Network (VXLAN) encapsulation for the data plane of the overlay network.
Multi-Protocol BGP (MP-BGP) is an extension of BGP-4 that allows BGP speakers to encode Network Layer Reachability Information (NLRI) of various address types, including IPv4/6, VPNv4, and MAC addresses, into BGP Update messages.
The MP_REACH_NLRI path attribute (PA) carried within MP-BGP update messages includes Address Family Identifier (AFI) and Subsequent Address Family Identifier (SAFI) attributes. The combination of AFI and SAFI determines the semantics of the carried Network Layer Reachability Information (NLRI). For example, AFI-25 (L2VPN) with SAFI-70 (EVPN) defines an MP-BGP-based L2VPN solution, which extends a broadcast domain in a multipoint manner over a routed IPv4 infrastructure using an Ethernet VPN (EVPN) solution.
BGP EVPN Route Types (BGP RT) carried in BGP update messages describe the advertised EVPN NLRIs (Network Layer Reachability Information) type. Besides publishing IP Prefix information with IP Prefix Route (EVPN RT 5), BGP EVPN uses MAC Advertisement Route Continue reading
In Figure 1-3 we have VLAN 10 mapped to EVI/MAC-VRF L2VNI10000. TS-A1 (IP: 192.168.11.12/MAC: 1000.0010.beef) is connected to VLAN10 via Attachment Circuit (AC) Ethernet 1/2, (ifindex: 1a000200).
Figure 1-3: MAC-VRF: L2RIB Local Learning Process.
Example 1-1 shows the VLAN to L2VNI mapping information.
Example 1-1: VLAN to EVPN Instance Mapping Information.
During the startup process, TS-A1 sends a Gratuitous ARP (GARP) message to announce its presence on the network and validate the uniqueness of its IP address. It uses its IP address in the Target IP field (Example 1-2). If another host responds to this unsolicited ARP reply, it indicates a potential IP address conflict.
Multi-Protocol BGP (MP-BGP) is a BGP-4 extension that enables BGP speakers to encode Network Layer Reachability Information (NLRI) of various address types, such as IPv4/6, VPNv4, and MAC addresses, into BGP Update messages. MP-BGP features an MP_REACH_NLRI Path-Attribute (PA), which utilizes an Address Family Identifier (AFI) to describe service categories. Subsequent Address Family Identifier (SAFI), in turn, defines the solution used for providing the service. For example, L2VPN (AFI 25) is a primary category for Layer-2 VPN services, and the Ethernet Virtual Private Network (EVPN: SAFI 70) provides the service. Another L2VPN service is Virtual Private LAN Service (VPLS: SAFI 65). The main differences between these two L2VPN services are that only EVPN supports active/active multihoming, has a control-plane-based MAC address learning mechanism, and operates over an IP-routed infrastructure.
EVPN utilizes various Route Types (EVPN RT) to describe the Network Layer Reachability Information (NLRI) associated with Unicast, BUM (Broadcast, Unknown unicast, and Multicast) traffic, as well as ESI Multihoming. The following sections explain how EVPN RT 2 (MAC Advertisement Route) is employed to distribute MAC and IP address information of Tenant Systems enabling the expansion of VLAN over routed infrastructure.
The Tenant System refers to a host, virtual machine, Continue reading
In Figure 1-1, we have a routed 3-stage Clos Fabric, where all Inter-Switch links are routed point-to-point layer-3 connections. As explained in previous sections, a switched layer-2 network with an STP control plane allows only one active path per VLAN/Instance and VLAN-based traffic load sharing. Due to the Equal Cost Multi-Path (ECMP) supported by routing protocols, a routed Clos Fabric enables flow-based traffic load balancing using all links from the ingress leaf via the spine layer down to the egress leaf. The convergence time for routing protocols is faster and less disruptive than STP topology change. Besides, a routed Clos Fabric architecture allows horizontal bandwidth scaling. We can increase the overall bandwidth capacity between switches, by adding a new spine switch. Dynamic routing protocols allow standalone and virtualized devices lossless In-Service Software Update (ISSU) by advertising infinite metrics or withdrawing all advertised routes.
But how do we stretch layer-2 segments over layer-3 infrastructure in a Multipoint-to-Multipoint manner, allowing tenant isolation and routing between segments? The answer relies on the Network Virtualization Overlay (NVO3) framework.
BGP EVPN, as an NVO3 control plane protocol, uses EVPN Route Types (RT) in update messages for identifying the type of advertised EVPN NLRIs (Network Continue reading
The default Layer 2 Control Plane protocol in Cisco NX-OS is a Rapid Per-VLAN Spanning Tree Plus (Rapid PVST+), which runs 802.1w standard Rapid Spanning Tree Protocol (RSTP) instance per VLAN. Rapid PVST+ builds a VLAN-specific, loop-free Layer 2 data path from the STP root switch to all non-root switches. Spanning Tree Protocol, no matter which mode we use, allows only one active path at a time and blocks all redundant links. One general solution for activating all Inter-switch links is placing an STP root switch for odd and even VLANs into different switches. However, STP allows only a VLAN-based traffic load balancing.
After building a loop-free data path, switches running Rapid PVST+ monitor the state of the network by using Spanning Tree instance-based Bridge Protocol Data Units (BPDU). By default, each switch sends instance-based BPDU messages from their designated port in two-second intervals. If we have 2000 VLANs, all switches must process 2000 BPDUs. To reduce CPU and Memory consumption caused by BPDU processing, we can use Multiple Spanning Tree – MSTP (802.1s), where VLANs are associated with Instances. For example, we can attach VLANs 1-999 to one instance and Continue reading
Before you can add Cisco ISE to Catalyst Center’s global network settings as an Authentication, Authorization, and Accounting server (AAA) for clients and manage the Group-Based access policy implemented in Cisco ISE, you must integrate them.
This post starts by explaining how to activate the pxGrid service on ISE, which it uses for pushing policy changes to Catalyst Center (steps 1a-f). Next, it illustrates the procedure to enable External RESTful API (ERS) read/write on Cisco ISE to allow external clients to Create, Read, Update, and Delete (CRUD) processes on ISE. Catalyst Center uses ERS for pushing configuration to ISE. After starting the pxGrid service and enabling ERS, this post discusses how to initiate the connection between ISE and Catalyst Center (steps 2a-h and 3a-b). The last part depicts the Group-Based Access Control migration processes (4a-b).
Open the Administrator tab on the main view of Cisco ISE. Then, under the System tab, select the Deployment option. The Deployment Nodes section displays the Cisco ISE Node along with its personas. In Figure 1-3, a standalone ISE Node is comprised of three personas: Policy Continue reading
This chapter introduces Cisco's approach to Intent-based Networking (IBN) through their Centralized SDN Controller, Cisco DNA Center, rebranded as Cisco Catalyst Center (from now on, I am using the abbreviation C3 for Cisco Catalyst Center). We focus on the network green field installation, showing workflows, configuration parameters, and relationships and dependencies between building blocks. The C3 workflow is divided into four main entities: 1) Design, 2) Policy, 3) Provision, and 4) Assurance, each having its own sub-processes. This chapter introduces the Design phase focusing on Network Hierarchy, Network Settings, and Network Profile with Configuration Templates.
This post deprecates the previous post, "Cisco Intent-Based Networking: Part I, Overview."
Network Hierarchy is a logical structure for organizing network devices. At the root of this hierarchy is the Global Area, where you establish your desired network structure. In our example, the hierarchy consists of four layers: Area (country - Finland), Sub-area (city - Joensuu), Building (JNS01), and Floor (JNS01-FLR01). Areas and Buildings indicate the location, while Floors provide environmental information relevant to wireless networks, such as floor type, measurements, and wall properties.
Network settings define device credentials (CLI, HTTP(S), SNMP, and NETCONF) required for accessing devices Continue reading
This post introduces Cisco's approach to Intent-based Networking (IBN) through their Centralized SDN Controller, DNA Center, rebranded as Catalyst Center. We focus on the network green field installation, showing workflows, configuration parameters, and relationships and dependencies between building blocks.
Figure 1-1 is divided into three main areas: a) Onboard and Provisioning, b) Network Hierarchy and Global Network Settings, c) and Configuration Templates and Site Profiles.
We start a green field network deployment by creating a Network Design. In this phase, we first build a Network Hierarchy for our sites. For example, a hierarchy can define Continent/Country/City/Building/Floor structure. Then, we configure global Network Settings. This phase includes both Network and Device Credentials configuration. AAA, DHCP, DNS serves, DNS name, and Time Zone, which are automatically inherited throughout the hierarchy, are part of the Network portion. Device Credentials, in turn, define CLI, SNMP read/write, HTTP(S) read/write username/password, and CLI enable password. The credentials are used later in the Discovery phase.
Next, we build a site and device type-specific configuration templates. As a first step, we create a Project, a folder for our templates. In Figure 1-1, we have a Composite template into which we attach two Regular templates. Regular templates include Continue reading
Available at Leanpub and Amazon
About This Book
A modern application typically comprises several modules, each assigned specific roles and responsibilities within the system. Application architecture governs the interactions and communications between these modules and users. One prevalent architecture is the three-tier architecture, encompassing the Presentation, Application, and Data tiers. This book explains how you can build a secure and scalable networking environment for your applications running in Microsoft Azure. Besides a basic introduction to Microsoft Azure, the book explains various solutions for Virtual Machines Internet Access, connectivity, security, and scalability perspectives.
Azure Basics: You will learn the hierarchy of Microsoft Azure datacenters, i.e., how a group of physical datacenters forms an Availability Zone within the Azure Region. Besides, you learn how to create a Virtual Network (VNet), divide it into subnets, and deploy Virtual Machines (VM). You will also learn how the subnet in Azure differs from the subnet in traditional networks.
Internet Access: Depending on the role of the application, VMs have different Internet access requirements. Typically, front-end VMs in the presentation tier/DMZ are visible on the Internet, allowing external hosts to initiate connections. VMs in the Application and Data tiers are rarely accessible from Continue reading
In Chapter Five, we deployed an internal load balancer (ILB) in the vnet-hub. It was attached to the subnet 10.0.0.0/24, where it obtained the frontend IP (FIP) 10.0.1.6. Next, we created a backend pool and associated our NVAs with it. Finally, we bound the frontend IP 10.0.1.6 to the backend pool to complete the ILB setup.
Next, in vnet-spoke1, we created a route table called rt-spoke1. This route table contained a user-defined route (UDR) for 10.2.0.0/24 (vnet-spoke2) with the next-hop set as 10.0.1.6. We attached this route table to the subnet 10.1.0.0/24. Similarly, in vnet-spoke2, we implemented a user-defined route for 10.1.0.0/24 (vnet-spoke1). By configuring these UDRs, we ensured that the spoke-to-spoke traffic would pass through the ILB and one of the NVAs on vnet-hub. Note that in this design, the Virtual Network Gateway is not required for spoke-to-spoke traffic.
In this chapter, we will add a Virtual Network Gateway (VGW) into the topology and establish an IPsec VPN connection between the on-premises network edge router and VGW. Additionally, we will deploy a new route table called "rt-gw-snet" where we add routing entries to the spoke VNets with the next-hop IP address 10.0.1.6 (ILB's frontend IP). Besides, we will add a routing entry 10.3.0.0/16 > 10.0.1.6 into the existing route tables on vnet-spoke-1 and vnet-spoke-2 (not shown in figure 6-1). This configuration will ensure that the spoke to spoke and spoke to on-prem flows are directed through one of the Network Virtual Appliances (NVAs) via ILB. The NVAs use the default route table, where the VGW propagates all the routes learned from VPN peers. However, we do not propagate routes from the default route table into the "rt-gw-snet" and "rt-prod-1" route tables. To enable the spoke VNets to use the VGW on the hub VNet, we allow it in VNet peering configurations.