NSX-T: Routing where you need it (Part 1)

Network virtualization has come a long way. NSX has played a key role in redefining and modernizing networking in a Datacenter. Providing an optimal routing path for the traffic has been one of the topmost priorities of Network Architects. Thanks to NSX distributed routing, that the routing between different subnets on a ESXi hypervisor can... Read more →

NSX-T: Routing where you need it (Part 1)

 

Network virtualization has come a long way. NSX has played a key role in redefining and modernizing networking in a Datacenter. Providing an optimal routing path for the traffic has been one of the topmost priorities of Network Architects. Thanks to NSX distributed routing, the routing between different subnets on a ESXi hypervisor can be done in kernel and traffic never has to leave the hypervisor. With NSX-T, we take a step further and extend this network functionality to a multi-hypervisor and multi-cloud environment. NSX-T is a platform that provides Network and Security virtualization for a plethora of compute nodes such as ESXi, KVM, Bare Metal, Public Clouds and Containers.

 

This blog series will introduce NSX-T Routing & focus primarily on Distributed Routing. I will explain Distributed Routing in detail with a packet walk between the VMs sitting in same/different hypervisors, connectivity to physical infrastructure and multi-tenant routing. Let’s start with a quick reference to NSX-T architecture.

 

NSX-T Architecture

NSX-T has a built-in separation for Management plane (NSX-T Manager), Control Plane (Controllers) and Data Plane (Hypervisors, Containers etc.). I highly recommend going through NSX-T Whitepaper for detailed information on architecture to understand the components and  functionality of each of the planes.

 

Couple of interesting points that I want to highlight about the architecture:

  • NSX-T Manager is decoupled from vCenter and is designed to run across all these heterogeneous platforms.
  • NSX-T Controllers serve as central control point for all the logical switches within a network and maintains information about hosts, logical switches/routers. 
  • NSX-T Manager and NSX-T Controllers can be deployed in a VM form factor on either ESXi or KVM. 
  • In order to provide networking to different type of compute nodes, NSX-T relies on a virtual switch called “hostswitch”. The NSX management plane fully manages the lifecycle of this “hostswitch”. This hostswitch is a variant of the VMware virtual switch on ESXi-based endpoints and as Open Virtual Switch (OVS) on KVM-based endpoints.
  • Data Plane stretches across a variety of compute nodes: ESXi, KVM, Containers, and NSX-T edge nodes (on/off ramp to physical infrastructure).
  • Each of the compute nodes is a transport node & will be a TEP (Tunnel End Point) for the host. Depending upon the teaming policy, this host could have one or more TEPs.
  • NSX-T uses GENEVE as underlying overlay protocol for these TEPs to carry Layer 2 information across Layer 3. GENEVE provides us the complete flexibility of inserting Metadata as TLV (Type, Length, Value) fields which could be used for new features. One of the examples of this Metadata is VNI (Virtual Network Identifier). We recommend a MTU of 1600 to account for encapsulation header. More details on GENEVE can be found on the following IETF Draft. https://datatracker.ietf.org/doc/draft-ietf-nvo3-geneve/

 

Before we dive deep into routing, let me define a few key terms.

Logical Switch is a broadcast domain which can span across multiple compute hypervisors. VMs in the same subnet would connect to the same logical switch. 

 

Logical Router provides North-South, East-West routing between different subnets & has two components: Distributed component that runs as a kernel module in hypervisor and Centralized component to take care of centralized functions like NAT, DHCP, LB and provide connectivity to physical infrastructure.

Types of interfaces on a Logical Router

  • Downlink- Interface connecting to a Logical switch.
  • Uplink– Interface connecting to the physical infrastructure/physical router.
  • RouterLink– Interface connecting two Logical routers.

 

Edge nodes are appliances with a pool of capacity to run the centralized services and would be an on/off ramp to the physical infrastructure. You can think of Edge node as an empty container which would host one or multiple Logical routers to provide centralized services and connectivity to physical routers. Edge node will be a transport node just like compute node and will also have a TEP IP to terminate overlay tunnels.

They are available in two form factor: Bare Metal or VM(leveraging Intel’s DPDK Technology).

 

Moving on, let’s also get familiarized with the topology that I will use throughout this blog series.

I have two hypervisors in above topology, ESXi and KVM. Both of these hypervisors have been prepared for NSX & have been assigned a TEP (Tunnel End Point) IP, ESXi Host: 192.168.140.151, KVM host: 192.168.150.152. These hosts have L3 connectivity between them via transport network. I have created 3 Logical switches via NSX Manager & have connected a VM to each one of the switches. I have also created a Logical Router named Tenant 1 Router, which is connected to all the logical switches and is acting as a Gateway for each subnet.

Before we look at the routing table, packet walk etc., let’s look at how configuration looks like in NSX Manager. Here is switching configuration, showing 3 Logical switches.

Following is the routing configuration showing the Tenant 1 Logical Router.

 

Once configured via NSX Manager, the logical switches and routers are pushed to both the hosts, ESXi and KVM. Let’s validate that on both hosts. Following is the output from ESXi showing the Logical switches and router.

 

Following is the output from KVM host showing the Logical switches and router.

 

NSX Controller MAC learning and advertisement

 

Before we look at the packet walk, it is important to understand how remote MAC addresses are learnt by the compute hosts. This is done via NSX Controllers.  As soon as a VM comes up and connects to Logical switch, local TEP registers its MAC with the NSX Controller. Following output from NSX Controller shows that the MAC addresses of VMs on Web VM1, App VM1 and DB VM1 have been reported by their respective TEPs. NSX Controller publishes this MAC/TEP association to the compute hosts depending upon type of host.

Now, we will look at the communication between VMs on the same hypervisor.

 

Distributed Routing for VMs hosted on the same Hypervisor

 

We have WEB VM1 and App VM1 hosted on the same ESXi hypervisor. Since we are discussing the communication between VMs on same host, I am just showing the relevant topology below.

 

 Following is how traffic would go from Web VM1 to App VM1.

  1. Web VM1 (172.16.10.11) sends traffic to the gateway 172.16.10.1, as the destination (172.16.20.11) is in different subnet. This traffic traverses Web-LS and goes to Downlink interface of Local distributed router running as a kernel module on ESXi Host.
  2. Routing lookup happens on the ESXi distributed router and 172.16.20.0 subnet is a Connected route. Packet gets routed and is put on the App-LS.
  3. Destination MAC lookup for MAC address of App VM1 is needed to forward the frame. In this case, App VM1 is also hosted on the same ESXi, we do a MAC address lookup and find a local MAC entry as highlighted in diagram above.
  4. L3 rewrite is done and packet is sent to App VM1.

Please note that the packet didn’t have to leave the hypervisor to get routed. This routing happened in kernel. Now that we understand the communication between two VMs (in different subnet) on same hypervisor, let’s take a look at the packet walk from Web VM1 (172.16.10.11) on ESXi to DB-VM1 (172.16.30.11) hosted on KVM.

 

Distributed Routing for VMs hosted on the different Hypervisors (ESXi & KVM)

 

  1. Web VM1 (172.16.10.11) sends traffic to the gateway 172.16.10.1, as the destination (172.16.30.11) is in different subnet. This traffic traverses Web-LS and goes to Downlink interface of Local distributed router on ESXi Host.
  2. Routing lookup happens on the ESXi distributed router. Packet gets routed and is put on the DB-LS. Following output show the distributed router on ESXi host and it’s routing table.

  1. Destination MAC lookup for MAC address of DB VM1 is needed to forward the frame. MAC lookup is done and MAC address of DB VM1 is learnt via remote TEP 192.168.150.152. Again, this MAC/TEP association table was published by NSX Controller to the hosts.

  1. ESXi TEP encapsulates the packet and sends it to the remote TEP with a Src IP=192.168.140.151, Dst IP=192.168.150.152.
  1. Packet is received at remote KVM TEP 192.168.150.152, where VNI (21386) is matched. MAC lookup is done and packet is delivered to DB VM1 after removing the encapsulation header.

A quick traceflow validates the above packet walk.

This concludes the routing components part of this blog. In the next blog of this series, I will discuss multi-tenant routing and connectivity to physical infrastructure.

Continue reading

The Sun sets on Solaris and Sparc

After years of struggle and fading recognition, it seems the end is finally here for Solaris and Sparc, the Unix operating system and RISC processor designed and championed by Sun Microsystems and inherited by Oracle in 2010.In a move that will win it no PR points, Oracle sent out recorded telephone messages to employees who were let go on the Friday before the Labor Day weekend. Yes, firing by voice mail. Classy.The exact number is being debated, but talk on one message board puts it at 2,500. That’s both Solaris and Sparc engineers. The workers affected are primarily in the Santa Clara area, in a former Sun office not even a mile from Intel’s headquarters, but they are in three other states and India as well. To read this article in full or to leave a comment, please click here

The Sun sets on Solaris and Sparc

After years of struggle and fading recognition, it seems the end is finally here for Solaris and Sparc, the Unix operating system and RISC processor designed and championed by Sun Microsystems and inherited by Oracle in 2010.In a move that will win it no PR points, Oracle sent out recorded telephone messages to employees who were let go on the Friday before the Labor Day weekend. Yes, firing by voice mail. Classy.The exact number is being debated, but talk on one message board puts it at 2,500. That’s both Solaris and Sparc engineers. The workers affected are primarily in the Santa Clara area, in a former Sun office not even a mile from Intel’s headquarters, but they are in three other states and India as well. To read this article in full or to leave a comment, please click here

IDG Contributor Network: Connecting the car

It’s an interesting paradox: the more connected the car, the less connected the driver.As a gearhead, that frightens me. I’ve always gravitated toward muscle cars, both new and old, because I like to feel connected to the car. I like the feeling of being pushed back in the seat when I press the long, skinny pedal. I love the experience of going through the gears and the throaty roar of a finely-tuned V8, like the one in my “TrackPack” optioned Mustang GT.But as a techie, I also appreciate connected and autonomous cars – so long as they don’t look like inflated characters from classic video game, Dig Dug.You Googlers know exactly what I mean!To read this article in full or to leave a comment, please click here

Apple Watch / Red Sox cheating scandal points toward larger issues

Newsflash: Someone just found a viable use case for the Apple Watch. Too bad it turned out to be cheating at baseball!Wearable technology such as fitness trackers and smart watches have long been seen as aids for athletes to improve their performance and help them win. And that’s great. But now that the Boston Red Sox have been caught red handed using Apple Watches to communicate and transfer signs stolen from the Yankees, it seems there may also be an unanticipated dark side to the Internet of Things (IoT) in sports.To read this article in full or to leave a comment, please click here

Using Keybase with GPG on macOS

During my too-brief stint using Fedora Linux as my primary laptop OS (see here for some details), I became attached to using GPG (GNU Privacy Guard)—in conjunction with Keybase—for signing Git commits and signing e-mail messages. Upon moving back to macOS, I found that I needed to set this configuration back up again, and so I thought I’d document it here in case others find it useful.

I’m making a couple assumptions here:

  1. First, I’m assuming you’ve already signed up for Keybase, generated your proofs, installed the app (this provides the keybase CLI tool), and created a PGP key using Keybase. (Hard-core PGP/GPG users will probably prefer to create their key outside of Keybase and import it, but then again they aren’t necessarily the target audience for this article either.)

  2. Second, I’m assuming you’ve already installed GPG on your Mac, typically via something like GPG Tools.

With those assumptions in mind, let’s assume that you’d like to be able to use the PGP key generated by/stored in Keybase with something like GPGMail. Here are the steps you’d need to follow to do that:

  1. First, you’ll need to export the PGP public key out of Keybase and into Continue reading

Troubleshooting connectivity problems in leaf and spine fabrics

Introducing data center fabric, the next-generation Facebook data center network describes the benefits of moving to a leaf and spine network architecture. The diagram shows how the leaf and spine architecture creates many paths between each pair of hosts. Multiple paths increase available bandwidth and resilience against the loss of a link or a switch. While most networks don't have the scale requirements of Facebook, smaller scale leaf and spine designs deliver high bandwidth, low latency, networking to support cloud workloads (e.g. vSphere, OpenStack, Docker, Hadoop, etc.).

Unlike traditional hierarchical network designs, where a small number of links can be monitored to provide visibility, a leaf and spine network has no special links or switches where running CLI commands or attaching a probe would provide visibility. Even if it were possible to attach probes, the effective bandwidth of a leaf and spine network can be as high as a Petabit/second, well beyond the capabilities of current generation monitoring tools.

Fortunately, industry standard sFlow monitoring technology is built into the commodity switch hardware used to build leaf and spine networks. Enabling sFlow telemetry on all the switches in the network provides centralized, real-time, visibility into network traffic.
Fabric View Continue reading

Intent-Based Network Automation with Ansible

The latest in all the networking buzz these days is Intent-Based Networking (IBN). There are varying definitions of what IBN is and is not. Does IBN mean you need to deploy networking solely from business policy, does IBN mean you must be streaming telemetry from every network device in real-time, is it a combination of both? Is it automation?

This article isn’t meant to define IBN, rather, it’s meant to provide a broader, yet more practical perspective on automation and intent.

Intent isn’t New

One could argue that intent-based systems have been around for years, especially when managing servers. Why not look at DevOps tools like CFEngine, Chef, and Puppet (being three of the first)? They focused on desired state–their goal was to get managed systems into a technical desired state.

If something is in its desired state, doesn’t that mean it’s in its intended state?

These tools did this eliminating the need to know the specific Linux server commands to configure the device–you simply defined your desired state with a declarative approach to systems management, e.g. ensure Bob is configured on the system without worrying about the command to add Bob. One major difference was those tools used Continue reading

Intent-Based Network Automation with Ansible

The latest in all the networking buzz these days is Intent-Based Networking (IBN). There are varying definitions of what IBN is and is not. Does IBN mean you need to deploy networking solely from business policy, does IBN mean you must be streaming telemetry from every network device in real-time, is it a combination of both? Is it automation?

This article isn’t meant to define IBN, rather, it’s meant to provide a broader, yet more practical perspective on automation and intent.

Intent isn’t New

One could argue that intent-based systems have been around for years, especially when managing servers. Why not look at DevOps tools like CFEngine, Chef, and Puppet (being three of the first)? They focused on desired state–their goal was to get managed systems into a technical desired state.

If something is in its desired state, doesn’t that mean it’s in its intended state?

These tools did this eliminating the need to know the specific Linux server commands to configure the device–you simply defined your desired state with a declarative approach to systems management, e.g. ensure Bob is configured on the system without worrying about the command to add Bob. One major difference was those tools used Continue reading

Docker in the Enterprise Showcased at VMworld 2017

Image uploaded from iOS.jpg

Last week, in the blistering heat of Las Vegas, Docker had a chance to interact with thousands of VMworld attendees to talk about containers. The message we heard again and again was that those in charge of infrastructure and virtualization are now being asked to manage containers. Sometimes it is being driven by developer teams that are already using Docker, but sometimes it is the infrastructure teams who recognize the benefits of moving applications to containers for easier maintenance and operations as well as cost savings. With Docker Enterprise Edition (EE), we have a solution that is designed to let IT secure and manage containerized applications.

Learn More About Docker for VMware Admins

If you’re interested in learning more about how Docker EE allows IT to improve operations, move workloads to the cloud and increase application agility all while saving costs, we’re happy to be partnering with the VMware User Group (VMUG) to deliver a webcast tomorrow. Register now to see how Docker EE is being used by some of the world’s largest container deployments, how Docker works with Windows applications or to clarify any confusion you may have about how Docker containers and VMs work together.

Docker + vSphere: Two Continue reading