Best practices: MLAG backup IP

Recently there was a conversation in the Cumulus community (details in the debriefing below) about the best way to build a redundant backup IP link for multi-chassis link aggregation (MLAG). Like all good consulting-led blogs, we have a healthy dose of pragmatism that goes with our recommendations and this technology is no different. But if you’re looking for the short answer, let’s just say: it depends.

The MLAG backup IP feature goes by many names in the industry. In Cisco-land you might call this the “peer keepalive link,” in Arista-ville you might call this the “peer-address heartbeat” and in Dell VLTs it is known as the “backup destination.” No matter what you call it, the functionality offered is nearly the same.

What does it do?

Before we get into the meat of the recommendation, let’s talk about what the backup IP is designed to do. The backup IP link provides an additional value for MLAG to monitor, so a switch knows if its peer is reachable. Most implementations use this backup IP link solely as a heartbeat, meaning that it is not used to synchronize MAC addresses between the two MLAG peers. This is also the case with Cumulus Continue reading

Comparing Upgrade Strategies with Cumulus Linux

You’ve been running your Cumulus Linux network for a while, and everything has been running perfectly. Cumulus Linux has sliced your bread, you’ve gotten a promotion because your boss can’t believe how successful the project was, and the cost savings being felt by the organization. Your company has even been able to fire the accountant because Cumulus Linux has surprisingly also done your taxes for the coming year, and in general everything is going swimmingly with your open networking.

So what now, is our story over? Well not exactly, enterprise networks have long lifespans. Hyperscalers typically operate on a refresh cycle of 3-5 years. For them, anything over 3/yrs old is considered tech debt. Anything over 5/yrs old is considered a critical fault point. Your typical enterprise network may be around even longer than that. It is very common in this timespan for the needs of the applications to change requiring the network to change too. This often requires support for newer features at some point in the lifecycle of the equipment.

While the scenario above is quite rosey, (Hey – this is our blog after all!) the reasons for wanting to upgrade are many and varied. New features, Continue reading

Tales from the field: Best practices for initial provisioning (Part 1)

Working with the Cumulus Professional Services team, we get the privilege of seeing how many folks use and operationalize Cumulus Linux. Over time, we’ve learned many lessons and best practices that can benefit others who are getting started on the journey. It’s for that reason that we’re putting virtual pen to virtual paper and writing this post. This article is the first in a series of two that will discuss how to use Zero Touch Provisioning (ZTP) and automation tools together for maximum efficiency in your initial provisioning. This post is going to focus on ZTP while the next will focus on automation tooling.

Cumulus Linux: Batteries included

Let’s recap — what comes configured with Cumulus out of the box?

2 user accounts (cumulus and root, but only cumulus supports login via a default password
Some COntrol Plane Policing (COPP) rules that limit lots of different types of control traffic
SSHv2 enabled by default
NTP enabled by default

You’ll notice here that we’ve said nothing about interface configuration. Like all network switches and routers we start with a pretty blank slate from an interface configuration perspective. We leverage ZTP to give us some initial configuration for the node to use Continue reading

Celebrating ECMP in Linux — part two

In part one of our series on ECMP, we discussed the basics of ECMP, the recent changes that have been made and Cumulus’ part in moving the ball forward for Linux networking. Now, it’s time to get a little more technical and review how advancements in ECMP development for IPv4 and IPv6 have made ECMP what it is today — and what it can be in the near future.

Setting the stage: defining our terminologies

Hashing algorithms

Hashing algorithms are the biggest component of ECMP behavior, so it makes sense for us to talk for a moment about what we specifically mean when we refer to each one.

1.) Per-packet hash
This hash was the original hashing algorithm used in the kernel’s ECMP behavior. It is trivially simple to understand as it basically uses a pseudo random number in the kernel at the time packet is being processed (jiffies) to determine which link in an ECMP bundle the traffic will use for egress. With this algorithm in place, each packet for a single flow could use a different link to get to the destination. This leads to all kinds of bad behaviors in TCP and higher level applications/protocols Continue reading

Celebrating ECMP in Linux — part one

ECMP in Linux: A brief history

Equal Cost Multi-Path (ECMP) routes are a big component of all the super-trendy data center network designs that are en vogue right now. Clos networks and the ECMP that underpins those designs are the best tools we have today to deliver high bandwidth, highly fault-tolerant networks. Clos networks are rich with multiple equal cost paths to get from Server A to Server B.

Linux kernel ECMP

2 Paths from Host to ToR * 8 Paths from ToR to Leaf * 16 Paths from Leaf to Spine * 8 Paths from Spine to Leaf * 2 Paths from Leaf to ToR
= 4096 Possible Unique Paths between Server A and Server B

FYI: The above is an actual customer network. Names have been changed to protect the innocent and colors have been added because a rainbow of links is more fun!

Cumulus has been working to improve the behavior of ECMP routes in the Linux kernel over the last several kernel releases. Now, with kernel v4.17, we have achieved the milestone we set out to attain. As of Linux kernel v4.17, Linux hosts can now leverage the “5-Tuple” style hashing used inside traditional network devices for Continue reading

Data center network monitoring best practices part 3: Modernizing tooling

Implementing your strategy using modern tooling

In the previous two posts we discussed gathering metrics for long term trend analysis and then combining it with event-based alerts for actionable results. In order to combine these two elements, we need strong network monitoring tooling that allows us to overlay these activities into an effective solution.

Understanding drawbacks of older network monitoring tooling

The legacy approach to monitoring is to deploy a monitoring server that periodically polls your network devices via Simple Network Management Protocol. SNMP is a very old protocol, originally developed in 1988. While some things do get better with age, computer protocols are rarely one of them. SNMP has been showing its age in many ways.

Inflexibility

SNMP uses data structures called MIBs to exchange information. These MIBs are often proprietary, and difficult to modify and extend to cover new and interesting metrics.

Polling vs event driven

Polling doesn’t offer enough granularity to catch all events. For instance, even if you check disk utilization once every five minutes, you may go over threshold and back in between intervals and never know.

An inefficient protocol

SNMP’s polling design is a “call and response” protocol, this means the monitoring server will Continue reading

Data center network monitoring best practices part 1: Metrics and alerts

One of the least loved areas of any data center network is monitoring. This is ironic because at its core, the network has two goals: 1) Get packets from A to B 2) Make sure packets got from A to B. It is not uncommon in the deployments I’ve seen for the monitoring budget to be effectively $0, and generally, an organization’s budget also reflects their priorities. Despite spending thousands, or even hundreds of thousands, of dollars on networking equipment to facilitate goal #1 from above, there is often little money, thought and time spent in pursuit of Goal #2. In the next several paragraphs I’ll go into some basic data center network monitoring best practices that will work with any budget.

It is not hard to see why monitoring the data center network can be a daunting task. Monitoring your network, just like designing your network, takes a conscious plan of action. Tooling in the monitoring space today is highly fragmented with over 100+ “best of breed” tools that each accommodate a specific use case. Just evaluating all the tools would be a full time job. A recent Big Panda Report and their video overview of it (38 mins) Continue reading