Ivan Pepelnjak

Author Archives: Ivan Pepelnjak

The Mythical Use Cases: Traffic Engineering for Data Center Backups

Vendor product managers love discussing mythical use cases to warrant complex functionality in their gear. Long-distance VM mobility was one of those (using it for disaster avoidance was Mission Impossible under any real-world assumptions), and high-volume network-based backups seems to be another. Here’s what someone had to say about that particular unicorn in a LinkedIn comment when discussing whether we need traffic engineering in a data center fabric.

When you’re dealing with a large cluster on a fabric, you will see things like inband backup. The most common one I’ve seen is VEEAM. Those inband backups can flood a single link, and no amount of link scheduling really solves that; depending on the source, they can saturate 100G. There are a couple of solutions; IPv6 or eBGP SID has been used to avoid these links or schedule avoidance for other traffic.

It is true that (A) in-band backups can be bandwidth intensive and that (B) well-written applications can saturate 100G server links. However:

Worth Exploring: Infrahub by Opsmill

A year or two after Damien Garros told me that “he moved to France and is working on something new” we can admire the results: Infrahub, a version-control-based system that includes a data store and a repository of all source code you use in your network automation environment. Or, straight from the GitHub repository,

A central hub to manage the data, templates and playbooks that powers your infrastructure by combining the version control and branch management capabilities of Git with the flexible data model and UI of a graph database.

I’ve seen an early demo, and it looks highly promising and absolutely worth exploring. Have fun ;)

Fun fact: the OpsMill team includes two guest speakers in the ipSpace.net automation course and a netlab contributor.

Worth Reading: ChatGPT Does Not Summarize

I mostly gave up on LLMs being any help (apart from generating copious amounts of bullshit), but I still thought that generating summaries might be an interesting use case. I was wrong.

As Gerben Wierda explains in his recent “When ChatGPT summarises, it actually does nothing of the kind” blog post, you have to understand a text if you want to generate a useful summary, and that’s not what LLMs do. They can generate a shorter version of the text, which might not focus on the significant bits.

BGP Labs: Graceful Shutdown

Using the typical default router configurations, it can take minutes between a failure of an inter-AS link and the convergence of BGP routes. You can fine-tune that behavior with BGP timers and BFD (and still get pwned by Graceful Restart). While you can’t influence link failures, you could drain the traffic from a link before starting maintenance operations on it, and it would be a shame not to do that considering there’s a standard way to do that – the GRACEFUL_SHUTDOWN BGP community defined in RFC 8326. That’s what you’ll practice in the next BGP lab exercise.

BGP Route Reflectors Considered Harmful

The recent IBGP Full Mesh Between EVPN Leaf Switches blog post generated an interesting discussion on LinkedIn focused on whether we need route reflectors (in small fabrics) and whether they do more harm than good. Here are some of the highlights of that discussion, together with a running commentary.

Please note that we’re talking about BGP route reflectors in reasonably small data center fabrics. Large service provider networks with millions of customer VPN routes are a completely different story. As always, what you read in a random blog post might not apply to your network design. YMMV.

Worth Reading: Using AWS Services via IPv6

AWS started charging for public IPv4 addresses a few months ago, supposedly to encourage users to move to IPv6. As it turns out, you need public IPv4 addresses (or a private link) to access many AWS services, clearly demonstrating that it’s just another way of fleecing the sheep Hotel California tax. I’m so glad I moved my videos to Cloudflare ;)

For more details, read AWS: Egress Traffic and Using AWS Services via IPv6 (rendered in beautiful, easy-to-read teletype font).

EVPN Designs: IBGP Full Mesh Between Leaf Switches

In the previous blog post in the EVPN Designs series, we explored the simplest possible VXLAN-based fabric design: static ingress replication without any L2VPN control plane. This time, we’ll add the simplest possible EVPN control plane: a full mesh of IBGP sessions between the leaf switches.

This blog post describes an initial BGP design that we’ll refine in subsequent blog posts. Having a full mesh of IBGP sessions between leaf switches is a bad idea unless you have a tiny fabric or you’re deploying a small-scale EVPN pilot.

Testing Network Automation Data Transformation

Every complex enough network automation solution has to introduce a high-level (user-manageable) data model that is eventually transformed into a low-level (device) data model.

High-level overview of the process

High-level overview of the process

The transformation code (business logic) is one of the most complex pieces of a network automation solution, and there’s only one way to ensure it works properly: you test the heck out of it ;) Let me show you how we solved that challenge in netlab.

Must Read: OSPF Protocol Analysis (RFC 1245)

Daniel Dib found the ancient OSPF Protocol Analysis (RFC 1245) that includes the Router CPU section. Please keep in mind the RFC was published in 1991 (35 years ago):

Steve Deering presented results for the Dijkstra calculation in the “MOSPF meeting report” in [3]. Steve’s calculation was done on a DEC 5000 (10 mips processor), using the Stanford internet as a model. His graphs are based on numbers of networks, not number of routers. However, if we extrapolate that the ratio of routers to networks remains the same, the time to run Dijkstra for 200 routers in Steve’s implementation was around 15 milliseconds.

MLAG Deep Dive: LAG Member Failures in VXLAN Fabrics

In the Dealing with LAG Member Failures blog post, we figured out how easy it is to deal with a LAG member failure in a traditional MLAG cluster. The failover could happen in hardware, and even if it’s software-driven, it does not depend on the control plane.

Let’s add a bit of complexity and replace a traditional layer-2 fabric with a VXLAN fabric. The MLAG cluster members still use an MLAG peer link and an anycast VTEP IP address (more details).

netlab 1.8.2: Bug Fixes, Usability Improvements

netlab release 1.8.2 contains dozens of bug fixes and minor tweaks to device configuration templates. We also added a few safeguards including:

  • Check for Vagrant boxes or Docker containers before starting the lab and display pointers to build recipes.
  • Check installed Ansible collections before trying to configure the lab devices.
  • Display a warning if the lab topology was modified after the lab was created

netlab: VRF Instantiation on Lab Devices

In the previous blog post on this topic, I described how node and global VRFs work in netlab.

TL&DR: If you use the same VRF on multiple devices, it’s better to define it globally.

However, you might not need every VRF on every lab device in a more complex lab topology. Considering that, netlab tries to minimize the number of VRFs configured on lab devices using a simple rule: a VRF is configured on a lab device only if the device has at least one interface in that VRF.

1 2 3 124