Russ

Author Archives: Russ

BIER Basics

Multicast is, at best, difficult to deploy in large scale networks—PIM sparse and BIDIR are both complex, adding large amounts of state to intermediate devices. In the worst case, there is no apparent way to deploy any existing version of PIM, such as large-scale spine and leaf networks (variations on the venerable Clos fabric). BEIR, described in RFC8279, aims to solve the per-device state of traditional multicast.

In this network, assume A has some packet that needs to be delivered to T, V, and X. A could generate three packets, each one addressed to one of the destinations—but replicating the packet at A is wastes network resources on the A->B link, at least. Using PIM, these three destinations could be placed in a multicast group (a multicast address can be created that describes T, V, and X as a single destination). After this, a reverse shortest path tree can be calculated from each of the destinations in the group towards the source, A, and the correct forwarding state (the outgoing interface list) be installed at each of the routers in the network (or at least along the correct paths). This, however, adds a lot of state to the network.
Continue reading

History of Networking: OpenConfig with Anees Shaikh and Rob Shakir

OpenConfig is an effort amongst many cooperative network operators to define vender-neutral data models for configuring and managing networks programatically. In this episode we talk with Anees Shaikh and Rob Shakir about the roots of the OpenConfig project and where it’s at currently.

Outro Music:
Danger Storm Kevin MacLeod (incompetech.com)
Licensed under Creative Commons: By Attribution 3.0 License
http://creativecommons.org/licenses/by/3.0/

Choose Simple Solutions

In my experience, simplicity is not valued enough in software development. Instead, there is a lot of emphasis placed on flexibility. —Felix

Replace “software” with “network,” and think about it. How often do network engineers select the chassis-based system that promises to “never need to be replaced?” How often do we build networks like they will be “in use” 20+ years from now? Now it does happen from time to time; I have heard of devices with many years of uptime, for instance. I have worked on AT&T Brouters in production—essentially a Cisco AGS+ rebranded and resold by AT&T—that were some ten or fifteen years old even back when I worked on them. These things certainly happen, and sometimes they even happen for good reasons.

But knowing such things happen and planning for such things to happen are two different mindsets. At least some of the complexity in networks comes from just this sort of “must make it permanent: thinking:

Many developers like to write code which handles any problem which might appear at any point in the future. In that regard, they are fortune tellers, trying to find a solution for eventual problems. This can work out very Continue reading

Instagram and Friday Photos

I’m making some changes to the Friday Photo series (which is why I’ve not posted any of these in a bit). I will be posting a small copy of each photo to Instagram, and a fuller image over on my smugmug page. I will be including a link to the smugmug version in the instagram post, but because of the way instragram sets things up you’ll have to copy the link out and paste it into a browser separately.

I will be going back through all my images and reprocessing them, so you will probably see duplicates from time to time.

Social Media, Limits, and Productivity

If there is one question I get most often, it is “how do you get so much done?” One answer to this question is: I limit my use of social media. There is, another angle to social media use which is a bit more… philosophical.

Some of you might know that I am currently working on a PhD in Philosophy—which might seem like an odd thing to do for someone who has been in the engineering world for, well, pretty much my entire life. My particular area of study, however, is what might be called media ecology and humanness. How do these two interact? What impact does, for instance, social media have on things like human freedom and dignity?

Social media (and mediated reality in general) has a bad habit of making people into objects—objectification is just part of the mediation process. If you go “all in” to the mediated world, then you become wholly mediated. This is ultimately dehumanizing, and a very bad thing.

Returning to the first question I raised above: what impact does social media have on my use of time? Does it make me more or less productive?

If we think social media does have Continue reading

Upcoming Safari Books Webinars

I have two webinars on Safari that might be of interest to folks who read here.

Network Troubleshooting Theory and Process

In this course I related by formal training in electronics into the networking world. The primary topic is the half-split method of troubleshooting, which tends to be much faster than the “hunch, hunt, and peck” method most folks seem to intuitively use. This is a course I give on a regular basis, though I suspect I am moving to giving this course twice a year in the future.

How Networks Really Work

This is a course I just started developing. Essentially, this will be split into two pieces. The first part will be walking through packets traversing a network; the second will be walking through various routing protocols converging on some common topologies. The aim here is to connect some of the theory I talk about to the “real world,” so this is not about covering the material, but also about covering the mindset.

I also have two more LiveLessons in production, one with Dinesh Dutt on disaggregation, and another on various forms of abstraction and the tradeoffs around abstraction (such as summarization and aggregation). I hope to have Continue reading

Grey Failure Lessons Learne

Grey Failures in the Real World

Most “smaller scale” operators probably believe they are not impacted by grey failures, but this is probably not true. Given the law of large numbers, there must be some number of grey failures in some percentage of smaller networks simply because there are so many of them. What is interesting about grey failures is there is so little study in this area; since these errors can exist in a network for years without being discovered, they are difficult to track down and repair, and they are often “fixed” by someone randomly doing things in surrounding systems that end up performing an “unintentional repair” (for instance by resetting some software state through a reboot). It is interesting, then, to see a group of operators collating the grey failures they have seen across a number of larger scale networks.

Gunawi, Haryadi S., Riza O. Suminto, Russell Sears, Casey Golliher, Swaminathan Sundararaman, Xing Lin, Tim Emami, et al. “Fail-Slow at Scale: Evidence of Hardware Performance Faults in Large Production Systems,” 1–14, 2018. https://www.usenix.org/conference/fast18/presentation/gunawi.

Some interesting results of the compilation are covered in a table early in the document. One of these is that grey Continue reading

The Network Sized Holes in Serverless

Until about 2017, the cloud was going to replace all on-premises data centers. As it turns out, however, the cloud has not replaced all on-premises data centers. Why not? Based on the paper under review, one potential answer is because containers in the cloud are still too much like “serverfull” computing. Developers must still create and manage what appear to be virtual machines, including:

  • Machine level redundancy, including georedundancy
  • Load balancing and request routing
  • Scaling up and down based on load
  • Monitoring and logging
  • System upgrades and security
  • Migration to new instances

Serverless solves these problems by placing applications directly onto the cloud, or rather a set of libraries within the cloud.

Jonas, Eric, Johann Schleier-Smith, Vikram Sreekanti, Chia-Che Tsai, Anurag Khandelwal, Qifan Pu, Vaishaal Shankar, et al. “Cloud Programming Simplified: A Berkeley View on Serverless Computing.” ArXiv:1902.03383 [Cs], February 9, 2019. http://arxiv.org/abs/1902.03383.

The authors define serverless by contrasting it with serverfull computing. While software is run based on an event in serverless, software runs until stopped in a cloud environment. While an application does not have a maximum run time in a serverfull environment, there is some maximum set by the provider in a serverless Continue reading

1 35 36 37 38 39 162