Russ

Author Archives: Russ

Grey Failure Lessons Learne

Grey Failures in the Real World

Most “smaller scale” operators probably believe they are not impacted by grey failures, but this is probably not true. Given the law of large numbers, there must be some number of grey failures in some percentage of smaller networks simply because there are so many of them. What is interesting about grey failures is there is so little study in this area; since these errors can exist in a network for years without being discovered, they are difficult to track down and repair, and they are often “fixed” by someone randomly doing things in surrounding systems that end up performing an “unintentional repair” (for instance by resetting some software state through a reboot). It is interesting, then, to see a group of operators collating the grey failures they have seen across a number of larger scale networks.

Gunawi, Haryadi S., Riza O. Suminto, Russell Sears, Casey Golliher, Swaminathan Sundararaman, Xing Lin, Tim Emami, et al. “Fail-Slow at Scale: Evidence of Hardware Performance Faults in Large Production Systems,” 1–14, 2018. https://www.usenix.org/conference/fast18/presentation/gunawi.

Some interesting results of the compilation are covered in a table early in the document. One of these is that grey Continue reading

The Network Sized Holes in Serverless

Until about 2017, the cloud was going to replace all on-premises data centers. As it turns out, however, the cloud has not replaced all on-premises data centers. Why not? Based on the paper under review, one potential answer is because containers in the cloud are still too much like “serverfull” computing. Developers must still create and manage what appear to be virtual machines, including:

  • Machine level redundancy, including georedundancy
  • Load balancing and request routing
  • Scaling up and down based on load
  • Monitoring and logging
  • System upgrades and security
  • Migration to new instances

Serverless solves these problems by placing applications directly onto the cloud, or rather a set of libraries within the cloud.

Jonas, Eric, Johann Schleier-Smith, Vikram Sreekanti, Chia-Che Tsai, Anurag Khandelwal, Qifan Pu, Vaishaal Shankar, et al. “Cloud Programming Simplified: A Berkeley View on Serverless Computing.” ArXiv:1902.03383 [Cs], February 9, 2019. http://arxiv.org/abs/1902.03383.

The authors define serverless by contrasting it with serverfull computing. While software is run based on an event in serverless, software runs until stopped in a cloud environment. While an application does not have a maximum run time in a serverfull environment, there is some maximum set by the provider in a serverless Continue reading

Bridging the Gap

Mike Bushong and Denise Donohue join Eyvonne, Jordan, and I to discuss the gap between network engineering and “the business,” and give us some thoughts on bridging it.

Outro Music:
Danger Storm Kevin MacLeod (incompetech.com)
Licensed under Creative Commons: By Attribution 3.0 License
http://creativecommons.org/licenses/by/3.0/

Research: Practical Challenge-Response for DNS

Because the speed of DNS is so important to the performance of any connection on the ‘net, a lot of thought goes into making DNS servers fast, including optimized software that can respond to queries in milliseconds, and connecting DNS servers to the ‘net through high bandwidth links. To set the stage for massive DDoS attacks based in the DNS system, add a third point: DNS responses tend to be much larger than DNS queries. In fact, a carefully DNS response can be many times larger than the query.

To use a DNS server as an amplifier in a DDoS attack, then, the attacker sends a query to some number of publicly accessible DNS servers. The source of this query is the address of the system to be attacked. If the DNS query is carefully crafted, the attacker can send small packets that cause a number of DNS servers to send large responses to a single IP address, causing large amounts of traffic to the system under attack.

Rami Al-Dalky, Michael Rabinovich, and Mark Allman. 2018. Practical Challenge-Response for DNS. In Proceedings of the Applied Networking Research Workshop (ANRW ’18). ACM, New York, NY, USA, 74-74. DOI: https://doi.org/10.1145/3232755. Continue reading

History of Networking: George Swallow on MPLS/TE

Traffic engineering (TE) is one of the most complex technologies used in large scale networks today. George Swallow joins us for a look at how and why TE was invented, and where some of the ideas came from.

Outro Music:
Danger Storm Kevin MacLeod (incompetech.com)
Licensed under Creative Commons: By Attribution 3.0 License
http://creativecommons.org/licenses/by/3.0/

Is it Balance, or Workism?

While we tend to focus on work/life balance, perhaps the better question is: how effective are we at using the time we use for work? From a recent study (which you may have already seen):

  • Workers average just 2 hours and 48 minutes of productive device time a day
  • 21% of working hours are spent on entertainment, news, and social media
  • 28% of workers start their day before 8:30 AM (and 5% start before 7 AM)
  • 40% of people use their computers after 10 PM
  • 26% of work is done outside of normal working hours
  • Workers average at least 1 hour of work outside of working hours on 89 days/year (and on ~50% of all weekend days)
  • We check email and IM, on average, every 6 minutes

This is odd—we are starting work earlier, finishing later, and working over weekends, but we still only “work” less than three hours a day.

The first question must be: is this right? How are they measuring productive versus unproductive device time? What is “work time,” really? I know I don’t keep any sort of recognizable “office hours,’ so it seems like it would be hard to measure how much time I spend Continue reading

Webinar: How the Internet Really Works

I’m doing a live webinar at Safari Books Online on March 15thabout the operation of the ‘net—

This live training will provide an overview of the systems, providers, and standards bodies important to the operation of the global Internet, including the Domain Name System (DNS), the routing and transport systems, standards bodies, and registrars.

You can register here.

Troubleshooting Webinar at Safari Books Online

I’m doing a live webinar on troubleshooting on Safari Books on the 19th of April—

Troubleshooting is a fundamental skill for all network engineers, from the least to most experienced. However, there is little material on correct and efficient troubleshooting techniques in a network engineering context, and no (apparent) live training in this area. Some chapters in books exist (such as the Computer Networking Problems and Solutions, published in December 2017), and some presentations in Cisco Live, but the level of coverage for this critical skill is far below what engineers working in the field to develop solid troubleshooting skills.

This course is grounded in the formal half split method of troubleshooting I learned back in electronic engineering applied to networks. It’s probably the most efficient and effective method of troubleshooting available.

You can register here.

Why not OSPF for the Internet Core?

Every now and again (not often enough, if I’m to be honest), someone will write me with what might seem like an odd question that actually turns out to be really interesting. This one is from Surya Ahuja, a student at NC State, where I occasionally drop by to do a guest lecture.

We were recently working on an example design problem in one of our courses, and like a dedicated student I was preparing the State, optimisation, surface sheet ? One of the design decisions was to explain the selection of the routing protocol. This got me thinking. When BGP was being created, were there modifications to OSPF itself considered? … it could have been made possible to just use OSPF across enterprise and the internet. Then why not use it?

A quick answer might go somethign like this: OSPF did not exist when BGP was invented. The IETF working group for OSPF was, in fact, started in 1988, while the original version of BGP, called EGP, was originally specified by Eric Rosen in 1982, some 6 years earlier. The short answer might be, then, that OSPF was not considered because OSPF had not been invented yet. ?

But this quick Continue reading

1 36 37 38 39 40 162