Worth Reading: Azure Datacenter Switch Failures

Microsoft engineers published an analysis of switch failures in 130 Azure regions (review of the article, The Next Platform summary):

  • A data center switch has a 2% chance of failing in 3 months (= less than 10% per year);
  • ~60% of the failures are caused by hardware faults or power failures, another 17% are software bugs;
  • 50% of failures lasted less than 6 minutes (obviously crashes or power glitches followed by a reboot).
  • Switches running SONiC had lower failure rate than switches running vendor NOS on the same hardware. Looks like bloatware results in more bugs, and taking months to fix bugs results in more crashes. Who would have thought…

Worth Reading: Azure Datacenter Switch Failures

Microsoft engineers published an analysis of switch failures in 130 Azure regions (review of the article, The Next Platform summary):

  • A data center switch has a 2% chance of failing in 3 months (= less than 10% per year);
  • ~60% of the failures are caused by hardware faults or power failures, another 17% are software bugs;
  • 50% of failures lasted less than 6 minutes (obviously crashes or power glitches followed by a reboot).
  • Switches running SONiC had lower failure rate than switches running vendor NOS on the same hardware. Looks like bloatware results in more bugs, and taking months to fix bugs results in more crashes. Who would have thought…

Tools 7. Show me your packets … with TCP dump

Hello my friend,

When something goes wrong with the distributed application, where the network is involved (e.g., between client and web service, or between frontend and backend of services), the network is a first thing to be blamed. After the troubleshooting, it is often turned out that the network is innocent, but we need first need to prove it.


1
2
3
4
5
No part of this blogpost could be reproduced, stored in a
retrieval system, or transmitted in any form or by any
means, electronic, mechanical or photocopying, recording,
or otherwise, for commercial purposes without the
prior permission of the author.

Automated troubleshooting for automated networks?

The truth is that automation helped me so many times to figure out the root cause of the network outages or malfunctions that I even stopped counting that. I may say that automaton solutions work perfect, if you create them to solve your issues and tailor to your environment.

That’s what our Live Network Automation Training (10 weeks) and Automation with Nornir (2 weeks) are all about: to show you real automation in a real environment with multiple vendors together. No matter what those vendors are, the automation principles, tools Continue reading

Worth Reading: Running BGP in Large-Scale Data Centers

Here’s one of the major differences between Facebook and Google: one of them publishes research papers with helpful and actionable information, the other uses publications as recruitment drive full of we’re so awesome but you have to trust us – we’re not sharing the crucial details.

Recent data point: Facebook published an interesting paper describing their data center BGP design. Absolutely worth reading.

Just in case you haven’t realized: Petr Lapukhov of the RFC 7938 fame moved from Microsoft to Facebook a few years ago. Coincidence? I think not.

Worth Reading: Running BGP in Large-Scale Data Centers

Here’s one of the major differences between Facebook and Google: one of them publishes research papers with helpful and actionable information, the other uses publications as recruitment drive full of we’re so awesome but you have to trust us – we’re not sharing the crucial details.

Recent data point: Facebook published an interesting paper describing their data center BGP design. Absolutely worth reading.

Just in case you haven’t realized: Petr Lapukhov of the RFC 7938 fame moved from Microsoft to Facebook a few years ago. Coincidence? I think not.

QUIC Version 1 is live on Cloudflare

QUIC Version 1 is live on Cloudflare
QUIC Version 1 is live on Cloudflare

On May 27 2021, the Internet Engineering Task Force published RFC 9000 - the standardarized version of the QUIC transport protocol. The QUIC Working Group declared themselves done by issuing a Last Call 7 months ago. The i's have been dotted and the t's crossed, RFC 8999 - RFC 9002 are a suite of documents that capture years of engineering design and testing of QUIC. This marks a big occasion.

And today, one day later, we’ve made the standardized version of QUIC available to Cloudflare customers.

Transport protocols have a history of being hard to deploy on the Internet. QUIC overcomes this challenge by basing itself on top of UDP. Compared to TCP, QUIC has security by default, protecting almost all bytes from prying eyes or "helpful" middleboxes that can end up making things worse. It has designed-in features that speed up connection handshakes and mitigate the performance perils that can strike on networks that suffer loss or delays. It is pluggable, providing clear standardised extensions point that will allow smooth, iterative development and deployment of new features or performance enhancements for years to come.

The killer feature of QUIC, however, is that it is deployable in reality. We are Continue reading

How to Protect Azure VMware Solution Resources with Azure Application Gateway

Azure VMware Solution (AVS) is a VMware validated private cloud solution managed and maintained by Azure. It runs on dedicated bare-metal Azure infrastructure. AVS allows customers to manage and secure applications across VMware environments and Microsoft Azure with a consistent operating framework. It supports workload migration, VM deployment, and Azure service consumption 

As AVS private cloud runs on an isolated Azure environment, it is not accessible from Azure or the Internet by default. Users can use either ExpressRoute Global Reach (i.e., from on-prem) or a jump box (i.e., on an Azure VNet) to access AVS private cloud. This means AVS workload VMs are confined within AVS private cloud and not accessible from the Internet 

But what if customers want to make AVS Private Cloud resources, such as web servers, accessible from the Internet? In that case, Public IP needs to be deployed. There are couple of ways to do this: (1) Azure Application Gateway, and (2) Destination NAT or DNAT using Azure WAN Hub and Firewall. Azure Application Gateway is Continue reading

Document The First Time, Every Time

2053fountain_pen

Imagine you’re deep into a massive issue. You’ve been troubleshooting for hours trying to figure out why something isn’t working. You’ve pulled in resources to help and you’re on the line with the TAC to try and get a resolution. You know this has to be related to something recent because you just got notified about it yesterday. You’re working through logs and configuration setting trying to gain insights into what went wrong. That’s when the TAC engineer hits you with with an armor-piecing question:

When did this start happening?

Now you’re sunk. When did you first start seeing it? Was it happening before and no one noticed? Did a tree fall in the forest and no one was around to hear the sound? What is the meaning of life now?

It’s not too hard to imagine the above scenario because we’ve found ourselves in it more times than we can count. We’ve started working on a problem and traced it back to a root cause only to find out that the actual inciting incident goes back even further than that. Maybe the symptoms just took a while to show up. Perhaps someone unknowingly “fixed” the issue with a Continue reading

Heavy Networking 580: Multivendor EVPN? Nope

Today's Heavy Networking is a nerdy excursion into EVPN VXLAN, including how it works, why you might want it, and why multivendor interoperability is so difficult with this standard. Guest Tony Bourke and host Ethan Banks also explore hardware challenges, automation strategies, EVPN flooding mechanisms, BGP multi-homing, and more.

Heavy Networking 580: Multivendor EVPN? Nope

Today's Heavy Networking is a nerdy excursion into EVPN VXLAN, including how it works, why you might want it, and why multivendor interoperability is so difficult with this standard. Guest Tony Bourke and host Ethan Banks also explore hardware challenges, automation strategies, EVPN flooding mechanisms, BGP multi-homing, and more.

The post Heavy Networking 580: Multivendor EVPN? Nope appeared first on Packet Pushers.

3 Reasons to Process Closer to the Edge

Being able to increase processing speeds will inevitably improve productivity and efficiency, but more importantly, edge computing will help with the management and use of data. In today's world and business environment, we have endless data being collected and transmitted.

Mauritius Must Not Fall into the ‘Mass Surveillance’ Trap

This article was originally published in French in L’express. On 17 May, 2020, The Internet Society, alongside the IGF Mauritius, submitted a response to the call to the government of Mauritius’ call to provide input to the proposed amendments to the ICT ACT for regulating the use and addressing the abuse and misuse of Social Media […]

The post Mauritius Must Not Fall into the ‘Mass Surveillance’ Trap appeared first on Internet Society.

Edge devices improve drilling efficiency for energy company

Moving compute and storage resources to edge locations can reduce latency and bandwidth needs, improve performance and save money. At the same time, widespread edge computing deployments can introduce significant management challenges. Servers can be hard enough to maintain when they’re in an on-prem data center. What if they’re deployed in the middle of nowhere?Energy companies know all too well the challenges of remote computing.“When we drill a well, it’s always in the middle of nowhere,” says Dingzhou Cao, senior advisor for data science at independent shale producer Devon Energy, a Fortune 500 company based in Oklahoma City, Okla.To read this article in full, please click here

Edge devices improve drilling efficiency for energy company

Moving compute and storage resources to edge locations can reduce latency and bandwidth needs, improve performance and save money. At the same time, widespread edge computing deployments can introduce significant management challenges. Servers can be hard enough to maintain when they’re in an on-prem data center. What if they’re deployed in the middle of nowhere?Energy companies know all too well the challenges of remote computing.“When we drill a well, it’s always in the middle of nowhere,” says Dingzhou Cao, senior advisor for data science at independent shale producer Devon Energy, a Fortune 500 company based in Oklahoma City, Okla.To read this article in full, please click here