Grey Failure Lessons Learne

Grey Failures in the Real World

Most “smaller scale” operators probably believe they are not impacted by grey failures, but this is probably not true. Given the law of large numbers, there must be some number of grey failures in some percentage of smaller networks simply because there are so many of them. What is interesting about grey failures is there is so little study in this area; since these errors can exist in a network for years without being discovered, they are difficult to track down and repair, and they are often “fixed” by someone randomly doing things in surrounding systems that end up performing an “unintentional repair” (for instance by resetting some software state through a reboot). It is interesting, then, to see a group of operators collating the grey failures they have seen across a number of larger scale networks.

Gunawi, Haryadi S., Riza O. Suminto, Russell Sears, Casey Golliher, Swaminathan Sundararaman, Xing Lin, Tim Emami, et al. “Fail-Slow at Scale: Evidence of Hardware Performance Faults in Large Production Systems,” 1–14, 2018. https://www.usenix.org/conference/fast18/presentation/gunawi.

Some interesting results of the compilation are covered in a table early in the document. One of these is that grey Continue reading

How Facebook Might Find Nervana For Machine Learning Training

There is a rumor going around that a certain hyperscaler is going to be augmenting its GPU-based machine learning training and will be adopting Intel’s Nervana Neural Network Processor (NNP) for at least some of its workloads. “How Facebook Might Find Nervana For Machine Learning Training”

How Facebook Might Find Nervana For Machine Learning Training was written by Timothy Prickett Morgan at .

Network Break 227: Facebook’s Plaintext Password Blunder; How Google’s Gaming Service Might Impact Broadband

Today's Network Break looks at Facebook's plaintext password blunder, Nokia's new investment in the IETF, the potential impact of Google's gaming service on broadband, new products from VMware and Dell EMC, and more tech news.

The post Network Break 227: Facebook’s Plaintext Password Blunder; How Google’s Gaming Service Might Impact Broadband appeared first on Packet Pushers.

The Week in Internet News: Russia Moves Toward Test of Internet Disconnect

A separate Internet: The MIT Technology Review looks at the implications of Russia’s test to cut itself off from the rest of the Internet, scheduled for early April. The shutdown is a test of an Internet sovereignty law being considered in Russia, but it’s unclear how the country will actually accomplish the disconnect.

Clamping down: Egypt is cracking down on fake news with new rules that critics say are meant to curb dissent and restrict information the government believe is a threat to national security, The Hill reports. The country’s Supreme Media Regulatory Council can now block websites and some social media accounts with more than 5,000 followers for what it believes is “fake news” and can fine operators up to US$14,400 without getting a court order. Meanwhile, Russian President Vladimir Putin has signed two bills that critics say amount to censorship, Ars Technica says. One bill allows stiff fines for disseminating what the government determines is fake news, and the second allows fines and jail time for insulting government officials, including Putin.

Encrypting the IoT: The U.S. National Institute of Standards and Technology is looking at encryption methods to protect the Internet of Things and other computing devices against future encryption-cracking technologies, Continue reading

IT needs to make mobile unified communications a priority

The need for safe, reliable, and easy-to-use communications tools has given rise to unified communications (UC), a strategy that integrates multiple communications modalities under a single management and security umbrella. The result is more effective communication, improved collaboration, and a boost to security and regulatory policies. Now that mobility is the primary networking vehicle for end users, it’s time for IT departments to make mobile unified communications (MUC) a priority.The most important benefit of MUC is the ability of organizations to finally leave behind the uncontrolled, untracked mish-mash of consumer-centric, carrier, and third-party communications tools traditionally applied over the years. Communications are a critical organizational resource; MUC is a much easier vehicle to manage and scale, and MUC offers the visibility and control that’s essential to enterprise IT deployments. These advantages will enable MUC to become the dominant provisioning strategy and mechanism for organizational communications over the next five to 10 years.To read this article in full, please click here

Last Week on ipSpace.net (2019W12)

Spring started for real, so it was time for some early-spring cleaning and I managed to complete two webinars during last week:

Both webinars are part of standard ipSpace.net subscription

Amazon Aurora: design considerations for high throughput cloud-native relational databases

Amazon Aurora: design considerations for high throughput cloud-native relational databases Verbitski et al., SIGMOD’17

Werner Vogels recently published a blog post describing Amazon Aurora as their fastest growing service ever. That post provides a high level overview of Aurora and then links to two SIGMOD papers for further details. Also of note is the recent announcement of Aurora serverless. So the plan for this week on The Morning Paper is to cover both of these Aurora papers and then look at Calvin, which underpins FaunaDB.

Say you’re AWS, and the task in hand is to take an existing relational database (MySQL) and retrofit it to work well in a cloud-native environment. Where do you start? What are the key design considerations and how can you accommodate them? These are the questions our first paper digs into. (Note that Aurora supports PostgreSQL as well these days).

Here’s the starting point:

In modern distributed cloud services, resilience and scalability are increasingly achieved by decoupling compute from storage and by replicating storage across multiple nodes. Doing so lets us handle operations such as replacing misbehaving or unreachable hosts, adding replicas, failing over from a writer to a replica, scaling the size Continue reading

Relentless Competition Drives Down Ethernet Switch Costs

If you want to see what real competition might look like at some point in the future of the server racket, look no further than the Ethernet switch market, where switch ASICs and the companies that build switches alike have to fight for every dollar and make it up in volume every year without pause. “Relentless Competition Drives Down Ethernet Switch Costs”

Relentless Competition Drives Down Ethernet Switch Costs was written by Timothy Prickett Morgan at .

Heavy Networking 437: Melding Policy And Technology With The Internet Governance Project

On today's Heavy Networking we explore the intersection of policy, politics, and technology with the Internet Governance Project (IGP), which connects tech and policy experts to help bridge gaps in understanding between these two communities, with the goal of influencing outcomes on issues such as free expression, privacy, and security.

The post Heavy Networking 437: Melding Policy And Technology With The Internet Governance Project appeared first on Packet Pushers.

A Full CI/CD Pipeline for Workers with Travis CI

A Full CI/CD Pipeline for Workers with Travis CI
A Full CI/CD Pipeline for Workers with Travis CI

In today’s post we’re going to talk about building a CI/CD pipeline for Cloudflare Worker’s using Travis CI. If you aren’t yet aware, Cloudflare Workers allow you to run Javascript in all 165 of our data centers, and they deploy globally in about 30 seconds. Learn more here.

There are a few steps before we get started. We need to have a Worker script we want to deploy, some optional unit tests for the script, a serverless.yml  file to deploy via the Serverless Framework, a .gitignore file to ignore the node_modules folder, and finally, a .travis.yml configuration file. All of these files will live in the same GitHub repository, which should have a final layout like:

----- worker.js
----- serverless.yml
----- test
      . worker-test.js
----- node_modules
----- package.json
----- package-lock.json
----- .travis.yml
----- .gitignore

The Worker Script

In a recent post we discussed a method for testing Workers. We’ll reuse this method here to test a really simple Worker script below which simply returns Hello World! in the body of the response. We will name our Worker worker.js.

addEventListener('fetch', event => {
  event.respondWith(handleRequest(event.request))
})
async function handleRequest(request) {
  return new  Continue reading