Author Archives: Todd Hoff
Author Archives: Todd Hoff
Hey, it's HighScalability time:
Failover does not cut it anymore. You need an ALWAYS ON architecture with multiple data centers.-- Martin Van Ryswyk, VP of Engineering at DataStax
Failover, switching to a redundant or standby system when a component fails, has a long and checkered history as a way of dealing with failure. The reason is your failover mechanism becomes a single point of failure that often fails just when it's needed most. Having worked on a few telecom systems that used a failover strategy I know exactly how stressful failover events can be and how stupid you feel when your failover fails. If you have a double or triple fault in your system failover is exactly the time when it will happen.
For a long time the only real trick we had for achieving fault tolerance was to have a hot, warm, or cold standby (disk, interface, card, server, router, generator, datacenter, etc.) and failover to it when there's a problem. This old style of Disaster Recovery planning is no longer adequate or necessary.
Now, thanks to cloud infrastructures, at least at a software system level, we have an alternative: an always on architecture. Google calls this a Continue reading
Hey, it's HighScalability time:
IT Security Engineering. At Gusto we are on a mission to create a world where work empowers a better life. As Gusto's IT Security Engineer you'll shape the future of IT security and compliance. We're looking for a strong IT technical lead to manage security audits and write and implement controls. You'll also focus on our employee, network, and endpoint posture. As Gusto's first IT Security Engineer, you will be able to build the security organization with direct impact to protecting PII and ePHI. Read more and apply here.
How did Paypal take a billion hits a day system that might traditionally run on a 100s of VMs and shrink it down to run on 8 VMs, stay responsive even at 90% CPU, at transaction densities Paypal has never seen before, with jobs that take 1/10th the time, while reducing costs and allowing for much better organizational growth without growing the compute infrastructure accordingly?
PayPal moved to an Actor model based on Akka. PayPal told their story here: squbs: A New, Reactive Way for PayPal to Build Applications. They open source squbs and you can find it here: squbs on GitHub.
The stateful service model still doesn't get enough consideration when projects are choosing a way of doing things. To learn more about stateful services there's an article, Making The Case For Building Scalable Stateful Services In The Modern Era, based on an great talk given by Caitie McCaffrey. And if that doesn't convince you here's WhatsApp, who used Erlang, an Akka competitor, to achieve incredible throughput: The WhatsApp Architecture Facebook Bought For $19 Billion.
I refer to the above articles because the PayPal article is short on architectural details. It's more about the factors the led the selection of Akka and the Continue reading
Hey, it's HighScalability time:
The big middle finger to the Olympic Committee. They pulled this video of the incredibly beautiful Olympic cauldron at Rio.
I have dozens if not hundreds of half finished articles and snippets of ideas in the haunted house that is my Google Docs. Walking the house around midnight, with the lights turned off course, I stumbled upon one ghost that has been haunting me since 2012. It is time to perform the ritual of exorcism by just publishing something.
You may or may not remember Obama for America, which in 2012 had a staff of 120 people that built and maintained the infrastructure that helped get out the vote for Obama.
Harper Reed and Dylan Richard headed up the effort. Around that time they were getting a lot of press. One of the things that interested me was how they held Gameday test events, where they would simulate failure modes in their testing environments. Google calls these DiRT (Disaster Recovery Testing event) exercises.
So I asked Harper and Dylan what these exercises actually were and they were kind enough to reply. And I apparently forgot all about it. My apologies. Better late than never? Yah, let's go with that.
Here are some of the failure testing scenarios carried out by the Obama for America team:
Hey, it's HighScalability time:
There's a meme out there that hardware is dead. Maybe not. Hardware is becoming more specialized as the general purpose CPU can't keep up. The tick-tock cycle created by Moore's law meant designers had a choice: build or buy. Make your own hardware to deep inspect 1gps of network traffic (for example) and release later or use an off-the-shelf CPU and release sooner.
Now in the anarchy of a Moore's lawless it looks like build is back. Jeff Dean is giving a talk at #scaledmlconf where he talks about this trend at Google.
CPU@jackclarkSF: Jeff Dean says Google can run its full Inception' v3 image model on a phone at about 6fps. And specialized ASICs are coming.
And Mo Patel captured this slide from the talk:
IT Security Engineering. At Gusto we are on a mission to create a world where work empowers a better life. As Gusto's IT Security Engineer you'll shape the future of IT security and compliance. We're looking for a strong IT technical lead to manage security audits and write and implement controls. You'll also focus on our employee, network, and endpoint posture. As Gusto's first IT Security Engineer, you will be able to build the security organization with direct impact to protecting PII and ePHI. Read more and apply here.
This is a guest post by Alessandro Pieri, Software Architect at Stream. Try out this 5 minute interactive tutorial to learn more about Stream’s API.
Originally built by Facebook in 2009, Apache Cassandra is a free and open-source distributed database designed to handle large amounts of data across a large number of servers. At Stream, we use Cassandra as the primary data store for our feeds. Cassandra stands out because it’s able to:
Shard data automatically
Handle partial outages without data loss or downtime
Scales close to linearly
If you’re already using Cassandra, your cluster is likely configured to handle the loss of 1 or 2 nodes. However, what happens when a full availability zone goes down?
In this article you will learn how to setup Cassandra to survive a full availability zone outage. Afterwards, we will analyze how moving from a single to a multi availability zone cluster impacts availability, cost, and performance.
Hey, it's HighScalability time:
We've been following an increasing ephemerality curve to get more and more utilization out of our big brawny boxes. VMs, VMs in the cloud, containers, containers in the cloud, and now serverless, which looks to be our first native cloud infrastructure.
Serverless is said to be about functions, but you really need a zip file of code to do much of anything useful, which is basically a container.
So serverless isn't so much about packaging as it is about not standing up your own chunky persistent services. Those services, like storage, like the database, etc, have moved to the environment.
Your code orchestrates the dance and implements specific behaviours. Serverless is nothing if not a framework writ large.
Serverless also intensifies the developer friendly disintermediation of infrastructure that the cloud started.
Upload your code and charge it on your credit card. All the developer has to worry about their function. Oh, and linking everything together (events, DNS, credentials, backups, etc) through a Byzantine patch panel of a UI; uploading each of your zillions of "functions" on every change; managing versions so you can separate out test, development, and production. But hey, nothing is perfect.
What may drive serverless more Continue reading
Hey, it's HighScalability time:
This is a guest repost by Donatas Abraitis, Lead Systems Engineer at at Hostinger International.
This article is about how we built the new high scalable cloud hosting solution using IPv6-only communication between commodity servers, what problems we faced with IPv6 protocol and how we tackled them for handling more than ten millions active users.
At Hostinger we care much about innovation technologies, thus we decided to run a new project named Awex that is based on this protocol. If we can, so why not start since today? Only frontend (user facing) services are running in dual-stack environment, everything else is IPv6-only for west-east traffic.
IT Security Engineering. At Gusto we are on a mission to create a world where work empowers a better life. As Gusto's IT Security Engineer you'll shape the future of IT security and compliance. We're looking for a strong IT technical lead to manage security audits and write and implement controls. You'll also focus on our employee, network, and endpoint posture. As Gusto's first IT Security Engineer, you will be able to build the security organization with direct impact to protecting PII and ePHI. Read more and apply here.
Awake Networks is an early stage network security and analytics startup that processes, analyzes, and stores billions of events at network speed. We help security teams respond to intrusions with super-human efficiency and provide macroscopic and microscopic insight into the networks they defend. We're looking for folks that are excited about building systems that handle scale in a constrained environment. We have many open-ended problems to solve around stream-processing, distributed systems, machine learning, query processing, data modeling, and much more! Please check out our jobs page to learn more.
How does Google keep all its services up and running? They almost never seem to fail. If you've ever wondered we get a wonderful peek behind the curtain in a talk given at GCP NEXT 2016 by Melissa Binde, Director, Storage SRE at Google: How Google Does Planet-Scale Engineering for Planet-Scale Infrastructure.
Melissa's talk is short, but it's packed with wisdom and delivered in a no nonsense style that makes you think if your service is down Melissa is definitely the kind of person you want on the case.
Oh, just what is SRE? It stands for Site Reliability Engineering, but a definition is more elusive. It's like the kind of answers you get when you ask for a definition of the Tao. It's more a process than a thing, as is made clear by Ben Sloss 24x7 VP, Google, who defines SRE as:
what happens when a software engineer is tasked with what used to be called operations.
Let that bounce around your head for awhile.
Above and beyond all else one thing is clear: SREs are the custodian of production. SREs are the custodian of customer experience, for both google.com and GCP.
Some Continue reading
Hey, it's HighScalability time:
When Lee Atchison arrived at Amazon, Amazon was in the process of moving from a large monolithic application to a Service Oriented Architecture.
Lee talks about this evolution in an interesting interview on Software Engineering Daily: Scalable Architecture with Lee Atchison, about Lee's new book: Architecting for Scale: High Availability for Your Growing Applications.
This is a topic Adrian Cockcroft has talked a lot about in relation to his work at Netflix, but it's a powerful experience to hear Lee talk about how Amazon made the transition with us having the understanding of what Amazon would later become.
Amazon was running into the problems of success. Not so much from a scaling to handle the requests perspective, but they were suffering from the problem of scaling the number of engineers working in the same code base.
At the time their philosophy was based on the Two Pizza team. A small group owns a particular piece of functionality. The problem is it doesn’t work to have hundreds of pizza teams working on the same code base. It became very difficult to innovate and add new features. It even became hard to build the application, pass the test suites, and Continue reading
This is just to delicious a parallel to pass up.
Here we have Google building a new four story datacenter Scaling Up: Google Building Four-Story Data Centers:
And here we have a new vertical farm from AeroFarms:
Both have racks of consumables. One is a rack of bits, the other is a rack of bites. Both used to sprawl horizontally across huge swaths of land and now are building up. Both designs are driven by economic efficiency, extracting the most value per square foot. Both are expanding to meet increased demand. It's a strange sort of convergence.