adriancolyer

Author Archives: adriancolyer

Overload control for scaling WeChat microservices

Overload control for scaling WeChat microservices Zhou et al., SoCC’18

There are two reasons to love this paper. First off, we get some insights into the backend that powers WeChat; and secondly the authors share the design of the battle hardened overload control system DAGOR that has been in production at WeChat for five years. This system has been specifically designed to take into account the peculiarities of microservice architectures. If you’re looking to put a strategy in place for your own microservices, you could do a lot worse than start here.

WeChat

The WeChat backend at this point consists of over 3000 mobile services, including instant messaging, social networking, mobile payment, and third-party authorization. The platform sees between 10^{10} - 10^{11} external requests per day. Each such request can triggers many more internal microservice requests, such that the WeChat backend as a whole needs to handle hundreds of millions of requests per second.

WeChat’s microservice system accommodates more than 3000 services running on over 20,000 machines in the WeChat business system, and these numbers keep increasing as WeChat is becoming immensely popular… As WeChat is ever actively evolving, its microservice system has been undergoing fast iteration of service updates. For instance, Continue reading

Unikernels as processes

Unikernels as processes Williams et al., SoCC’18

Ah, unikernels. Small size, fast booting, tiny attack surface, resource efficient, hard to deploy on existing cloud platforms, and undebuggable in production. There’s no shortage of strong claims on both sides of the fence.

See for example:

In today’s paper choice, Williams et al. give us an intriguing new option in the design space: running unikernels as processes. Yes, that’s initially confusing to get your head around! That means you still have a full-fat OS underneath the process, and you don’t get to take advantage of the strong isolation afforded by VMs. But through a clever use of seccomp, unikernels as processes still have strong isolation as well as increased throughput, reduced startup time, and increased memory density. Most importantly though, with unikernels as processes we can reuse standard infrastructure and tools:

We believe that running unikernels as processes is an important step towards running them in production, because, as processes, they can Continue reading

Debugging distributed systems with why-across-time provenance

Debugging distributed systems with why-across-time provenance Whittaker et al., SoCC’18

This value is 17 here, and it shouldn’t be. Why did the get request return 17?

Sometimes the simplest questions can be the hardest to answer. As the opening sentence of this paper states:

Debugging distributed systems is hard.

The kind of why questions we’re interested in for this paper are questions of provenance. What are the causes of this output? Provenance has been studied in the context of relational databases and dataflow systems, but here we’re interested in general distributed systems. (Strictly, those where the behaviour of each node can be modelled by a deterministic state machine: non-deterministic behaviour is left to future work).

Why why-provenance doesn’t work

Relational databases have why-provenance, which sounds on the surface exactly like what we’re looking for.

Given a relational database, a query issued against the database, and a tuple in the output of the query, why-provenance explains why the output tuple was produced. That is, why -provenance produces the input tuples that, if passed through the relational operators of the query, would produce the output tuple in question.

One reason that won’t work in our distributed systems setting is that Continue reading

ApproxJoin: approximate distributed joins

ApproxJoin: approximate distributed joins Le Quoc et al., SoCC’18

GitHub: https://ApproxJoin.github.io

The join is a fundamental data processing operation and has been heavily optimised in relational databases. When you’re working with large volumes of unstructured data though, say with a data processing framework such as Flink or Spark, joins become distributed and much more expensive. One of the reasons for this is the amount of data that needs to be moved over the network. In many use cases, approximate results would be acceptable, and as we’ve seen before, likely much faster and cheaper to compute. Approximate computing with joins is tricky though: if you sample datasets before the join you reduce data movement, but also sacrifice up to an order of magnitude in accuracy; if you sample results after the join you don’t save on any data movement and the process is slow.

This paper introduces an approximate distributed join technique, ApproxJoin, which is able to sample before data shuffling without loss of end result accuracy. Compared to unmodified Spark joins with the same sampling ratio it achieves a speedup of 9x while reducing the shuffled data volume by 82x.

The following charts show ApproxJoin’s latency Continue reading

ASAP: fast, approximate graph pattern mining at scale

ASAP: fast, approximate graph pattern mining at scale Iyer et al., OSDI’18

I have a real soft spot for approximate computations. In general, we waste a lot of resources on overly accurate analyses when understanding the trends and / or the neighbourhood is quite good enough (do you really need to know it’s 78.763895% vs 78 ± 1%?). You can always drill in with more accuracy if the approximate results hint at something interesting or unexpected.

Approximate analytics is an area that has gathered attention in big data analytics, where the goal is to let the user trade-off accuracy for much faster results.

(See e.g. ApproxHadoop which we covered on The Morning Paper a while back).

In the realm of graph processing, graph pattern mining algorithms, which discover structural patterns in a graph, can reveal very interesting things in our data but struggle to scale to larger graphs. This is in contrast to graph analysis algorithms such as PageRank which typically compute properties of a graph using neighbourhood information.

Today, a deluge of graph processing frameworks exist, both in academia and open-source… a vast majority of the existing graph processing frameworks however have focused on graph Continue reading

Sharding the shards: managing datastore locality at scale with Akkio

Sharding the shards: managing datastore locality at scale with Akkio Annamalai et al., OSDI’18

In Harry Potter, the Accio Summoning Charm summons an object to the caster of the spell, sometimes transporting it over a significant distance. In Facebook, Akkio summons data to a datacenter with the goal of improving data access locality for clients. Central to Akkio is the notion of microshards (μ-shards), units of data much smaller than a typical shard. μ-shards are defined by the client application, and should exhibit strong access locality (i.e., the application tends to read/write the data in a μ-shard together in a small window of time). Sitting as a layer between client applications and underlying datastores, Akkio has been in production at Facebook since 2014, where it manages around 100PB of data.

Measurements from our production environment show that Akkio reduces latencies by up to 50%, cross-datacenter traffic by up to 50%, and storage footprint by up to 40% compared to reasonable alternatives.

Akkio can support trillions of μ-shards and many 10s of millions of data access requests per second.

Motivation

Our work in this area was initially motivated by our aim to reduce service response times and resource Continue reading

The FuzzyLog: a partially ordered shared log

The FuzzyLog: a partially ordered shared log Lockerman et al., OSDI’18

If you want to build a distributed system then having a distributed shared log as an abstraction to build upon — one that gives you an agreed upon total order for all events — is such a big help that it’s practically cheating! (See the “Can’t we all just agree” mini-series of posts for some of the background on consensus).

Services built over a shared log are simple, compact layers that map a high-level API to append/read operations on the shared log, which acts as the source of strong consistency, durability, failure atomicity, and transactional isolation. For example, a shared log version of ZooKeeper uses 1K lines of code, an order of magnitude lower than the original system.

There’s a catch of course. System-wide total orders are expensive to maintain. Sometimes it may be impossible (e.g. in the event of a network partition). But perhaps we don’t always need a total ordering. Oftentimes for example causal consistency is strong enough. FuzzyLog aims to provide the simplicity of a shared log without imposing a total order: it provides partial ordering instead. It’s designed for a world Continue reading

Moment-based quantile sketches for efficient high cardinality aggregation queries

Moment-based quantile sketches for efficient high cardinality aggregation queries Gan et al., VLDB’18

Today we’re temporarily pausing our tour through some of the OSDI’18 papers in order to look at a great sketch-based data structure for quantile queries over high-cardinality aggregates.

That’s a bit of a mouthful so let’s jump straight into an example of the problem at hand. Say you have telemetry data from millions of heterogenous mobile devices running your app. Each device tracks multiple metrics such as request latency and memory usage, and is associated with dimensional metadata (categorical variables) such as application version and hardware model.

In applications such as A/B testing, exploratory data analysis, and operations monitoring, analysts perform aggregation queries to understand how specific user cohorts, device types, and feature flags are behaving.

We want to be able to ask questions like “what’s the 99%-ile latency over the last two weeks for v8.2 of the app?

SELECT percentile(latency, 99) FROM requests
WHERE time > date_sub(curdate(), 2 WEEK)
AND app_version = "v8.2"

As well as threshold queries such as “what combinations of app version and hardware platform have a 99th percentile latency exceeding 100ms?

SELECT app_version, hw_model, PERCENTILE(latency,  Continue reading

Noria: dynamic, partially-stateful data-flow for high-performance web applications

Noria: dynamic, partially-stateful data-flow for high-performance web applications Gjengset, Schwarzkopf et al., OSDI’18

I have way more margin notes for this paper than I typically do, and that’s a reflection of my struggle to figure out what kind of thing we’re dealing with here. Noria doesn’t want to fit neatly into any existing box!

We’ve seen streaming data-flow engines that maintain state and offer SQL interfaces and even transactions (e.g. Apache Flink, and data Artisan’s Streaming Ledger for Flink). The primary model here is data-flow, and SQL is bolted on as an interface to the state. The title of this paper sets me off thinking along those lines, but from the end user perspective, Noria looks and feels more like a database. The SQL interface is primary, not ancillary, and it maintains relational data in base tables (using RocksDB as the storage engine). Noria makes intelligent use of data-flow beneath the SQL interface (i.e., dataflow is not exposed as an end-user programming model) in order to maintain a set of (semi-)materialized views. Noria itself figures out the most efficient data-flows to maintain those views, and how to update the data-flow graphs in the face of Continue reading

RobinHood: tail latency aware caching – dynamic reallocation from cache-rich to cache-poor

RobinHood: tail latency aware caching – dynamic reallocation from cache-rich to cache-poor Berger et al., OSDI’18

It’s time to rethink everything you thought you knew about caching! My mental model goes something like this: we have a set of items that probably follow a power-law of popularity.

We have a certain finite cache capacity, and we use it to cache the most frequently requested items, speeding up request processing.

Now, there’s a long tail of less frequently requested items, and if we request one of these that’s not in the cache the request is going to take longer (higher latency). But it makes no sense whatsoever to try and improve the latency for these requests by ‘shifting our cache to the right.’

Hence the received wisdom that unless the full working set fits entirely in the cache, then a caching layer doesn’t address tail latency.

So far we’ve been talking about one uniform cache. But in a typical web application one incoming request might fan out to many back-end service requests processed in parallel. The OneRF page rendering framework at Microsoft (which serves msn.com, microsoft.com and xbox.com among others) relies on more than 20 backend Continue reading

Maelstrom: mitigating datacenter-level disasters by draining interdependent traffic safely and efficiently

Maelstrom: mitigating datacenter-level disasters by draining interdependent traffic safely and efficiently Veeraraghavan et al., OSDI’18

Here’s a really valuable paper detailing four plus years of experience dealing with datacenter outages at Facebook. Maelstrom is the system Facebook use in production to mitigate and recover from datacenter-level disasters. The high level idea is simple: drain traffic away from the failed datacenter and move it to other datacenters. Doing that safely, reliably, and repeatedly, not so simple!

Modern Internet services are composed of hundreds of interdependent systems spanning dozens of geo-distributed datacenters. At this scale, seemingly rare natural disasters, such as hurricanes blowing down power lines and flooding, occur regularly.

How regularly? Well, we’re told that Maelstrom has been in production at Facebook for over four years, and in that time has helped Facebook to recover from over 100 disasters. I make that about one disaster every two weeks!! Not all of these are total loss of a datacenter, but they’re all serious datacenter-wide incidents. One example was fibrecuts leading to a loss of 85% of the backbone network capability connecting a datacenter to the FB infrastructure. Maelstrom had most user-facing traffic drained in about 17 minutes, and the all traffic Continue reading

LegoOS: a disseminated, distributed OS for hardware resource disaggregation

LegoOS: a disseminated, distributed OS for hardware resource disaggregation Shan et al., OSDI’18

One of the interesting trends in hardware is the proliferation and importance of dedicated accelerators as general purposes CPUs stopped benefitting from Moore’s law. At the same time we’ve seen networking getting faster and faster, causing us to rethink some of the trade-offs between local I/O and network access. The monolithic server as the unit of packaging for collections of such devices is starting to look less attractive:

  • It leads to inefficient resource utilisation, since CPU and memory for a job have to be allocated from the same machine. This can lead to eviction even when utilisation is overall low (e.g. 50%).

  • It is difficult to add, move, remove, or reconfigure hardware components after they have been installed in a server, leading to long up-front planning cycles for hardware rollouts at odds with the fast-moving rate of change in requirements.
  • It creates a coarse failure domain – when any hardware component within a server fails, the whole server is often unusable.
  • It doesn’t work well with heterogeneous devices and their rollout: e.g. GPGPUs, TPUs, DPUs, FGPAs, NVM, and NVMe-based SSDs.

To fully support the Continue reading

Orca: differential bug localization in large-scale services

Orca: differential bug localization in large-scale services Bhagwan et al., OSDI’18

Earlier this week we looked at REPT, the reverse debugging tool deployed live in the Windows Error Reporting service. Today it’s the turn of Orca, a bug localisation service that Microsoft have in production usage for six of their large online services. The focus of this paper is on the use of Orca with ‘Orion,’ where Orion is a codename given to a ‘large enterprise email and collaboration service that supports several millions of users, run across hundreds of thousands of machines, and serves millions of requests per second.’ We could it ‘Office 365’ perhaps? Like REPT, Orca won a best paper award (meaning MR scooped 2 out of the three awards at OSDI this year!).

Orca is designed to support on-call engineers (OCEs) in quickly figuring out the change (commit) that introduced a bug to a service so that it can be backed out. (Fixes can come later!). That’s a much harder task than it sounds in highly dynamic and fast moving environments. In ‘Orion’ for example there are many developers concurrently committing code. Post review the changes are eligible for inclusion in a Continue reading

REPT: reverse debugging of failures in deployed software

REPT: reverse debugging of failures in deployed software Cui et al., OSDI’18

REPT (‘repeat’) won a best paper award at OSDI’18 this month. It addresses the problem of debugging crashes in production software, when all you have available is a memory dump. In particular, we’re talking about debugging Windows binaries. To effectively understand and fix bugs, a developer wants to be able to follow the path leading up to the point of failure. Not just the control flow, but also the data values involved. This is known as reverse debugging. What’s so clever about REPT is that it enables reverse debugging without the high overheads of record/replay systems (typically up to 200%). It combines low overhead hardware tracing (Intel PT) to record a programs control flow, with a novel binary analysis technique to recover (a very good percentage of) data flow information. Evaluated on 16 real-world bugs in software such as Chrome, Apache, PHP, and Python, REPT enabled effective reverse debugging for 14 of them, including 2 concurrency bugs.

REPTs offline binary analysis and reverse debugging is integrated into WinDbg, and the Windows Error Reporting service (WER) is enhanced to support REPT so that developers can request Intel PT Continue reading

Capturing and enhancing in situ system observability for failure detection

Capturing and enhancing in situ system observability for failure detection Huang et al., OSDI’18

The central idea in this paper is simple and brilliant. The place where we have the most relevant information about the health of a process or thread is in the clients that call it. Today the state of the practice is to log and try to recover from a failed call at the client, while a totally separate failure detection infrastructure is responsible for figuring out whether or not things are working as desired. What Panorama does is turn clients into observers and reporters of the components they call, using these observations to determine component health. It works really well!

Panorama can easily integrate with popular distributed systems and detect all 15 real-world gray failures that we reproduced in less than 7s, whereas existing approaches detect only one of them in under 300s.

Panaroma is open source and available at https://github.com/ryanphuang/panorama.

Combating gray failures with multiple observers

Panaroma is primarily design to catch gray failures, in which components and systems offer degraded performance but typically don’t crash-stop. One example of such a failure is a ZooKeeper cluster that could no longer service write Continue reading

Automatic discovery of tactics in spatio-temporal soccer match data

Automatic discovery of tactics in spatio-temporal soccer match data Decroos et al., KDD’18

Here’s a fun paper to end the week. Data collection from sporting events is now widespread. This fuels an endless thirst for team and player statistics. In terms of football (which shall refer to the game of soccer throughout this write-up) that leads to metrics such as completed passes, distance covered, intercepts, shots-on-goal, and so on. Decroos et al. want to go one level deeper though, and use the data to uncover team tactics. The state of the art today for tactical analysis still involves watching hours of video footage.

This paper focuses on the problem of detecting tactics from professional soccer matches based on spatio-temporal data.

Now when I think of tactics, a key component in my mind is the team shape and movement of players off the ball. Unfortunately Decroos et al., don’t have the data available to analyse that. So they have to do what they can based on more limited information.

Our dataset consists of event data for the English Premier League for the 2015/2016 season. This event data was manually collected by humans who watch video feeds of the matches Continue reading

Detecting spacecraft anomalies using LSTMs and nonparametric dynamic thresholding

Detecting spacecraft anomalies using LSTMs and nonparametric dynamic thresholding Hundman et al., KDD’18

How do you effectively monitor a spacecraft? That was the question facing NASA’s Jet Propulsion Laboratory as they looked forward towards exponentially increasing telemetry data rates for Earth Science satellites (e.g., around 85 terabytes/day for a Synthetic Aperture Radar satellite).

Spacecraft are exceptionally complex and expensive machines with thousands of telemetry channels detailing aspects such as temperature, radiation, power, instrumentation, and computational activities. Monitoring these channels is an important and necessary component of spacecraft operations given their complexity and cost. In an environment where a failure to detect and respond to potential hazards could result in the full or partial loss of spacecraft, anomaly detection is a critical tool to alert operations engineers of unexpected behavior.

Anomaly detection systems deployed today typically consist of tiered alarms indicating when values fall outside of pre-defined limits. There are also limited instances of expert systems and nearest-neighbour based approaches being tried, but their limitations prevented widespread adoption. A more accurate and scalable approach to anomaly detection that makes better use of limited engineering resources is required.

Any such system needs to work with data that is highly Continue reading

Online parameter selection for web-based ranking problems

Online parameter selection for web-based ranking problems Agarwal et al., KDD’18

Last week we looked at production systems from Facebook, Airbnb, and Snap Inc., today it’s the turned of LinkedIn. This paper describes the system and model that LinkedIn use to determine the items to be shown in a user’s feed:

It replaces previous hand-tuning of the feature mix and ranking:

In the past, a developer spent days to accurately estimate [the feature weights] with desired behavior with every significant drift and intervention. Using the approach described in this paper, we have completely eliminated the laborious manual tuning, significantly increased developer productivity as well as obtained better Pareto optimal solutions.

Here are representative results comparing the impact on three key metrics for a setting found by the new algorithm over a period of 36 hours, vs the best setting that could be found in 7 days of hand-tuning:

Viral Actions is the most important metric for the LinkedIn feed, ‘with far reaching consequences from more subsequent sessions to more revenue.’ The algorithm is tasked with maximising this metric while not allowing others to drop by more than 1%. (Increases in the other metrics as well are Continue reading

I know you’ll be back: interpretable new user clustering and churn prediction on a mobile social application

I know you’ll be back: interpretable new user clustering and churn prediction on a mobile social application Yang et al., KDD’18

Churn rates (how fast users abandon your app / service) are really important in modelling a business. If the churn rate is too high, it’s hard to maintain growth. Since acquiring new customers is also typically much more expensive than expanding in existing accounts, churn hits you twice over. So it’s really important to understand what causes users to churn in your business, and ideally to be able to predict users likely to churn so that you can intervene. This paper describes ClusChurn, the churn prediction system deployed at Snapchat.

It has been argued for decades that acquiring new users is often more expensive than keeping old ones. Surprisingly, however, user retention and its core component, churn prediction, have received much less attention from the research community… While this paper focuses on the Snapchat data as a comprehensive example, the techniques developed here can be readily leveraged for other online platforms, where users interact with the platform functions as well as other users.

The central idea in ClusChurn is to cluster users into clusters that are meaningful Continue reading

Customized regression model for Airbnb dynamic pricing

Customized regression model for Airbnb dynamic pricing Ye et al., KDD’18

This paper details the methods that Airbnb use to suggest prices to listing hosts (hosts ultimately remain in control of pricing on the Airbnb platform).

The proposed strategy model has been deployed in production for more than 1 year at Airbnb. The launch of the first iteration of the strategy model yielded significant gains on bookings and booking values for hosts who have adopted our suggestions… multiple iterations of the strategy model have been experimented [with] and launched into production to further improve the quality of our price suggestions.

Figuring out the right price for a night in a given Airbnb listing is challenging because no two listings are the same. Even when we constrain to e.g. similar sized properties in the same region, factors such as the number of five star reviews can influence price. Furthermore demand is time-varying due to seasonality and regional events (with different seasonality patterns for different countries). And then of course, how far in advance a booking is being made also factors into the price (“as lead time reduces, there are less opportunities for this night to be booked, which Continue reading

1 8 9 10 11 12 15