Archive

Category Archives for "The Morning Paper"

Automating chaos experiments in production

Automating chaos experiments in production Basiri et al., ICSE 2019

Are you ready to take your system assurance programme to the next level? This is a fascinating paper from members of Netflix’s Resilience Engineering team describing their chaos engineering initiatives: automated controlled experiments designed to verify hypotheses about how the system should behave under gray failure conditions, and to probe for and flush out any weaknesses. The ‘controlled’ part is important here because given the scale and complexity of the environment under test, the only meaningful place to do this is in production with real users.

Maybe that sounds scary, but one of the interesting perspectives this paper brings is to make you realise that it’s really not so different from any other change you might be rolling out into production (e.g. a bug fix, configuration change, new feature, or A/B test). In all cases we need to be able to carefully monitor the impact on the system, and back out if things start going badly wrong. Moreover, just like an A/B test, we’ll be collecting metrics while the experiment is underway and performing statistical analysis at the end to interpret the results.

Netflix’s system is deployed on Continue reading

One SQL to rule them all: an efficient and syntactically idiomatic approach to management of streams and tables

One SQL to rule them all: an efficient and syntactically idiomatic approach to management of streams and tables Begoli et al., SIGMOD’19

In data processing it seems, all roads eventually lead back to SQL! Today’s paper choice is authored by a collection of experts from the Apache Beam, Apache Calcite, and Apache Flink projects, outlining their experiences building SQL interfaces for streaming. The net result is a set of proposed extensions to the SQL standard itself, being worked on under the auspices of the international SQL standardization body.

The thesis of this paper, supported by experience developing large open-source frameworks supporting real-world streaming use cases, is that the SQL language and relational model as-is and with minor non-intrusive extensions, can be very effective for manipulation of streaming data.

Many of the ideas presented here are already implemented by Apache Beam, Calcite, and Flink in some form, as one option amongst several. The streaming SQL interface has been adopted by Alibaba, Hauwei, Lyft, Uber and others, with the following feedback presented to the authors as to why they made this choice:

  • Development and adoption costs are significantly lower compared to non-declarative stream processing APIs
  • Familiarity with standard SQL eases adoption Continue reading

The convoy phenomenon

The convoy phenomenon Blasgen et al., IBM Research Report 1977 (revised 1979)

Today we’re jumping from HotOS topics of 2019, to hot topics of 1977! With thanks to Pat Helland for the recommendation, and with Jim Gray as one of the authors, we have a combination that’s very hard to ignore :).

Here’s the set-up as relayed to me by Pat (with permission):

At work, I am part of a good sized team working on a large system implementation. One of the very senior engineers with 25+ years experience mentioned a problem with the system. It seems that under test load, it would behave beautifully until the performance just fell to the floor. The system just crawled forever and never seemed to get out of this state. Work was getting done but at a pathetic rate. I said: “You have a convoy.” His response was; “Huh?”. I forwarded him the paper on “The Convoy Phenomenon”…

I have to confess I hadn’t heard of the convoy phenomenon either! Before we go on, take a moment to think of possible causes for the system behaviour described above. Lots of things can cause a performance cliff, but the interesting thing Continue reading

Machine learning systems are stuck in a rut

Machine learning systems are stuck in a rut Barham & Isard, HotOS’19

In this paper we argue that systems for numerical computing are stuck in a local basin of performance and programmability. Systems researchers are doing an excellent job improving the performance of 5-year old benchmarks, but gradually making it harder to explore innovative machine learning research ideas.

The thrust of the argument is that there’s a chain of inter-linked assumptions / dependencies from the hardware all the way to the programming model, and any time you step outside of the mainstream it’s sufficiently hard to get acceptable performance that researchers are discouraged from doing so.

Take a simple example: it would be really nice if we could have named dimensions instead of always having to work with indices.

Named dimensions improve readability by making it easier to determine how dimensions in the code correspond to the semantic dimensions described in, .e.g., a research paper. We believe their impact could be even greater in improving code modularity, as named dimensions would enable a language to move away from fixing an order on the dimensions of a given tensor, which in turn would make function lifting more convenient…

For Continue reading

Designing far memory data structures: think outside the box

Designing far memory data structures: think outside the box Aguilera et al., HotOS’19

Last time out we looked at some of the trade-offs between RInKs and LInKs, and the advantages of local in-memory data structures. There’s another emerging option that we didn’t talk about there: the use of far-memory, memory attached to the network that can be remotely accessed without mediation by a local processor. For many data center applications this looks to me like it could be a compelling future choice.

Far memory brings many potential benefits over near memory: higher memory capacity through disaggregation, separate scaling between processing and far memory, better availability due to separate fault domains for far memory, and better shareability among processors.

It’s not all straightforward though. As we’ve seen a number of times before, there’s a trade-off between fast one-sided access that doesn’t involve the remote CPU, and a more traditional RPC style that does. In particular, if you end up needing to make multiple one-sided requests to get to the data you really need, it’s often faster to just go the RPC route.

Therefore, if we want to make full use of one-sided far memory, we need to think Continue reading

Fast key-value stores: an idea whose time has come and gone

Fast key-value stores: an idea whose time has come and gone Adya et al., HotOS’19

No controversy here! Adya et al. would like you to stop using Memcached and Redis, and start building 11-factor apps. Factor VI in the 12-factor app manifesto, “Execute the app as one or more stateless processes,” to be dropped and replaced with “Execute the app as one or more stateful processes.”

It all makes for a highly engaging position paper (even if that engagement doesn’t necessarily take the form of agreement on all points)! It’s healthy to challenge the status-quo from time to time…

Remote, in-memory key-value (RInK) stores such as Memcached and Redis are widely used in industry and are an active area of academic research. Coupled with stateless application servers to execute business logic and a database-like system to provide persistent storage, they form a core component of popular data center service archictectures. We argue that the time of the RInK store has come and and gone…

What on earth are they thinking? ;)

Why are developers using RInK systems as part of their design? Generally to cache data (including non-persistent data that never sees a backing store), to share Continue reading

What bugs cause cloud production incidents?

What bugs cause production cloud incidents? Liu et al., HotOS’19

Last time out we looked at SLOs for cloud platforms, today we’re looking at what causes them to be broken! This is a study of every high severity production incident at Microsoft Azure services over a span of six months, where the root cause of that incident was a software bug. In total, there were 112 such incidents over the period March – September 2018 (not all of them affecting external customers). Software bugs are the most common cause of incidents during this period, accounting for around 40% of all incidents (so we can infer there were around 280 incidents total in the pool).

The 112 incidents caused by software bugs are further broken down into categories, with data-format bugs, fault-related bugs, timing bugs, and constant_value bugs being the largest categories. Interestingly, outages caused by configuration errors represented only a small number of incidents in this study. This could be an artefact of that data set in some way, or it might be due to the tool chain that Microsoft uses:

The types of bugs we observed in production are biased by the fact that Microsoft uses effective Continue reading

Nines are not enough: meaningful metrics for clouds

Nines are not enough: meaningful metrics for clouds Mogul & Wilkes, HotOS’19

It’s hard to define good SLOs, especially when outcomes aren’t fully under the control of any single party. The authors of today’s paper should know a thing or two about that: Jeffrey Mogul and John Wilkes at Google1! John Wilkes was also one of the co-authors of chapter 4 “Service Level Objectives” in the SRE book, which is good background reading for the discussion in this paper.

The opening paragraph of the abstract does a great job of framing the problem:

Cloud customers want strong, understandable promises (Service Level Objectives, or SLOs) that their applications will run reliably and with adequate performance, but cloud providers don’t want to offer them, because they are technically hard to meet in the face of arbitrary customer behavior and the hidden interactions brought about by statistical multiplexing of shared resources.

When it comes to SLOs, the interests of the customer and the cloud provider are at odds, and so we end up with SLAs (Service Level Agreements) that tie SLOs to contractual agreements.

What are we talking about

Let’s start out by getting some terms straight: SLIs, SLOs, SLAs, and Continue reading

Towards multiverse databases

Towards multiverse databases Marzoev et al., HotOS’19

A typical backing store for a web application contains data for many users. The application makes queries on behalf of an authenticated user, but it is up to the application itself to make sure that the user only sees data they are entitled to see.

Any frontend can access the whole store, regardless of the application user consuming the results. Therefore, frontend code is responsible for permission checks and privacy-preserving transformations that protect user’s data. This is dangerous and error-prone, and has caused many real-world bugs… the trusted computing base (TCB) effectively includes the entire application.

The central idea behind multiverse databases is to push the data access and privacy rules into the database itself. The database takes on responsibility for authorization and transformation, and the application retains responsibility only for authentication and correct delegation of the authenticated principal on a database call. Such a design rules out an entire class of application errors, protecting private data from accidentally leaking.

It would be safer and easier to specify and transparently enforce access policies once, at the shared backend store interface. Although state-of-the-are databases have security features designed for exactly this purpose, Continue reading

A case for managed and model-less inference serving

A case for managed and model-less inference serving Yadwadkar et al., HotOS’19

HotOS’19 is presenting me with something of a problem as there are so many interesting looking papers in the proceedings this year it’s going to be hard to cover them all! As a transition from the SysML papers we’ve been looking at recently, I’ve chosen a HotOS position paper from the Stanford Platform Lab to kick things off. As we saw with the SOAP paper last time out, even with a fixed model variant and hardware there are a lot of different ways to map a training workload over the available hardware. In “A case for managed and model-less inference serving” Yadwadkar et al. look at a similar universe of possibilities for model serving at inference time, and conclude that it’s too much to expect users to navigate this by themselves.

Making queries to an inference engine has many of the same throughput, latency, and cost considerations as making queries to a datastore, and more and more applications are coming to depend on such queries. “For instance, Facebook applications issue tens-of-trillions of inference queries per day with varying performance, accuracy, and cost constraints.”

If Continue reading

Beyond data and model parallelism for deep neural networks


Beyond data and model parallelism for deep neural networks Jia et al., SysML’2019

I’m guessing the authors of this paper were spared some of the XML excesses of the late nineties and early noughties, since they have no qualms putting SOAP at the core of their work! To me that means the “simple” object access protocol, but not here:

We introduce SOAP, a more comprehensive search space of parallelization strategies for DNNs that includes strategies to parallelize a DNN in the Sample, Operator, Attribute, and Parameter dimensions.

The goal here is to reduce the training times of DNNs by finding efficient parallel execution strategies, and even including its search time, FlexFlow is able to increase training throughput by up to 3.3x compared to state-of-the-art approaches.

There are two key ideas behind FlexFlow. The first is to expand the set of possible solutions (and hence also the search space!) in the hope of covering more interesting potential solutions. The second is an efficient execution simulator that makes searching that space possible by giving a quick evaluation of the potential performance of a given parallelisation strategy. Combine those with an off-the-shelf Metropolis-Hastings MCMC search strategy and Bob’s your uncle.

Continue reading

PyTorch-BigGraph: a large-scale graph embedding system

PyTorch-BigGraph: a large-scale graph embedding system Lerer et al., SysML’19

We looked at graph neural networks earlier this year, which operate directly over a graph structure. Via graph autoencoders or other means, another approach is to learn embeddings for the nodes in the graph, and then use these embeddings as inputs into a (regular) neural network:

Working with graph data directly is difficult, so a common technique is to use graph embedding methods to create vector representations for each node so that distances between these vectors predict the occurrence of edges in the graph.

When you’re Facebook, the challenge in learning embeddings is that the graph is big: over two billion user nodes, and over a trillion edges. Alibaba’s graph has more than one billion users and two billion items; Pinterest’s graph has more than 2 billion entities and over 17 billion edges. At this scale we have to find a way to divide-and-conquer. We’ll need to find some parallelism to embed graphs with trillions of edges in reasonable time, and a way of partitioning the problem so that we don’t need all of the embeddings in memory at each node (‘many standard methods exceed the memory Continue reading

Towards federated learning at scale: system design

Towards federated learning at scale: system design Bonawitz et al., SysML 2019

This is a high level paper describing Google’s production system for federated learning. One of the most interesting things to me here is simply to know that Google are working on this, have a first version in production working with tens of millions of devices, and see significant future expansion for the technology (‘we anticipate uses where the number of devices reaches billions’).

So what exactly is federated learning?

Federated Learning (FL) is a distributed machine learning approach which enables training on a large corpus of decentralized data residing on devices like mobile phones. FL is one instance of the more general approach of “bringing the code to the data, instead of the data to the code” and addresses the fundamental problems of privacy, ownership, and locality of data.

Note that this beyond using an on-device model to make predictions based on local data. Here we’re actually training the model in a distributed fashion, using data collected on the devices, without the data ever leaving those devices. The FL system contains a number of privacy-enhancing building blocks, but the privacy guarantees of any end-to-end system Continue reading

Data validation for machine learning

Data validation for machine learning Breck et al., SysML’19

Last time out we looked at continuous integration testing of machine learning models, but arguably even more important than the model is the data. Garbage in, garbage out.

In this paper we focus on the problem of validation the input data fed to ML pipelines. The importance of this problem is hard to overstate, especially for production pipelines. Irrespective of the ML algorithms used, data errors can adversely affect the quality of the generated model.

Breck et al. describe for us the data validation pipeline deployed in production at Google, “used by hundreds of product teams to continuously monitor and validate several petabytes of production data per day.” That’s trillions of training and serving examples per day, across more than 700 machine learning pipelines. More than enough to have accumulated some hard-won experience on what can go wrong and the kinds of safeguards it is useful to have in place!

What could possibly go wrong?

The motivating example is based on an actual production outage at Google, and demonstrates a couple of the trickier issues: feedback loops caused by training on corrupted data, and distance between data Continue reading

Continuous integration of machine learning models with ease.ml/ci

Continuous integration of machine learning models with ease.ml/ci: towards a rigorous yet practical treatment Renggli et al., SysML’19

Developing machine learning models is no different from developing traditional software, in the sense that it is also a full life cycle involving design, implementation, tuning, testing, and deployment. As machine learning models are used in more task-critical applications and are more tightly integrated with traditional software stacks, it becomes increasingly important for the ML development life cycle also to be managed following systematic, rigit engineering discipline.

I didn’t find this an easy paper to follow at all points, but the question it addresses is certainly interesting: what does a continuous integration testing environment look like for a machine learning model? ease.ml/ci is a CI system for machine learning, and it has to take into account two main differences from a regular CI test suite:

  1. Machine learning is inherently probabilistic , so test conditions are evaluated with respect to an (\epsilon, \delta) -reliability requirement, where (1-\delta) is the probability of a valid test, and \epsilon is the error tolerance.
  2. By getting pass/fail feedback from the CI server, the model can adapt to the test set used in the CI environment, leading to it Continue reading

A case for lease-based, utilitarian resource management on mobile devices

A case for lease-based, utilitarian resource management on mobile devices Hu et al., ASPLOS’19

I’ve chosen another energy-related paper to end the week, addressing a problem many people can relate to: apps that drain your battery. LeaseOS borrows the concept of a lease from distributed systems, but with a rather nice twist, and is able to reduce power wastage by 92% with no disruption to application experience and no changes required to the apps themselves.

So about that twist. LeaseOS injects a transparent proxy between an app and a power-hungry OS resource. The app thinks it has control of the resource until it releases it, but under the covers the proxy is given a lease. In a traditional leasing scheme, it’s up to the borrower to request a lease extension. But here half the problem is that apps are requesting expensive resources they don’t really need. So instead the OS monitors how wisely the leased resource is being used. If an app is making good, legitimate use of the resource then the lease will be transparently extended. If it isn’t, it loses the underlying resource. How you tell whether or not an app is being a wise steward of Continue reading

Boosted race trees for low energy classification

Boosted race trees for low energy classification Tzimpragos et al., ASPLOS’19

We don’t talk about energy as often as we probably should on this blog, but it’s certainly true that our data centres and various IT systems consume an awful lot of it. So it’s interesting to see a paper using nano-Joules per prediction as an evaluation metric. The goal is to produce a low-energy hardware classifier for embedded applications doing local processing of sensor data. To get there, the authors question a whole bunch of received wisdom, beginning with this: do we really need to convert the analog sensor data into a digital signal?! Here’s another fun one: what if instead of being something you worked hard to avoid, you had to build your whole application based on the outcomes of data races??!

Typically, a sensor gathers analog information from the physical world and then converts it into a conventional digital signal… While this binary-represented integer is perfectly efficient for storage as bits in memory and for typical general purpose computing operations, it is unclear that this is the most efficient for our target application. One such possible representation is pure analog signalling.

Of course analog signalling comes Continue reading

CheriABI: enforcing valid pointer provenance and minimizing pointer privilege in the POSIX C run-time environment

CheriABI: enforcing valid pointer provenance and minimizing pointer privilege in the POSIX C run-time environment Davis et al., ASPLOS’19

Last week we saw the benefits of rethinking memory and pointer models at the hardware level when it came to object storage and compression (Zippads). CHERI also rethinks the way that pointers and memory work, but the goal here is memory protection. The scope of the work stands out as particularly impressive:

We have adapted a complete C, C++, and assembly-language software stack, including the open source FreeBSD OS (nearly 800 UNIX programs and more than 200 libraries including OpenSSH, OpenSSL, and bsnmpd) and PostgreSQL database, to employ ubiquitous capability-based pointer and virtual-address protection.

The protections are hardware implemented and cannot be forged in software. The process model, user-kernel interactions, dynamic linking, and memory management concerns are all in scope, and the protection spans the OS/DBMS boundary.

The basic question here is whether it is practical to support a large-scale C-language software stack with strong pointer-based protection… with only modest changes to existing C code-bases and with reasonable performance cost. We answer this question affirmatively.

That ‘reasonable’ performance cost is a 6.8% slowdown, significantly better than e. Continue reading

Compress objects, not cache lines: an object-based compressed memory hierarchy

Compress objects, not cache lines: an object-based compressed memory hierarchy Tsai & Sanchez, ASPLOS’19

Last time out we saw how Google have been able to save millions of dollars though memory compression enabled via zswap. One of the important attributes of their design was easy and rapid deployment across an existing fleet. Today’s paper introduces Zippads, which compared to a state of the art compressed memory hierarchy is able to achieve a 1.63x higher compression ratio and improve performance by 17%. The big idea behind zippads is simple and elegant, but the ramifications go deep: all the way down to a modified instruction set (ISA)! So while you probably won’t be using Zippads in practice anytime soon, it’s a wonderful example of what’s possible when you’re prepared to take a fresh look at “the way we’ve always done things.”

The big idea

Existing cache and main memory compression techniques compress data in small fixed-size blocks, typically cache lines. Moreover, they use simple compression algorithms that focus on exploiting redundancy within a block. These techniques work well for scientific programs that are dominated by arrays. However, they are ineffective on object-based programs because objects do not fall neatly Continue reading

Software-defined far memory in warehouse scale computers

Software-defined far memory in warehouse-scale computers Lagar-Cavilla et al., ASPLOS’19

Memory (DRAM) remains comparatively expensive, while in-memory computing demands are growing rapidly. This makes memory a critical factor in the total cost of ownership (TCO) of large compute clusters, or as Google like to call them “Warehouse-scale computers (WSCs).”

This paper describes a “far memory” system that has been in production deployment at Google since 2016. Far memory sits in-between DRAM and flash and colder in-memory data can be migrated to it:

Our software-defined far memory is significantly cheaper (67% or higher memory cost reduction) at relatively good access speeds (6µs) and allows us to store a significant fraction of infrequently accessed data (on average, 20%), translating to significant TCO savings at warehouse scale.

With a far memory tier in place operators can choose between packing more jobs onto each machine, or reducing the DRAM capacity, both of which lead to TCO reductions. Google were able to bring about a 4-5% reduction in memory TCO (worth millions of dollars!) while having negligible impact on applications.

In introducing far memory Google faced a number of challenges: workloads are very diverse and change all the time, both in job Continue reading

1 5 6 7 8 9 16