Archive

Category Archives for "The Morning Paper"

Beyond data and model parallelism for deep neural networks


Beyond data and model parallelism for deep neural networks Jia et al., SysML’2019

I’m guessing the authors of this paper were spared some of the XML excesses of the late nineties and early noughties, since they have no qualms putting SOAP at the core of their work! To me that means the “simple” object access protocol, but not here:

We introduce SOAP, a more comprehensive search space of parallelization strategies for DNNs that includes strategies to parallelize a DNN in the Sample, Operator, Attribute, and Parameter dimensions.

The goal here is to reduce the training times of DNNs by finding efficient parallel execution strategies, and even including its search time, FlexFlow is able to increase training throughput by up to 3.3x compared to state-of-the-art approaches.

There are two key ideas behind FlexFlow. The first is to expand the set of possible solutions (and hence also the search space!) in the hope of covering more interesting potential solutions. The second is an efficient execution simulator that makes searching that space possible by giving a quick evaluation of the potential performance of a given parallelisation strategy. Combine those with an off-the-shelf Metropolis-Hastings MCMC search strategy and Bob’s your uncle.

Continue reading

PyTorch-BigGraph: a large-scale graph embedding system

PyTorch-BigGraph: a large-scale graph embedding system Lerer et al., SysML’19

We looked at graph neural networks earlier this year, which operate directly over a graph structure. Via graph autoencoders or other means, another approach is to learn embeddings for the nodes in the graph, and then use these embeddings as inputs into a (regular) neural network:

Working with graph data directly is difficult, so a common technique is to use graph embedding methods to create vector representations for each node so that distances between these vectors predict the occurrence of edges in the graph.

When you’re Facebook, the challenge in learning embeddings is that the graph is big: over two billion user nodes, and over a trillion edges. Alibaba’s graph has more than one billion users and two billion items; Pinterest’s graph has more than 2 billion entities and over 17 billion edges. At this scale we have to find a way to divide-and-conquer. We’ll need to find some parallelism to embed graphs with trillions of edges in reasonable time, and a way of partitioning the problem so that we don’t need all of the embeddings in memory at each node (‘many standard methods exceed the memory Continue reading

Towards federated learning at scale: system design

Towards federated learning at scale: system design Bonawitz et al., SysML 2019

This is a high level paper describing Google’s production system for federated learning. One of the most interesting things to me here is simply to know that Google are working on this, have a first version in production working with tens of millions of devices, and see significant future expansion for the technology (‘we anticipate uses where the number of devices reaches billions’).

So what exactly is federated learning?

Federated Learning (FL) is a distributed machine learning approach which enables training on a large corpus of decentralized data residing on devices like mobile phones. FL is one instance of the more general approach of “bringing the code to the data, instead of the data to the code” and addresses the fundamental problems of privacy, ownership, and locality of data.

Note that this beyond using an on-device model to make predictions based on local data. Here we’re actually training the model in a distributed fashion, using data collected on the devices, without the data ever leaving those devices. The FL system contains a number of privacy-enhancing building blocks, but the privacy guarantees of any end-to-end system Continue reading

Data validation for machine learning

Data validation for machine learning Breck et al., SysML’19

Last time out we looked at continuous integration testing of machine learning models, but arguably even more important than the model is the data. Garbage in, garbage out.

In this paper we focus on the problem of validation the input data fed to ML pipelines. The importance of this problem is hard to overstate, especially for production pipelines. Irrespective of the ML algorithms used, data errors can adversely affect the quality of the generated model.

Breck et al. describe for us the data validation pipeline deployed in production at Google, “used by hundreds of product teams to continuously monitor and validate several petabytes of production data per day.” That’s trillions of training and serving examples per day, across more than 700 machine learning pipelines. More than enough to have accumulated some hard-won experience on what can go wrong and the kinds of safeguards it is useful to have in place!

What could possibly go wrong?

The motivating example is based on an actual production outage at Google, and demonstrates a couple of the trickier issues: feedback loops caused by training on corrupted data, and distance between data Continue reading

Continuous integration of machine learning models with ease.ml/ci

Continuous integration of machine learning models with ease.ml/ci: towards a rigorous yet practical treatment Renggli et al., SysML’19

Developing machine learning models is no different from developing traditional software, in the sense that it is also a full life cycle involving design, implementation, tuning, testing, and deployment. As machine learning models are used in more task-critical applications and are more tightly integrated with traditional software stacks, it becomes increasingly important for the ML development life cycle also to be managed following systematic, rigit engineering discipline.

I didn’t find this an easy paper to follow at all points, but the question it addresses is certainly interesting: what does a continuous integration testing environment look like for a machine learning model? ease.ml/ci is a CI system for machine learning, and it has to take into account two main differences from a regular CI test suite:

  1. Machine learning is inherently probabilistic , so test conditions are evaluated with respect to an (\epsilon, \delta) -reliability requirement, where (1-\delta) is the probability of a valid test, and \epsilon is the error tolerance.
  2. By getting pass/fail feedback from the CI server, the model can adapt to the test set used in the CI environment, leading to it Continue reading

A case for lease-based, utilitarian resource management on mobile devices

A case for lease-based, utilitarian resource management on mobile devices Hu et al., ASPLOS’19

I’ve chosen another energy-related paper to end the week, addressing a problem many people can relate to: apps that drain your battery. LeaseOS borrows the concept of a lease from distributed systems, but with a rather nice twist, and is able to reduce power wastage by 92% with no disruption to application experience and no changes required to the apps themselves.

So about that twist. LeaseOS injects a transparent proxy between an app and a power-hungry OS resource. The app thinks it has control of the resource until it releases it, but under the covers the proxy is given a lease. In a traditional leasing scheme, it’s up to the borrower to request a lease extension. But here half the problem is that apps are requesting expensive resources they don’t really need. So instead the OS monitors how wisely the leased resource is being used. If an app is making good, legitimate use of the resource then the lease will be transparently extended. If it isn’t, it loses the underlying resource. How you tell whether or not an app is being a wise steward of Continue reading

Boosted race trees for low energy classification

Boosted race trees for low energy classification Tzimpragos et al., ASPLOS’19

We don’t talk about energy as often as we probably should on this blog, but it’s certainly true that our data centres and various IT systems consume an awful lot of it. So it’s interesting to see a paper using nano-Joules per prediction as an evaluation metric. The goal is to produce a low-energy hardware classifier for embedded applications doing local processing of sensor data. To get there, the authors question a whole bunch of received wisdom, beginning with this: do we really need to convert the analog sensor data into a digital signal?! Here’s another fun one: what if instead of being something you worked hard to avoid, you had to build your whole application based on the outcomes of data races??!

Typically, a sensor gathers analog information from the physical world and then converts it into a conventional digital signal… While this binary-represented integer is perfectly efficient for storage as bits in memory and for typical general purpose computing operations, it is unclear that this is the most efficient for our target application. One such possible representation is pure analog signalling.

Of course analog signalling comes Continue reading

CheriABI: enforcing valid pointer provenance and minimizing pointer privilege in the POSIX C run-time environment

CheriABI: enforcing valid pointer provenance and minimizing pointer privilege in the POSIX C run-time environment Davis et al., ASPLOS’19

Last week we saw the benefits of rethinking memory and pointer models at the hardware level when it came to object storage and compression (Zippads). CHERI also rethinks the way that pointers and memory work, but the goal here is memory protection. The scope of the work stands out as particularly impressive:

We have adapted a complete C, C++, and assembly-language software stack, including the open source FreeBSD OS (nearly 800 UNIX programs and more than 200 libraries including OpenSSH, OpenSSL, and bsnmpd) and PostgreSQL database, to employ ubiquitous capability-based pointer and virtual-address protection.

The protections are hardware implemented and cannot be forged in software. The process model, user-kernel interactions, dynamic linking, and memory management concerns are all in scope, and the protection spans the OS/DBMS boundary.

The basic question here is whether it is practical to support a large-scale C-language software stack with strong pointer-based protection… with only modest changes to existing C code-bases and with reasonable performance cost. We answer this question affirmatively.

That ‘reasonable’ performance cost is a 6.8% slowdown, significantly better than e. Continue reading

Compress objects, not cache lines: an object-based compressed memory hierarchy

Compress objects, not cache lines: an object-based compressed memory hierarchy Tsai & Sanchez, ASPLOS’19

Last time out we saw how Google have been able to save millions of dollars though memory compression enabled via zswap. One of the important attributes of their design was easy and rapid deployment across an existing fleet. Today’s paper introduces Zippads, which compared to a state of the art compressed memory hierarchy is able to achieve a 1.63x higher compression ratio and improve performance by 17%. The big idea behind zippads is simple and elegant, but the ramifications go deep: all the way down to a modified instruction set (ISA)! So while you probably won’t be using Zippads in practice anytime soon, it’s a wonderful example of what’s possible when you’re prepared to take a fresh look at “the way we’ve always done things.”

The big idea

Existing cache and main memory compression techniques compress data in small fixed-size blocks, typically cache lines. Moreover, they use simple compression algorithms that focus on exploiting redundancy within a block. These techniques work well for scientific programs that are dominated by arrays. However, they are ineffective on object-based programs because objects do not fall neatly Continue reading

Software-defined far memory in warehouse scale computers

Software-defined far memory in warehouse-scale computers Lagar-Cavilla et al., ASPLOS’19

Memory (DRAM) remains comparatively expensive, while in-memory computing demands are growing rapidly. This makes memory a critical factor in the total cost of ownership (TCO) of large compute clusters, or as Google like to call them “Warehouse-scale computers (WSCs).”

This paper describes a “far memory” system that has been in production deployment at Google since 2016. Far memory sits in-between DRAM and flash and colder in-memory data can be migrated to it:

Our software-defined far memory is significantly cheaper (67% or higher memory cost reduction) at relatively good access speeds (6µs) and allows us to store a significant fraction of infrequently accessed data (on average, 20%), translating to significant TCO savings at warehouse scale.

With a far memory tier in place operators can choose between packing more jobs onto each machine, or reducing the DRAM capacity, both of which lead to TCO reductions. Google were able to bring about a 4-5% reduction in memory TCO (worth millions of dollars!) while having negligible impact on applications.

In introducing far memory Google faced a number of challenges: workloads are very diverse and change all the time, both in job Continue reading

RPCValet: NI-driven tail-aware balancing of µs-scale RPCs

RPCValet: NI-driven tail-aware balancing of µs-scale RPCs Daglis et al., ASPLOS’19

Last week we learned about the [increased tail-latency sensitivity of microservices based applications with high RPC fan-outs. Seer uses estimates of queue depths to mitigate latency spikes on the order of 10-100ms, in conjunction with a cluster manager. Today’s paper choice, RPCValet, operates at latencies 3 orders of magnitude lower, targeting reduction in tail latency for services that themselves have service times on the order of a small number of µs (e.g., the average service time for memcached is approximately 2µs).

The net result of rapid advancements in the networking world is that inter-tier communications latency will approach the fundamental lower bound of speed-of-light propagation in the foreseeable future. The focus of optimization hence will completely shift to efficiently handling RPCs at the endpoints as soon as they are delivered from the network.

Furthermore, the evaluation shows that “RPCValet leaves no significant room for improvement” when compared against the theoretical ideal (it comes within 3-15%). So what we have here is a glimpse of the limits for low-latency RPCs under load. When it’s no longer physically possible to go meaningfully faster, further application-level performance Continue reading

Understanding real-world concurrency bugs in Go

Understanding real-world concurrency bugs in Go Tu, Liu et al., ASPLOS’19

The design of a programming (or data) model not only makes certain problems easier (or harder) to solve, but also makes certain classes of bugs easier (or harder) to create, detect, and subsequently fix. Today’s paper choice studies concurrency mechanisms in Go. Before we dive in, it might be interesting to pause for a moment and consider your own beliefs about Go, which may well include some of the following:

  • Go was explicitly designed to make concurrent programming easier and less error-prone
  • Go makes concurrent programming easier and less error-prone
  • Go programs make heavy use of message passing via channels, which is less error prone than shared memory synchronisation
  • Go programs have less concurrency bugs
  • Go’s built-in deadlock and data race detectors will catch any (most?)
    bugs you do let slip into your code

The first of those statements is true. For the remaining statements, you can use the data from this research to re-evaluate how strongly you want to hold those opinions…

We perform the first systematic study on concurrency bugs in real Go programs. We studied six popular Go software [projects] including Docker, Kubernetes, and Continue reading

Seer: leveraging big data to navigate the complexity of performance debugging in cloud microservices

Seer: leveraging big data to navigate the complexity of performance debugging in cloud microservices Gan et al., ASPLOS’19

Last time around we looked at the DeathStarBench suite of microservices-based benchmark applications and learned that microservices systems can be especially latency sensitive, and that hotspots can propagate through a microservices architecture in interesting ways. Seer is an online system that observes the behaviour of cloud applications (using the DeathStarBench microservices for the evaluation) and predicts when QoS violations may be about to occur. By cooperating with a cluster manager it can then take proactive steps to avoid a QoS violation occurring in practice.

We show that Seer correctly anticipates QoS violations 91% of the time, and avoids the QoS violation to begin with in 84% of cases. Finally, we show that Seer can identify application level design bugs, and provide insights on how to better architect microservices to achieve predictable performance.

Seer uses a lightweight RPC-level tracing system to collect request traces and aggregate them in a Cassandra database. A DNN model is trained to recognise patterns in space and time that lead to QoS violations. This model makes predictions at runtime based on real-time streaming trace inputs. When a Continue reading

An open-source benchmark suite for microservices and their hardware-software implications for cloud & edge systems

An open-source benchmark suite for microservices and their hardware-software implications for cloud & edge systems Gan et al., ASPLOS’19

Microservices are well known for producing ‘death star’ interaction diagrams like those shown below, where each point on the circumference represents an individual service, and the lines between them represent interactions.

Systems built with lots of microservices have different operational characteristics to those built from a small number of monoliths, we’d like to study and better understand those differences. That’s where ‘DeathStarBench’ comes in: a suite of five different microservices-based applications (one of which, a drone coordination platform called Swarm has two variations – one doing most compute in the cloud, and one offloading as much as possible to the edge). It’s a pretty impressive effort to pull together and make available in open source (not yet available as I write this) such a suite, and I’m sure explains much of the long list of 24 authors on this paper.

The suite is built using popular OSS applications and representative technologies, deliberately using a mix of languages (C/C++, Java, Javascript, node.js, Python, Ruby, Go, Scala, …) and both RESTful and RPC (Thrift, gRPC) style service interfaces. There’s a nice Continue reading

Distributed consensus revised – Part III

Distributed consensus revised (part III) Howard, PhD thesis

With all the ground work laid, the second half of the thesis progressively generalises the Paxos algorithm: weakening the quorum intersection requirements; reusing intersections to allow decisions to be reached with fewer participants; weakening the value selection rules; and sharing phases to take best advantage of the generalisation.

The result of this thesis is a family of approaches to achieving distributed consensus, which generalise over the most popular existing algorithms such as Paxos and Fast Paxos.

Quorum intersection revised

Classic Paxos requires all quorums to intersect, but this turns out to be a stronger condition than is actually required to guarantee safety and progress.

Our first finding is that it is only necessary for phase one quorums and phase two quorums to intersect. There is no need to require that phase one quorums intersect with each other nor that phase two quorums intersect with each other.

This finding (‘revision A’) was also discussed in the Flexible Paxos paper that we’ve covered in a previous edition of The Morning Paper. So long as one quorum member is around to carry the learnings from phase one into phase two, we’re good (the thesis itself Continue reading

Distributed consensus revised – Part II

Distributed consensus revised (part II) Howard, PhD thesis

In today’s post we’re going to be looking at chapter 3 of Dr Howard’s thesis, which is a tour (“systematisation of knowledge”, SoK) of some of the major known revisions to the classic Paxos algorithm.

Negative responses (NACKs)

In classic Paxos acceptors only send replies to proposer messages with an epoch greater than or equal to the acceptors last promised epoch (Property 6). The algorithms relies on timeouts to determine when a proposer abandons the current phase and retries with a new epoch number. We can eliminate the timeout delays by adding negative responses, for example no\_promise(e) and no\_accept(e), to be sent by the acceptor in response to prepare or propose messages with an invalid epoch number. These negative acknowledgements (NACKS) can also include further information such as the acceptor’s last promised epoch and last accepted proposal value. (There’s no point a proposer retrying with a new epoch less than the acceptor’s last promised one for example).

NACKs have replaced timeouts as we assume that messages are eventually delivered. We can therefore remove the synchrony assumptions from our progress proof.

Bypassing phase two

If a proposer learns during phase one that a value Continue reading

Distributed consensus revised – Part I

Distributed consensus revised Howard, PhD thesis

Welcome back to a new term of The Morning Paper! To kick things off, I’m going to start by taking a look at Dr Howard’s PhD thesis, ‘Distributed consensus revised’. This is obviously longer than a standard paper, so we’ll break things down over a few days. As the title suggests, the topic in hand is distributed consensus:

Single-valued agreement is often overlooked in the literature as already solved or trivial and is seldom considered at length, despite being a vital component in distributed systems which is infamously poorly understood… we undertake an extensive examination of how to achieve consensus over a single value.

What makes this much harder than it might at first appear of course, is the possibility of failures and asynchronous communication. In the face of this, an algorithm for consensus must meet three safety requirements and two progress requirements:

  • Non-triviality: the decided value must have been proposed by a participant (so for example, solutions which always choose a fixed pre-determined value are not acceptable)
  • Safety: if a value has been decided, no other value will be decided
  • Safe learning: if a participant learns a value, it must Continue reading

End of term

We’ve reached the end of term again on The Morning Paper, and I’ll be taking a two week break. The Morning Paper will resume on Tuesday 7th May (since Monday 6th is a public holiday in the UK).

My end of term tradition is to highlight a few of the papers from the term that I especially enjoyed, but this time around I want to let one work stand alone:

You might also enjoy “The Mess We’re In,” and Joe’s seven deadly sins of programming:

  1. Code even you cannot understand a week after you wrote it – no comments
  2. Code with no specifications
  3. Code that is shipped as soon as it runs and before it is beautiful
  4. Code with added features
  5. Code that is very very fast very very very obscure and incorrect
  6. Code that is not beautiful
  7. Code that you wrote without understanding the problem

We’re in an even bigger mess without you Joe. Thank you for everything. RIP.

Keeping master green at scale

Keeping master green at scale Ananthanarayanan et al., EuroSys’19

This paper provides a fascinating look at a key part of Uber’s software delivery machine. With a monorepo, and many thousands of engineers concurrently committing changes, keeping the build green, and keeping commit-to-live latencies low, is a major challenge.

This paper introduces a change management system called SubmitQueue that is responsible for continuous integration of changes into the mainline at scale while always keeping the mainline green.

The challenge: build fails at scale

Each individual submitted change will have passed all local tests, but when you put large numbers of concurrent changes together conflicts can still happen. Finding out what’s gone wrong is a tedious and error-prone task often requiring human intervention. Meanwhile, new features are blocked from rolling out.

So the goal is to keep it green:

…the monorepo mainline needs to remain green at all times. A mainline is called green if all build steps (e.g., compilation, unit tests, UI tests) can successfully execute for every commit point in the history. Keeping the mainline green allows developers to (i) instantly release new features from any commit point in the mainline, (ii) roll back to any Continue reading

Teaching rigorous distributed systems with efficient model checking

Teaching rigorous distributed systems with efficient model checking Michael et al., EuroSys’19

On the surface you might think today’s paper selection an odd pick. It describes the labs environment, DSLabs, developed at the University of Washington to accompany a course in distributed systems. During the ten week course, students implement four different assignments: an exactly-once RPC protocol; a primary-backup system; Paxos; and a scalable, transactional key-value storage system. 175 undergraduates a year currently go through this course. Enabling students to build running performant versions of all of those systems in the time available is one challenge. Testing and grading their solutions is another!

Although we added tests to catch specific issues as we learned of them, we found it difficult to keep up with the diversity of possible student errors.

What I like about the paper is that (a) the DSLabs framework and all assignments are available in open source, and it looks like it would be a lot of fun to play with, and (b) the challenges students have to face in building reliable, testable, distributed systems under time pressure look pretty much like the challenges many practitioners have to face to me! So by focusing Continue reading

1 5 6 7 8 9 16