Author Archives: adriancolyer

Compress objects, not cache lines: an object-based compressed memory hierarchy

Compress objects, not cache lines: an object-based compressed memory hierarchy Tsai & Sanchez, ASPLOS’19

Last time out we saw how Google have been able to save millions of dollars though memory compression enabled via zswap. One of the important attributes of their design was easy and rapid deployment across an existing fleet. Today’s paper introduces Zippads, which compared to a state of the art compressed memory hierarchy is able to achieve a 1.63x higher compression ratio and improve performance by 17%. The big idea behind zippads is simple and elegant, but the ramifications go deep: all the way down to a modified instruction set (ISA)! So while you probably won’t be using Zippads in practice anytime soon, it’s a wonderful example of what’s possible when you’re prepared to take a fresh look at “the way we’ve always done things.”

The big idea

Existing cache and main memory compression techniques compress data in small fixed-size blocks, typically cache lines. Moreover, they use simple compression algorithms that focus on exploiting redundancy within a block. These techniques work well for scientific programs that are dominated by arrays. However, they are ineffective on object-based programs because objects do not fall neatly Continue reading

Software-defined far memory in warehouse scale computers

Software-defined far memory in warehouse-scale computers Lagar-Cavilla et al., ASPLOS’19

Memory (DRAM) remains comparatively expensive, while in-memory computing demands are growing rapidly. This makes memory a critical factor in the total cost of ownership (TCO) of large compute clusters, or as Google like to call them “Warehouse-scale computers (WSCs).”

This paper describes a “far memory” system that has been in production deployment at Google since 2016. Far memory sits in-between DRAM and flash and colder in-memory data can be migrated to it:

Our software-defined far memory is significantly cheaper (67% or higher memory cost reduction) at relatively good access speeds (6µs) and allows us to store a significant fraction of infrequently accessed data (on average, 20%), translating to significant TCO savings at warehouse scale.

With a far memory tier in place operators can choose between packing more jobs onto each machine, or reducing the DRAM capacity, both of which lead to TCO reductions. Google were able to bring about a 4-5% reduction in memory TCO (worth millions of dollars!) while having negligible impact on applications.

In introducing far memory Google faced a number of challenges: workloads are very diverse and change all the time, both in job Continue reading

RPCValet: NI-driven tail-aware balancing of µs-scale RPCs

RPCValet: NI-driven tail-aware balancing of µs-scale RPCs Daglis et al., ASPLOS’19

Last week we learned about the [increased tail-latency sensitivity of microservices based applications with high RPC fan-outs. Seer uses estimates of queue depths to mitigate latency spikes on the order of 10-100ms, in conjunction with a cluster manager. Today’s paper choice, RPCValet, operates at latencies 3 orders of magnitude lower, targeting reduction in tail latency for services that themselves have service times on the order of a small number of µs (e.g., the average service time for memcached is approximately 2µs).

The net result of rapid advancements in the networking world is that inter-tier communications latency will approach the fundamental lower bound of speed-of-light propagation in the foreseeable future. The focus of optimization hence will completely shift to efficiently handling RPCs at the endpoints as soon as they are delivered from the network.

Furthermore, the evaluation shows that “RPCValet leaves no significant room for improvement” when compared against the theoretical ideal (it comes within 3-15%). So what we have here is a glimpse of the limits for low-latency RPCs under load. When it’s no longer physically possible to go meaningfully faster, further application-level performance Continue reading

Understanding real-world concurrency bugs in Go

Understanding real-world concurrency bugs in Go Tu, Liu et al., ASPLOS’19

The design of a programming (or data) model not only makes certain problems easier (or harder) to solve, but also makes certain classes of bugs easier (or harder) to create, detect, and subsequently fix. Today’s paper choice studies concurrency mechanisms in Go. Before we dive in, it might be interesting to pause for a moment and consider your own beliefs about Go, which may well include some of the following:

  • Go was explicitly designed to make concurrent programming easier and less error-prone
  • Go makes concurrent programming easier and less error-prone
  • Go programs make heavy use of message passing via channels, which is less error prone than shared memory synchronisation
  • Go programs have less concurrency bugs
  • Go’s built-in deadlock and data race detectors will catch any (most?)
    bugs you do let slip into your code

The first of those statements is true. For the remaining statements, you can use the data from this research to re-evaluate how strongly you want to hold those opinions…

We perform the first systematic study on concurrency bugs in real Go programs. We studied six popular Go software [projects] including Docker, Kubernetes, and Continue reading

Seer: leveraging big data to navigate the complexity of performance debugging in cloud microservices

Seer: leveraging big data to navigate the complexity of performance debugging in cloud microservices Gan et al., ASPLOS’19

Last time around we looked at the DeathStarBench suite of microservices-based benchmark applications and learned that microservices systems can be especially latency sensitive, and that hotspots can propagate through a microservices architecture in interesting ways. Seer is an online system that observes the behaviour of cloud applications (using the DeathStarBench microservices for the evaluation) and predicts when QoS violations may be about to occur. By cooperating with a cluster manager it can then take proactive steps to avoid a QoS violation occurring in practice.

We show that Seer correctly anticipates QoS violations 91% of the time, and avoids the QoS violation to begin with in 84% of cases. Finally, we show that Seer can identify application level design bugs, and provide insights on how to better architect microservices to achieve predictable performance.

Seer uses a lightweight RPC-level tracing system to collect request traces and aggregate them in a Cassandra database. A DNN model is trained to recognise patterns in space and time that lead to QoS violations. This model makes predictions at runtime based on real-time streaming trace inputs. When a Continue reading

An open-source benchmark suite for microservices and their hardware-software implications for cloud & edge systems

An open-source benchmark suite for microservices and their hardware-software implications for cloud & edge systems Gan et al., ASPLOS’19

Microservices are well known for producing ‘death star’ interaction diagrams like those shown below, where each point on the circumference represents an individual service, and the lines between them represent interactions.

Systems built with lots of microservices have different operational characteristics to those built from a small number of monoliths, we’d like to study and better understand those differences. That’s where ‘DeathStarBench’ comes in: a suite of five different microservices-based applications (one of which, a drone coordination platform called Swarm has two variations – one doing most compute in the cloud, and one offloading as much as possible to the edge). It’s a pretty impressive effort to pull together and make available in open source (not yet available as I write this) such a suite, and I’m sure explains much of the long list of 24 authors on this paper.

The suite is built using popular OSS applications and representative technologies, deliberately using a mix of languages (C/C++, Java, Javascript, node.js, Python, Ruby, Go, Scala, …) and both RESTful and RPC (Thrift, gRPC) style service interfaces. There’s a nice Continue reading

Distributed consensus revised – Part III

Distributed consensus revised (part III) Howard, PhD thesis

With all the ground work laid, the second half of the thesis progressively generalises the Paxos algorithm: weakening the quorum intersection requirements; reusing intersections to allow decisions to be reached with fewer participants; weakening the value selection rules; and sharing phases to take best advantage of the generalisation.

The result of this thesis is a family of approaches to achieving distributed consensus, which generalise over the most popular existing algorithms such as Paxos and Fast Paxos.

Quorum intersection revised

Classic Paxos requires all quorums to intersect, but this turns out to be a stronger condition than is actually required to guarantee safety and progress.

Our first finding is that it is only necessary for phase one quorums and phase two quorums to intersect. There is no need to require that phase one quorums intersect with each other nor that phase two quorums intersect with each other.

This finding (‘revision A’) was also discussed in the Flexible Paxos paper that we’ve covered in a previous edition of The Morning Paper. So long as one quorum member is around to carry the learnings from phase one into phase two, we’re good (the thesis itself Continue reading

Distributed consensus revised – Part II

Distributed consensus revised (part II) Howard, PhD thesis

In today’s post we’re going to be looking at chapter 3 of Dr Howard’s thesis, which is a tour (“systematisation of knowledge”, SoK) of some of the major known revisions to the classic Paxos algorithm.

Negative responses (NACKs)

In classic Paxos acceptors only send replies to proposer messages with an epoch greater than or equal to the acceptors last promised epoch (Property 6). The algorithms relies on timeouts to determine when a proposer abandons the current phase and retries with a new epoch number. We can eliminate the timeout delays by adding negative responses, for example no\_promise(e) and no\_accept(e), to be sent by the acceptor in response to prepare or propose messages with an invalid epoch number. These negative acknowledgements (NACKS) can also include further information such as the acceptor’s last promised epoch and last accepted proposal value. (There’s no point a proposer retrying with a new epoch less than the acceptor’s last promised one for example).

NACKs have replaced timeouts as we assume that messages are eventually delivered. We can therefore remove the synchrony assumptions from our progress proof.

Bypassing phase two

If a proposer learns during phase one that a value Continue reading

Distributed consensus revised – Part I

Distributed consensus revised Howard, PhD thesis

Welcome back to a new term of The Morning Paper! To kick things off, I’m going to start by taking a look at Dr Howard’s PhD thesis, ‘Distributed consensus revised’. This is obviously longer than a standard paper, so we’ll break things down over a few days. As the title suggests, the topic in hand is distributed consensus:

Single-valued agreement is often overlooked in the literature as already solved or trivial and is seldom considered at length, despite being a vital component in distributed systems which is infamously poorly understood… we undertake an extensive examination of how to achieve consensus over a single value.

What makes this much harder than it might at first appear of course, is the possibility of failures and asynchronous communication. In the face of this, an algorithm for consensus must meet three safety requirements and two progress requirements:

  • Non-triviality: the decided value must have been proposed by a participant (so for example, solutions which always choose a fixed pre-determined value are not acceptable)
  • Safety: if a value has been decided, no other value will be decided
  • Safe learning: if a participant learns a value, it must Continue reading

End of term

We’ve reached the end of term again on The Morning Paper, and I’ll be taking a two week break. The Morning Paper will resume on Tuesday 7th May (since Monday 6th is a public holiday in the UK).

My end of term tradition is to highlight a few of the papers from the term that I especially enjoyed, but this time around I want to let one work stand alone:

You might also enjoy “The Mess We’re In,” and Joe’s seven deadly sins of programming:

  1. Code even you cannot understand a week after you wrote it – no comments
  2. Code with no specifications
  3. Code that is shipped as soon as it runs and before it is beautiful
  4. Code with added features
  5. Code that is very very fast very very very obscure and incorrect
  6. Code that is not beautiful
  7. Code that you wrote without understanding the problem

We’re in an even bigger mess without you Joe. Thank you for everything. RIP.

Keeping master green at scale

Keeping master green at scale Ananthanarayanan et al., EuroSys’19

This paper provides a fascinating look at a key part of Uber’s software delivery machine. With a monorepo, and many thousands of engineers concurrently committing changes, keeping the build green, and keeping commit-to-live latencies low, is a major challenge.

This paper introduces a change management system called SubmitQueue that is responsible for continuous integration of changes into the mainline at scale while always keeping the mainline green.

The challenge: build fails at scale

Each individual submitted change will have passed all local tests, but when you put large numbers of concurrent changes together conflicts can still happen. Finding out what’s gone wrong is a tedious and error-prone task often requiring human intervention. Meanwhile, new features are blocked from rolling out.

So the goal is to keep it green:

…the monorepo mainline needs to remain green at all times. A mainline is called green if all build steps (e.g., compilation, unit tests, UI tests) can successfully execute for every commit point in the history. Keeping the mainline green allows developers to (i) instantly release new features from any commit point in the mainline, (ii) roll back to any Continue reading

Teaching rigorous distributed systems with efficient model checking

Teaching rigorous distributed systems with efficient model checking Michael et al., EuroSys’19

On the surface you might think today’s paper selection an odd pick. It describes the labs environment, DSLabs, developed at the University of Washington to accompany a course in distributed systems. During the ten week course, students implement four different assignments: an exactly-once RPC protocol; a primary-backup system; Paxos; and a scalable, transactional key-value storage system. 175 undergraduates a year currently go through this course. Enabling students to build running performant versions of all of those systems in the time available is one challenge. Testing and grading their solutions is another!

Although we added tests to catch specific issues as we learned of them, we found it difficult to keep up with the diversity of possible student errors.

What I like about the paper is that (a) the DSLabs framework and all assignments are available in open source, and it looks like it would be a lot of fun to play with, and (b) the challenges students have to face in building reliable, testable, distributed systems under time pressure look pretty much like the challenges many practitioners have to face to me! So by focusing Continue reading

Time protection: the missing OS abstraction

Time protection: the missing OS abstraction Ge et al., EuroSys’19

Ever since the prominent emergence of timing-based microarchitectural attacks (e.g. Spectre, Meltdown, and friends) I’ve been wondering what we can do about them. When a side-channel is based on observing improved performance, a solution that removes the improved performance can work, but is clearly undesirable. In today’s paper choice, for which the authors won a best paper award at EuroSys’19 last month, Ge et al., set out a principled basis for protecting against this class of attacks. Just as today’s systems offer memory protection, they call this time protection. The paper sets out what we can do in software given today’s hardware, and along the way also highlights areas where cooperation from hardware will be needed in the future.

Timing channels, and in particular microarchitectural channels, which exploit timing variations due to shared use of caches and other hardware, remain a fundamental OS security challenge that has eluded a comprehensive solution to date… We argue that it is time to take temporal isolation seriously, and make the OS responsible for time protection, the prevention of temporal inference, just as memory protection prevents spatial inference.

Continue reading

Master of web puppets: abusing web browsers for persistent and stealthy computation

Master of web puppets: abusing web browsers for persistent and stealthy computation Papadopoulus et al., NDSS’19

You’ve probably heard about crypto-currency mining and the like in hijacked browsers.

From a security perspective, a fundamental problem of web applications is that by default their publisher is considered as trusted, and thus allowed to run JavaScript code (even from third parties) on the user side without any restrictions… On the positive side JavaScript execution so far has been constrained chronologically to the lifetime of the browser window or tab that rendered the compromised or malicious website.

Not any more! This paper shows how modern browsers with support for Service Workers can be stealthily connected into a botnet, with a connection that persists until the user closes the browser completely: “in contrast to previous approaches for browser hijacking, a key feature of MarioNet is that it remains operational even after the user browses away from the malicious webpage.

MarioNet building blocks: Service Workers and WebRTC

Service Workers are non-blocking modules that reside in the user’s browser. Once registered they can run in the background without requiring the user to continue browsing on the originating site. In addition, service workers have Continue reading

Don’t trust the locals: investigating the prevalence of persistent client-side cross-site scripting in the wild

Don’t trust the locals: investigating the prevalence of persistent client-side cross-site scripting in the wild Steffens et al., NDSS’19

Does your web application make use of local storage? If so, then like many developers you may well be making the assumption that when you read from local storage, it will only contain the data that you put there. As Steffens et al. show in this paper, that’s a dangerous assumption! The storage aspect of local storage makes possible a particularly nasty form of attack known as a persistent client-side cross-site scripting attack. Such an attack, once it has embedded itself in your browser one time (e.g. that one occasion you quickly had to jump on the coffee shop wifi), continues to work on all subsequent visits to the target site (e.g., once you’re back home on a trusted network).

In an analysis of the top 5000 Alexa domains, 21% of sites that make use of data originating from storage were found to contain vulnerabilities, of which at least 70% were directly exploitable using the models described in this paper.

Our analysis shows that more than 8% of the top 5,000 domains are potentially susceptible to a Continue reading

How bad can it git? Characterizing secret leakage in public GitHub repositories

How bad can it git? Characterizing secret leakage in public GitHub repositories Meli et al., NDSS’19

On the one hand you might say there’s no new news here. We know that developers shouldn’t commit secrets, and we know that secrets leaked to GitHub can be discovered and exploited very quickly. On the other hand, this study goes much deeper, and also provides us with some very actionable information.

…we go far beyond noting that leakage occurs, providing a conservative longitudinal analysis of leakage, as well as analyses of root causes and the limitations of current mitigations.

In my opinion, the best time to catch secrets is before they are ever committed in the first place. A git pre-commit hook using the regular expressions from this paper’s appendix looks like a pretty good investment to me. The pre-commit hook approach is taken by TruffleHog, though as of the time this paper was written, TruffleHog’s secret detection mechanisms were notably inferior (detecting only 25-29%) to those developed in this work (§ VII.D). You might also want to look at git-secrets which does this for AWS keys, and is extensible with additional patterns. For a belt and braces approach, also Continue reading

Ginseng: keeping secrets in registers when you distrust the operating system

Ginseng: keeping secrets in registers when you distrust the operating system Yun & Zhong et al., NDSS’19

Suppose you did go to the extreme length of establishing an unconditional root of trust for your system, even then, unless every subsequent piece of code you load is also fully trusted (e.g., formally verified) then you’re open to post-boot attacks. This is especially true in a context where lots of third-party application code (e.g. apps on a mobile phone) gets loaded.

Many mobile and IoT apps nowadays contain sensitive data, or secrets, such as passwords, learned models, and health information. Such secrets are often protected by encryption in the storage. However, to use a secret, an app must decrypt it and usually store it as cleartext in memory. In doing so, the app assumes that the operating system (OS) is trustworthy. OSes are complex software and have a large attack surface… Increasingly abundant evidence suggests that prudent apps should not trust the OS with their secrets.

Instead of trying to protect absolutely everything, Ginseng assumes that some data matters more than others. It arranges things such that this sensitive data is only ever in the clear in registers Continue reading

Establishing software root of trust unconditionally

Establishing software root of trust unconditionally Gligor & Woo, NDSS’19

The authors won a best paper award for this work at NDSS this year. The main result is quite something, but as you might expect the lines of argument are detailed and not always easy to follow (and certainly not critically!) for non-experts like me. I’ll use today’s write-up to convey the big ideas as I understand them. Close reading of the original paper (and it’s accompanying technical report) will be essential if you then want to dig deeper.

The problem: establishing a root of trust

Root of trust (RoT) establishment on an untrusted system ensures that a system state comprises all and only content chosen by the user, and the user’s code begins execution in that state. All implies that no content is missing, and only that no extra content exists. If a system state is initialized to content that satisfies security invariants and RoT establishment succeeds, a user’s code begins execution in a secure initial state.

That is to say, even in the presence of persistent malware e.g. in the firmware of peripheral controllers, network interface cards, disk and USB controllers, and so on, root of Continue reading

The crux of voice (in)security: a brain study of speaker legitimacy detection

The crux of voice (in)security: a brain study of speaker legitimacy detection Neupane et al., NDSS’19

The key results of this paper are easy to understand, but the implications are going to take us a long time to unravel. Speech morphing (voice morphing) is the process of translating a speaker’s voice to sound like a given impersonation target. This capability is now available off-the-shelf —this paper uses the CMU Festvox voice converter— and is getting better all the time. Being able to impersonate someone’s voice takes things like social engineering attacks to a whole new level…

…voice imitation is an emerging class of threats, especially given the advancement in speech synthesis technology seen in a variety of contexts that can harm a victim’s reputation and her security/safety. For instance, the attacker could publish the morphed voice samples on social media, impersonate the victim in phone conversations, leave fake voice messages to the victim’s contacts, and even launch man-in-the-middle attacks against end-to-end encryption technologies that require users to verify the voices of the callers, to name a few instances of such attacks.

So voice should sit alongside images and video as a source we can’t trust in our new Continue reading

Calvin: fast distributed transactions for partitioned database systems

Calvin: fast distributed transactions for partitioned database systems Thomson et al., SIGMOD’12

Earlier this week we looked at Amazon’s Aurora. Today it’s the turn of Calvin, which is notably used by FaunaDB (strictly “_FaunaDB uses patent-pending technology inspired by Calvin…”). As the paper title suggests, the goal of Calvin is to put the ACID back into distributed databases. I’ve put these systems back-to-back because they are both vying for attention in the serverless world, but note that really this is an apples-to-oranges thing as Aurora scales out MySQL and PostgreSQL with single writer multi-reader models, whereas Calvin supports distributed transactions over distributed databases. A closer comparison with Calvin in the AWS world would be DynamoDB, which recently added limited transaction support (single region tables). Calvin is more commonly compared against Google’s Spanner. See Daniel Abadi on this topic (note that Abadi is one of the authors of the Calvin paper, and an advisor to FaunaDB), and also on “It’s time to move on from two-phase commit.”

Calvin [is] a transaction processing and replication layer designed to transform a generic, non-transactional, un-replicated data store into a fully ACID, consistently replicated distributed database system. Calvin supports horizontal scalability Continue reading

1 2 3 9