My children broke up from school this past weekend, which seems as good a reason as any to call this ‘end of term’ for The Morning Paper. I’ll be taking a break until the New Year, topping up my reading lists and getting ready for a whole new crop of papers and discoveries. The Morning Paper will resume on Monday 6th January.
Since term began on the 19th August we’ve looked at 50 different papers, and I had the pleasure of attending VLDB and HPTS in person as well. I learned a ton! I hope you found something you enjoyed in the paper selections as well.
Here’s a small selection of my personal highlights from the term, in case you missed any of them (in the order in which they originally appeared on the blog):
How do committees invent?, Conway, Datamation magazine 1968
With thanks to Chris Frost for recommending this paper – another great example of a case where we all know the law (Conway’s law in this case), but many of us have not actually read the original ideas behind it.
We’re back in 1968, a time when it was taken for granted that before building a system, it was necessary to design it. The systems under discussion are not restricted to computer systems either by the way – one of example of a system is "the public transport network." Designs are produced by people, and the set of people working on a design are part of a design organisation.
The definition of design itself is quite interesting:
That kind of intellectual activity which creates a whole from its diverse parts may be called the design of a system.
When I think about design, I more naturally think about it the other way around: how to decompose the whole into a set of parts that will work together to accomplish the system goals. But of course Conway is right that those parts do have to fit together to produce the intended Continue reading
A tale of two abstractions: the case for object space, Bittman et al., HotStorage 2019.
This is a companion paper to the "persistent problem" piece that we looked at earlier this week, going a little deeper into the object pointer representation choices and the mapping of a virtual object space into physical address spaces.
…software operating on persistent data structures requires "global" pointers that remain valid after a process terminates, while hardware requires that a diverse set of devices all have the same mappings they need for bulk transfers to and from memory, and that they be able to do so for a potentially heterogeneous memory system. Both abstractions must be implemented in a way that is efficient using existing hardware.
In-memory data structures are notable for the rich inter-weaving of pointer references between them. If we take those data structures and make them also be the persistent representation, "then applications need a way to refer to data such that references have the same lifetime as the referenced data." Epheremal virtual addresses don’t cut it as the basis for persistent pointers.
Applications running on BNVM (byte-addressable non-volatile memory) must have a way Continue reading
A persistent problem: managing pointers in NVM Bittman et al., PLOS’19
At the start of November I was privileged to attend HPTS (the High Performance Transaction Systems) conference in Asilomar. If you ever get the chance to go, I highly recommend it. It’s a comparatively small gathering with a great mix of people, and fabulous discussions. A big thank you to everyone that I met there for making me feel so welcome.
On the last morning of the conference Daniel Bittman presented some of the work being done in the context of the Twizzler OS project to explore new programming models for NVM. It’s a really bold project (‘let’s rethink the OS from the ground up’) and generated a lot of lively discussion.
(Byte-addressable non-volatile memory,) NVM will fundamentally change the way hardware interacts, the way operating systems are designed, and the way applications operate on data.
The starting point is a set of three asumptions for an NVM-based programming model:
Benchmarking spreadsheet systems Rahman et al., Preprint
A recent TwThread drew my attention to this pre-print paper. When spreadsheets were originally conceived, data and formula were input by hand and so everything operated at human scale. Increasingly we’re dealing with larger and larger datasets — for example, data imported via csv files — and spreadsheets are creaking. I’m certainly familiar with the sinking feeling on realising I’ve accidentally asked a spreadsheet to open up a file with 10s of thousands of rows, and that my computer is now going to be locked up for an age. Rahman et al. construct a set of benchmarks to try and understand what might be going on under the covers in Microsoft Excel, Google Sheets, and LibreOffice Calc.
Spreadsheets claim to support pretty large datasets these days – e.g. five million cells for Google Sheets, and even more than that for Excel. But in practice, they struggle at sizes well below this.
With increasing data sizes… spreadsheets have started to break down to the point of being unusable, displaying a number of scalability problems. They often freeze during computation, and are unable to import datasets well below the size limits posed by Continue reading
Declarative assembly of web applications from predefined concepts De Rosso et al., Onward! 2019
I chose this paper to challenge my own thinking. I’m not really a fan of low-code / no-code / just drag-and-drop-from-our-catalogue forms of application development. My fear is that all too often it’s like jumping on a motorbike and tearing off at great speed (rapid initial progress), only to ride around a bend and find a brick wall across the road in front of you. That doesn’t normally end well. I’ve seen enough generations of CASE (remember that acronym?), component-based software development, reusable software catalogues etc. to develop a healthy scepticism: lowest-common denominators, awkward or missing round-tripping behaviour, terrible debugging experiences, catalogues full of junk components, inability to accommodate custom behaviours not foreseen by the framework/component developers, limited reuse opportunities in practice compared to theory, and so on.
The thing is, on one level I know that I’m wrong. To start with, there’s Grady Booch’s observation that “the whole history of computer science is one of ever rising levels of abstraction” 1. Then there’s the changing demographic of software building. Heather Miller recently gave a great presentation on this topic, ‘The Continue reading
Efficient lock-free durable sets Zuriel et al., OOPSLA’19
Given non-volatile memory (NVRAM), the naive hope for persistence is that it would be a no-op: what happens in memory, stays in memory. Unfortunately, a very similar set of issues to those concerned with flushing volatile memory to persistent disk exist here too, just at another level. Memory might be durable, but…
…it is expected that caches and registers will remain volatile. Therefore the state of data structures underlying standard algorithms might not be complete in the NVRAM view, and after a crash this view might not be consistent because of missed writes that were in the caches but did not reach the memory. Moreover, for better performance, the processor may change the order in which writes reach the NVRAM, making it difficult for the NVRAM to even reflect a consistent prefix of the computation.
Plus ça change, plus c’est la même chose.
So, we’re going to need to take care that everything we say is committed is truly durable, and that we can recover to a consistent state following a crash. The traditional way to accomplish this is with a write-ahead log. You’ll no doubt be familiar with the phrase Continue reading
TLA+ model checking made symbolic Konnov et al., OOPSLA’19
TLA+ is a formal specification language (Temporal Logic of Actions) particularly well suited to reasoning about distributed algorithms. In addition to the specification language, the TLA+ toolset includes a model checker (TLC) and a theorem prover (TLAPS).
Given the huge state spaces involved in many real-world settings, the TLC model checker can take a long time / a lot of resources to run.
While progress towards proof automation in TLAPS has been made in the last years, writing interactive proofs is still a demanding task. Hence, the users prefer to run TLC for days, rather than writing proofs.
Like many people (?!), I often find myself wishing I had the time (and skills!) to model some of the algorithms in the papers I read and taken them for a spin in a checker. So anything that can help make that a little bit more tractable is interesting to me.
This paper introduces an alternative symbolic model checker for TLA+ called APALACHE:
Unlike TLC, APALACHE translates the underlying transition relation into quantifier-free SMT constraints, which allows us to exploit the power of SMT solvers.
The Continue reading
Mergeable replicated data types – part II Kaki et al., OOPLSA ’19
Last time out we saw how Mergeable Replicated Data Types (MRDTs) use a bijection between the natural domain of a data type and relational sets to define merge semantics between two concurrently modified versions given their lowest common ancestor (LCA). Today we’re picking things up in §4 of the paper, starting with how to derive a merge function for an arbitrary data type.
The basic approach so far has been to take the difference between each version and the LCA, and add those differences to the LCA state. But using an example of a pair data type holding pairs of integers, the authors show that what we’ve been doing so far isn’t quite good enough. We need to merge the pair state, and also the state of the first and second members of the pair.
The pair example demonstrates the need and opportunity to make merges compositional. The specification of such a composite merge function is invariably compositional in terms of the merge specifications of the types involved.
Given a merge function for counters, we can construct a merge function for a pair of counters. Given a Continue reading
Mergeable replicated data types Kaki et al., OOPSLA’19
This paper was published at OOPSLA, but perhaps it’s amongst the distributed systems community that I expect there to be the greatest interest. Mergeable Replicated Data Types (MRDTs) are in the same spirit as CRDTs but with the very interesting property that they compose. Furthermore, a principled approach for deriving MRDTs is provided, allowing a wide range of data-types to be lifted to their mergeable, replicated counterparts with very little effort on the part of the programmer. The paper sets the discussion in the context of geo-replicated distributed systems, but of course the same mechanisms could be equally useful in the context of local-first applications.
There’s a lot of material in this paper’s 29 pages, and to make sure I have the space to properly understand the ideas (the process of summary writing is a critical part of that for me), I’m going to spread my coverage over two posts.
We’ll begin by looking at the context and the high-level idea behind MRDTs, then we’ll investigate how merge specifications can be derived and composed, and how merge implementations can be derived too. Finally we’ll touch on the implementation of MRDTs in Continue reading
PlanAlyzer: assessing threats to the validity of online experiments Tosch et al., OOPSLA’19
It’s easy to make experimental design mistakes that invalidate your online controlled experiments. At an organisation like Facebook (who kindly supplied the corpus of experiments used in this study), the state of art is to have a pool of experts carefully review all experiments. PlanAlyzer acts a bit like a linter for online experiment designs, where those designs are specified in the PlanOut language.
We present the first approach for statically checking the internal validity of online experiments. Our checks are based on well-known problems that arise in experimental design and causal inference… PlanAlyzer checks PlanOut programs for a variety of threats to internal validity, including failures of randomization, treatment assignment, and causal sufficiency.
As well as pointing out any bugs in the experiment design, PlanAlyzer will also output a set of contrasts — comparisons that you can safely make given the design of the experiment. Hopefully the comparison you wanted to make when you set up the experiment is in that set!
PlanOut is a open source framework for online field experiments, developed by and extensively used at Facebook. To quote Continue reading
Local-first software: you own your data, in spite of the cloud Kleppmann et al., Onward! ’19
Watch out! If you start reading this paper you could be lost for hours following all the interesting links and ideas, and end up even more dissatisfied than you already are with the state of software today. You might also be inspired to help work towards a better future. I’m all in :).
On the one-hand we have ‘cloud apps’ which make it easy to access our work from multiple devices and to collaborate online with others (e.g. Google Docs, Trello, …). On the other hand we have good old-fashioned native apps that you install on your operating system (a dying breed? See e.g. Brendan Burns’ recent tweet). Somewhere in the middle, but not-quite perfect, are online (browser-based) apps with offline support.
The primary issue with cloud apps (the SaaS model) is ownership of the data.
Unfortunately, cloud apps are problematic in this regard. Although they let you access your data anywhere, all data access must go via the server, and you can only do the things that the server will let you do. Continue reading
Formal foundations of serverless computing Jangda et al., OOPSLA’19
Jangda et al. won a distinguished paper award at OOPSLA this year for their work on ‘Formal foundations of serverless computing.’ Central to the paper is their observation that the serverless execution environment has a number of unique properties (such as warm starting / reuse of execution environments) that can make it harder to build correct applications. They show the conditions under which a serverless function can safely ignore these peculiarities, and thus become much simpler to reason about. They also introduce a composition language based on arrows for function composition.
The serverless computing abstraction, despite its many advantages, exposes several low-level operational details that make it hard for programmers to write and reason about their code. For example, to reduce latency, serverless platforms try to reuse the same function instance to process multiple requests. However, this behaviour is not transparent and it is easy to write a serverless function that produces incorrect results or leaks confidential data when reused. A related problem is that serverless platforms abruptly terminate function instances when they are idle…
These first problems are related to Continue reading
It’s another networking paper to close out the week (and our coverage of SOSP’19), but whereas [Snap][Snap] looked at traffic routing within the datacenter, Taiji is concerned with routing traffic from the edge to a datacenter. It’s been in production deployment at Facebook for the past four years.
When a user makes a request to
http://www.facebook.com, DNS will route the request to one of dozens of globally deployed edge nodes. Within the edge node, a load balancer (the Edge LB) is responsible for routing requests through to frontend machines in datacenters. The question Taiji addresses is a simple one on the surface: what datacenter should a given request be routed to?
There’s one thing that Taiji doesn’t have to worry about: backbone capacity between the edge nodes and datacenters— this is provisioned in abundance such that it is not a consideration in balancing decisions. However, there are plenty of other things going on that make the decision challenging:
Scaling symbolic evaluation for automated verification of systems code with Serval Nelson et al., SOSP’19
Serval is a framework for developing automated verifiers of systems software. It makes an interesting juxtaposition to the approach Google took with Snap that we looked at last time out. I’m sure that Google engineers do indeed take extreme care when building the low level networking code that powers Google’s datacenters, but their fundamental design point was to enable frequent releases for fast iteration, feedback on their designs, and yes, early detection of problems.
Formal verification is at the other end of the spectrum. In theory it enables you to eliminate whole classes of problems and vulnerabilities entirely (in practice perfection is still hard to come by), and so it can be especially valuable in security sensitive situations. But it comes with a very high price tag:
Writing proofs requires a time investment that is usually measured in person-years, and the size of the proofs can be several times or even more that an order of magnitude larger than that of implementation code.
That’s both very expensive and an incredibly long wait for feedback. To invest in formally modelling something, you really Continue reading
Snap: a microkernel approach to host networking Marty et al., SOSP’19
This paper describes the networking stack, Snap, that has been running in production at Google for the last three years+. It’s been clear for a while that software designed explicitly for the data center environment will increasingly want/need to make different design trade-offs to e.g. general-purpose systems software that you might install on your own machines. But wow, I didn’t think we’d be at the point yet where we’d be abandoning TCP/IP! You need a lot of software engineers and the willingness to rewrite a lot of software to entertain that idea. Enter Google!
I’m jumping ahead a bit here, but the component of Snap which provides the transport and communications stack is called Pony Express. Here are the bombshell paragraphs:
Our datacenter applications seek ever more CPU-efficient and lower-latency communication, which Pony Express delivers. It implements reliability, congestion control, optional ordering, flow control, and execution of remote data access operations. Rather than reimplement TCP/IP or refactor an existing transport, we started Pony Express from scratch to innovate on more efficient interfaces, architecture, and protocol. (Emphasis mine).
and later on “we are seeking to grow Continue reading
It’s been a while since we looked a debugging and troubleshooting on The Morning Paper (here’s a sample of earlier posts on the topic). Today’s paper introduces a root cause of failure detector for those hard-to-pin-down bugs. Whereas most root cause analysis systems start from the failure and work backwards, Kairux starts from the beginning of the program execution and works forwards. It’s based on the simple idea that the root cause is likely to be found where the failing execution deviates from successful executions.
If we model an execution as a totally ordered sequence of instructions, then the root cause can be identified by the first instruction where the failure execution deviates from the non-failure execution that has the longest instruction sequence prefix in common with that of the failure execution.
This point where the failing run deviates from the longest successful instruction sequence prefix is called the inflection point, and the Inflection Point Hypothesis says that the inflection point is likely to be the root cause.
The definition of root cause itself is taken from Continue reading
File systems unfit as distributed storage backends: lessons from 10 years of Ceph evolution Aghayev et al., SOSP’19
Ten years of hard-won lessons packed into just 17 pages (13 if you don’t count the references!) makes this paper extremely good value for your time. It’s also a fabulous example of recognising and challenging implicit assumptions. In this case, the assumption that a distributed storage backend should clearly be layered on top of a local file system. Breaking that assumption allowed Ceph to introduce a new storage backend called BlueStore with much better performance and predictability, and the ability to support the changing storage hardware landscape. In the two years since it’s release, 70% of all Ceph users had switched to running BlueStore in production.
Ceph is a widely-used, open-source distributed file system that followed this convention [of building on top of a local file system] for a decade. Hard lessons that the Ceph team learned using several popular file systems led them to question the fitness of file systems as storage backends. This is not surprising in hindsight.
Sometimes, things that aren’t surprising in hindsight can be the very hardest of things to spot!
An analysis of performance evolution of Linux’s core operations Ren et al., SOSP’19
I was drawn in by the headline results here:
This paper presents an analysis of how Linux’s performance has evolved over the past seven years… To our surprise, the study shows that the performance of many core operations has worsened or fluctuated significantly over the years.
When you get into the details I found it hard to come away with any strongly actionable takeaways though. Perhaps the most interesting lesson/reminder is this: it takes a lot of effort to tune a Linux kernel. For example:
Meanwhile, Linux releases a new kernel every 2-3 months, with between 13,000 and 18,000 commits per release.
Clearly, performance comes at a high cost, and unfortunately, this cost is difficult to get around. Most Linux users cannot afford Continue reading
Optimized risk scores Ustun & Rudin, KDD’17
On Monday we looked at the case for interpretable models, and in Wednesday’s edition of The Morning Paper we looked at CORELS which produces provably optimal rule lists for categorical assessments. Today we’ll be looking at RiskSLIM, which produces risk score models together with a proof of optimality.
A risk score model is a a very simple points-based system designed to be used (and understood by!) humans. Such models are widely used in e.g. medicine and criminal justice. Traditionally they have been built by panels of experts or by combining multiple heuristics . Here’s an example model for the CHADS2 score assessing stroke risk.
Even when you don’t consider interpretability as a goal (which you really should!), “doing the simplest thing which could possibly work” is always a good place to start. The fact that CORELS and RiskSLIM come with an optimality guarantee given the constraints fed to them on model size etc. also means you can make informed decisions about model complexity vs performance trade-offs if a more complex model looks like it may perform better. It’s a refreshing change of mindset to shift from “finding an Continue reading