Narrowing the gap between serverless and its state with storage functions, Zhang et al., SoCC’19
"Narrowing the gap" was runner-up in the SoCC’19 best paper awards. While being motivated by serverless use cases, there’s nothing especially serverless about the key-value store, Shredder, this paper reports on. Shredder’s novelty lies in a new implementation of an old idea. On the mainframe we used to call it function shipping. In databases you probably know it as stored procedures. The advantages of function shipping (as opposed to the data shipping we would normally do in a serverless application) are that (a) you can avoid moving potentially large amounts of data over the network in order to process it, and (b) you might be able to collapse multiple remote calls into one if the function traverses structures that otherwise could not be fetched in a single call.
Shredder is "a low-latency multi-tenant cloud store that allows small units of computation to be performed directly within storage nodes."
Running end-user compute inside the datastore is not without its challenges of course. From an operator perspective it makes it harder to follow the classic cloud-native design in which a global storage Continue reading
Reverb: speculative debugging for web applications, Netravali & Mickens, SOCC’19
This week we’ll be looking at a selection of papers from the 2019 edition of the ACM Symposium of Cloud Computing (SoCC). First up is Reverb, which won a best paper award for its record and replay debugging framework that accommodates speculative edits (i.e., candidate bug-fixes) during replay. In the context of the papers we’ve been looking at recently, and for a constrained environment, Reverb is helping its users to form an accurate mental model of the system state, and to form and evaluate hypotheses in-situ.
Reverb has three features which enable a fundamentally more powerful debugging experience. First, Reverb tracks precise value provenance, allowing a developer to quickly identify the reads and writes to JavaScript state that affected a particular variable’s value. Second, Reverb enables speculative bug fix analysis… Third, Reverb supports wide-area debugging for applications whose server-side components use event-driven architectures.
Reverb’s goal is to aid in debugging the client-side of JavaScript web applications. These are "pervasively asynchronous and event-driven" which makes it notoriously difficult to figure out what’s going on. See e.g. "Debugging data flows Continue reading
Trade-offs under pressure: heuristics and observations of teams resolving internet service outages, Allspaw, Masters thesis, Lund University 2015
This is part 2 of our look at Allspaw’s 2015 master thesis (here’s part 1). Today we’ll be digging into the analysis of an incident that took place at Etsy on December 4th, 2014.
Trade-offs under pressure: heuristics and observations of teams resolving internet service outages, Allspaw, Masters thesis, Lund University, 2015
Following on from the STELLA report, today we’re going back to the first major work to study the human and organisational side of incident management in business-critical Internet services: John Allspaw’s 2015 Masters thesis. The document runs to 87 pages, so I’m going to cover the material across two posts. Today we’ll be looking at the background and literature review sections, which place the activity in a rich context and provide many jumping off points for going deeper in areas of interest to you. In the next post we’ll look at the detailed analysis of how a team at Etsy handled a particular incident on December 4th 2014, to see what we can learn from it.
Perhaps it seems obvious that incident management is hard. But it’s worth recaping some of the reasons why this is the case, and what makes it an area worthy of study.
The operating environment of Internet services contains many of the ingredients necessary for ambiguity and high consequences for mistakes in the diagnosis and response of an adverse Continue reading
STELLA: report from the SNAFU-catchers workshop on coping with complexity, Woods 2017, Coping with Complexity workshop
“Coping with complexity” is about as good a three-word summary of the systems and software challenges facing us over the next decade as I can imagine. Today’s choice is a report from a 2017 workshop convened with that title, and recommended to me by John Allspaw – thank you John!
The workshop brought together about 20 experts from a variety of different companies to share and analyse the details of operational incidents (and their postmortems) that had taken place at their respective organisations. Six themes emerged from those discussions that sit at the intersection of resilience engineering and IT. These are all very much concerned with the interactions between humans and complex software systems, along the lines we examined in Ironies of Automation and Ten challenges for making automation a ‘team player’ in joint human-agent activity.
There’s a great quote on the very front page of the report that is worth the price of admission on its own:
Woods’ Theorem: As the complexity of a system increases, the accuracy of any single agent’s own model of that system decreases rapidly.
Remember Continue reading
Synthesizing data structure transformations from input-output examples, Feser et al., PLDI’15
The Programmatically Interpretable Reinforcement Learning paper that we looked at last time out contained this passing comment coupled with a link to today’s paper choice:
It is known from prior work that such [functional] languages offer natural advantages in program synthesis.
That certainly caught my interest. The context for the quote is synthesis of programs by machines, but when I’m programming, I’m also engaged in the activity of program synthesis! So a work that shows functional languages have an advantage for programmatic synthesis might also contain the basis for an argument for natural advantages to the functional style of programming. I didn’t find that here. We can however say that this paper shows “functional languages are well suited to program synthesis.”
Never mind, because the ideas in the paper are still very connected to a question I’m fascinated by at the moment: “how will we be developing software systems over this coming decade?”. There are some major themes to be grappled with: system complexity, the consequences of increasing automation and societal integration, privacy, ethics, security, trust (especially in supply chains), interpretability vs black box models, Continue reading
Programmatically interpretable reinforcement learning, Verma et al., ICML 2018
Being able to trust (interpret, verify) a controller learned through reinforcement learning (RL) is one of the key challenges for real-world deployments of RL that we looked at earlier this week. It’s also an essential requirement for agents in human-machine collaborations (i.e, all deployments at some level) as we saw last week. Since reading some of Cynthia Rudin’s work last year I’ve been fascinated with the notion of interpretable models. I believe there are a large set of use cases where an interpretable model should be the default choice. There are so many deployment benefits, even putting aside any ethical or safety concerns.
So how do you make an interpretable model? Today’s paper choice is the third paper we’ve looked at along these lines (following CORELS and RiskSlim), enough for a recognisable pattern to start to emerge. The first step is to define a language — grammar and associated semantics — in which the ultimate model to be deployed will be expressed. For CORRELS this consists of simple rule based expressions, and for RiskSlim it is scoring sheets. For Programmatically Interpretable Reinforcement Learning (PIRL) as we shall Continue reading
Challenges of real-world reinforcement learning, Dulac-Arnold et al., ICML’19
Last week we looked at some of the challenges inherent in automation and in building systems where humans and software agents collaborate. When we start talking about agents, policies, and modelling the environment, my thoughts naturally turn to reinforcement learning (RL). Today’s paper choice sets out some of the current (additional) challenges we face getting reinforcement learning to work well in many real-world systems.
We consider control systems grounded in the physical world, optimization of software systems, and systems that interact with users such as recommender systems and smart phones. … RL methods have been shown to be effective on a large set of simulated environments, but uptake in real-world problems has been much slower.
Why is this? The authors posit that there’s a meaningful gap between the tightly-controlled and amenable to simulation research settings where many RL systems do well, and the messy realities and constraints of real-world systems. For example, there may be no good simulator available, exploration may be curtailed by strong safety constraints, and feedback cycles for learning may be slow.
This lack of available simulators means learning must be done using Continue reading
Ten challenges for making automation a ‘team player’ in joint human-agent activity, Klein et al., IEEE Computer Nov/Dec 2004
With thanks to Thomas Depierre for the paper suggestion.
Last time out we looked at some of the difficulties inherit in automating control systems. However much we automate, we’re always ultimately dealing with some kind of human/machine collaboraton. Today’s choice looks at what it takes for machines to participate productively in collaborations with humans. Written in 2004, the ideas remind me very much of Mark Burgess’ promise theory, which was also initially developed around the same time.
If a group of people (or people and machines) are going to coordinate with each other to achieve a set of shared ends then there are four basic requirements that must be met to underpin their joint activity:
A basic compact is…
… an agreement (often tacit) to facilitate coordination, work toward shared goals, and prevent breakdowns in team coordination. This Compact involves a commitment Continue reading
Ironies of automation, Bainbridge, Automatica, Vol. 19, No. 6, 1983
With thanks to Thomas Depierre for the paper recommendation.
Making predictions is a dangerous game, but as we look forward to the next decade a few things seem certain: increasing automation, increasing system complexity, faster processing, more inter-connectivity, and an even greater human and societal dependence on technology. What could possibly go wrong? Automation is supposed to make our lives easier, but ~~if~~ when it goes wrong it can put us in a very tight spot indeed. Today’s paper choice, ‘Ironies of Automation’ explores these issues. Originally published in this form in 1983, its lessons are just as relevant today as they were then.
The central irony (‘combination of circumstances, the result of which is the direct opposite of what might be expected’) referred to in this paper is that the more we automate, and the more sophisticated we make that automation, the more we become dependent on a highly skilled human operator.
Why do we automate?
The designer’s view of the human operator may be that the operator is unreliable and inefficient, so should be eliminated from the system.
An automated system Continue reading
Welcome to another year of The Morning Paper! Over the holidays I spent
some time mapping out a partial conference calendar for the year, and thinking about the kinds of papers I want to be reading. In a typical year, I’ll cover somewhere north of 120 papers on this blog. That’s a tiny drop in the ocean compared to the amount of research published. And then as well as dipping my toes into the new, I also want to make more space for papers that have stood the test of time. Following the Lindy effect these are the ones most likely to continue giving ten years or more into the future. Where have we come from? Where are we now? And where are we heading? My only firm rule for paper selection is that I must find it interesting. As regular readers of The Morning Paper will know though, my interests are pretty broad and will no doubt take many twists and turns over the course of the year.
Through the course of a year I often have the pleasure of bumping into many readers of The Morning Paper. And very often they tell me apologetically that they don’t always Continue reading
My children broke up from school this past weekend, which seems as good a reason as any to call this ‘end of term’ for The Morning Paper. I’ll be taking a break until the New Year, topping up my reading lists and getting ready for a whole new crop of papers and discoveries. The Morning Paper will resume on Monday 6th January.
Since term began on the 19th August we’ve looked at 50 different papers, and I had the pleasure of attending VLDB and HPTS in person as well. I learned a ton! I hope you found something you enjoyed in the paper selections as well.
Here’s a small selection of my personal highlights from the term, in case you missed any of them (in the order in which they originally appeared on the blog):
How do committees invent?, Conway, Datamation magazine 1968
With thanks to Chris Frost for recommending this paper – another great example of a case where we all know the law (Conway’s law in this case), but many of us have not actually read the original ideas behind it.
We’re back in 1968, a time when it was taken for granted that before building a system, it was necessary to design it. The systems under discussion are not restricted to computer systems either by the way – one of example of a system is "the public transport network." Designs are produced by people, and the set of people working on a design are part of a design organisation.
The definition of design itself is quite interesting:
That kind of intellectual activity which creates a whole from its diverse parts may be called the design of a system.
When I think about design, I more naturally think about it the other way around: how to decompose the whole into a set of parts that will work together to accomplish the system goals. But of course Conway is right that those parts do have to fit together to produce the intended Continue reading
A tale of two abstractions: the case for object space, Bittman et al., HotStorage 2019.
This is a companion paper to the "persistent problem" piece that we looked at earlier this week, going a little deeper into the object pointer representation choices and the mapping of a virtual object space into physical address spaces.
…software operating on persistent data structures requires "global" pointers that remain valid after a process terminates, while hardware requires that a diverse set of devices all have the same mappings they need for bulk transfers to and from memory, and that they be able to do so for a potentially heterogeneous memory system. Both abstractions must be implemented in a way that is efficient using existing hardware.
In-memory data structures are notable for the rich inter-weaving of pointer references between them. If we take those data structures and make them also be the persistent representation, "then applications need a way to refer to data such that references have the same lifetime as the referenced data." Epheremal virtual addresses don’t cut it as the basis for persistent pointers.
Applications running on BNVM (byte-addressable non-volatile memory) must have a way Continue reading
A persistent problem: managing pointers in NVM Bittman et al., PLOS’19
At the start of November I was privileged to attend HPTS (the High Performance Transaction Systems) conference in Asilomar. If you ever get the chance to go, I highly recommend it. It’s a comparatively small gathering with a great mix of people, and fabulous discussions. A big thank you to everyone that I met there for making me feel so welcome.
On the last morning of the conference Daniel Bittman presented some of the work being done in the context of the Twizzler OS project to explore new programming models for NVM. It’s a really bold project (‘let’s rethink the OS from the ground up’) and generated a lot of lively discussion.
(Byte-addressable non-volatile memory,) NVM will fundamentally change the way hardware interacts, the way operating systems are designed, and the way applications operate on data.
The starting point is a set of three asumptions for an NVM-based programming model:
Benchmarking spreadsheet systems Rahman et al., Preprint
A recent TwThread drew my attention to this pre-print paper. When spreadsheets were originally conceived, data and formula were input by hand and so everything operated at human scale. Increasingly we’re dealing with larger and larger datasets — for example, data imported via csv files — and spreadsheets are creaking. I’m certainly familiar with the sinking feeling on realising I’ve accidentally asked a spreadsheet to open up a file with 10s of thousands of rows, and that my computer is now going to be locked up for an age. Rahman et al. construct a set of benchmarks to try and understand what might be going on under the covers in Microsoft Excel, Google Sheets, and LibreOffice Calc.
Spreadsheets claim to support pretty large datasets these days – e.g. five million cells for Google Sheets, and even more than that for Excel. But in practice, they struggle at sizes well below this.
With increasing data sizes… spreadsheets have started to break down to the point of being unusable, displaying a number of scalability problems. They often freeze during computation, and are unable to import datasets well below the size limits posed by Continue reading
Declarative assembly of web applications from predefined concepts De Rosso et al., Onward! 2019
I chose this paper to challenge my own thinking. I’m not really a fan of low-code / no-code / just drag-and-drop-from-our-catalogue forms of application development. My fear is that all too often it’s like jumping on a motorbike and tearing off at great speed (rapid initial progress), only to ride around a bend and find a brick wall across the road in front of you. That doesn’t normally end well. I’ve seen enough generations of CASE (remember that acronym?), component-based software development, reusable software catalogues etc. to develop a healthy scepticism: lowest-common denominators, awkward or missing round-tripping behaviour, terrible debugging experiences, catalogues full of junk components, inability to accommodate custom behaviours not foreseen by the framework/component developers, limited reuse opportunities in practice compared to theory, and so on.
The thing is, on one level I know that I’m wrong. To start with, there’s Grady Booch’s observation that “the whole history of computer science is one of ever rising levels of abstraction” 1. Then there’s the changing demographic of software building. Heather Miller recently gave a great presentation on this topic, ‘The Continue reading
Efficient lock-free durable sets Zuriel et al., OOPSLA’19
Given non-volatile memory (NVRAM), the naive hope for persistence is that it would be a no-op: what happens in memory, stays in memory. Unfortunately, a very similar set of issues to those concerned with flushing volatile memory to persistent disk exist here too, just at another level. Memory might be durable, but…
…it is expected that caches and registers will remain volatile. Therefore the state of data structures underlying standard algorithms might not be complete in the NVRAM view, and after a crash this view might not be consistent because of missed writes that were in the caches but did not reach the memory. Moreover, for better performance, the processor may change the order in which writes reach the NVRAM, making it difficult for the NVRAM to even reflect a consistent prefix of the computation.
Plus ça change, plus c’est la même chose.
So, we’re going to need to take care that everything we say is committed is truly durable, and that we can recover to a consistent state following a crash. The traditional way to accomplish this is with a write-ahead log. You’ll no doubt be familiar with the phrase Continue reading
TLA+ model checking made symbolic Konnov et al., OOPSLA’19
TLA+ is a formal specification language (Temporal Logic of Actions) particularly well suited to reasoning about distributed algorithms. In addition to the specification language, the TLA+ toolset includes a model checker (TLC) and a theorem prover (TLAPS).
Given the huge state spaces involved in many real-world settings, the TLC model checker can take a long time / a lot of resources to run.
While progress towards proof automation in TLAPS has been made in the last years, writing interactive proofs is still a demanding task. Hence, the users prefer to run TLC for days, rather than writing proofs.
Like many people (?!), I often find myself wishing I had the time (and skills!) to model some of the algorithms in the papers I read and taken them for a spin in a checker. So anything that can help make that a little bit more tractable is interesting to me.
This paper introduces an alternative symbolic model checker for TLA+ called APALACHE:
Unlike TLC, APALACHE translates the underlying transition relation into quantifier-free SMT constraints, which allows us to exploit the power of SMT solvers.
The Continue reading
Mergeable replicated data types – part II Kaki et al., OOPLSA ’19
Last time out we saw how Mergeable Replicated Data Types (MRDTs) use a bijection between the natural domain of a data type and relational sets to define merge semantics between two concurrently modified versions given their lowest common ancestor (LCA). Today we’re picking things up in §4 of the paper, starting with how to derive a merge function for an arbitrary data type.
The basic approach so far has been to take the difference between each version and the LCA, and add those differences to the LCA state. But using an example of a pair data type holding pairs of integers, the authors show that what we’ve been doing so far isn’t quite good enough. We need to merge the pair state, and also the state of the first and second members of the pair.
The pair example demonstrates the need and opportunity to make merges compositional. The specification of such a composite merge function is invariably compositional in terms of the merge specifications of the types involved.
Given a merge function for counters, we can construct a merge function for a pair of counters. Given a Continue reading