#### Archive

Category Archives for "The Morning Paper"

## Robust learning from untrusted sources

Robust learning from untrusted sources Konstantinov & Lampert, ICML’19

Welcome back to a new term of The Morning Paper! Just before the break we were looking at selected papers from ICML’19, including “Data Shapley.” I’m going to pick things up pretty much where we left off with a few more ICML papers…

Data Shapley provides us with one way of finding and correcting or eliminating low-value (and potentially harmful) data points from a training set. In today’s paper choice, Konstantinov & Lampert provide a way of assessing the value of datasets as a whole. The idea is that you might be learning e.g. a classifier by combining data from multiple sources. By assigning a value (weighting) to each of those data sources we can intelligently combine them to get the best possible performance out of the resulting trained model. So if you need more data in order to improve the performance of your model, ‘Robust learning from untrusted sources’ provides an approach that lets you tap into additional, potentially noisier, sources.

It’s similar in spirit to Snorkel which we looked at last year, and is designed to let you incorporate data from multiple ‘weakly supervised’ (i.e. noisy)

## End of term

I can’t believe we’ve arrived at the end-of-term again already! I’ll be taking a four-week break from writing The Morning Paper, normal service resumes on Monday 19th August. A big milestone will slip quietly by during this recess – it was five years ago on the 30th July 2014 that I read and shared the very first paper in this current streak of paper reading. In case you’re wondering, that paper was "Why functional programming matters" (revisited again on the blog 2 years later). In terms of published posts, we’re also rapidly approaching the 1,000 posts/papers mark! I wonder what amazing research developments the next five years might bring us??!

There are so many interesting papers published all the time, and I can only cover the tiniest fraction of them on The Morning Paper. If you still feel in need of your regular paper fix over the next few weeks, then a great exercise is to think back to a paper you particularly enjoyed, see where it was published, and then go look through the proceedings to discover what else is there you might like.

For example, let’s say you enjoyed ‘Designing far memory data structures: think

## Meta-learning neural Bloom filters

Meta-learning neural bloom filters Rae et al., ICML’19

Bloom filters are wonderful things, enabling us to quickly ask whether a given set could possibly contain a certain value. They produce this answer while using minimal space and offering O(1) inserts and lookups. It’s no wonder Bloom filters and their derivatives (the family of approximate set membership algorithms) are used everywhere. Hash functions are the key to so many great algorithms, and Bloom filters are one of my favourite applications of them.

But Rae et al. think we can do even better, especially when it comes to the space required by an approximate set membership data structure. Being an ICLR paper of course, you won’t be surprised to learn that the solution involves neural networks. This puts us in the same territory as SageDB and ‘The case for learned index structures.’ Probably my favourite sentence in the whole paper is this one, which crisply sets out where machine learning might be able to find an advantage over traditional algorithms:

We build upon the recently growing literature on using neural networks to replace algorithms that are configured by heuristics, or do not take advantage of the data distribution.

Bloom

## Challenging common assumptions in the unsupervised learning of disentangled representations

Challenging common assumptions in the unsupervised learning of disentangled representations Locatello et al., ICML’19

Today’s paper choice won a best paper award at ICML’19. The ‘common assumptions’ that the paper challenges seem to be: “unsupervised learning of disentangled representations is possible, and useful!”

The key idea behind the unsupervised learning of disentangled representations is that real-world data is generated by a few explanatory factors of variation which can be recovered by unsupervised learning algorithms. In this paper, we provide a sober look at recent progress in the field and challenge some common assumptions.

### What exactly is a ‘disentangled representation’ and why might we want one?

Put the ‘disentangled’ part to one side for a moment, and let’s start out by revisiting what we mean by a representation. Given a real-world observation $\mathbf{x}$ (e.g. of an image or video), a representation $r(\mathbf{x})$ is a transformation of $\mathbf{x}$ (typically to a lower dimensional space in order to be useful) that somehow preserves the salient information in the $\mathbf{x}$ so that we can still use $r(\mathbf{x})$ to extract useful information about the input (e.g. for building classifiers). As a trivial example, suppose we had real world observations consisting of 1000 points sampled from a

## Data Shapley: equitable valuation of data for machine learning

Data Shapley: equitable valuation of data for machine learning Ghorbani & Zou et al., ICML’19

It’s incredibly difficult from afar to make sense of the almost 800 papers published at ICML this year! In practical terms I was reduced to looking at papers highlighted by others (e.g. via best paper awards), and scanning the list of paper titles looking for potentially interesting topics. For the next few days we’ll be looking at some of the papers that caught my eye during this process.

The now somewhat tired phrase “data is the new oil” (something we can consume in great quantities to eventually destroy the world as we know it???) suggests that data has value. But pinning down that value can be tricky – how much is a given data point worth, and what framework can we use for thinking about that question?

As data becomes the fuel driving technological and economic growth, a fundamental challenge is how to quantify the value of data in algorithmic predictions and decisions…. In this work we develop a principled framework to address data valuation in the context of supervised machine learning.

One of the nice outcomes is that once you’ve understood

## View-centric performance optimization for database-backed web applications

View-centric performance optimization for database-backed web applications Yang et al., ICSE 2019

The problem set-up in this paper discusses the importance of keeping web page load times low as a fundamental contributor to user satisfaction (See e.g. ‘Why performance matters’). Between client-side tools such as Google’s Lighthouse, back-end tools that can analyse ORM usage and database queries and point out issues such as N+1 selects, and the information provided by your favourite APM I was initially wondering what ground there was left to tread here. So I was pleasantly surprised when it turned out the authors were looking at the problem in a different way to most of these approaches.

Rather than accepting the current rendered view (web page) as seen by the end-user as fixed, and then asking what can be done to optimise the end-to-end loading time of that page, this paper examines the question of what changes to the current view could dramatically reduce its load time? I.e., small (or sometimes not so small) changes to what the end user ultimately sees on the page, that can have a net benefit on the overall user experience.

Empirical studies have found

## Three key checklists and remedies for trustworthy analysis of online controlled experiments at scale

Last time out we looked at machine learning at Microsoft, where we learned among other things that using an online controlled experiment (OCE) approach to rolling out changes to ML-centric software is important. Prior to that we learned in ‘Automating chaos experiments in production’ of the similarities between running a chaos experiment and many other online controlled experiments. And going further back on The Morning Paper we looked at a model for evolving online controlled experiment capabilities within an organisation. Today’s paper choice builds on that by distilling wisdom collected from Microsoft, Airbnb, Snap, Skyscanner, Outreach.io, Intuit, Netflix, and Booking.com into a series of checklists that you can use as a basis for your own processes.

Online Controlled Experiments (OCEs) are becoming a standard operating procedure in data-driven software companies. When executed and analyzed correctly, OCEs deliver many benefits…

The challenge with OCEs though, as we’ve seen before, is that they’re really tricky to get right. When the output of those experiments is guiding product direction, that can be a problem.

Despite their great power

## Software engineering for machine learning: a case study

Software engineering for machine learning: a case study Amershi et al., ICSE’19

Previously on The Morning Paper we’ve looked at the spread of machine learning through Facebook and Google and some of the lessons learned together with processes and tools to address the challenges arising. Today it’s the turn of Microsoft. More specifically, we’ll be looking at the results of an internal study with over 500 participants designed to figure out how product development and software engineering is changing at Microsoft with the rise of AI and ML.

… integration of machine learning components is happening all over the company, not just on teams historically known for it.

A list of application areas includes search, advertising, machine translation, predicting customer purchases, voice recognition, image recognition, identifying customer leads, providing design advice for presentations and word processing documents, creating unique drawing features, healthcare, improving gameplay, sales forecasting, decision optimisation, incident reporting, bug analysis, fraud detection, and security monitoring.

As you might imagine, these are underpinned by a wide variety of different ML models. The teams doing the work are also varied in their make-up, some containing data scientists with many years of experience, and others just starting out. In a

## Automating chaos experiments in production

Automating chaos experiments in production Basiri et al., ICSE 2019

Are you ready to take your system assurance programme to the next level? This is a fascinating paper from members of Netflix’s Resilience Engineering team describing their chaos engineering initiatives: automated controlled experiments designed to verify hypotheses about how the system should behave under gray failure conditions, and to probe for and flush out any weaknesses. The ‘controlled’ part is important here because given the scale and complexity of the environment under test, the only meaningful place to do this is in production with real users.

Maybe that sounds scary, but one of the interesting perspectives this paper brings is to make you realise that it’s really not so different from any other change you might be rolling out into production (e.g. a bug fix, configuration change, new feature, or A/B test). In all cases we need to be able to carefully monitor the impact on the system, and back out if things start going badly wrong. Moreover, just like an A/B test, we’ll be collecting metrics while the experiment is underway and performing statistical analysis at the end to interpret the results.

Netflix’s system is deployed on

## One SQL to rule them all: an efficient and syntactically idiomatic approach to management of streams and tables

In data processing it seems, all roads eventually lead back to SQL! Today’s paper choice is authored by a collection of experts from the Apache Beam, Apache Calcite, and Apache Flink projects, outlining their experiences building SQL interfaces for streaming. The net result is a set of proposed extensions to the SQL standard itself, being worked on under the auspices of the international SQL standardization body.

The thesis of this paper, supported by experience developing large open-source frameworks supporting real-world streaming use cases, is that the SQL language and relational model as-is and with minor non-intrusive extensions, can be very effective for manipulation of streaming data.

Many of the ideas presented here are already implemented by Apache Beam, Calcite, and Flink in some form, as one option amongst several. The streaming SQL interface has been adopted by Alibaba, Hauwei, Lyft, Uber and others, with the following feedback presented to the authors as to why they made this choice:

• Development and adoption costs are significantly lower compared to non-declarative stream processing APIs
• Familiarity with standard SQL eases adoption

## The convoy phenomenon

The convoy phenomenon Blasgen et al., IBM Research Report 1977 (revised 1979)

Today we’re jumping from HotOS topics of 2019, to hot topics of 1977! With thanks to Pat Helland for the recommendation, and with Jim Gray as one of the authors, we have a combination that’s very hard to ignore :).

Here’s the set-up as relayed to me by Pat (with permission):

At work, I am part of a good sized team working on a large system implementation. One of the very senior engineers with 25+ years experience mentioned a problem with the system. It seems that under test load, it would behave beautifully until the performance just fell to the floor. The system just crawled forever and never seemed to get out of this state. Work was getting done but at a pathetic rate. I said: “You have a convoy.” His response was; “Huh?”. I forwarded him the paper on “The Convoy Phenomenon”…

I have to confess I hadn’t heard of the convoy phenomenon either! Before we go on, take a moment to think of possible causes for the system behaviour described above. Lots of things can cause a performance cliff, but the interesting thing

## Machine learning systems are stuck in a rut

Machine learning systems are stuck in a rut Barham & Isard, HotOS’19

In this paper we argue that systems for numerical computing are stuck in a local basin of performance and programmability. Systems researchers are doing an excellent job improving the performance of 5-year old benchmarks, but gradually making it harder to explore innovative machine learning research ideas.

The thrust of the argument is that there’s a chain of inter-linked assumptions / dependencies from the hardware all the way to the programming model, and any time you step outside of the mainstream it’s sufficiently hard to get acceptable performance that researchers are discouraged from doing so.

Take a simple example: it would be really nice if we could have named dimensions instead of always having to work with indices.

Named dimensions improve readability by making it easier to determine how dimensions in the code correspond to the semantic dimensions described in, .e.g., a research paper. We believe their impact could be even greater in improving code modularity, as named dimensions would enable a language to move away from fixing an order on the dimensions of a given tensor, which in turn would make function lifting more convenient…

For

## Designing far memory data structures: think outside the box

Designing far memory data structures: think outside the box Aguilera et al., HotOS’19

Last time out we looked at some of the trade-offs between RInKs and LInKs, and the advantages of local in-memory data structures. There’s another emerging option that we didn’t talk about there: the use of far-memory, memory attached to the network that can be remotely accessed without mediation by a local processor. For many data center applications this looks to me like it could be a compelling future choice.

Far memory brings many potential benefits over near memory: higher memory capacity through disaggregation, separate scaling between processing and far memory, better availability due to separate fault domains for far memory, and better shareability among processors.

It’s not all straightforward though. As we’ve seen a number of times before, there’s a trade-off between fast one-sided access that doesn’t involve the remote CPU, and a more traditional RPC style that does. In particular, if you end up needing to make multiple one-sided requests to get to the data you really need, it’s often faster to just go the RPC route.

Therefore, if we want to make full use of one-sided far memory, we need to think

## Fast key-value stores: an idea whose time has come and gone

Fast key-value stores: an idea whose time has come and gone Adya et al., HotOS’19

No controversy here! Adya et al. would like you to stop using Memcached and Redis, and start building 11-factor apps. Factor VI in the 12-factor app manifesto, “Execute the app as one or more stateless processes,” to be dropped and replaced with “Execute the app as one or more stateful processes.”

It all makes for a highly engaging position paper (even if that engagement doesn’t necessarily take the form of agreement on all points)! It’s healthy to challenge the status-quo from time to time…

Remote, in-memory key-value (RInK) stores such as Memcached and Redis are widely used in industry and are an active area of academic research. Coupled with stateless application servers to execute business logic and a database-like system to provide persistent storage, they form a core component of popular data center service archictectures. We argue that the time of the RInK store has come and and gone…

### What on earth are they thinking? ;)

Why are developers using RInK systems as part of their design? Generally to cache data (including non-persistent data that never sees a backing store), to share

## What bugs cause cloud production incidents?

What bugs cause production cloud incidents? Liu et al., HotOS’19

Last time out we looked at SLOs for cloud platforms, today we’re looking at what causes them to be broken! This is a study of every high severity production incident at Microsoft Azure services over a span of six months, where the root cause of that incident was a software bug. In total, there were 112 such incidents over the period March – September 2018 (not all of them affecting external customers). Software bugs are the most common cause of incidents during this period, accounting for around 40% of all incidents (so we can infer there were around 280 incidents total in the pool).

The 112 incidents caused by software bugs are further broken down into categories, with data-format bugs, fault-related bugs, timing bugs, and constant_value bugs being the largest categories. Interestingly, outages caused by configuration errors represented only a small number of incidents in this study. This could be an artefact of that data set in some way, or it might be due to the tool chain that Microsoft uses:

The types of bugs we observed in production are biased by the fact that Microsoft uses effective

## Nines are not enough: meaningful metrics for clouds

Nines are not enough: meaningful metrics for clouds Mogul & Wilkes, HotOS’19

It’s hard to define good SLOs, especially when outcomes aren’t fully under the control of any single party. The authors of today’s paper should know a thing or two about that: Jeffrey Mogul and John Wilkes at Google1! John Wilkes was also one of the co-authors of chapter 4 “Service Level Objectives” in the SRE book, which is good background reading for the discussion in this paper.

The opening paragraph of the abstract does a great job of framing the problem:

Cloud customers want strong, understandable promises (Service Level Objectives, or SLOs) that their applications will run reliably and with adequate performance, but cloud providers don’t want to offer them, because they are technically hard to meet in the face of arbitrary customer behavior and the hidden interactions brought about by statistical multiplexing of shared resources.

When it comes to SLOs, the interests of the customer and the cloud provider are at odds, and so we end up with SLAs (Service Level Agreements) that tie SLOs to contractual agreements.

### What are we talking about

Let’s start out by getting some terms straight: SLIs, SLOs, SLAs, and

## Towards multiverse databases

Towards multiverse databases Marzoev et al., HotOS’19

A typical backing store for a web application contains data for many users. The application makes queries on behalf of an authenticated user, but it is up to the application itself to make sure that the user only sees data they are entitled to see.

Any frontend can access the whole store, regardless of the application user consuming the results. Therefore, frontend code is responsible for permission checks and privacy-preserving transformations that protect user’s data. This is dangerous and error-prone, and has caused many real-world bugs… the trusted computing base (TCB) effectively includes the entire application.

The central idea behind multiverse databases is to push the data access and privacy rules into the database itself. The database takes on responsibility for authorization and transformation, and the application retains responsibility only for authentication and correct delegation of the authenticated principal on a database call. Such a design rules out an entire class of application errors, protecting private data from accidentally leaking.

It would be safer and easier to specify and transparently enforce access policies once, at the shared backend store interface. Although state-of-the-are databases have security features designed for exactly this purpose,

## A case for managed and model-less inference serving

A case for managed and model-less inference serving Yadwadkar et al., HotOS’19

HotOS’19 is presenting me with something of a problem as there are so many interesting looking papers in the proceedings this year it’s going to be hard to cover them all! As a transition from the SysML papers we’ve been looking at recently, I’ve chosen a HotOS position paper from the Stanford Platform Lab to kick things off. As we saw with the SOAP paper last time out, even with a fixed model variant and hardware there are a lot of different ways to map a training workload over the available hardware. In “A case for managed and model-less inference serving” Yadwadkar et al. look at a similar universe of possibilities for model serving at inference time, and conclude that it’s too much to expect users to navigate this by themselves.

Making queries to an inference engine has many of the same throughput, latency, and cost considerations as making queries to a datastore, and more and more applications are coming to depend on such queries. “For instance, Facebook applications issue tens-of-trillions of inference queries per day with varying performance, accuracy, and cost constraints.”

If

## Beyond data and model parallelism for deep neural networks

Beyond data and model parallelism for deep neural networks Jia et al., SysML’2019

I’m guessing the authors of this paper were spared some of the XML excesses of the late nineties and early noughties, since they have no qualms putting SOAP at the core of their work! To me that means the “simple” object access protocol, but not here:

We introduce SOAP, a more comprehensive search space of parallelization strategies for DNNs that includes strategies to parallelize a DNN in the Sample, Operator, Attribute, and Parameter dimensions.

The goal here is to reduce the training times of DNNs by finding efficient parallel execution strategies, and even including its search time, FlexFlow is able to increase training throughput by up to 3.3x compared to state-of-the-art approaches.

There are two key ideas behind FlexFlow. The first is to expand the set of possible solutions (and hence also the search space!) in the hope of covering more interesting potential solutions. The second is an efficient execution simulator that makes searching that space possible by giving a quick evaluation of the potential performance of a given parallelisation strategy. Combine those with an off-the-shelf Metropolis-Hastings MCMC search strategy and Bob’s your uncle.

## PyTorch-BigGraph: a large-scale graph embedding system

PyTorch-BigGraph: a large-scale graph embedding system Lerer et al., SysML’19

We looked at graph neural networks earlier this year, which operate directly over a graph structure. Via graph autoencoders or other means, another approach is to learn embeddings for the nodes in the graph, and then use these embeddings as inputs into a (regular) neural network:

Working with graph data directly is difficult, so a common technique is to use graph embedding methods to create vector representations for each node so that distances between these vectors predict the occurrence of edges in the graph.

When you’re Facebook, the challenge in learning embeddings is that the graph is big: over two billion user nodes, and over a trillion edges. Alibaba’s graph has more than one billion users and two billion items; Pinterest’s graph has more than 2 billion entities and over 17 billion edges. At this scale we have to find a way to divide-and-conquer. We’ll need to find some parallelism to embed graphs with trillions of edges in reasonable time, and a way of partitioning the problem so that we don’t need all of the embeddings in memory at each node (‘many standard methods exceed the memory