Archive

Category Archives for "The Morning Paper"

Applying deep learning to Airbnb search

Applying deep learning to Airbnb search Haldar et al., KDD’19

Last time out we looked at Booking.com’s lessons learned from introducing machine learning to their product stack. Today’s paper takes a look at what happened in Airbnb when they moved from standard machine learning approaches to deep learning. It’s written in a very approachable style and packed with great insights. I hope you enjoy it as much as I did!

Ours is a story of the elements we found useful in applying neural networks to a real life product. Deep learning was steep learning for us. To other teams embarking on similar journeys, we hope an account of our struggles and triumphs will provide some useful pointers.

An ecosystem of models

The core application of machine learning discussed in this paper is the model which orders available listings according to a guest’s likelihood of booking. This is one of a whole ecosystem of models which contribute towards search rankings when a user searches on Airbnb. New models are tested online through an A/B testing framework to compare their performance to previous generations.

The time is right

The very first version of search ranking at Airbnb was a hand-written Continue reading

150 successful machine learning models: 6 lessons learned at Booking.com

150 successful machine learning models: 6 lessons learned at Booking.com Bernadi et al., KDD’19

Here’s a paper that will reward careful study for many organisations. We’ve previously looked at the deep penetration of machine learning models in the product stacks of leading companies, and also some of the pre-requisites for being successful with it. Today’s paper choice is a wonderful summary of lessons learned integrating around 150 successful customer facing applications of machine learning at Booking.com. Oddly enough given the paper title, the six lessons are never explicitly listed or enumerated in the body of the paper, but they can be inferred from the division into sections. My interpretation of them is as follows:

  1. Projects introducing machine learned models deliver strong business value
  2. Model performance is not the same as business performance
  3. Be clear about the problem you’re trying to solve
  4. Prediction serving latency matters
  5. Get early feedback on model quality
  6. Test the business impact of your models using randomised controlled trials (follows from #2)

There are way more than 6 good pieces of advice contained within the paper though!

We found that driving true business impact is amazingly hard, plus it is difficult to isolate Continue reading

Detecting and characterizing lateral phishing at scale

Detecting and characterizing lateral phishing at scale Ho et al., USENIX Security Symposium 2019

This is an investigation into the phenomenon of lateral phishing attacks. A lateral phishing attack is one where a compromised account within an organisation is used to send out further phishing emails (typically to other employees within the same organisation). So ‘alice at example.com’ might receive a phishing email that has genuinely been sent by ‘bob at example.com’, and thus is more likely to trust it.

In recent years, work from both industry and academia has pointed to the emergence and growth of lateral phishing attacks: a new form of phishing that targets a diverse range of organizations and has already incurred billions of dollars in financial harm…. This attack proves particularly insidious because the attacker automatically benefits from the implicit trust in the hijacked account: trust from both human recipients and conventional email protection systems.

A dataset of 113 million emails…

The study is conducted in conjunction with Barracuda Networks, who obtained customer permission to use email data from the Office 365 employee mailboxes of 92 different organisations. 69 of these organisations were selected through random sampling across all organisations, and 23 Continue reading

In-toto: providing farm-to-table guarantees for bits and bytes

in-toto: providing farm-to-table guarantees for bits and bytes Torres-Arias et al., USENIX Security Symposium 2019

Small world with high risks did a great job of highlighting the absurd risks we’re currently carrying in many software supply chains. There are glimmers of hope though. This paper describes in-toto, and end-to-end system for ensuring the integrity of a software supply chain. To be a little more precise, in-toto secures the end-to-end delivery pipeline for one product or package. But it’s only a small step from there to imagine using in-toto to also verify the provenance of every third-party dependency included in the build, and suddenly you’ve got something that starts to look very interesting indeed.

In-toto is much more than just a research project, it’s already deployed and integrated into a number of different projects and ecosystems, quietly protecting artefacts used by millions of people daily. You can find the in-toto website at https://in-toto.io.

In-toto has about a dozen different integrations that protect software supply chains for millions of end-users.

  • If you install a Debian package using apt, in-toto is protecting it.
  • If you use kubesec to analyze your Kubenetes configurations, in-toto is protecting it
  • If you use the Continue reading

Small world with high risks: a study of security threats in the npm ecosystem

Small world with high risks: a study of security threats in the npm ecosystem Zimmermann et al., USENIX Security Symposium 2019

This is a fascinating study of the npm ecosystem, looking at the graph of maintainers and packages and its evolution over time. It’s packed with some great data, and also helps us quantify something we’ve probably all had an intuition for— the high risks involved in depending on a open and fast-moving ecosystem. One the key takeaways for me is the concentration of reach in a comparatively small number of packages and maintainers, making these both very high value targets (event-stream, it turns out, wouldn’t even have made the top-1000 in a list of ranked targets!), but also high leverage points for defence. We have to couple this of course with an exceedingly long tail.

The npm ecosystem

As the primary source of third-party JavaScript packages for the client-side, server-side, and other platforms, npm is the centrerpiece of a large and important software ecosystem.

Npm is an open ecosystem hosting a collection of over 800,000 packages as of February 2019, and it continues to grow rapidly.

To share a package on npm, a maintainer creates Continue reading

Wireless attacks on aircraft instrument landing systems

Wireless attacks on aircraft instrument landing systems Sathaye et al., USENIX Security Symposium 2019

It’s been a while since we last looked at security attacks against connected real-world entities (e.g., industrial machinery, light-bulbs, and cars). Today’s paper is a good reminder of just how important it is becoming to consider cyber threat models in what are primary physical systems, especially if you happen to be flying on an aeroplane – which I am right now as I write this!

The first fully operational Instrument Landing System (ILS) for planes was deployed in 1932. But assumptions we’ve been making since then (and until the present day, it appears!) no longer hold:

Security was never considered by design as historically the ability to transmit and receive wireless signals required considerable resources and knowledge. However, the widespread availability of powerful and low-cost software-defined radio platforms has altered the threat landscape. In fact, today the majority of wireless systems employed in modern aviation have been shown to be vulnerable to some form of cyber-physical attacks.

Both sections 1 and 6 in the paper give some eye-opening details of known attacks against aviation systems, but to date no-one Continue reading

50 ways to leak your data: an exploration of apps’ circumvention of the Android permissions system

50 ways to leak your data: an exploration of apps’ circumvention of the Android permissions system Reardon et al., USENIX Security Symposium 2019


The problem is all inside your app, she said to me / The answer is easy if you take it logically / I’d like to help data in its struggle to be free / There must be fifty ways to leak their data.

You just slip it out the back, Jack / Make a new plan, Stan / You don’t need to be coy, Roy / Just get the data free.

Hop it on the bus, Gus / You don’t need to discuss much / Just drop off the key, Lee / And get the data free…

— Lyrics adapted from “50 ways to leave your lover” by Paul Simon (fabulous song btw., you should definitely check it out if you don’t already know it!).


This paper is a study of Android apps in the wild that leak permission protected data (identifiers which can be used for tracking, and location information), where those apps should not have been able to see such data due to a lack of granted permissions. By detecting Continue reading

The secret-sharer: evaluating and testing unintended memorization in neural networks

The secret sharer: evaluating and testing unintended memorization in neural networks Carlini et al., USENIX Security Symposium 2019

This is a really important paper for anyone working with language or generative models, and just in general for anyone interested in understanding some of the broader implications and possible unintended consequences of deep learning. There’s also a lovely sense of the human drama accompanying the discoveries that just creeps through around the edges.

Disclosure of secrets is of particular concern in neural network models that classify or predict sequences of natural language text… even if sensitive or private training data text is very rare, one should assume that well-trained models have paid attention to its precise details…. The users of such models may discover— either by accident or on purpose— that entering certain text prefixes causes the models to output surprisingly revealing text completions.

Take a system trained to make predictions on a language (word or character) model – an example you’re probably familiar with is Google Smart Compose. Now feed it a prefix such as “My social security number is “. Can you guess what happens next?

As a small scale demonstration, the authors trained a model on Continue reading

Even more amazing papers at VLDB 2019 (that I didn’t have space to cover yet)

We’ve been covering papers from VLDB 2019 for the last three weeks, and next week it will be time to mix things up again. There were so many interesting papers at the conference this year though that I haven’t been able to cover nearly as many as I would like. So today’s post is a short summary of things that caught my eye that I haven’t covered so far. A few of these might make it onto The Morning Paper in weeks to come, you never know!

Industry papers

  • Tunable consistency in MongoDB. MongoDB is an important database, and this paper explains the tunable (per-operation) consistency models that MongoDB provides and how they are implemented under the covers. I really do want to cover this one at some point, along with Implementation of cluster-wide logical clock and causal consistency in MongoDB from SIGMOD 2019.
  • We hear a lot from Google and Microsoft about their cloud platforms, but not quite so much from the other key industry players. So it’s great to see some papers from Alibaba and Tencent here. AliGraph covers Alibaba’s distributed graph engine supporting the development of new GNN applications. Their dataset has about 7B edges… Meanwhile, AnalyticDB Continue reading

Updating graph databases with Cypher

Updating graph databases with Cypher Green et al., VLDB’19

This is the story of a great collaboration between academia, industry, and users of the Cypher graph querying language as created by Neo4j. Beyond Neo4j, Cypher is also supported in SAP HANA Graph, RedisGraph, Agnes Graph, and Memgraph. Cypher for Apache Spark, and Cypher over Gremlin projects are also both available in open source. The openCypher project brings together Cypher implementors across different projects and products, and aims to produce a full and open specification of the language. There is also a Graph Query Language (GQL) standards organisation.

Cypher is used in hundreds of production applications across many industry vertical domains, such as financial services, telecommunications, manufacturing and retails, logistics, government, and healthcare.

Personally I would have expected that number to be in the thousands by now and there are some suggestions that it is, however Neo4j are still only claiming ‘hundreds of customers’ on their own website.

The read-only core of the Cypher language has already been fully formalised. But when it came time to extend that formalism to include the update mechanisms, the authors ran into difficulties.

Our understanding of updates in the popular graph Continue reading

Fine-grained, secure and efficient data provenance on blockchain systems

Fine-grained, secure and efficient data provenance on blockchain systems Ruan et al., VLDB’19

We haven’t covered a blockchain paper on The Morning Paper for a while, and today’s choice won the best paper award at VLDB’19. The goal here is to enable smart contracts to be written in which the contract logic depends on the history, or provenance of its inputs.
For example, a contract that sends a reward amount of tokens to a user based on that user’s average balance per day over some period.

That’s hard to do in today’s blockchain systems for two reasons:

  1. Provenance can only be determined by querying and replaying all on-chain transactions, which is inefficient and an offline activity.
  2. As a consequence the computation of provenance and issuing a subsequent transaction are decoupled and hence there are no serializability guarantees. “In blockchains with native currencies, serializabiliy violations can be exploited for Transaction-Ordering attacks that cause substantial financial loss to the users.”

In other words, smart contracts cannot access historical blockchain states in a tamper-evident manner.

In designing a blockchain-friendly provenance mechanism, three aspects unique to the blockchain environment differentiate the problem from that of traditional database provenance:

  1. There are no Continue reading

Declarative recursive computation on an RDBMS

Declarative recursive computation on an RDBMS… or, why you should use a database for distributed machine learing Jankov et al., VLDB’19

If you think about a system like Procella that’s combining transactional and analytic workloads on top of a cloud-native architecture, extensions to SQL for streaming, dataflow based materialized views (see e.g. Naiad, Noria, Multiverses, and also check out what Materialize are doing here), the ability to use SQL interfaces to query over semi-structured and unstructured data, and so on, then a picture begins to emerge of a unifying large-scale data platform with a SQL query engine on top that addresses all of the data needs of an organisation in a one-stop shop. Except there’s one glaring omission from that list: handling all of the machine learning use cases.

Machine learning inside a relational database has been done before, most notably in the form of MADlib, which was integrated into Greenplum during my time at Pivotal. The Apache MADLib project is still going strong, and the recent (July 2019) 1.16 release even includes some support for deep learning.

To make that vision of a one-stop shop for all of an organisation’s data Continue reading

Procella: unifying serving and analytical data at YouTube

Procella: unifying serving and analytical data at YouTube Chattopadhyay et al., VLDB’19

Academic papers aren’t usually set to music, but if they were the chorus of Queen’s “I want it all (and I want it now…)” seems appropriate here. Anchored in the primary use case of supporting Google’s YouTube business, what we’re looking at here could well be the future of data processing at Google. Well, I say the future, but “Procella has now been in production for multiple years. Today, it is deployed in over a dozen data centers and serves hundreds of billions of queries per day over tens of petabytes of data…” So maybe what we’re looking at is the future of data processing for the rest of us!

Google already has Dremel, Mesa, Photon, F1, PowerDrill, and Spanner, so why did they need yet another data processing system? Because they had too many data processing systems! ;)

Large organizations… are dealing with exploding data volume and increasing demand for data driven applications. Broadly, these can be categorized as: reporting and dashboarding, embedded statistics in pages, time-series monitoring, and ad-hoc analysis. Typically, organizations build specialized infrastructure for each Continue reading

Experiences with approximating queries in Microsoft’s production big-data clusters

Experiences with approximating queries in Microsoft’s production big-data clusters Kandula et al., VLDB’19

I’ve been excited about the potential for approximate query processing in analytic clusters for some time, and this paper describes its use at scale in production. Microsoft’s big data clusters have 10s of thousands of machines, and are used by thousands of users to run some pretty complex queries. These clusters are in high demand and approximate query processing both saves their users time and lightens the overall load on the cluster.

What’s especially nice about this paper is that we get a glimpse into the practical adoption issues of persuading users that it’s ok to approximate too.

We have implemented support for query time sampling in production big-data clusters at Microsoft: these clusters consist of tens of thousands of multi-core, multi-disk servers and are used by developers from many different businesses including Bing, Azure, and Windows. In total, the clusters store a few exabytes of data and are primarily responsible for all of the batch analytics at Microsoft.

Approximate query support

Control over approximation is put into the hands of the user via extensions to the query language. In particular, support for expressing sampling requirements Continue reading

DDSketch: a fast and fully-mergeable quantile sketch with relative-error guarantees

DDSketch: a fast and fully-mergeable quantile sketch with relative-error guarantees Masson et al., VLDB’19

Datadog handles a ton of metrics – some customers have endpoints generating over 10M points per second! For response times (latencies) reporting a simple metric such as ‘average’ is next to useless. Instead we want to understand what’s happening at different latency percentiles (e.g p99).

The ability to compute quantiles over aggregated metrics has been recognized to be an essential feature of any monitoring system… Given how expensive calculating exact quantiles can be for both storage and network bandwidth, most monitoring system will compress the data into sketches and compute approximate quantiles.

Fortunately there are plenty of quantile sketching algorithms available including the GK-sketch, the t-digest, the HDR histogram, and the Moments sketch that we looked at last year. For reasons we’ll see shortly though, none of those were good enough for Datadog, so they developed their own sketching data structure, DDSketch. Officially in the paper DDSketch stands for ‘Distributed Distribution Sketch’ but that seems a bit of a stretch… surely it’s the ‘Datadog Sketch’ ! A glance at the code repository for the Python implementation confirms my suspicion: there are several references to Continue reading

SLOG: serializable, low-latency, geo-replicated transactions

SLOG: serializable, low-latency, geo-replicated transactions Ren et al., VLDB’19

SLOG is another research system motivated by the needs of the application developer (aka, user!). Building correct applications is much easier when the system provides strict serializability guarantees.

Strict serializability reduces application code complexity and bugs, since it behaves like a system that is running on a single machine processing transactions sequentially.

The challenge with strict serializability (or even just serializability on a regular DBMS) is that it requires coordination, and as we know, coordination kills performance. Weaker consistency models can give better performance yet “expose applications to potential race condition bugs, and typically require skilled application programmers.” But developers are the kingmakers (I think it’s been enough time now that we can drop the ‘new’ in that phrase?? ;) ), and thus:

… the demand for systems that support strict serializability has only increased.

So starting with strict serializability as a given, how do we claw back some of that performance? That’s where SLOG (Serializable LOw-latency, Geo-replicated transactions) comes in.

SLOG achieves high throughput, strictly serializable ACID transactions at geo-replicated scale for all transactions submitted across the world, all the while achieving low latency for transactions Continue reading

IPA: invariant-preserving applications for weakly consistent replicated databases

IPA: invariant-preserving applications for weakly consistent replicated databases Balegas et al., VLDB’19

IPA for developers, happy days!

Last we week looked at automating checks for invariant confluence, and extending the set of cases where we can show that an object is indeed invariant confluent. I’m not going to re-cover that background in this write-up, so I suggest you head over there for a quick catch-up before reading on if you missed it first time around.

Today’s paper is very much in same spirit, building on the same foundation of invariant confluence (I-Confluence), and also on Indigo which introduced an annotation model for application invariants, a invariant violation avoidance mechanism using lock reservations and escrows, and limited support for repairing violations that do happen.

With Invariant-Preserving Applications (IPAs), Balegas et al. introduce new mechanisms for avoiding invariant violations and for repairing them when detected, based on CRDTs. There’s also a very nice looking developer workflow to help ensure you’ve got all the bases covered. At the end of the day, you get the dual benefit of higher throughput and lower latency (as compared to coordination-based approaches) coupled with knowing that there isn’t some nasty invariant-violating concurrency bug waiting Continue reading

Choosing a cloud DBMS: architectures and tradeoffs

Choosing a cloud DBMS: architectures and tradeoffs Tan et al., VLDB’19

If you’re moving an OLAP workload to the cloud (AWS in the context of this paper), what DBMS setup should you go with? There’s a broad set of choices including where you store the data, whether you run your own DBMS nodes or use a service, the kinds of instance types to go for if you do run your own, and so forth. Tan et al. use the TPC-H benchmark to assess Redshift, Redshift Spectrum, Athena, Presto, Hive, and Vertica to find out what works best and the trade-offs involved.

We focused on OLAP-oriented parallel data warehouse products available for AWS and restricted our attention to commercially available systems. As it is infeasible to test every OLAP system runnable on AWS, we chose widely-used systems that represented a variety of architectures and cost models.

My key takeaways as a TL;DR:

  • Store your data in S3
  • Use portable data format that gives you future flexibility to process it with multiple different systems (e.g. ORC or Parquet)
  • Use Athena for workloads it can support (Athena could not run 4 of the 22 TPC-H queries, and Spectrum could not run Continue reading

Interactive checks for coordination avoidance

Interactive checks for coordination avoidance Whittaker & Hellerstein et al., VLDB’19

I am so pleased to see a database systems paper addressing the concerns of the application developer!

To the developer, a strongly consistent system behaves exactly like a single-threaded system running on a single node, so reasoning about the behaviour of the system is simple1. Unfortunately strong consistency is at odds with performance… On the other hand weak consistency models… put a tremendous burden on the application designer to reason about the complex interleavings of operations that are allowed by these weak consistency models.

To be able to reason effectively about a system, it’s tremendously helpful to be able to rely on some things that we know will always be true. The fancy word for those is invariants, because whatever happens they never vary. A example class of application invariants is database integrity constraints. Unfortunately, under weak consistency models finding solid ground to reason about invariants is really hard:

Even if every transaction executing in a weakly consistent system individually maintains an application invariant, the system as a whole can produce invariant-violating states.

In a distributed setting with weak consistency, an object is replicated across a Continue reading

Snuba: automating weak supervision to label training data

Snuba: automating weak supervision to label training data Varma & Ré, VLDB 2019

This week we’re moving on from ICML to start looking at some of the papers from VLDB 2019. VLDB is a huge conference, and once again I have a problem because my shortlist of “that looks really interesting, I’d love to read it” papers runs to 54 long at the moment! As a special bonus for me, I’m actually going to be at VLDB this year, where no doubt I’ll learn about even more interesting things! By the time you get to read this, it should be the first (workshop) day of the conference…

The conference may have changed, but to bridge from ICML to VLDB I’m going to start with a paper on very much the same theme as we’ve been dipping into over the past couple of weeks: how to combine and learn from multiple noisy sources of data and labels. Snuba is from the same Stanford line as Snorkel which we looked at last year. It’s tackling the same fundamental problem: how to gather enough labeled data to train a model, and how to effectively use it in a weak supervision setting (supervised learning Continue reading

1 3 4 5 6 7 16