adriancolyer

Author Archives: adriancolyer

Even more amazing papers at VLDB 2019 (that I didn’t have space to cover yet)

We’ve been covering papers from VLDB 2019 for the last three weeks, and next week it will be time to mix things up again. There were so many interesting papers at the conference this year though that I haven’t been able to cover nearly as many as I would like. So today’s post is a short summary of things that caught my eye that I haven’t covered so far. A few of these might make it onto The Morning Paper in weeks to come, you never know!

Industry papers

  • Tunable consistency in MongoDB. MongoDB is an important database, and this paper explains the tunable (per-operation) consistency models that MongoDB provides and how they are implemented under the covers. I really do want to cover this one at some point, along with Implementation of cluster-wide logical clock and causal consistency in MongoDB from SIGMOD 2019.
  • We hear a lot from Google and Microsoft about their cloud platforms, but not quite so much from the other key industry players. So it’s great to see some papers from Alibaba and Tencent here. AliGraph covers Alibaba’s distributed graph engine supporting the development of new GNN applications. Their dataset has about 7B edges… Meanwhile, AnalyticDB Continue reading

Updating graph databases with Cypher

Updating graph databases with Cypher Green et al., VLDB’19

This is the story of a great collaboration between academia, industry, and users of the Cypher graph querying language as created by Neo4j. Beyond Neo4j, Cypher is also supported in SAP HANA Graph, RedisGraph, Agnes Graph, and Memgraph. Cypher for Apache Spark, and Cypher over Gremlin projects are also both available in open source. The openCypher project brings together Cypher implementors across different projects and products, and aims to produce a full and open specification of the language. There is also a Graph Query Language (GQL) standards organisation.

Cypher is used in hundreds of production applications across many industry vertical domains, such as financial services, telecommunications, manufacturing and retails, logistics, government, and healthcare.

Personally I would have expected that number to be in the thousands by now and there are some suggestions that it is, however Neo4j are still only claiming ‘hundreds of customers’ on their own website.

The read-only core of the Cypher language has already been fully formalised. But when it came time to extend that formalism to include the update mechanisms, the authors ran into difficulties.

Our understanding of updates in the popular graph Continue reading

Fine-grained, secure and efficient data provenance on blockchain systems

Fine-grained, secure and efficient data provenance on blockchain systems Ruan et al., VLDB’19

We haven’t covered a blockchain paper on The Morning Paper for a while, and today’s choice won the best paper award at VLDB’19. The goal here is to enable smart contracts to be written in which the contract logic depends on the history, or provenance of its inputs.
For example, a contract that sends a reward amount of tokens to a user based on that user’s average balance per day over some period.

That’s hard to do in today’s blockchain systems for two reasons:

  1. Provenance can only be determined by querying and replaying all on-chain transactions, which is inefficient and an offline activity.
  2. As a consequence the computation of provenance and issuing a subsequent transaction are decoupled and hence there are no serializability guarantees. “In blockchains with native currencies, serializabiliy violations can be exploited for Transaction-Ordering attacks that cause substantial financial loss to the users.”

In other words, smart contracts cannot access historical blockchain states in a tamper-evident manner.

In designing a blockchain-friendly provenance mechanism, three aspects unique to the blockchain environment differentiate the problem from that of traditional database provenance:

  1. There are no Continue reading

Declarative recursive computation on an RDBMS

Declarative recursive computation on an RDBMS… or, why you should use a database for distributed machine learing Jankov et al., VLDB’19

If you think about a system like Procella that’s combining transactional and analytic workloads on top of a cloud-native architecture, extensions to SQL for streaming, dataflow based materialized views (see e.g. Naiad, Noria, Multiverses, and also check out what Materialize are doing here), the ability to use SQL interfaces to query over semi-structured and unstructured data, and so on, then a picture begins to emerge of a unifying large-scale data platform with a SQL query engine on top that addresses all of the data needs of an organisation in a one-stop shop. Except there’s one glaring omission from that list: handling all of the machine learning use cases.

Machine learning inside a relational database has been done before, most notably in the form of MADlib, which was integrated into Greenplum during my time at Pivotal. The Apache MADLib project is still going strong, and the recent (July 2019) 1.16 release even includes some support for deep learning.

To make that vision of a one-stop shop for all of an organisation’s data Continue reading

Procella: unifying serving and analytical data at YouTube

Procella: unifying serving and analytical data at YouTube Chattopadhyay et al., VLDB’19

Academic papers aren’t usually set to music, but if they were the chorus of Queen’s “I want it all (and I want it now…)” seems appropriate here. Anchored in the primary use case of supporting Google’s YouTube business, what we’re looking at here could well be the future of data processing at Google. Well, I say the future, but “Procella has now been in production for multiple years. Today, it is deployed in over a dozen data centers and serves hundreds of billions of queries per day over tens of petabytes of data…” So maybe what we’re looking at is the future of data processing for the rest of us!

Google already has Dremel, Mesa, Photon, F1, PowerDrill, and Spanner, so why did they need yet another data processing system? Because they had too many data processing systems! ;)

Large organizations… are dealing with exploding data volume and increasing demand for data driven applications. Broadly, these can be categorized as: reporting and dashboarding, embedded statistics in pages, time-series monitoring, and ad-hoc analysis. Typically, organizations build specialized infrastructure for each Continue reading

Experiences with approximating queries in Microsoft’s production big-data clusters

Experiences with approximating queries in Microsoft’s production big-data clusters Kandula et al., VLDB’19

I’ve been excited about the potential for approximate query processing in analytic clusters for some time, and this paper describes its use at scale in production. Microsoft’s big data clusters have 10s of thousands of machines, and are used by thousands of users to run some pretty complex queries. These clusters are in high demand and approximate query processing both saves their users time and lightens the overall load on the cluster.

What’s especially nice about this paper is that we get a glimpse into the practical adoption issues of persuading users that it’s ok to approximate too.

We have implemented support for query time sampling in production big-data clusters at Microsoft: these clusters consist of tens of thousands of multi-core, multi-disk servers and are used by developers from many different businesses including Bing, Azure, and Windows. In total, the clusters store a few exabytes of data and are primarily responsible for all of the batch analytics at Microsoft.

Approximate query support

Control over approximation is put into the hands of the user via extensions to the query language. In particular, support for expressing sampling requirements Continue reading

DDSketch: a fast and fully-mergeable quantile sketch with relative-error guarantees

DDSketch: a fast and fully-mergeable quantile sketch with relative-error guarantees Masson et al., VLDB’19

Datadog handles a ton of metrics – some customers have endpoints generating over 10M points per second! For response times (latencies) reporting a simple metric such as ‘average’ is next to useless. Instead we want to understand what’s happening at different latency percentiles (e.g p99).

The ability to compute quantiles over aggregated metrics has been recognized to be an essential feature of any monitoring system… Given how expensive calculating exact quantiles can be for both storage and network bandwidth, most monitoring system will compress the data into sketches and compute approximate quantiles.

Fortunately there are plenty of quantile sketching algorithms available including the GK-sketch, the t-digest, the HDR histogram, and the Moments sketch that we looked at last year. For reasons we’ll see shortly though, none of those were good enough for Datadog, so they developed their own sketching data structure, DDSketch. Officially in the paper DDSketch stands for ‘Distributed Distribution Sketch’ but that seems a bit of a stretch… surely it’s the ‘Datadog Sketch’ ! A glance at the code repository for the Python implementation confirms my suspicion: there are several references to Continue reading

SLOG: serializable, low-latency, geo-replicated transactions

SLOG: serializable, low-latency, geo-replicated transactions Ren et al., VLDB’19

SLOG is another research system motivated by the needs of the application developer (aka, user!). Building correct applications is much easier when the system provides strict serializability guarantees.

Strict serializability reduces application code complexity and bugs, since it behaves like a system that is running on a single machine processing transactions sequentially.

The challenge with strict serializability (or even just serializability on a regular DBMS) is that it requires coordination, and as we know, coordination kills performance. Weaker consistency models can give better performance yet “expose applications to potential race condition bugs, and typically require skilled application programmers.” But developers are the kingmakers (I think it’s been enough time now that we can drop the ‘new’ in that phrase?? ;) ), and thus:

… the demand for systems that support strict serializability has only increased.

So starting with strict serializability as a given, how do we claw back some of that performance? That’s where SLOG (Serializable LOw-latency, Geo-replicated transactions) comes in.

SLOG achieves high throughput, strictly serializable ACID transactions at geo-replicated scale for all transactions submitted across the world, all the while achieving low latency for transactions Continue reading

IPA: invariant-preserving applications for weakly consistent replicated databases

IPA: invariant-preserving applications for weakly consistent replicated databases Balegas et al., VLDB’19

IPA for developers, happy days!

Last we week looked at automating checks for invariant confluence, and extending the set of cases where we can show that an object is indeed invariant confluent. I’m not going to re-cover that background in this write-up, so I suggest you head over there for a quick catch-up before reading on if you missed it first time around.

Today’s paper is very much in same spirit, building on the same foundation of invariant confluence (I-Confluence), and also on Indigo which introduced an annotation model for application invariants, a invariant violation avoidance mechanism using lock reservations and escrows, and limited support for repairing violations that do happen.

With Invariant-Preserving Applications (IPAs), Balegas et al. introduce new mechanisms for avoiding invariant violations and for repairing them when detected, based on CRDTs. There’s also a very nice looking developer workflow to help ensure you’ve got all the bases covered. At the end of the day, you get the dual benefit of higher throughput and lower latency (as compared to coordination-based approaches) coupled with knowing that there isn’t some nasty invariant-violating concurrency bug waiting Continue reading

Choosing a cloud DBMS: architectures and tradeoffs

Choosing a cloud DBMS: architectures and tradeoffs Tan et al., VLDB’19

If you’re moving an OLAP workload to the cloud (AWS in the context of this paper), what DBMS setup should you go with? There’s a broad set of choices including where you store the data, whether you run your own DBMS nodes or use a service, the kinds of instance types to go for if you do run your own, and so forth. Tan et al. use the TPC-H benchmark to assess Redshift, Redshift Spectrum, Athena, Presto, Hive, and Vertica to find out what works best and the trade-offs involved.

We focused on OLAP-oriented parallel data warehouse products available for AWS and restricted our attention to commercially available systems. As it is infeasible to test every OLAP system runnable on AWS, we chose widely-used systems that represented a variety of architectures and cost models.

My key takeaways as a TL;DR:

  • Store your data in S3
  • Use portable data format that gives you future flexibility to process it with multiple different systems (e.g. ORC or Parquet)
  • Use Athena for workloads it can support (Athena could not run 4 of the 22 TPC-H queries, and Spectrum could not run Continue reading

Interactive checks for coordination avoidance

Interactive checks for coordination avoidance Whittaker & Hellerstein et al., VLDB’19

I am so pleased to see a database systems paper addressing the concerns of the application developer!

To the developer, a strongly consistent system behaves exactly like a single-threaded system running on a single node, so reasoning about the behaviour of the system is simple1. Unfortunately strong consistency is at odds with performance… On the other hand weak consistency models… put a tremendous burden on the application designer to reason about the complex interleavings of operations that are allowed by these weak consistency models.

To be able to reason effectively about a system, it’s tremendously helpful to be able to rely on some things that we know will always be true. The fancy word for those is invariants, because whatever happens they never vary. A example class of application invariants is database integrity constraints. Unfortunately, under weak consistency models finding solid ground to reason about invariants is really hard:

Even if every transaction executing in a weakly consistent system individually maintains an application invariant, the system as a whole can produce invariant-violating states.

In a distributed setting with weak consistency, an object is replicated across a Continue reading

Snuba: automating weak supervision to label training data

Snuba: automating weak supervision to label training data Varma & Ré, VLDB 2019

This week we’re moving on from ICML to start looking at some of the papers from VLDB 2019. VLDB is a huge conference, and once again I have a problem because my shortlist of “that looks really interesting, I’d love to read it” papers runs to 54 long at the moment! As a special bonus for me, I’m actually going to be at VLDB this year, where no doubt I’ll learn about even more interesting things! By the time you get to read this, it should be the first (workshop) day of the conference…

The conference may have changed, but to bridge from ICML to VLDB I’m going to start with a paper on very much the same theme as we’ve been dipping into over the past couple of weeks: how to combine and learn from multiple noisy sources of data and labels. Snuba is from the same Stanford line as Snorkel which we looked at last year. It’s tackling the same fundamental problem: how to gather enough labeled data to train a model, and how to effectively use it in a weak supervision setting (supervised learning Continue reading

Learning to prove theorems via interacting with proof assistants

Learning to prove theorems via interacting with proof assistants Yang & Deng, ICML’19

Something a little different to end the week: deep learning meets theorem proving! It’s been a while since we gave formal methods some love on The Morning Paper, and this paper piqued my interest.

You’ve probably heard of Coq, a proof management system backed by about 30 years of research and developed out of INRIA. Here’s how the Coq home page introduces it:

Coq is a formal proof management system. It provides a formal language to write mathematical definitions, executable algorithms and theorems together with an environment for semi-interactive development of machine-checked proofs.

Certified programs can then be extracted to languages like OCaml, Haskell, and Scheme.

In fully automated theorem proving (ATP), after providing a suitable formalism and a theorem, the goal is to be able to push a button and have the system prove that the theorem is true (or false!). The state-of-the-art in ATP is still some way behind human experts though it two key areas: the scale of systems it can tackle, and the interpretability of the generated proofs.

What a typical theorem prover does… is to prove by resolution refutation: it Continue reading

Statistical foundations of virtual democracy

Statiscal foundations of virtual democracy Kahng et al., ICML’19

This is another paper on the theme of combining information and making decisions in the face of noise and uncertainty – but the setting is quite different to those we’ve been looking at recently. Consider a food bank that receives donations of food and distributes it to those in need. The goal is to implement an automated decision making system such that when a food donation is received, the system outputs the organisation (e.g. housing authority or food pantry) that should receive it. We could hard code a set of rules, but what should they be? And who gets to decide?

A democratic solution to this would be to give each of the stakeholders a vote on every decision. In the food bank setting, identified classes of stakeholders include the donors, the recipients, the volunteers (who pick up food from the donor and deliver it to the recipient), and employees. Their votes encode their own preferences and biases, perhaps in a way that even the voters themselves couldn’t neatly codify in a set of explicit rules.

It’s not really practical to have an actual vote with all stakeholders participating Continue reading

Robust learning from untrusted sources

Robust learning from untrusted sources Konstantinov & Lampert, ICML’19

Welcome back to a new term of The Morning Paper! Just before the break we were looking at selected papers from ICML’19, including “Data Shapley.” I’m going to pick things up pretty much where we left off with a few more ICML papers…

Data Shapley provides us with one way of finding and correcting or eliminating low-value (and potentially harmful) data points from a training set. In today’s paper choice, Konstantinov & Lampert provide a way of assessing the value of datasets as a whole. The idea is that you might be learning e.g. a classifier by combining data from multiple sources. By assigning a value (weighting) to each of those data sources we can intelligently combine them to get the best possible performance out of the resulting trained model. So if you need more data in order to improve the performance of your model, ‘Robust learning from untrusted sources’ provides an approach that lets you tap into additional, potentially noisier, sources.

It’s similar in spirit to Snorkel which we looked at last year, and is designed to let you incorporate data from multiple ‘weakly supervised’ (i.e. noisy) Continue reading

End of term

I can’t believe we’ve arrived at the end-of-term again already! I’ll be taking a four-week break from writing The Morning Paper, normal service resumes on Monday 19th August. A big milestone will slip quietly by during this recess – it was five years ago on the 30th July 2014 that I read and shared the very first paper in this current streak of paper reading. In case you’re wondering, that paper was "Why functional programming matters" (revisited again on the blog 2 years later). In terms of published posts, we’re also rapidly approaching the 1,000 posts/papers mark! I wonder what amazing research developments the next five years might bring us??!

There are so many interesting papers published all the time, and I can only cover the tiniest fraction of them on The Morning Paper. If you still feel in need of your regular paper fix over the next few weeks, then a great exercise is to think back to a paper you particularly enjoyed, see where it was published, and then go look through the proceedings to discover what else is there you might like.

For example, let’s say you enjoyed ‘Designing far memory data structures: think Continue reading

Meta-learning neural Bloom filters

Meta-learning neural bloom filters Rae et al., ICML’19

Bloom filters are wonderful things, enabling us to quickly ask whether a given set could possibly contain a certain value. They produce this answer while using minimal space and offering O(1) inserts and lookups. It’s no wonder Bloom filters and their derivatives (the family of approximate set membership algorithms) are used everywhere. Hash functions are the key to so many great algorithms, and Bloom filters are one of my favourite applications of them.

But Rae et al. think we can do even better, especially when it comes to the space required by an approximate set membership data structure. Being an ICLR paper of course, you won’t be surprised to learn that the solution involves neural networks. This puts us in the same territory as SageDB and ‘The case for learned index structures.’ Probably my favourite sentence in the whole paper is this one, which crisply sets out where machine learning might be able to find an advantage over traditional algorithms:

We build upon the recently growing literature on using neural networks to replace algorithms that are configured by heuristics, or do not take advantage of the data distribution.

Bloom Continue reading

Challenging common assumptions in the unsupervised learning of disentangled representations

Challenging common assumptions in the unsupervised learning of disentangled representations Locatello et al., ICML’19

Today’s paper choice won a best paper award at ICML’19. The ‘common assumptions’ that the paper challenges seem to be: “unsupervised learning of disentangled representations is possible, and useful!”

The key idea behind the unsupervised learning of disentangled representations is that real-world data is generated by a few explanatory factors of variation which can be recovered by unsupervised learning algorithms. In this paper, we provide a sober look at recent progress in the field and challenge some common assumptions.

What exactly is a ‘disentangled representation’ and why might we want one?

Put the ‘disentangled’ part to one side for a moment, and let’s start out by revisiting what we mean by a representation. Given a real-world observation \mathbf{x} (e.g. of an image or video), a representation r(\mathbf{x}) is a transformation of \mathbf{x} (typically to a lower dimensional space in order to be useful) that somehow preserves the salient information in the \mathbf{x} so that we can still use r(\mathbf{x}) to extract useful information about the input (e.g. for building classifiers). As a trivial example, suppose we had real world observations consisting of 1000 points sampled from a Continue reading

Data Shapley: equitable valuation of data for machine learning

Data Shapley: equitable valuation of data for machine learning Ghorbani & Zou et al., ICML’19

It’s incredibly difficult from afar to make sense of the almost 800 papers published at ICML this year! In practical terms I was reduced to looking at papers highlighted by others (e.g. via best paper awards), and scanning the list of paper titles looking for potentially interesting topics. For the next few days we’ll be looking at some of the papers that caught my eye during this process.

The now somewhat tired phrase “data is the new oil” (something we can consume in great quantities to eventually destroy the world as we know it???) suggests that data has value. But pinning down that value can be tricky – how much is a given data point worth, and what framework can we use for thinking about that question?

As data becomes the fuel driving technological and economic growth, a fundamental challenge is how to quantify the value of data in algorithmic predictions and decisions…. In this work we develop a principled framework to address data valuation in the context of supervised machine learning.

One of the nice outcomes is that once you’ve understood Continue reading

View-centric performance optimization for database-backed web applications

View-centric performance optimization for database-backed web applications Yang et al., ICSE 2019

The problem set-up in this paper discusses the importance of keeping web page load times low as a fundamental contributor to user satisfaction (See e.g. ‘Why performance matters’). Between client-side tools such as Google’s Lighthouse, back-end tools that can analyse ORM usage and database queries and point out issues such as N+1 selects, and the information provided by your favourite APM I was initially wondering what ground there was left to tread here. So I was pleasantly surprised when it turned out the authors were looking at the problem in a different way to most of these approaches.

Rather than accepting the current rendered view (web page) as seen by the end-user as fixed, and then asking what can be done to optimise the end-to-end loading time of that page, this paper examines the question of what changes to the current view could dramatically reduce its load time? I.e., small (or sometimes not so small) changes to what the end user ultimately sees on the page, that can have a net benefit on the overall user experience.

Empirical studies have found Continue reading

1 3 4 5 6 7 15