adriancolyer

Author Archives: adriancolyer

SageDB: a learned database system

SageDB: a learned database system Kraska et al., CIDR’19

About this time last year, a paper entitled ‘The case for learned index structures’ (part I, part II) generated a lot of excitement and debate. Today’s paper choice builds on that foundation, putting forward a vision where learned models pervade every aspect of a database system.

The core idea behind SageDB is to build one or more models about the data and workload distribution and based on them automatically build the best data structures and algorithms for all components of the database system. This approach, which we call “database synthesis” will allow us to achieve unprecedented performance by specializing the implementation of every database component to the specific database, query workload, and execution environment.

For the want of a model

In the absence of runtime learning and adaptation, database systems are engineered for general purpose use and do not take full advantage of the specific characteristics of the workload and data at hand. The size of the opportunity for SageDB is the gap between such an approach and what is possible when designing a specialised solution with full knowledge of the data distribution and workload.

Consider an Continue reading

Serverless computing: one step forward, two steps back

Serverless computing: one step forward, two steps back Hellerstein et al., CIDR’19

The biennial Conference on Innovative Data Systems Research has come round again. Today’s paper choice is sure to generate some healthy debate, and it’s a good set of questions to spend some time thinking over as we head into 2019: Where do you think serverless is heading? What is it good for today? What’s the end-goal here?

The authors see ‘critical gaps’ in current first-generation serverless offerings from the major cloud vendors (AWS, Azure, GCP). I’m sure some will read the paper as serverless-bashing, but I read into it more of an appeal from the heart to not stop where we are today, but to continue to pursue infrastructure and programming models truly designed for cloud platforms. Platforms that offer ‘unlimited’ data storage, ‘unlimited’ distributed processing power, and the ability to harness these only as needed.

We hope this paper shifts the discussion from ‘What it serverless?’ Or ‘Will serverless win?’ to a rethinking of how we design infrastructure and programming models to spark real innovation in data-rich, cloud-scale systems. We see the future of cloud programming as far, far brighter than the promise of Continue reading

Unsupervised learning of artistic styles with archetypal style analysis

Unsupervised learning of artistic styles with archetypal style analysis Wynen et al., NeurIPS’18

I’ve always enjoyed following work on artistic style transfer. The visual nature makes it easy to gain an appreciation for what is going on and the results are very impressive. It also something that’s been unfolding within the timespan of The Morning Paper, if we peg the beginning to the work of Gatys et al. in 2015. See for example the posts on ‘Texture Networks’ and ‘Deep photo style transfer.’

Beyond direct style transfer, the objective of the work described in today’s paper choice is to uncover representations of styles (archetypes) themselves. Given a large collection of paintings,…

… Our objective is to automatically discover, summarize, and manipulate artistic styles present in the collection.

This is achieved using an unsupervised learning technique called archetypal analysis. We can recover archetypes from a collection of paintings, and we can also go the other way; taking a painting and decomposing it into a combination of archetypes. And of course if we then manipulate the composition of archetypes, we can manipulate the style of an image.

To visualise what an archetype ‘looks like’ the authors synthesise Continue reading

Neural Ordinary Differential Equations

Neural ordinary differential equations Chen et al., NeurIPS’18

‘Neural Ordinary Differential Equations’ won a best paper award at NeurIPS last month. It’s not an easy piece (at least not for me!), but in the spirit of ‘deliberate practice’ that doesn’t mean there isn’t something to be gained from trying to understand as much as possible.

In addition to the paper itself, I found the following additional resources to be helpful:

Neural networks as differential equations

Consider a multi-layered neural network. We have an input layer and an output layer, and inbetween them, some number of hidden layers. As an input feeds forward through the network, it is progressively transformed, one layer at a time, from the input to the ultimate output. Each network layer is a step on that journey. If we take a small number of big steps, we end up with a rough approximation to the true transformation function we’d like to learn. If we take a much Continue reading

The tradeoffs of large scale learning

The tradeoffs of large scale learning Bottou & Bousquet, NIPS’07

Welcome to another year of The Morning Paper. As usual we’ll be looking at a broad cross-section of computer science research (I have over 40 conferences/workshops on my list to keep an eye on as a start!). I’ve no idea yet what papers we’ll stumble across, but if previous years are anything to go by I’m sure there’ll be plenty of great material to keep interest levels high.

To start us off, today’s paper choice is “The tradeoffs of large scale learning,” which won the ‘test of time’ award at NeurIPS last month.

this seminal work investigated the interplay between data and computation in ML, showing that if one is limited by computing power but can make use of a large dataset, it is more efficient to perform a small amount of computation on many individual training examples rather than to perform extensive computation on a subset of the data. [Google AI blog: The NeurIPS 2018 Test of Time Award].

For a given time/computation budget, are we better off performing a computationally cheaper (e.g., approximate) computation over lots of data, or a more accurate computation Continue reading

Towards a theory of software development expertise

Towards a theory of software development expertise Baltes et al., ESEC/FSE’18

This is the last paper we’ll be looking at this year, so I’ve chosen something a little more reflective to leave you with (The Morning Paper will return on Monday 7th January, 2019). The question Baltes and Diehl tackle is this: “How do you get better as a software developer?” What does expert performance look like?

We present a first conceptual theory of software development expertise that is grounded in data from a mixed-methods survey with 335 software developers and in literature on expertise and expert performance…. [the theory] describes central properties of software development expertise and important factors influencing its formation.

In essence, ask a bunch of practitioners what they think, use a disciplined coding scheme to interpret the answers (a “grounded theory”), and then layer in what we know about expertise and expert performance in general. The end result is a “conceptual theory” that shows the various contributors to expert performance and the relationships between them. “Software Development” in the current work is synonymous with “programming.”

To make the paper come alive you need to engage with it a little: Does the theory developed Continue reading

Identifying impactful service system problems via log analysis

Identifying impactful service system problems via log analysis He et al., ESEC/FSE’18

If something is going wrong in your system, chances are you’ve got two main sources to help you detect and resolve the issue: logs and metrics. You’re unlikely to be able to get to the bottom of a problem using metrics alone (though you might well detect one that way), so that leaves logs as the primary diagnosis tool. The online service at Microsoft used as the main case study in the paper produces dozens of Terabytes of logs every day.

Logs play a crucial role in the diagnosis of modern cloud-based online service systems. Clearly, manual problem diagnosis is very time-consuming and error-prone due to the increasing scale and complexity of large-scale systems.

Log3C analyses logs to look for indications of impactful problems, using correlated KPIs as a guide. It finds these needles in the haystack with an average precision of 0.877 and an average recall of 0.883. A distributed version of Log3C has been deployed and used in production at Microsoft for several years, both to support a massive online service (we are not told which one), and integrated into “Product B” where Continue reading

Applied machine learning at Facebook: a datacenter infrastructure perspective

Applied machine learning at Facebook: a datacenter infrastructure perspective Hazelwood et al., _HPCA’18 _

This is a wonderful glimpse into what it’s like when machine learning comes to pervade nearly every part of a business, with implications top-to-bottom through the whole stack. It’s amazing to step back and think just how fundamentally software systems have changed over the last decade in this regard.

Just how pervasive is machine learning at Facebook?

  • At Facebook, machine learning provides key capabilities in driving nearly all aspects of user experience… Machine learning is applied pervasively across nearly all services.”
  • Facebook funnels a large fraction of all stored data through machine learning pipelines, and this fraction is increasing over time to improve model quality.”
  • Looking forward, Facebook expects rapid growth in machine learning across existing and new services…. Over time, most services indicate a trend toward leveraging increased amounts of user data.. the training data sets are trending towards continued and sometimes dramatic growth.

The modern user-experience is increasingly powered by machine learning models, and the quality of those models depends directly on the volume and quality of the data powering them: “For many machine learning models Continue reading

Darwinian data structure selection

Darwinian data structure selection Basios et al., FSE’18

GraphIt may have caught your attention for the success of its approach, but I suspect for many readers it’s not something you’ll be immediately applying. Darwinian Data Structures (DDSs) on the other hand looks to be of immediate interest to many Java and C++ projects (and generalises beyond those languages).

What I would have called an ADT (e.g., a List), the authors call Darwinian Data Structures. The ‘Darwinian’ part comes from the fact that ADTs have multiple concrete implementations, and Artemis, “a multi-objective, cloud-based search-based optimisation framework” finds the best implementation class (e.g. ArrayList, LinkedList) for your specific use case. It does this using the NSGA-II genetic algorithm-based optimiser in the current implementation.

In brief, Artemis finds the places in your code where you are using an ADT, and explores the possible concrete instantiation space for those ADTs using your test suite as a guide to performance. Then it outputs the transformed source. You might be wondering whether e.g. LinkedList vs ArrayList makes that big a difference in most real world projects:

Artemis achieves substantial performance improvements for every project in Continue reading

GraphIt: A high-performance graph DSL

GraphIt: a high-performance graph DSL Zhang et al., OOPSLA’18

See also: http://graphit-lang.org/.

The problem with finding the optimal algorithm and data structures for a given problem is that so often it depends. This is especially true when it comes to graph algorithms.

It is difficult to implement high-performance graph algorithms. The performance bottlenecks of these algorithms depend not only on the algorithm and the underlying hardware, but also on the size and structure of the graph. As a result, different algorithms running on the same machine, or even the same algorithm running with different types of graph on the same machine, can exhibit different performance bottlenecks.

What we’d like therefore, is some way of expressing a graph algorithm at a high level such that we can map it into different implementations, each applying different optimisations as needed. For bonus points, we could then automate the search within the optimisation space to find the best performing combination for the circumstances at hand.

This is exactly what GraphIt does. GraphIt combines a DSL for specifying graph algorithms with a separate scheduling language that determines implementation policy. You can specify a schedule yourself, or use autotuning to discover optimal schedules for Continue reading

MadMax: surviving out-of-gas conditions in Ethereum smart contracts

MadMax: surviving out-of-gas conditions in ethereum smart contracts Grech et al., OOPSLA’18

We’re transitioning to look at a selection of papers from the recent OOPSLA conference this week. MadMax won a distinguished paper award, and makes a nice bridge from the CCS blockchain papers we were looking at last week.

Analysis and verification of smart contracts is a high-value task, possibly more so than in any other programming setting. The combination of monetary value and public availability makes the early detection of vulnerabilities a task of paramount importance. (Detection may occur after contract deployment. Despite the code immutability, which prevents bug fixes, discovering a vulnerability before an attacker may exploit it could enable a trusted third party to move vulnerable funds to safety).

MadMax is in the same vein as Securify, performing EVM bytecode analysis using Datalog (also with Soufflé) to infer security issues in contracts. In this instance, MadMax focuses on detecting vulnerabilities caused by out-of-gas conditions. The paper touches on some nice reusable building blocks (e.g. Vandal). I could easily see Vandal + Soufflé becoming a standard foundation for powerful EVM-based smart contract analysis.

MadMax is available on GitHub at https://github.com/nevillegreech/MadMax.

MaxMax Continue reading

RapidChain: scaling blockchain via full sharding

RapidChain: scaling blockchain via full sharding Zamani et al., CCS’18

RapidChain is a sharding-based public blockchain protocol along the lines of OmniLedger that we looked at earlier in the year. RapidChain is resilient to Byzantine faults from up to 1/3 of its participants, requires no trusted setup, and can achieve more than 7,300 tx/sec with an expected confirmation latency of roughly 8.7 seconds in a network of 4,000 nodes with a time-to-failure of more than 4,500 years. Those are pretty interesting numbers!


(Enlarge)

RapidChain partitions the set of nodes into multiple smaller groups of nodes called committees that operate in parallel on disjoint blocks of transactions and maintain disjoint ledgers.

With n nodes, each committee is of size m = c \log n where c is a security parameter typically set around 20. E.g. with 1,000 nodes we’ll have around 17 committees of 60 nodes each. To make all these work we’ll need a number of different parts:

  • A way to bootstrap the initial system
  • A way of forming (and re-forming) committees, and allowing nodes to join and leave
  • A way of reaching consensus within a committee (shard)
  • A way of verifying transactions which cross-shards

Bootstrapping

The initial set of participants start Continue reading

FairSwap: how to fairly exchange digital goods

FairSwap: how to fairly exchange digital goods Dziembowski et al., CCS’18

(Preprint)

This is a transactions paper with a twist. The transactions we’re talking about are purchases of digital assets. More specifically, the purchase of a file (document, movie, archive of a dataset, …). The property we strongly care about is atomicity: either the seller receives payment and the buyer receives a valid file or neither of these things happen. The buyer and seller don’t trust each other (so e.g., “you send me the payment and then I’ll send you the file” is not an acceptable solution, nor is “you send me the file and then I’ll send you the payment”). This is known as the fair exchange problem.

Fair exchange is a well studied research problem. It has been shown that without further assumptions fair exchange cannot be achieved without a Trusted Third Party (TTP). To circumvent this impossibility, research has studied weaker security models— most notably, the optimistic model in which a TTP is consulted only in case one party deviates from the expected behavior.

In many real-world scenarios, escrow services play the role of the trusted third party. Unfortunately this means you Continue reading

Securify: practical security analysis of smart contracts

Securify: practical security analysis of smart contracts Tsankov et al., CCS’18

Sometimes the perfect is the enemy of the good. When we’re talking about securing smart contracts, we need all the help we can get! Bugs can cost millions of dollars. Securify uses a set of expert heuristics (patterns) to help identify issues in smart contracts. It’s available at https://securify.ch, has analysed over 18K uploaded contracts, and is used by security auditors as part of their arsenal.

The increased adoption of smart contracts demands strong security guarantees. Unfortunately, it is challenging to create smart contracts that are free of security bugs. As a consequence, critical vulnerabilities in smart contracts are discovered and exploited every few months. In turn, these exploits have led to losses reaching millions worth of USD in the past few years…. Despite their potential, repeated security concerns have shaken the trust in handling billions of USD by smart contracts.

Too right! We’ve examined some of the challenges involved in creating correct smart contracts in previous editions of The Morning Paper, as well as tools such as Zeus that help with verification.

It’s not a solvable problem in the general case (i.e., ‘perfect’ Continue reading

LEMNA: explaining deep learning based security applications

LEMNA: explaining deep learning based security applications Guo et al., CCS’18

Understanding why a deep learning model produces the outputs it does is an important part of gaining trust in the model, and in some situations being able to explain decisions is a strong requirement. Today’s paper shows that by carefully considering the architectural features of a given model, it’s possible to co-design an explanatory model. The idea is applied to deep learning models in the security domain (to detect the start of functions within binaries, and to detect malware) where for reasons we’ll look at next, the assumptions made by black-box explainers such as LIME don’t apply.

Like LIME, LEMNA approximates a local area of a complex deep learning decision boundary using a simple interpretable model. Unlike LIME, LEMNA can handle non-linear local boundaries, and feature dependencies (e.g., for a sequences fed into RNNs, which explicitly model dependencies in sequential data).

Why explainability matters

While intrigued by the high accuracy, security practitioners are concerned about the lack of transparency of deep learning models, and thus hesitate to widely adopt deep learning classifiers in security and safety-critical areas.

Explanations that are understandable by security analysts can help Continue reading

Towards usable checksums: automating the integrity verification of web downloads for the masses

Towards usable checksums: automating the integrity verification of web downloads for the masses Cherubini et al., CCS’18

If you tackled Monday’s paper on BEAT you deserve something a little easier to digest today, and ‘Towards usable checksums’ fits the bill nicely! There’s some great data-driven product management going on here as the authors set out to quantify current attitudes and behaviours regarding downloading files from the Internet, design a solution to improve security and ease-of-use, and then test their solution to gather feedback and prepare for a more widely deployed beta version.

When I was growing up we were all taught “Don’t talk to strangers”, and “Never get in a stranger’s car”. As has been well noted by others, so much for that advice! Perhaps the modern equivalent is “Don’t download unknown files from the Internet!” This paper specifically looks at applications made directly available from developer websites (vs downloads made through app stores).

A popular and convenient way to download programs is to use official app stores such as Apple’s Mac App Store and Microsoft’s Windows Store. Such platforms, however, have several drawbacks for developers, including long review and validation times, technical restrictions (e.g., sandboxing), Continue reading

BEAT: asynchronous BFT made practical

BEAT: asynchronous BFT made practical Duan et al., CCS’18

Reaching agreement (consensus) is hard enough, doing it in the presence of active adversaries who can tamper with or destroy your communications is much harder still. That’s the world of Byzantine fault tolerance (BFT). We’ve looked at Practical BFT (PBFT) and HoneyBadger on previous editions of The Morning Paper. Today’s paper, BEAT, builds on top of HoneyBadger to offer BFT with even better latency and throughput.

Asynchronous BFT protocols are arguably the most appropriate solutions for building high-assurance and intrusion-tolerant permissioned blockchains in wide-are (WAN) environments, as these asynchronous protocols are inherently more robust against timing and denial-of-service (DoS) attacks that can be mounted over an unprotected network such as the Internet.

The best performing asynchronous BFT protocol, HoneyBadger, still lags behind the partially synchronous PBFT protocol in terms of throughput and latency. BEAT is actually a family of five different asynchronous BFT protocols that start from the HoneyBadger baseline and make improvements targeted at different application scenarios.

Unlike HoneyBadgerBFT, which was designed to optimize throughput only, BEAT aims to be flexible and versatile, providing protocol instances optimized for latency, throughput, bandwidth, or scalability (in terms of the number Continue reading

Uncertainty propagation in data processing systems

Uncertainty propagation in data processing systems Manousakis et al., SoCC’18

When I’m writing an edition of The Morning Paper, I often imagine a conversation with a hypothetical reader sat in a coffee shop somewhere at the start of their day. There are three levels of takeaway from today’s paper choice:

  • If you’re downing a quick espresso, then it’s good to know that uncertainty can creep into our data in lots of different ways, and if you compute with those uncertain values as if they were precise, errors can compound quickly leading to incorrect results or false confidence.
  • If you’re savouring a cortado, then you might also want to dip into the techniques we can use to propagate uncertainty through a computation.
  • If you’re lingering over a latte, then the UP (Uncertainty Propagation) framework additionally shows how to integrate these techniques into a dataflow framework.

We implement this framework in a system called UP-MapReduce, and use it to modify ten applications, including AI/ML, image processing, and trend analysis applications to process uncertain data. Our evaluation shows that UP-MapReduce propagates uncertainties with high accuracy and, in many cases, low performance overheads.

Are you sure?

Uncertainty can arise from a number of Continue reading

Continuum: a platform for cost-aware low-latency continual learning

Continuum: a platform for cost-aware low-latency continual learning Tian et al., SoCC’18

Let’s start with some broad approximations. Batching leads to higher throughput at the cost of higher latency. Processing items one at a time leads to lower latency and often reduced throughput. We can recover throughput to a degree by throwing horizontally scalable resources at the problem, but it’s hard to recover latency. In many business scenarios latency matters, so we’ve been seeing a movement overtime from batching through micro-batching to online streaming.

Continuum looks at the same issues from the perspective of machine learning models. Offline (batch) trained models can suffer from concept drift (loss of accuracy over time) as a result of not incorporating the latest data. I.e., there’s a business cost incurred for higher latency of update incorporation. Online models support incremental updates. Continuum determines the optimum time to retrain models in the presence of incoming data, based on user policy (best effort, cost-aware, or user-defined). There’s some great data here about the need for and benefit of continual learning, and a surprising twist in the tale where it turns out that even if you can afford it, updating the model on Continue reading

ScootR: scaling R dataframes on dataflow systems

ScootR: scaling R dataframes on dataflow systems Kunft et al., SoCC’18

The language of big data is Java ( / Scala). The languages of data science are Python and R. So what do you do when you want to run your data science analysis over large amounts of data?

…programming languages with rich support for data manipulation and statistics, such as R and Python, have become increasingly popular… [but]… they are typically designed for single machine and in-memory usage…. In contrast, parallel dataflow systems, such as Apache Flink and Apache Spark, are able to handle large amounts of data. However, data scientists are often unfamiliar with the systems’ native language and programming abstraction, which is crucial to achieve good performance.

A tempting solution is to embed Python / R support within the dataflow engine. There are two basic approaches to this today:

  1. Keep the guest language components in a separate process and use IPC (inter-process communication) to exchange input and output data between the dataflow engine and the guest language process. This approach can support the full power of the guest language, but pays a heavy price in IPC and serialisation costs.
  2. Use source-to-source (STS) translation to translate guest Continue reading
1 7 8 9 10 11 15