TensorFlow.js: machine learning for the web and beyond Smilkov et al., SysML’19
If machine learning and ML models are to pervade all of our applications and systems, then they’d better go to where the applications are rather than the other way round. Increasingly, that means JavaScript – both in the browser and on the server.
TensorFlow.js brings TensorFlow and Keras to the the JavaScript ecosystem, supporting both Node.js and browser-based applications. As well as programmer accessibility and ease of integration, running on-device means that in many cases user data never has to leave the device.
On-device computation has a number of benefits, including data privacy, accessibility, and low-latency interactive applications.
TensorFlow.js isn’t just for model serving, you can run training with it as well. Since it’s launch in March 2018, people have done lots of creative things with it. And since it runs in the browser, these are all accessible to you with just one click! Some examples:
Fixed it for you: protocol repair using lineage graphs Oldenburg et al., CIDR’19
This is a cool paper on a number of levels. Firstly, the main result that catches my eye is that it’s possible to build a distributed systems ‘debugger’ that can suggest protocol-level fixes. E.g. say you have a system that sometimes sends acks before it really should, resulting in the possibility of state loss. Nemo can uncover this and suggest a change to the protocol that removes the window. Secondly, it uses an obscure (from the perspective of most readers of this blog) programming language called Dedalus. Dedalus is a great example of how a different programming paradigm can help us to think about a problem differently and generate new insights and possibilities. (Dedalus is a temporal logic programming language based on datalog). Now, it would be easy for practitioner readers to immediately dismiss the work as not relevant to them given they won’t be coding systems in Dedalus anytime soon. The third thing I want to highlight in this introduction therefore is the research strategy:
Nemo operates on an idealized model in which distributed executions are centrally simulated, record-level provenance of these Continue reading
Veritas: shared verifiable databases and tables in the cloud Allen et al., CIDR’19
Two (or more) parties want to transact based on the sharing of information (e.g. current offers). In order to have trust in the system and provide a foundation for resolving disputes, we’d like a tamperproof and immutable audit log of all shared data and actions, such that an independent auditor can reconstruct the state of the system at any point in time.
Enter the blockchain?! Not so fast say Allen et al., blockchain technology as we know it today is ‘one step forward, two steps back’ ;).
Today, for gaining immutability and auditability with new blockchain platforms, we give up decades of research in data management— and hardened, enterprise-ready code that implements these ideas.
We’d still like to be able to use SQL for example. We want transaction throughput much closer to a traditional database, and we want to take advantage of query optimisation and sophisticated query processing engines. We could try adding database like features to blockchain systems, but that looks to be a long road:
There are now a gazillion start-ups that are adding these basic database features to blockchains, Continue reading
The case for network-accelerated query processing Lerner et al., CIDR’19
Datastores continue to advance on a number of fronts. Some of those that come to mind are adapting to faster networks (e.g. ‘FARM: Fast Remote Memory’) and persistent memory (see e.g. ‘Let’s talk about storage and recovery methods for non-volatile memory database systems’), deeply integrating approximate query processing (e.g. ‘ApproxHadoop: Bringing approximations to MapReduce frameworks’ and ‘BlinkDB’), embedding machine learning in the core of the system (e.g. ‘SageDB’), and offloading processing into the network (e.g KV-Direct) — one particular example of exploiting hardware accelerators. Today’s paper gives us an exciting look at the untapped potential for network-accelerated query processing. We’re going to need all that data structure synthesis and cost-model based exploration coupled with self-learning to unlock the potential that arises from all of these advances in tandem!
NetAccel uses programmable network devices to offload some query patterns for MPP databases into the switch.
Thus, for the first time, moving data through networking equipment can contributed to query execution. Our preliminary results show that we can improve response times on even the best Continue reading
Programming paradigms for dummies: what every programmer should know Peter Van Roy, 2009
We’ll get back to CIDR’19 next week, but chasing the thread starting with the Data Continuum paper led me to this book chapter by Peter Van Roy mapping out the space of programming language designs. (Thanks to TuringTest for posting a reference to it in a HN thread). It was too good not to take a short detour to cover it! If you like the chapter, you’ll probably enjoy the book, ‘Concepts, Techinques, and Models of Computer Programming’ by Van Roy & Hardi on which much of this chapter was based .
This chapter gives an introduction to all the main programming paradigms, their underlying concepts, and the relationships between them… We give a taxonomy of about 30 useful programming paradigms and how they are related.
Programming paradigms are approaches based on a mathematical theory or particular set of principles, each paradigm supporting a set of concepts. Van Roy is a believer in multi-paradigm languages: solving a programming problem requires choosing the right concepts, and many problems require different sets of concepts for different parts. Moreover, many programs have to solve more than one problem! Continue reading
The Data Calculator: data structure design and cost synthesis from first principles and learned cost models Idreos et al., SIGMOD’18
This paper preceded the work on data continuums that we looked at last time, and takes a more general look at interactive and semi-automated design of data structures. A data structure here is defined as a combination of (1) a data layout describing how the data is stored, and (2) algorithms that describe how its basic functionality is achieved over the specific data layout. For data structures with just two different types of nodes (e.g., leaf-nodes and non-leaf nodes in a tree), the authors estimate there are more than possible valid data structure designs! Dozens of new data structures are published each year, but we’re still only scratching the surface.
Our intuition as that most designs (and even inventions) are about combining a small set of fundamental concepts in different ways or tunings… Our vision is to build the periodic table of data structures so we can express their massive design space. We take the first step in the paper, presenting a set of first principles that can synthesize an order of magnitude more data structure designs Continue reading
Design continuums and the path toward self-designing key-value stores that know and learn Idreos et al., CIDR’19
We’ve seen systems that help to select the best data structure from a pre-defined set of choices (e.g. ‘Darwinian data structure selection’), systems that synthesise data structure implementations given an abstract specification (‘Generalized data structure synthesis’), systems that use learning to tune configurations given a pre-defined set of configuration options (e.g, ‘BOAT: Building auto-tuners with structured Bayesian optimization’), systems that use learned models as key inputs to algorithms (e.g. ‘SageDB: a learned database system’), and systems that use reinforcement learning to discover fit-for-workload policies (‘Towards a hands-free query optimizer through deep learning’). Today’s paper choice jumps right into the middle of this mix, showing how families of data structures can be organised into design continuums with well-understood design parameters and cost models informing selection of a given data-structure design from within the space. As well as helping us to understand and reason about data structures within the space, the constrained feature space with quick-to-evaluate cost models opens the way to practical machine learning guided data structure synthesis and selection. Continue reading
Towards a hands-free query optimizer through deep learning Marcus & Papaemmanouil, CIDR’19
Where the SageDB paper stopped— at the exploration of learned models to assist in query optimisation— today’s paper choice picks up, looking exclusively at the potential to apply learning (in this case deep reinforcement learning) to build a better optimiser.
Query optimisers are traditionally composed of carefully tuned and complex heuristics based on years of experience. Feedback from the actual execution of query plans can be used to update cardinality estimates. Database cracking, adaptive indexing, and adaptive query processing all incorporate elements of feedback as well.
In this vision paper, we argue that recent advances in deep reinforcement learning (DRL) can be applied to query optimization, resulting in a “hands-free” optimizer that (1) can tune itself for a particular database automatically without requiring intervention from expert DBAs, and (2) tightly incorporates feedback from past query optimizations and executions in order to improve the performance of query execution plans generated in the future.
If we view query optimisation as a DRL problem, then in reinforcement learning terminology the optimiser is the agent, the current query plan is the state, and each available action Continue reading
SageDB: a learned database system Kraska et al., CIDR’19
About this time last year, a paper entitled ‘The case for learned index structures’ (part I, part II) generated a lot of excitement and debate. Today’s paper choice builds on that foundation, putting forward a vision where learned models pervade every aspect of a database system.
The core idea behind SageDB is to build one or more models about the data and workload distribution and based on them automatically build the best data structures and algorithms for all components of the database system. This approach, which we call “database synthesis” will allow us to achieve unprecedented performance by specializing the implementation of every database component to the specific database, query workload, and execution environment.

In the absence of runtime learning and adaptation, database systems are engineered for general purpose use and do not take full advantage of the specific characteristics of the workload and data at hand. The size of the opportunity for SageDB is the gap between such an approach and what is possible when designing a specialised solution with full knowledge of the data distribution and workload.
Consider an Continue reading
Serverless computing: one step forward, two steps back Hellerstein et al., CIDR’19
The biennial Conference on Innovative Data Systems Research has come round again. Today’s paper choice is sure to generate some healthy debate, and it’s a good set of questions to spend some time thinking over as we head into 2019: Where do you think serverless is heading? What is it good for today? What’s the end-goal here?
The authors see ‘critical gaps’ in current first-generation serverless offerings from the major cloud vendors (AWS, Azure, GCP). I’m sure some will read the paper as serverless-bashing, but I read into it more of an appeal from the heart to not stop where we are today, but to continue to pursue infrastructure and programming models truly designed for cloud platforms. Platforms that offer ‘unlimited’ data storage, ‘unlimited’ distributed processing power, and the ability to harness these only as needed.
We hope this paper shifts the discussion from ‘What it serverless?’ Or ‘Will serverless win?’ to a rethinking of how we design infrastructure and programming models to spark real innovation in data-rich, cloud-scale systems. We see the future of cloud programming as far, far brighter than the promise of Continue reading
Unsupervised learning of artistic styles with archetypal style analysis Wynen et al., NeurIPS’18
I’ve always enjoyed following work on artistic style transfer. The visual nature makes it easy to gain an appreciation for what is going on and the results are very impressive. It also something that’s been unfolding within the timespan of The Morning Paper, if we peg the beginning to the work of Gatys et al. in 2015. See for example the posts on ‘Texture Networks’ and ‘Deep photo style transfer.’
Beyond direct style transfer, the objective of the work described in today’s paper choice is to uncover representations of styles (archetypes) themselves. Given a large collection of paintings,…
… Our objective is to automatically discover, summarize, and manipulate artistic styles present in the collection.
This is achieved using an unsupervised learning technique called archetypal analysis. We can recover archetypes from a collection of paintings, and we can also go the other way; taking a painting and decomposing it into a combination of archetypes. And of course if we then manipulate the composition of archetypes, we can manipulate the style of an image.
To visualise what an archetype ‘looks like’ the authors synthesise Continue reading
Neural ordinary differential equations Chen et al., NeurIPS’18
‘Neural Ordinary Differential Equations’ won a best paper award at NeurIPS last month. It’s not an easy piece (at least not for me!), but in the spirit of ‘deliberate practice’ that doesn’t mean there isn’t something to be gained from trying to understand as much as possible.
In addition to the paper itself, I found the following additional resources to be helpful:
Consider a multi-layered neural network. We have an input layer and an output layer, and inbetween them, some number of hidden layers. As an input feeds forward through the network, it is progressively transformed, one layer at a time, from the input to the ultimate output. Each network layer is a step on that journey. If we take a small number of big steps, we end up with a rough approximation to the true transformation function we’d like to learn. If we take a much Continue reading
The tradeoffs of large scale learning Bottou & Bousquet, NIPS’07
Welcome to another year of The Morning Paper. As usual we’ll be looking at a broad cross-section of computer science research (I have over 40 conferences/workshops on my list to keep an eye on as a start!). I’ve no idea yet what papers we’ll stumble across, but if previous years are anything to go by I’m sure there’ll be plenty of great material to keep interest levels high.
To start us off, today’s paper choice is “The tradeoffs of large scale learning,” which won the ‘test of time’ award at NeurIPS last month.
this seminal work investigated the interplay between data and computation in ML, showing that if one is limited by computing power but can make use of a large dataset, it is more efficient to perform a small amount of computation on many individual training examples rather than to perform extensive computation on a subset of the data. [Google AI blog: The NeurIPS 2018 Test of Time Award].
For a given time/computation budget, are we better off performing a computationally cheaper (e.g., approximate) computation over lots of data, or a more accurate computation Continue reading
Towards a theory of software development expertise Baltes et al., ESEC/FSE’18
This is the last paper we’ll be looking at this year, so I’ve chosen something a little more reflective to leave you with (The Morning Paper will return on Monday 7th January, 2019). The question Baltes and Diehl tackle is this: “How do you get better as a software developer?” What does expert performance look like?
We present a first conceptual theory of software development expertise that is grounded in data from a mixed-methods survey with 335 software developers and in literature on expertise and expert performance…. [the theory] describes central properties of software development expertise and important factors influencing its formation.
In essence, ask a bunch of practitioners what they think, use a disciplined coding scheme to interpret the answers (a “grounded theory”), and then layer in what we know about expertise and expert performance in general. The end result is a “conceptual theory” that shows the various contributors to expert performance and the relationships between them. “Software Development” in the current work is synonymous with “programming.”

To make the paper come alive you need to engage with it a little: Does the theory developed Continue reading
Identifying impactful service system problems via log analysis He et al., ESEC/FSE’18
If something is going wrong in your system, chances are you’ve got two main sources to help you detect and resolve the issue: logs and metrics. You’re unlikely to be able to get to the bottom of a problem using metrics alone (though you might well detect one that way), so that leaves logs as the primary diagnosis tool. The online service at Microsoft used as the main case study in the paper produces dozens of Terabytes of logs every day.
Logs play a crucial role in the diagnosis of modern cloud-based online service systems. Clearly, manual problem diagnosis is very time-consuming and error-prone due to the increasing scale and complexity of large-scale systems.
Log3C analyses logs to look for indications of impactful problems, using correlated KPIs as a guide. It finds these needles in the haystack with an average precision of 0.877 and an average recall of 0.883. A distributed version of Log3C has been deployed and used in production at Microsoft for several years, both to support a massive online service (we are not told which one), and integrated into “Product B” where Continue reading
Applied machine learning at Facebook: a datacenter infrastructure perspective Hazelwood et al., _HPCA’18 _
This is a wonderful glimpse into what it’s like when machine learning comes to pervade nearly every part of a business, with implications top-to-bottom through the whole stack. It’s amazing to step back and think just how fundamentally software systems have changed over the last decade in this regard.
Just how pervasive is machine learning at Facebook?
The modern user-experience is increasingly powered by machine learning models, and the quality of those models depends directly on the volume and quality of the data powering them: “For many machine learning models Continue reading
Darwinian data structure selection Basios et al., FSE’18
GraphIt may have caught your attention for the success of its approach, but I suspect for many readers it’s not something you’ll be immediately applying. Darwinian Data Structures (DDSs) on the other hand looks to be of immediate interest to many Java and C++ projects (and generalises beyond those languages).
What I would have called an ADT (e.g., a List), the authors call Darwinian Data Structures. The ‘Darwinian’ part comes from the fact that ADTs have multiple concrete implementations, and Artemis, “a multi-objective, cloud-based search-based optimisation framework” finds the best implementation class (e.g. ArrayList, LinkedList) for your specific use case. It does this using the NSGA-II genetic algorithm-based optimiser in the current implementation.

In brief, Artemis finds the places in your code where you are using an ADT, and explores the possible concrete instantiation space for those ADTs using your test suite as a guide to performance. Then it outputs the transformed source. You might be wondering whether e.g. LinkedList vs ArrayList makes that big a difference in most real world projects:
Artemis achieves substantial performance improvements for every project in Continue reading
GraphIt: a high-performance graph DSL Zhang et al., OOPSLA’18
See also: http://graphit-lang.org/.
The problem with finding the optimal algorithm and data structures for a given problem is that so often it depends. This is especially true when it comes to graph algorithms.
It is difficult to implement high-performance graph algorithms. The performance bottlenecks of these algorithms depend not only on the algorithm and the underlying hardware, but also on the size and structure of the graph. As a result, different algorithms running on the same machine, or even the same algorithm running with different types of graph on the same machine, can exhibit different performance bottlenecks.
What we’d like therefore, is some way of expressing a graph algorithm at a high level such that we can map it into different implementations, each applying different optimisations as needed. For bonus points, we could then automate the search within the optimisation space to find the best performing combination for the circumstances at hand.
This is exactly what GraphIt does. GraphIt combines a DSL for specifying graph algorithms with a separate scheduling language that determines implementation policy. You can specify a schedule yourself, or use autotuning to discover optimal schedules for Continue reading
MadMax: surviving out-of-gas conditions in ethereum smart contracts Grech et al., OOPSLA’18
We’re transitioning to look at a selection of papers from the recent OOPSLA conference this week. MadMax won a distinguished paper award, and makes a nice bridge from the CCS blockchain papers we were looking at last week.
Analysis and verification of smart contracts is a high-value task, possibly more so than in any other programming setting. The combination of monetary value and public availability makes the early detection of vulnerabilities a task of paramount importance. (Detection may occur after contract deployment. Despite the code immutability, which prevents bug fixes, discovering a vulnerability before an attacker may exploit it could enable a trusted third party to move vulnerable funds to safety).
MadMax is in the same vein as Securify, performing EVM bytecode analysis using Datalog (also with Soufflé) to infer security issues in contracts. In this instance, MadMax focuses on detecting vulnerabilities caused by out-of-gas conditions. The paper touches on some nice reusable building blocks (e.g. Vandal). I could easily see Vandal + Soufflé becoming a standard foundation for powerful EVM-based smart contract analysis.
MadMax is available on GitHub at https://github.com/nevillegreech/MadMax.
MaxMax Continue reading
RapidChain: scaling blockchain via full sharding Zamani et al., CCS’18
RapidChain is a sharding-based public blockchain protocol along the lines of OmniLedger that we looked at earlier in the year. RapidChain is resilient to Byzantine faults from up to 1/3 of its participants, requires no trusted setup, and can achieve more than 7,300 tx/sec with an expected confirmation latency of roughly 8.7 seconds in a network of 4,000 nodes with a time-to-failure of more than 4,500 years. Those are pretty interesting numbers!

(Enlarge)
RapidChain partitions the set of nodes into multiple smaller groups of nodes called committees that operate in parallel on disjoint blocks of transactions and maintain disjoint ledgers.
With nodes, each committee is of size
where
is a security parameter typically set around 20. E.g. with 1,000 nodes we’ll have around 17 committees of 60 nodes each. To make all these work we’ll need a number of different parts:
The initial set of participants start Continue reading