The Morning Paper Archives - Page 14 of 16

Debugging data flows in reactive programs

Debugging data flows in reactive programs Banken et al., ICSE’18

To round off our look at papers from ICSE, here’s a really interesting look at the challenges of debugging reactive applications (with a certain Erik Meijer credited among the authors).

… in recent years the use of Reactive Programming (RP) has exploded. Languages such as Elm and libraries such as Reactor, Akka, and Rx are being used by companies such as Netflix, Microsoft, and Google, to build highly responsive and scalable systems.

The rise of reactive programming fits well with the increasing need to process streams of data. In a reactive program, you set up a data processing pipeline and then wait for input to arrive.

Many RP implementations share a notion of a collection that abstracts over time, in contrast to space like standard collections. This collection comes in different flavors, such as Observable (Rx)… the implementations differ in the precise semantics of their collections, their execution model (push/pull) , and the set of available operators.

While in theory events and threads are duals, in practice the RP abstraction works very well for expressing streaming pipelines. If you have an abstraction over time though, Continue reading

How not to structure your database-backed web applications: a study of performance bugs in the wild

How not to structure your database-backed web applications: a study of performance bugs in the wild Yang et al., ICSE’18

This is a fascinating study of the problems people get into when using ORMs to handle persistence concerns in their web applications. The authors study real-world applications and distil a catalogue of common performance anti-patterns. There are a bunch of familiar things in the list, and a few that surprised me with the amount of difference they can make. By fixing many of the issues that they find, Yang et al., are able to quantify how many lines of code it takes to address the issue, and what performance improvement the fix delivers.

To prove our point, we manually fix 64 performance issues in [the latest versions of the applications under study] and obtain a median speed-up of 2x (and up to 39x max) with fewer than 5 lines of code change in most cases.

The Hyperloop website provides access to a tool you can use to identify and solve some of the common performance issues in your own (Rails) apps.

I’m going to skip the intro parts about what ORMs do and how a typical web app Continue reading

Secure coding practices in Java: challenges and vulnerabilities

Secure coding practices in Java: challenges and vulnerabilities Meng et al., ICSE’18

TL;DR : don’t trust everything you read on Stack Overflow.

Meng et al. conduct a study of Stack Overflow posts relating to secure coding practices in Java to find out the hot topics, what people struggle with, and whether or not the accepted answers are actually following security best practices.

We conducted an empirical study on Stack Overflow posts, aiming to understand developer’s concerns on Java secure coding, their programming obstacles, and insecure coding practices. We observed a wide adoption of the authentication and authorization features provided by Spring Security — a third-party framework designed to secure enterprise applications…

Well, how could I resist reading that! (Some readers may know that I was for many years the CTO of SpringSource). Spring Security does come in for some flak in this paper for the high volume of questions that are asked relating to it. There’s no calibration though for underlying popularity. One of the reasons there are a lot of questions, I posit, is that there are an awful lot of users of Spring Security. Spring Boot applications will use Spring Security, and Spring Boot has been growing Continue reading

Deep code search

Deep code search Gu et al., ICSE’18

The problem with searching for code is that the query, e.g. “read an object from xml,” doesn’t look very much like the source code snippets that are the intended results, e.g.:

That’s why we have Stack Overflow! Stack Overflow can help with ‘how to’ style queries, but it can’t help with searches inside codebases you care about. For example, “where in this codebase are events queued on a thread?”

…an effective code search engine should be able to understand the semantic meanings of natural language queries and source code in order to improve the accuracy of code search.

DeepCS is just such a search engine for code, based on the CODEnn (Code-Description Embedding Neural Network) network model. During training, it takes code snippets (methods) and corresponding natural language descriptions (from the method comments) and learns a joint-embedding. I.e., it learns embeddings such that a method description and its corresponding code snippet are both mapped to a similar point in the same shared embedding space. Then given a natural language query, it can embed the query in vector space and look for nearby code snippets. Compared Continue reading

To distribute or not to distribute? Why licensing bugs matter

To distribute or not to distribute? Why licensing bugs matter Vendome et al., ICSE’18

Software licensing can quickly get quite complicated, with over 100 known open source licenses out there, and distributions often including components with a mix of licenses. Unsurprisingly, developers find it hard to determine appropriate licenses for their work, and to interpret the implications of including third-party software under different licenses.

We present a large-scale qualitative study aimed at characterizing licensing bugs, with the goal of understanding the types of licensing bugs developers face, their legal and technical implications, and how such bugs are fixed.

The result is a helpful catalogue of seven different categories of licensing bugs, with 21 sub-categories in total between them. Although the authors are not lawyers (as far as I can tell), it still constitutes a very useful list of things to think about. “Our proposed catalog can serve as a reference for developers and lawyers dealing with potential licensing issues.”

The catalogue is drawn from an open coding exercise based on a statistically significant sample of 1,200 discussions randomly selected from a population of 59,426 discussions across a collection of issue trackers and mailing lists. The mailing lists Continue reading

Automated localization for unreproducible builds

Automated localization for unreproducible builds Ren et al., ICSE’18

Reproducible builds are an important component of integrity in the software supply chain. Attacks against package repositories and build environments may compromise binaries and produce packages with backdoors (see this report for a recent prominent example of compromised packages on DockerHub). If the same source files always lead to the same binary packages, then an infected binary can be much more easily detected. Unfortunately, reproducible builds have not traditionally been the norm. Non-determinism creeping into build processes means that rebuilding an application from the exact same source, even within a secure build environment, can often lead to a different binary.

Due to the significant benefits, many open-source software repositories have initiated their validation processes. These repositories include GNU/Linux distributions such as Debian and Guix, as well as software systems like Bitcoin.

If you have a non-reproducible build, finding out why can be non-trivial. It takes time and a lot of effort to hunt down and eradicate the causes. For example, Debian unstable for AMD64 still had 2,342 packages with non-reproducible builds as of August 2017. (The number today as I’m writing this is 2,826). You can see a stubbornly persistent Continue reading

Generalized data structure synthesis

Generalized data structure synthesis Loncaric et al., ICSE’18

Many systems have a few key data structures at their heart. Finding correct and efficient implementations for these data structures is not always easy. Today’s paper introduces Cozy (https://cozy.uwplse.org), which can handle this task for you given a high-level specification of the state, queries, and update operations that need to be supported.

Cozy has three goals: to reduce programmer effort, to produce bug-free code, and to match the performance of handwritten code. We found that using Cozy requires an order of magnitude fewer lines of code than manual implementation, makes no mistakes even when human programmers do, and often matches the performance of handwritten code.

Let’s start out by looking at four case studies from the evaluation, to get a feel for where Cozy applies.

ZTopo is a topological map viewer implemented in C++. A core data structure is the map tile cache for map tiles that are asynchronously loaded over the network and cached on disk or in memory.
Sat4j is a Boolean sat solver implemented in Java. A core data structure is the variable store.
Openfire is a large scalable IRC server implemented in Java. Continue reading

ConflictJS: finding and understanding conflicts between JavaScript libraries

ConflictJS: finding and understanding conflicts between JavaScript libraries Patra et al., ICSE’18

The JavaScript ecosystem is fertile ground for dependency hell. With so many libraries being made available and the potential for global namespace clashes, it’s easy for libraries to break each other. Sometimes in an obvious to spot way (that’s a good day!), and sometimes in subtle ways that are harder to detect.

ConflictJS is a tool for finding conflicting JavasScript libraries. It’s available as open source and nicely documented, so you can try it for yourself from https://github.com/sola-da/ConflictJS.

We use ConflictJS to analyze and study conflicts among 951 real-world libraries. The results show that one out of four libraries is potentially conflicting and that 166 libraries are involved in at least one certain conflict.

Why do conflicts happen?

At a language level, until ES6 modules at least, there was no built-in namespacing mechanism (though we do have a number of conventions and module libraries). In principle developers can follow a ‘single API object’ pattern where the entire API of a library is encapsulated behind a single object. In practice, many of them don’t (71% of libraries did not do this, from 951 studied for this Continue reading

Debugging with intelligence via probabilistic inference

Debugging with intelligence via probabilistic inference Xu et al., ICSE’18

Xu et al. have built a automated debugger that can take a single failing test execution, and with minimal interaction from a human, pinpoint the root cause of the failure. What I find really exciting about it, is that instead of brute force there’s a certain encoded intelligence in the way the analysis is undertaken which feels very natural. The first IDE / editor to integrate a tool like this wins!

The authors don’t give a name to their tool in the paper, which is going to make it awkward to refer to during this write-up. So I shall henceforth refer to it as the PI Debugger. PI here stands for probabilistic inference.

We model debugging as a probabilistic inference problem, in which the likelihood of each executed statement instance and variable being correct/faulty is modeled by a random variable. Human knowledge, human-like reasoning rules and program semantics are modeled as conditional probability distributions, also called probabilistic constraints. Solving these constraints identifies the most likely faulty statements.

In the evaluation, when debugging problems in large projects, it took on average just 3 interactions with a developer to find the Continue reading

DeepTest: automated testing of deep-neural-network-driven autonomous cars

DeepTest: automated testing of deep-neural-network-driven autonomous cars Tian et al., ICSE’18

How do you test a DNN? We’ve seen plenty of examples of adversarial attacks in previous editions of The Morning Paper, but you couldn’t really say that generating adversarial images is enough to give you confidence in the overall behaviour of a model under all operating conditions. Adversarial images approach things from a ‘think like an attacker’ mindset. We want to ‘think like a tester.’ For example, the work on DeepXplore which uses model ensembles to find differences in outputs that suggest bugs. The importance of testing DNNs is especially obvious when it comes to applications such as autonomous driving. Several of the ideas from DeepXplore are used in DeepTest, which looks specifically at testing of autonomous driving system. I think you could apply the DeepTest techniques to test other kinds of DNNs as well.

…despite the tremendous progress, just like traditional software, DNN-based software, including the ones used for autonomous driving, often demonstrate incorrect/unexpected corner-case behaviours that lead to dangerous consequences like a fatal collision.

DeepTest is a system designed to aid in the testing of autonomous driving models. When used to test three of Continue reading

Popular is cheaper: curtailing memory costs in interactive analytics engines

(If you don’t have ACM Digital Library access, the paper can be accessed either by following the link above directly from The Morning Paper blog site).

We’re sticking with the optimisation of data analytics today, but at the other end of the spectrum to the work on smart arrays that we looked at yesterday. Getafix (extra points for the Asterix-inspired name, especially as it works with Yahoo!’s Druid cluster) is aimed at reducing the memory costs for large-scale in-memory data analytics, without degrading performance of course. It does this through an intelligent placement strategy that decides on replication level and data placement for data segments based on the changing popularity of those segments over time. Experiments with workloads from Yahoo!’s production Druid cluster that Getafix can reduce memory footprint by 1.45-2.15x while maintaining comparable average and tail latencies. If you translate that into a public cloud setting, and assuming a 100TB hot dataset size — a conservative estimate in the Yahoo! case — we’re looking at savings on the order of $10M per year.

Real-time analytics is projected to Continue reading

Analytics with smart arrays: adaptive and efficient language-independent data

Analytics with smart arrays: adaptive and efficient language-independent data Psaroudakis et al., EuroSys’18

(If you don’t have ACM Digital Library access, the paper can be accessed either by following the link above directly from The Morning Paper blog site).

We’re going lower-level today, with a look at some work on adaptive data structures by Oracle. It’s motivated by a desire to speed up big data analytic workloads that are “increasingly limited by simple bottlenecks within the machine.” The initial focus is on array processing, but the ambition is to extend the work to more data types in time.

Modern servers have multiple interconnected sockets of multi-core processors. Each socket has local memory, accessible via a cache-coherent non-uniform memory access (ccNUMA) architecture. In the NUMA world the following hold true:

remote memory accesses are slower than local accesses
bandwidth to a socket’s memory and interconnect can be separately saturated
the bandwidth of an interconnect is often much lower than a socket’s local memory bandwidth

If we want to crunch through an array as fast as possible in a NUMA world, the optimum way of doing it depends on the details of the machine, and on the application Continue reading

Medea: scheduling of long running applications in shared production clusters

Medea: scheduling of long running applications in shared production clusters Garefalakis et al., EuroSys’18

(If you don’t have ACM Digital Library access, the paper can be accessed either by following the link above directly from The Morning Paper blog site).

We’re sticking with schedulers today, and a really interesting system called Medea which is designed to support the common real world use case of mixed long running applications (LRAs) and shorter duration tasks within the same cluster. The work is grounded in production cluster workloads at Microsoft and is now part of the Apache Hadoop 3.1 release. In the evaluation, when compared to the Kubernetes’ scheduling algorithm Medea reduces median runtimes by up to 32%, and by 2.1x compared to the previous generation YARN scheduler.

…a substantial portion of production clusters today is dedicated to LRAs…. placing LRAs, along with batch jobs, in shared clusters is appealing to reduce cluster operational costs, avoid unnecessary data movement, and enable pipelines involving both classes of applications. Despite these observations, support for LRAs in existing schedulers is rudimentary.

The challenges of scheduling long-running applications

Example uses of long running application containers include streaming systems, interactive data-intensive applications (maintaining Continue reading

Optimus: an efficient dynamic resource scheduler for deep learning clusters

Optimus: an efficient dynamic resource scheduler for deep learning clusters Peng et al., EuroSys’18

(If you don’t have ACM Digital Library access, the paper can be accessed either by following the link above directly from The Morning Paper blog site).

It’s another paper promising to reduce your deep learning training times today. But instead of improving the programming model and/or dataflow engine, Optimus improves the scheduling of jobs within a cluster. You can run it on top of Kubernetes, and the authors claim about a 1.6x reduction in makespan compared to the mostly widely used schedulers today.

Deep learning clusters

We’re using ever larger models, with ever increasing amounts of data (at least, whenever we can get our hands on it). In general this improves the learning accuracy, but it also increases the training time. The most common approach is parallel training using a machine learning cluster. Typically a model is partitioned among multiple parameter servers, and training data is spread across multiple workers. Workers compute parameter updates and push them to the respective parameter server.

Training is an iterative process with a dataset divided into chunks, and each chunk further divided into mini-batches. A Continue reading

Improving the expressiveness of deep learning frameworks with recursion

Improving the expressiveness of deep learning frameworks with recursion Jeong, Jeong et al., EuroSys’18

(If you don’t have ACM Digital Library access, the paper can be accessed either by following the link above directly from The Morning Paper blog site).

Last week we looked at the embedded dynamic control flow operators in TensorFlow. In today’s paper choice, Jeong et al. make the case for support of an additional control flow construct: recursion. A little recursion it turns out, can go a long way. Implemented on top of TensorFlow (and with a design that should also work for other embedded control flow machine learning frameworks e.g. Theano, Caffe, MXNet), support for recursion enables cleaner expression of a class of model architectures, and improved performance. The performance gains come from the increased opportunities to exploit parallelism within the recursive definitions.

In this paper, we introduce recursive definitions into the programming model of existing embedded control flow frameworks, adding first-class support for recursion. By allowing users to directly express recursive definitions in application code with enhanced programmability, models with recursive data structures such as trees or graphs can be written without requiring users to use a separate complex API Continue reading

BDS: A centralized near-optimal overlay network for inter-datacenter data replication

BDS: A centralized near-optimal overlay network for inter-datacenter data replication Zhang et al., EuroSys’18

(If you don’t have ACM Digital Library access, the paper can be accessed either by following the link above directly from The Morning Paper blog site).

This is the story of how inter-datacenter multicast transfers at Baidu were sped-up by a factor of 3-5x. That’s a big deal!

For large-scale online service providers, such as Google, Facebook, and Baidu, an important data communication pattern is inter-DC multicast of bulk data — replicating massive amounts of data (e.g., user logs, web search indexes, photo sharing, blog posts) from one DC to multiple DCs in geo-distributed locations.

To set the scene, the authors study inter-DC traffic at Baidu over a period of seven days. Nearly all inter-DC traffic is multicast (91.1%), highlighting the importance of optimising the multicast use case.

When looking at the individual transfers, there is great diversity in the source and destination DCs. Thus it’s not going to suffice to pre-configure a few select routes: “we need a system to automatically route and schedule any given inter-DC multicast transfers.”

60% of the transferred files are over 1TB Continue reading

Dynamic control flow in large-scale machine learning

Dynamic control flow in large-scale machine learning Yu et al., EuroSys’18

(If you don’t have ACM Digital Library access, the paper can be accessed either by following the link above directly from The Morning Paper blog site).

In 2016 the Google Brain team published a paper giving an overview of TensorFlow, “TensorFlow: a system for large-scale machine learning.” This paper is a follow-up, taking a much deeper look at how TensorFlow supports dynamic control flow, including extending automatic differentiation to control flow constructs.

Embedding control flow within the dataflow graph

With a wide range of machine learning models in use, and rapid exploration of new techniques, a machine learning system needs to be expressive and flexible to support both research and production use cases. Given the ever larger models and training sets, a machine learning system also needs to be scalable. These means both using individual devices efficiently (anything from phones to custom ASCIs in datacenters), and also supporting parallel execution over multiple devices.

Both the building blocks of machine learning and the architectures built up using these blocks have been changing rapidly. This pace appears likely to continue. Therefore, rather than defining RNNs, MoEs Continue reading

Reducing DRAM footprint with NVM in Facebook

Reducing DRAM footprint with NVM in Facebook Eisenman et al., EuroSys’18

(If you don’t have ACM Digital Library access, the paper can be accessed either by following the link above directly from The Morning Paper blog site).

…to the best of our knowledge, this is the first study on the usage of NVM devices in a commercial data center environment.

We’ve been watching NVM coming for some time now, so it’s exciting to see a paper describing its adoption within Facebook. MyRocks is Facebook’s primary MySQL database, and is used to store petabytes of data and to serve real-time user activities. MyRocks uses RocksDB as the storage engine, and a typical server consumes 128GB of DRAM and 3 TB of flash. It all seems to work well, so what’s the problem? Spiralling costs!

As DRAM facing major scaling challenges, its bit supply growth rate has experienced a historic low. Together with the growing demand for DRAM, these trends have led to problems in global supply, increasing total cost of ownership (TCO) for data center providers. Over the last year, for example, the average DRAM DDR4 price has increased by 2.3x.

Just using less DRAM per server Continue reading

ServiceFabric: a distributed platform for building microservices in the cloud

ServiceFabric: a distributed platform for building microservices in the cloud Kakivaya et al., EuroSys’18

(If you don’t have ACM Digital Library access, the paper can be accessed either by following the link above directly from The Morning Paper blog site).

Microsoft’s Service Fabric powers many of Azure’s critical services. It’s been in development for around 15 years, in production for 10, and was made available for external use in 2015.

ServiceFabric (SF) enables application lifecycle management of scalable and reliable applications composed of microservices running at very high density on a shared pool of machines, from development to deployment to management.

Some interesting systems running on top of SF include:

Azure SQL DB (100K machines, 1.82M DBs containing 3.48PB of data)
Azure Cosmos DB (2 million cores and 100K machines)
Skype
Azure Event Hub
Intune
Azure IoT suite
Cortana

SF runs in multiple clusters each with 100s to many 100s of machines, totalling over 160K machines with over 2.5M cores.

Positioning & Goals

Service Fabric defies easy categorisation, but the authors describe it as “Microsoft’s platform to support microservice applications in cloud settings.” What particularly makes it stand out from the crowd Continue reading

Hyperledger fabric: a distributed operating system for permissioned blockchains

Hyperledger fabric: a distributed operating system for permissioned blockchains Androulaki et al., EuroSys’18

(If you don’t have ACM Digital Library access, the paper can be accessed either by following the link above directly from The Morning Paper blog site).

This very well written paper outlines the design of HyperLedger Fabric and the rationales for many of the key design decisions. It’s a great introduction and overview. Fabric is a permissioned blockchain system with the following key features:

A modular design allows many components to be pluggable, including the consensus algorithm
Instead of the order-execute architecture used by virtually all existing blockchain systems, Fabric uses an execute-order-validate paradigm which enables a combination of passive and active replication. (We’ll be getting into this in much more detail shortly).
Smart contracts can be written in any language.

…in popular deployment configurations, Fabric achieves throughput of more than 3500 tps, achieving finality with latency of a few hundred ms and scaling well to over 100 peers.

Examples of use cases powered by Fabric include foreign exchange netting in which a blockchain is used to resolve trades that aren’t settling; enterprise asset management tracking hardware assets as they move from manufacturing to Continue reading

« Previous 1 … 12 13 14 15 16 Next »