Author Archives: adriancolyer
Author Archives: adriancolyer
Privacy risks with Facebook’s PII-based targeting: auditing a data broker’s advertising interface Venkatadri et al., IEEE Security and Privacy 2018
This is one of those jaw-hits-the-floor, can’t quite believe what I’m reading papers. The authors describe an attack exploiting Facebook’s custom audience feature, that can leak your PII.
Specifically, we show how the adversary can infer user’s full phone numbers knowing just their email address, determine whether a particular user visited a website, and de-anonymize all the visitors to a website by inferring their phone numbers en masse. These attacks can be conducted without any interaction with the victim(s), cannot be detected by the victim(s), and do not require the adversary to spend money or actually place an ad.
Following responsible disclosure of the attack vectors to Facebook, Facebook acknowledged the vulnerability and have put in place a fix (not giving audience size estimates under certain scenarios). The experiments conducted by the authors were performed between January and March 2017, and presumably the disclosure happened around that time or shortly afterwards. That probably means your PII on Facebook was vulnerable from when the custom audiences feature was first introduced, until early 2017. Someone with more time could probably put Continue reading
The rise of the citizen developer: assessing the security impact of online app generators Oltrogge et al., IEEE Security & Privacy 2018
“Low code”, “no code”, “citizen developers”, call it what you will, there’s been a big rise in platforms that seek to make it easy to develop applications for non-export developers. Today’s paper choice studies the online application generator (OAG) market for Android applications. When what used to be a web site (with many successful web site templating and building options around) is often in many cases now also or instead a mobile app, so it makes sense that the same kind of templating and building approach should exist there too. For a brief period at the end of last year, Apple flirted with banning such apps from their app store, before back-tracking just a couple of weeks after the initial announcement. After reading today’s paper I can’t help but feel that perhaps they were on to something. Not that templated apps are bad per se, but when the generated apps contain widespread vulnerabilities and privacy issues, then that is bad.
With the increasing use of OAGs the duty of generating secure code shifts away from the Continue reading
Debugging data flows in reactive programs Banken et al., ICSE’18
To round off our look at papers from ICSE, here’s a really interesting look at the challenges of debugging reactive applications (with a certain Erik Meijer credited among the authors).
… in recent years the use of Reactive Programming (RP) has exploded. Languages such as Elm and libraries such as Reactor, Akka, and Rx are being used by companies such as Netflix, Microsoft, and Google, to build highly responsive and scalable systems.
The rise of reactive programming fits well with the increasing need to process streams of data. In a reactive program, you set up a data processing pipeline and then wait for input to arrive.
Many RP implementations share a notion of a collection that abstracts over time, in contrast to space like standard collections. This collection comes in different flavors, such as Observable (Rx)… the implementations differ in the precise semantics of their collections, their execution model (push/pull) , and the set of available operators.
While in theory events and threads are duals, in practice the RP abstraction works very well for expressing streaming pipelines. If you have an abstraction over time though, Continue reading
How not to structure your database-backed web applications: a study of performance bugs in the wild Yang et al., ICSE’18
This is a fascinating study of the problems people get into when using ORMs to handle persistence concerns in their web applications. The authors study real-world applications and distil a catalogue of common performance anti-patterns. There are a bunch of familiar things in the list, and a few that surprised me with the amount of difference they can make. By fixing many of the issues that they find, Yang et al., are able to quantify how many lines of code it takes to address the issue, and what performance improvement the fix delivers.
To prove our point, we manually fix 64 performance issues in [the latest versions of the applications under study] and obtain a median speed-up of 2x (and up to 39x max) with fewer than 5 lines of code change in most cases.
The Hyperloop website provides access to a tool you can use to identify and solve some of the common performance issues in your own (Rails) apps.
I’m going to skip the intro parts about what ORMs do and how a typical web app Continue reading
Secure coding practices in Java: challenges and vulnerabilities Meng et al., ICSE’18
TL;DR : don’t trust everything you read on Stack Overflow.
Meng et al. conduct a study of Stack Overflow posts relating to secure coding practices in Java to find out the hot topics, what people struggle with, and whether or not the accepted answers are actually following security best practices.
We conducted an empirical study on Stack Overflow posts, aiming to understand developer’s concerns on Java secure coding, their programming obstacles, and insecure coding practices. We observed a wide adoption of the authentication and authorization features provided by Spring Security — a third-party framework designed to secure enterprise applications…
Well, how could I resist reading that! (Some readers may know that I was for many years the CTO of SpringSource). Spring Security does come in for some flak in this paper for the high volume of questions that are asked relating to it. There’s no calibration though for underlying popularity. One of the reasons there are a lot of questions, I posit, is that there are an awful lot of users of Spring Security. Spring Boot applications will use Spring Security, and Spring Boot has been growing Continue reading
Deep code search Gu et al., ICSE’18
The problem with searching for code is that the query, e.g. “read an object from xml,” doesn’t look very much like the source code snippets that are the intended results, e.g.:
*
That’s why we have Stack Overflow! Stack Overflow can help with ‘how to’ style queries, but it can’t help with searches inside codebases you care about. For example, “where in this codebase are events queued on a thread?”
…an effective code search engine should be able to understand the semantic meanings of natural language queries and source code in order to improve the accuracy of code search.
DeepCS is just such a search engine for code, based on the CODEnn (Code-Description Embedding Neural Network) network model. During training, it takes code snippets (methods) and corresponding natural language descriptions (from the method comments) and learns a joint-embedding. I.e., it learns embeddings such that a method description and its corresponding code snippet are both mapped to a similar point in the same shared embedding space. Then given a natural language query, it can embed the query in vector space and look for nearby code snippets. Compared Continue reading
To distribute or not to distribute? Why licensing bugs matter Vendome et al., ICSE’18
Software licensing can quickly get quite complicated, with over 100 known open source licenses out there, and distributions often including components with a mix of licenses. Unsurprisingly, developers find it hard to determine appropriate licenses for their work, and to interpret the implications of including third-party software under different licenses.
We present a large-scale qualitative study aimed at characterizing licensing bugs, with the goal of understanding the types of licensing bugs developers face, their legal and technical implications, and how such bugs are fixed.
The result is a helpful catalogue of seven different categories of licensing bugs, with 21 sub-categories in total between them. Although the authors are not lawyers (as far as I can tell), it still constitutes a very useful list of things to think about. “Our proposed catalog can serve as a reference for developers and lawyers dealing with potential licensing issues.”
The catalogue is drawn from an open coding exercise based on a statistically significant sample of 1,200 discussions randomly selected from a population of 59,426 discussions across a collection of issue trackers and mailing lists. The mailing lists Continue reading
Automated localization for unreproducible builds Ren et al., ICSE’18
Reproducible builds are an important component of integrity in the software supply chain. Attacks against package repositories and build environments may compromise binaries and produce packages with backdoors (see this report for a recent prominent example of compromised packages on DockerHub). If the same source files always lead to the same binary packages, then an infected binary can be much more easily detected. Unfortunately, reproducible builds have not traditionally been the norm. Non-determinism creeping into build processes means that rebuilding an application from the exact same source, even within a secure build environment, can often lead to a different binary.
Due to the significant benefits, many open-source software repositories have initiated their validation processes. These repositories include GNU/Linux distributions such as Debian and Guix, as well as software systems like Bitcoin.
If you have a non-reproducible build, finding out why can be non-trivial. It takes time and a lot of effort to hunt down and eradicate the causes. For example, Debian unstable for AMD64 still had 2,342 packages with non-reproducible builds as of August 2017. (The number today as I’m writing this is 2,826). You can see a stubbornly persistent Continue reading
Generalized data structure synthesis Loncaric et al., ICSE’18
Many systems have a few key data structures at their heart. Finding correct and efficient implementations for these data structures is not always easy. Today’s paper introduces Cozy (https://cozy.uwplse.org), which can handle this task for you given a high-level specification of the state, queries, and update operations that need to be supported.
Cozy has three goals: to reduce programmer effort, to produce bug-free code, and to match the performance of handwritten code. We found that using Cozy requires an order of magnitude fewer lines of code than manual implementation, makes no mistakes even when human programmers do, and often matches the performance of handwritten code.
Let’s start out by looking at four case studies from the evaluation, to get a feel for where Cozy applies.
ConflictJS: finding and understanding conflicts between JavaScript libraries Patra et al., ICSE’18
The JavaScript ecosystem is fertile ground for dependency hell. With so many libraries being made available and the potential for global namespace clashes, it’s easy for libraries to break each other. Sometimes in an obvious to spot way (that’s a good day!), and sometimes in subtle ways that are harder to detect.
ConflictJS is a tool for finding conflicting JavasScript libraries. It’s available as open source and nicely documented, so you can try it for yourself from https://github.com/sola-da/ConflictJS.
We use ConflictJS to analyze and study conflicts among 951 real-world libraries. The results show that one out of four libraries is potentially conflicting and that 166 libraries are involved in at least one certain conflict.
At a language level, until ES6 modules at least, there was no built-in namespacing mechanism (though we do have a number of conventions and module libraries). In principle developers can follow a ‘single API object’ pattern where the entire API of a library is encapsulated behind a single object. In practice, many of them don’t (71% of libraries did not do this, from 951 studied for this Continue reading
Debugging with intelligence via probabilistic inference Xu et al., ICSE’18
Xu et al. have built a automated debugger that can take a single failing test execution, and with minimal interaction from a human, pinpoint the root cause of the failure. What I find really exciting about it, is that instead of brute force there’s a certain encoded intelligence in the way the analysis is undertaken which feels very natural. The first IDE / editor to integrate a tool like this wins!
The authors don’t give a name to their tool in the paper, which is going to make it awkward to refer to during this write-up. So I shall henceforth refer to it as the PI Debugger. PI here stands for probabilistic inference.
We model debugging as a probabilistic inference problem, in which the likelihood of each executed statement instance and variable being correct/faulty is modeled by a random variable. Human knowledge, human-like reasoning rules and program semantics are modeled as conditional probability distributions, also called probabilistic constraints. Solving these constraints identifies the most likely faulty statements.
In the evaluation, when debugging problems in large projects, it took on average just 3 interactions with a developer to find the Continue reading
DeepTest: automated testing of deep-neural-network-driven autonomous cars Tian et al., ICSE’18
How do you test a DNN? We’ve seen plenty of examples of adversarial attacks in previous editions of The Morning Paper, but you couldn’t really say that generating adversarial images is enough to give you confidence in the overall behaviour of a model under all operating conditions. Adversarial images approach things from a ‘think like an attacker’ mindset. We want to ‘think like a tester.’ For example, the work on DeepXplore which uses model ensembles to find differences in outputs that suggest bugs. The importance of testing DNNs is especially obvious when it comes to applications such as autonomous driving. Several of the ideas from DeepXplore are used in DeepTest, which looks specifically at testing of autonomous driving system. I think you could apply the DeepTest techniques to test other kinds of DNNs as well.
…despite the tremendous progress, just like traditional software, DNN-based software, including the ones used for autonomous driving, often demonstrate incorrect/unexpected corner-case behaviours that lead to dangerous consequences like a fatal collision.
DeepTest is a system designed to aid in the testing of autonomous driving models. When used to test three of Continue reading
Popular is cheaper: curtailing memory costs in interactive analytics engines Ghosh et al., EuroSys’18
(If you don’t have ACM Digital Library access, the paper can be accessed either by following the link above directly from The Morning Paper blog site).
We’re sticking with the optimisation of data analytics today, but at the other end of the spectrum to the work on smart arrays that we looked at yesterday. Getafix (extra points for the Asterix-inspired name, especially as it works with Yahoo!’s Druid cluster) is aimed at reducing the memory costs for large-scale in-memory data analytics, without degrading performance of course. It does this through an intelligent placement strategy that decides on replication level and data placement for data segments based on the changing popularity of those segments over time. Experiments with workloads from Yahoo!’s production Druid cluster that Getafix can reduce memory footprint by 1.45-2.15x while maintaining comparable average and tail latencies. If you translate that into a public cloud setting, and assuming a 100TB hot dataset size — a conservative estimate in the Yahoo! case — we’re looking at savings on the order of $10M per year.
Real-time analytics is projected to Continue reading
Analytics with smart arrays: adaptive and efficient language-independent data Psaroudakis et al., EuroSys’18
(If you don’t have ACM Digital Library access, the paper can be accessed either by following the link above directly from The Morning Paper blog site).
We’re going lower-level today, with a look at some work on adaptive data structures by Oracle. It’s motivated by a desire to speed up big data analytic workloads that are “increasingly limited by simple bottlenecks within the machine.” The initial focus is on array processing, but the ambition is to extend the work to more data types in time.
Modern servers have multiple interconnected sockets of multi-core processors. Each socket has local memory, accessible via a cache-coherent non-uniform memory access (ccNUMA) architecture. In the NUMA world the following hold true:
If we want to crunch through an array as fast as possible in a NUMA world, the optimum way of doing it depends on the details of the machine, and on the application Continue reading
Medea: scheduling of long running applications in shared production clusters Garefalakis et al., EuroSys’18
(If you don’t have ACM Digital Library access, the paper can be accessed either by following the link above directly from The Morning Paper blog site).
We’re sticking with schedulers today, and a really interesting system called Medea which is designed to support the common real world use case of mixed long running applications (LRAs) and shorter duration tasks within the same cluster. The work is grounded in production cluster workloads at Microsoft and is now part of the Apache Hadoop 3.1 release. In the evaluation, when compared to the Kubernetes’ scheduling algorithm Medea reduces median runtimes by up to 32%, and by 2.1x compared to the previous generation YARN scheduler.
…a substantial portion of production clusters today is dedicated to LRAs…. placing LRAs, along with batch jobs, in shared clusters is appealing to reduce cluster operational costs, avoid unnecessary data movement, and enable pipelines involving both classes of applications. Despite these observations, support for LRAs in existing schedulers is rudimentary.
Example uses of long running application containers include streaming systems, interactive data-intensive applications (maintaining Continue reading
Optimus: an efficient dynamic resource scheduler for deep learning clusters Peng et al., EuroSys’18
(If you don’t have ACM Digital Library access, the paper can be accessed either by following the link above directly from The Morning Paper blog site).
It’s another paper promising to reduce your deep learning training times today. But instead of improving the programming model and/or dataflow engine, Optimus improves the scheduling of jobs within a cluster. You can run it on top of Kubernetes, and the authors claim about a 1.6x reduction in makespan compared to the mostly widely used schedulers today.
We’re using ever larger models, with ever increasing amounts of data (at least, whenever we can get our hands on it). In general this improves the learning accuracy, but it also increases the training time. The most common approach is parallel training using a machine learning cluster. Typically a model is partitioned among multiple parameter servers, and training data is spread across multiple workers. Workers compute parameter updates and push them to the respective parameter server.
Training is an iterative process with a dataset divided into chunks, and each chunk further divided into mini-batches. A Continue reading
Improving the expressiveness of deep learning frameworks with recursion Jeong, Jeong et al., EuroSys’18
(If you don’t have ACM Digital Library access, the paper can be accessed either by following the link above directly from The Morning Paper blog site).
Last week we looked at the embedded dynamic control flow operators in TensorFlow. In today’s paper choice, Jeong et al. make the case for support of an additional control flow construct: recursion. A little recursion it turns out, can go a long way. Implemented on top of TensorFlow (and with a design that should also work for other embedded control flow machine learning frameworks e.g. Theano, Caffe, MXNet), support for recursion enables cleaner expression of a class of model architectures, and improved performance. The performance gains come from the increased opportunities to exploit parallelism within the recursive definitions.
In this paper, we introduce recursive definitions into the programming model of existing embedded control flow frameworks, adding first-class support for recursion. By allowing users to directly express recursive definitions in application code with enhanced programmability, models with recursive data structures such as trees or graphs can be written without requiring users to use a separate complex API Continue reading
BDS: A centralized near-optimal overlay network for inter-datacenter data replication Zhang et al., EuroSys’18
(If you don’t have ACM Digital Library access, the paper can be accessed either by following the link above directly from The Morning Paper blog site).
This is the story of how inter-datacenter multicast transfers at Baidu were sped-up by a factor of 3-5x. That’s a big deal!
For large-scale online service providers, such as Google, Facebook, and Baidu, an important data communication pattern is inter-DC multicast of bulk data — replicating massive amounts of data (e.g., user logs, web search indexes, photo sharing, blog posts) from one DC to multiple DCs in geo-distributed locations.
To set the scene, the authors study inter-DC traffic at Baidu over a period of seven days. Nearly all inter-DC traffic is multicast (91.1%), highlighting the importance of optimising the multicast use case.

When looking at the individual transfers, there is great diversity in the source and destination DCs. Thus it’s not going to suffice to pre-configure a few select routes: “we need a system to automatically route and schedule any given inter-DC multicast transfers.”
60% of the transferred files are over 1TB Continue reading
Dynamic control flow in large-scale machine learning Yu et al., EuroSys’18
(If you don’t have ACM Digital Library access, the paper can be accessed either by following the link above directly from The Morning Paper blog site).
In 2016 the Google Brain team published a paper giving an overview of TensorFlow, “TensorFlow: a system for large-scale machine learning.” This paper is a follow-up, taking a much deeper look at how TensorFlow supports dynamic control flow, including extending automatic differentiation to control flow constructs.
With a wide range of machine learning models in use, and rapid exploration of new techniques, a machine learning system needs to be expressive and flexible to support both research and production use cases. Given the ever larger models and training sets, a machine learning system also needs to be scalable. These means both using individual devices efficiently (anything from phones to custom ASCIs in datacenters), and also supporting parallel execution over multiple devices.
Both the building blocks of machine learning and the architectures built up using these blocks have been changing rapidly. This pace appears likely to continue. Therefore, rather than defining RNNs, MoEs Continue reading
Reducing DRAM footprint with NVM in Facebook Eisenman et al., EuroSys’18
(If you don’t have ACM Digital Library access, the paper can be accessed either by following the link above directly from The Morning Paper blog site).
…to the best of our knowledge, this is the first study on the usage of NVM devices in a commercial data center environment.
We’ve been watching NVM coming for some time now, so it’s exciting to see a paper describing its adoption within Facebook. MyRocks is Facebook’s primary MySQL database, and is used to store petabytes of data and to serve real-time user activities. MyRocks uses RocksDB as the storage engine, and a typical server consumes 128GB of DRAM and 3 TB of flash. It all seems to work well, so what’s the problem? Spiralling costs!
As DRAM facing major scaling challenges, its bit supply growth rate has experienced a historic low. Together with the growing demand for DRAM, these trends have led to problems in global supply, increasing total cost of ownership (TCO) for data center providers. Over the last year, for example, the average DRAM DDR4 price has increased by 2.3x.
Just using less DRAM per server Continue reading