adriancolyer

Author Archives: adriancolyer

Delayed impact of fair machine learning

Delayed impact of fair machine learning Liu et al., ICML’18

“Delayed impact of fair machine learning” won a best paper award at ICML this year. It’s not an easy read (at least it wasn’t for me), but fortunately it’s possible to appreciate the main results without following all of the proof details. The central question is how to ensure fair treatment across demographic groups in a population when using a score-based machine learning model to decide who gets an opportunity (e.g. is offered a loan) and who doesn’t. Most recently we looked at the equal opportunity and equalized odds models.

The underlying assumption of course for studied fairness models is that the fairness criteria promote the long-term well-being of those groups they aim to protect. The big result in this paper is that you can easily up end ‘killing them with kindness’ instead. The potential for this to happen exists when there is a feedback loop in place in the overall system. By overall system here, I mean the human system of which the machine learning model is just a small part. Using the loan/no-loan decision that is a popular study vehicle in fairness papers, we need to Continue reading

Bounding data races in space and time – part II

Bounding data races in space and time Dolan et al., PLDI’18

Yesterday we looked at the case for memory models supporting local data-race-freedom (local DRF). In today’s post we’ll push deeper into the paper and look at a memory model which does just that.

Consider a memory store S which maps locations to values. For non-atomic locations it’s not enough just to store a single memory value at the location. Instead non-atomic locations are mapped to histories H, where a history is a finite map from timestamps to values, and timestamps are totally ordered.

Every thread has a frontier, F, which is a map from non-atomic locations to timestamps.

Intuitively, each thread’s frontier records, for each nonatomic location, the latest write known to the thread. More recent writes may have occurred, but are not guaranteed to be visible.

For atomic locations, the memory store records a pair (F,x), with a single data value x rather than a history. F is a frontier, which will be merged with the frontiers of threads that operate on the location. This is expressed in the rules READ-AT and WRITE-AT in the operational semantics below. The READ-AT rule, for example, should be Continue reading

Bounding data races in space and time – part I

Bounding data races in space and time Dolan et al., PLDI’18

Are you happy with your programming language’s memory model? In this beautifully written paper, Dolan et al. point out some of the unexpected behaviours that can arise in mainstream memory models (C++, Java) and why we might want to strive for something better. Then they show a comprehensible (!) memory model that offers good performance and supports local reasoning. The work is being done to provide a foundation for the multicore implementation of OCaml, but should be of interest much more broadly. There’s so much here that it’s worth taking our time over it, so I’m going to spread my write-up over a number of posts.

Today we’ll be looking at the concept of local data-race-freedom (local DRF) and why we might want this property in a programming language.

Mainstream memory models don’t support local reasoning

Modern processors and compilers have all sorts of trickery they can deploy to make your program run faster. The optimisations don’t always play well with parallel execution though.

To benefit from these optimisations, mainstream languages such as C++ and Java have adopted complicated memory models which specify which of these relaxed Continue reading

HHVM JIT: A profile-guided, region-based compiler for PHP and Hack

HHVM JIT: A profile-guided, region-based compiler for PHP and Hack Ottoni, PLDI’18

HHVM is a virtual machine for PHP and Hack (a PHP extension) which is used to power Facebook’s website among others. Today’s paper choice describes the second generation HHVM implementation, which delivered a 21.7% performance boost when running the Facebook website compared to the previous HHVM implementation.

…the PHP code base that runs the Facebook website includes tens of millions of lines of source code, which are translated to hundreds of megabytes of machine code during execution.

I’m clearly suffering from an over-simplified understanding of what the Facebook web application actually does, but at the same time if I asked you to write a Facebook clone for just the website (not the backing services, not the mobile apps, etc.), would your initial estimate be on the order of tens of millions of lines of code???!

HHVM high-level overview

The starting point for HHVM is source code in PHP or Hack. Hack is a PHP dialect used by Facebook and includes support for a richer set of type hints. From the perspective of HHVM though the two languages are fundamentally equivalent. In particular, Hack’s type hints are Continue reading

BLeak: automatically debugging memory leaks in web applications

BLeak: Automatically debugging memory leaks in web applications Vilk & Berger, PLDI’18

BLeak is a Browser Leak debugger that finds memory leaks in web applications. You can use BLeak to test your own applications by following the instructions at http://bleak-detector.org.

Guided by BLeak, we identify and fix over 50 memory leaks in popular libraries and apps including Airbnb, AngularJS, Google Analytics, Google Maps SDK, and jQuery. BLeak’s median precision is 100%; fixing the leaks it identifies reduces heap growth by an average of 94%, saving from 0.5MB to 8MB per round trip.

Why are web application memory leaks so problematic?

Memory leaks in web applications are a pervasive problem. They lead to higher garbage collection frequency and overhead, reduced application responsiveness, and even browser tab crashes. Existing memory leak detection approaches don’t work well in the browser environment though:

  • Staleness detection assumes leaked memory is rarely touched, but web apps regularly interact with leaked state (e.g. via event listeners).
  • Growth-based technique assume that leaked objects are uniquely owned, or that leaked objects from strongly connected components in the heap graph. In a web application, objects frequently have multiple owners, and all roads lead back to window.
  • Techniques Continue reading

Here we go again!

It’s time to start a new term on #themorningpaper. I read my very first #themorningpaper on the 30th July 2014 (“Why functional programming matters”, Hughes 1990) and since then, bar three scheduled breaks a year, I’ve been reading a research paper every weekday. Since the 8th October 2014, I’ve also been posting a write-up of the day’s paper on The Morning Paper blog. There’s not always an exact 1:1 correspondence between a post and a paper, but it’s pretty good. On that basis, you can now find somewhere in the order of 840 paper write-ups on the blog, and we’re racing towards the 1,000 mark!

People often ask, and yes, it’s a lot of work to curate papers, read them, and post the write-ups! Somewhere on the order of 15-20 hours a week (and all outside of my regular work commitments). I’ve got a tremendous amount of value out of this habit and I intend to keep it going. However, starting again this term it suddenly felt like a big load to carry – might be something to do with the fact that we’re moving house in a couple of weeks! So to keep things feeling fun for me and Continue reading

End of term

It’s time to take my regular summer break from writing The Morning Paper – normal service will resume again on Monday 6th August. I’ll be topping up my paper backlog and scouting out interesting research during the break. If you’ve seen any great papers I haven’t already covered and that you think ‘The Morning Paper’ readers would enjoy, please do send them my way.

In the meantime, here are a few picks from papers we’ve covered this term in case you missed any of them. (In order of publication on the blog)

Many thanks to all of you that follow along and support the blog – the accountability keeps me sticking at it when I’m snowed under, and Continue reading

Oblix: an efficient oblivious search index

Oblix: an efficient oblivious search index Mishra et al., IEEE Security & Privacy 2018

Unfortunately, many known schemes that enable search queries on encrypted data achieve efficiency at the expense of security, as they reveal access patterns to the encrypted data. In this paper we present Oblix, a search index for encrypted data that is oblivious (provably hides access patterns) is dynamic (supports inserts and deletes), and has good efficiency.

There’s a lot to this paper! Starting with a recap of existing work on Path ORAM (Oblivious RAM) and Oblivious Data Structures (ODS), Mishra introduce an extension for an Oblivious Sorted Multimap (OSM) (such that you can look up a key and find a set of associated values, handy for building indexes!). Then because their design runs a client-proxy process inside an enclave at the server, and enclaves still leak some information, they also design “doubly-oblivious” versions of all of the above that hide the client access patterns in addition to those at the server process. It’s all topped off with an implementation in Rust (nice to see Rust being used for systems research), and an evaluation with three prime use cases: private contact discovery in Signal, Continue reading

EnclaveDB: a secure database using SGX

EnclaveDB: A secure database using SGX Priebe et al., IEEE Security & Privacy 2018

This is a really interesting paper (if you’re into this kind of thing I guess!) bringing together the security properties of Intel’s SGX enclaves with the Hekaton SQL Server database engine. The result is a secure database environment with impressive runtime performance. (In the read-mostly TATP benchmarks, overheads are down around 15%, which is amazing for this level of encryption and security). The paper does a great job showing us all of the things that needed to be considered to make EnclaveDB work so well in this environment.

One of my favourite takeaways is that we don’t always have to think of performance and security as trade-offs:

In this paper, we show that the principles behind the design of a high performance database engine are aligned with security. Specifically, in-memory tables and indexes are ideal data structures for securely hosting and querying sensitive data in enclaves.

Motivation and threat model

We host databases in all sorts of untrusted environments, potentially with unknown database administrators, server administrators, OS and hypervisors. How can we guarantee data security and integrity in such a world? Or even how Continue reading

Grand Pwning Unit: Accelerating microarchitectural attacks with the GPU

Grand Pwning Unit: Accelerating microarchitectural attacks with the GPU Frigo et al., IEEE Security & Privacy

The general awareness of microarchitectural attacks is greatly increased since meltdown and spectre earlier this year. A lot of time and energy has been spent in defending against such attacks, with a threat model that assumes attacks originate from the CPU. Frigo et al. open up an entirely new can of worms – modern SoCs contain a variety of special purpose accelerator units, chief among which is the GPU. GPUs are everywhere these days.

Unfortunately, the inclusion of these special-purpose units in the processor today appears to be guided by a basic security model that mainly governs access control, while entirely ignoring the threat of more advanced microarchitectural attacks.

I’m sure you know where this is heading…

It turns out the accelerators can also be used to “accelerate” microarchitectural attacks. Once more we find ourselves in a situation with widespread vulnerabilities. The demonstration target in the paper is a mobile phone running on the ARM platform, with all known defences, including any applicable advanced research defences, employed. Using WebGL from JavaScript, Frigo et al. show how to go from e.g. an advert Continue reading

Privacy risks with Facebook’s PII-based targeting: auditing a data broker’s advertising interface

Privacy risks with Facebook’s PII-based targeting: auditing a data broker’s advertising interface Venkatadri et al., IEEE Security and Privacy 2018

This is one of those jaw-hits-the-floor, can’t quite believe what I’m reading papers. The authors describe an attack exploiting Facebook’s custom audience feature, that can leak your PII.

Specifically, we show how the adversary can infer user’s full phone numbers knowing just their email address, determine whether a particular user visited a website, and de-anonymize all the visitors to a website by inferring their phone numbers en masse. These attacks can be conducted without any interaction with the victim(s), cannot be detected by the victim(s), and do not require the adversary to spend money or actually place an ad.

Following responsible disclosure of the attack vectors to Facebook, Facebook acknowledged the vulnerability and have put in place a fix (not giving audience size estimates under certain scenarios). The experiments conducted by the authors were performed between January and March 2017, and presumably the disclosure happened around that time or shortly afterwards. That probably means your PII on Facebook was vulnerable from when the custom audiences feature was first introduced, until early 2017. Someone with more time could probably put Continue reading

The rise of the citizen developer: assessing the security impact of online app generators

The rise of the citizen developer: assessing the security impact of online app generators Oltrogge et al., IEEE Security & Privacy 2018

“Low code”, “no code”, “citizen developers”, call it what you will, there’s been a big rise in platforms that seek to make it easy to develop applications for non-export developers. Today’s paper choice studies the online application generator (OAG) market for Android applications. When what used to be a web site (with many successful web site templating and building options around) is often in many cases now also or instead a mobile app, so it makes sense that the same kind of templating and building approach should exist there too. For a brief period at the end of last year, Apple flirted with banning such apps from their app store, before back-tracking just a couple of weeks after the initial announcement. After reading today’s paper I can’t help but feel that perhaps they were on to something. Not that templated apps are bad per se, but when the generated apps contain widespread vulnerabilities and privacy issues, then that is bad.

With the increasing use of OAGs the duty of generating secure code shifts away from the Continue reading

Debugging data flows in reactive programs

Debugging data flows in reactive programs Banken et al., ICSE’18

To round off our look at papers from ICSE, here’s a really interesting look at the challenges of debugging reactive applications (with a certain Erik Meijer credited among the authors).

… in recent years the use of Reactive Programming (RP) has exploded. Languages such as Elm and libraries such as Reactor, Akka, and Rx are being used by companies such as Netflix, Microsoft, and Google, to build highly responsive and scalable systems.

The rise of reactive programming fits well with the increasing need to process streams of data. In a reactive program, you set up a data processing pipeline and then wait for input to arrive.

Many RP implementations share a notion of a collection that abstracts over time, in contrast to space like standard collections. This collection comes in different flavors, such as Observable (Rx)… the implementations differ in the precise semantics of their collections, their execution model (push/pull) , and the set of available operators.

While in theory events and threads are duals, in practice the RP abstraction works very well for expressing streaming pipelines. If you have an abstraction over time though, Continue reading

How not to structure your database-backed web applications: a study of performance bugs in the wild

How not to structure your database-backed web applications: a study of performance bugs in the wild Yang et al., ICSE’18

This is a fascinating study of the problems people get into when using ORMs to handle persistence concerns in their web applications. The authors study real-world applications and distil a catalogue of common performance anti-patterns. There are a bunch of familiar things in the list, and a few that surprised me with the amount of difference they can make. By fixing many of the issues that they find, Yang et al., are able to quantify how many lines of code it takes to address the issue, and what performance improvement the fix delivers.

To prove our point, we manually fix 64 performance issues in [the latest versions of the applications under study] and obtain a median speed-up of 2x (and up to 39x max) with fewer than 5 lines of code change in most cases.

The Hyperloop website provides access to a tool you can use to identify and solve some of the common performance issues in your own (Rails) apps.

I’m going to skip the intro parts about what ORMs do and how a typical web app Continue reading

Secure coding practices in Java: challenges and vulnerabilities

Secure coding practices in Java: challenges and vulnerabilities Meng et al., ICSE’18

TL;DR : don’t trust everything you read on Stack Overflow.

Meng et al. conduct a study of Stack Overflow posts relating to secure coding practices in Java to find out the hot topics, what people struggle with, and whether or not the accepted answers are actually following security best practices.

We conducted an empirical study on Stack Overflow posts, aiming to understand developer’s concerns on Java secure coding, their programming obstacles, and insecure coding practices. We observed a wide adoption of the authentication and authorization features provided by Spring Security — a third-party framework designed to secure enterprise applications…

Well, how could I resist reading that! (Some readers may know that I was for many years the CTO of SpringSource). Spring Security does come in for some flak in this paper for the high volume of questions that are asked relating to it. There’s no calibration though for underlying popularity. One of the reasons there are a lot of questions, I posit, is that there are an awful lot of users of Spring Security. Spring Boot applications will use Spring Security, and Spring Boot has been growing Continue reading

Deep code search

Deep code search Gu et al., ICSE’18

The problem with searching for code is that the query, e.g. “read an object from xml,” doesn’t look very much like the source code snippets that are the intended results, e.g.:

*

That’s why we have Stack Overflow! Stack Overflow can help with ‘how to’ style queries, but it can’t help with searches inside codebases you care about. For example, “where in this codebase are events queued on a thread?”

…an effective code search engine should be able to understand the semantic meanings of natural language queries and source code in order to improve the accuracy of code search.

DeepCS is just such a search engine for code, based on the CODEnn (Code-Description Embedding Neural Network) network model. During training, it takes code snippets (methods) and corresponding natural language descriptions (from the method comments) and learns a joint-embedding. I.e., it learns embeddings such that a method description and its corresponding code snippet are both mapped to a similar point in the same shared embedding space. Then given a natural language query, it can embed the query in vector space and look for nearby code snippets. Compared Continue reading

To distribute or not to distribute? Why licensing bugs matter

To distribute or not to distribute? Why licensing bugs matter Vendome et al., ICSE’18

Software licensing can quickly get quite complicated, with over 100 known open source licenses out there, and distributions often including components with a mix of licenses. Unsurprisingly, developers find it hard to determine appropriate licenses for their work, and to interpret the implications of including third-party software under different licenses.

We present a large-scale qualitative study aimed at characterizing licensing bugs, with the goal of understanding the types of licensing bugs developers face, their legal and technical implications, and how such bugs are fixed.

The result is a helpful catalogue of seven different categories of licensing bugs, with 21 sub-categories in total between them. Although the authors are not lawyers (as far as I can tell), it still constitutes a very useful list of things to think about. “Our proposed catalog can serve as a reference for developers and lawyers dealing with potential licensing issues.”

The catalogue is drawn from an open coding exercise based on a statistically significant sample of 1,200 discussions randomly selected from a population of 59,426 discussions across a collection of issue trackers and mailing lists. The mailing lists Continue reading

Automated localization for unreproducible builds

Automated localization for unreproducible builds Ren et al., ICSE’18

Reproducible builds are an important component of integrity in the software supply chain. Attacks against package repositories and build environments may compromise binaries and produce packages with backdoors (see this report for a recent prominent example of compromised packages on DockerHub). If the same source files always lead to the same binary packages, then an infected binary can be much more easily detected. Unfortunately, reproducible builds have not traditionally been the norm. Non-determinism creeping into build processes means that rebuilding an application from the exact same source, even within a secure build environment, can often lead to a different binary.

Due to the significant benefits, many open-source software repositories have initiated their validation processes. These repositories include GNU/Linux distributions such as Debian and Guix, as well as software systems like Bitcoin.

If you have a non-reproducible build, finding out why can be non-trivial. It takes time and a lot of effort to hunt down and eradicate the causes. For example, Debian unstable for AMD64 still had 2,342 packages with non-reproducible builds as of August 2017. (The number today as I’m writing this is 2,826). You can see a stubbornly persistent Continue reading

Generalized data structure synthesis

Generalized data structure synthesis Loncaric et al., ICSE’18

Many systems have a few key data structures at their heart. Finding correct and efficient implementations for these data structures is not always easy. Today’s paper introduces Cozy (https://cozy.uwplse.org), which can handle this task for you given a high-level specification of the state, queries, and update operations that need to be supported.

Cozy has three goals: to reduce programmer effort, to produce bug-free code, and to match the performance of handwritten code. We found that using Cozy requires an order of magnitude fewer lines of code than manual implementation, makes no mistakes even when human programmers do, and often matches the performance of handwritten code.

Let’s start out by looking at four case studies from the evaluation, to get a feel for where Cozy applies.

  • ZTopo is a topological map viewer implemented in C++. A core data structure is the map tile cache for map tiles that are asynchronously loaded over the network and cached on disk or in memory.
  • Sat4j is a Boolean sat solver implemented in Java. A core data structure is the variable store.
  • Openfire is a large scalable IRC server implemented in Java. Continue reading

ConflictJS: finding and understanding conflicts between JavaScript libraries

ConflictJS: finding and understanding conflicts between JavaScript libraries Patra et al., ICSE’18

The JavaScript ecosystem is fertile ground for dependency hell. With so many libraries being made available and the potential for global namespace clashes, it’s easy for libraries to break each other. Sometimes in an obvious to spot way (that’s a good day!), and sometimes in subtle ways that are harder to detect.

ConflictJS is a tool for finding conflicting JavasScript libraries. It’s available as open source and nicely documented, so you can try it for yourself from https://github.com/sola-da/ConflictJS.

We use ConflictJS to analyze and study conflicts among 951 real-world libraries. The results show that one out of four libraries is potentially conflicting and that 166 libraries are involved in at least one certain conflict.

Why do conflicts happen?

At a language level, until ES6 modules at least, there was no built-in namespacing mechanism (though we do have a number of conventions and module libraries). In principle developers can follow a ‘single API object’ pattern where the entire API of a library is encapsulated behind a single object. In practice, many of them don’t (71% of libraries did not do this, from 951 studied for this Continue reading