adriancolyer

Author Archives: adriancolyer

Unveiling and quantifying Facebook exploitation of sensitive personal data for advertising purposes

Unveiling and quantifying Facebook exploitation of sensitive personal data for advertising purposes Cabañas et al., USENIX Security 2018

Earlier this week we saw how the determined can still bypass most browser and tracker-blocking extension protections to track users around the web. Today’s paper is a great example of why you should care about that. Cabañas et al. examine the extent to which the profile Facebook builds on its users includes sensitive personal data made available to advertisers for targeting. The work was done just prior to the GPDR coming into force, which makes for very interesting timing from a legal perspective. The headline result is that it looks like Facebook is holding sensitive data on about 40% of the EU population, and that this data can be used by third-parties to target individuals in sensitive demographics and even identify them at a cost of as little as €0.015 per user.

What kinds of sensitive data are we talking about?

The GDPR definition of sensitive personal data is “data revealing racial or ethnic origin, political opinions, religious or philosophical beliefs, or trade union membership, and the processing of genetic data, biometric data for the purpose of uniquely identifying Continue reading

Who left open the cookie jar? A comprehensive evaluation of third-party cookie policies

Who left open the cookie jar? A comprehensive evaluation of third-party cookie policies from the Franken et al., USENIX Security 2018

This paper won a ‘Distinguished paper’ award at USENIX Security 2018, as well as the 2018 Internet Defense Prize. It’s an evaluation of the defense mechanisms built into browsers (and via extensions / add-ons) that seek to protect against user tracking and cross-site attacks. Testing across 7 browsers and 46 browser extensions, the authors find that for virtually every browser and extension combination there is a way to bypass the intended security policies.

Despite their significant merits, the way cookies are implemented in most modern browsers also introduces a variety of attacks and other unwanted behavior. More precisely, because cookies are attached to every request, including third-party requests, it becomes more difficult for websites to validate the authenticity of a request. Consequently, an attacker can trigger requests with a malicious payload from the browser of an unknowing victim… Next to cross-site attacks, the inclusion of cookies in third-party requests also allows fo users to be tracked across the various websites they visit.

When you visit a site A, it can set a cookie to be included in Continue reading

Fear the reaper: characterization and fast detection of card skimmers

Fear the reaper: characterization and fast detection of card skimmers Scaife et al., USENIX Security 2018

Until I can get my hands on a Skim Reaper I’m not sure I’ll ever trust an ATM or other exposed card reading device (e.g., at garages) again!

Scaife et al. conduct a study of skimming devices found by the NYPD Financial Crimes Task Force over a 16 month period. The bad news is that you and I don’t really have much chance of detecting a deployed card skimming device (most of the folk wisdom about how to do so doesn’t really work). The good news is that the Skim Reaper detection device developed in this research project was able to effectively detect 100% of the devices supplied by the NYPD. That’s great if you happen to have a Skim Reaper handy to test with before using an ATM. The NYPD are now actively using a set of such devices in the field.

Card skimmers and why they’re so hard for end-users to detect

Almost as well-know as (credit and debit) cards themselves is the ease with which fraud can be committed against them. Attackers often acquire card data using skimmers Continue reading

STTR: A system for tracking all vehicles all the time at the edge of the network

STTR: A system for tracking all vehicles all the time at the edge of the network Xu et al., DEBS’18

With apologies for only bringing you two paper write-ups this week: we moved house, which turns out to be not at all conducive to quiet study of research papers!

Today’s smart camera surveillance systems are largely alert based, which gives two main modes of operation: either you know in advance the vehicles of interest so that you can detect them in real time, or you have to trawl through lots of camera footage post-facto (expensive and time-consuming). STTR is a system designed to track all of the vehicles all of the time, and store their trajectories for ever. I certainly have mixed feelings about the kinds of state surveillance and privacy invasions that enables (it’s trivial to link back to individuals given trajectories over time), but here we’ll just focus on the technology. Since the system is design with pluggable detection and matching algorithms, then given some calculations around volume it ought to be possible to use it to track objects other than vehicles. People for example?

Assuming the availability of suitable detection and matching (figuring out if Continue reading

Learning the structure of generative models without labeled data

Learning the structure of generative models without labeled data Bach et al., ICML’17

For the last couple of posts we’ve been looking at Snorkel and BabbleLabble which both depend on data programming – the ability to intelligently combine the outputs of a set of labelling functions. The core of data programming is developed in two papers, ‘Data programming: creating large training sets, quickly’ (Ratner 2016) and today’s paper choice, ‘Learning the structure of generative models without labeled data’ (Bach 2017).

The original data programming paper works explicitly with input pairs (x,y) (e.g. the chemical and disease word pairs we saw from the disease task in Snorkel) which (for me at least) confuses the presentation a little compared to the latter ICML paper which just assumes inputs x (which could of course have pair structure, but we don’t care about that at this level of detail). Also in the original paper dependencies between labelling functions are explicitly specified by end users (as one of four types: similar, fixing, reinforcing, and exclusive) and built into a factor graph. In the ICML paper dependencies are learned. So I’m going to work mostly from ‘Learning the structure of generative Continue reading

Training classifiers with natural language explanations

Training classifiers with natural language explanations Hancock et al., ACL’18

We looked at Snorkel earlier this week, which demonstrates that maybe AI isn’t going to take over all of our programming jobs. Instead, we’ll be writing labelling functions to feed the machine! Perhaps we could call this task label engineering. To me, it feels a bit like programming a quick-and-dirty expert system, where the downstream generative model deals with all the inaccuracies and inconsistencies so that we don’t have to be perfect, just useful. Given the success of the approach, a natural question to ask is how we can enable end users to more easily create useful labelling functions in their domain. This is where BabbleLabble comes in!

In this work, we propose BabbleLabble, a framework for training classifiers in which an annotator provides a natural language explanation for each labeling decision. A semantic parser converts these explanations into programmatic labeling functions that generate noisy labels for an arbitrary amount of unlabeled data, which is used to train a classifier.

So much for those programming jobs ! ?

Working with BabbleLabble, it takes users about twice as long per example to provide a label plus an explanation as it does just Continue reading

Snorkel: rapid training data creation with weak supervision

Snorkel: rapid training data creation with weak supervision Ratner et al., VLDB’18

Earlier this week we looked at Sparser, which comes from the Stanford Dawn project, “a five-year research project to democratize AI by making it dramatically easier to build AI-powered applications.” Today’s paper choice, Snorkel, is from the same stable. It tackles one of central questions in supervised machine learning: how do you get a large enough set of training data to power modern deep models?

…deep learning has a major upfront cost: these methods need massive training sets of labeled examples to learn from – often tens of thousands to millions to reach peak predictive performance. Such training sets are enormously expensive to create…

Snorkel lets you throw everything you’ve got at the problem. Heuristics, external knowledge bases, crowd-sourced workers, you name it. These are known as weak supervision sources because they may be limited in accuracy and coverage. All of these get combined in a principled manner to produce a set of probability-weighted labels. The authors call this process ‘data programming’. The end model is then trained on the generated labels.

Snorkel is the first system to implement our recent work Continue reading

Filter before you parse: faster analytics on raw data with Sparser

Filter before you parse: faster analytics on raw data with Sparser Palkar et al., VLDB’18

We’ve been parsing JSON for over 15 years. So it’s surprising and wonderful that with a fresh look at the problem the authors of this paper have been able to deliver an order-of-magnitude speed-up with Sparser in about 4Kloc.

The classic approach to JSON parsing is to use a state-machine based parsing algorithm. This is the approach used by e.g. RapidJSON. Such algorithms are sequential and can’t easily exploit the SIMD capabilities of modern CPUs. State of the art JSON parsers such as Mison are designed to match the capabilities of modern hardware. Mison uses SIMD instructions to find special characters such as brackets and colons and build a structural index over a raw json string.

… we found that Mison can parse highly nested in-memory data at over 2GMB/s per core, over 5x faster than RapidJSON, the fastest traditional state-machine based parser available.

How can we parse JSON even faster? The key lies in re-framing the question. The fastest way to parse a JSON file is not to parse it at all. Zero ms is a hard lower bound ;). In other Continue reading

Fairness without demographics in repeated loss minimization

Fairness without demographics in repeated loss minimization Hashimoto et al., ICML’18

When we train machine learning models and optimise for average loss it is possible to obtain systems with very high overall accuracy, but which perform poorly on under-represented subsets of the input space. For example, a speech recognition system that performs poorly with minority accents.

We refer to this phenomenon of high overall accuracy but low minority accuracy as a representation disparity… This representation disparity forms our definition of unfairness, and has been observed in face recognition, language identification, dependency parsing, part-of-speech tagging, academic recommender systems, and automatic video captioning.

For systems that are continually trained and evolved based on data collected from their users, the poor performance for a minority group can set in place a vicious cycle in which members of such a group use the system less (because it doesn’t work as well for them), causing them to provide less data and hence to be further under-represented in the training set…

… this problem of disparity amplification is a possibility in any machine learning system that is retrained on user data.

An interesting twist in the problem is that the authors assume neither the Continue reading

Obfuscated gradients give a false sense of security: circumventing defenses to adversarial examples

Obfuscated gradients give a false sense of security: circumventing defenses to adversarial examples Athalye et al., ICML’18

There has been a lot of back and forth in the research community on adversarial attacks and defences in machine learning. Today’s paper examines a number of recently proposed defences and shows that most of them rely on forms of gradient masking. The authors develop attack techniques to overcome such defences, and 9 analyse defences from ICLR 2018 claiming to protect against white-box attacks. 7 of these turn out to rely on obfuscated gradients, and 6 of these fall to the new attacks (and the other one partially succumbs). Athalye et al. won a best paper award at ICML’18 for this work.

One of the great things about work on adversarial attacks and defences, as we’ve looked at before, is that they illuminate the strengths and weaknesses of current technology. Depending on the threat model you choose, for my own part I’m currently of the opinion that we’re unlikely to find a robust adversarial defence without a more radical re-think of how we’re doing image classification. If we’re talking about the task of ‘find an image that doesn’t fool a human, but Continue reading

Delayed impact of fair machine learning

Delayed impact of fair machine learning Liu et al., ICML’18

“Delayed impact of fair machine learning” won a best paper award at ICML this year. It’s not an easy read (at least it wasn’t for me), but fortunately it’s possible to appreciate the main results without following all of the proof details. The central question is how to ensure fair treatment across demographic groups in a population when using a score-based machine learning model to decide who gets an opportunity (e.g. is offered a loan) and who doesn’t. Most recently we looked at the equal opportunity and equalized odds models.

The underlying assumption of course for studied fairness models is that the fairness criteria promote the long-term well-being of those groups they aim to protect. The big result in this paper is that you can easily up end ‘killing them with kindness’ instead. The potential for this to happen exists when there is a feedback loop in place in the overall system. By overall system here, I mean the human system of which the machine learning model is just a small part. Using the loan/no-loan decision that is a popular study vehicle in fairness papers, we need to Continue reading

Bounding data races in space and time – part II

Bounding data races in space and time Dolan et al., PLDI’18

Yesterday we looked at the case for memory models supporting local data-race-freedom (local DRF). In today’s post we’ll push deeper into the paper and look at a memory model which does just that.

Consider a memory store S which maps locations to values. For non-atomic locations it’s not enough just to store a single memory value at the location. Instead non-atomic locations are mapped to histories H, where a history is a finite map from timestamps to values, and timestamps are totally ordered.

Every thread has a frontier, F, which is a map from non-atomic locations to timestamps.

Intuitively, each thread’s frontier records, for each nonatomic location, the latest write known to the thread. More recent writes may have occurred, but are not guaranteed to be visible.

For atomic locations, the memory store records a pair (F,x), with a single data value x rather than a history. F is a frontier, which will be merged with the frontiers of threads that operate on the location. This is expressed in the rules READ-AT and WRITE-AT in the operational semantics below. The READ-AT rule, for example, should be Continue reading

Bounding data races in space and time – part I

Bounding data races in space and time Dolan et al., PLDI’18

Are you happy with your programming language’s memory model? In this beautifully written paper, Dolan et al. point out some of the unexpected behaviours that can arise in mainstream memory models (C++, Java) and why we might want to strive for something better. Then they show a comprehensible (!) memory model that offers good performance and supports local reasoning. The work is being done to provide a foundation for the multicore implementation of OCaml, but should be of interest much more broadly. There’s so much here that it’s worth taking our time over it, so I’m going to spread my write-up over a number of posts.

Today we’ll be looking at the concept of local data-race-freedom (local DRF) and why we might want this property in a programming language.

Mainstream memory models don’t support local reasoning

Modern processors and compilers have all sorts of trickery they can deploy to make your program run faster. The optimisations don’t always play well with parallel execution though.

To benefit from these optimisations, mainstream languages such as C++ and Java have adopted complicated memory models which specify which of these relaxed Continue reading

HHVM JIT: A profile-guided, region-based compiler for PHP and Hack

HHVM JIT: A profile-guided, region-based compiler for PHP and Hack Ottoni, PLDI’18

HHVM is a virtual machine for PHP and Hack (a PHP extension) which is used to power Facebook’s website among others. Today’s paper choice describes the second generation HHVM implementation, which delivered a 21.7% performance boost when running the Facebook website compared to the previous HHVM implementation.

…the PHP code base that runs the Facebook website includes tens of millions of lines of source code, which are translated to hundreds of megabytes of machine code during execution.

I’m clearly suffering from an over-simplified understanding of what the Facebook web application actually does, but at the same time if I asked you to write a Facebook clone for just the website (not the backing services, not the mobile apps, etc.), would your initial estimate be on the order of tens of millions of lines of code???!

HHVM high-level overview

The starting point for HHVM is source code in PHP or Hack. Hack is a PHP dialect used by Facebook and includes support for a richer set of type hints. From the perspective of HHVM though the two languages are fundamentally equivalent. In particular, Hack’s type hints are Continue reading

BLeak: automatically debugging memory leaks in web applications

BLeak: Automatically debugging memory leaks in web applications Vilk & Berger, PLDI’18

BLeak is a Browser Leak debugger that finds memory leaks in web applications. You can use BLeak to test your own applications by following the instructions at http://bleak-detector.org.

Guided by BLeak, we identify and fix over 50 memory leaks in popular libraries and apps including Airbnb, AngularJS, Google Analytics, Google Maps SDK, and jQuery. BLeak’s median precision is 100%; fixing the leaks it identifies reduces heap growth by an average of 94%, saving from 0.5MB to 8MB per round trip.

Why are web application memory leaks so problematic?

Memory leaks in web applications are a pervasive problem. They lead to higher garbage collection frequency and overhead, reduced application responsiveness, and even browser tab crashes. Existing memory leak detection approaches don’t work well in the browser environment though:

  • Staleness detection assumes leaked memory is rarely touched, but web apps regularly interact with leaked state (e.g. via event listeners).
  • Growth-based technique assume that leaked objects are uniquely owned, or that leaked objects from strongly connected components in the heap graph. In a web application, objects frequently have multiple owners, and all roads lead back to window.
  • Techniques Continue reading

Here we go again!

It’s time to start a new term on #themorningpaper. I read my very first #themorningpaper on the 30th July 2014 (“Why functional programming matters”, Hughes 1990) and since then, bar three scheduled breaks a year, I’ve been reading a research paper every weekday. Since the 8th October 2014, I’ve also been posting a write-up of the day’s paper on The Morning Paper blog. There’s not always an exact 1:1 correspondence between a post and a paper, but it’s pretty good. On that basis, you can now find somewhere in the order of 840 paper write-ups on the blog, and we’re racing towards the 1,000 mark!

People often ask, and yes, it’s a lot of work to curate papers, read them, and post the write-ups! Somewhere on the order of 15-20 hours a week (and all outside of my regular work commitments). I’ve got a tremendous amount of value out of this habit and I intend to keep it going. However, starting again this term it suddenly felt like a big load to carry – might be something to do with the fact that we’re moving house in a couple of weeks! So to keep things feeling fun for me and Continue reading

End of term

It’s time to take my regular summer break from writing The Morning Paper – normal service will resume again on Monday 6th August. I’ll be topping up my paper backlog and scouting out interesting research during the break. If you’ve seen any great papers I haven’t already covered and that you think ‘The Morning Paper’ readers would enjoy, please do send them my way.

In the meantime, here are a few picks from papers we’ve covered this term in case you missed any of them. (In order of publication on the blog)

Many thanks to all of you that follow along and support the blog – the accountability keeps me sticking at it when I’m snowed under, and Continue reading

Oblix: an efficient oblivious search index

Oblix: an efficient oblivious search index Mishra et al., IEEE Security & Privacy 2018

Unfortunately, many known schemes that enable search queries on encrypted data achieve efficiency at the expense of security, as they reveal access patterns to the encrypted data. In this paper we present Oblix, a search index for encrypted data that is oblivious (provably hides access patterns) is dynamic (supports inserts and deletes), and has good efficiency.

There’s a lot to this paper! Starting with a recap of existing work on Path ORAM (Oblivious RAM) and Oblivious Data Structures (ODS), Mishra introduce an extension for an Oblivious Sorted Multimap (OSM) (such that you can look up a key and find a set of associated values, handy for building indexes!). Then because their design runs a client-proxy process inside an enclave at the server, and enclaves still leak some information, they also design “doubly-oblivious” versions of all of the above that hide the client access patterns in addition to those at the server process. It’s all topped off with an implementation in Rust (nice to see Rust being used for systems research), and an evaluation with three prime use cases: private contact discovery in Signal, Continue reading

EnclaveDB: a secure database using SGX

EnclaveDB: A secure database using SGX Priebe et al., IEEE Security & Privacy 2018

This is a really interesting paper (if you’re into this kind of thing I guess!) bringing together the security properties of Intel’s SGX enclaves with the Hekaton SQL Server database engine. The result is a secure database environment with impressive runtime performance. (In the read-mostly TATP benchmarks, overheads are down around 15%, which is amazing for this level of encryption and security). The paper does a great job showing us all of the things that needed to be considered to make EnclaveDB work so well in this environment.

One of my favourite takeaways is that we don’t always have to think of performance and security as trade-offs:

In this paper, we show that the principles behind the design of a high performance database engine are aligned with security. Specifically, in-memory tables and indexes are ideal data structures for securely hosting and querying sensitive data in enclaves.

Motivation and threat model

We host databases in all sorts of untrusted environments, potentially with unknown database administrators, server administrators, OS and hypervisors. How can we guarantee data security and integrity in such a world? Or even how Continue reading

Grand Pwning Unit: Accelerating microarchitectural attacks with the GPU

Grand Pwning Unit: Accelerating microarchitectural attacks with the GPU Frigo et al., IEEE Security & Privacy

The general awareness of microarchitectural attacks is greatly increased since meltdown and spectre earlier this year. A lot of time and energy has been spent in defending against such attacks, with a threat model that assumes attacks originate from the CPU. Frigo et al. open up an entirely new can of worms – modern SoCs contain a variety of special purpose accelerator units, chief among which is the GPU. GPUs are everywhere these days.

Unfortunately, the inclusion of these special-purpose units in the processor today appears to be guided by a basic security model that mainly governs access control, while entirely ignoring the threat of more advanced microarchitectural attacks.

I’m sure you know where this is heading…

It turns out the accelerators can also be used to “accelerate” microarchitectural attacks. Once more we find ourselves in a situation with widespread vulnerabilities. The demonstration target in the paper is a mobile phone running on the ARM platform, with all known defences, including any applicable advanced research defences, employed. Using WebGL from JavaScript, Frigo et al. show how to go from e.g. an advert Continue reading