Author Archives: adriancolyer
Author Archives: adriancolyer
Rosetta: large scale system for text detection and recognition in images Borisyuk et al., KDD’18
Rosetta is Facebook’s production system for extracting text (OCR) from uploaded images.
In the last several years, the volume of photos being uploaded to social media platforms has grown exponentially to the order of hundreds of millions every day, presenting technological challenges for processing increasing volumes of visual information… our problem can be stated as follows: to build a robust and accurate system for optical character recognition capable of processing hundreds of millions of images per day in realtime.
Images uploaded by clients are added to a distributed processing queue from which Rosetta inference machines pull jobs. Online image processing consists of the following steps:
Columnstore and B+ tree – are hybrid physical designs important? Dziedzic et al., SIGMOD’18
Earlier this week we looked at the design of column stores and their advantages for analytic workloads. What should you do though if you have a mixed workload including transaction processing, decision support, and operational analytics? Microsoft SQL Server supports hybrid physical design combining both column store and B+ tree indexes in the same database.
It is generally understood that columnstores are crucial to achieving high performance for analytic queries and that B+ tree indexes are key to supporting transactional workloads efficiently. However, it is not well understood whether hybrid physical designs – both columnstore and B+ tree indices on the same database and potentially the same table – are important for any of the above workloads.
Through a series of benchmarks the authors show that hybrid physical designs can result in more than an order of magnitude lower execution costs for many workloads when compared to alternatives using B+ tree-only or columnstore-only. The Database Engine Tuning Advisor (DTA) for SQL Server is extended to analyze and recommend the appropriate indices for a given workload. Support for columnstore indices and the new DTA functionality was Continue reading
The design and implementation of modern column-oriented database systems Abadi et al., Foundations and trends in databases, 2012
I came here by following the references in the Smoke paper we looked at earlier this week. “The design and implementation of modern column-oriented database systems” is a longer piece at 87 pages, but it’s good value-for-time. What we have here is a very readable overview of the key techniques behind column stores.
Column stores are relational databases that store data by column rather than by row. Whereas a traditional row-based store stores all attributes of one row together, followed by the attributes of the next row, and so on, a column-based stored uses one logical file per attribute (column). The column-oriented layout makes it efficient to read just the columns you need for a query, without pulling in lots of redundant data.
Data for a column may be stored in an array with implicit ids (a), or in some format with explicit ids (b).
Since data transfer costs from storage (or through a storage hierarchy) are often the major performance bottlenecks in database systems, while at the same time database schemas are becoming more and Continue reading
Smoke: fine-grained lineage at interactive speed Psallidas et al., VLDB’18
Data lineage connects the input and output data items of a computation. Given a set of output records, a backward lineage query selects a subset of the output records and asks “which input records contributed to these results?” A forward lineage query selects a subset of the input records and asks, “which output records depend on these inputs?”. Lineage-enabled systems capture record-level relationships throughout a workflow and support lineage queries.
Data lineage is useful in lots of different applications; this paper uses as its main example interactive visualisation systems. This domain requires fast answers to queries and is typically dominated by hand-written implementations. Consider the two views in the figure below. When the user selects a set of marks in , marks derived from the same records are highlighted in (linked brushing).
A typical visualisation system implements this manually, but it can equally be viewed as a backward lineage query from the selection points in , followed by a forward lineage query from the resulting input records to .
(See ‘Explaining outputs in modern data analytics’ which we looked at last year for an introduction Continue reading
Same-different problems strain convolutional neural networks Ricci et al., arXiv 2018
Since we’ve been looking at the idea of adding structured representations and relational reasoning to deep learning systems, I thought it would be interesting to finish off the week with an example of a problem that seems to require it: detecting whether objects in a scene are the same or different.
This image containing a flute was correctly classified by a CNN trained on millions of photographs. On ImageNet the network even surpassed the accuracy of a human observer.
This image contains two shapes that are the same, a relationship that is immediately obvious to a human observer. “Yet, the CNN failed to learn this relation even after seeing millions of training examples.”
The above is an example of a same-different (SD) visual relation problem (output whether the objects in the scene are the same, or different). Spatial relation (SR) problems ask whether objects follow a certain spatial relation, e.g. in a line, horizontally stacked, vertically stacked, and so on. For example:
The synthetic visual reasoning test (SVRT) contains a collection of 23 binary classification problems along these lines. In each case opposing classes differ Continue reading
Relational inductive biases, deep learning, and graph networks Battaglia et al., arXiv’18
Earlier this week we saw the argument that causal reasoning (where most of the interesting questions lie!) requires more than just associational machine learning. Structural causal models have at their core a graph of entities and relationships between them. Today we’ll be looking at a position paper with a wide team of authors from DeepMind, Google Brain, MIT, and the University of Edinburgh, which also makes the case for graph networks as a foundational building block of the next generation of AI. In other words, bringing back and re-integrating some of the techniques from the AI toolbox that were prevalent when resources were more limited.
We argue that combinatorial generalization must be a top priority for AI to achieve human-like abilities, and that structured representation and computations are key to realizing this objective… We explore how using relational inductive biases within deep learning architectures can facilitate learning about entities, relations, and the rules for composing them.
Human’s represent complex systems as compositions of entities and their interactions. We use hierarchies to abstract away fine-grained differences, manage part-whole associations and other more Continue reading
The seven tools of causal inference with reflections on machine learning Pearl, CACM 2018
With thanks to @osmandros for sending me a link to this paper on twitter.
In this technical report Judea Pearl reflects on some of the limitations of machine learning systems that are based solely on statistical interpretation of data. To understand why? and to answer what if? questions, we need some kind of a causal model. In the social sciences and especially epidemiology, a transformative mathematical framework called ‘Structural Causal Models’ (SCM) has seen widespread adoption. Pearl presents seven example tasks which the model can handle, but which are out of reach for associational machine learning systems.
A useful insight unveiled by the theory of causal models is the classification of causal information in terms of the kind of questions that each class is capable of answering. This classification forms a 3-level hierarchy in the sense that questions at level i (i = 1, 2 ,3 ) can only be answered if information from level j (j ≥ i) is available.
The lowest (first) layer is called Association and it involves purely statistical relationships defined by the naked data. This Continue reading
An empirical analysis of anonymity in Zcash Kappos et al., USENIX Security’18
As we’ve seen before, in practice Bitcoin offers little in the way of anonymity. Zcash on the other hand was carefully designed with privacy in mind. It offers strong theoretical guarantees concerning privacy. So in theory users of Zcash can remain anonymous. In practice though it depends on the way those users interact with Zcash. Today’s paper choice, ‘An empirical analysis of anonymity in Zcash’ studies how identifiable transaction participants are in practice based on the 2,242,847 transactions in the blockchain at the time of the study.
We conclude that while it is possible to use Zcash in a private way, it is also possible to shrink its anonymity set considerably by developing simple heuristics based on identifiable patterns of usage.
The analysis also provides some interesting insights into who is using Zcash and for what as well. Founders and miners combined account for around 66% of the value drawn from the shielded pool.
The code for the analysis is available online at https://github.com/manganese/zcash-empirical-analysis
Zcash is based on highly regarded research including a cryptographic proof of the main privacy feature Continue reading
QSYM: a practical concolic execution engine tailored for hybrid fuzzing Yun et al., USENIX Security 2018
There are two main approaches to automated test case generated for uncovering bugs and vulnerabilities: fuzzing and concolic execution. Fuzzing is good at quickly exploring the input space, but can get stuck when trying to get past more complex conditional causes (i.e., when randomly generated inputs are unlikely to satisfy them). Concolic execution, which we saw in action earlier in the week, uses symbolic execution to uncover constraints and pass them to a solver. It can handle complex branch conditions, but it’s much slower. Hybrid fuzzers combine both coverage-guided fuzzing and concolic execution, bringing in the big guns (concolic) when the fuzzer gets stuck. In non-trivial real-world applications though, even the hybrid approach has been too slow. Until now.
For me, the attention grabbing paragraph in this paper is to be found on page 8 (752) in section 5.1. Google’s OSS-Fuzz was previously used to test a number of important real-world applications and libraries including libjpeg, libpng, libtiff, lepton, openjpge, tcpdump, file, libarchive, audiofile, ffmpeg, and binutils.
It is worth noting that Google’s OSS-Fuzz generated 10 trillion test inputs Continue reading
NAVEX: Precise and scalable exploit generation for dynamic web applications Alhuzali et al., USENIX Security 2018
NAVEX (https://github.com/aalhuz/navex) is a very powerful tool for finding executable exploits in dynamic web applications. It combines static and dynamic analysis (to cope with dynamically generated web content) to find vulnerable points in web applications, determine whether inputs to those are appropriately sanitised, and then builds a navigation graph for the application and uses it to construct a series of HTTP requests that trigger the vulnerability.
It also works at real-world scale: NAVEX was used on 26 PHP applications with a total of 3.2M SLOC and 22.7K PHP files. It generated 204 concrete exploits across these applications in a total of 6.5 hours. While the current implementation of NAVEX targets PHP applications, the approach could be generalised to other languages and frameworks.
In this paper, our main contribution is a precise approach for vulnerability analysis of multi-tier web applications with dynamic features… our approach combines dynamic analysis of web applications with static analysis to automatically identify vulnerabilities and generate concrete exploits as proof of those vulnerabilities.
Here’s a example of what NAVEX can do. From the 64K Continue reading
Unveiling and quantifying Facebook exploitation of sensitive personal data for advertising purposes Cabañas et al., USENIX Security 2018
Earlier this week we saw how the determined can still bypass most browser and tracker-blocking extension protections to track users around the web. Today’s paper is a great example of why you should care about that. Cabañas et al. examine the extent to which the profile Facebook builds on its users includes sensitive personal data made available to advertisers for targeting. The work was done just prior to the GPDR coming into force, which makes for very interesting timing from a legal perspective. The headline result is that it looks like Facebook is holding sensitive data on about 40% of the EU population, and that this data can be used by third-parties to target individuals in sensitive demographics and even identify them at a cost of as little as €0.015 per user.
The GDPR definition of sensitive personal data is “data revealing racial or ethnic origin, political opinions, religious or philosophical beliefs, or trade union membership, and the processing of genetic data, biometric data for the purpose of uniquely identifying Continue reading
Who left open the cookie jar? A comprehensive evaluation of third-party cookie policies from the Franken et al., USENIX Security 2018
This paper won a ‘Distinguished paper’ award at USENIX Security 2018, as well as the 2018 Internet Defense Prize. It’s an evaluation of the defense mechanisms built into browsers (and via extensions / add-ons) that seek to protect against user tracking and cross-site attacks. Testing across 7 browsers and 46 browser extensions, the authors find that for virtually every browser and extension combination there is a way to bypass the intended security policies.
Despite their significant merits, the way cookies are implemented in most modern browsers also introduces a variety of attacks and other unwanted behavior. More precisely, because cookies are attached to every request, including third-party requests, it becomes more difficult for websites to validate the authenticity of a request. Consequently, an attacker can trigger requests with a malicious payload from the browser of an unknowing victim… Next to cross-site attacks, the inclusion of cookies in third-party requests also allows fo users to be tracked across the various websites they visit.
When you visit a site A, it can set a cookie to be included in Continue reading
Fear the reaper: characterization and fast detection of card skimmers Scaife et al., USENIX Security 2018
Until I can get my hands on a Skim Reaper I’m not sure I’ll ever trust an ATM or other exposed card reading device (e.g., at garages) again!
Scaife et al. conduct a study of skimming devices found by the NYPD Financial Crimes Task Force over a 16 month period. The bad news is that you and I don’t really have much chance of detecting a deployed card skimming device (most of the folk wisdom about how to do so doesn’t really work). The good news is that the Skim Reaper detection device developed in this research project was able to effectively detect 100% of the devices supplied by the NYPD. That’s great if you happen to have a Skim Reaper handy to test with before using an ATM. The NYPD are now actively using a set of such devices in the field.
Almost as well-know as (credit and debit) cards themselves is the ease with which fraud can be committed against them. Attackers often acquire card data using skimmers Continue reading
STTR: A system for tracking all vehicles all the time at the edge of the network Xu et al., DEBS’18
With apologies for only bringing you two paper write-ups this week: we moved house, which turns out to be not at all conducive to quiet study of research papers!
Today’s smart camera surveillance systems are largely alert based, which gives two main modes of operation: either you know in advance the vehicles of interest so that you can detect them in real time, or you have to trawl through lots of camera footage post-facto (expensive and time-consuming). STTR is a system designed to track all of the vehicles all of the time, and store their trajectories for ever. I certainly have mixed feelings about the kinds of state surveillance and privacy invasions that enables (it’s trivial to link back to individuals given trajectories over time), but here we’ll just focus on the technology. Since the system is design with pluggable detection and matching algorithms, then given some calculations around volume it ought to be possible to use it to track objects other than vehicles. People for example?
Assuming the availability of suitable detection and matching (figuring out if Continue reading
Learning the structure of generative models without labeled data Bach et al., ICML’17
For the last couple of posts we’ve been looking at Snorkel and BabbleLabble which both depend on data programming – the ability to intelligently combine the outputs of a set of labelling functions. The core of data programming is developed in two papers, ‘Data programming: creating large training sets, quickly’ (Ratner 2016) and today’s paper choice, ‘Learning the structure of generative models without labeled data’ (Bach 2017).
The original data programming paper works explicitly with input pairs (x,y) (e.g. the chemical and disease word pairs we saw from the disease task in Snorkel) which (for me at least) confuses the presentation a little compared to the latter ICML paper which just assumes inputs (which could of course have pair structure, but we don’t care about that at this level of detail). Also in the original paper dependencies between labelling functions are explicitly specified by end users (as one of four types: similar, fixing, reinforcing, and exclusive) and built into a factor graph. In the ICML paper dependencies are learned. So I’m going to work mostly from ‘Learning the structure of generative Continue reading
Training classifiers with natural language explanations Hancock et al., ACL’18
We looked at Snorkel earlier this week, which demonstrates that maybe AI isn’t going to take over all of our programming jobs. Instead, we’ll be writing labelling functions to feed the machine! Perhaps we could call this task label engineering. To me, it feels a bit like programming a quick-and-dirty expert system, where the downstream generative model deals with all the inaccuracies and inconsistencies so that we don’t have to be perfect, just useful. Given the success of the approach, a natural question to ask is how we can enable end users to more easily create useful labelling functions in their domain. This is where BabbleLabble comes in!
In this work, we propose BabbleLabble, a framework for training classifiers in which an annotator provides a natural language explanation for each labeling decision. A semantic parser converts these explanations into programmatic labeling functions that generate noisy labels for an arbitrary amount of unlabeled data, which is used to train a classifier.
So much for those programming jobs !
Working with BabbleLabble, it takes users about twice as long per example to provide a label plus an explanation as it does just Continue reading
Snorkel: rapid training data creation with weak supervision Ratner et al., VLDB’18
Earlier this week we looked at Sparser, which comes from the Stanford Dawn project, “a five-year research project to democratize AI by making it dramatically easier to build AI-powered applications.” Today’s paper choice, Snorkel, is from the same stable. It tackles one of central questions in supervised machine learning: how do you get a large enough set of training data to power modern deep models?
…deep learning has a major upfront cost: these methods need massive training sets of labeled examples to learn from – often tens of thousands to millions to reach peak predictive performance. Such training sets are enormously expensive to create…
Snorkel lets you throw everything you’ve got at the problem. Heuristics, external knowledge bases, crowd-sourced workers, you name it. These are known as weak supervision sources because they may be limited in accuracy and coverage. All of these get combined in a principled manner to produce a set of probability-weighted labels. The authors call this process ‘data programming’. The end model is then trained on the generated labels.
Snorkel is the first system to implement our recent work Continue reading
Filter before you parse: faster analytics on raw data with Sparser Palkar et al., VLDB’18
We’ve been parsing JSON for over 15 years. So it’s surprising and wonderful that with a fresh look at the problem the authors of this paper have been able to deliver an order-of-magnitude speed-up with Sparser in about 4Kloc.
The classic approach to JSON parsing is to use a state-machine based parsing algorithm. This is the approach used by e.g. RapidJSON. Such algorithms are sequential and can’t easily exploit the SIMD capabilities of modern CPUs. State of the art JSON parsers such as Mison are designed to match the capabilities of modern hardware. Mison uses SIMD instructions to find special characters such as brackets and colons and build a structural index over a raw json string.
… we found that Mison can parse highly nested in-memory data at over 2GMB/s per core, over 5x faster than RapidJSON, the fastest traditional state-machine based parser available.
How can we parse JSON even faster? The key lies in re-framing the question. The fastest way to parse a JSON file is not to parse it at all. Zero ms is a hard lower bound ;). In other Continue reading
Fairness without demographics in repeated loss minimization Hashimoto et al., ICML’18
When we train machine learning models and optimise for average loss it is possible to obtain systems with very high overall accuracy, but which perform poorly on under-represented subsets of the input space. For example, a speech recognition system that performs poorly with minority accents.
We refer to this phenomenon of high overall accuracy but low minority accuracy as a representation disparity… This representation disparity forms our definition of unfairness, and has been observed in face recognition, language identification, dependency parsing, part-of-speech tagging, academic recommender systems, and automatic video captioning.
For systems that are continually trained and evolved based on data collected from their users, the poor performance for a minority group can set in place a vicious cycle in which members of such a group use the system less (because it doesn’t work as well for them), causing them to provide less data and hence to be further under-represented in the training set…
… this problem of disparity amplification is a possibility in any machine learning system that is retrained on user data.
An interesting twist in the problem is that the authors assume neither the Continue reading
Obfuscated gradients give a false sense of security: circumventing defenses to adversarial examples Athalye et al., ICML’18
There has been a lot of back and forth in the research community on adversarial attacks and defences in machine learning. Today’s paper examines a number of recently proposed defences and shows that most of them rely on forms of gradient masking. The authors develop attack techniques to overcome such defences, and 9 analyse defences from ICLR 2018 claiming to protect against white-box attacks. 7 of these turn out to rely on obfuscated gradients, and 6 of these fall to the new attacks (and the other one partially succumbs). Athalye et al. won a best paper award at ICML’18 for this work.
One of the great things about work on adversarial attacks and defences, as we’ve looked at before, is that they illuminate the strengths and weaknesses of current technology. Depending on the threat model you choose, for my own part I’m currently of the opinion that we’re unlikely to find a robust adversarial defence without a more radical re-think of how we’re doing image classification. If we’re talking about the task of ‘find an image that doesn’t fool a human, but Continue reading