Author Archives: adriancolyer
Author Archives: adriancolyer
Invisible mask: practical attacks on face recognition with infrared Zhou et al., arXiv’18
You might have seen selected write-ups from The Morning Paper appearing in ACM Queue. The editorial board there are also kind enough to send me paper recommendations when they come across something that sparks their interest. So this week things are going to get a little bit circular as we’ll be looking at three papers originally highlighted to me by the ACM Queue board!
‘Invisible Mask’ looks at the very topical subject of face authentication systems. We’ve looked at adversarial attacks on machine learning systems before, including those that can be deployed in the wild, such as decorating stop signs. Most adversarial attacks against image recognition systems require you to have pixel-level control over the input image though. Invisible Mask is different, it’s a practical attack in that the techniques described in this paper could be used to subvert face authentication systems deployed in the wild, without there being any obvious visual difference (e.g. specially printed glass frames) in the face of the attacker to a casual observer. That’s the invisible part: to the face recognition system it’s as if you are wearing a mask, Continue reading
Learning a unified embedding for visual search at Pinterest Zhai et al., KDD’19
Last time out we looked at some great lessons from Airbnb as they introduced deep learning into their search system. Today’s paper choice highlights an organisation that has been deploying multiple deep learning models in search (visual search) for a while: Pinterest.
With over 600 million visual searches per month and growing, visual search is one of the fastest growing products at Pinterest and of increasing importance.
Visual search is pretty fundamental to the Pinterest experience. The paper focuses on three search-based products: Flashlight, Lens, and Shop-the-Look.
In Flashlight search the search query is a source image either from Pinterest or the web, and the search results are relevant pins. In Lens the search query is a photograph taken by the user with their camera, and the search results are relevant pins. In Shop-the-Look the search query is a source image from Pinterest or the web, and the results are products which match items in the image.
Models are like microservices in one sense: it seems that they have a tendency to proliferate within organisations once they start to take hold! (Aside, I wonder if there’s Continue reading
Applying deep learning to Airbnb search Haldar et al., KDD’19
Last time out we looked at Booking.com’s lessons learned from introducing machine learning to their product stack. Today’s paper takes a look at what happened in Airbnb when they moved from standard machine learning approaches to deep learning. It’s written in a very approachable style and packed with great insights. I hope you enjoy it as much as I did!
Ours is a story of the elements we found useful in applying neural networks to a real life product. Deep learning was steep learning for us. To other teams embarking on similar journeys, we hope an account of our struggles and triumphs will provide some useful pointers.
The core application of machine learning discussed in this paper is the model which orders available listings according to a guest’s likelihood of booking. This is one of a whole ecosystem of models which contribute towards search rankings when a user searches on Airbnb. New models are tested online through an A/B testing framework to compare their performance to previous generations.
The very first version of search ranking at Airbnb was a hand-written Continue reading
150 successful machine learning models: 6 lessons learned at Booking.com Bernadi et al., KDD’19
Here’s a paper that will reward careful study for many organisations. We’ve previously looked at the deep penetration of machine learning models in the product stacks of leading companies, and also some of the pre-requisites for being successful with it. Today’s paper choice is a wonderful summary of lessons learned integrating around 150 successful customer facing applications of machine learning at Booking.com. Oddly enough given the paper title, the six lessons are never explicitly listed or enumerated in the body of the paper, but they can be inferred from the division into sections. My interpretation of them is as follows:
There are way more than 6 good pieces of advice contained within the paper though!
We found that driving true business impact is amazingly hard, plus it is difficult to isolate Continue reading
Detecting and characterizing lateral phishing at scale Ho et al., USENIX Security Symposium 2019
This is an investigation into the phenomenon of lateral phishing attacks. A lateral phishing attack is one where a compromised account within an organisation is used to send out further phishing emails (typically to other employees within the same organisation). So ‘alice at example.com’ might receive a phishing email that has genuinely been sent by ‘bob at example.com’, and thus is more likely to trust it.
In recent years, work from both industry and academia has pointed to the emergence and growth of lateral phishing attacks: a new form of phishing that targets a diverse range of organizations and has already incurred billions of dollars in financial harm…. This attack proves particularly insidious because the attacker automatically benefits from the implicit trust in the hijacked account: trust from both human recipients and conventional email protection systems.
The study is conducted in conjunction with Barracuda Networks, who obtained customer permission to use email data from the Office 365 employee mailboxes of 92 different organisations. 69 of these organisations were selected through random sampling across all organisations, and 23 Continue reading
in-toto: providing farm-to-table guarantees for bits and bytes Torres-Arias et al., USENIX Security Symposium 2019
Small world with high risks did a great job of highlighting the absurd risks we’re currently carrying in many software supply chains. There are glimmers of hope though. This paper describes in-toto, and end-to-end system for ensuring the integrity of a software supply chain. To be a little more precise, in-toto secures the end-to-end delivery pipeline for one product or package. But it’s only a small step from there to imagine using in-toto to also verify the provenance of every third-party dependency included in the build, and suddenly you’ve got something that starts to look very interesting indeed.
In-toto is much more than just a research project, it’s already deployed and integrated into a number of different projects and ecosystems, quietly protecting artefacts used by millions of people daily. You can find the in-toto website at https://in-toto.io.
In-toto has about a dozen different integrations that protect software supply chains for millions of end-users.
kubesec
to analyze your Kubenetes configurations, in-toto is protecting itSmall world with high risks: a study of security threats in the npm ecosystem Zimmermann et al., USENIX Security Symposium 2019
This is a fascinating study of the npm ecosystem, looking at the graph of maintainers and packages and its evolution over time. It’s packed with some great data, and also helps us quantify something we’ve probably all had an intuition for— the high risks involved in depending on a open and fast-moving ecosystem. One the key takeaways for me is the concentration of reach in a comparatively small number of packages and maintainers, making these both very high value targets (event-stream
, it turns out, wouldn’t even have made the top-1000 in a list of ranked targets!), but also high leverage points for defence. We have to couple this of course with an exceedingly long tail.
As the primary source of third-party JavaScript packages for the client-side, server-side, and other platforms, npm is the centrerpiece of a large and important software ecosystem.
Npm is an open ecosystem hosting a collection of over 800,000 packages as of February 2019, and it continues to grow rapidly.
To share a package on npm, a maintainer creates Continue reading
Wireless attacks on aircraft instrument landing systems Sathaye et al., USENIX Security Symposium 2019
It’s been a while since we last looked at security attacks against connected real-world entities (e.g., industrial machinery, light-bulbs, and cars). Today’s paper is a good reminder of just how important it is becoming to consider cyber threat models in what are primary physical systems, especially if you happen to be flying on an aeroplane – which I am right now as I write this!
The first fully operational Instrument Landing System (ILS) for planes was deployed in 1932. But assumptions we’ve been making since then (and until the present day, it appears!) no longer hold:
Security was never considered by design as historically the ability to transmit and receive wireless signals required considerable resources and knowledge. However, the widespread availability of powerful and low-cost software-defined radio platforms has altered the threat landscape. In fact, today the majority of wireless systems employed in modern aviation have been shown to be vulnerable to some form of cyber-physical attacks.
Both sections 1 and 6 in the paper give some eye-opening details of known attacks against aviation systems, but to date no-one Continue reading
50 ways to leak your data: an exploration of apps’ circumvention of the Android permissions system Reardon et al., USENIX Security Symposium 2019
The problem is all inside your app, she said to me / The answer is easy if you take it logically / I’d like to help data in its struggle to be free / There must be fifty ways to leak their data.
You just slip it out the back, Jack / Make a new plan, Stan / You don’t need to be coy, Roy / Just get the data free.
Hop it on the bus, Gus / You don’t need to discuss much / Just drop off the key, Lee / And get the data free…
— Lyrics adapted from “50 ways to leave your lover” by Paul Simon (fabulous song btw., you should definitely check it out if you don’t already know it!).
This paper is a study of Android apps in the wild that leak permission protected data (identifiers which can be used for tracking, and location information), where those apps should not have been able to see such data due to a lack of granted permissions. By detecting Continue reading
The secret sharer: evaluating and testing unintended memorization in neural networks Carlini et al., USENIX Security Symposium 2019
This is a really important paper for anyone working with language or generative models, and just in general for anyone interested in understanding some of the broader implications and possible unintended consequences of deep learning. There’s also a lovely sense of the human drama accompanying the discoveries that just creeps through around the edges.
Disclosure of secrets is of particular concern in neural network models that classify or predict sequences of natural language text… even if sensitive or private training data text is very rare, one should assume that well-trained models have paid attention to its precise details…. The users of such models may discover— either by accident or on purpose— that entering certain text prefixes causes the models to output surprisingly revealing text completions.
Take a system trained to make predictions on a language (word or character) model – an example you’re probably familiar with is Google Smart Compose. Now feed it a prefix such as “My social security number is “. Can you guess what happens next?
As a small scale demonstration, the authors trained a model on Continue reading
We’ve been covering papers from VLDB 2019 for the last three weeks, and next week it will be time to mix things up again. There were so many interesting papers at the conference this year though that I haven’t been able to cover nearly as many as I would like. So today’s post is a short summary of things that caught my eye that I haven’t covered so far. A few of these might make it onto The Morning Paper in weeks to come, you never know!
Updating graph databases with Cypher Green et al., VLDB’19
This is the story of a great collaboration between academia, industry, and users of the Cypher graph querying language as created by Neo4j. Beyond Neo4j, Cypher is also supported in SAP HANA Graph, RedisGraph, Agnes Graph, and Memgraph. Cypher for Apache Spark, and Cypher over Gremlin projects are also both available in open source. The openCypher project brings together Cypher implementors across different projects and products, and aims to produce a full and open specification of the language. There is also a Graph Query Language (GQL) standards organisation.
Cypher is used in hundreds of production applications across many industry vertical domains, such as financial services, telecommunications, manufacturing and retails, logistics, government, and healthcare.
Personally I would have expected that number to be in the thousands by now and there are some suggestions that it is, however Neo4j are still only claiming ‘hundreds of customers’ on their own website.
The read-only core of the Cypher language has already been fully formalised. But when it came time to extend that formalism to include the update mechanisms, the authors ran into difficulties.
Our understanding of updates in the popular graph Continue reading
Fine-grained, secure and efficient data provenance on blockchain systems Ruan et al., VLDB’19
We haven’t covered a blockchain paper on The Morning Paper for a while, and today’s choice won the best paper award at VLDB’19. The goal here is to enable smart contracts to be written in which the contract logic depends on the history, or provenance of its inputs.
For example, a contract that sends a reward amount of tokens to a user based on that user’s average balance per day over some period.
That’s hard to do in today’s blockchain systems for two reasons:
In other words, smart contracts cannot access historical blockchain states in a tamper-evident manner.
In designing a blockchain-friendly provenance mechanism, three aspects unique to the blockchain environment differentiate the problem from that of traditional database provenance:
Declarative recursive computation on an RDBMS… or, why you should use a database for distributed machine learing Jankov et al., VLDB’19
If you think about a system like Procella that’s combining transactional and analytic workloads on top of a cloud-native architecture, extensions to SQL for streaming, dataflow based materialized views (see e.g. Naiad, Noria, Multiverses, and also check out what Materialize are doing here), the ability to use SQL interfaces to query over semi-structured and unstructured data, and so on, then a picture begins to emerge of a unifying large-scale data platform with a SQL query engine on top that addresses all of the data needs of an organisation in a one-stop shop. Except there’s one glaring omission from that list: handling all of the machine learning use cases.
Machine learning inside a relational database has been done before, most notably in the form of MADlib, which was integrated into Greenplum during my time at Pivotal. The Apache MADLib project is still going strong, and the recent (July 2019) 1.16 release even includes some support for deep learning.
To make that vision of a one-stop shop for all of an organisation’s data Continue reading
Procella: unifying serving and analytical data at YouTube Chattopadhyay et al., VLDB’19
Academic papers aren’t usually set to music, but if they were the chorus of Queen’s “I want it all (and I want it now…)” seems appropriate here. Anchored in the primary use case of supporting Google’s YouTube business, what we’re looking at here could well be the future of data processing at Google. Well, I say the future, but “Procella has now been in production for multiple years. Today, it is deployed in over a dozen data centers and serves hundreds of billions of queries per day over tens of petabytes of data…” So maybe what we’re looking at is the future of data processing for the rest of us!
Google already has Dremel, Mesa, Photon, F1, PowerDrill, and Spanner, so why did they need yet another data processing system? Because they had too many data processing systems! ;)
Large organizations… are dealing with exploding data volume and increasing demand for data driven applications. Broadly, these can be categorized as: reporting and dashboarding, embedded statistics in pages, time-series monitoring, and ad-hoc analysis. Typically, organizations build specialized infrastructure for each Continue reading
Experiences with approximating queries in Microsoft’s production big-data clusters Kandula et al., VLDB’19
I’ve been excited about the potential for approximate query processing in analytic clusters for some time, and this paper describes its use at scale in production. Microsoft’s big data clusters have 10s of thousands of machines, and are used by thousands of users to run some pretty complex queries. These clusters are in high demand and approximate query processing both saves their users time and lightens the overall load on the cluster.
What’s especially nice about this paper is that we get a glimpse into the practical adoption issues of persuading users that it’s ok to approximate too.
We have implemented support for query time sampling in production big-data clusters at Microsoft: these clusters consist of tens of thousands of multi-core, multi-disk servers and are used by developers from many different businesses including Bing, Azure, and Windows. In total, the clusters store a few exabytes of data and are primarily responsible for all of the batch analytics at Microsoft.
Control over approximation is put into the hands of the user via extensions to the query language. In particular, support for expressing sampling requirements Continue reading
DDSketch: a fast and fully-mergeable quantile sketch with relative-error guarantees Masson et al., VLDB’19
Datadog handles a ton of metrics – some customers have endpoints generating over 10M points per second! For response times (latencies) reporting a simple metric such as ‘average’ is next to useless. Instead we want to understand what’s happening at different latency percentiles (e.g p99).
The ability to compute quantiles over aggregated metrics has been recognized to be an essential feature of any monitoring system… Given how expensive calculating exact quantiles can be for both storage and network bandwidth, most monitoring system will compress the data into sketches and compute approximate quantiles.
Fortunately there are plenty of quantile sketching algorithms available including the GK-sketch, the t-digest, the HDR histogram, and the Moments sketch that we looked at last year. For reasons we’ll see shortly though, none of those were good enough for Datadog, so they developed their own sketching data structure, DDSketch. Officially in the paper DDSketch stands for ‘Distributed Distribution Sketch’ but that seems a bit of a stretch… surely it’s the ‘Datadog Sketch’ ! A glance at the code repository for the Python implementation confirms my suspicion: there are several references to Continue reading
SLOG: serializable, low-latency, geo-replicated transactions Ren et al., VLDB’19
SLOG is another research system motivated by the needs of the application developer (aka, user!). Building correct applications is much easier when the system provides strict serializability guarantees.
Strict serializability reduces application code complexity and bugs, since it behaves like a system that is running on a single machine processing transactions sequentially.
The challenge with strict serializability (or even just serializability on a regular DBMS) is that it requires coordination, and as we know, coordination kills performance. Weaker consistency models can give better performance yet “expose applications to potential race condition bugs, and typically require skilled application programmers.” But developers are the kingmakers (I think it’s been enough time now that we can drop the ‘new’ in that phrase?? ;) ), and thus:
… the demand for systems that support strict serializability has only increased.
So starting with strict serializability as a given, how do we claw back some of that performance? That’s where SLOG (Serializable LOw-latency, Geo-replicated transactions) comes in.
SLOG achieves high throughput, strictly serializable ACID transactions at geo-replicated scale for all transactions submitted across the world, all the while achieving low latency for transactions Continue reading
IPA: invariant-preserving applications for weakly consistent replicated databases Balegas et al., VLDB’19
IPA for developers, happy days!
Last we week looked at automating checks for invariant confluence, and extending the set of cases where we can show that an object is indeed invariant confluent. I’m not going to re-cover that background in this write-up, so I suggest you head over there for a quick catch-up before reading on if you missed it first time around.
Today’s paper is very much in same spirit, building on the same foundation of invariant confluence (I-Confluence), and also on Indigo which introduced an annotation model for application invariants, a invariant violation avoidance mechanism using lock reservations and escrows, and limited support for repairing violations that do happen.
With Invariant-Preserving Applications (IPAs), Balegas et al. introduce new mechanisms for avoiding invariant violations and for repairing them when detected, based on CRDTs. There’s also a very nice looking developer workflow to help ensure you’ve got all the bases covered. At the end of the day, you get the dual benefit of higher throughput and lower latency (as compared to coordination-based approaches) coupled with knowing that there isn’t some nasty invariant-violating concurrency bug waiting Continue reading
Choosing a cloud DBMS: architectures and tradeoffs Tan et al., VLDB’19
If you’re moving an OLAP workload to the cloud (AWS in the context of this paper), what DBMS setup should you go with? There’s a broad set of choices including where you store the data, whether you run your own DBMS nodes or use a service, the kinds of instance types to go for if you do run your own, and so forth. Tan et al. use the TPC-H benchmark to assess Redshift, Redshift Spectrum, Athena, Presto, Hive, and Vertica to find out what works best and the trade-offs involved.
We focused on OLAP-oriented parallel data warehouse products available for AWS and restricted our attention to commercially available systems. As it is infeasible to test every OLAP system runnable on AWS, we chose widely-used systems that represented a variety of architectures and cost models.
My key takeaways as a TL;DR: