Author Archives: adriancolyer
Author Archives: adriancolyer
Orca: differential bug localization in large-scale services Bhagwan et al., OSDI’18
Earlier this week we looked at REPT, the reverse debugging tool deployed live in the Windows Error Reporting service. Today it’s the turn of Orca, a bug localisation service that Microsoft have in production usage for six of their large online services. The focus of this paper is on the use of Orca with ‘Orion,’ where Orion is a codename given to a ‘large enterprise email and collaboration service that supports several millions of users, run across hundreds of thousands of machines, and serves millions of requests per second.’ We could it ‘Office 365’ perhaps? Like REPT, Orca won a best paper award (meaning MR scooped 2 out of the three awards at OSDI this year!).
Orca is designed to support on-call engineers (OCEs) in quickly figuring out the change (commit) that introduced a bug to a service so that it can be backed out. (Fixes can come later!). That’s a much harder task than it sounds in highly dynamic and fast moving environments. In ‘Orion’ for example there are many developers concurrently committing code. Post review the changes are eligible for inclusion in a Continue reading
REPT: reverse debugging of failures in deployed software Cui et al., OSDI’18
REPT (‘repeat’) won a best paper award at OSDI’18 this month. It addresses the problem of debugging crashes in production software, when all you have available is a memory dump. In particular, we’re talking about debugging Windows binaries. To effectively understand and fix bugs, a developer wants to be able to follow the path leading up to the point of failure. Not just the control flow, but also the data values involved. This is known as reverse debugging. What’s so clever about REPT is that it enables reverse debugging without the high overheads of record/replay systems (typically up to 200%). It combines low overhead hardware tracing (Intel PT) to record a programs control flow, with a novel binary analysis technique to recover (a very good percentage of) data flow information. Evaluated on 16 real-world bugs in software such as Chrome, Apache, PHP, and Python, REPT enabled effective reverse debugging for 14 of them, including 2 concurrency bugs.
REPTs offline binary analysis and reverse debugging is integrated into WinDbg, and the Windows Error Reporting service (WER) is enhanced to support REPT so that developers can request Intel PT Continue reading
Capturing and enhancing in situ system observability for failure detection Huang et al., OSDI’18
The central idea in this paper is simple and brilliant. The place where we have the most relevant information about the health of a process or thread is in the clients that call it. Today the state of the practice is to log and try to recover from a failed call at the client, while a totally separate failure detection infrastructure is responsible for figuring out whether or not things are working as desired. What Panorama does is turn clients into observers and reporters of the components they call, using these observations to determine component health. It works really well!
Panorama can easily integrate with popular distributed systems and detect all 15 real-world gray failures that we reproduced in less than 7s, whereas existing approaches detect only one of them in under 300s.
Panaroma is open source and available at https://github.com/ryanphuang/panorama.
Panaroma is primarily design to catch gray failures, in which components and systems offer degraded performance but typically don’t crash-stop. One example of such a failure is a ZooKeeper cluster that could no longer service write Continue reading
Automatic discovery of tactics in spatio-temporal soccer match data Decroos et al., KDD’18
Here’s a fun paper to end the week. Data collection from sporting events is now widespread. This fuels an endless thirst for team and player statistics. In terms of football (which shall refer to the game of soccer throughout this write-up) that leads to metrics such as completed passes, distance covered, intercepts, shots-on-goal, and so on. Decroos et al. want to go one level deeper though, and use the data to uncover team tactics. The state of the art today for tactical analysis still involves watching hours of video footage.
This paper focuses on the problem of detecting tactics from professional soccer matches based on spatio-temporal data.
Now when I think of tactics, a key component in my mind is the team shape and movement of players off the ball. Unfortunately Decroos et al., don’t have the data available to analyse that. So they have to do what they can based on more limited information.
Our dataset consists of event data for the English Premier League for the 2015/2016 season. This event data was manually collected by humans who watch video feeds of the matches Continue reading
Detecting spacecraft anomalies using LSTMs and nonparametric dynamic thresholding Hundman et al., KDD’18
How do you effectively monitor a spacecraft? That was the question facing NASA’s Jet Propulsion Laboratory as they looked forward towards exponentially increasing telemetry data rates for Earth Science satellites (e.g., around 85 terabytes/day for a Synthetic Aperture Radar satellite).
Spacecraft are exceptionally complex and expensive machines with thousands of telemetry channels detailing aspects such as temperature, radiation, power, instrumentation, and computational activities. Monitoring these channels is an important and necessary component of spacecraft operations given their complexity and cost. In an environment where a failure to detect and respond to potential hazards could result in the full or partial loss of spacecraft, anomaly detection is a critical tool to alert operations engineers of unexpected behavior.
Anomaly detection systems deployed today typically consist of tiered alarms indicating when values fall outside of pre-defined limits. There are also limited instances of expert systems and nearest-neighbour based approaches being tried, but their limitations prevented widespread adoption. A more accurate and scalable approach to anomaly detection that makes better use of limited engineering resources is required.
Any such system needs to work with data that is highly Continue reading
Online parameter selection for web-based ranking problems Agarwal et al., KDD’18
Last week we looked at production systems from Facebook, Airbnb, and Snap Inc., today it’s the turned of LinkedIn. This paper describes the system and model that LinkedIn use to determine the items to be shown in a user’s feed:
It replaces previous hand-tuning of the feature mix and ranking:
In the past, a developer spent days to accurately estimate [the feature weights] with desired behavior with every significant drift and intervention. Using the approach described in this paper, we have completely eliminated the laborious manual tuning, significantly increased developer productivity as well as obtained better Pareto optimal solutions.
Here are representative results comparing the impact on three key metrics for a setting found by the new algorithm over a period of 36 hours, vs the best setting that could be found in 7 days of hand-tuning:
Viral Actions is the most important metric for the LinkedIn feed, ‘with far reaching consequences from more subsequent sessions to more revenue.’ The algorithm is tasked with maximising this metric while not allowing others to drop by more than 1%. (Increases in the other metrics as well are Continue reading
Churn rates (how fast users abandon your app / service) are really important in modelling a business. If the churn rate is too high, it’s hard to maintain growth. Since acquiring new customers is also typically much more expensive than expanding in existing accounts, churn hits you twice over. So it’s really important to understand what causes users to churn in your business, and ideally to be able to predict users likely to churn so that you can intervene. This paper describes ClusChurn, the churn prediction system deployed at Snapchat.
It has been argued for decades that acquiring new users is often more expensive than keeping old ones. Surprisingly, however, user retention and its core component, churn prediction, have received much less attention from the research community… While this paper focuses on the Snapchat data as a comprehensive example, the techniques developed here can be readily leveraged for other online platforms, where users interact with the platform functions as well as other users.
The central idea in ClusChurn is to cluster users into clusters that are meaningful Continue reading
Customized regression model for Airbnb dynamic pricing Ye et al., KDD’18
This paper details the methods that Airbnb use to suggest prices to listing hosts (hosts ultimately remain in control of pricing on the Airbnb platform).
The proposed strategy model has been deployed in production for more than 1 year at Airbnb. The launch of the first iteration of the strategy model yielded significant gains on bookings and booking values for hosts who have adopted our suggestions… multiple iterations of the strategy model have been experimented [with] and launched into production to further improve the quality of our price suggestions.
Figuring out the right price for a night in a given Airbnb listing is challenging because no two listings are the same. Even when we constrain to e.g. similar sized properties in the same region, factors such as the number of five star reviews can influence price. Furthermore demand is time-varying due to seasonality and regional events (with different seasonality patterns for different countries). And then of course, how far in advance a booking is being made also factors into the price (“as lead time reduces, there are less opportunities for this night to be booked, which Continue reading
Rosetta: large scale system for text detection and recognition in images Borisyuk et al., KDD’18
Rosetta is Facebook’s production system for extracting text (OCR) from uploaded images.
In the last several years, the volume of photos being uploaded to social media platforms has grown exponentially to the order of hundreds of millions every day, presenting technological challenges for processing increasing volumes of visual information… our problem can be stated as follows: to build a robust and accurate system for optical character recognition capable of processing hundreds of millions of images per day in realtime.
Images uploaded by clients are added to a distributed processing queue from which Rosetta inference machines pull jobs. Online image processing consists of the following steps:
Columnstore and B+ tree – are hybrid physical designs important? Dziedzic et al., SIGMOD’18
Earlier this week we looked at the design of column stores and their advantages for analytic workloads. What should you do though if you have a mixed workload including transaction processing, decision support, and operational analytics? Microsoft SQL Server supports hybrid physical design combining both column store and B+ tree indexes in the same database.
It is generally understood that columnstores are crucial to achieving high performance for analytic queries and that B+ tree indexes are key to supporting transactional workloads efficiently. However, it is not well understood whether hybrid physical designs – both columnstore and B+ tree indices on the same database and potentially the same table – are important for any of the above workloads.
Through a series of benchmarks the authors show that hybrid physical designs can result in more than an order of magnitude lower execution costs for many workloads when compared to alternatives using B+ tree-only or columnstore-only. The Database Engine Tuning Advisor (DTA) for SQL Server is extended to analyze and recommend the appropriate indices for a given workload. Support for columnstore indices and the new DTA functionality was Continue reading
The design and implementation of modern column-oriented database systems Abadi et al., Foundations and trends in databases, 2012
I came here by following the references in the Smoke paper we looked at earlier this week. “The design and implementation of modern column-oriented database systems” is a longer piece at 87 pages, but it’s good value-for-time. What we have here is a very readable overview of the key techniques behind column stores.
Column stores are relational databases that store data by column rather than by row. Whereas a traditional row-based store stores all attributes of one row together, followed by the attributes of the next row, and so on, a column-based stored uses one logical file per attribute (column). The column-oriented layout makes it efficient to read just the columns you need for a query, without pulling in lots of redundant data.
Data for a column may be stored in an array with implicit ids (a), or in some format with explicit ids (b).
Since data transfer costs from storage (or through a storage hierarchy) are often the major performance bottlenecks in database systems, while at the same time database schemas are becoming more and Continue reading
Smoke: fine-grained lineage at interactive speed Psallidas et al., VLDB’18
Data lineage connects the input and output data items of a computation. Given a set of output records, a backward lineage query selects a subset of the output records and asks “which input records contributed to these results?” A forward lineage query selects a subset of the input records and asks, “which output records depend on these inputs?”. Lineage-enabled systems capture record-level relationships throughout a workflow and support lineage queries.
Data lineage is useful in lots of different applications; this paper uses as its main example interactive visualisation systems. This domain requires fast answers to queries and is typically dominated by hand-written implementations. Consider the two views in the figure below. When the user selects a set of marks in , marks derived from the same records are highlighted in (linked brushing).
A typical visualisation system implements this manually, but it can equally be viewed as a backward lineage query from the selection points in , followed by a forward lineage query from the resulting input records to .
Same-different problems strain convolutional neural networks Ricci et al., arXiv 2018
Since we’ve been looking at the idea of adding structured representations and relational reasoning to deep learning systems, I thought it would be interesting to finish off the week with an example of a problem that seems to require it: detecting whether objects in a scene are the same or different.
This image containing a flute was correctly classified by a CNN trained on millions of photographs. On ImageNet the network even surpassed the accuracy of a human observer.
This image contains two shapes that are the same, a relationship that is immediately obvious to a human observer. “Yet, the CNN failed to learn this relation even after seeing millions of training examples.”
The above is an example of a same-different (SD) visual relation problem (output whether the objects in the scene are the same, or different). Spatial relation (SR) problems ask whether objects follow a certain spatial relation, e.g. in a line, horizontally stacked, vertically stacked, and so on. For example:
The synthetic visual reasoning test (SVRT) contains a collection of 23 binary classification problems along these lines. In each case opposing classes differ Continue reading
Relational inductive biases, deep learning, and graph networks Battaglia et al., arXiv’18
Earlier this week we saw the argument that causal reasoning (where most of the interesting questions lie!) requires more than just associational machine learning. Structural causal models have at their core a graph of entities and relationships between them. Today we’ll be looking at a position paper with a wide team of authors from DeepMind, Google Brain, MIT, and the University of Edinburgh, which also makes the case for graph networks as a foundational building block of the next generation of AI. In other words, bringing back and re-integrating some of the techniques from the AI toolbox that were prevalent when resources were more limited.
We argue that combinatorial generalization must be a top priority for AI to achieve human-like abilities, and that structured representation and computations are key to realizing this objective… We explore how using relational inductive biases within deep learning architectures can facilitate learning about entities, relations, and the rules for composing them.
Human’s represent complex systems as compositions of entities and their interactions. We use hierarchies to abstract away fine-grained differences, manage part-whole associations and other more Continue reading
With thanks to @osmandros for sending me a link to this paper on twitter.
In this technical report Judea Pearl reflects on some of the limitations of machine learning systems that are based solely on statistical interpretation of data. To understand why? and to answer what if? questions, we need some kind of a causal model. In the social sciences and especially epidemiology, a transformative mathematical framework called ‘Structural Causal Models’ (SCM) has seen widespread adoption. Pearl presents seven example tasks which the model can handle, but which are out of reach for associational machine learning systems.
A useful insight unveiled by the theory of causal models is the classification of causal information in terms of the kind of questions that each class is capable of answering. This classification forms a 3-level hierarchy in the sense that questions at level i (i = 1, 2 ,3 ) can only be answered if information from level j (j ≥ i) is available.
The lowest (first) layer is called Association and it involves purely statistical relationships defined by the naked data. This Continue reading
An empirical analysis of anonymity in Zcash Kappos et al., USENIX Security’18
As we’ve seen before, in practice Bitcoin offers little in the way of anonymity. Zcash on the other hand was carefully designed with privacy in mind. It offers strong theoretical guarantees concerning privacy. So in theory users of Zcash can remain anonymous. In practice though it depends on the way those users interact with Zcash. Today’s paper choice, ‘An empirical analysis of anonymity in Zcash’ studies how identifiable transaction participants are in practice based on the 2,242,847 transactions in the blockchain at the time of the study.
We conclude that while it is possible to use Zcash in a private way, it is also possible to shrink its anonymity set considerably by developing simple heuristics based on identifiable patterns of usage.
The analysis also provides some interesting insights into who is using Zcash and for what as well. Founders and miners combined account for around 66% of the value drawn from the shielded pool.
The code for the analysis is available online at https://github.com/manganese/zcash-empirical-analysis
Zcash is based on highly regarded research including a cryptographic proof of the main privacy feature Continue reading
QSYM: a practical concolic execution engine tailored for hybrid fuzzing Yun et al., USENIX Security 2018
There are two main approaches to automated test case generated for uncovering bugs and vulnerabilities: fuzzing and concolic execution. Fuzzing is good at quickly exploring the input space, but can get stuck when trying to get past more complex conditional causes (i.e., when randomly generated inputs are unlikely to satisfy them). Concolic execution, which we saw in action earlier in the week, uses symbolic execution to uncover constraints and pass them to a solver. It can handle complex branch conditions, but it’s much slower. Hybrid fuzzers combine both coverage-guided fuzzing and concolic execution, bringing in the big guns (concolic) when the fuzzer gets stuck. In non-trivial real-world applications though, even the hybrid approach has been too slow. Until now.
For me, the attention grabbing paragraph in this paper is to be found on page 8 (752) in section 5.1. Google’s OSS-Fuzz was previously used to test a number of important real-world applications and libraries including libjpeg, libpng, libtiff, lepton, openjpge, tcpdump, file, libarchive, audiofile, ffmpeg, and binutils.
It is worth noting that Google’s OSS-Fuzz generated 10 trillion test inputs Continue reading
NAVEX: Precise and scalable exploit generation for dynamic web applications Alhuzali et al., USENIX Security 2018
NAVEX (https://github.com/aalhuz/navex) is a very powerful tool for finding executable exploits in dynamic web applications. It combines static and dynamic analysis (to cope with dynamically generated web content) to find vulnerable points in web applications, determine whether inputs to those are appropriately sanitised, and then builds a navigation graph for the application and uses it to construct a series of HTTP requests that trigger the vulnerability.
It also works at real-world scale: NAVEX was used on 26 PHP applications with a total of 3.2M SLOC and 22.7K PHP files. It generated 204 concrete exploits across these applications in a total of 6.5 hours. While the current implementation of NAVEX targets PHP applications, the approach could be generalised to other languages and frameworks.
In this paper, our main contribution is a precise approach for vulnerability analysis of multi-tier web applications with dynamic features… our approach combines dynamic analysis of web applications with static analysis to automatically identify vulnerabilities and generate concrete exploits as proof of those vulnerabilities.
Here’s a example of what NAVEX can do. From the 64K Continue reading
Unveiling and quantifying Facebook exploitation of sensitive personal data for advertising purposes Cabañas et al., USENIX Security 2018
Earlier this week we saw how the determined can still bypass most browser and tracker-blocking extension protections to track users around the web. Today’s paper is a great example of why you should care about that. Cabañas et al. examine the extent to which the profile Facebook builds on its users includes sensitive personal data made available to advertisers for targeting. The work was done just prior to the GPDR coming into force, which makes for very interesting timing from a legal perspective. The headline result is that it looks like Facebook is holding sensitive data on about 40% of the EU population, and that this data can be used by third-parties to target individuals in sensitive demographics and even identify them at a cost of as little as €0.015 per user.
The GDPR definition of sensitive personal data is “data revealing racial or ethnic origin, political opinions, religious or philosophical beliefs, or trade union membership, and the processing of genetic data, biometric data for the purpose of uniquely identifying Continue reading
Who left open the cookie jar? A comprehensive evaluation of third-party cookie policies from the Franken et al., USENIX Security 2018
This paper won a ‘Distinguished paper’ award at USENIX Security 2018, as well as the 2018 Internet Defense Prize. It’s an evaluation of the defense mechanisms built into browsers (and via extensions / add-ons) that seek to protect against user tracking and cross-site attacks. Testing across 7 browsers and 46 browser extensions, the authors find that for virtually every browser and extension combination there is a way to bypass the intended security policies.
When you visit a site A, it can set a cookie to be included in Continue reading