An overview of end-to-end entity resolution for big data

An overview of end-to-end entity resolution for big data, Christophides et al., ACM Computing Surveys, Dec. 2020, Article No. 127

The ACM Computing Surveys are always a great way to get a quick orientation in a new subject area, and hot off the press is this survey on the entity resolution (aka record linking) problem. It’s an important part of many modern data workflows, and an area I’ve been wrestling with in one of my own projects.

Entity Resolution (ER) aims to identify different descriptions that refer to the same real-world entity appearing either within or across data sources, when unique entity identifiers are not available.

When ER is applied to records from the same data source it can be used for deduplication, when used to join records across data sources we call it record linking. Doing this well at scale is non-trivial; at its core, the problem requires comparing each entity to every other, i.e. it is quadratic in input size.

An individual record/document for an entity is called an entity description. A set of such descriptions is an entity collection. Two descriptions that correspond to the same real world entity are called matches or Continue reading

Privacy and Compliance Reading List

Privacy and Compliance Reading List
Privacy and Compliance Reading List

Privacy matters. Privacy and Compliance are at the heart of Cloudflare's products and solutions. We are committed to providing built-in data protection and privacy throughout our global network and for every product in our portfolio. This is why we have dedicated a whole week to highlight important aspects of how we are working to make sure privacy will stay at the core of all we do as a business.

In case you missed any of the blog posts this week addressing the topics of Privacy and Compliance, you’ll find a summary below.

Welcome to Privacy & Compliance Week: Reflecting Values at Cloudflare’s Core

We started the week with this introduction by Matthew Prince. The blog post summarizes the early decisions that the founding team made to make sure customer data is kept private, that we do not sell or rent this data to third parties, and why trust is the foundation of our business. > Read the full blog post.

Introducing the Cloudflare Data Localization Suite

Cloudflare’s network is private and compliant by design. Preserving end-user privacy is core to our mission of helping to build a better Internet; we’ve never sold personal data about customers or end-users of our Continue reading

Supporting Jurisdictional Restrictions for Durable Objects

Supporting Jurisdictional Restrictions for Durable Objects
Supporting Jurisdictional Restrictions for Durable Objects

Over the past week, you’ve heard how Cloudflare is making it easy for our customers to control where their data is stored and protected.

We’re not the only ones building these data controls. Around the world, companies are working to figure out where and how to store customer data in a way that is compliant with data localization obligations. For developers, this means new deployment models and new headaches — wrangling infrastructure in multiple regions, partitioning user data based on location, and staying on top of the latest rules from regulators.

Durable Objects, currently in limited beta, already make it easy for customers to manage state on Cloudflare Workers without worrying about provisioning infrastructure. Today, we’re announcing Jurisdictional Restrictions for Durable Objects, which ensure that a Durable Object only stores and processes data in a given geographical region. Jurisdictional Restrictions make it easy for developers to build serverless, stateful applications that not only comply with today’s regulations, but can handle new and updated policies as new regulations are added.

How Jurisdictional Restrictions Work

When creating a Durable Object, developers generate a unique ID that lets a Cloudflare Worker communicate with the Object.

Let’s say I want to create a Durable Continue reading

Learning To Make The Machine Part Of AI Invisible And Easy

About 15 years ago, as Swami Sivasubramanian was making his way from grad school back into the working world, he saw that developers and builders at enterprises were being held back not by their skills or their ideas, but by their inability to access the technology needed to bring those ideas to the fore.

Learning To Make The Machine Part Of AI Invisible And Easy was written by Jeffrey Burt at The Next Platform.

DNS Flag Day 2020

One of the outcomes of the 'stacked' architecture of network protocol design is that upper level protocols should not try to do the job of the lower layers. Packet adaptation through fragmentation is a IP layer 'problem' and applications do not have to concern themselves with this. We've come some distance from this position and these days many applications need to be highly aware of transport layer and IP layer properties, and the DNS is no exception. There have been some recent steps in the DNS with the DNS Flag Day 2020 to try and tune the DNS to avoid packet fragmentation. How bad is the problem with packet fragmentation and do the DNS Flag Day measures address the issue?

How Long Should You Practice

A reporter once asked boxing legend Muhammad Ali how many sit-ups he did each day. I’m sure the reporter wasn’t expecting Ali’s answer. Ali replied with:

I don’t know. I don’t start counting them until it hurts. Those are the only ones that count. That’s what makes you a champion.”

Ali knew that counting things is just a numbers game. Five hundred poor sit-ups don’t count as much a fifty done the right way. With any practice that you do the only things that count are the things that teach your something or that push you to be better.

Don’t Practice Until It’s Right

People used to ask me how long I would spend at night studying for the CCIE lab. I told them I usually spent between five and seven hours depending on what I was studying. Sometimes those people would say things like “I’m not talking about setup time. I’m talking about actual lab work.” I always countered by making them explain why the setup isn’t part of the “real” work. That’s usually when they went quiet.

It’s far too easy to fall into the trap of overlooking things that you think are unimportant. A task Continue reading

NS1: Avoid the Trap of DNS Single-Point-of-Failure

Third-party DNS providers have seen tremendous consolidation during the past few years, resulting in dependence on a smaller pool of providers that maintain the world’s largest website lookups. Reliance on only one of a few single DNS providers also represents a heightened risk in the event of a Carnegie Mellon University, 89.2% of the CDN MaxCDN, the researchers noted. A

JUNOS | Layer 2 Circuit | MPLS-TE | PSN-Tunnel Endpoint

In this post we’re going to explore a technique for steering Layer 2 Circuit traffic onto a dedicated MPLS-TE LSP using JUNOS. The use case is fairly popular amongst Service Providers where special treatment is desired for certain Layer 2 Circuits. This special treatment could be the need for the traffic to follow a certain explicit path through the network, or perhaps there are other traffic-engineering constraints that are required. A good example of this is to create a deterministic state through the network in order to guarantee path diversity or a low latency path. This technique can be used alongside LDP, RSVP or SR.

Requirements
– Layer 2 Circuit traffic between CE4 and CE1 must use a dedicated traffic-engineered LSP via the P routers.
– No other traffic is permitted to use the LSP.
– All other traffic must continue to use LDP to reach the egress PE.

Lab Overview
The IGP is based on OSPF and LDP is used as the default label distribution protocol.
PE1 vSRX1 (Ingress PE): 20.1R1.11
PE2 CSR1000V1 (Egress PE): 16.11.01b

Layer 2 Circuit
Firstly, let’s create Layer 2 Circuits between PE1 and PE2 and observe the normal default behaviour.

Continue reading

Technology Short Take 135

Welcome to Technology Short Take #135! This will likely be the last Technology Short Take of 2020, so it’s a tad longer than usual. Sorry about that! You know me—I just want to make sure everyone has plenty of technical content to read during the holidays. And speaking of holidays…whatever holidays you do (or don’t) celebrate, I hope that the rest of the year is a good one for you. Now, on to the content!

Networking

  • Arthur Chiao cracks open kube-proxy, a key part of Kubernetes networking, to expose the internals, and along the way exposes readers to a few different technologies. This is a good read if you’re trying to better understand some aspects of Kubernetes networking.
  • Gian Paolo takes a look at using tools like curl and jq when working with networking-related APIs.
  • It’s not unusual to see “networking professionals need to learn developer tools,” but how often do you see “developers need to learn these networking tools”? Martin Heinz discusses that very topic in this post.

Servers/Hardware

Continue reading

Is LinkedIn Still Relevant to Your Career?

We all know LinkedIn has been available since the year 2002. It is popularly known as a place where professionals in every and any aspect of life can be found. People these days wonder if joining this social media platform is still important. If you don’t make use of LinkedIn properly, your messages, resumes, and all other activities on the website could take lots of hours, and it could waste important time you could use to be productive in other ways.

Do You Need a LinkedIn Profile for Your Career?

The simple answer to this question is “yes.” Even if you don’t make use of it often or at all, it’ll be nice to simply create the account. It doesn’t take anything but time to create this. You can then make an appointment on your calendar to check the site every 6 months or so. Whenever you have anything to add, you can do so. Make sure you add in huge accomplishments over your career span. Also from these scheduled checks of your profile, you should change your current job listing on LinkedIn whenever you have an occupational switch. 

Hiring managers and recruiters make use of LinkedIn to look Continue reading

Encrypting your WAF Payloads with Hybrid Public Key Encryption (HPKE)

Encrypting your WAF Payloads with Hybrid Public Key Encryption (HPKE)
Encrypting your WAF Payloads with Hybrid Public Key Encryption (HPKE)

The Cloudflare Web Application Firewall (WAF) blocks more than 72B malicious requests per day from reaching our customers’ applications. Typically, our users can easily confirm these requests were not legitimate by checking the URL, the query parameters, or other metadata that Cloudflare provides as part of the security event log in the dashboard.

Sometimes investigating a WAF event requires a bit more research and a trial and error approach, as the WAF may have matched against a field that is not logged by default.

Not logging all parts of a request is intentional: HTTP headers and payloads often contain sensitive data, including personally identifiable information, which we consider a toxic asset. Request headers may contain cookies and POST payloads may contain username and password pairs submitted during a login attempt among other sensitive data.

We recognize that providing clear visibility in any security event is a core feature of a firewall, as this allows users to better fine tune their rules. To accomplish this, while ensuring end-user privacy, we built encrypted WAF matched payload logging. This feature will log only the specific component of the request the WAF has deemed malicious — and it is encrypted using a customer-provided key Continue reading

Developing NetBox Plugin – Part 1 – Setup and initial build

This is first post in my series showing how to develop NetBox plugin. We'll talk about what NetBox plugins are and why would you want one. Then I'll show you how to set up development environment. We'll finish by building base version of our custom plugin.

Developing NetBox Plugin tutorial series

Contents

What are NetBox plugins?

NetBox plugins are small, self-contained, applications that add new functionality. This could range from adding new API endpoint to fully fledged apps. These apps can provide their own data models, views, background tasks and more. We can also inject content Continue reading