Leveraging open technologies to monitor packet drops in AI cluster fabrics

In this talk from the recent OCP Global Summit, Aldrin Isaac, eBay, describes the challenge, AI clusters operate most efficiently over lossless networks for optimum job completion times which can be significantly impacted by dropped packets. Although networks can be designed to minimize packet loss by choosing the right network topology, optimizing network devices and protocols, an effective monitoring and troubleshooting network performance tool is still required. Such tool should capture packet drops, raise notifications and identify various drop reasons and pin point where the drops caused congestions. In turn, it allows the governing management application to tune configurations of relevant infrastructure components, including switches, NICs and GPU servers.

The talk shares the results and best practices of a TAM (Telemetry and Monitoring) solution being prepared for deployment at eBay. It leverages OCP’s SAI and open sFlow drop notification technologies as part of eBay’s ongoing initiatives to adopt open networking hardware and community SONiC for its data centers.

The sFlow Dropped Packet Notification Structures extension mentioned in the talk adds real-time packet drop notifications (including dropped packet header and drop reason) as part of an industry standard sFlow telemetry feed, making the data available to open source and commercial sFlow analytics Continue reading

Building Vectorize, a distributed vector database, on Cloudflare’s Developer Platform

Vectorize is a globally distributed vector database that enables you to build full-stack, AI-powered applications with Cloudflare Workers. Vectorize makes querying embeddings — representations of values or objects like text, images, audio that are designed to be consumed by machine learning models and semantic search algorithms — faster, easier and more affordable.

In this post, we dive deep into how we built Vectorize on Cloudflare’s Developer Platform, leveraging Cloudflare’s global network, Cache, Workers, R2, Queues, Durable Objects, and container platform.

What is a vector database?

A vector database is a queryable store of vectors. A vector is a large array of numbers called vector dimensions.

A vector database has a similarity search query: given an input vector, it returns the vectors that are closest according to a specified metric, potentially filtered on their metadata.

Vector databases are used to power semantic search, document classification, and recommendation and anomaly detection, as well as contextualizing answers generated by LLMs (Retrieval Augmented Generation, RAG).

Why do vectors require special database support?

Conventional data structures like B-trees, or binary search trees expect the data they index to be cheap to compare and to follow Continue reading

Is this thing on? Using OpenBMC and ACPI power states for reliable server boot

Introduction

At Cloudflare, we provide a range of services through our global network of servers, located in 330 cities worldwide. When you interact with our long-standing application services, or newer services like Workers AI, you’re in contact with one of our fleet of thousands of servers which support those services.

These servers which provide Cloudflare services are managed by a Baseboard Management Controller (BMC). The BMC is a special purpose processor  — different from the Central Processing Unit (CPU) of a server — whose sole purpose is ensuring a smooth operation of the server.

Regardless of the server vendor, each server has this BMC. The BMC runs independently of the CPU and has its own embedded operating system, usually referred to as firmware. At Cloudflare, we customize and deploy a server-specific version of the BMC firmware. The BMC firmware we deploy at Cloudflare is based on the Linux Foundation Project for BMCs, OpenBMC. OpenBMC is an open-sourced firmware stack designed to work across a variety of systems including enterprise, telco, and cloud-scale data centers. The open-source nature of OpenBMC gives us greater flexibility and ownership of this critical server subsystem, instead of the closed nature of proprietary firmware. Continue reading

Caddy Reverse Proxy With Docker

Caddy Reverse Proxy With Docker

I currently run multiple Docker containers across two hosts, each hosting various applications on different ports. UniFi Controller exposes the web GUI on 8443, Pi-hole on 8080, and Memos on 5230. Remembering each port number for every application started to become a hassle.

Additionally, most of these applications, like Pi-hole and Memos, do not support HTTPS out of the box. After searching for a solution to simplify this setup, I found that Caddy Reverse Proxy offers one of the simplest and most effective ways to manage these services. In this blog post, we’ll look at how to use Caddy Reverse Proxy with my Docker containers running across two hosts.

What is a Reverse Proxy?

A reverse proxy is a server that sits in front of one or more web servers and forwards client requests to them. It acts as an intermediary, handling incoming traffic and distributing it to the appropriate server. This setup can help improve security, manage SSL/TLS encryption, and simplify network traffic management by consolidating multiple services under a single domain.

What is Caddy?

Caddy is an open-source web server and reverse proxy software that is known for its simplicity and ease of use. It automatically handles HTTPS Continue reading

Lab: Configure IS-IS on Point-to-Point Links

From a very high-level perspective, OSPF and IS-IS are quite similar. Both were created in the Stone Age of networking, and both differentiate between multi-access LAN segments and point-to-point serial interfaces. Unfortunately, that approach no longer works in the Ethernet Everywhere world where most of the point-to-point links look like LAN segments, so we always have to change the default settings to make an IGP work better.

That’s what you’ll do in today’s lab exercise, which also explains the behind-the-scenes differences between point-to-point and multi-access links and the intricate world of three-way handshake.

ADCS Cert Templates for ISE Lab

In my ISE lab I’m going to be using EAP-TLS and TEAP, which means I’ll be needing user and computer certificates. The goal is to be able to enable the 802.1X supplicant via GPO and to distribute certificates automatically without requiring any user input. Another post will cover GPO, in this post I’ll cover creating the certificate templates in ADCS.

When opening the CA app, there are a number of templates provided by default:

There are already templates for User and Computer, but it’s better to leave the default templates alone and create new ones. First, we’ll create a template for user certificates. Start by right clicking Certificate Templates and selecting Manage:

Then we’re going to right click the User template and select Duplicate Template:

This is going to open up a new window with properties of the template:

Go to General and give the template a name:

Don’t select the Do not automatically reenroll option or it won’t be possible to renew certs before they expire.

Then go to Request Handling. We’re going to uncheck the Allow private key to be exported option as this is considered more secure:

Make sure Enroll subject without requiring any Continue reading

NB500: SolarWinds and MacOS Vulernabilities Get Attention; Amazon Invests in Nuclear to Meet Power, Carbon Goals

This week’s Network Break discusses a CISA warning that a serious SolarWinds vulnerability is being exploited, Microsoft turns the tables by discovering a MacOS vulnerability, and Amazon invests in small modular nuclear reactors to meet growing power demands and reduce carbon output. T-Mobile releases a new device using the 5G Reduced Capacity spec, Palo Alto... Read more »

Tech Bytes: From AWS Topography to On-Prem Flows, Cisco ThousandEyes Boosts Network Visibility (Sponsored)

The Tech Bytes podcast welcomes back sponsor Cisco ThousandEyes to talk about new features that improve visibility into both the public cloud and your on-prem network. We’ll get details on the new topographical mapping feature for AWS, as well as ThousandEyes’ new capability to consume flow records from on prem and correlate those records with... Read more »

Unlocking The Future of AI Infrastructure: Breaking Through Bottlenecks For Profitability And Performance

As the last several years have shown, scaling up AI systems to train larger models with more parameters across more data is a very expensive proposition, and one that has made Nvidia fabulously rich.

Unlocking The Future of AI Infrastructure: Breaking Through Bottlenecks For Profitability And Performance was written by Timothy Prickett Morgan at The Next Platform.

AI for Network Engineers: Multi-Class Classification

 Introduction 

This chapter explains the multi-class classification training process. It begins with an introduction to the MNIST dataset (Modified National Institute of Standards and Technology dataset). Next, it describes how the SoftMax activation function computes the probability of the image fed into the model during the forward pass and how the weight parameters are adjusted during the backward pass to improve training results. Additionally, the chapter discusses the data parallelization strategy from a network perspective.


MINST Dataset

We will use the MNIST dataset [1], which consists of handwritten digits, to demonstrate the training process. The MNIST dataset includes four files: (1) a training set with 60,000 gray-scale images (28x28 pixels) and their respective (2) labels, and a test set with 10,000 images (28x28 pixels) and their respective labels. Figure 3-1 illustrates the structure and dependencies between the test dataset and the labels.

The file train-images-idx3-ubyte contains metadata describing how the images are ordered, along with the image pixel order. The file train-labels-idx1-ubyte defines which label (the digits 0-9) corresponds to which image in the image file. Since we have ten possible outputs, we use ten output neurons.

Before the training process begins, the labels for each image-label pair are one-hot Continue reading

NOG.HR: A NOG Meeting Worth Attending

I never know what to expect when I’m invited to speak at a regional (or in-country) Network Operator Group (NOG) meeting. Sometimes, it turns out to be a large conference (PLNOG and ITNOG come to mind); other times, it’s just a few people gathered around free donuts and coffee1. Last week’s Croatian NOG (NOG.HR) meeting was in the Goldilocks zone between the extremes: plenty of interested networking engineers, but not large enough to be overpowering.

Also, it was such a nice experience ;)

The IPv6 Transition

I wrote an article in May 2022, asking “Are we there yet?” about the transition to IPv6. At the time I concluded the article on an optimistic note, observing that we may not be ending the transition just yet, but we are closing in. I thought at the time that we won’t reach the end of this transition to IPv6 with a bang, but with a whimper. A couple of years later, I’d like to revise these conclusions with some different thoughts about where we are heading and why.

Global Protect VPN SAML SSO with Entra-ID

Global Protect VPN SAML SSO with Entra-ID

In this blog post, we will look at how to use Entra-ID SAML SSO with GlobalProtect VPN. This guide assumes you are already familiar with GlobalProtect VPN and have an existing VPN solution with other forms of authentication. If you are new to GlobalProtect VPN, feel free to check out my other blog post, which is linked below.

Palo Alto Global Protect VPN Configuration Example
In this blog post, we will cover how to configure Palo Alto Global Protect VPN. We’ll go through setting up the portal, gateway, authentication profile, IP pools, split-tunnel, security policy, NAT policy and other necessary components.
Global Protect VPN SAML SSO with Entra-ID

Adding GlobalProtect to the Admin Centre

  1. Sign in to the Microsoft Entra admin centre and navigate to Identity > Applications > Enterprise applications > New application.
  2. Add the Palo Alto Networks - GlobalProtect application.
  3. Once added, select Palo Alto Networks - GlobalProtect > Single sign-on.
Global Protect VPN SAML SSO with Entra-ID
Global Protect VPN SAML SSO with Entra-ID

On the Set up single sign-on with SAML page, click the pencil icon in the Basic SAML Configuration section to edit the settings.

💡
In the Basic SAML Configuration section, for the Entity ID and Reply URL, ensure that you include :443 after the URL, otherwise, it won't work. I Continue reading

What Is the Future of the .io Domain?

The .io domain was originally created for the British Indian Ocean Territory but eventually became popular with the tech sector, for obvious reasons. Part of the reason for this is that ‘io’ is similar in appearance to I/O (aka input/output), which is why the tech sector started gobbling up the .io domains. There were issues soon after the creation of the domain that had to do with the distribution of profit. A lot of app developers use the .io domain. The New Stack uses the .io domain. It’s everywhere. But there’s a problem, and it’s one that could have a cascading effect within the realm of the tech sector. What has happened is that the