Archive

Category Archives for "Networking"

Cloudflare incident on November 14, 2024, resulting in lost logs

On November 14, 2024, Cloudflare experienced an incident which impacted the majority of customers using Cloudflare Logs. During the roughly 3.5 hours that these services were impacted, about 55% of the logs we normally send to customers were not sent and were lost. We’re very sorry this happened, and we are working to ensure that a similar issue doesn't happen again.

This blog post explains what happened and what we’re doing to prevent recurrences. Also, the systems involved and the particular class of failure we experienced will hopefully be of interest to engineering teams beyond those specifically using these products.

Failures within systems at scale are inevitable, and it’s essential that subsystems protect themselves from failures in other parts of the larger system to prevent cascades. In this case, a misconfiguration in one part of the system caused a cascading overload in another part of the system, which was itself misconfigured. Had it been properly configured, it could have prevented the loss of logs.

Background

Cloudflare’s network is a globally distributed system enabling and supporting a wide variety of services. Every part of this system generates event logs which contain detailed metadata about what’s happening with our systems around Continue reading

Palo Alto Auto-Tagging to Automate Security Actions

Palo Alto Auto-Tagging to Automate Security Actions

Auto-tagging allows the firewall to tag a policy object when it receives a log that matches specific criteria, creating an IP-to-tag or user-to-tag mapping. For example, when the firewall generates a traffic or threat log, you can configure it to tag the source IP address or User associated with that log using a specific tag name. These tags can then be used to automatically populate policy objects like Dynamic User Groups or Dynamic Address Groups, which in turn can automate security actions within security policies.

For example, let's say I have a policy that denies traffic from the Internet to the firewall's public IP or subnet whenever someone attempts to access random ports. This policy blocks the traffic and generates a traffic log. Now, if someone tries to target our public IP on port 22 (SSH), we might want to add them to a blacklist, which is a Dynamic Address Group. We can then create another policy that references this Dynamic Address Group to block any further traffic from this IP address.

A Realistic Use Case

A realistic use case is when you want to block a source IP after multiple failed authentication attempts to GlobalProtect. Typically, you can use Continue reading

Topology aware flow analytics with NVIDIA NetQ

NVIDIA Cumulus Linux 5.11 for AI / ML describes how NVIDIA 400/800G Spectrum-X switches combined with the latest Cumulus Linux release deliver enhanced real-time telemetry that is particularly relevant to the AI / machine learning workloads that Spectrum-X switches are designed to handle.

This article shows how to extract Topology from an NVIDIA fabric in order to perform advanced fabric aware analytics, for example: detect flow collisions, trace flow paths, and de-duplicate traffic.

In this example, we will use NVIDIA NetQ, a highly scalable, modern network operations toolset that provides visibility, troubleshooting, and validation of your Cumulus and SONiC fabrics in real time.

netq show lldp json
For example, the NetQ Link Layer Discovery Protocol (LLDP) service simplifies the task of gathering neighbor data from switches in the network, and with the json option, makes the output easy to process with a Python script, for example, lldp-rt.py.

The simplest way to try sFlow-RT is to use the pre-built sflow/topology Docker image that packages sFlow-RT with additional applications that are useful for monitoring network topologies.

docker run -p 6343:6343/udp -p 8008:8008 sflow/topology
Configure Cumulus Linux to steam sFlow telemetry to sFlow-RT on UDP port 6343 (the default for Continue reading

Deciphering the Open Systems Interconnection Model

Unless you’ve studied for a network cert, the Open Systems Interconnection (OSI) model is probably somewhat of a mystery to you. Maybe you heard of it from a coworker, or maybe you saw it in a marketing campaign for something on AWS. Maybe you thought “Layer 3” was just some new buzzword. Such shorthand references to the OSI model, however, can be useful if you can decode them, as they can help you understand where in your network stack a tool could fit or where to look for a problem during an incident call. Before we get too far, let me address a point of contention. Many people will say the theoretical OSI model is outdated. The model is theoretical, true, and the real world is certainly more complex than it may lead you to believe. Its layers don’t neatly map to specific devices, and other models exist that more accurately reflect the real world, such as the Transmission Control Protocol/Internet Protocol (TCP/IP) model. Image 1 It’s useful to think of the OSI model as an abstraction that allows us to reason about the separation of concerns on a network. We use it to think through troubleshooting steps should Continue reading

EVPN Designs: EVPN IBGP over IPv4 EBGP

We’ll conclude the EVPN designs saga with the “most creative” design promoted by some networking vendors: running an IBGP session (carrying EVPN address family) between loopbacks advertised with EBGP IPv4 address family.

Oversimplified IBGP-over-EBGP design

Oversimplified IBGP-over-EBGP design

There’s just a tiny gotcha in the above Works Best in PowerPoint diagram. IBGP assumes the BGP neighbors are in the same autonomous system while EBGP assumes they are in different autonomous systems. The usual way out of that OMG, I painted myself into a corner situation is to use BGP local AS functionality on the underlay EBGP session:

From Python to Go 004. Arrays, Lists, Slices.

Hello my friend,

In this post we will start exploring how to store multiple values in a single variable. There are multiple approaches how this can be achieved both in Python and Go (Golang). The first one we’ll cover is using ordered something of elements. We’ve used word “something”, because in Python this is called list, whilst Go (Golang) use the term slice. Let’s see what is similar and what is not between Python and Go (Golang).

Who Else Deals With Network Automation?

Just this week there was a massive event Autocon2, which is the biggest gathering of network automation enthusiasts from the entire world. People from different industries shared their experience, success stories and challenges. Whilst many of those insights are different, all of them revolve around the central idea that network automation is a must.

We have started doing network automation trainings before it become mainstream, yet we constantly update its content with new things. Start your training today to onboard the training, which is just leaving the platform.:

We offer the following training programs in network automation for you:

Cisco CML Free Tier (No Kidding)

Cisco CML Free Tier (No Kidding)

I first came across this on LinkedIn that Cisco is introducing a free tier of their Cisco CML platform (starting from version 2.8). My initial reaction was, 'Cisco giving something away for free? That can't be true!' But it turns out, it is. I couldn't wait to try it. I believe it’s limited to five nodes, but I’ll take it.

I could never justify paying $199 per year when there are free alternatives available (EVE-NG, GNS3, Containerlab). If it were a one-time purchase, I might have gone for it, but $199 every year is just too much for me.

In this post, I'll cover how to get the free version and how to install it on VMware Workstation.

💡
Downloading the CML image requires a Cisco account, which you can create for free. Also, make sure to add your address to your profile, or you might not be able to download it.

Downloading the Image

Head over to the Cisco download page and select CML version 2.8 Free Tier. As I mentioned earlier, you need a Cisco account, which is easy to create. Depending on your installation method, you can choose either the bare-metal or OVA Continue reading

P99 Conf: 3 Ways to Squash Application Latency

We’ve all been frustrated by latency, either as users of an application,  or as developers building such apps. At ScyllaDB‘s annual Pekka Enberg, founder and CTO of shared his favorite tips for spotting and removing latency from systems. “Latency lurks everywhere,” said Enberg, who also has authored a once estimated that it loses 1% of sales for every 100ms of latency. Screenshot Enberg has thought plenty about ways of reducing latency and has boiled down his solutions into three different approaches: Reduce data movement Continue reading

TNO009: From Network Monitoring to Observability: Make the Leap for Better NetOps

Traditional network monitoring was built around SNMP and logs. And while there’s still a role for these sources, network observability aims to incorporate more data to help you build a holistic picture of the network and its behavior and performance. These sources can include flows, streaming telemetry, APIs, NETCONF, the CLI, deep packet inspection, synthetic... Read more »

Dynamic BGP Peers

You might have an environment where a route reflector (or a route server) has dozens or hundreds of BGP peers. Configuring them by hand is a nightmare; you should either build a decent automation platform or use dynamic BGP neighbors – a feature you can practice in the next lab exercise.

Click here to start the lab in your browser using GitHub Codespaces (or set up your own lab infrastructure). After starting the lab environment, change the directory to session/9-dynamic and execute netlab up.

SC24 Over 10 Terabits per Second of WAN Traffic

The SC24 WAN Stress Test chart shows 10.3 Terabits bits per second of WAN traffic to the The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC24) conference held this week in Atlanta. The conference network used in the demonstration, SCinet, is described as the most powerful and advanced network on Earth, connecting the SC community to the world.

SC24 Real-time RoCEv2 traffic visibility describes a demonstration of wide area network bulk data transmission using RDMA over Converged Ethernet (RoCEv2) flows typically seen in AI/ML data centers. In the example, 3.2Tbits/second sustained trasmissions from sources geographically distributed around the United States was demonstrated.

SC24 Dropped packet visibility demonstration shows how the sFlow data model integrates three telemetry streams: counters, packet samples, and packet drop notifications. Each type of data is useful on its own, but together they provide the comprehensive network wide observability needed to drive automation. Real-time network visibility is particularly relevant to AI / ML data center networks where congestion and dropped packets can result in serious performance degradation and in this screen capture you can see multiple 400Gbits/s RoCEv2 flows.

SC24 SCinet traffic describes the architecture of the real-time monitoring system used to Continue reading

1 5 6 7 8 9 3,411