Archive

Category Archives for "Networking"

Quantile Regression

The Median Isn’t the Message - Stephen Jay Gould

When we think of regression, the most common one, which we all know, is linear regression. It is a fairly popular and simple technique for estimating the mean of some variable conditional on the values of independent variables.

Linear Regression

Now imagine if you are a grocery delivery or ride-hailing service and want to show the customer the estimated delivery or wait times. If the distance is smaller, there will be less variability in the waiting time, but if the distance is longer, many things can go wrong, and due to that there can be a lot of variability in the estimate time. If we have to create a model to predict that, we may not want to apply linear regression as that will only tell us the average time.

It’s important to note that one of the key assumptions for applying linear regression is a constant variance (Homoskedasticity). However, many times this is often not the case. The variability is not constant (Heteroscedastic), which violates the linear regression assumption (Linear Regression Notes).

Motivation

Let’s look at a running data for the distance vs. the time it takes to finish. We clearly know Continue reading

What is new in Calico 3.28

TL/DR

  • A new Grafana dashboard that helps you monitor Calico Typha’s performance and troubleshoot issues.
  • Calico eBPF dataplane IPv6 is now GA. It supports true IPv6-only clusters as well as dual-stack clusters. 🐝
  • Optional Pod startup delay to ensure networking is up in high-churn scenarios.
  • Tigera operator now supports multiple IP pools, IP pool modification, affinity for operator pods, priorityclassname, and more!
  • Improved policy performance in both eBPF and iptables.
  • Calico now ships with a pprof server. Activate the performance server for real-time views of Typha and Felix components and real-time debugging.

🚨 Important changes 🚨

Calico 3.28 now has enabled VXLAN checksum offload by default for environments with the kernel version of 5.8 or above. In the past, offloading was disabled due to kernel bugs.

Please keep in mind, if you are upgrading to 3.28 this change will take effect after node restarts.

If you encounter unexpected performance issues, you can use the following command to revert to the previous method by using the following command:

kubectl patch felixconfiguration default --type="merge" -p='{"spec":{"FeatureDetectOverride":"ChecksumOffloadBroken=true"}'

Please keep in mind that you can report any issues via GitHub tickets or Slack and include a detailed description of the environment (NIC hardware, kernel, distro, Continue reading

Network Observability in K8s Clusters for Better Troubleshooting

For DevOps and platform teams working with containers and Kubernetes, reducing downtime and improving security posture is crucial. A clear understanding of network topology, service interactions and workload dependencies is required in cloud native applications. This is essential for securing and optimizing your Kubernetes deployment and minimizing response time in the event of failure. Network observability can highlight gaps in network policies for applications that require network policy controls, thus reducing the risk of attack from unsecured egress access or lateral movement of threats within the Kubernetes cluster. However, visualizing workload communication, service dependencies, and active and inactive network security policies presents significant challenges due to the distributed and dynamic nature of

Case Study: IPng’s mail servers

Intro

I have seen companies achieve great successes in the space of consumer internet and entertainment industry. I’ve been feeling less enthusiastic about the stronghold that these corporations have over my digital presence. I am the first to admit that using “free” services is convenient, but these companies are sometimes taking away my autonomy and exerting control over society. To each their own of course, but for the last few years, I’ve been more and more inclined to take back a little bit of responsibility for my online social presence, away from centrally hosted services and to privately operated ones.

GMail

First off - I love Google’s Workspace products. I started using GMail just after it launched, back in 2004. Its user interface is sleek, performant, and very intuitive. Its filtering, granted, could be a bit less … robotic, but that’s made up by labels and an incredibly comprehensive search function. I would dare say that between GMail and Photos, those are my absolute favorite products on the internet.

That said, I have been running e-mail servers since well before Google existed as a company. I started off at M.C.G.V. Stack, the computer club of the University of Continue reading

Amazon EKS networking options

When setting up a Kubernetes environment with Amazon Elastic Kubernetes Service (EKS), it is crucial to understand your available networking options. EKS offers a range of networking choices that allow you to build a highly available and scalable cloud environment for your workloads.

In this blog post, we will explore the networking and policy enforcement options provided by AWS for Amazon EKS. By the end, you will have a clear understanding of the different networking options and network policy enforcement engines, and other features that can help you create a functional and secure platform for your Kubernetes workloads and services.

Amazon Elastic Kubernetes Service (EKS)

Amazon Elastic Kubernetes Service (EKS) is a managed Kubernetes service that simplifies routine operations, such as cluster deployment and maintenance, by automating tasks such as patching and updating operating systems and their underlying components. EKS enhances scalability through AWS Auto Scaling groups and other AWS service integrations and offers a highly available control plane to manage your cluster.

Amazon EKS in the cloud has two options:

  • Managed
  • Self-managed

Managed clusters rely on the AWS control plane node, which AWS hosts and controls separately from your cluster. This node operates in isolation and cannot be directly Continue reading

Must Read: OSPF Protocol Analysis (RFC 1245)

Daniel Dib found the ancient OSPF Protocol Analysis (RFC 1245) that includes the Router CPU section. Please keep in mind the RFC was published in 1991 (35 years ago):

Steve Deering presented results for the Dijkstra calculation in the “MOSPF meeting report” in [3]. Steve’s calculation was done on a DEC 5000 (10 mips processor), using the Stanford internet as a model. His graphs are based on numbers of networks, not number of routers. However, if we extrapolate that the ratio of routers to networks remains the same, the time to run Dijkstra for 200 routers in Steve’s implementation was around 15 milliseconds.

Must Read: OSPF Protocol Analysis (RFC 1245)

Daniel Dib found the ancient OSPF Protocol Analysis (RFC 1245) that includes the Router CPU section. Please keep in mind the RFC was published in 1991 (35 years ago):

Steve Deering presented results for the Dijkstra calculation in the “MOSPF meeting report” in [3]. Steve’s calculation was done on a DEC 5000 (10 mips processor), using the Stanford internet as a model. His graphs are based on numbers of networks, not number of routers. However, if we extrapolate that the ratio of routers to networks remains the same, the time to run Dijkstra for 200 routers in Steve’s implementation was around 15 milliseconds.

Rule 11 Academy is (Somewhat) Live

The Academy does not replace this blog, the Hedge, etc. Instead, it’s a place for me to recreate all the training materials I’ve taught in the past, put them in one place, and adding new training material besides. It’s light right now, but I plan to post about once or twice a week.

Note this is a subscription site with paid content and two memberships–six months and yearly.

Get six months free using the coupon code BEAG2DRUP0TORNSKUT.

https://rule11.ac/

Thwart Ops Sprawl With a Unified Data Plane 

When updating a critical infrastructure element for application teams takes weeks due to coordination between NetOps, SecOps, PlatformOps and FinOps, you have a problem: ops sprawl. First was the technology ops team. Then came network operations and security operations. Then, arising from the site reliability engineering (SRE) movement and the goal of pushing more ops decisions into the development environment, came

New Consent and Bot Management features for Cloudflare Zaraz

Managing consent online can be challenging. After you’ve figured out the necessary regulations, you usually need to configure some Consent Management Platform (CMP) to load all third-party tools and scripts on your website in a way that respects these demands. Cloudflare Zaraz manages the loading of all of these third-party tools, so it was only natural that in April 2023 we announced the Cloudflare Zaraz CMP: the simplest way to manage consent in a way that seamlessly integrates with your third-party tools manager.

As more and more third-party tool vendors are required to handle consent properly, our CMP has evolved to integrate with these new technologies and standardization efforts. Today, we’re happy to announce that the Cloudflare Zaraz CMP is now compatible with the Interactive Advertising Bureau Transparency and Consent Framework (IAB TCF) requirements, and fully supports Google’s Consent Mode v2 signals. Separately, we’ve taken efforts to improve the way Cloudflare Zaraz handles traffic coming from online bots.

IAB TCF Compatibility

Earlier this year, Google announced that websites that would like to use AdSense and other advertising solutions in the European Economic Area (EEA), the UK, and Switzerland, will be required to use a CMP that is approved by Continue reading

MLAG Deep Dive: LAG Member Failures in VXLAN Fabrics

In the Dealing with LAG Member Failures blog post, we figured out how easy it is to deal with a LAG member failure in a traditional MLAG cluster. The failover could happen in hardware, and even if it’s software-driven, it does not depend on the control plane.

Let’s add a bit of complexity and replace a traditional layer-2 fabric with a VXLAN fabric. The MLAG cluster members still use an MLAG peer link and an anycast VTEP IP address (more details).

MLAG Deep Dive: LAG Member Failures in VXLAN Fabrics

In the Dealing with LAG Member Failures blog post, we figured out how easy it is to deal with a LAG member failure in a traditional MLAG cluster. The failover could happen in hardware, and even if it’s software-driven, it does not depend on the control plane.

Let’s add a bit of complexity and replace a traditional layer-2 fabric with a VXLAN fabric. The MLAG cluster members still use an MLAG peer link and an anycast VTEP IP address (more details).

Cisco vPC in VXLAN/EVPN Network – Part 4 – Fabric Peering

Like I mentioned in a previous post, normally leafs don’t connect to leafs, but for vPC this is required. What if we don’t want to use physical interfaces for this interconnection? This is where fabric peering comes into play. Now, unfortunately my lab, which is virtual, does not support fabric peering so I will just introduce you to the concept. Let’s compare the traditional vPC to fabric peering, starting with traditional vPC:

The traditional vPC has the following pros and cons:

  • Pros:
    • No dependency on other devices for peer link and peer keepalive link.
    • No contention for bandwidth on interfaces as they are dedicated.
    • This also means no QoS configuration is required.
    • Intent of configuration is clear with dedicated interfaces.
  • Cons:
    • Requires dedicated interfaces that could be used for something else.
    • Interfaces have a cost, both from perspective of buying the switch, but also SFPs.

Now let’s compare that to fabric peering:

Fabric peering has the following pros and cons:

  • Pros:
    • No dedicated interfaces required.
    • Thus reducing cost.
    • Resiliency as there are multiple paths between the two switches.
  • Cons:
    • Dependency to other devices.
    • Dependency to underlay.
    • Contention for bandwidth with other traffic.
    • May require QoS.
    • May be more difficult to Continue reading