What To Do When You Can’t Get Nvidia H100 GPUs

In a world where allocations of “Hopper” H100 GPUs coming out of Nvidia’s factories are going out well into 2024, and the allocations for the impending “Antares” MI300X and MI300A GPUs are probably long since spoken for, anyone trying to build a GPU cluster to power a large language model for training or inference has to think outside of the box.

The post What To Do When You Can’t Get Nvidia H100 GPUs first appeared on The Next Platform.

What To Do When You Can’t Get Nvidia H100 GPUs was written by Timothy Prickett Morgan at The Next Platform.

KU041: Running Kubernetes On Nomad

In this episode, Michael, Kristina, and Adriana Villela discuss the challenges and benefits of running Kubernetes on Nomad. Adriana shares her experience of using Nomad in a data center, highlighting its simplicity and ease of deployment compared to Kubernetes. The speakers also discuss the differences between the two platforms, the concept of vendor lock-in, and... Read more »

KU041: Running Kubernetes On Nomad

In this episode, Michael, Kristina, and Adriana Villela discuss the challenges and benefits of running Kubernetes on Nomad. Adriana shares her experience of using Nomad in a data center, highlighting its simplicity and ease of deployment compared to Kubernetes. The speakers also discuss the differences between the two platforms, the concept of vendor lock-in, and […]

The post KU041: Running Kubernetes On Nomad appeared first on Packet Pushers.

SC23 Data Transfer Node TCP Metrics

The dashboard shown above is based on the open source sflow-rt/dtn project. The dashboard shows data captured from The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC23) being held this week in Denver.

The dashboard displays data gathered from open source Host sFlow agents installed on Data Transfer Nodes (DTNs) run by the Caltech High Energy Physics Department and used for handling transfer of large scientific data sets (for example, accessing experiment data from the CERN particle accelerator). Network performance monitoring describes how the Host sFlow agents augment standard sFlow telemetry with measurements that the Linux kernel maintains as part of the normal operation of the TCP protocol stack.

The dashboard shows 5 large flows (greater than 50 Gigabits per Second). For each large flow being tracked, additional TCP performance metrics are displayed:

  • RTT The round trip time observed between DTNs
  • RTT Wait The amount of time that data waits on sender before it can be sent.
  • RTT Sdev The standard deviation on observed RTT. This variation is a measure of jitter.
  • Avg. Packet Size The average packet size used to send data.
  • Packets in Flight The number of unacknowledged packets.

See Defining Flows for full range of Continue reading

Introducing advanced session audit capabilities in Cloudflare One

The basis of Zero Trust is defining granular controls and authorization policies per application, user, and device. Having a system with a sufficient level of granularity to do this is crucial to meet both regulatory and security requirements. But there is a potential downside to so many controls: in order to troubleshoot user issues, an administrator has to consider a complex combination of variables across applications, user identity, and device information, which may require painstakingly sifting through logs.

We think there’s a better way — which is why, starting today, administrators can easily audit all active user sessions and associated data used by their Cloudflare One policies. This enables the best of both worlds: extremely granular controls, while maintaining an improved ability to troubleshoot and diagnose Zero Trust deployments in a single, simple control panel. Information that previously lived in a user’s browser or changed dynamically is now available to administrators without the need to bother an end user or dig into logs.

A quick primer on application authentication and authorization

Authentication and Authorization are the two components that a Zero Trust policy evaluates before allowing a user access to a resource.

Authentication is the process of verifying the identity Continue reading

IPB139: Avoiding Typical IPv6 Pitfalls

Network engineers and architects considering IPv6 can benefit from the experiences of those who have gone before them by avoiding the problems that have bedeviled other deployments. On today’s show, your hosts discuss three typical pitfalls and how to get over or around them without falling in. Those IPv6 pitfalls include: IPv4 thinking Deploying ULA... Read more »

Billion-Dollar AI Promise a Bright Spot in Gloomy Quarter for Cisco

Cisco navigated a rocky road in its first quarter of the year as evidenced by the dips in share price for the networking giant this morning.

The post Billion-Dollar AI Promise a Bright Spot in Gloomy Quarter for Cisco first appeared on The Next Platform.

Billion-Dollar AI Promise a Bright Spot in Gloomy Quarter for Cisco was written by Nicole Hemsoth Prickett at The Next Platform.

The First Ever Network Automation Conference – AutoCon0

First let me just say that you have got to love a zero indexed conference! If you are a network engineer and you don’t know what that means we need to chat..and that situation was a key topic of the conference. In my mind the goal of the conference was to assess the state of READ MORE

The post The First Ever Network Automation Conference – AutoCon0 appeared first on The Gratuitous Arp.

Microsoft partners with Nvidia, Synopsys for genAI services

Microsoft has announced that it is partnering with chipmaker Nvidia and chip-designing software provider Synopsys to provide enterprises with foundry services and a new chip-design assistant. The announcement was made at the ongoing Microsoft Ignite conference.The foundry services from Nvidia, which will deployed on Microsoft Azure, will combine three of Nvidia’s elements — its foundation models, its NeMo framework, and Nvidia’s DGX Cloud service.To read this article in full, please click here

Worth Reading: Cloudflare Control Plane Outage

Cloudflare experienced a significant outage in early November 2023 and published a detailed post-mortem report. You should read the whole report; here are my CliffsNotes:

Also (unrelated to Cloudflare outage):

Worth Reading: Cloudflare Control Plane Outage

Cloudflare experienced a significant outage in early November 2023 and published a detailed post-mortem report. You should read the whole report; here are my CliffsNotes:

Also (unrelated to Cloudflare outage):

IPv6, the DNS and Happy Eyeballs

If we are going to update RFC 3901, "DNS IPv6 Transport Guidelines," and offer a revised set of guidelines that are more positive guidelines about the use of IPv6 in the DNS, then what should such updated guidelines say?

Microsoft Holds Chip Makers’ Feet To The Fire With Homegrown CPU And AI Chips

After many years of rumors, Microsoft has finally confirmed that it is following rivals Amazon Web Services and Google into the design of custom processors and accelerators for their clouds.

The post Microsoft Holds Chip Makers’ Feet To The Fire With Homegrown CPU And AI Chips first appeared on The Next Platform.

Microsoft Holds Chip Makers’ Feet To The Fire With Homegrown CPU And AI Chips was written by Timothy Prickett Morgan at The Next Platform.

SC23 WiFi Traffic Heatmap

Real-time WiFi-Traffic Heatmap (source code GitHub: cod3monk/showfloor-heatmap) displays real-time WiFi traffic from The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC23) being held this week in Denver.
The conference network used in the demonstration, SCinet, is described as the most powerful and advanced network on Earth, connecting the SC community to the world.
In this example, the sFlow-RT real-time analytics engine receives sFlow telemetry from switches, routers, and servers in the SCinet network and creates metrics to drive the real-time heatmap. Getting Started provides a quick introduction to deploying and using sFlow-RT for real-time network-wide flow analytics.

Additional use cases being demonstrated this week include, SC23 Dropped packet visibility demonstration and SC23 SCinet traffic.

Cisco: Generative AI expectations outstrip enterprise readiness

While 95% of businesses are aware that AI will increase infrastructure workloads, only 17% have networks that are flexible enough to handle the complex requirements of AI. Given that disconnect, it’s too early to see widespread deployment of AI at scale, despite the hype.That's one of the key takeaways from Cisco’s inaugural AI Readiness Index, a survey of 8,000 global companies aimed at measuring corporate interest in and ability to utilize AI technologies.To read this article in full, please click here

D2C220: KubeConversations Part 1 – Platform Engineering

Welcome to a special edition of Day Two Cloud. Host Ned Bellavance traveled to KubeCon Chicago 2023 and spoke to vendors and open source maintainers about what’s going on in the cloud-native ecosystem. This episode features conversations on platform engineering. Part 2 will focus on security. Episode Guests: Cole Morrison, Developer Advocate at HashiCorp LinkedIn... Read more »