Streaming and longer context lengths for LLMs on Workers AI

Streaming LLMs and longer context lengths available in Workers AI

Workers AI is our serverless GPU-powered inference platform running on top of Cloudflare’s global network. It provides a growing catalog of off-the-shelf models that run seamlessly with Workers and enable developers to build powerful and scalable AI applications in minutes. We’ve already seen developers doing amazing things with Workers AI, and we can’t wait to see what they do as we continue to expand the platform. To that end, today we’re excited to announce some of our most-requested new features: streaming responses for all Large Language Models (LLMs) on Workers AI, larger context and sequence windows, and a full-precision Llama-2 model variant.

If you’ve used ChatGPT before, then you’re familiar with the benefits of response streaming, where responses flow in token by token. LLMs work internally by generating responses sequentially using a process of repeated inference — the full output of a LLM model is essentially a sequence of hundreds or thousands of individual prediction tasks. For this reason, while it only takes a few milliseconds to generate a single token, generating the full response takes longer, on the order of seconds. The good news is we can start displaying the response as soon as the first tokens are generated, Continue reading

Is Anyone Using netlab on Windows?

Tomas wants to start netlab with PowerShell, but it doesn’t work for him, and I don’t know anyone running netlab directly on Windows (I know people running it in a Ubuntu VM on Windows, but that’s a different story).

In theory, netlab (and Ansible) should work fine with Windows Subsystem for Linux. In practice, there’s often a gap between theory and practice – if you run netlab on Windows (probably using VirtualBox with Vagrant), I’d love to hear from you. Please leave a comment, email me, add a comment to Tomas’ GitHub issue, or fix the documentation and submit a PR. Thank you!

NB455: Extreme Announces ZTNA Offering; Palo Alto Networks Spends Big On A Browser Startup

Extreme Networks is jumping into Zero Trust Network Access, Palo Alto Networks is reportedly spending more than half a billion dollars to acquire a corporate browser startup, and Forrester predicts as much as 20% of VMware’s customers may jump ship after the Broadcom acquisition completes. Arista touts a strong third quarter, while F5 forecasts a... Read more »

NB455: Extreme Announces ZTNA Offering; Palo Alto Networks Spends Big On A Browser Startup

Extreme Networks is jumping into Zero Trust Network Access, Palo Alto Networks is reportedly spending more than half a billion dollars to acquire a corporate browser startup, and Forrester predicts as much as 20% of VMware's customers may jump ship after the Broadcom acquisition completes. We cover these stories and more in today's Network Break podcast.

The post NB455: Extreme Announces ZTNA Offering; Palo Alto Networks Spends Big On A Browser Startup appeared first on Packet Pushers.

Top500 Supercomputers: Who Gets The Most Out Of Peak Performance?

The most exciting thing about the Top500 rankings of supercomputers that come out each June and November is not who is on the top of the list.

The post Top500 Supercomputers: Who Gets The Most Out Of Peak Performance? first appeared on The Next Platform.

Top500 Supercomputers: Who Gets The Most Out Of Peak Performance? was written by Timothy Prickett Morgan at The Next Platform.

LiquidStack expands into single-phase liquid cooling

LiquidStack, one of the first major players in the immersion cooling business, has entered the single-phase liquid cooling market with an expansion of its DataTank product portfolio.Immersion cooling is the process of dunking the motherboard in a nonconductive liquid to cool it. It's primarily centered around the CPU but, in this case, involves the entire motherboard, including the memory and other chips.Immersion cooling has been around for a while but has been something of a fringe technology. With server technology growing hotter and denser, immersion has begun to creep into the mainstream.To read this article in full, please click here

LiquidStack expands into single-phase liquid cooling

LiquidStack, one of the first major players in the immersion cooling business, has entered the single-phase liquid cooling market with an expansion of its DataTank product portfolio.Immersion cooling is the process of dunking the motherboard in a nonconductive liquid to cool it. It's primarily centered around the CPU but, in this case, involves the entire motherboard, including the memory and other chips.Immersion cooling has been around for a while but has been something of a fringe technology. With server technology growing hotter and denser, immersion has begun to creep into the mainstream.To read this article in full, please click here

Aurora enters TOP500 supercomputer ranking at No. 2 with a challenge for reigning champ Frontier

Frontier maintained its top spot in the latest edition of the TOP500 for the fourth consecutive time and is still the only exascale machine on the list of the world's most powerful supercomputers. Newcomer Aurora debuted at No. 2 in the ranking, and it’s expected to surpass Frontier once the system is fully built.Frontier, housed at the Oak Ridge National Laboratory (ORNL) in Tenn., landed the top spot with an HPL score of 1.194 quintillion floating point operations per second (FLOPS), which is the same score from earlier this year. A quintillion is 1018 or one exaFLOPS (EFLOPS). The speed measurement used in evaluating the computers is the High Performance Linpack (HPL) benchmark, which measures how well systems solve a dense system of linear equations.To read this article in full, please click here

Aurora enters TOP500 supercomputer ranking at No. 2 with a challenge for reigning champ Frontier

Frontier maintained its top spot in the latest edition of the TOP500 for the fourth consecutive time and is still the only exascale machine on the list of the world's most powerful supercomputers. Newcomer Aurora debuted at No. 2 in the ranking, and it’s expected to surpass Frontier once the system is fully built.Frontier, housed at the Oak Ridge National Laboratory (ORNL) in Tenn., landed the top spot with an HPL score of 1.194 quintillion floating point operations per second (FLOPS), which is the same score from earlier this year. A quintillion is 1018 or one exaFLOPS (EFLOPS). The speed measurement used in evaluating the computers is the High Performance Linpack (HPL) benchmark, which measures how well systems solve a dense system of linear equations.To read this article in full, please click here

BrandPost: Prioritizing the human behind the screen through end-user experience scoring

As digital landscapes evolve, so does the definition of network performance. It's no longer just about metrics; it's about the human behind the screen. Businesses are recognizing the need to zoom in on the actual experiences of end-users. This emphasis has given rise to advanced tools that delve deeper, capturing the essence of user interactions and painting a clearer picture of network health.The rise of End-User Experience (EUE) scoringEnd-User Experience (EUE) Scoring has emerged as a game-changer in the realm of network monitoring. Rather than solely relying on traditional metrics like latency or bandwidth, EUE scoring provides a holistic measure of how a user perceives the performance of a network or application. By consolidating various key performance indicators into a single, comprehensible metric, businesses can gain actionable insights into the true quality of their digital services, ensuring that their users' experiences are nothing short of exceptional.To read this article in full, please click here

Tech Bytes: Why AI Workloads Require Optimized Ethernet Fabrics (Sponsored)

Network engineers have a good grasp on how to build data center networks to support all kinds of apps, from traditional three-tier designs to applications built around containers and microservices. But what about building a network fabric to support AI? Today on the Tech Bytes podcast, sponsored by Nokia, we talk about the special requirements to build a data center fabric for AI use cases such as training and inference.

The post Tech Bytes: Why AI Workloads Require Optimized Ethernet Fabrics (Sponsored) appeared first on Packet Pushers.

Why Is IPv6 Adoption Slow?

IPv6, the most recent version of the Internet Protocol, was designed to overcome the address-space limitations of IPv4, which has been overwhelmed by the explosion of the digital ecosystem. Although major companies like Google, Meta, Microsoft and YouTube are gradually adopting IPv6, the overall adoption of this technologically superior protocol has been slow. As of September, only 22% of websites have made the switch. What is slowing the adoption of IPv6? Let’s take a walk through the possible causes and potential solutions. Why IPv6? IPv6 has a 128-bit address format that allows for a vastly larger number of unique IP addresses than its predecessor, IPv4. The latter uses a 32-bit address format and has an address catalog sufficient for only340 undecillion (340 trillion³) addresses, more than enough to accommodate the projected surge of devices. In addition to expanding the address space, IPv6 offers these improvements: Streamlined network management: Unlike IPv4, which requires manual configuration or external servers like DHCP (Dynamic Host Configuration Protocol), IPv6 supports stateless Continue reading

Nvidia Pushes Hopper HBM Memory, And That Lifts GPU Performance

For very sound technical and economic reasons, processors of all kinds have been overprovisioned on compute and underprovisioned on memory bandwidth – and sometimes memory capacity depending on the device and depending on the workload – for decades.

The post Nvidia Pushes Hopper HBM Memory, And That Lifts GPU Performance first appeared on The Next Platform.

Nvidia Pushes Hopper HBM Memory, And That Lifts GPU Performance was written by Timothy Prickett Morgan at The Next Platform.

SC23 SCinet traffic

The real-time dashboard shows total network traffic at The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC23) conference being held this week in Denver. The dashboard shows that 31 Petabytes of data have been transferred already and the conference hasn't even started.
The conference network used in the demonstration, SCinet, is described as the most powerful and advanced network on Earth, connecting the SC community to the world.
In this example, the sFlow-RT real-time analytics engine receives sFlow telemetry from switches, routers, and servers in the SCinet network and creates metrics to drive the real-time charts in the dashboard. Getting Started provides a quick introduction to deploying and using sFlow-RT for real-time network-wide flow analytics.
The dashboard above trends SC23 Total Traffic. The dashboard was constructed using the Prometheus time series database to store metrics retrieved from sFlow-RT and Grafana to build the dashboard. Deploy real-time network dashboards using Docker compose demonstrates how to deploy and configure these tools to create custom dashboards like the one shown here.

Finally, check out the SC23 Dropped packet visibility demonstration to learn about one of newest developments in sFlow monitoring and see a live demonstration.

LAN Data Link Layer Addressing

Last week, we discussed Fibre Channel addressing. This time, we’ll focus on data link layer technologies used in multi-access networks: Ethernet, Token Ring, FDDI, and other local area- or Wi-Fi technologies.

The first local area networks (LANs) ran on a physical multi-access medium. The first one (original Ethernet) started as a thick coaxial cable1 that you had to drill into to connect a transceiver to the cable core.

Later versions of Ethernet used thinner cables with connectors that you put together to build whole network segments out of pieces of cable. However, even in that case, we were dealing with a single multi-access physical network – disconnecting a cable would bring down the whole network.

Cisco Intent-Based Networking: Part I – Introduction

 Introduction

This chapter introduces Cisco's approach to Intent-based Networking (IBN) through their Centralized SDN Controller, Cisco DNA Center, rebranded as Cisco Catalyst Center (from now on, I am using the abbreviation C3 for Cisco Catalyst Center). We focus on the network green field installation, showing workflows, configuration parameters, and relationships and dependencies between building blocks. The C3 workflow is divided into four main entities: 1) Design, 2) Policy, 3) Provision, and 4) Assurance, each having its own sub-processes. This chapter introduces the Design phase focusing on Network Hierarchy, Network Settings, and Network Profile with Configuration Templates. 

This post deprecates the previous post, "Cisco Intent-Based Networking: Part I, Overview."

Network Hierarchy

Network Hierarchy is a logical structure for organizing network devices. At the root of this hierarchy is the Global Area, where you establish your desired network structure. In our example, the hierarchy consists of four layers: Area (country - Finland), Sub-area (city - Joensuu), Building (JNS01), and Floor (JNS01-FLR01). Areas and Buildings indicate the location, while Floors provide environmental information relevant to wireless networks, such as floor type, measurements, and wall properties.


Network Settings

Network settings define device credentials (CLI, HTTP(S), SNMP, and NETCONF) required for accessing devices Continue reading

Debian on Mellanox SN2700 (32x100G)

Introduction

I’m still hunting for a set of machines with which I can generate 1Tbps and 1Gpps of VPP traffic, and considering a 100G network interface can do at most 148.8Mpps, I will need 7 or 8 of these network cards. Doing a loadtest like this with DACs back-to-back is definitely possible, but it’s a bit more convenient to connect them all to a switch. However, for this to work I would need (at least) fourteen or more HundredGigabitEthernet ports, and these switches tend to get expensive, real quick.

Or do they?

Hardware

SN2700

I thought I’d ask the #nlnog IRC channel for advice, and of course the usual suspects came past, such as Juniper, Arista, and Cisco. But somebody mentioned “How about Mellanox, like SN2700?” and I remembered my buddy Eric was a fan of those switches. I looked them up on the refurbished market and I found one for EUR 1’400,- for 32x100G which felt suspiciously low priced… but I thought YOLO and I ordered it. It arrived a few days later via UPS from Denmark to Switzerland.

The switch specs are pretty impressive, with 32x100G QSFP28 ports, which can be broken out to a set of Continue reading