lindsay

Author Archives: lindsay

EX3400 Disk Space and Upgrades

The Juniper EX3400 switch series is a decent access switch. But a Product Manager chose to save $0.50 on COGS by choosing a 2GB disk. That’s just not enough space to handle normal Junos upgrades. This has wasted untold engineer hours on busywork. I hope that person (A) got a bonus, and (B) is never allowed to under-spec hardware again.

Here’s some tips I’ve learnt for manual and automated upgrades for EX3400s.

Manual Upgrades

Search for “Juniper EX3400 disk space” and you’ll find plenty of people complaining about this, and some suggestions. Juniper KB31198 looks like a good place to start. But it starts with request system storage cleanup and request system snapshot delete snap*.

Those might work if you’re upgrading from 15.1X -> 18.2. Maybe if you’re lucky it will be enough for upgrades within the 18.4 train. But it almost certainly won’t work if you’re going from 18.4.x -> 20.2.x.

There have been PRs that are supposed to fix this, and they might help around the edges, but they don’t help a lot.

With certain version combinations, you could get away with copying the new verson to /mfs, and Continue reading

Juniper ARP Policer on PTX

I’ve written before about the default ARP policer on Juniper MX. It can create some odd failure conditions when you’re connected to noisy networks such as large Internet Exchanges. Junos OS Evolved, as used on platforms like the PTX10003 has low default values for ARP and ICMPv6 ND DDoS protections. It will cause the same problems, but is easier to diagnose and mitigate.

Juniper DDoS Protection

Platforms like MX, QFX, PTX have Control Plane DDoS protections built in. These will automatically rate-limit various traffic types that hit the CPU. This is generally a Good Thing. Certain packet types get punted from the ASIC to the CPU, but the CPU can’t handle anywhere near the traffic levels that the forwarding ASIC can. Send enough special packets to a router, choke the CPU, and you might be able to knock things offline. So having default policies to rate-limit traffic makes sense.

Platform Defaults

Juniper might have “One Junos” but we know it’s not that simple. Behavior varies between platforms. Check these default values for some DDoS protections for different platforms:

Protocol MX QFX PTX
ARP 20,000 500 500
NDPv6 20,000 N/A 500
ICMP 20,000 N/A 500
BGP 20,000 3,000 5,000

Note Continue reading

Juniper i40e NVM Firmware Upgrade

Juniper Routing Engines with VM Host need an i40e NVM firmware upgrade. The procedure is a pain in the ass, and documentation is not great. But you can’t avoid the upgrade any more. New Junos versions need the firmware upgrade, and replacement REs will ship with it already installed. Here’s some tips on doing the upgrade.

Background

Newer Juniper Routing Engines use a Linux-based hypervisor, and Junos (still BSD-based) runs as a guest VM. This is mostly transparent for day to day operations. When you do a Junos upgrade, it will upgrade the underlying hypervisor if required.

Upcoming Junos versions ship with a new version of Wind River Linux that needs i40e firmware version 6.01. Older versions used v4.26. You need the new i40e firmware installed first, before you can install the latest Junos versions. You can’t put this upgrade off forever. Sooner or later you’ll want to ugprade to a Junos version that only supports the new firmware. Or you’ll get a replacement RE delivered with new firmware, and you can’t downgrade it.

For the last couple of years, Juniper has been shipping Junos versions that will work with both old & new firmware versions. You Continue reading

Juniper Direct vs Local Routes

Juniper routers consider a directly configured IP as a “local” route, except when you use /32 mask. Then it is a “direct” route. This caused me some confusion when creating a policy to redistribute loopback IP addresses into BGP.

Route Protocol Types

A router learns routes from a variety of sources - networks configured on the box, those learned from IS-IS, rumors of prefixes from BGP or RIP, etc. You can see the full list here.

When routes are learned from different sources, Junos uses “Route Preference Values” to decide which route source to prefer. (Cisco refers to this as Administrative Distance). If routes are otherwise identical, the route with the lowest preference will be installed into the FIB.

If you’re looking at the route table, you can narrow down displayed routes to look at a specific type, e.g. show route protocol direct to see locally connected networks:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
[email protected]> show route protocol direct

inet.0: 7 destinations, 7 routes (7 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active,  Continue reading

Juniper Direct vs Local Routes

Juniper routers consider a directly configured IP as a “local” route, except when you use /32 mask. Then it is a “direct” route. This caused me some confusion when creating a policy to redistribute loopback IP addresses into BGP.

Route Protocol Types

A router learns routes from a variety of sources - networks configured on the box, those learned from IS-IS, rumors of prefixes from BGP or RIP, etc. You can see the full list here.

When routes are learned from different sources, Junos uses “Route Preference Values” to decide which route source to prefer. (Cisco refers to this as Administrative Distance). If routes are otherwise identical, the route with the lowest preference will be installed into the FIB.

If you’re looking at the route table, you can narrow down displayed routes to look at a specific type, e.g. show route protocol direct to see locally connected networks:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
[email protected]> show route protocol direct

inet.0: 7 destinations, 7 routes (7 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active,  Continue reading

Juniper Default ARP Policer

Juniper devices have a default ARP policer that drops ARP requests and responses over 150kbps. By default, this is an aggregate policer that applies to all interfaces. This can lead to unexpected behavior when high levels of ARP on one interface lead to BGP session drops on another interface. You can’t change the default policer limits, but you can create a new policer, with higher limits.

Problem: IPv4 BGP Session Flaps on PNI

I was investigating a problem reported by one of our Transit providers. Once a day or so, our IPv4 BGP session with them would flap. The interface itself was stable, and the IPv6 session remained up. One particular site was seeing this more than others. The sites used different platforms, but were running the same code version.

The curious thing was the logs - we saw log messages saying that we had a notification message saying NOTIFICATION received from 192.0.2.188 (External AS 64498): code 4 (Hold Timer Expired Error). The syslog included this hold timer 30s, hold timer remain 0s, last sent 2s. So our router thought it was sending regular KEEPALIVE messages, but the remote end thought it had missed too many.

Looking Continue reading

Juniper Default ARP Policer

Juniper devices have a default ARP policer that drops ARP requests and responses over 150kbps. By default, this is an aggregate policer that applies to all interfaces. This can lead to unexpected behavior when high levels of ARP on one interface lead to BGP session drops on another interface. You can’t change the default policer limits, but you can create a new policer, with higher limits.

Problem: IPv4 BGP Session Flaps on PNI

I was investigating a problem reported by one of our Transit providers. Once a day or so, our IPv4 BGP session with them would flap. The interface itself was stable, and the IPv6 session remained up. One particular site was seeing this more than others. The sites used different platforms, but were running the same code version.

The curious thing was the logs - we saw log messages saying that we had a notification message saying NOTIFICATION received from 192.0.2.188 (External AS 64498): code 4 (Hold Timer Expired Error). The syslog included this hold timer 30s, hold timer remain 0s, last sent 2s. So our router thought it was sending regular KEEPALIVE messages, but the remote end thought it had missed too many.

Looking Continue reading

Juniper Branch SRX LACP Weirdness

Juniper SRX 300 Series firewalls may stop forwarding traffic in some situations. The firewall says it is forwarding the traffic, but it doesn’t work. Monitoring traffic looks OK, ARP entries are present, but traffic never gets to the destination, until you clear ARP. Turns out the problem comes from using LACP with fast timers and active mode. Luckily the fix is simple.

Alert: Firewall Offline

Here’s the situation we saw: Our NMS reported a Juniper SRX320 offline. All other devices at the site were still working, but the firewall was unreachable. Traffic from the firewall to the NMS goes via the firewall’s default gateway. Firewall A in this diagram was unreachable, but Firewall B was fine.

network_overview

OK, what’s happening? Why is my firewall unreachable?

Firewall says its fine?

Try to ping Firewall A, no response. From the default gateway, we can see an ARP entry for the firewall, but no response to ping. We can log in to Firewall B, and we see an ARP entry for Firewall A. Crucially: we can ping Firewall A from Firewall B. Hmmm. That’s strange. Why can we ping it from one locally connected device but not another?

From Firewall B, we SSH across Continue reading

Juniper Branch SRX LACP Weirdness

Juniper SRX 300 Series firewalls may stop forwarding traffic in some situations. The firewall says it is forwarding the traffic, but it doesn’t work. Monitoring traffic looks OK, ARP entries are present, but traffic never gets to the destination, until you clear ARP. Turns out the problem comes from using LACP with fast timers and active mode. Luckily the fix is simple.

Alert: Firewall Offline

Here’s the situation we saw: Our NMS reported a Juniper SRX320 offline. All other devices at the site were still working, but the firewall was unreachable. Traffic from the firewall to the NMS goes via the firewall’s default gateway. Firewall A in this diagram was unreachable, but Firewall B was fine.

network_overview

OK, what’s happening? Why is my firewall unreachable?

Firewall says its fine?

Try to ping Firewall A, no response. From the default gateway, we can see an ARP entry for the firewall, but no response to ping. We can log in to Firewall B, and we see an ARP entry for Firewall A. Crucially: we can ping Firewall A from Firewall B. Hmmm. That’s strange. Why can we ping it from one locally connected device but not another?

From Firewall B, we SSH across Continue reading

Juniper QFX10K IPFIX Gotchas

IPFIX is problematic on the Juniper QFX10K switches. Documentation is sparse, and doesn’t have a complete configuration. Behavior changes between versions in undocumented ways. Here’s a couple of things I noticed when upgrading from Junos 17.3 to 17.4. These also apply if you are running 18.4 code. I hit more problems with 18.4, and ended up rolling back to 17.4.

Big Changes in Reported Throughput

Here’s a graph showing total reported throughput for a QFX10K I upgraded:

ipfix traffic report

There’s a few things going on there. First the reported traffic drops to zero after I upgraded. Then it starts coming up, after I fixed the first problem. But then after that the reported traffic is flat, and lower than it should be. Then it starts coming up again after I made the second fix.

First Problem: Chassis Sample Instance

The first configuration change I needed to add was this: set chassis fpc 0 sampling-instance sample-border, where sample-border is the name of the sampling instance I have configured under forwarding-options. This was not required with 17.3. If you don’t do it with 17.4, you won’t get any data.

Second Problem: DDoS-Protection

Some Juniper platforms implement Continue reading

Juniper QFX10K IPFIX Gotchas

IPFIX is problematic on the Juniper QFX10K switches. Documentation is sparse, and doesn’t have a complete configuration. Behavior changes between versions in undocumented ways. Here’s a couple of things I noticed when upgrading from Junos 17.3 to 17.4. These also apply if you are running 18.4 code. I hit more problems with 18.4, and ended up rolling back to 17.4.

Big Changes in Reported Throughput

Here’s a graph showing total reported throughput for a QFX10K I upgraded:

ipfix traffic report

There’s a few things going on there. First the reported traffic drops to zero after I upgraded. Then it starts coming up, after I fixed the first problem. But then after that the reported traffic is flat, and lower than it should be. Then it starts coming up again after I made the second fix.

First Problem: Chassis Sample Instance

The first configuration change I needed to add was this: set chassis fpc 0 sampling-instance sample-border, where sample-border is the name of the sampling instance I have configured under forwarding-options. This was not required with 17.3. If you don’t do it with 17.4, you won’t get any data.

Second Problem: DDoS-Protection

Some Juniper platforms implement Continue reading

Juniper MX Upgrades Causing Overheating

Juniper changed the way they do temperature management on MX240 and MX480 chassis devices, somewhere between 15.1 and 17.3. The net result is that your chassis might run hotter after you upgrade, which can lead to the system shutting down some optics. Probably not what you want. Luckily there’s a few hidden commands you can use to change this behavior

“Optics will be disabled…”

Post upgrade, you might see higher temperatures reported by show chassis fpc. This system was reporting temperatures in the low 30s, now it reports 50:

1
2
3
4
5
6
7
8
9
[email protected]> show chassis fpc
                     Temp  CPU Utilization (%)   CPU Utilization (%)  Memory    Utilization (%)
Slot State            (C)  Total  Interrupt      1min   5min   15min  DRAM (MB) Heap     Buffer
  0  Empty
  1  Online            50     22          1       22     22     22    2048       38         21
  2  Empty

{master}
[email protected]>

On its own, that’s OK, until you start seeing log messages like this:

1
FPC 1 temperature over 50 degrees C; non-high-temperature tolerant optics will be disabled in 58 seconds if condition persists

Yeah that’s not good, especially when it carries out the threat, and Continue reading

Juniper MX Upgrades Causing Overheating

Juniper changed the way they do temperature management on MX240 and MX480 chassis devices, somewhere between 15.1 and 17.3. The net result is that your chassis might run hotter after you upgrade, which can lead to the system shutting down some optics. Probably not what you want. Luckily there’s a few hidden commands you can use to change this behavior

“Optics will be disabled…”

Post upgrade, you might see higher temperatures reported by show chassis fpc. This system was reporting temperatures in the low 30s, now it reports 50:

1
2
3
4
5
6
7
8
9
[email protected]> show chassis fpc
                     Temp  CPU Utilization (%)   CPU Utilization (%)  Memory    Utilization (%)
Slot State            (C)  Total  Interrupt      1min   5min   15min  DRAM (MB) Heap     Buffer
  0  Empty
  1  Online            50     22          1       22     22     22    2048       38         21
  2  Empty

{master}
[email protected]>

On its own, that’s OK, until you start seeing log messages like this:

1
FPC 1 temperature over 50 degrees C; non-high-temperature tolerant optics will be disabled in 58 seconds if condition persists

Yeah that’s not good, especially when it carries out the threat, and Continue reading

QFX Upgrades – Check Host Version

I came across a situation where a software upgrade failed for some members in a Juniper QFX Virtual Chassis. There is a known issue with upgrades with a certain configuration + version combination, but I thought it didn’t apply to me. Turns out that the key was the host OS version, not the Junos VM version. Your host and guest versions can be out of sync with Juniper QFX 5K devices, and this can lead to confusing behavior, especially in a virtual chassis where host OS versions might vary.

Upgrade Failures - post-install error

When upgrading an old Juniper QFX5100, you might see these messages when running the upgrade:

1
2
Error: jinstall-vjunos fails post-install
Error: jinstall-vjunos-14.1X53-D34-domestic-signed fails post-install

In my case, I saw it for some nodes in a Virtual Chassis. Some worked, some failed. KB31923 says that this error is due to this configuration:

1
2
3
# show system internet-options
tcp-drop-synfin-set;
no-tcp-reset drop-tcp-with-syn-only;

Easy enough to change:

1
2
3
4
5
6
7
8
9
10
# delete system internet-options
{master:0}[edit]
root# show |compare
[edit system]
  internet-options {
      tcp-drop-synfin-set;
     no-tcp-reset drop-tcp-with-syn-only;
 
{master:0}[edit]
root# commit

Continue reading

QFX Upgrades – Check Host Version

I came across a situation where a software upgrade failed for some members in a Juniper QFX Virtual Chassis. There is a known issue with upgrades with a certain configuration + version combination, but I thought it didn’t apply to me. Turns out that the key was the host OS version, not the Junos VM version. Your host and guest versions can be out of sync with Juniper QFX 5K devices, and this can lead to confusing behavior, especially in a virtual chassis where host OS versions might vary.

Upgrade Failures - post-install error

When upgrading an old Juniper QFX5100, you might see these messages when running the upgrade:

1
2
Error: jinstall-vjunos fails post-install
Error: jinstall-vjunos-14.1X53-D34-domestic-signed fails post-install

In my case, I saw it for some nodes in a Virtual Chassis. Some worked, some failed. KB31923 says that this error is due to this configuration:

1
2
3
# show system internet-options
tcp-drop-synfin-set;
no-tcp-reset drop-tcp-with-syn-only;

Easy enough to change:

1
2
3
4
5
6
7
8
9
10
# delete system internet-options
{master:0}[edit]
root# show |compare
[edit system]
  internet-options {
      tcp-drop-synfin-set;
     no-tcp-reset drop-tcp-with-syn-only;
 
{master:0}[edit]
root# commit

Continue reading

Junos SNMP via Routing Instance

Juniper routing instances are very useful when you need separate routing tables on the one device, for example to separate customers. Junos lets you configure SNMP polling of routing instances, so customers can poll “their” interfaces using 'instance_name'@'community'. All very useful. But it wasn’t obvious to me how to poll the default table via an interface in a routing instance. The trick is to just use @'community'. Here’s an example.

Network Overview

To demo this I have a simple network. I’m using a Virtual QFX plus Vagrant setup, based on the Vagrantfiles in this repo. I’m running one vqfx10k, connected to one server. The key here is that the server has two connections to the vqfx. One interface is in the default instance, one is in a “Customer” routing instance:

network_overview

Here’s the routing-instance config:

1
2
3
4
5
6
7
8
[email protected]> show configuration routing-instances
Customer {
    instance-type virtual-router;
    interface xe-0/0/1.0;
}

{master:0}
[email protected]>

And here’s my SNMP configuration:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
[email protected]> show  Continue reading

Junos SNMP via Routing Instance

Juniper routing instances are very useful when you need separate routing tables on the one device, for example to separate customers. Junos lets you configure SNMP polling of routing instances, so customers can poll “their” interfaces using 'instance_name'@'community'. All very useful. But it wasn’t obvious to me how to poll the default table via an interface in a routing instance. The trick is to just use @'community'. Here’s an example.

Network Overview

To demo this I have a simple network. I’m using a Virtual QFX plus Vagrant setup, based on the Vagrantfiles in this repo. I’m running one vqfx10k, connected to one server. The key here is that the server has two connections to the vqfx. One interface is in the default instance, one is in a “Customer” routing instance:

network_overview

Here’s the routing-instance config:

1
2
3
4
5
6
7
8
[email protected]> show configuration routing-instances
Customer {
    instance-type virtual-router;
    interface xe-0/0/1.0;
}

{master:0}
[email protected]>

And here’s my SNMP configuration:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
[email protected]>  Continue reading

New Role with Valve

I have started a new role as a Network Engineer with Valve Corporation. My period of unemployment was short-lived, and I am gainfully employed once more.

Not Another Vendor?

Did I think about going to work for another vendor? Yes, I did. I thought a lot about what I want to do, and what type of company I want to work for. Small/medium/large, vendor/customer, Product Manager vs Engineer, etc.

For now, I decided I want to solve business problems using whichever tools are appropriate, rather than building and selling a single product. I didn’t want to work for a company that just consumes technology though. I want to work somewhere that has interesting problems, and will do whatever is needed to solve those problems - build/buy/cobble together.

Why Valve?

Valve is big enough to offer the right level of challenge, but also small enough that I can make a difference. I’m not lost in the machine, but I am working on a global network.

Valve is also quite a different company. Check out the Employee Handbook to get a sense of Continue reading

New Role with Valve

I have started a new role as a Network Engineer with Valve Corporation. My period of unemployment was short-lived, and I am gainfully employed once more.

Not Another Vendor?

Did I think about going to work for another vendor? Yes, I did. I thought a lot about what I want to do, and what type of company I want to work for. Small/medium/large, vendor/customer, Product Manager vs Engineer, etc.

For now, I decided I want to solve business problems using whichever tools are appropriate, rather than building and selling a single product. I didn’t want to work for a company that just consumes technology though. I want to work somewhere that has interesting problems, and will do whatever is needed to solve those problems - build/buy/cobble together.

Why Valve?

Valve is big enough to offer the right level of challenge, but also small enough that I can make a difference. I’m not lost in the machine, but I am working on a global network.

Valve is also quite a different company. Check out the Employee Handbook to get a sense of Continue reading

Netlify Migration

This blog is now hosted via Netlify, rather than GitHub Pages. It is still built using Jekyll, but I updated the theme to Mediumish.

I looked at switching to Hugo for site generation, but I hit several bugs trying to do the import, and theme setup was a pain. So I stuck with Jekyll, because it’s doing what I need. Using Netlify gives me a few more options around build and deploy, and moves away from Cloudflare.

All URLs and RSS feed should remain the same. Let me know if you see any issues.

1 2 3 5