Lindsay Hill

Author Archives: Lindsay Hill

Christmas Change Freeze – Good or Bad?

We’re approaching Christmas, and for many of us, that means we’re about to enter an extended change freeze. This means an extended period when we shouldn’t change anything, hoping to improve stability. ITIL Change Management tells us this is good. I’m not convinced.

The Christmas Change Freeze

Many businesses impose some form of change freeze across all production systems during the Christmas/New Years period. In theory, all network/compute/storage changes are deferred until January. In practice, high priority changes will still be made if you jump up and down enough. The rate of change should still be lower during this period though.

Some change freezes may only run from just before Christmas until early January. Other businesses will go into a change freeze for as long as five weeks. My experience is that Southern Hemisphere businesses have a longer change freeze than Northern Hemisphere ones. I assume this is because many staff take extended leave over the Austral summer.

Aside: In New Zealand, the term ‘Brown out’ is often used when referring to the Christmas Change Freeze. I have no idea why this term is used, as a ‘brownout’ normally refers to something quite different.

Why Have One?

There are differing opinions about the usefulness Continue reading

War Stories: Unix Security

A different kind of war story this time: Unix security blunders. Old-school Unix-types will mutter about how much more secure Unix systems are than Windows, but that glosses over a lot. In a former life I worked as an HP-UX sysadmin, and I saw some shocking default configurations. I liked HP-UX – so much better laid out than Solaris – but it was very insecure by default. Here’s a few things I’ve come across:

Gaining Root

We’d lost the root password for a test HP-UX server. We had user access, but not root. The server was located in a different DC, and we didn’t really feel like going and plugging in a console cable to reset the root password. So we started looking around at how we might get access. After a while I found these two things:

  1. Root’s home directory was ‘/‘ – this was the default on HP-UX
  2. The Remote Login service was running

And now for the kicker:

hpux lhill$ ls -ld /
drwxrwxrwx 30 root wheel 1020 1 Nov 13:57 /

Put those together, and you can see it’s easy to gain root. All we needed to do was create /.rhosts, and add whatever Continue reading

Outsourcing Mistakes

Outsourcing is complex, and there are lots of ways it can go wrong, or simply fail to deliver. I’ve put together a list of things that I see going wrong with outsourcing arrangements. Of course it’s not exclusive!

There’s a few different types of outsourcing. It might mean procuring a commodity service – e.g. IaaS, or it might mean handing over your existing environment and staff to a third party. There’s also a whole range of levels in between, but the usual deal is: Some part of your environment gets managed or delivered by someone else, according to the terms of a fixed agreement.

Here’s a few things I’ve learnt to watch out for (nb not all these items apply to all types of agreements):

Not keeping up to date

If your outsourcer is managing your software, the contract usually covers applying security patches and bug fixes. But what gets missed is larger upgrades – e.g. ESXi 4.1 to 5.x. Everything goes OK for a while…and then your version goes End of Support.

It then becomes a major drama to get the upgrades sorted out. For financial purposes, you may not be able to do major Continue reading

Juniper SRX-110H EoL

Somehow I missed this when it was announced, but the Juniper SRX-110H-VA is End of Life, and is no longer supported for new software releases.

End of Life announcement is here, with extra detail in this PDF. Announcement was Dec 10 2013, with “Last software engineering support” date Dec 20 2013.

This is now starting to take effect, with 12.1X47 not supported on this platform:

Note: Upgrading to Junos OS Release 12.1X47-D10 or later is not supported on the J Series devices or on the low-memory versions of the SRX100 and SRX200 lines. If you attempt to upgrade one of these devices to Junos OS 12.1X47-D10, installation will be aborted with the following error message:

ERROR: Unsupported platform <platform-name >for 12.1X47 and higher

The replacement hardware is the SRX-110H2-VA, which has 2GB of RAM instead of 1GB. Otherwise it’s exactly the same, which seems a missed opportunity to at least update to local 1Gb switching.

Michael Dale has a little more info here, along with tips for tricking a 240H into installing 12.1X47.

So I decided to see if I could work around this and trick JunOS into installing on my 240H, I Continue reading

Wipebook – A Portable Whiteboard

It is a stereotype, but engineers really do like whiteboards. Problem is, you can’t carry one around with you. Plus there’s still a few unenlightened employers who don’t provide whiteboards. Enter the Wipebook, a spiral-bound notebook made of whiteboard-like pages:

I normally carry a notebook for scratching out notes while talking to customers, sketching diagrams, working through problems, etc. I don’t archive these notes – most are just short-term things, and I shred them. Important stuff gets turned into OmniFocus tasks/emails/etc.

So the Wipebook looks perfect for me. My wife bought one for me recently, and I’ve started using it at work. So far, it’s working as expected. I can quickly scribble notes, sketch a diagram, make corrections, etc. When I’m done with it, I wipe the page down.

It’s not perfect – the pages don’t always wipe down perfectly, and obviously it gets bumped around in my bag. So it won’t last forever. But it’s a nice touch that I can open & close the bindings, so I can easily get rid of any pages that are too beaten up.

The pens have a small eraser on the end, but it’s only suitable for very minor corrections. I have a Continue reading

iRules/Tcl – Watch the Comments

It’s pretty common practice to ‘comment out’ lines in scripts. The code stays in place, but doesn’t get executed. Perfect for testing, when you might need more debug output, or you want to run a slightly different set of actions. But you have to be careful when commenting out lines – it might catch you out, and the F5 iRules editor won’t save you.

Normally it’s pretty simple to comment out a line. Here’s a quick Bash example:

#!/bin/bash

FILECOUNT=`ls /tmp|wc -l`

if [ $FILECOUNT -lt 7 ]
 then
        echo "There are fewer than 7 files in /tmp"
        run_command
fi
...

When I’m testing the script, I might not want to actually run that command. So I’ll quickly comment it out like this:

#!/bin/bash

FILECOUNT=`ls /tmp|wc -l`

if [ $FILECOUNT -lt 7 ]
 then
        echo "There are fewer than 7 files in /tmp"
        #run_command
fi
...

The ‘#’ tells the shell to ignore anything else on that line. All pretty straightforward.

Today I was debugging an F5 synchronisation issue, where I got this message on synchronisation:

BIGpipe parsing error (/config/bigip.conf Line 333):
   012e0054:3: The braced list of attributes is not closed for 'rule'.

The offending section looked like this:

when  Continue reading

Complexity vs Security

Many of the ‘security’ measures in our networks add complexity. That may be an acceptable tradeoff, if we make a meaningful difference to security. But often it feels like we just add complexity for no real benefit.

Here’s some examples of what I’m talking about:

  • Multiple Firewall Layers: Many networks use multiple layers of firewalls. If you have a strong policy that says all traffic must go via a server within a DMZ, this makes sense. But often we end up with the same connections going through multiple firewalls. We end up configuring the same rules in multiple places. No security benefit, but increased chance of making mistakes, and added troubleshooting complexity.
  • Chained proxies: It’s pretty common to use a proxy server, to enforce HR and security controls on what users browse. But some organisations have chained proxies, where an internal proxy server connects to an upstream proxy server to get Internet access. The upstream proxy doesn’t add anything from a policy or control perspective. It’s just another point to configure and troubleshoot.
  • NAT/Routing: First let me be clear: NAT is not complete security in itself, but it can form a valid part of your overall network security policy. That Continue reading

War Stories: Cursed VLANs

I’ve written before about switch ports being permanently disabled. This time it’s something new to me: VLANs that refuse to forward frames.

A Simple Network

The network was pretty straightforward. A pair of firewalls connecting through a pair of switches to a pair of routers:

Cursed VLAN

Sub-interfaces were used on the routers and firewalls, with trunks to the switches. VLAN 100 was used for 100.100.100.0/24, and VLAN 200 was used for 200.200.200.0/24. The switches were configured to pass VLANs 100 & 200.

All was working as expected. All devices could see each other on all VLANs.

Until it stopped

We received reports that we’d lost reachability to Router A’s VLAN 200 sub-interface. After doing some investigation, we could see that Firewall-A could no longer see Router A’s MAC address on G0.200. But everything else was fine – the VLAN 100 interface worked perfectly. So we knew it couldn’t be a physical interface issue.

Hmmm. What’s going on? First instinct: check the switch port configuration. Has anything changed? Nope. VLAN 200 still there, configured as expected. The router & firewall were still tagging frames with VLAN 200. But they couldn’t see each other, and the Continue reading

Ops Work vs Project Work

There’s a constant tension between delivering new services, and running the existing services well. How do you figure out how to prioritise work between Operations tasks and Project work? Skewing too far either way leads to problems. Maybe the answer is in how we structure Operations tasks?

Definitions

  • Operations work: Dealing with outages, trouble tickets, support requests, etc. System monitoring – reviewing data for capacity planning, and identifying new areas to monitor. Automated repetitive tasks. Patches, upgrades, minor changes to existing services. Accountants would call this work OpEx.
  • Project work: Design, test and deployment of new services. Major upgrades or enhancements to existing services. This is usually classified as CapEx. For some businesses, this work is customer-billable.

What happens when you’re imbalanced?

  • Too much Project work: If you’re flat out deploying new systems (and dealing with the fallout), it’s easy to let Operations work slip. Maybe you don’t get around to automating that log rotation script, or paying attention to the slope of that consumption graph. It’s OK for a while too…things seem to be trucking along. But then you start having outages due to simple things like logs filling directories, or you hit a capacity limit, and there’s a 6-week Continue reading

Meeting Rules

Years ago a wise engineer gave me these rules for meetings:

  1. Never go into a meeting unless you know what the outcome will be.
  2. Plan to leave the meeting with less work than when you went in.

Stick to those rules, and you’ll do well.

OK, so maybe the second rule’s not so serious, but the first one has a grain of truth. You don’t need to know exactly what the decision should be, but you should be clear about what you want to get decided. If it’s particularly important, you should have already discussed it with the key attendees, and you should know what they’re thinking. You don’t want any surprises.

Too many meetings have no clear purpose, or they can only agree that ‘a decision needs to be made…pending further research.’ Avoid those sorts of meetings. Otherwise it ends up like…well….Every Meeting Ever:

Cumulus in the Campus?

Recently I’ve been idly speculating about how campus networking could be shaken up, with different cost and management models. A few recent podcasts have inspired some thoughts on how Cumulus Networks might fit into this.

In response to a PacketPushers podcast on HP Network Management, featuring yours truly, Kanat asks:

For me the benchmark of network management so far is Meraki Dashboard – stupid simple and feature rich…
Yes – it’s a niche product that only focuses on Campus scenarios, Yes – it only supports proprietary HW. But it offers pretty much everything network operator needs – detailed visibility, traffic policy engine with L7 capability, MDM and you can hit it and go full speed right away.

How long will it take HP to achieve that level of simplicity/usability?

He’s right about the Meraki dashboard. It’s fantastic. Fast to get set up, easy to use, it’s what others should aspire to. But there’s a catch: It only works with Meraki hardware. Keep paying your monthly bills, and all is well. But what if you’ve got non-Meraki hardware? Or what if you decide you don’t want to pay Meraki any more? What if Meraki goes out of business (unlikely, but still Continue reading

Accurate Dependency Mapping – One Day?

Recently I’ve been thinking about Root Cause Analysis (RCA), and how it’s not perfect, but there may be hope for the future.

The challenge is that Automated RCA needs an accurate, complete picture of how everything connects together to work well. You need to know all the dependencies between networks, storage, servers, applications, etc. If you have a full dependency mapping, you can start to figure out what the underlying cause of a fault is, or you can start doing ‘What If?’ scenario planning.

But once your network gets past a moderate size, it’s hard to maintain this sort of dependency mapping. Manual methods break down, and we look for automated means instead – but they have gaps and limitations.

Automated Mapping – Approaches & Limitations

Tools such as HP’s CMS suite attempt to discover all objects and dependencies using a combination of network scanning and agents. They’ll use things like ping, SNMP, WMI, nmap to identify systems and running services. Agents can then report more data about installed applications, configurations, etc.

Network sniffing can also be used to identify traffic flows. Most tools will also connect to common orchestration points, such as vCenter, or the AWS console, to Continue reading

Fixed-Price, or T&M?

Recently I posted about Rewarding Effort vs Results, how different contract structures can have different outcomes. This post covers Time & Materials vs Fixed-Price a little more, looking at pros & cons, and where each one is better suited.

Definitions:

  • Time & Materials: Client & supplier agree on the requirements, and an hourly rate. The client is billed based upon the number of hours spent completing the job. Any costs for materials are also passed on. If the job takes 8 hours, the client pays for 8 hours. If it takes 800 hours, the client pays for 800 hours. To prevent bill shock, there will usually be review points to measure progress & time spent. Risk lies with the client.
  • Fixed-Price: Client & Supplier agree beforehand on what outcomes the client needs. It is crucial that this is well-documented, so there are no misunderstandings. The supplier will estimate how long the job will take, allow some extra margin, and quote a figure. The client pays the same amount, regardless of how long the job takes. Risk lies with the supplier.

Comparison:

Time & Materials

Pros: Little time/energy wasted on quoting – engineers can get to work faster. Customer saves money if job Continue reading

Andrisoft Wanguard: Cost-Effective Network Visibility

Andrisoft Wansight and Wanguard are tools for network traffic monitoring, visibility, anomaly detection and response. I’ve used them, and think that they do a good job, for a reasonable price.

Wanguard Overview

There are two flavours to what Andrisoft does: Wansight for network traffic monitoring, and Wanguard for monitoring and response. They both use the same underlying components, the main difference is that Wanguard can actively respond to anomalies (DDoS, etc).

Andrisoft monitors traffic in several ways – it can do flow monitoring using NetFlow/sFlow/IPFIX, or it can work in inline mode, and do full packet inspection. Once everything is setup, all configuration and reporting is done from a console. This can be on the same server as you’re using for flow collection, or you can use a distributed setup.

The software is released as packages that can run on pretty much any mainstream Linux distro. It can run on a VM or on physical hardware. If you’re processing a lot of data, you will need plenty of RAM and good disk. VMs are fine for this, provided you have the right underlying resources. Don’t listen to those who still cling to their physical boxes. They lost.

Anomaly Detection

You Continue reading

Non-Functional Requirements

I’m currently reading and enjoying “The Practice of Cloud System Administration.” It doesn’t go into great depth in any one area, but it covers a range of design patterns and implementation considerations for large-scale systems. It works for two audiences: A primer for junior engineers who need a broad overview, or as a reference for more experienced engineers. It doesn’t cover all the implementation specifics, nor should it: it would date very quickly if it tried.

I’ve long disliked the term “non-functional requirements,” so I enjoyed this passage:

Rather than the term “operational requirements,” some organizations use the term “non-functional requirements.” We consider this term misleading. While these features are not directly responsible for the function of the application or service, the term “non-functional” implies that these features do not have a function. A service cannot exist without the support of these features; they are essential.

It is all the fashion today to separate requirements into ‘functional’ and ‘non-functional,’ but the authors are right to point out that this can be misleading. Perhaps it’s the old Operations Engineer in me, but if a product doesn’t have things like Backup & Restore, or Configuration Management, then it’s a Continue reading

Keep an Open Mind

We all know that IT changes rapidly, but we still don’t always accept what that means. Companies and technologies change over time, and good engineers recognise this. Poor engineers cling to past beliefs, refusing to accept change. Try to keep an open mind, and periodically re-evaluate your opinions.

Consider the Linux vs Microsoft debate. I’ve been an Open Source fan for a long time, and have plenty of experience running Linux on servers and desktops. Today I use OS X as my primary desktop. I’ve cursed at Microsoft many times over the years, usually when dealing with some crash, security issue, or odd design choice.

But it annoys the hell out of me when I hear engineers spouting tired old lines about Microsoft products crashing, or having poor security. This is usually accompanied by some smug look “Hur hur hur…Microsoft crash…Blue Screen of Death…hur hur hur”

I get frustrated because these people aren’t paying attention to what Microsoft has been doing. They have come a very long way since the 2002 Bill Gates email setting security as the top priority. It’s a big ship to turn, and it took time. Their overall security model and practices are far better than they were, Continue reading

Rewarding Effort vs Results

Sometimes we confuse effort with outcome. We think that hours spent are more important than outcomes achieved. Or we unintentionally create a system where effort is rewarded, rather than outcomes.

Consider a situation where you work for a consulting firm, doing capped Time & Materials jobs. The client gets charged for the amount of time actually worked. Any amount of time up to the cap will be accepted. If more time is needed to complete a task, you’ll need to go back to the client to negotiate for more time/money. Occasionally you’ll need to do that, but usually the job will be completed under the cap.

As a consultant, you’re normally measured on your utilisation, and the amount you bill. So what’s the optimum amount of work to do for each job? Funnily enough, it is very close to the amount estimated – no matter what the estimate was. Maximise revenue & utilisation, while still doing the work under budget. There’s no incentive to do the job quicker.

Look at it from the perspective of two different consultants, Alice & Bob:

  • Alice is a diligent worker, who gets through her work as quickly as possible. Repeatable tasks are scripted. She doesn’t muck around.
  • Bob is a Continue reading

APIs Alone Aren’t Enough

Yes, we know: Your product has an API. Yawn. Sorry for not getting excited. That’s just table stakes now. What I’m interested in is the pre-written integrations and code you have that does useful things with that API.

Because sure, an API lets me integrate my various systems however I want. Theoretically. Just the same way that Bunnings probably sells me all the pieces I need to build a complete house.

Random aside: If your “open API” requires signing an NDA to view details, then maybe it’s not so open after all? 

If I’m running a small company staffed by developers, then just giving me an API is acceptable. But in a larger company, or one without developer resources, an API alone isn’t enough. I want to see standard, obvious integrations already available, and supported by the vendor.

In this spirit, I’m very pleased to see that ThousandEyes now has a standard integration with PagerDuty:

ThousandEyes appears as a partner integration from which you can receive notifications; and, within ThousandEyes we now have a link to easily add alerts to your PagerDuty account.

You can read more at the ThousandEyes blog.

This is exactly the sort of obvious integration I Continue reading

Increased MTTR is Good?

In Episode 167 of The Cloudcast – “Bringing Advanced Analytics to DevOps”, Dave Hayes brings up an interesting point about Mean Time to Resolution (MTTR). At about 8:30 in, he states:

“In a counter-intuitive sense, you actually want this to be going up…If you’re removing false alerts, and you’re getting better about the quantity of alerts, you’re going to be solving far fewer, more difficult problems, so you should see a slight trend upwards in Mean Time to Resolution”

This is a really interesting way of looking at things. Obviously you don’t want to set your goal as “Increase our MTTR,” but this could be a positive side-effect of improved processes.

I recommend listening to the whole episode. PagerDuty is a very cool product in itself, but this is a broader discussion about operations, analytics, and best practices.

Subscribe to the podcast while you’re there too. Lots of interesting technology discussed there.

Using Firewalls for Policy Has Been a Disaster

Almost every SDN vendor today talks about policy, how they make it easy to express and enforce network policies. Cisco ACI, VMware NSX, Nuage Networks, OpenStack Congress, etc. This sounds fantastic. Who wouldn’t want a better, simpler way to get the network to apply the policies we want? But maybe it’s worth taking a look at how we manage policy today with firewalls, and why it doesn’t work.

In traditional networks, we’ve used firewalls as network policy enforcement points. These were the only practical point where we could do so. But…it’s been a disaster. The typical modern enterprise firewall has hundreds (or thousands) of rules, has overlapping, inconsistent rules, refers to decommissioned systems, and probably allows far more access than it should. New rules are almost always just added to the bottom, rather than working within the existing framework – it’s just too hard to figure out otherwise.

Why have they been a disaster? Here’s a few thoughts:

  • Traditional firewalls use IP addresses. But there’s no automated connection between server configuration/IP allocation and firewall policies. So as servers move around or get decommissioned, firewall policies don’t get automatically updated. You end up with many irrelevant objects and Continue reading