Yesterday I was investigating the OpenStreetMap elevation tag, when I was surprised to find that the third most common value is ’0.00000000000′! Now I have my suspicions about any value below ten, but here are 13,832 features in OpenStreetMap that have their elevation mapped to within 10 picometres – roughly one sixth of the diameter of a helium atom – of sea level. Seems unlikely, to be blunt.
It could of course be hundreds of hyper-accurate volunteer mappers, but I immediately suspect an import. Given the spurious accuracy and the tendancy to cluster around sea level, I also suspect it’s a broken import where these picometre-accurate readings are more likely to mean “we don’t know” than “exactly at sea level”. Curious, I spent a few minutes using the overpass API and found an example – Almonesson Lake. The NHD prefix on the tags suggests it came from the National Hydrography Dataset and so, as supected, not real people with ultra-accurate micrometres.
But what concerns me most is when I had a quick look at the data layer for that lake – it turns out that there are three separate and overlapping lakes! We have the NHD import from June 2011. We have an “ArcGIS Exporter lake” from October that year, both of which simply ignore the original lake created way back in Feb 2009, almost 5 years ago. There’s no point in having 3 slightly different lakes, and if anyone were to try to fix a misspelling, add an attribute or tweak the outline they would have an unexpectedly difficult task. There is, sadly, a continual stream of imports that are often poorly executed and, worse, rarely revisted and fixed, and this is just one case among many.
Mistakes are made, of course, but it’s clear that data problems like these aren’t being noticed and/or fixed in a reasonable timescale – even just this one small example throws up a score of problems that all need to be addressed. Most of my own editing is now focussed on tending and fixing the data that we already have, rather than adding new information from surveys. And as the size of the OpenStreetMap dataset increases, along with the seemingly perpetual and often troublesome importing of huge numbers of features, OpenStreetMap will need to adjust to the increasing priority for such data-gardening.
If you want some to give out to potential new mapper recruits, you can order the OpenStreetMap leaflets and I’ll post them to you.
A little over a year ago I was plugging through setting up another OpenCycleMap server. I knew what needed installing, and I’d done it many times before, but I suspected that there was a better way than having a terminal open in one screen and my trusty installation notes in the other.
Previously I’d taken a copy of my notes, and tried reworking them into something resembling an automated installation script. I got it to the point where I could work through my notes line-by-line, pasting most of them into the terminal and checking the output, with the occasional note requiring actual typing (typically when I was editing configuration files). But to transform the notes into a robust hands-off script would have been a huge amount of work – probably involving far too many calls to sed and grep – and making everything work when it’s re-run or when I change the script a bit would be hard. I suspected that I would be re-inventing a wheel – but I didn’t know which wheel!
The first thing was to figure out some jargon – what’s the name of this particular wheel? Turns out that it’s known as “configuration management“. The main principle is to write code to describe the server setup, rather than running commands. That twigged with me straight away – every time I was adding more software to the OpenCycleMap servers I had this sinking feeling that I’d need to type the same stuff in over and over on different servers – I’d prefer to write some code once, and run that code over and over instead. The code also needs to be idempotent – i.e. it doesn’t matter how many times you run the code, the end result is the same. That’s about the sum of what configuration management entails.
There’s a few open-source options for configuration management, but one in particular caught my eye. Opscode’s Chef is ruby-based, which works for me since I do a fair amount of ruby development and it’s a language that I enjoy working with. And chef is also what the OpenStreetMap sysadmins use to configure their servers, so having people around who use the same system would simply be a bonus.
What started off as a few days effort turned into a massive multi-week project as I learned chef for the first time, and plugged through creating cookbooks for all the components of my server. It was a massive task and took much longer than I’d initially expected, but 18 months on it was clearly worth it – I’d have never been able to run enough servers for all the styles I have now, nor been able to keep up with the upgrades to the software and hardware without it. It’s awesome.
So here’s some tips, for those who have their own servers and are in a similar position to what I was.
- How many servers before it’s worth it? Configuration management really kicks in to its own when you have dozens of servers, but how few are too few to be worth the hassle? It’s a tough one. Nowadays I’d say if you have only one server it’s still worth it – just – since one server really means three, right? The one you’re running, the VM on your laptop that you’re messing around with for the next big software upgrade, and the next one you haven’t installed yet. If you’re running a server with anything remotely important on it, then having some chef-scripts to get a replacement up and running if the first goes up in smoke is a really good time-critical aid when you need it most.
- How do you get started with chef? Well, it’s tough, the learning curve is like a cliff. Chef setups have three main parts – the server(s) you’re setting up (the “node”), the machine you’re pressing keys on (the “workstation”) and the confusingly-named “chef server” which is where “nodes” grab their scripts (“cookbooks”) from. It makes sense to cut down the learning, so I’d recommend using the free 5-node trial of their Hosted Chef offering. That way you only need to concentrate on the nodes and workstation setup at first – and when you run out of nodes, there’s always the open-source chef-server if the platform is too expensive.
- Which recipes should I use? There are loads available on github, and there’s links all over the chef website. In general, I recommend avoiding them, at least at first. Like I mentioned, the learning curve is cliff-like and while you can do super-complex whizz-bang stuff with chef, the public recipes are almost all vastly overcomplicated, and more importantly, hard to learn from. Start out writing your own – mine were little more than a list of packages to install at first. Then I started adding in some templates, a few scripts resources here and there, and built up from there as I learned new features. Make sure your chef repository is in git, and that you’re committing your cookbook changes as you go along
- Where’s the documentation? I’d recommend following the tutorial to get things all set up, while trying not to worry too much about the details. Then start writing recipes. For that, the resources page on the wiki tells you everything you need to know – start with the package resource, then the template resource, then on to the rest. There’s a whole bunch of stuff that you won’t need for a long time – attributes, tags, searches – so don’t try learning everything in one go.
I’ll be writing more about developing and testing cookbooks in the future – it’s a whole subject in itself!