Crawling/mirroring Browser-Generated Web Pages


When parts of sites are generated on the fly by browser Javascript, they can’t be directly archived. Augmenting a simple crawler with an easy-to-use browser automation subsystem (Selenium) makes this very straightforward.


If you got past the title of this post, chances are you’ve executed a command something like this:

wget --mirror --convert-links --adjust-extension --page-requisites 

This crawls the Web site and creates a local copy of the content so it could be served much like the original remote material.

It’s also likely that you’ve dealt with a site that contains pages largely generated by Javascript in the browser. This is a fine way to build attractive and responsive pages but has a major problem. A HTTP client, like wget or a search engine robot, won’t see the material in the same way as it appears in the browser. In the extreme, without a browser running the Javascript, there is no site. If, say, the content is provided by calls to another service, the wget above will just give you an entry page, the CSS and JS. Ok, so if you place that material on another Web server and open it in a browser, it may well give you the same views as the original. But there are various time when this might not be enough.

A search engine depends on being able to see the content to be able to index it. If the back end of the system is going to change, then it won’t be possible to make an archive of the populated pages. A cache will be of no use if it doesn’t contain the material it’s meant to be caching.

Many years ago AaronSw wrote Bake, Don’t Fry, in which he argues the advantages of having a static [1], Baked filesystem-served site over having the pages being Fried, ie. dynamically generated from a database. A

[1] For the pedants I’ll note that ‘static’ isn’t an absolutely accurate term, the filesystem is acting as database for the pages. But filesystems are typically a layer or two down the software stack than eg. a MySQL DB, and files are so familiar that they might as well be written in stone.

My Scenario

I’ve just encountered this situation, where I have a site (a FooWiki instance) actually running on a home network server) with pages that are assembled on the fly from Javascript calls to a remote SPARQL server. I want to crawl this site to get all the full pages and then push the result into a remote server ( to serve as a static site.

A Solution

A Web crawler typically maintains a list of URLs over which to operate, and contains a few core functional units that are used in sequence over each page:

  1. page getter – retrieve a representation of a URL in the list from the Web
  2. link extractor – parse/scrape a given page, pull out the URLs, add them to the list
  3. URL filter – typically only pages within a particular domain or path will be required
  4. URL-to-filename translator
  5. page saver – dump the page, with converted links, to the local filesystem

Incidentally, I’ve written quite a few crawlers in a variety of languages over the years (if you’re involved in Web coding, an archetypal pattern for learning a language probably starts something like: “Hello World!”, TODO List Manager, Web Crawler, Blog Engine…).  So I can confidently say the trickiest part is the matching and translating of the URLs, it can get messy.

The list above will work fine over static pages, but where it falls down on Javascript-generated pages is at step 2. If the scripts in the page aren’t run, some or all of the links, and hence pages, will be missed. So an engine will be needed to run the scripts, ideally something very close to a browser in the results it produces.

I had a look around at the various options available, and the one that looked easiest was Selenium. From the site:

What is Selenium?

Selenium automates browsers. That’s it! What you do with that power is entirely up to you. Primarily, it is for automating web applications for testing purposes, but is certainly not limited to just that. Boring web-based administration tasks can (and should!) be automated as well.

It’s a sizeable framework which can run on Windows, Linux, and OS X. It has a DSL and bindings for a host of different languages. It even has an IDE! (A Firefox plugin). Apache 2 license.

But only a small (though significant) part of the framework is actually needed here. The work involved is minimal. I’ve been using a lot of Python recently so that’s what I went for – most other popular languages are supported. I did find a snippet that came close to what I needed (alas, lost the link). The key bits of code are just :

from selenium import webdriver
driver = webdriver.Firefox()
# retrieve the page
# extract the links from the DOM
elements = driver.find_elements_by_xpath("//a") 
content = driver.page_source

That’s it!

It is necessary to put a suitable driver on the system path. I used the geckodriver, and as I already had Firefox installed, for me it was a simple matter of copying the driver file to  /usr/local/bin.


I wasted a few hours with the script, the links were getting garbled, but I couldn’t see why, tried loads of things – even starting a blog post in the hope that would clear my thoughts… At some point I deleted one special case link from the site I was crawling, soon after noticed it was still showing up in the automated Firefox. D’oh! Firefox was caching.

Unfortunately there’s no way to turn this off programmatically (yet?), but it is straightforward to achieve by creating a custom Firefox profile.

First locate a usable starter profile. If you cd ~/.mozilla/firefox/ and open profile.ini there will be lines like:


That’s the only profile on this machine (I generally use Chrome browser), hopefully most of the settings I want will be the defaults. So I copy the whole directory:

cp -r 70abdmdv.default profile.Selenium

And edit the new profile:

nano profile.Selenium/prefs.js


Ew! It looks like I lost a bit of this write-up (or got bored).

Anyhow, the good news is I’ve got some code basically working, so I gave it a name and popped it in a github repo : Clonio.

It’s written in Python (2.*). Right now the configuration is just a few lines at the top of the file, should be self-explanatory.

As and when I have time, I’ll tidy it up, put together some proper docs (and maybe port it to Node as well).











Doing Nothing to Save Time

I noted recently with my ELFquake project that, at least with a slow brain like mine, there’s a definite efficiency trade-off between procrastinating and just Doing Stuff. In the time it took me to get around to implementation, research and plain old thought had led me to much better, quicker ways of doing things. I’ve just had another episode of this in a shorter timescale.

I’m in the process of setting up a local server on an old laptop, mostly running Web-based apps that in the past I’d have simply put on a remote host. I’ve been rubbish at getting the funds in to pay for hosting. But it’s occurred to me that virtually all the apps could be run locally, with the material being uploaded to a static host such as (One exception being WebBeeps – that really needs to be live online, here’s the archived site has the docs).

A lot of the things I want to run are backed by a Fuseki SPARQL server. I thought a good place to start would be a Wiki I put together, FooWiki. I’ve recently discovered the joy of Docker, and there’s an image of Fuseki available. So I’ve also put together an image containing an nginx Web server and the pages & scripts needed by FooWiki. (Dockerfiles here). Today I got that working (some minor bugs in FooWiki, but it’s basically working).

Now I want to get the Wiki pages and upload them to GitHub. So this afternoon I made tweaked versions of the Wiki page rendering, leaving out all the editing bits, so it would be suitable for a totally static site. But there’s a snag – the pages are generated on the fly in the browser from the results of SPARQL queries. I’ve just spent another maybe 3 hours putting to get together a Selenium-based setup to crawl the Wiki, as rendered. And it’s just occurred to me that it would be much, much easier to process the SPARQL results directly – lose the browser entirely.

So I’ve spent 3 out of maybe 4 total hours on solving a coding problem, only for me to think of a different approach that will be much better and probably only take about 1/2hr. I should have watched some art documentaries instead.

Nah, wasn’t wasted time. Making the necessary tweaks to the FooWiki code has made me familiar with it again, and I haven’t played with Selenium before, will probably want to use it before long.


Position Statement

I do not fear AI or robots. I fear humans.
Within the next 30 years (again) a very good chance of computer-generated creatures being close to thinking like a small dog. Is all good. Another industrial revolution.
Look at those factory jobs robots are doing, is brilliant. So everyone has to retrain as a computer programmer. Well grab your wifi.
Bit stupid in that they are mostly making cars, but have to start somewhere.
Things have changed massively in my lifetime, for the positive generally.
The AI vision stuff is brilliant, maybe we can get cars off the road.
Bit slow on getting politicians out of the way.
Plenty of human bastards around. They are the problem.
I don’t think we’ll see proper AI in the next 100 years, we are too tied to the ground, a reality that would take a century to learn, even given the raw material.
Predictions: Google will burst, there will be a nasty thing over oil before long. Things don’t usually go bang, whimper is the word.
I hope the humans will pick up on the detritus from the big politicians, am optimistic.

Mental Witches

Had a little personal epiphany just now, walking back from Castiglione (tobacco run…ok, wine too).
tldr If you’re bored, read on, if you’re happy clap your hands, or vice versa.
I’ve been home for a month, done nothing you’d call productive. Each week mostly two-part: half wine bender; half depressed couch potato.
Was putting it down to post-holiday malaise, and missing the girl. But a month is too long. And I was like this when I left. Different having Raven‘s energy, but without that, flatlining again.
But I somehow got to thinking about when I was in early teens, going to a nasty school, home life a bit of a mess. I was in a similar state then. Teenage hormonal cherchez la femme aside, I’d go to a disco or party at weekend, get wasted, lie in bed as much as possible during the week.
But then there was something that got me out of myself. The music bollocks.
Playing ‘Wild Thing’ on the guitar, very badly, but as loud as possible. Later this turned into bleepy stuff (also loud & discordant). Lost in music.
Until a few months ago, I did have a very comfortable home studio setup. Spent hours on end – nothing particularly imaginative – making noises. Until the PC died on me.
Nota bene: I had about a week’s stuff that I (at least) was pleased with, enjoying the process. Machine died. Now I do occasionally get angry, extremely rarely physical, verbal against a person maybe. But I just lost it, smashing a guitar down into pieces, and denting other bits.
Rational Danny tried to recuperate some time later, the guitar bits are in the workshop under epoxy supervision. Bought a new PSU and case. (Very likely the machine had given up the ghost thanks to electrical storms). But I haven’t had the will to put things together and try again.
A Catch-22 perhaps.
In the meantime, I can lose myself in any interesting hard/software project, earthquake prevention seemed reasonable. Lost the enthusiasm when I got home. For anything I enjoy, except bars.
FFS! A guitar and amp were enough for me when I was 14, I’m in a much, much better situation now.
I did come up with an approach that nearly worked for getting out of a shitty frame of mind. First tidy yourself up, then immediate surroundings, and/or dogs. Then expand from there.
Suppose I was missing my soul? The spirit? (As a metaphor, I accept happily).
Anyhow, I sort out some ability to make noises this weekend.

Easter Island – a 21st Century Parable

Easter Island (Rapa Nui: Rapa Nui, Spanish: Isla de Pascua) is a Chilean island in the southeastern Pacific Ocean, at the southeasternmost point of the Polynesian Triangle. Easter Island is famous for its 887 extant monumental statues, called moai, created by the early Rapa Nui people.

A tiny green dot in an ocean of blue. People arrived there, it is reckoned, around 1000 AD . They got there by boat (canoe/catamaran) from one of the other islands of Polynesia, at least 2,600km away, probably before then from South America. These people knew how to use a boat. Whatever they brought with them, they developed a distinct local, rich, highly industrial culture. The society that developed locally was highly hierarchical, with class distinctions between a high chief, nine clan chiefs and then presumably, not quite so well recorded, everyone else.

They thrived. It’s reckoned that the population got up to 15,000 around the 1500s, despite only being 163.6 km^2. The island was biologically diverse, notably with plenty of trees. However they did experience ecological issues, effectively beyond their control, because they brought with them the Polynesian rat. This put paid to a lot of the local vegetation.
Note that this was long before the first recorded arrival of Europeans (Jacob Roggeveen, Dutch, 1722, followed not long after by yon Yorkshireman Cook).

By the time the Europeans got there, the islanders were already in deep crisis, population had descended to 2000-3000. When they were legion, the population had a voracious appetite for resources. They cut down trees (slash & burn presumably) to make space for agriculture. Without restraint.

What first comes to mind when you or I think of Easter Island, are the rows of huge stone heads called moai. For once when archaeologists handwave about’ritual objects’, the role of these is reasonably well known. Tied into the ancestor worship-based religion, the heads were those of noteworthys, chiefs and dieties. Their blank staring eyes were originally bright with coral. One aspect of them that I only found out fairly recently, made me do a double-take, is that the Easter Island statues face inland.

So, the population exploded, resources such as trees were hacked down, resources got thin on the ground and culture changed. It’s hard to see what happened here, but the focus of the religion had a shift from ancestor worship to a weird kind of bird veneration. Attenborough has suggested it was because of a particular bird (sorry, I forget its name) that had the power of staying in the air all the time.

They also got more warlike, battles, overturning of rival group’s statues. Food started running out. It has been suggested that cannibalism arrived.

Of course the Europeans introduced a few more ecological problems, but by that point this civilization was totally broken of it’s own accord.

They’d chopped down all the trees. The raw material for boats (and most mod cons). It seems reasonable to assume that human resources that might have been useful in the field were reallocated to defend or attack.

After a migration from 1000s of miles away, our cousins are now in total isolation. More immediately, their ability to fish was compromised, the range of local flore & fauna had suffered serious species extinction, the whole ecology of the place had been comprised. Broken.

Even without the assistance of the European and his issues, they were doomed.

How stupid could people be?

So imagine you’re from a different planet somewhere out in the galaxy, that despite astronomical odds against, happens to be at a similar level of tech & culture as Earth humans. Looking at the leading denizens of that tiny blue dot and what they get up to – how stupid could people be?

Shaving Horse

Dobbins” (as Raven dubbed it).

I enjoy woodcarving using a mallet & gouge on the bench, but with that toolkit things tend towards the decorative. But there’s something very appealing about the bodger‘s approach, with minimal tools used outdoors on green wood, generally making functional products. A key tool there is the drawknife, ideally suited for many jobs, but regular bench clamping is really poorly suited to using one.

drawknifeAround the Web are loads of pics of shaving horses, the bench/vice optimized for drawknives. Typically they’re a long low bench with a seat at one end, a rest for the work on the other and a foot-operated lever to hold the work down. I’ve got a fair pile of wood offcuts from various house projects, so decided I’ve have a go at making one. I have promised to make a friend a wooden spoon, so this makes a great Yak Shaving exercise.

oldThere are two general designs, the Continental dumbhead style, as in this engraving , and the possibly more recent English frame style (it’s not illustrated prior to the 19th C unlike the dumbhead for which there are plenty of 15th C pics). I reckoned the frame style would probably be easier to build and also offer more control, so went English.

The design was arrived at by looking at pics of existing horses then mostly doing it by eye, given the wood I had. I made the base a few weeks ago by measuring what felt like a comfortable height and guestimating the kind of reach seemed about right. I angled out the legs about 15°, screwed and glued. At that point I made my first mistake. Before gluing I screwed the legs on to check, unscrewing them splintered off a bit of wood around the holes. Didn’t really affect the structure but was ugly, so after gluing I patched the holes up with filler, making it look worse… Hey ho. Appearance was way down priorities.

Today I added the mechanism, again judging things by eye. The work surface is hinged off the base so:


This could well be another mistake. I didn’t expect the hinge to receive much force so only used one, but I may well have to add another.

Next I needed to figure out the clamp part. I decided to use a length of 12mm (?) threaded rod for the clamp pivot, so needed to figure out where to put that.




I’ve got an upright drill press but no way could I manoeuvre the thing in position, so I made a vertical guide hole in an offcut and used that to keep the hand drill vertical.



Bit more trial and erroring got the clamp frame together. The foot rest is screwed to the uprights, but plenty of adjustments for location of the pivot and top cross piece seemed a good idea.

When planning this thing I anticipated having to come up with a way of locking the riser wedge in position under the workpiece support. Overthinking, it seems a loose offcut and friction locks fine.

DSCN6953.JPGAnd so that’s it basically done. I’ll give it a coat of varnish next, then have a play. First impressions are that it works!

Time to mount up.


(The hat incidentally was an xmas present from Raven’s mum, after she heard I was a fan of McLeod’s Daughters).

Our Area’s Longest Mural?

From London to Longoio (and Lucca and Beyond) Part Two

At first it was just a plain hoarding put around Borgo a Mozzano’s Istituto Comprensivo (an education centre which can comprise, primary, secondary schools, technical colleges and, in Borgo’s case, a fine music school, the ‘M. Salotti’) to fence off a works area. The school needed important structural work done to it to bring it up to scratch with the latest seismic and anti-earthquake regulations. Borgo a Mozzano is in seismic area no 2 which means that quite strong earthquakes could occur (as they have: see my post at

Then the istituto’s pupils started painting the hoardings which stretch quite some way around the building yard.

Finally, yesterday the painting work was completed and the boring hoarding had metamorphosed into a very colourful and lively mural – perhaps the longest we’ve seen yet in our area.

Of course, there was a master-mind behind scheme. Ilenia Rosati, born in Pisa…

View original post 181 more words