Crawling/mirroring Browser-Generated Web Pages

tldr;

When parts of sites are generated on the fly by browser Javascript, they can’t be directly archived. Augmenting a simple crawler with an easy-to-use browser automation subsystem (Selenium) makes this very straightforward.

Background

If you got past the title of this post, chances are you’ve executed a command something like this:

wget --mirror --convert-links --adjust-extension --page-requisites 
--no-parent http://example.org

This crawls the Web site and creates a local copy of the content so it could be served much like the original remote material.

It’s also likely that you’ve dealt with a site that contains pages largely generated by Javascript in the browser. This is a fine way to build attractive and responsive pages but has a major problem. A HTTP client, like wget or a search engine robot, won’t see the material in the same way as it appears in the browser. In the extreme, without a browser running the Javascript, there is no site. If, say, the content is provided by calls to another service, the wget above will just give you an entry page, the CSS and JS. Ok, so if you place that material on another Web server and open it in a browser, it may well give you the same views as the original. But there are various time when this might not be enough.

A search engine depends on being able to see the content to be able to index it. If the back end of the system is going to change, then it won’t be possible to make an archive of the populated pages. A cache will be of no use if it doesn’t contain the material it’s meant to be caching.

Many years ago AaronSw wrote Bake, Don’t Fry, in which he argues the advantages of having a static [1], Baked filesystem-served site over having the pages being Fried, ie. dynamically generated from a database. A

[1] For the pedants I’ll note that ‘static’ isn’t an absolutely accurate term, the filesystem is acting as database for the pages. But filesystems are typically a layer or two down the software stack than eg. a MySQL DB, and files are so familiar that they might as well be written in stone.

My Scenario

I’ve just encountered this situation, where I have a site (a FooWiki instance) actually running on a home network server) with pages that are assembled on the fly from Javascript calls to a remote SPARQL server. I want to crawl this site to get all the full pages and then push the result into a remote server (github.io) to serve as a static site.

A Solution

A Web crawler typically maintains a list of URLs over which to operate, and contains a few core functional units that are used in sequence over each page:

  1. page getter – retrieve a representation of a URL in the list from the Web
  2. link extractor – parse/scrape a given page, pull out the URLs, add them to the list
  3. URL filter – typically only pages within a particular domain or path will be required
  4. URL-to-filename translator
  5. page saver – dump the page, with converted links, to the local filesystem

Incidentally, I’ve written quite a few crawlers in a variety of languages over the years (if you’re involved in Web coding, an archetypal pattern for learning a language probably starts something like: “Hello World!”, TODO List Manager, Web Crawler, Blog Engine…).  So I can confidently say the trickiest part is the matching and translating of the URLs, it can get messy.

The list above will work fine over static pages, but where it falls down on Javascript-generated pages is at step 2. If the scripts in the page aren’t run, some or all of the links, and hence pages, will be missed. So an engine will be needed to run the scripts, ideally something very close to a browser in the results it produces.

I had a look around at the various options available, and the one that looked easiest was Selenium. From the site:

What is Selenium?

Selenium automates browsers. That’s it! What you do with that power is entirely up to you. Primarily, it is for automating web applications for testing purposes, but is certainly not limited to just that. Boring web-based administration tasks can (and should!) be automated as well.

It’s a sizeable framework which can run on Windows, Linux, and OS X. It has a DSL and bindings for a host of different languages. It even has an IDE! (A Firefox plugin). Apache 2 license.

But only a small (though significant) part of the framework is actually needed here. The work involved is minimal. I’ve been using a lot of Python recently so that’s what I went for – most other popular languages are supported. I did find a snippet that came close to what I needed (alas, lost the link). The key bits of code are just :

from selenium import webdriver
...
driver = webdriver.Firefox()
...
# retrieve the page
driver.get(page)
...
# extract the links from the DOM
elements = driver.find_elements_by_xpath("//a") 
...
content = driver.page_source
...
driver.quit()

That’s it!

It is necessary to put a suitable driver on the system path. I used the geckodriver, and as I already had Firefox installed, for me it was a simple matter of copying the driver file to  /usr/local/bin.

Beware!

I wasted a few hours with the script, the links were getting garbled, but I couldn’t see why, tried loads of things – even starting a blog post in the hope that would clear my thoughts… At some point I deleted one special case link from the site I was crawling, soon after noticed it was still showing up in the automated Firefox. D’oh! Firefox was caching.

Unfortunately there’s no way to turn this off programmatically (yet?), but it is straightforward to achieve by creating a custom Firefox profile.

First locate a usable starter profile. If you cd ~/.mozilla/firefox/ and open profile.ini there will be lines like:

[Profile0]
Name=default
IsRelative=1
Path=70abdmdv.default
Default=1

That’s the only profile on this machine (I generally use Chrome browser), hopefully most of the settings I want will be the defaults. So I copy the whole directory:

cp -r 70abdmdv.default profile.Selenium

And edit the new profile:

nano profile.Selenium/prefs.js

PS.

Ew! It looks like I lost a bit of this write-up (or got bored).

Anyhow, the good news is I’ve got some code basically working, so I gave it a name and popped it in a github repo : Clonio.

It’s written in Python (2.*). Right now the configuration is just a few lines at the top of the file, should be self-explanatory.

As and when I have time, I’ll tidy it up, put together some proper docs (and maybe port it to Node as well).

 

 

 

 

 

 

 

 

 

Advertisements

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s