Presented the following slides to a small group of interested people the other night. Would love to hear your feedback. 🙂
Last night was a fun night indeed for the classification database scraper. It had basically completely earlier that morning, however, upon closer inspection, a few logic errors and regular expression errors later, some of the data was a bit funky.
Luckily, none of the errors were bad enough to break everything, and as such, I simply wrote a verification method and ran it. If only it were that simple. Turns out computers don’t like dust, and the dust was so compacted on that the PC kept shutting down due to overheating while I was running the verification process.
Part of the way through my deliciously aggravating freedom of information request for the classification database, the AGD launched a ‘new’ website. The search functionality remains basically unchanged, except for one thing: we can now search by date range! This means I can reasonably scrape the website now.
I started working on a new scraper right away, aptly named Clasrip2 (in loving memory of Clasrip whose life was cut so short by the destruction of the original search mechanism on the Classification website). Each page is abstracted out into different classes to manage the crazy, with a central Scraper class for handling the actual scraping process.
This is the general process I used to reverse engineer the website and write the scraper.