It finally happened. Today I received in the mail a USB stick containing a dump of the National Classification Database. I haven’t looked into the data in much detail yet, but it contains more fields than I had expected, such as RC information and other odd fields, which is just great.
This concludes Part A of my freedom of information request, which was lodged on September 17, 2012. It has now taken 74 days to run one SQL query and send me a USB stick containing a 400MB file. I thank the taxpayers for their gift of a 16GB USB stick to carry this file, and I thank the officer in charge of this request for sending me the information for free.
So, I’m at it again. This time, I’ve requested a quite redacted copy of the TPP, and any documents relating to the positions taken by the DFAT delegation.
The purpose of asking for such a heavily redacted document is simple:
- They have made it very clear they will not give us a full negotiating document, ever
- Redacted means less time spent consulting on which parts we can maybe see, so less chance of an “unreasonable diversion of resources” excuse
- We can hopefully at least see what bureaucrats are doing in our name
There has been a tentative success in the battle for a dump of the National Classification Database. It seems that the AGD has had a change of heart regarding my request for a dump of the database without cost!
For context, see my previous post on difficulties faced attempting to get machine-readable classification data.
Last night was a fun night indeed for the classification database scraper. It had basically completely earlier that morning, however, upon closer inspection, a few logic errors and regular expression errors later, some of the data was a bit funky.
Luckily, none of the errors were bad enough to break everything, and as such, I simply wrote a verification method and ran it. If only it were that simple. Turns out computers don’t like dust, and the dust was so compacted on that the PC kept shutting down due to overheating while I was running the verification process.
Part of the way through my deliciously aggravating freedom of information request for the classification database, the AGD launched a ‘new’ website. The search functionality remains basically unchanged, except for one thing: we can now search by date range! This means I can reasonably scrape the website now.
I started working on a new scraper right away, aptly named Clasrip2 (in loving memory of Clasrip whose life was cut so short by the destruction of the original search mechanism on the Classification website). Each page is abstracted out into different classes to manage the crazy, with a central Scraper class for handling the actual scraping process.
This is the general process I used to reverse engineer the website and write the scraper.