Monthly Archives: November 2012

Passing Arguments to TOSBack

Over time, I’ve added some features to TOSBack that you access by passing arguments to the script. The quickest way to test out new rules/XPath info is to run the script like this: rubycode$ ruby tosback.rb ../rules/abercrombie.com.xml Instead of writing the policy to file, it will just print it on the screen for verification. If… Read more »

Searching Crawl Data for Empty Files

I needed a way to scan the crawl data programmatically and determine if there were blank policies that I didn’t know about. With around 1000 rules in the TOSBack app, I need some ways to double check the data that comes back from the web scraping. I decided to set up a class method that… Read more »

Refactoring TOSBack

When it came time to add a new feature to TOSBack, I realized that it was going to be a lot more difficult than it should be. My script was a messy group of methods passing data all around instead of an OOP app with a DRY structure. I refactored everything into a few nice… Read more »

Adding a Default Content Type to Mechanize Agent

So I found a few sites that would return the content of the entire page even if I updated the XPath info in the rule. I couldn’t figure out what was going on so I copy/pasted my script into IRB (is there a better way? can you “require” it?), and I found that when I… Read more »

Fixing Blank Scrape Data

There is an error log that reports documents that aren’t able to be opened and crawled. I had assumed that the URL’s in this list just needed to be updated, and that my script must be getting a 404 error on the page. But when I started researching, some of the pages were opening fine… Read more »