Over time, I’ve added some features to TOSBack that you access by passing arguments to the script. The quickest way to test out new rules/XPath info is to run the script like this: rubycode$ ruby tosback.rb ../rules/abercrombie.com.xml Instead of writing the policy to file, it will just print it on the screen for verification. If… Read more »
Monthly Archives: November 2012
Searching Crawl Data for Empty Files
I needed a way to scan the crawl data programmatically and determine if there were blank policies that I didn’t know about. With around 1000 rules in the TOSBack app, I need some ways to double check the data that comes back from the web scraping. I decided to set up a class method that… Read more »
Refactoring TOSBack
When it came time to add a new feature to TOSBack, I realized that it was going to be a lot more difficult than it should be. My script was a messy group of methods passing data all around instead of an OOP app with a DRY structure. I refactored everything into a few nice… Read more »
Adding a Default Content Type to Mechanize Agent
So I found a few sites that would return the content of the entire page even if I updated the XPath info in the rule. I couldn’t figure out what was going on so I copy/pasted my script into IRB (is there a better way? can you “require” it?), and I found that when I… Read more »
Fixing Blank Scrape Data
There is an error log that reports documents that aren’t able to be opened and crawled. I had assumed that the URL’s in this list just needed to be updated, and that my script must be getting a 404 error on the page. But when I started researching, some of the pages were opening fine… Read more »