Posts Tagged: ruby

Version Callbacks for ToSBack Policies

Our ToSBack policies now have some automatic versioning when the “detail” attribute is changed! Here’s an example policy in my development environment. Its current version is stored as an attribute in the policy model (detail), but it’s also represented in the versions model: 1.9.3-p327 :001 > pol = Policy.first 1.9.3-p327 :009 > pol.detail => "… Read more »

Troubleshooting ToSBack File Handling

For the past few days, ToSBack has been running so smoothly! Every day, I scroll through the latest “Crawl” commit to see what’s new and to decide which rules need to be updated. And every day this week, I’ve been pleasantly surprised to find that not many files are being modified! ~# tail -f rubytosback/tosback2/logs/run.log… Read more »

Adding a Method to Check for Previous Crawl Data

I added a new feature to the script that will hopefully make the TOSBack crawls more reliable! The app has had some intermittent problems downloading pages; in some cases, the crawl data would come back blank even if the document had downloaded properly before. I decided to right a couple methods that would check for… Read more »

Passing Arguments to TOSBack

Over time, I’ve added some features to TOSBack that you access by passing arguments to the script. The quickest way to test out new rules/XPath info is to run the script like this: rubycode$ ruby tosback.rb ../rules/abercrombie.com.xml Instead of writing the policy to file, it will just print it on the screen for verification. If… Read more »

Searching Crawl Data for Empty Files

I needed a way to scan the crawl data programmatically and determine if there were blank policies that I didn’t know about. With around 1000 rules in the TOSBack app, I need some ways to double check the data that comes back from the web scraping. I decided to set up a class method that… Read more »

Refactoring TOSBack

When it came time to add a new feature to TOSBack, I realized that it was going to be a lot more difficult than it should be. My script was a messy group of methods passing data all around instead of an OOP app with a DRY structure. I refactored everything into a few nice… Read more »

Adding a Default Content Type to Mechanize Agent

So I found a few sites that would return the content of the entire page even if I updated the XPath info in the rule. I couldn’t figure out what was going on so I copy/pasted my script into IRB (is there a better way? can you “require” it?), and I found that when I… Read more »

Fixing Blank Scrape Data

There is an error log that reports documents that aren’t able to be opened and crawled. I had assumed that the URL’s in this list just needed to be updated, and that my script must be getting a 404 error on the page. But when I started researching, some of the pages were opening fine… Read more »

What is TOSBack?

TOSBack is an old EFF project that was designed to track changes in web policies, but it’s currently not in great shape. When I asked how I could help out with the TOS;DR project, they asked me if I could try to rewrite TOSBack. So I forked the old Github project, and started working on… Read more »