Posts Tagged: ToSBack

Adding a Default Content Type to Mechanize Agent

So I found a few sites that would return the content of the entire page even if I updated the XPath info in the rule. I couldn’t figure out what was going on so I copy/pasted my script into IRB (is there a better way? can you “require” it?), and I found that when I… Read more »

Fixing Blank Scrape Data

There is an error log that reports documents that aren’t able to be opened and crawled. I had assumed that the URL’s in this list just needed to be updated, and that my script must be getting a 404 error on the page. But when I started researching, some of the pages were opening fine… Read more »

Adding XPath to TOSBack Rules

I quickly found out that I would have some problems with the old TOSBack rules. Some policies were showing up as being modified daily because the app was downloading the full page instead of just the policy on the page. The XML rules that TOSBack uses to know which policies to download only give us… Read more »

What is TOSBack?

TOSBack is an old EFF project that was designed to track changes in web policies, but it’s currently not in great shape. When I asked how I could help out with the TOS;DR project, they asked me if I could try to rewrite TOSBack. So I forked the old Github project, and started working on… Read more »