Refactoring TOSBack

When it came time to add a new feature to TOSBack, I realized that it was going to be a lot more difficult than it should be. My script was a messy group of methods passing data all around instead of an OOP app with a DRY structure.

I refactored everything into a few nice classes, and now the code is much more organized.

Adding a Default Content Type to Mechanize Agent

So I found a few sites that would return the content of the entire page even if I updated the XPath info in the rule. I couldn’t figure out what was going on so I copy/pasted my script into IRB (is there a better way? can you “require” it?), and I found that when I ran the following:

scrapedata = mech_agent.get("http://somebrokenurl.com/privacy")
scrapedata.class

…My scrape data’s class would be “Mechanize::File” instead of “Mecanize::Page.” I found a couple articles, and made some changes to my agent to set a default content_type if none is returned.

    mech = Mechanize.new { |agent| 
      agent.user_agent_alias = 'Mac FireFox'
      agent.post_connect_hooks << lambda { |_,_,response,_|
        if response.content_type.nil? || response.content_type.empty?
          response.content_type = 'text/html'
        end
      }
      agent.ssl_version = 'SSLv3'
      agent.verify_mode = OpenSSL::SSL::VERIFY_NONE # less secure. Shouldn't matter for scraping.
      agent.agent.http.reuse_ssl_sessions = false
    }

Now, I’m able to use the “.search()” method on the Mechanize::Page to extract the policy using XPath.

Fixing Blank Scrape Data

There is an error log that reports documents that aren’t able to be opened and crawled. I had assumed that the URL’s in this list just needed to be updated, and that my script must be getting a 404 error on the page. But when I started researching, some of the pages were opening fine in a browser.

I’ve found that the best way to troubleshoot these errors in Ruby is to open up IRB and break the script into smaller pieces. And Google. Lots of Google. I found that some of the sites we were trying to access would not respond unless my client provided HTTP header info, so I decided to switch to the Mechanize library since it supports HTTP headers and sessions. Here’s what the current agent looks like in TOSBack:

    mech = Mechanize.new { |agent| 
      agent.user_agent_alias = 'Mac FireFox'
    }

You can get a list of user_agent_aliases on Github.

Adding XPath to TOSBack Rules

I quickly found out that I would have some problems with the old TOSBack rules. Some policies were showing up as being modified daily because the app was downloading the full page instead of just the policy on the page. The XML rules that TOSBack uses to know which policies to download only give us the URL and the policy name. In other words, any site with changing headers, related articles, local time, current temperature in Nebraska, (or anything that might change!) will show up as modified in the app even if nothing about the policy itself changed.

So to get just the information I want, I added something to the script that would get XPath info from the rule file and use that to extract just the policy from the page! Now our rule files are beginning to look something like this:

<docname name="Privacy Policy">
  <url name="http://www.500px.com/privacy" xpath="//div[@id='terms']">
    <norecurse name="arbitrary"/>
  </url>
</docname>

Now, I’ll see the policy appear as modified only if the policy changes!

What is TOSBack?

TOSBack is an old EFF project that was designed to track changes in web policies, but it’s currently not in great shape. When I asked how I could help out with the TOS;DR project, they asked me if I could try to rewrite TOSBack. So I forked the old Github project, and started working on my own version in Ruby.

Right now, the app is structured like this:

The “rules” folder contains around 1000 different policies for us to monitor for changes. The “crawl” folder contains the policy data that we get from the page. We will use Git to see the changes that occur when a site releases a new version of a policy.

This will save us from having to read the entire policy over again when only one minor bullet point has changed, and should help people understand what they are agreeing to whenever they check the “I agree” box. :)

Hey! A blog!

Hey! I’m Jimm. :)

I hope this blog will give you some idea of what I’m working on while I’m traveling. I’ll mostly update this blog with developer/work snippets, and keep posting my photos over on Curious Focus. Both sites are new, but I’ll be working on getting some content up.

Maybe, read this about me page? Or chat with me? I like to chat.