Fixing Blank Scrape Data

There is an error log that reports documents that aren’t able to be opened and crawled. I had assumed that the URL’s in this list just needed to be updated, and that my script must be getting a 404 error on the page. But when I started researching, some of the pages were opening fine in a browser.

I’ve found that the best way to troubleshoot these errors in Ruby is to open up IRB and break the script into smaller pieces. And Google. Lots of Google. I found that some of the sites we were trying to access would not respond unless my client provided HTTP header info, so I decided to switch to the Mechanize library since it supports HTTP headers and sessions. Here’s what the current agent looks like in TOSBack:

    mech = Mechanize.new { |agent| 
      agent.user_agent_alias = 'Mac FireFox'
    }

You can get a list of user_agent_aliases on Github.

Leave a Reply

  • (will not be published)

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>