There is an error log that reports documents that aren’t able to be opened and crawled. I had assumed that the URL’s in this list just needed to be updated, and that my script must be getting a 404 error on the page. But when I started researching, some of the pages were opening fine in a browser.
I’ve found that the best way to troubleshoot these errors in Ruby is to open up IRB and break the script into smaller pieces. And Google. Lots of Google. I found that some of the sites we were trying to access would not respond unless my client provided HTTP header info, so I decided to switch to the Mechanize library since it supports HTTP headers and sessions. Here’s what the current agent looks like in TOSBack:
mech = Mechanize.new { |agent| agent.user_agent_alias = 'Mac FireFox' }
You can get a list of user_agent_aliases on Github.