I added a new feature to the script that will hopefully make the TOSBack crawls more reliable! The app has had some intermittent problems downloading pages; in some cases, the crawl data would come back blank even if the document had downloaded properly before. I decided to right a couple methods that would check for previous crawl data, and retry the scrape if there is an existing policy in the crawl folder.
Here’s what it looks like:
def scrape(checkprev=true) #see below
download_full_page()
if @newdata
apply_xpath()
strip_tags()
format_newdata()
elsif (!@newdata && (checkprev == true))
check_prev()
end
end #scrape
def check_prev
prev = (File.exists?("#{$results_path}#{@site}/#{@name}.txt")) ? File.open("#{$results_path}#{@site}/#{@name}.txt") : nil
unless prev == nil
if File.size(prev) > 32
@has_prev = true
end #if
end #unless
prev.close if prev
end #check_prev
The “def scrape(checkprev=true)” sets a default value for the “checkprev” variable even if it’s not passed to .scrape(). I implemented it this way so I wouldn’t have to change the existing code, but I can still turn off checking for previous when I call the retry method.
def retry_docs
@sites.each do |site|
site.docs.each do |doc|
doc.scrape(false) if doc.has_prev == true
end #@docs
end #@sites
end #retry_docs