I added a new feature to the script that will hopefully make the TOSBack crawls more reliable! The app has had some intermittent problems downloading pages; in some cases, the crawl data would come back blank even if the document had downloaded properly before. I decided to right a couple methods that would check for previous crawl data, and retry the scrape if there is an existing policy in the crawl folder.
Here’s what it looks like:
def scrape(checkprev=true) #see below download_full_page() if @newdata apply_xpath() strip_tags() format_newdata() elsif (!@newdata && (checkprev == true)) check_prev() end end #scrape def check_prev prev = (File.exists?("#{$results_path}#{@site}/#{@name}.txt")) ? File.open("#{$results_path}#{@site}/#{@name}.txt") : nil unless prev == nil if File.size(prev) > 32 @has_prev = true end #if end #unless prev.close if prev end #check_prev
The “def scrape(checkprev=true)” sets a default value for the “checkprev” variable even if it’s not passed to .scrape(). I implemented it this way so I wouldn’t have to change the existing code, but I can still turn off checking for previous when I call the retry method.
def retry_docs @sites.each do |site| site.docs.each do |doc| doc.scrape(false) if doc.has_prev == true end #@docs end #@sites end #retry_docs