Adding a Method to Check for Previous Crawl Data

I added a new feature to the script that will hopefully make the TOSBack crawls more reliable! The app has had some intermittent problems downloading pages; in some cases, the crawl data would come back blank even if the document had downloaded properly before. I decided to right a couple methods that would check for previous crawl data, and retry the scrape if there is an existing policy in the crawl folder.

Here’s what it looks like:

  def scrape(checkprev=true) #see below
    download_full_page()
    if @newdata
      apply_xpath()
      strip_tags()
      format_newdata()
    elsif (!@newdata && (checkprev == true))
      check_prev()
    end
  end #scrape

  def check_prev
    prev = (File.exists?("#{$results_path}#{@site}/#{@name}.txt")) ? File.open("#{$results_path}#{@site}/#{@name}.txt") : nil
    unless prev == nil
      if File.size(prev) > 32
        @has_prev = true
      end #if
    end #unless
    prev.close if prev
  end #check_prev

The “def scrape(checkprev=true)” sets a default value for the “checkprev” variable even if it’s not passed to .scrape(). I implemented it this way so I wouldn’t have to change the existing code, but I can still turn off checking for previous when I call the retry method.

  def retry_docs
    @sites.each do |site|
      site.docs.each do |doc|
        doc.scrape(false) if doc.has_prev == true
      end #@docs
    end #@sites
  end #retry_docs

Leave a Reply

  • (will not be published)

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>