Searching Crawl Data for Empty Files

I needed a way to scan the crawl data programmatically and determine if there were blank policies that I didn’t know about. With around 1000 rules in the TOSBack app, I need some ways to double check the data that comes back from the web scraping. I decided to set up a class method that would log any files that I determined to be too small.

Here’s how it’s used!

If I run the script with the “-empty” argument:

elsif ARGV[0] == "-empty"
  
  TOSBackApp.find_empty_crawls($results_path,512)

It calls this method:

  def self.find_empty_crawls(path=$results_path, byte_limit)
    Dir.glob("#{path}*") do |filename| # each dir in crawl
      next if filename == "." || filename == ".."

      if File.directory?(filename)
        files = Dir.glob("#{filename}/*.txt")
        if files.length < 1
          TOSBackSite.log_stuff("#{filename} is an empty directory.",$empty_log)
        elsif files.length >= 1
          files.each do |file|
            TOSBackSite.log_stuff("#{file} is below #{byte_limit} bytes.",$empty_log) if (File.size(file) < byte_limit)
          end # files.each
        end # files.length < 1
      end # if File.directory?(filename)
    end # Dir.glob(path)
  end # find_empty_crawls

Full code is here :)

Leave a Reply

  • (will not be published)

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>