Troubleshooting ToSBack File Handling

For the past few days, ToSBack has been running so smoothly!

Every day, I scroll through the latest “Crawl” commit to see what’s new and to decide which rules need to be updated. And every day this week, I’ve been pleasantly surprised to find that not many files are being modified!

~# tail -f rubytosback/tosback2/logs/run.log
2012-12-05 01:14:30 -0600 - Script finished! Check errors.log for rules to fix :)
2012-12-06 01:06:02 -0600 - Beginning script!
2012-12-06 01:14:43 -0600 - Script finished! Check errors.log for rules to fix :)
2012-12-07 01:06:02 -0600 - Beginning script!
2012-12-07 01:23:40 -0600 - Script finished! Check errors.log for rules to fix :)
2012-12-08 01:06:02 -0600 - Beginning script!
2012-12-09 01:06:02 -0600 - Beginning script!
2012-12-10 01:06:02 -0600 - Beginning script!
2012-12-11 00:06:02 -0600 - Beginning script!

Oh, wait. This is a bad surprise.

In reality, ToSBack had been throwing an exception for the past few days. The script was stopping at a specific file.

tosback.rb:176:in `initialize': No such file or directory - ../crawl/godaddy.com/Trademark and/or Copyright Infringement Policy.txt (Errno::ENOENT)

In the Godaddy rule file, the document looked like this:

<docname name="Trademark and/or Copyright Infringement Policy">
  <url name="https://www.godaddy.com/agreements/showdoc.aspx?pageid=TRADMARK_COPY" xpath="//td[@class='bodyText']">
    <norecurse name="arbitrary"/>
  </url>
</docname>

And “tosback.rb:176″ is refering to line 176 in tosback.rb:

crawl_file = File.open(new_path,"w") # new file or overwrite old file

By putting a “/” (forward slash) in the docname, ToSBack is trying to open a file called “or Copyright Infringement Policy.txt” in a directory called “Trademark and/”. File.open() would usually create a new file if the file didn’t exist, but you can see that it won’t create new directories for you:

$ mkdir testdir
$ cd testdir
$ irb
1.9.3-p327 :001 > path = "new_file.txt"
 => "new_file.txt" 
1.9.3-p327 :002 > new_file = File.open(path,"w")
 => #<File:new_file.txt> 
1.9.3-p327 :003 > new_file.puts "all work and no play"
 => nil 
1.9.3-p327 :004 > new_file.close
 => nil 
1.9.3-p327 :005 > exit
$ ls
new_file.txt
$ tail new_file.txt 
all work and no play
$ irb
1.9.3-p327 :001 > path="new/file.txt"
 => "new/file.txt" 
1.9.3-p327 :003 > File.open(path,"w")
Errno::ENOENT: No such file or directory - new/file.txt
	from (irb):3:in `initialize'
	from (irb):3:in `open'
	from (irb):3
	from /Users/jimmy/.rvm/rubies/ruby-1.9.3-p327/bin/irb:16:in `<main>'
1.9.3-p327 :004 > exit
$ mkdir new
$ irb
1.9.3-p327 :001 > path = "new/file.txt"
 => "new/file.txt" 
1.9.3-p327 :002 > file = File.open(path,"w")
 => #<File:new/file.txt> 
1.9.3-p327 :003 > file.close
 => nil 
1.9.3-p327 :005 > exit
$ ls new
file.txt

So I removed the slash in the docname to resolve the issue.

One Response to “Troubleshooting ToSBack File Handling”

  1. I suggest url encoding it. http://ruby-doc.org/stdlib-1.9.3/libdoc/uri/rdoc/URI/Escape.html

    irb(main):006:0> enc_uri = URI.escape(“new/file.txt”, “/”);
    irb(main):007:0* p enc_uri
    “new%2Ffile.txt”

    Brian

Leave a Reply

  • (will not be published)

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>