For the past few days, ToSBack has been running so smoothly!
Every day, I scroll through the latest “Crawl” commit to see what’s new and to decide which rules need to be updated. And every day this week, I’ve been pleasantly surprised to find that not many files are being modified!
~# tail -f rubytosback/tosback2/logs/run.log 2012-12-05 01:14:30 -0600 - Script finished! Check errors.log for rules to fix :) 2012-12-06 01:06:02 -0600 - Beginning script! 2012-12-06 01:14:43 -0600 - Script finished! Check errors.log for rules to fix :) 2012-12-07 01:06:02 -0600 - Beginning script! 2012-12-07 01:23:40 -0600 - Script finished! Check errors.log for rules to fix :) 2012-12-08 01:06:02 -0600 - Beginning script! 2012-12-09 01:06:02 -0600 - Beginning script! 2012-12-10 01:06:02 -0600 - Beginning script! 2012-12-11 00:06:02 -0600 - Beginning script!
Oh, wait. This is a bad surprise.
In reality, ToSBack had been throwing an exception for the past few days. The script was stopping at a specific file.
tosback.rb:176:in `initialize': No such file or directory - ../crawl/godaddy.com/Trademark and/or Copyright Infringement Policy.txt (Errno::ENOENT)
In the Godaddy rule file, the document looked like this:
<docname name="Trademark and/or Copyright Infringement Policy">
<url name="https://www.godaddy.com/agreements/showdoc.aspx?pageid=TRADMARK_COPY" xpath="//td[@class='bodyText']">
<norecurse name="arbitrary"/>
</url>
</docname>
And “tosback.rb:176″ is refering to line 176 in tosback.rb:
crawl_file = File.open(new_path,"w") # new file or overwrite old file
By putting a “/” (forward slash) in the docname, ToSBack is trying to open a file called “or Copyright Infringement Policy.txt” in a directory called “Trademark and/”. File.open() would usually create a new file if the file didn’t exist, but you can see that it won’t create new directories for you:
$ mkdir testdir $ cd testdir $ irb 1.9.3-p327 :001 > path = "new_file.txt" => "new_file.txt" 1.9.3-p327 :002 > new_file = File.open(path,"w") => #<File:new_file.txt> 1.9.3-p327 :003 > new_file.puts "all work and no play" => nil 1.9.3-p327 :004 > new_file.close => nil 1.9.3-p327 :005 > exit $ ls new_file.txt $ tail new_file.txt all work and no play $ irb 1.9.3-p327 :001 > path="new/file.txt" => "new/file.txt" 1.9.3-p327 :003 > File.open(path,"w") Errno::ENOENT: No such file or directory - new/file.txt from (irb):3:in `initialize' from (irb):3:in `open' from (irb):3 from /Users/jimmy/.rvm/rubies/ruby-1.9.3-p327/bin/irb:16:in `<main>' 1.9.3-p327 :004 > exit $ mkdir new $ irb 1.9.3-p327 :001 > path = "new/file.txt" => "new/file.txt" 1.9.3-p327 :002 > file = File.open(path,"w") => #<File:new/file.txt> 1.9.3-p327 :003 > file.close => nil 1.9.3-p327 :005 > exit $ ls new file.txt
So I removed the slash in the docname to resolve the issue.
[...] or more "docname" elements * name: Make sure your docname's name is present and doesn't have any strange characters. * Nested beneath that is the "url" element and its attributes: * name: Encode your ampersands and [...]