Adding a Default Content Type to Mechanize Agent

So I found a few sites that would return the content of the entire page even if I updated the XPath info in the rule. I couldn’t figure out what was going on so I copy/pasted my script into IRB (is there a better way? can you “require” it?), and I found that when I ran the following:

scrapedata = mech_agent.get("http://somebrokenurl.com/privacy")
scrapedata.class

…My scrape data’s class would be “Mechanize::File” instead of “Mechanize::Page.” I found a couple articles, and made some changes to my agent to set a default content_type if none is returned.

    mech = Mechanize.new { |agent| 
      agent.user_agent_alias = 'Mac FireFox'
      agent.post_connect_hooks << lambda { |_,_,response,_|
        if response.content_type.nil? || response.content_type.empty?
          response.content_type = 'text/html'
        end
      }
      agent.ssl_version = 'SSLv3'
      agent.verify_mode = OpenSSL::SSL::VERIFY_NONE # less secure. Shouldn't matter for scraping.
      agent.agent.http.reuse_ssl_sessions = false
    }

Now, I’m able to use the “.search()” method on the Mechanize::Page to extract the policy using XPath.

Leave a Reply

  • (will not be published)

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>