Screen scraping challenge in Ruby
New here? Learn about Bountify and follow @bountify to get notified of new bounties! x

Please provide the code to crawl Bountify.co and gather data on all the bounties that have been posted thus far.

Please provide a ruby program that does the following:

  • Visits https://bountify.co/bounties and creates an instance of a Bounty class (see next point) for every bounty currently on the site.
  • There should be an object representing a Bounty. It should have the following instance variables: title, tags (array of tag titles), and amount (bounty amount as an integer in dollars)
  • It should go through all the pagination links automatically.
  • It shouldn't visit the individual bounty pages, it should just scrape from the /bounties page.
  • The crawl function should return an array with all of the bounty objects.
  • It should use the hpricot gem for screen scraping

Let me know in the comments if you have any questions.

By the way, I'm going to edit and use this code for another project (this isn't just a busywork question).

@bevan, should this only get the active bounties or every bounty?
alex 9 years ago
@alex, all bounties are fine. Thanks!
bevan 9 years ago
@bevan,could you clarify the random delay requests? Do you want the requests for pagination to be different?
alex 9 years ago
@alex, no worries about the delays, I'm going to remove that from the question. I originally posted it because some servers have crawl policies that specify a maximum number of requests per second (so I should have asked for fixed delays in between requests, not delays of random durations), but I can just set that myself later if it becomes an issue.
bevan 9 years ago
awarded to alex
Tags
ruby

Crowdsource coding tasks.

1 Solution

Winning solution

The script below should work for you.

#!/usr/bin/ruby
require "rubygems"
require "hpricot"
require "open-uri"
require "pp"
require "openssl"

OpenSSL::SSL::VERIFY_PEER = OpenSSL::SSL::VERIFY_NONE 

def crawl()
    pagination = []
    bounties = []
    # get number of pages
    document = Hpricot(open("https://bountify.co/bounties"))
    document.search("div.pagination li a").each do |pages|
        theVal = pages.inner_html.gsub(/\n/, "")
        pagination.push(theVal)
    end
    pagination.pop()
    pagination.shift()
    for a in 0..pagination.length
        # get each question
        document2 = Hpricot(open("https://bountify.co/bounties?page=#{a+1}"))
        document2.search("div.question").each do |question|
            amountlive = question.search("div.bounty-amount-live")
            if amountlive.any?
                theAmount = question.search("div.bounty-amount-live").inner_html.gsub(/\n/, "")
            else
                theAmount = question.search("div.bounty-amount-dead").inner_html.gsub(/\n/, "")
            end
            theTitle = question.search("div.question-partial-title a").inner_html.gsub(/\n/, "")
            # array push
            theTags = []
            question.search("div\#badges a:nth-child(even)").each do |tags|
                tempTags = tags.inner_html
                if tempTags.any?
                    theTags.push(tempTags)
                end
            end
            newBounty = [theTitle, theTags, theAmount]
            bounties.push(newBounty)
        end
    end
    # print array
    pp bounties
end
crawl()

And the output would look like a longer version of this array below

[
    ["Screen scraping challenge in Ruby", ["ruby"], "$10"],
     ["SecondLife Transaction History - Take 2", ["PHP", "java"], "$10"],
     ["Ruby Raise Exception Error", ["ruby", " Error "], "$1"],
     ["Script to sepia tone and watermark images in a folder",
      ["ruby", "ImageMagick"],
      "$10"],
]

Enjoy!

Notes

I used the pretty print(pp) library so the output would look better. And I included rubygems because that is probably the easiest way of installing hpricot.

Edits

  1. Streamlined code
  2. Added openSSL code. (Thanks @bevan).
Cool, thanks! Works fine, just had to add this to the top for some reason: require 'openssl'; OpenSSL::SSL::VERIFYPEER = OpenSSL::SSL::VERIFYNONE
bevan 9 years ago
@bevan Thanks for pointing that out, I'll edit the code.
alex 9 years ago