Charts of last 100 posts to HN; by username and by domain name.
New here? Learn about Bountify and follow @bountify to get notified of new bounties! x

Description

Using the HN Search API retrieve a list of the last one hundred articles. (I only want the domain names and the poster's name.)

Make one chart of the domains (along X axis) posted, and number of times (along Y axis) each was posted.

Make another chart of the usernames (along X axis) and the number of articles (along Y axis) they've posted.

Save the charts to a file. Finish.

Notes

Python 2.7 is preferred, but any language is acceptable. Solution must be able to run on OS X (snow leopard); Linux and Windows compatibility is great.

Must use API available at http://www.hnsearch.com/api

If possible: have a commented variable in the script to include / exclude dead items.

If possible: have a commented variable in the script to include / exclude items that are "self posts" (posts directly to HN rather than to an external website.) Minimum behaviour is to handle these posts gracefully; whether they're included or not included.

If possible: have a commented variable in the script to increase / decrease the quantity of articles returned.

Sample Output

The charts can be text or graphics. A simple example of text chart is shown, but as long as the charts are readable and clear. The charts do not have the be saved as a single file.

 10             *
 09             *         *
 08             *         *
 07             *         *
 06             *       * *
 05   *         *   *   * *   *
 04   *         *   * * * *   *
 03   *   *     *   * * * *   *
 02   *   *     * * * * * * * *
 01 * * * * * * * * * * * * * *
    A B C D E F G H I J K L M N


    KEY:

    A -- Bob
    B -- Ann
    C -- Mary
    D -- John

(Or domains) etc etc.

Or they can be graphic, using matplotlib. See this URL for very rough example charts http://imgur.com/a/DWMcV But any readable chart is acceptable.

Is spitting out a Google Charts URL okay?
skram over 6 years ago
Yes, spitting out a Google Charts URL is fine.
danbc over 6 years ago
awarded to jlengrand

Crowdsource coding tasks.

3 Solutions


Here is a working solution:

from collections import Counter
import requests

LIMIT = 50  # API max: 100
IGNORE_SELF_POSTS = False

def print_chart(counter):
    max_value = max(counter.values())
    keys = counter.keys()
    longest_key = max(len(key) for key in keys)

    for i in reversed(xrange(1, max_value + 1)):
        print '%02d' % i,

        for j in xrange(len(counter)):
            print '*' if counter[keys[j]] >= i else ' ',

        print

    for i in xrange(longest_key):
        print ' ' * 2,

        for j in xrange(len(counter)):
            print keys[j][i] if i + 1 <= len(keys[j]) else ' ',

        print

params = {
    'sortby': 'create_ts desc',
    'limit': LIMIT,
    'filter[fields][type]': 'submission'
    }

r = requests.get('http://api.thriftdb.com/api.hnsearch.com/items/_search',
                 params=params)

users_counter = Counter()
domains_counter = Counter()

for result in r.json['results']:
    if IGNORE_SELF_POSTS and item['domain'] is None:
        continue

    item = result['item']
    users_counter[item['username']] += 1
    domains_counter[item['domain'] or 'none'] += 1

print 'DOMAINS'
print '======='
print

print_chart(domains_counter)

print 'USERS'
print '====='
print

print_chart(users_counter)

Output: http://pastie.org/pastes/5521137/text


This solution is in Ruby. Primarily because Ruby is very succinct and effective for such small scripts. It should work well on all those platforms where ruby 1.9.3 works (Works on Ubuntu, OSx and Windows). It would also require a native library ImageMagick. Available on Ubuntu via apt-get and on OSx via brew and I think you will also find a executable for windows.

http://www.imagemagick.org/script/binary-releases.php

This solutions requires you to create two files. First file will be Gemfile with following content

Gemfile Name of the first file

source :rubygems  
gem 'gruff'  
gem 'rmagick'

Gruff is a ruby library used to generate pretty graphs. Here You have several options to generate pretty graphs.

Finally content of the script

crawler.rb Placed in same directory as Gemfile

require 'net/http'  
require 'json'
require 'gruff'

GRAPH_RESOLUTION = "800x500"  
MINIMUM_Y_AXIS_VALUE = 1  
MAXIMUM_Y_AXIS_VALUE = 5  
RECORD_COUNT = 50  

uri = URI.parse("http://api.thriftdb.com/api.hnsearch.com/items/_search?limit=#{RECORD_COUNT}&sort_by=create_ts")  
response = Net::HTTP.get_response(uri)  
hn_results = JSON.parse(response.body)["results"]  

user_count_hash = hn_results.inject({}) {|hash, post| uname = post["item"]["username"]; hash[uname] = (hash[uname]||0)+1; hash }  
domain_count_hash = hn_results.inject({}) {|hash, post| domain = post["item"]["domain"]; hash[domain] = (hash[domain]||0)+1; hash}

{"UserGraph" => user_count_hash, "DomainGraph" => domain_count_hash}.each do |graph_name, graph_data|  
graph_data.delete_if {|name,count| name.nil? }  
    count_graph = Gruff::Bar.new(GRAPH_RESOLUTION)  
    count_graph.title = graph_name  
    count_graph.maximum_value = [graph_data.values.max, MAXIMUM_Y_AXIS_VALUE].max  
    count_graph.minimum_value = [graph_data.values.min, MINIMUM_Y_AXIS_VALUE].min  
    graph_data.each { |name, count| count_graph.data(name, count) }  
    count_graph.write("#{graph_name}.png")  
end  

I really tried fixing my code formatting here, but the guide is too limited to make any sense out of it. Anyhow, the above code when run, it should generate graphs as attached below:

ruby crawler.rb

Before you run the script, you may have to install these libraries using Bundler or manually

UserGraph

DomainGraph

A much better formatted Gist is here!

Winning solution

Hi,

Here is a better version with classes to differentiate the grabbing/drawing part.
It roughly contains the same code, but I added an ''offline'' mode to be able to work in the train :).

What is good here is that each step is clearly separated.
So if you want to enhance the graph, just modify HackerPlotter, otherwise look at HackerCharter.

The number of items you want is defined in input of HackerCharter.

You can see later on that posts from HN to HN are taken as None in the json objects. I decided to display them as Ask HN, but you can also remove them if you want.

And I am sorry but I don't understand what you mean by "dead items" :S

#!/usr/bin/python
# -*- coding: utf-8 -*-
# Copyright 2012 - Julien Lengrand-Lambert
import requests
import pickle
import json
import sys
import pylab as plt

class HackerCharter():
    """
    Class crawling the recent history from Hacker News and extracting
    information about usernames and domain names posted on it
    Also draws relevant information
    """
    def __init__(self, nb_items=100, offline=False):
        """
        nb_items is the actual number of elements that will be retrieved during
        the call to HN API.
        offline is a mode that will used a file saved on your local hd ast time
        you were online
        (I used it in the train)
        """
        self.json_data = self.get_data(offline, nb_items)

    def get_data(self, offline, nb_items):
        if offline:
            json_data = self.load_offline_data()
        else:
            json_data = self.get_last_posts(nb_items)

        return json_data

    def get_last_posts(self, nb_items):
        try:
            base_url = "http://api.thriftdb.com/api.hnsearch.com/items/_search"
            # orders by creation date
            params = {'limit': nb_items, 'sortby': 'create_ts asc'}

            r = requests.get(url=base_url, params=params)
            json_data = json.loads(r.text)

            # saves file for eventual offline use
            # (to be used only on same computer)
            pickle.dump(json_data, open("hn.dump", "wb"))

            return json_data

        except:
            print "ERROR: Impossible to retrieve data. Exiting. . . "
            sys.exit(0)

    def load_offline_data(self):
        try:
            json_data = pickle.load(open("hn.dump", "rb"))
            return json_data
        except:
            print "ERROR: Impossible to retrieve data. Exiting. . . "
            sys.exit(0)

    def extract_items(self):
        usernames = []
        domains = []

        res = self.json_data["results"]
        for val in res:
            item = val['item']
            usernames.append(item['username'])
            domains.append(item['domain'])
        ut = Utils()
        usernames = ut.occurDict(usernames)
        domains = ut.occurDict(domains)

        return usernames, domains

    def plot(self):
        self.users, self.domains = self.extract_items()
        hp = HackerPlotter()
        hp.plot(self.users, "USERS")
        hp.plot(self.domains, "DOMAINS")

class HackerPlotter():
    """
    Plots the data we retrieved from Hacker News
    """
    def __init__(self):
        pass

    def plot(self, mydict, label):
        """
        Given a dictionary of values / occurences and a label
        of the type of data used,
        prints a 2d chart and the corresponding labels.
        """
        mylist = self.get_list(mydict)
        self.print_data(label, mydict)

        # gets corresponding range for x axis
        my_range = range(1, len(mylist) + 1)
        plt.plot(my_range, mylist)
        plt.show()

    def print_data(self, label, mydict):
        """
        Prints the label of the data to be processed,
        and a listing of the diciotnary keys.
        The keys will match the x axis of the generated plot
        """
        print "========================="
        print label
        print "========================="

        id = 1
        for key, value in mydict.iteritems():
            if key == None:
                    key = 'Ask/Show HN'
            print "%d : %s" % (id, key)
            id += 1

    def get_list(self, mydict):
        """"
        Returns a list of the input dictionary values
        Used for plotting data
        """
        mylist = []
        for key, value in mydict.iteritems():
            mylist.append(value)

        return mylist

class Utils():
    """
    Dummy class used to contain all utilites I needed during development
    """
    def __init__(self):
        pass

    def occurDict(self, listing):
        """
        Given a list of elements, returns a dictionary of each element with its
         number of occurences
        TODO: Check that it removes duplicates
        """
        d = {}
        for i in listing:
            if i in d:
                d[i] = d[i] + 1
            else:
                d[i] = 1
        return d

if __name__ == '__main__':
    nb_items = 100
    hn = HackerCharter(offline=True)
    hn.plot()

I used requests for the query to the API, and matplotlib for the drawing.
When the drawing appears, a simple print shows the domain/user corresponding to the plot.
Closing the first one makes the second one appear.

One thing I noticed is that some of the submissions don't'contain urls. Things like Ask HN for example. They were shown as None so I replaced them with something better.

Let me know if you want to change something.

That is actually the first time I use a web API :).

Just added two pictures :
- for domains
-and users

Surprinsingly, 26 users only are responsible for the last 100 posts!

Raaa, updating puts me down the line when I was first to post ^^. Not good to keep working on it :)
jlengrand over 6 years ago
@jlengrand BTW the history of all edits is visible by clicking on "View Timeline" beneath the last solution, and your first two edits are on record as being the first solutions posted (the bounty poster should select a solution based on that timeline).
bevan over 6 years ago
@bevan. Indeed, I didn't see that :). What is the pro of having solutions placed by order of last edit though ? Imagine having 4 or 5 solutions and discussing with the guy, you would have to keep searching for the solution has they are being updated ?
jlengrand over 6 years ago
That's a good point, in some cases it could be inconvenient to locate the updated solutions. In practice, there have been a manageable number of solutions per bounty, and users typically leave a single solution that they will edit several times rather than posting a second solution if they have another idea. Before, the first answer would stay on top, but that encouraged users to post a placeholder solution and later fill it in. The current system favors the user who posts the first "complete" solution which seems more fair. I'll make it more clear. Always open to suggestions too. Thanks!
bevan over 6 years ago
Thx ! The reason is clearer to me now, and totally makes sense actually. Kudos for you :).
jlengrand over 6 years ago
Is OP aware that this not only return comments, but also the oldest ones?
sprt over 6 years ago
Can you develop? oldest ones what ? OP has not been really active I must say :s , I don't know if the answer suits him, just that he chose me :)
jlengrand over 6 years ago
You're basically retrieving the 100 oldest (as opposed to latest) submissions and comments.
sprt over 6 years ago
Hum, in this case create_ts desc solves the problem right? I'd say that a more proper implementation would even let the user choose :). Thx for warning though
jlengrand over 6 years ago
View Timeline