Web Scraping & Hacker News

After covering basic web scraping in class, I thought it would be fun to get some additional practice scraping a live website and to create a simple, useful app with the result. The premise is pretty straightforward, but offers a feature not easily gleaned from simply visiting the Hacker News website - the program compares both each article’s Reddit-style point count with the quantity of comments. After identifying that article, it launches the webpage automatically from the console. See below for a repository link to ‘Must Read Hacker News.’

Utilizing the Nokogiri gem, I parse 4 pages of Hacker News, identifying the necessary css selectors and assigning to an array of hashes. From there, it’s a simple each loop in the ‘calculator’ method to identify the one article with the highest point score and highest comment count.

The main challenge I ran into was identifying a way to navigate the structure of the website - articles on Hacker News are arranged in a non-nested table, meaning getting from title/link to point count and comments wasn’t as easy as I had anticipated. Shown below is a visualization of what I mean - the table row that holds the subtext is not nested in the title table row, instead it is just the next .

I got some excellent pointers from Dave Flaherty on how to do just that, utilizing the ‘.parent’ Nokogiri module method. See the scraper method below:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
def page_scraper
    @reading_list = [ ]
    @url_list.each do |url|
        html = open(url)
        doc = Nokogiri::HTML(html)
        titles = doc.css(".title > a")
        score = nil
        comment_num = nil
        titles.each do |title_data|
            title = title_data.text[0..10]
            url = title_data.attribute("href").value
            subtexts = title_data.parent.parent.next.css(".subtext")
            subtexts.each do |subtext_line|
                score = subtext_line.css('.score').text.gsub(" points","").to_i
                comment_num = subtext_line.css('a').text.scan(/\d+ comments/).join.scan(/[0-9]+/).join.to_i
            end
          @reading_list << {article_title: title, link: url, points: score, comments: comment_num}
      end
    end
  end


Future improvements could include:
- Building out an interactive CLI app that allows for user engagement (e.g. “how many pages would you like to parse?”, “Would you like to see the article with the most points/comments/both?”, etc.).
- Investigating the use of Hacker News' API to allow for more robust information usage.
- Integrating other relevant news sites (e.g. top Reddit post, top io9 article, etc).

For the curious, you can find the repo here: Must Read Hacker News. Feedback and suggestions are welcome!