Scrape Pages Using Python3/BeautifulSoup4
New here? Learn about Bountify and follow @bountify to get notified of new bounties! x

Deliverable

Python 3 code

Task

The Python 3 script should use BeautifulSoup4 to scrape pages at the following urls:

urls = ['https://www.hsx.com/security/view/ASTRA'
, 'https://www.hsx.com/security/view/SWAR2'
, 'https://www.hsx.com/security/view/MISS6'
, 'https://www.hsx.com/security/view/DNDGN'
, 'https://www.hsx.com/security/view/BIOS']

Ideally, the scraping activity would be callable from a single function (this could call other functions as necessary), where the input is a url and the output is a dictionary. As an example, the dictionary for https://www.hsx.com/security/view/ASTRA would look like this:

{
, "url":"https://www.hsx.com/security/view/ASTRA "
, "movie_title":"Ad Astra"
, 'symbol':'ASTRA'
, 'status':'Active'
, 'ipo_date':'May 4, 2017'
, 'genre':'Sci-Fi'
, 'mpaa_rating':'n/a'
, 'phase':'Wrap'
, 'release_date':'Jan 11, 2019'
, 'release_pattern':'wide'
, 'gross':'n/a'
, 'theaters':'n/a'
, "description':'Brad Pitt toplines the sci-fi film Ad Astra. Twenty years after his father left on a one-way mission to Neptune, a space engineer searches the solar system to uncover why the mission failed. James Gray will direct from a script he co-wrote with Ethan Gross."
, 'distributor':'20th Century Fox'
, 'director':''
, 'cast':{"actor1":{"name":"Brad Pitt","starbond_symbol":"BPITT"}, "actor2":{"name":"Donald Sutherland", "starbond_symbol":"DSUTH"}, "actor3":{"name":"Ruth Negga", "starbond_symbol":"RNEGG"}, "actor4":{"name":"Tommy Lee Jones", "starbond_symbol":"TLJON"}}
, "last_price":"H$39.80"
, "shares_held_long": 100023468
, "shares_held_short": 2982120
, "trading_volume_today": 550000
, "delist_date":""
}

Notes

The pages will take two forms: movies already released and movies set for future release. The primary difference between these two types of pages are that movies already released will display a "Delist Price" and "Delist Date", whereas movies set for future release display "Current Value" and "Change Today" in their place (respectively).

Visual Aids

Here are some pictures where I have highlighted the places on the page from which the data should be scraped:
Imgur


Imgur


Imgur


awarded to CyteBode

Crowdsource coding tasks.

2 Solutions


Here's my solution. It doesn't require any external dependencies other than BeautifulSoup and works with all urls:

import re
import urllib.request
import urllib.parse
from bs4 import BeautifulSoup
from pprint import pprint


def get_data_column(title, soup):
    label = soup.find('td', {'class': 'label'}, string='%s:' % title)

    if label:
        return label.findNext('td').text.strip().replace("  ", " ")

    return ""


def fetch_page(url):
    f = urllib.request.urlopen(url)
    html = f.read().decode('utf-8')
    html = html.replace(" ", " ")
    soup = BeautifulSoup(html, 'html.parser')

    results = {}

    results['url'] = url
    results['movie_title'] = soup.select_one(
        '.security.movie + h1').text.strip()
    results['symbol'] = get_data_column('Symbol', soup)
    results['status'] = get_data_column('Status', soup)
    results['ipo_date'] = get_data_column('IPO Date', soup)
    results['genre'] = get_data_column('Genre', soup)
    results['mpaa_rating'] = get_data_column('MPAA Rating', soup)
    results['phase'] = get_data_column('Phase', soup)
    results['release_date'] = get_data_column('Release Date', soup)
    results['release_pattern'] = get_data_column('Release Pattern', soup)
    results['gross'] = get_data_column('Gross', soup)
    results['theaters'] = get_data_column('Theaters', soup)

    results['description'] = soup.find(
        'h3', string="Description").findNext('p').text.strip().replace(
            "\r\n", "").replace("\n", "")

    results['distributor'] = soup.find(
        'h4', string="Distributor").find_next_sibling('p') or ""

    if results['distributor'] != "":
        results['distributor'] = results['distributor'].text.strip()

    cast = soup.find(
        'h4', string="Cast").find_next_sibling('ul').find_all('li')

    results['cast'] = []

    for index, actor in enumerate(cast):
        name = re.sub(r'\s\(.*\)', '', actor.text.strip())
        name = name.replace("\n", "")

        if name:
            results['cast'].append({
                'name':
                name,
                'starbond_symbol':
                actor.find('a').text.strip(),
                'billing':
                index
            })

    directors = soup.find(
        'h4', string="Director").find_next_sibling('ul').find_all('li')

    results['directors'] = []

    for index, director in enumerate(directors):
        name = re.sub(r'\s\(.*\)', '', director.text.strip())
        name = name.replace("\n", "")

        if name:
            results['directors'].append({
                'name':
                name,
                'starbond_symbol':
                director.find('a').text.strip(),
                'billing':
                index
            })

    last_price = list(soup.find(
        'p',
        {
            'class': 'value'
        },
    ).strings)
    results['last_price'] = last_price[0]

    holdings_summary = soup.select('.holdings_summary span')

    results['shares_held_long'] = int(holdings_summary[0].text.strip().replace(
        ",", ""))
    results['shares_held_short'] = int(
        holdings_summary[1].text.strip().replace(",", ""))
    results['trading_volume_today'] = int(
        holdings_summary[2].text.strip().replace(",", ""))

    results['delist_date'] = get_data_column('Delist Date', soup)

    return results


urls = [
    'https://www.hsx.com/security/view/FH119',
    'https://www.hsx.com/security/view/ASTRA',
    'https://www.hsx.com/security/view/SWAR2',
    'https://www.hsx.com/security/view/MISS6',
    'https://www.hsx.com/security/view/DNDGN',
    'https://www.hsx.com/security/view/BIOS'
]

for url in urls:
    pprint(fetch_page(url))
I will test this after testing CyteBode's submission and compare the two.
armalcolite 29 days ago
Your solution crashes when the list of actors is empty (try it with https://www.hsx.com/security/view/FH119). This is the same bug I had to fix ;).
CyteBode 29 days ago
thanks, fixed it
weslly 29 days ago
Winning solution

This is a quick and dirty solution which might break with some pages, but it works fine with the provided list of URLs.

Requirements

  • Python 3
  • bs4
  • requests
  • html5lib

I used requests to easily make the GET requests, but it could get replaced by urllib.

I needed to use html5lib as a parser instead of lxml or html.parser because the latter two didn't like the website's HTML.

Script

from collections import OrderedDict
import itertools

import bs4
import requests


def parse_crew(ul):
    lst = []
    n = 0
    for li in ul.find_all("li"):
        string = li.find("p")
        if string is None:
            continue
        string = string.text.strip()
        if string == "":
            continue

        name, symbol = string.split(" (")
        name = name.lstrip()
        symbol = symbol.strip().rstrip(")")

        lst.append({"name": name, "billing": n, "starbond_symbol": symbol})
        n += 1
    return lst


def scrape_page(url):
    ret = OrderedDict({
        "url": url
    })

    response = requests.get(url, headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) "
                      "Gecko/20100101 Firefox/40.1"
    })
    soup = bs4.BeautifulSoup(response.text, "html5lib")

    whitebox = soup.find("div", class_="whitebox_content")

    security_data = whitebox.find("div", class_="security_data")
    ret["movie_title"] = security_data.find("h1").text.strip()

    left, right = security_data.find_all("div", class_="data_column")

    left_column = left.find("table")
    left_trs = left_column.find_all("tr")
    for i, name in enumerate(["symbol", "status", "ipo_date", "genre",
                              "mpaa_rating"]):
        ret[name] = left_trs[i].find_all("td")[1].text.strip()

    right_column = right.find("table")
    right_trs = right_column.find_all("tr")
    for i, name in enumerate(["phase", "release_date", "release_pattern",
                              "gross", "theaters"]):
        ret[name] = right_trs[i].find_all("td")[1].text.strip()

    ret["description"] = whitebox.find("p").text.strip()

    credits = (whitebox.find("div", class_="inner_columns")
                       .find("div", class_="column"))

    ret["distributor"] = credits.find("p").text.strip()

    directors, cast = credits.find_all("ul", class_="credit")
    ret["directors"] = parse_crew(directors)
    ret["cast"] = parse_crew(cast)

    right_column = soup.find("div", class_="four columns last")
    whiteboxes = right_column.find_all("div", "whitebox_content")
    top, bottom = itertools.islice(whiteboxes, 2)

    price, other = top.find("p", class_ = "value").contents

    ret["last_price"] = price.strip()

    long_, short, volume = bottom.find_all("span")

    ret["shares_held_long"] = int(long_.text.replace(",", "").strip())
    ret["shares_held_short"] = int(short.text.replace(",", "").strip())
    ret["trading_volume_today"] = int(volume.text.replace(",", "").strip())

    if ret["status"] != "Active":
        ret["delist_date"] = other.text.strip()
    else:
        ret["delist_date"] = ""

    # Fix the double space in the dates
    for date in ["ipo", "release", "delist"]:
        date = "%s_date" % date
        ret[date] = ret[date].replace("  ", " ")

    return ret


def main():
    import json

    urls = [
        'https://www.hsx.com/security/view/ASTRA',
        'https://www.hsx.com/security/view/SWAR2',
        'https://www.hsx.com/security/view/MISS6',
        'https://www.hsx.com/security/view/DNDGN',
        'https://www.hsx.com/security/view/BIOS'
    ]

    for url in urls:
        dictionary = scrape_page(url)

        # Do whatever you want with it
        print(json.dumps(dictionary, indent=4))


if __name__ == '__main__':
    main()

Edit 1: Fixed the crash that occurred when the cast was empty, made it so the double space in the dates gets removed.

Edit 2: Refactored the parsing of the crew lists into a function. Changed the director entry to directors whose value is now the list of directors. Changed the list format to be the same as the dict in the example (to go back to just getting a list, remove the calls to list_to_dict).

Edit 3: Changed the actors/directors list format to one that includes the billing order.

Note: I implemented the scraping of the director entry assuming there could possibly be movies with multiple directors (hence the accumulation into a list), but in the end I still only take the name of the first director to match the specs. I tried looking for movies with more than one director on the website (such as Avengers 4 by the Russo brothers), but it seems like they all get listed with no director when that's the case.
CyteBode 29 days ago
I am testing this now. Your suggested method for handling director in your comment is correct and an oversight on my part. You must be a movie buff ;)
armalcolite 29 days ago
All right, do you want me to change the script so the directors entry is the whole list? Also, do you want the cast list to follow the {"actor1": ..., "actor2": ..., ...} format or just a list as I did?
CyteBode 29 days ago
I think you've handled the cast list correctly. My original thought was that the order of the actors connotes billing order (i.e. biggest star first). I actually do believe this is the meaning of the order of actors listed. However, upon reflection, I'm not sure that the data structure {"actor1": ..., "actor2": ..., ...} is very wise. Maybe we could handle it like [{'name': 'Channing Tatum', 'billing':0, 'starbond_symbol': 'CTATU'}, {'name': 'Ice Cube', 'starbond_symbol': 'ICUBE', 'billing':1}, {'name': 'Jonah Hill', 'starbond_symbol': 'JHILL', 'billing':2}]? In terms of directors, yes that would be awesome if you could change the script so the directors entry is a whole list.
armalcolite 29 days ago
Thanks for awarding me the bounty! I'm sorry I didn't reply to your last comment earlier, I didn't receive any notification for some reason. I've changed the list format to how you want it. However, I'm pretty sure the actors are just shown in the alphabetical order of the first names, not in billing order...
CyteBode 28 days ago
Awesome. Thank you!
armalcolite 28 days ago
View Timeline