Scrape Earnings Conference Call Transcripts
New here? Learn about Bountify and follow @bountify to get notified of new bounties! x

Scrape Earnings Conference Call Transcripts from seekingalpha.com

I am doing a research project and need to build a database of earnings conference call transcripts. This data will not be used for profit.

Deliverables

  • Python code which can scrape transcript documents.
  • A set of utf-8 encoded text files (.txt), 1 for each earnings call transcript containing call transcript text and article meta data. I estimate these will number approximately 135,000.

Conditions

  • This is not a time sensitive application. The code can take its time in scraping. A window of up to 24 hours to scrape all results is acceptable.
  • The solution should only use Python 3.X, and should only use free and open source libraries. I will tip extra if only scrapy is used, but other packages and frameworks will be accepted.

Tutorial

The seed page is https://seekingalpha.com/earnings/earnings-call-transcripts. This page contains a list of links. The documents to which each of those links points are the target data. The list of links is paginated into ~4,500 pages, with 30 links per page. This means that there are approximately 135,000 documents that need to be scraped.

Following each link in the list leads to an earnings call transcript. The transcripts are often paginated into multiple subdocuments. This pagination can be avoided by appending ?part=single to the end of the URL. It is possible this functionality is only enabled by creating a trial "pro" account, but I haven't confirmed this.

I don't want any information from the site banner, and I do not want information from advertisements and links in the right and left hand margins. I do not want any information from after the transcript ends--so I do not want the social media links at the end of the document, and I do not want any information from the comments section.

Here are images that give some guidance on the information I want from each transcript page. Areas that are surrounded by a red rectangle are information I DO NOT want. Areas that are surrounded by a green rectangle are information I DO want.:

Start of transcript

https://s22.postimg.cc/g4oix4xgd/start_of_page_guidance.png

End of transcript

https://s22.postimg.cc/t8u39tx7x/end_of_page_guidance.png

The site appears to take anti-scraping actions based on the following signals (but certainly others not listed here):

  • frequency of page loads
  • legitimacy of user agent
  • header differences in requests

Also, as I mentioned previously, the site may require logging in with a trial "pro" registered account in order to access pages and/or avoid captchas or 403s.

Here are links to a few examples of the pages from which I want data scraped:

awarded to kostasx

Crowdsource coding tasks.

1 Solution


Here's what I've come up with:

import scrapy
from urllib.parse import urlparse
from slugify import slugify

# RANDOMIZE USER AGENTS ON EACH REQUEST:
import random
# SRC: https://developers.whatismybrowser.com/useragents/explore/
user_agent_list = [
   #Chrome
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36',
    'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36',
    'Mozilla/5.0 (Windows NT 5.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36',
    'Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36',
    'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36',
    'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36',
    'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36',
    'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36',
    #Firefox
    'Mozilla/4.0 (compatible; MSIE 9.0; Windows NT 6.1)',
    'Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko',
    'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)',
    'Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko',
    'Mozilla/5.0 (Windows NT 6.2; WOW64; Trident/7.0; rv:11.0) like Gecko',
    'Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko',
    'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.0; Trident/5.0)',
    'Mozilla/5.0 (Windows NT 6.3; WOW64; Trident/7.0; rv:11.0) like Gecko',
    'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)',
    'Mozilla/5.0 (Windows NT 6.1; Win64; x64; Trident/7.0; rv:11.0) like Gecko',
    'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0)',
    'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0)',
    'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)'
]
debug_mode = True

class QuotesSpider(scrapy.Spider):

    name = "quotes"
    custom_settings = {
            # 'LOG_LEVEL': 'CRITICAL', # 'DEBUG'
            # 'LOG_ENABLED': False,
            'DOWNLOAD_DELAY': 0 # 0.25 == 250 ms of delay, 1 == 1000ms of delay, etc.
    }

    def start_requests(self):
        # GET LAST INDEX PAGE NUMBER
        urls = [ 'https://seekingalpha.com/earnings/earnings-call-transcripts/9999' ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        data = response.css("#paging > ul.list-inline > li:last-child a::text")
        last_page = data.extract()
        last_page = int(last_page[0])
        for x in range(0, last_page+1):
            # DEBUGGING: CHECK ONLY FIRST ELEMENT
            if debug_mode == True and x > 0:
                break
            url = "https://seekingalpha.com/earnings/earnings-call-transcripts/%d" % (x)
            yield scrapy.Request(url=url, callback=self.parse_link)

    # SAVE CONTENTS TO AN HTML FILE 
    def save_contents(self, response):
        data = response.css("div#content-rail article #a-body")
        data = data.extract()
        url = urlparse(response.url)
        url = url.path
        filename = slugify(url) + ".html"
        with open(filename, 'w') as f:
            f.write(data[0])
            f.close()

    def parse_link(self, response):
        print("Parsing results for: " + response.url)
        links = response.css("a[sasource='earnings-center-transcripts_article']")
        links.extract()
        for index, link in enumerate(links):
            url = link.xpath('@href').extract()
            # DEBUGGING MODE: Parse only first link
            if debug_mode == True and index > 0:
                break
            url = link.xpath('@href').extract()
            data = urlparse(response.url)
            data = data.scheme + "://" + data.netloc + url[0]  # .scheme, .path, .params, .query
            user_agent = random.choice(user_agent_list)
            print("======------======")
            print("Getting Page:")
            print("URL: " + data)
            print("USER AGENT: " + user_agent)
            print("======------======")
            request = scrapy.Request(data,callback=self.save_contents,headers={'User-Agent': user_agent})
            yield request

Before running the scrapy crawl quotes command open setting.py and place the following line:

# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'

In order to run the full script, you will need to change the debug_mode variable to False.

In case you want to run scrapy programmatically:
Create a file: earnings.py and paste the following:

#!/usr/bin/env python3
import scrapy
from scrapy.crawler import CrawlerProcess   # Programmatically execute scrapy
from urllib.parse import urlparse
from slugify import slugify

# RANDOMIZE USER AGENTS ON EACH REQUEST:
import random
# SRC: https://developers.whatismybrowser.com/useragents/explore/
user_agent_list = [
   #Chrome
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36',
    'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36',
    'Mozilla/5.0 (Windows NT 5.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36',
    'Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36',
    'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36',
    'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36',
    'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36',
    'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36',
    #Firefox
    'Mozilla/4.0 (compatible; MSIE 9.0; Windows NT 6.1)',
    'Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko',
    'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)',
    'Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko',
    'Mozilla/5.0 (Windows NT 6.2; WOW64; Trident/7.0; rv:11.0) like Gecko',
    'Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko',
    'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.0; Trident/5.0)',
    'Mozilla/5.0 (Windows NT 6.3; WOW64; Trident/7.0; rv:11.0) like Gecko',
    'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)',
    'Mozilla/5.0 (Windows NT 6.1; Win64; x64; Trident/7.0; rv:11.0) like Gecko',
    'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0)',
    'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0)',
    'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)'
]
debug_mode = True

class QuotesSpider(scrapy.Spider):

    name = "quotes"
    custom_settings = {
            # 'LOG_LEVEL': 'CRITICAL', # 'DEBUG'
            'LOG_ENABLED': False,
            'DOWNLOAD_DELAY': 4 # 0.25 == 250 ms of delay, 1 == 1000ms of delay, etc.
    }

    def start_requests(self):
        # GET LAST INDEX PAGE NUMBER
        urls = [ 'https://seekingalpha.com/earnings/earnings-call-transcripts/9999' ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse_last_page)

    def parse_last_page(self, response):
        data = response.css("#paging > ul.list-inline > li:last-child a::text")
        last_page = data.extract()
        last_page = int(last_page[0])
        for x in range(0, last_page+1):
            # DEBUGGING: CHECK ONLY FIRST ELEMENT
            if debug_mode == True and x > 0:
                break
            url = "https://seekingalpha.com/earnings/earnings-call-transcripts/%d" % (x)
            yield scrapy.Request(url=url, callback=self.parse)

    # SAVE CONTENTS TO AN HTML FILE 
    def save_contents(self, response):
        data = response.css("div#content-rail article #a-body")
        data = data.extract()
        url = urlparse(response.url)
        url = url.path
        filename = slugify(url) + ".html"
        with open(filename, 'w') as f:
            f.write(data[0])
            f.close()

    def parse(self, response):
        print("Parsing results for: " + response.url)
        links = response.css("a[sasource='earnings-center-transcripts_article']")
        links.extract()
        for index, link in enumerate(links):
            url = link.xpath('@href').extract()
            # DEBUGGING MODE: Parse only first link
            if debug_mode == True and index > 0:
                break
            url = link.xpath('@href').extract()
            data = urlparse(response.url)
            data = data.scheme + "://" + data.netloc + url[0]  # .scheme, .path, .params, .query
            user_agent = random.choice(user_agent_list)
            print("======------======")
            print("Getting Page:")
            print("URL: " + data)
            print("USER AGENT: " + user_agent)
            print("======------======")
            request = scrapy.Request(data,callback=self.save_contents,headers={'User-Agent': user_agent})
            yield request

c = CrawlerProcess({
    'USER_AGENT': 'Mozilla/5.0',
})
c.crawl(QuotesSpider)
c.start()

Notes: You will need to install the slugify package:
pip install python-slugify
I've tested this with the debug_mode set to True, in order to test only the first link from the first results page.
To scrape all results page and all links inside those result pages, just set the debug_mode variable to False

I've set a DOWNLOAD_DELAY to 4 seconds in order not to ban the scraper due to quick subsequent connections and I also added a User Agent Randomizer to avoid being banned from the target site.

The script, saves each link content to a file named after a slugified version of the URL.

I am starting with the URL: https://seekingalpha.com/earnings/earnings-call-transcripts/9999 since this gives us the pagination links that include the last page ~4983 and then start looping over all the result pages from 0 up to that number.

Further notes to avoid getting banned from the Scrapy doc pages:

Here are some tips to keep in mind when dealing with these kinds of sites:

  • Rotate your user agent from a pool of well-known ones from browsers (google around to get a list of them)
    REFS: https://www.scrapehero.com/how-to-fake-and-rotate-user-agents-using-python-3/

  • Disable cookies (see COOKIES_ENABLED) as some sites may use cookies to spot bot behaviour

  • Use download delays (2 or higher). See DOWNLOAD_DELAY setting.

  • If possible, use Google cache to fetch pages, instead of hitting the sites directly

  • Use a pool of rotating IPs. For example, the free Tor project or paid services like ProxyMesh. An open source alternative is scrapoxy, a super proxy that you can attach your own proxies to.

  • Use a highly distributed downloader that circumvents bans internally, so you can just focus on parsing clean pages. One example of such downloaders is Crawlera

Let me know what you think.

You will need to chmod u+x to the script file (Linux,Mac) and then execute the script by running: $ ./earnings.py Additionaly, this package could be used to automate the fake user agent headers: https://pypi.org/project/fake-useragent/
kostasx 5 months ago
So this should not be run in the traditional scrapy manner i.e. scrapy crawl quotes?
armalcolite 5 months ago
Yes, you can. I have updated the script. You can use the first version to run it via the scrappy crawl command.
kostasx 5 months ago
I am testing this now. For the 1st option, why does user agent need to be set in settings.py if you are rotating user agents in earnings.py?
armalcolite 5 months ago
Just a leftover. I added the rotating user agent strings in the end. And, also, I think the site rejects unknown agents, even for the first call.
kostasx 5 months ago
View Timeline