Scrape Earnings Conference Call Transcripts seekingalpha.com
New here? Learn about Bountify and follow @bountify to get notified of new bounties! x

Scrape Earnings Conference Call Transcripts from seekingalpha.com
I am doing a research project and need to build a database of earnings conference call transcripts. This data will not be used for profit.

Deliverables
- a working version of the following code: https://github.com/RCJansonVTFL/SeekingAlphaWebScrape/blob/main/MGMT%20Final%20w%20Entities_Compiled_Final.ipynb
- Python code which can scrape transcript documents
- A set of utf-8 encoded text files (.txt), 1 for each earnings call transcript containing call transcript text and article meta data. I estimate these will number approximately 135,000.

Conditions
This is not a time sensitive application. The code can take its time in scraping. A window of up to 24 hours to scrape all results is acceptable.
I think the old code does not work anymore due to some countermeasures against scraping. Please be informed that you can also use an seekingalpha premium account since you can sign up for a 14 days premium free trial. I also have access to a premium account which credentials I could fill in the Python code.

Tutorial
The seed page is https://seekingalpha.com/earnings/earnings-call-transcripts. This page contains a list of links. The documents to which each of those links points are the target data. The list of links is paginated into ~4,500 pages, with 30 links per page. This means that there are approximately 135,000 documents that need to be scraped.

Following each link in the list leads to an earnings call transcript. The transcripts are often paginated into multiple subdocuments. This pagination can be avoided by appending ?part=single to the end of the URL. It is possible this functionality is only enabled by creating a trial "pro" account, but I haven't confirmed this.

I don't want any information from the site banner, and I do not want information from advertisements and links in the right and left hand margins. I do not want any information from after the transcript ends--so I do not want the social media links at the end of the document, and I do not want any information from the comments section.

Here are images that give some guidance on the information I want from each transcript page. Areas that are surrounded by a red rectangle are information I DO NOT want. Areas that are surrounded by a green rectangle are information I DO want.:

Start of transcript

https://s22.postimg.cc/g4oix4xgd/start_of_page_guidance.png

End of transcript

https://s22.postimg.cc/t8u39tx7x/end_of_page_guidance.png

The site appears to take anti-scraping actions based on the following signals (but certainly others not listed here):

frequency of page loads
legitimacy of user agent
header differences in requests

Also, as I mentioned previously, the site may require logging in with a trial "pro" registered account in order to access pages and/or avoid captchas or 403s.

Here are links to a few examples of the pages from which I want data scraped:

https://seekingalpha.com/article/4188882-wipros-wit-ceo-abidali-neemuchwala-q1-2019-results-earnings-call-transcript?part=single
https://seekingalpha.com/article/3279-overstock-q1-2004-earnings-conference-call-transcript-ostk
https://seekingalpha.com/article/4188876-skanska-abb-sksbf-ceo-jim-craigie-q3-2014-results-earnings-call-transcript?part=single

I will also add a big tip if code works properly. Thank you!

all the links for article are with me. now just need all articles. do you mind using nodejs instead of python?
ganganimaulik 4 months ago
Hello @bencol. I am the author of the old version. I also have the new version which does work. Feel free to DM me if you are interested!
VladimirMikulic 4 months ago
Hi @VladimirMikulic thank you for your comment. I sent you a DM.
bencol 4 months ago
Sent you an email :)
bencol 4 months ago
4 months ago

Crowdsource coding tasks.

1 Solution


I have comeup with solution using puppeteerjs. you can check commented link to see how it works.
it also handles simple captcha security.

right now I have fetched all links and 1000 articles you can download output file here:
https://drive.google.com/file/d/1PvO22WaYyohrUGLFQqL-aauRmfkhVons/view?usp=sharing

all other code files can be found here:
https://drive.google.com/drive/folders/1fxtDAbGsGsOWe77UCOJyifFejlG6vduB?usp=sharing

const puppeteer = require("puppeteer-extra");
const { Cluster } = require("puppeteer-cluster");

const randomUseragent = require("random-useragent");
function sleep(ms) {
  return new Promise((resolve) => setTimeout(resolve, ms));
}

const util = require("util");
var Datastore = require("nedb"),
  db = new Datastore({ filename: "./linksfile", autoload: true });
db.find = util.promisify(db.find);

// add stealth plugin and use defaults (all evasion techniques)
// const StealthPlugin = require("puppeteer-extra-plugin-stealth");
// puppeteer.use(StealthPlugin());

const AdblockerPlugin = require("puppeteer-extra-plugin-adblocker");
const adblocker = AdblockerPlugin({
  blockTrackers: true, // default: false
});
puppeteer.use(adblocker);

const { installMouseHelper } = require("./install-mouse-helper");
let skips = 0;
async function captchaResolve(page) {
  await installMouseHelper(page);

  if ((await page.title()) != "Access to this page has been denied.")
    return console.log(
      "captcha skipped",
      skips++,
      await page.title(),
      await page.url()
    );
  console.log("captcha solving", skips++, await page.title(), await page.url());
  await page.waitForSelector("#px-captcha");
  const example = await page.$("#px-captcha");
  const bounding_box = await example.boundingBox();
  await sleep(2000);
  await page.mouse.move(
    bounding_box.x + bounding_box.width / 5,
    bounding_box.y + bounding_box.height / 2
  );
  console.log(bounding_box);
  await page.mouse.down();
  await sleep(6000);
  await page.mouse.up();

  await page.waitForNavigation();
}

(async () => {
  const threads = 5;

  let links = await db.find({});

  const cluster = await Cluster.launch({
    concurrency: Cluster.CONCURRENCY_BROWSER,
    maxConcurrency: threads,
    puppeteerOptions: {
      headless: false,
      defaultViewport: null,
      args: [
        "--start-maximized", // you can also use '--start-fullscreen'
      ],
    },
    puppeteer,
  });

  await cluster.task(async ({ page, data }) => {
    await page.goto(data.url, { waitUntil: "domcontentloaded" });
    await captchaResolve(page);
    console.log("waiting for loader to disappear");
    await page.waitFor(
      () =>
        !document.querySelector(
          '[data-test-id="card-container"] [data-test-id="loader"]'
        )
    );
    console.log("waiting for card-container");
    articleElement = await page.waitForSelector(
      '[data-test-id="card-container"]'
    );
    console.log("finding html");
    let articleHtml = await page.evaluate(
      (articleElement) => articleElement.innerHTML,
      articleElement
    );
    console.log("updating db for html");
    db.update(
      { link: data.url },
      { $set: { data: articleHtml } },
      { multi: true }
    );
    console.log(data.url, "done");
  });

  //scrape articles
  for (let i = 0; links.length > i; i++) {
    let link = links[i];
    if (!link.data) {
      cluster.queue({
        url: `${link.link}`,
      });
    }
  }
})();
Hi ganganmaulik, Python would be preferred but if your solution works and is scalable for all +135k articles then it is awesome. Could you please add an instruction step-by-step how to apply your code?
bencol 4 months ago
In addition, do you see a way to list all scraped transcripts in an index file (xls) providing ticker, company name, date, fiscal quarter and time for each transcript (similar to the code on github)?
bencol 4 months ago
my solution is multi threaded so it can be scaled higher as much as required by adding more proxies. right now I am using free proxies available on internet and it's doing article/4 seconds. I will upload data file just before end of this contest so you can check data. regarding xls creation, it can be done as separate project.
ganganimaulik 4 months ago
Okay, could you pls give instruction how to apply your set of codes in visual code studio.
bencol 4 months ago