Scrape Crunchbase for Acquisitions section
New here? Learn about Bountify and follow @bountify to get notified of new bounties! x

Crunchbase has a small section for acquisitions a company has made:

Sample URL for General Motors:

I need a command line script for *nix (preferably python or bash) that will get me the output on the command line in a readable+parse-able format.

Crunchbase have an API https://data.crunchbase.com/docs , it is probably better than scraping.
iurisilvio 4 months ago
The site is protected against scraping, so although the scraping itself would be easy, getting the HTML is significantly harder than it normally is. One way I can think of doing it is through browser automation, but it wouldn't be as convenient as a simple command line script.
CyteBode 4 months ago
I did not know that. I increased the bounty to reflect the difficulty. Anything that can get the job done and get the output back to the command line would work, instrumenting things like phantom or something similar maybe...?
azod 4 months ago
Would any method of obtaining a list of subsidaries, not just from crunchbase, be acceptable?
jduplessis294 4 months ago
Crunchbase is using Distilnetwork to protect theirs site from bots and scarping, and distil's algorothm seems to use some sort of machine learning algos to to prevent bots.In fact I don't think even browser automation would work in this scenario.
Codeword 4 months ago
And I believe their acquisitions api is paid
jduplessis294 4 months ago
Yeah, In this scenario, I suggest using their paid api, or use some other api which is free of cost.
Codeword 4 months ago
awarded to Wuddrum

Crowdsource coding tasks.

3 Solutions


Crunchbase Pro account can export these data to Excel.

Sign up for a Crunchbase Pro account. https://about.crunchbase.com/products/pricing/
Get the Excel Export, https://api.crunchbase.com/v3.1/excel_export/crunchbase_export.xlsx?user_key=user_key

More details about how it works: https://data.crunchbase.com/docs/excel-export


This Does not use Crunchbase (but it is similar)

It uses an api called corpwatch. The script uses node, and if this is not suitable let me know and I can port to python or bash

Do an NPM install:

npm i --save request-promise request

then save this to corpwatch.js:

const request = require('request-promise')

request('http://api.corpwatch.org/companies.json?company_name='+process.argv[2])
.then(content => {
  const list = JSON.parse(content);
  const id = list.result.companies[Object.keys(list.result.companies)[0]].cw_id;
  console.log(id)
  return request(`http://api.corpwatch.org/companies/${id}/children.json?limit=5000`)  
})
.then(content => {
  const children = JSON.parse(content).result.companies;
  for (let obj in children) {
    console.log(children[obj].company_name);
  } 
})

Run Like this:

node corpwatch.js "general electric co"

or

node corpwatch.js "berkshire hathaway inc"
Is this suitable? I can improve if not
jduplessis294 3 months ago
Winning solution

Here's my solution in Python 3.x:

#!/usr/bin/env python
import random
import time
import json
import zlib
import urllib.parse
import urllib.request
import http.client
from argparse import ArgumentParser

http.client.HTTPConnection.debuglevel = 1
http.client.HTTPSConnection.debuglevel = 1

parser = ArgumentParser()
parser.add_argument(dest='organization', help='organization to fetch acquisitions for')
parser.add_argument('-t', '--throttle', dest='throttle', action='store_true',
                    help='enable http request throttling')
parser.add_argument('-m', '--throttle_min', dest='throttle_min', default=0.8, type=float,
                    metavar='SECONDS', help='minimum amount of seconds to throttle for')
parser.add_argument('-M', '--thorttle_max', dest='throttle_max', default=3.7, type=float,
                    metavar='SECONDS', help='maximum amount of seconds to throttle for')
args = parser.parse_args()

throttle_min = int(args.throttle_min * 100)
throttle_max = int(args.throttle_max * 100)

def throttle():
  throttle_amount = random.randint(throttle_min, throttle_max) / 100
  time.sleep(throttle_amount)

def decode_response(response):
  if response.headers['Content-Encoding'] == 'gzip' or response.headers['Content-Encoding'] == 'zlib':
    return zlib.decompress(response.read(), 15+32)
  return response.read()

def add_common_headers(request):
  request.add_header('Accept', 'application/json, text/plain, */*')
  request.add_header('Accept-Encoding', 'gzip, deflate, br')
  request.add_header('Accept-Language', 'en-US,en;q=0.5')
  request.add_header('Cache-Control', 'no-cache')
  request.add_header('Connection', 'keep-alive')
  request.add_header('Pragma', 'no-cache')
  request.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:60.0) Gecko/20100101 Firefox/60.0')
  request.add_header('X-Requested-With', 'XMLHttpRequest')

def fetch_organization(organization_query):
  organization_query = urllib.parse.quote_plus(organization_query)
  organization_search_url = 'https://www.crunchbase.com/v4/data/autocompletes?query=' + organization_query + '&collection_ids=organization.companies'
  request = urllib.request.Request(organization_search_url)
  add_common_headers(request)
  request.add_header('Referer', 'https://www.crunchbase.com/')
  response = urllib.request.urlopen(request)
  response_body = decode_response(response)
  data = json.loads(response_body)

  organization_name = data['entities'][0]['identifier']['value']
  organization_id = data['entities'][0]['identifier']['permalink']
  return (organization_name, organization_id)

def fetch_acquisitions(organization_id):
  acquisitions_url = 'https://www.crunchbase.com/v4/data/entities/organizations/' + organization_id + '/overrides?field_ids=["identifier","layout_id","facet_ids","title","short_description","is_locked"]&card_ids=["acquisitions_list"]'
  acquisitions_body = {
    'card_lookups': [
      {
        'card_id': 'acquisitions_list',
        'limit': 10000
      }
    ]
  }
  json_body = json.dumps(acquisitions_body)
  json_body_bytes = json_body.encode('utf-8')
  request = urllib.request.Request(acquisitions_url)
  add_common_headers(request)
  request.add_header('Referer', 'https://www.crunchbase.com/organization/%s' % organization_id)
  request.add_header('Content-Type', 'application/json; charset=utf-8')
  request.add_header('Content-Length', len(json_body_bytes))
  response = urllib.request.urlopen(request, json_body_bytes)
  response_body = decode_response(response)
  data = json.loads(response_body)

  acquisitions = []
  for acquisition in data['cards']['acquisitions_list']:
      acquiree_name = acquisition['acquiree_identifier']['value']
      date = acquisition['announced_on']['value']
      transaction_name = acquisition['identifier']['value']
      if 'price' in acquisition:
        price_usd = acquisition['price']['value_usd']
      else:
        price_usd = 0

      acquisitions.append({
        'acquired_organization': acquiree_name,
        'date': date,
        'price_usd': price_usd,
        'transaction_name': transaction_name
        })

  return acquisitions

if (args.throttle):
  throttle()
organization_name, organization_id = fetch_organization('"' + args.organization + '"')
if (args.throttle):
  throttle()
acquisitions = fetch_acquisitions(organization_id)

print('Acquisitions for: %s' % organization_name)
print('Acquisition count: %i' % len(acquisitions))
for acquisition in acquisitions:
  print('Acquired Organization: %(acquired_organization)s, Date: %(date)s, Price(USD): %(price_usd)s, Transaction Name: %(transaction_name)s' % acquisition)

Usage:

$ ./crunchbaseacquisitions.py "General Electric"

I've also added a throttling mode in case you happen to hit their anti-scraping protection. I never managed to hit it while I tested the script. But this should help with throwing off their scraping detection.

Throttling mode simply waits a random amount of seconds before making an http request.
You can turn on the throttling mode with -t argument. E.g.:

$ ./crunchbaseacquisitions.py "Tesla" -t

You can also provide custom min and max seconds for the throttle mode to use, with -m for min seconds and -M for max seconds. E.g.:

$ ./crunchbaseacquisitions.py "AT&T" -t -m 1.7 -M 5.5

Here's an example output:

Acquisitions for: Tesla
Acquisition count: 4
Acquired Organization: Perbix, Date: 2017-11-06, Price(USD): 0, Transaction Name: Perbix acquired by Tesla
Acquired Organization: Grohmann Engineering, Date: 2016-11-08, Price(USD): 0, Transaction Name: Grohmann Engineering acquired by Tesla
Acquired Organization: SolarCity, Date: 2016-06-22, Price(USD): 2600000000, Transaction Name: SolarCity acquired by Tesla
Acquired Organization: Riviera Tool LLC, Date: 2015-05-08, Price(USD): 0, Transaction Name: Riviera Tool LLC acquired by Tesla
Love this solution, getting an error though:
azod 3 months ago
Looks like 405 code signals a captcha page. I'm looking into why it works perfectly fine on my home machine and how to fix it.
Wuddrum 3 months ago
I've updated the script to appear more browser-like. I managed to get the 405 code on my home connection and it resumed working with this update. I also tested it on a fresh IP and that also worked straight away. Why I think it worked at first for me even with the old script:
I believe my fresh IP got whitelisted when I explored the site in the very beginning using my default browser. Afterwards, the system probably recognized my IP, which was probably already whitelisted and allowed me to scrape the site with the script. So if you have the chance to browse the site normally from the IP that you intend to run the script from, I'd suggest doing so (or even leaving it open for 20-30minutes, so that their fingerprinting call is made several times).
Wuddrum 3 months ago
I also suggest using Firefox for that. Lastly - If the user-agent of the browser you're using differs from the one specified in the script (search for User-Agent), then it'd be very advisable to replace it with the one your browser uses.
You can view your user agent visiting this site.
Wuddrum 3 months ago
View Timeline