Scrape BuiltWith Mobile endpoint for technologies from the command line
New here? Learn about Bountify and follow @bountify to get notified of new bounties! x

BuiltWith (https://builtwith.com) is a technology profiling website. The have a browser extension that returns

  • Analytics and Tracking
  • Widgets
  • Frameworks
  • CDN
  • Javascrip Libraries
  • Advertising
  • SSL
  • etc..

i need a command line script for *nix (preferably python or bash) that will get me the output on the command line in a readable+parse-able format.

Here is the request for the extension:

GET /mobile.aspx?https://bountify.co/ HTTP/1.1
Host: builtwith.com
Connection: close
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36
Accept: */*
Accept-Encoding: gzip, deflate
Accept-Language: en-US,en;q=0.9
awarded to Wuddrum

Crowdsource coding tasks.

5 Solutions


1) Get a free API key: https://api.builtwith.com/free-api

2) Edit buildwith.py:

#!/usr/bin/env python
import sys
site = sys.argv[1]
url = 'https://api.builtwith.com/free1/api.json?key=37b9029c-aba0-4535-be6f-10a151027caa&lookup=' + site

import requests
r = requests.get(url)
print(r.json())

3) $> chmod u+x buildwith.py
4) $> ./buildwith.py https://bountify.co

You must replace the key= portion in the script with the API key you will get when you register.

NOTES: The output is in JSON format. You can use a tool like jq to parse or filter the JSON output of the python script or implement some JSON manipulation methods in the script.


Here is my solution, using Python (2 or 3), requests and BeautifulSoup.

I wasn't exactly sure of what you needed in the output, so I put everything that seemed relevant.

The script

#!/usr/bin/python
import collections

from bs4 import BeautifulSoup
import requests


def builtWith(url):
    url = "https://builtwith.com/mobile.aspx?%s" % url

    req = requests.get(url)
    soup = BeautifulSoup(req.text, "html.parser")

    output = collections.OrderedDict()

    categories = []
    for category_ul in soup.find_all("ul", class_="nav-pills"):
        categories.append(category_ul.text.strip())

    for table in soup.find_all("table"):
        titles = []
        for title in table.find_all("tr", class_=None):
            titles.append(title.text.strip())

        descs = []
        for desc in table.find_all("tr", class_="id"):
            descs.append(desc.text.strip())

        output[categories.pop(0)] = ([(t, d) for t, d in zip(titles, descs)])

    return output


def prettyPrintBW(url):
    bw = builtWith(url)

    for category, technologies in bw.items():
        print(category)
        for technology, description in technologies:
            print("  %s" % technology)
            print("    %s" % description)
        print("")


if __name__ == '__main__':
    import sys

    if len(sys.argv) <= 1:
        prettyPrintBW("https://bountify.co")
    else:
        prettyPrintBW(sys.argv[1])

Sample output

Analytics and Tracking
  Airbrake
    Airbrake collects errors generated by other applications, and aggregates the results for review.
  Facebook Domain Insights
    This website contains tracking information that allows admins to see Facebook Insights out of Facebook to this domain.
  Google Analytics
    Google Analytics offers a host of compelling features and benefits for everyone from senior executives and advertising and marketing professionals to site owners and content developers.

Widgets
  Facebook Like Button
    The code to implement a Facebook Like Button on the page.
  Facebook Like
    Allows users to Like items they find on the web, similar to how you Like items within Facebook.
  Google Plus One Platform
    Google+ API functionality.

Ecommerce
  Cart Functionality
    The site has a link to a shopping cart which is not categorized under any of the cart technologies we track (custom implementation or not tracked yet).

Frameworks
  Ruby on Rails Token
    Ruby on Rails is an open-source web framework that is optimized for programmer happiness and sustainable productivity. Note that Ruby on Rails has two detection techniques and this is one of them.
  Ruby on Rails
    Ruby on Rails is an open-source web framework that is optimized for programmer happiness and sustainable productivity.
  Pusher
    Pusher is a realtime service that complements your existing server architecture.
  Heroku Vegur Proxy
    Content from this page is being sent via the Heroku Vegur Proxy.

Content Delivery Network
  Content Delivery Network
    This page contains links that give the impression that some of the site contents are stored on a content delivery network.

Payment
  Stripe
    Stripe makes it easy for developers to accept credit cards on the web.

JavaScript Libraries and Functions
  jQuery
    JQuery is a fast, concise, JavaScript Library that simplifies how you traverse HTML documents, handle events, perform animations, and add Ajax interactions to your web pages. jQuery is designed to change the way that you write JavaScript.
  html5shiv
    HTML5 IE enabling script shim.
  Facebook for Websites
    Allows a user to make a website more sociable and connected with integrations from the hugely popular Facebook website.
  Facebook SDK
    JavaScript SDK enables you to access all of the features of the Graph API via JavaScript, and it provides a rich set of client-side functionality for authentication and sharing. It differs from Facebook Connect.
  Twitter Platform
    The page embeds the Twitter platform in one method or another.
  Google API
    The website uses some form of Google APIs to provide interaction with the many API's Google Providers.
  Typeahead.js
    typeahead.js is an autocomplete library from Twitter.
  Bootstrap.js
    Twitter Bootstrap JS components.

Name Server
  DNSimple
    Domain name services made simple.

Email Hosting Providers
  Google Apps for Business
    Web-based email, calendar, and documents for teams. Renamed to Google Apps for Work, but now known as G Suite From Google Cloud.
  SPF
    The Sender Policy Framework is an open standard specifying a technical method to prevent sender address forgery.

Web Hosting Providers
  Amazon
    This site is hosted on Amazon AWS EC2 Infrastructure.

SSL Certificates
  SSL by Default
    The website redirects traffic to an HTTPS/SSL version by default.
  LetsEncrypt
    Lets Encrypt is a free open Certificate Authority.

Web Servers
  Rack Cache
    Rack::Cache is a component to enable HTTP caching for Rack-based applications such as Rails.

Edit: Added the command line argument and the shebang.


this solution uses python and pyquery. it doesn't need an API key so just run it with the url you want. you can also specify if you want the full output or just the list.

import sys
import requests
from pyquery import PyQuery as pq

base_url = "https://builtwith.com/"

if __name__ == "__main__":
  if len(sys.argv) < 2:
    print("usage: python bw.py <url>") 
    sys.exit(0)
  full = False if len(sys.argv) < 3 else True

  print "\n\nGetting builtwith info for", sys.argv[1]
  print "\t", base_url + sys.argv[1], "\n\n"

  p = pq(requests.get(base_url + sys.argv[1]).text)

  for c in p(".span8").children():
    if "title" in c.attrib["class"]:
      print "="*5, c.getchildren()[0].getchildren()[0].getchildren()[0].text, "="*5
    elif "tech" in c.attrib["class"]:
      print c.getchildren()[0].getchildren()[1].text
      if full:
        print "\t", c.getchildren()[2].text

here's the output for bountify.co with the full output

$ python bw.py bountify.co full


Getting builtwith info for bountify.co
    https://builtwith.com/bountify.co 


===== Analytics and Tracking =====
Airbrake
    Airbrake collects errors generated by other applications, and aggregates the results for review.
Facebook Domain Insights
    This website contains tracking information that allows admins to see Facebook Insights out of Facebook to this domain.
Google Analytics
    Google Analytics offers a host of compelling features and benefits for everyone from senior executives and advertising and marketing professionals to site owners and content developers.
===== Widgets =====
Facebook Like Button
    The code to implement a Facebook Like Button on the page.
Facebook Like
    Allows users to Like items they find on the web, similar to how you Like items within Facebook.
Google Plus One Platform
    Google+ API functionality.
===== Ecommerce =====
Cart Functionality
    The site has a link to a shopping cart which is not categorized under any of the cart technologies we track (custom implementation or not tracked yet).
===== Frameworks =====
Ruby on Rails Token
    Ruby on Rails is an open-source web framework that is optimized for programmer happiness and sustainable productivity. Note that Ruby on Rails has two detection techniques and this is one of them.
Ruby on Rails
    Ruby on Rails is an open-source web framework that is optimized for programmer happiness and sustainable productivity.
Pusher
    Pusher is a realtime service that complements your existing server architecture.
Heroku Vegur Proxy
    Content from this page is being sent via the Heroku Vegur Proxy.
===== Content Delivery Network =====
GStatic Google Static Content
    Google has off-loaded static content (Javascript/Images/CSS) to a different domain name in an effort to reduce bandwidth usage and increase network performance for the end user.
CloudFront
    Amazon CloudFront is a web service for content delivery. It integrates with other Amazon Web Services to give developers and businesses an easy way to distribute content to end users with low latency, high data transfer speeds, and no commitments. 
Twitter CDN
    This page contains content sourced from the Twitter CDN, either by the use of Widgets or linking to image content on twimg.com currently hosted by Akamai and Amazon.
Facebook CDN
    This page has content that links to the Facebook content delivery network.
Fastly CDN
    Links to fastly CDN based content.
===== Payment Providers =====
Stripe
    Stripe makes it easy for developers to accept credit cards on the web.
===== JavaScript Libraries =====
jQuery
    JQuery is a fast, concise, JavaScript Library that simplifies how you traverse HTML documents, handle events, perform animations, and add Ajax interactions to your web pages. jQuery is designed to change the way that you write JavaScript.
html5shiv
    HTML5 IE enabling script shim.
Facebook for Websites
    Allows a user to make a website more sociable and connected with integrations from the hugely popular Facebook website.
Facebook SDK
     JavaScript SDK enables you to access all of the features of the Graph API via JavaScript, and it provides a rich set of client-side functionality for authentication and sharing. It differs from Facebook Connect.
Twitter Platform
    The page embeds the Twitter platform in one method or another.
Google API
    The website uses some form of Google APIs to provide interaction with the many API's Google Providers.
Typeahead.js
    typeahead.js is an autocomplete library from Twitter.
Bootstrap.js
    Twitter Bootstrap JS components.
===== Nameserver Providers =====
DNSimple
    Domain name services made simple.
===== Email Services =====
Google Apps for Business
    Web-based email, calendar, and documents for teams. Renamed to Google Apps for Work, but now known as G Suite From Google Cloud.
SPF
    The Sender Policy Framework is an open standard specifying a technical method to prevent sender address forgery.
===== Hosting Providers =====
Amazon
    This site is hosted on Amazon AWS EC2 Infrastructure.
===== SSL Certificate =====
SSL by Default
    The website redirects traffic to an HTTPS/SSL version by default.
LetsEncrypt
    Let’s Encrypt is a free open Certificate Authority.
===== Web Server =====
Rack Cache
    Rack::Cache is a component to enable HTTP caching for Rack-based applications such as Rails.

here's the output for bountify.co with just the list

$ python bw.py bountify.co


Getting builtwith info for bountify.co
    https://builtwith.com/bountify.co 


===== Analytics and Tracking =====
Airbrake
Facebook Domain Insights
Google Analytics
===== Widgets =====
Facebook Like Button
Facebook Like
Google Plus One Platform
===== Ecommerce =====
Cart Functionality
===== Frameworks =====
Ruby on Rails Token
Ruby on Rails
Pusher
Heroku Vegur Proxy
===== Content Delivery Network =====
GStatic Google Static Content
CloudFront
Twitter CDN
Facebook CDN
Fastly CDN
===== Payment Providers =====
Stripe
===== JavaScript Libraries =====
jQuery
html5shiv
Facebook for Websites
Facebook SDK
Twitter Platform
Google API
Typeahead.js
Bootstrap.js
===== Nameserver Providers =====
DNSimple
===== Email Services =====
Google Apps for Business
SPF
===== Hosting Providers =====
Amazon
===== SSL Certificate =====
SSL by Default
LetsEncrypt
===== Web Server =====
Rack Cache

This also uses PyQuery:

pip install pyquery

Save this in buildwith.py:

import sys
import requests
from pyquery import PyQuery as pq

base_url = "https://builtwith.com/"

if __name__ == "__main__":
  if len(sys.argv) < 2:
    print("usage: python buildwith.py <url>") 
    sys.exit(0)
  full = False if len(sys.argv) < 3 else True

  p = pq(requests.get(base_url + sys.argv[1]).text)

  for c in p(".span8").children():
    if "title" in c.attrib["class"]:
        print '\n', c.getchildren()[0].getchildren()[0].getchildren()[0].text
    elif "tech" in c.attrib["class"]:
        print '-', c.getchildren()[0].getchildren()[1].text
    if full:
        print "\t", c.getchildren()[2].text

Alternatively, get the code from here

And run as

python buildwith.py bountify.co
Winning solution

Here's my solution in Python 3.x. It uses lxml library to scrape builtwith's extension/mobile site to achieve fast scraping with least bandwidth usage.

#!/usr/bin/env python
import lxml.html
from urllib.request import urlopen
from argparse import ArgumentParser

parser = ArgumentParser()
parser.add_argument(dest='url', help='url to use with buildwith.com', metavar='URL')
parser.add_argument('-s', '--simple', dest='simple', action='store_true',
                    help='simple mode that excludes tech descriptions')
args = parser.parse_args()

html = lxml.html.parse(urlopen('https://builtwith.com/app/mobile/?%s' % args.url))
elements = html.xpath('//*[self::ul[contains(@class, "nav-pills")] or self::tr]')

for el in elements:
    if el.tag == 'ul':
        print('')
        print('[%s]' % el.text_content())
    elif len(el.classes) == 0:
        text = el.text_content()
        if (text):
            if 'padding-left' in el[0].attrib['style']:
                print('Subname: %s' % text)
            else:
                print('Name: %s' % text)
    elif not args.simple and 'id' in el.classes:
        print('Description: %s' % el.text_content())

Usage

You'll first need to install the lxml library

pip install lxml

then simply launch it

$ ./builtwith.py bountify.co

I've also incuded a simple mode argument -s, that excludes technology descriptions if those are not needed. E.g.:

$ ./builtwith.py bountify.co -s

Example output:

[Analytics and Tracking]
Name: Airbrake
Description: Airbrake collects errors generated by other applications, and aggregates the results for review.
Name: Facebook Domain Insights
Description: This website contains tracking information that allows admins to see Facebook Insights out of Facebook to this domain.
Name: Google Analytics
Description: Google Analytics offers a host of compelling features and benefits for everyone from senior executives and advertising and marketing professionals to site owners and content developers.
Subname: Google Analytics Classic

[Widgets]
Name: Facebook Like Button
Description: The code to implement a Facebook Like Button on the page.
Name: Facebook Like
Description: Allows users to Like items they find on the web, similar to how you Like items within Facebook.
Name: Google Plus One Platform
Description: Google+ API functionality.

[Ecommerce]
Name: Cart Functionality
Description: The site has a link to a shopping cart which is not categorized under any of the cart technologies we track (custom implementation or not tracked yet).

[Frameworks]
Name: Ruby on Rails Token
Description: Ruby on Rails is an open-source web framework that is optimized for programmer happiness and sustainable productivity. Note that Ruby on Rails has two detection techniques and this is one of them.
Name: Ruby on Rails
Description: Ruby on Rails is an open-source web framework that is optimized for programmer happiness and sustainable productivity.
Name: Pusher
Description: Pusher is a realtime service that complements your existing server architecture.
Name: Heroku Vegur Proxy
Description: Content from this page is being sent via the Heroku Vegur Proxy.

[Content Delivery Network]
Name: GStatic Google Static Content
Description: Google has off-loaded static content (Javascript/Images/CSS) to a different domain name in an effort to reduce bandwidth usage and increase network performance for the end user.
Name: CloudFront
Description: Amazon CloudFront is a web service for content delivery. It integrates with other Amazon Web Services to give developers and businesses an easy way to distribute content to end users with low latency, high data transfer speeds, and no commitments.
Name: Twitter CDN
Description: This page contains content sourced from the Twitter CDN, either by the use of Widgets or linking to image content on twimg.com currently hosted by Akamai and Amazon.
Name: Facebook CDN
Description: This page has content that links to the Facebook content delivery network.
Name: Fastly CDN
Description: Links to fastly CDN based content.

[Payment]
Name: Stripe
Description: Stripe makes it easy for developers to accept credit cards on the web.

[JavaScript Libraries and Functions]
Name: jQuery
Description: JQuery is a fast, concise, JavaScript Library that simplifies how you traverse HTML documents, handle events, perform animations, and add Ajax interactions to your web pages. jQuery is designed to change the way that you write JavaScript.
Name: html5shiv
Description: HTML5 IE enabling script shim.
Name: Facebook for Websites
Description: Allows a user to make a website more sociable and connected with integrations from the hugely popular Facebook website.
Name: Facebook SDK
Description:  JavaScript SDK enables you to access all of the features of the Graph API via JavaScript, and it provides a rich set of client-side functionality for authentication and sharing. It differs from Facebook Connect.
Name: Twitter Platform
Description: The page embeds the Twitter platform in one method or another.
Name: Google API
Description: The website uses some form of Google APIs to provide interaction with the many API's Google Providers.
Name: Typeahead.js
Description: typeahead.js is an autocomplete library from Twitter.
Name: Bootstrap.js
Description: Twitter Bootstrap JS components.

[Name Server]
Name: DNSimple
Description: Domain name services made simple.

[Email Hosting Providers]
Name: Google Apps for Business
Description: Web-based email, calendar, and documents for teams. Renamed to Google Apps for Work, but now known as G Suite From Google Cloud.
Name: SPF
Description: The Sender Policy Framework is an open standard specifying a technical method to prevent sender address forgery.

[Web Hosting Providers]
Name: Amazon
Description: This site is hosted on Amazon AWS EC2 Infrastructure.
Subname: Amazon Virginia Region

[SSL Certificates]
Name: SSL by Default
Description: The website redirects traffic to an HTTPS/SSL version by default.
Name: LetsEncrypt
Description: Let’s Encrypt is a free open Certificate Authority.

[Web Servers]
Name: Rack Cache
Description: Rack::Cache is a component to enable HTTP caching for Rack-based applications such as Rails.

[Content Delivery Network]
Name: Content Delivery Network
Description: This page contains links that give the impression that some of the site contents are stored on a content delivery network.

This solution also includes technology subnames. E.g.:

[Web Hosting Providers]
Name: Amazon
Description: This site is hosted on Amazon AWS EC2 Infrastructure.
Subname: Amazon Virginia Region
View Timeline