Retrieve list of product URLs from BigCommerce site and document how you did it
New here? Learn about Bountify and follow @bountify to get notified of new bounties! x

Hi!

So, a while back, I somehow managed through some tinkering to automatically grab the URLs of all ~2K of the products listed on the e-commerce site https://www.eightcig.com.

This is something simple I'm missing, I'm sure, but I can't remember what I did to find that.

Could someone help me out with retrieving all product page URLs as a CSV and also then document the steps they took to find it?

Thanks!

i have played around with it, i was able to scrap the content of a single page (from console) but didn't find out a way of scrapping all contents of the website. couldn't force it to retrieve all the articles in a single webpage. in other words, i could make PHP script it scrap a webpage content, and saving it as csv, text file... or a Js script that scraps the content of a web page, and print it in console (maybe being able to download it as text file) .... if you know a way to retrieve all items in a single page, let me know.
Chlegou 5 months ago
I did something in the Chrome Web Dev Console while inspecting the JS; I can't remember how I got the list of product URLs, but I did: https://cvlassets.nyc3.digitaloceanspaces.com/input.csv. Essentially, all I'm looking for here is that same list again but with updated URLS; I'm only really looking for a sitemap of sorts. I don't need the scraping part. I've already written the python script that does that.
paragon21 5 months ago
i have posted a bounty (...not a solution, but to clarify things only). please check it.
Chlegou 5 months ago

Crowdsource coding tasks.

3 Solutions


To start, I opened the Chrome Web Dev Console and typed $ to find if the site had JQuery. Since it does, I decided to use it to gather the product links.

Inspecting the HTML, I found that each listing was in a list with the tag <li class="product">. I selected all such tags with JQuery in the Dev Console with $('li.product).

Inside the li tag, the a tag containing the product links could be selected with $('li.product .card-title a'). Casting it to an array with toArray, I could do operations on each element: $('li.product .card-title a').toArray().

Next, I used map to replace each a element with its corresponding link: $('li.product .card-title a').toArray().map(a => a.href) and joined the list of URLs into a string separated by commas with join: $('li.product .card-title a').toArray().map(a => a.href).join(",").

This one line of JavaScript in the Dev Console returns a string with each URL separated by commas.

This is my first solution here, let me know if this is what you were looking for or if there's anything I missed.
ProfessorG 5 months ago
Okay, so when I run your suggested solution in the Dev Console, I do see two product URLs separated by commas, but I don't see all 1400 or so of the products on the site, Am I missing something? Again, this was something I stumbled upon the first time, and cannot for the life of me remember where/how.
paragon21 5 months ago
Let me clarify here. Actually, I get exactly 20 URLs separated by commas, which it appears to me to be the same URLs of individual products listed on the homepage. I'm looking for the entire inventory of product URLs. It's there somewhere because I found it not but three weeks ago!
paragon21 5 months ago
Ah, I didn't realize that section wasn't the entire inventory of products. Since I didn't see any kind of "Next page" button I assumed everything was already there. My bad, I suppose this isn't really a solution that could work with modification even.
ProfessorG 5 months ago

@paragon21 this bounty is following out comments discussion, i just posted here to give a clarified answer of what i did.

the code that i wrote down is what could gather the elements in a single webpage ( as i said in my first comment)

$('.product .product-inner .card-title > a').each(function(i, e){ console.log($(e).attr('href'));});

but this code is retrieving only 20 rows(the rows added by default in the webpage)

so i tried to dive into /search.php maybe i could scrap all results in the website. the maximum results number i was able to get was 228! see it here:

https://www.eightcig.com/search.php?search_query=apple%20grape%20berry%20melon%20kiwi%20fruit&section=product

but the number of rows in a single webpage, was only 16!!

when writing this bounty i noticed that for ajax calls, there is a headers added to the request, so it returns only components, so started again from that point. i was able to reproduce the results following this code:

$.ajax({
    url: "https://www.eightcig.com/search.php?search_query=apple%20grape%20berry%20melon%20kiwi%20fruit&section=product",
    data: {  },
    type: "GET",
    beforeSend: function(xhr){
        // headers added for search page
        xhr.setRequestHeader('stencil-options', '{"render_with":"search/product-listing"}');
        xhr.setRequestHeader('stencil-config', '{ "product_results":{"limit":48}');
    },
    success: function(result) { 
        // get the length to know how many rows per page
        length = $(result).find('.product .product-inner .card-title > a').toArray().map(a => a.href).length;
        console.log(length);
        // print links if wanted
        //$(result).find('.product .product-inner .card-title > a').each(function(i, e){ console.log($(e).attr('href')); });
    }
});

i noticed that the stencil headers are making a difference, so googled it expecting some results, .... i have faced the bigcommerce docs, that explain the object structure, but couldn't manipulate it! i don't know how! it's my first experience with it and it drive me crazy!

link: https://developer.bigcommerce.com/stencil-docs/stencil-object-model-reference/stencil-objects/global-objects/search

i spent 2 hours trying to change results, but couldn't! :(

anyway, i tried what i could do, hope you have experince with it, so you could understand it and get what you want.

N.B you have said that you already have the script, if yes, so using only the selector within your python script will make it work again. (i don't know, maybe the selector was changed! ... i'm just guessing)

N.B i believe that the stencil lib could be the key! it really deserve time investigating. but also mind blowing after 2h with no results for me at least since it's my first time!

hope you get it right. i will be glad if stencil was the key! at least i learned something new :p

Thanks for all your time and effort here. I have a python script that uses Selenium to scrape the site every 12 hours, but I tell the script what product pages to scrape based upon the CSV of product URLs I originally was able to find by playing around in the Dev. Console. I just can't remember how I did it! Obviously, we need to update the CSV of input URLs to reflect new additions to the catalog. Here's that script so you can see how it is working (with 'input.csv' serving as CSV stored in same directory telling the script what URLs to visit): https://cvlassets.nyc3.digitaloceanspaces.com/eightcig-1.py
paragon21 5 months ago
And no it wasn't stencil –– it was something painfully obvious that was in plain sight somewhere. Arg, why didn't I document it?
paragon21 5 months ago
i viewed the python code, i'm beginner btw, i still believe that Stencil could be the solution, but who knows why it didn't work as i manipulate it. maybe because i don't know how to use it, or because some values are redefined in the server... anyway, i will keep following the bounty, maybe someone will get it right.
Chlegou 5 months ago
Winning solution

I wrote python script to use the website search to find the products but the the search bar just provided between 900 and 1000 product.
the steps as follow
the script use site search to search for this string
a b c d e f g h i j k l m n o p q r s t u v w x y z 0 1 2 3 4 5 6 7 8 9
the url of the search is
https://www.eightcig.com/search.php?search_query=a+b+c+d+e+f+g+h+i+j+k+l+m+n+o+p+q+r+s+t+u+v+w+x+y+z+0+1+2+3+4+5+6+7+8+9&section=product
then it retrive the html response and search inside it for the following string

wich exist before every product url
after finding the product url it append the url to a list
and do the same for all the products urls in the page
and then it search in the page for the following string
<link rel="next" href="
if it found the string it means there's a next page
so it retrive the next page url and open it
and repeat the process of searching for products
then it save the products list to a input.csv file
I don't know if your script will succees to read this output format or not.
Here's the code.
import csv
from urllib.request import urlopen
from html import unescape

def gather (url, products):
response = urlopen (url).read ()
response = response.decode("utf-8")
chunks = response.split ('')
chunks.pop (0)
for i in chunks:
chunk = i.split ('')
chunk = chunk[0].split ('?')
if chunk[0] not in products:
products.append (chunk[0])
next_page = '')
chunk = unescape (chunk [0])
url = 'https://www.eightcig.com'+chunk
gather (url, products)

def update ():
products = []
try:
with open('input.csv') as f:
r = csv.reader (f, delimiter=',')
for row in r:
products.append (row)
products = products [0]
except IOError:
pass
url = 'https://www.eightcig.com/search.php?search_query=a+b+c+d+e+f+g+h+i+j+k+l+m+n+o+p+q+r+s+t+u+v+w+x+y+z+0+1+2+3+4+5+6+7+8+9&section=product'
gather (url, products)
with open ('input.csv','w') as f:
for i in products:
f.write (',')
f.write (i)

if name == 'main':
update ()

Whoa! This may just do the trick. Brilliant idea!!!
paragon21 5 months ago
There’s actually that amount of products. It’s the variations that add up to 1400.
paragon21 5 months ago
you can run the script directly from the command line python3 updateproducts.py or by import it from inside the other script import updateproducts.py then in the place that you want to run it update_products.update ()
mohamed ibrahim 5 months ago
View Timeline