Scraping of all product information using Python into CSV and JSON, and load into database
New here? Learn about Bountify and follow @bountify to get notified of new bounties! x

The challenge is to develop an elegant script in Python that is run periodically, using available open source tools like Selenium, BeautifulSoup, or whichever open source packages you deem fit, to scrape the https://goo.gl/T117mc product pages for product information.

The script is meant to be run on an AWS virtual machine.
The output of the scraping is data of products in the website, inclusive of all their attributes, and urls to product photos on display. In order to allow for flexibility and some quick testing, the script should allow for the user to input in the script, a filters, in line with the product pages of the https://goo.gl/T117mc website e.g. “Price > $500”, “Product Category = Handbags”.

The output of the scraping is then to be (1) saved in both a CSV and JSON file for private download, and (2) loaded into a database on AWS. For the latter please propose what database to load it up to and provide information or a script to allow users to set up the database for storing of the scrapped results.

Filters are: [Category (Clothing, Handbags, Shoes, Accessories, Watches, Jewelry, Gift Sets, Men’s Bags, Sunglasses, Wallets)]; [Price (Less than S$50, S$50-100, S$100-300, S$300-500, S$500-1000, S$1000-2000, S$2000-3000, More than S$3000)]
Filters can be called from another python script to allow to search for multiple categories at once easily.

Data to be captured: [Name, Style No., Price (SGD), Color, Design, Details, Photos (urls), Category ]

The script should have a mechanism of identifying changes in products from the previous update. Like product additions, price changes, product deletions.

Crowdsource coding tasks.

1 Solution

Winning solution

Here's the script.
Notice when you want to use the filter you should copy and paste the category name from 'list of categories.txt' file.

https://github.com/mohamed-source/scrape-products/archive/master.zip

Hi Mohamed, really interesting solution to get the product urls. Ok, I have not tried out the solution yet and gotten an output for myself yet, and I realize that I am running Python 2, rather than Python 3. So trying if importing urllib and urllib2, and importing urlopen, Request from urllib2 will work. Nevertheless, will award the bounty immediately. In the meantime, would you be able to run the script and send me the entire output? Also, does this only work for the AU site? I would probably need a more flexible solution that allows me to scrape from sites with no product sitemaps. E.g. The US version does have a product sitemap too but the SG site doesn't seem to have it. Will probably need more follow up work, help and advice on this so we do over email? What do you prefer?
Edwin824 12 days ago
Hi Mohamed, can you help me with error? OperationalError Traceback (most recent call last) in () 152 save() 153 --> 154 if name == 'main': main() in main() 150 '''error[producturl] = logging.error(traceback.formatexc())''' 151 pass --> 152 save() 153 154 if name == 'main': main() in save(data, filename) 98 if data == None: 99 db = sqlite3.connect('products.db') --> 100 data = db.execute('select * from products').fetchall() 101 with open(filename+'.json', 'w') as f: 102 json.dump({'products':data}, f, indent=4) OperationalError: no such table: product
Edwin824 12 days ago
Hi Mohamed, can you send me the email on this again? I didn’t receive it. Thanks.
Edwin824 12 days ago
I have sent the files again. it seems that there's 37 version of the website I can build a script that can deal with all the variations and scrape all the products in all the versions for 50 dollar.
mohamed ibrahim 10 days ago