Advice required only, no need for code: how to structure a right challenge and bounty - “Scraping of all product information using Python into CSV and JSON, and load into database”
New here? Learn about Bountify and follow @bountify to get notified of new bounties! x

I had posted this task awhile back but there were no takers for the bounty. To improve subsequent requests to address the challenge, would like to get advice from the community on whether the bounty was too low for the challenge, or the proposed approach was wrong. How can I improve the challenge or bounty so that there will be takers for the task? Looking for some advice. Thanks all. Bounty will be awarded advice provided as appreciation.

“The challenge is to develop an elegant script in Python that is run periodically, using available open source tools like Selenium, BeautifulSoup, or whichever open source packages you deem fit, to scrape the https://goo.gl/T117mc product pages for product information.

The script is meant to be run on an AWS virtual machine.

The output of the scraping is data of products in the website, inclusive of all their attributes, and urls to product photos on display. In order to allow for flexibility and some quick testing, the script should allow for the user to input in the script, a filters, in line with the product pages of the https://goo.gl/T117mc website e.g. “Price > $500”, “Product Category = Handbags”.

The output of the scraping is then to be (1) saved in both a CSV and JSON file for private download, and (2) loaded into a database on AWS. For the latter please propose what database to load it up to and provide information or a script to allow users to set up the database for storing of the scrapped results.

To share more of the intention in order to give more context: Not within the scope of this challenge (but for a follow up challenge) is that if during the periodic running of the script on AWS, is when there new additions or deletion of products, or product information changes like inventory or price updates, the new product information will be added as a new line and old data for that product will be superceded and this is to be shown in the database.”

Could you provide a link to the previous bounty? It seems to be posted from a different account than https://bountify.co/users/edwin824 You've already provided most of the info, although the bounty award itself is not present
Prometheus 7 months ago
This is the bounty itself, I just modified it to get advice rather than the solution because no one posted a solution.
Edwin824 7 months ago
I've read this bounty multi days ago and I opened the website and analyzed it. it's slow and has bad structure so it's hard to be scraped. I played with it in my free time and I think I have figured out a way to scrape it. I can do the whole job in five days or less for fifty dollars. If I'll do it I need to know the list of filters you want to use. And if it possible to know the website product's frequency of update if you have such information.
mohamed ibrahim 7 months ago
Sounds good Mohamed, I’ll also be happy to try it out with you. Please advice how we can proceed.
Edwin824 7 months ago
I'll start to work on it right now and when I finish I'll tell you to post another bounty.
mohamed ibrahim 7 months ago
After working on the website yesterday and today it seems that it has around two hundred categories and around one thousand and seven hundred products and the website is slow. I want to know the informations that you want to extract from every product for example (name, price, colors, sizes, images etc). And what's the filters that you want to be available in the script.
mohamed ibrahim 7 months ago
And what do you mean by "To share more of the intention in order to give more context: Not within the scope of this challenge (but for a follow up challenge) is that if during the periodic running of the script on AWS, is when there new additions or deletion of products, or product information changes like inventory or price updates, the new product information will be added as a new line and old data for that product will be superceded and this is to be shown in the database.”
mohamed ibrahim 7 months ago
Filters are: [Category (Clothing, Handbags, Shoes, Accessories, Watches, Jewelry, Gift Sets, Men’s Bags, Sunglasses, Wallets)]; [Price (Less than S$50, S$50-100, S$100-300, S$300-500, S$500-1000, S$1000-2000, S$2000-3000, More than S$3000)] Data: [Name, Style No., Price (SGD), Color, Design, Details, Photos (urls), Category ]
Edwin824 7 months ago
What I meant but wasn’t clear in expressing, was simply to have a mechanism of identifying changes in products from the previous update. Like product additions, price changes, product deletions.
Edwin824 7 months ago
it was supposed that I'll deliver the script by today but unfortunately after trying to scrape the website using selenium I found that it would take around twenty hours to scrape the whole products in the website becouse unfortunately the "lazy load" in the website is very lazy. So I had to find another way to scrape it without having to deal with the lazy load and without using selenium after couple of days I found another way to scrape it efficiently and then I had to start all over again.
mohamed ibrahim 7 months ago
I just finished the script only missing the part of the filters I need to know about the filters does it wil be called from the linux terminal and will direct the output to the terminal or it'll be called from another python script and it'll direct its output to that script. And how it will be used.
mohamed ibrahim 7 months ago
My initial thought was to call it from another python script. Also, initial thinking was to output the result into a CSV or JSON file, or upload it into an AWS database. But would like to hear what you think and get advice on how I should go about it.
Edwin824 7 months ago
Calling the filters from another python script will be better than calling it from the terminal becouse it will allow to search for multiple categories at once easily.
mohamed ibrahim 7 months ago
I've finished the script. Post another bounty with fifty dollars or more to submit as solution into it.
mohamed ibrahim 7 months ago
Great, thanks for the note Mohamed. Could you send me a sample output using the script perhaps in CSV so that I can see the script works well to produce the required output? Perhaps using the sample filters of Price (S$500-1000, S$1000-2000, S$2000-3000, More than S$3000), Category (Handbags, Accessories, Watches, Jewelry, Gift Sets, Men’s Bags, Sunglasses, Wallets).
Edwin824 7 months ago
Great, thanks for the note Mohamed. I will post the bounty shortly. Could you send me a sample output using the script perhaps in CSV so that I can see the script works well to produce the required output? Perhaps using the sample filters of Price (S$500-1000, S$1000-2000, S$2000-3000, More than S$3000), Category (Handbags, Accessories, Watches, Jewelry, Gift Sets, Men’s Bags, Sunglasses, Wallets).
Edwin824 7 months ago
do you have gmail to send the sample files to?
mohamed ibrahim 7 months ago
Thanks Mohamed, yes you can email me at hej1sbebhxct@opayq.com please.
Edwin824 7 months ago
Thanks again Mohamed, the bounty has been posted. I have provided in the earlier comment a disposable email that forwards messages to my personal email in order to avoid the latter being captured by spiders for spam. Please do not be alarmed by the strange looking email address. If you’d like to and prefer, you can drop me a quick email saying that you are Mohamed, and I will respond to you with my real email and you can send the files to me then. We can further discuss over email.
Edwin824 7 months ago
awarded to Prometheus

Crowdsource coding tasks.

1 Solution

Winning solution

I have experience in Python, scraping and Cloud Services (AWS), so I'll give my attempt.

  1. There is no ability to view all products in a collection at once (even per group, try opening "VIEW ALL MENS", it shows only one product, but there are more products sub-groups). Therefore as the first step the script needs to receive a list of non-overlapping collection of URLs in the site, I'd recommend doing this manually (since there are only 10s of them, and it isn't straightforward to scrape them automatically)
  2. Collections use Infinite-scroll (i.e. lazy-loading) is used in the lists, so it would be inefficient to go back and forth between the collection and item. After receiving the collection URL list, the script (or a separate script) should build an item URL list from all of the collections.
  3. Now the program goes through each of the item URLs and saves it to the database.
  4. The requirements for the database are simple, therefore many of the options are reasonable. Although I'd recommend Amazon RDS (My personal preference is PostgreSQL, but any will do), (single table, i.e. denormalized, due to being easier to develop than normalized multi-table solution (since the benefits of normalization don't apply here, and the benefits of an alternative - NoSQL DynamoDB don't apply here too)
  5. You should save to CSV/JSON only by a script reading the database, instead of the scraping script doing the writing to CSV/JSON in addition of saving to database (there are many, many reasons why, I could expand, if you'd want)
  6. There are two main options of saving the images, either to a file storage (Amazon S3) or to the database itself. There are pros and cons to each, but using file storage is more conventional (most likely a cheaper storage too).
  7. The things that are make this more costly to implement are (1) saving of images and, to some extent, (2) saving info for each of the color (all of the info, e.g. price, description, etc. (except for product, images and sizes) are the same for each color), as this requires additional interaction

Possible steps could be (bounty award estimates are very approximate and subjective, though. Keep in mind, that bounty award is not guaranteed for the developers, therefore it may be more expensive than just finding someone who could do it for you contractually):

  1. Given a list of collection URLs, scrape all of the item links in all of the collections (25-50$)
  2. Create an SQL table (Amazon RDS) then given a list of item URLs, scrape text data (price, description, etc.) to it (50-100$)
  3. Additionally save images and different color variants (50-100$)
  4. Keep items in a database updated (Add new / update existing / delete removed) (25-50$)

P.S. Also, the site explicitly prohibits scraping in it's Terms of Services, so the owner of the AWS server might be held responsible. The least I'd recommend is changing User-Agent Header in the requests made, and spreading out the requests over time.

Thanks Prometheus, the advice is detailed, insightful, appreciated.
Edwin824 7 months ago