Small Python Script
New here? Learn about Bountify and follow @bountify to get notified of new bounties! x

I need a Python script which will:

  1. Loop through all HTML files in a target local folder recursively
  2. Fetch a specific HTML tag's value (e.g an image URL)
  3. Download this image and then upload to a specific Aliyun OSS bucket + path
  4. If an image download fails then repeat several times until upload successful
  5. The image file name should be saved to bucket using the value fetched from another tag in each HTML file

e.g:

original file name fetched from 'productImage' tag: some-image-name.jpg

saved file name using value fetched from 'random' tag: 28072278449636322137035947796046855471020161205142418.jpg

I have attached below a sample HTML file.

https://yolobit.com/v/f3166f

Thanks!

Can you mount an "Aliyun OSS bucket" via FUSE? This would mean you can treat it as a normal filesystem.
slang800 10 months ago
awarded to tomtoump
Tags
python

Crowdsource coding tasks.

1 Solution

Winning solution

Check this out.

I ran the script and after some files it encountered an error and stopped running: 'NoneType' object has no attribute 'string'
user856 10 months ago
Can you have the script skip any problematic files and continue running until complete? Also is it possible to add a concurrency parameter to both this script and the other script? Because I have lots of files to loop through and it's really slow otherwise. Thanks!
user856 10 months ago
Also I notice that the uploaded image file name doesn't have the file format such as .png or .jpg in the file name. I need the file name to keep the original format otherwise the images won't load properly
user856 10 months ago
I've added some checks, and it should skip any errors that occur. Can you test it?
tomtoump 10 months ago
Also I get the extension from the url and append it to the random tag value.
tomtoump 10 months ago
Are you able to add a concurrency parameter/function so that images can be uploaded faster?
user856 10 months ago
I'll look into it. Other than that, does it work as expected?
tomtoump 10 months ago
The script is currently running but I saw several files being skipped with this message: File './2.html' doesn't contain a url and name -> Skip But when I opened the file, it does contain an image URL and random tag. Can you check this attached file? https://yolobit.com/v/158514
user856 10 months ago
Does this file throw an error for you? Worked fine for me.
tomtoump 10 months ago
I checked the upload directory and an image with that file's 'random' tag value doesn't exist, so that means that file's image wasn't uploaded to the bucket for some reason
user856 10 months ago
Could you create a pastebin with the output of the script?
tomtoump 10 months ago
I saw several files during the script run that also were skipped but also contain proper tags. Is it something to do with the file parsing? Also what do you mean by output of the script? Are you referring to each row of output generated by the script in terminal window?
user856 10 months ago
I'm not sure what's wrong, because even this file that fails for you, works fine on my computer. Yes, I mean that output, so I can check the error messages.
tomtoump 10 months ago
I'm not sure if this is what you need but I pasted a portion of the output here: http://pastebin.com/bwCcp5Hf If the above issue is just some weird occurrence that I can just run the script again. Now I just need two things now for the script.
user856 10 months ago
  1. If an image was successfully uploaded, write back the new uploaded image file path + image name to the productImage tag and replace the old image link value. So that the old image link is replaced with the new one
  2. Concurrency parameter
user856 10 months ago
Can you upload Twinings-of-London-Th--Rooibos-Platinum---Golden-Caramel.html and Twinings-of-London-Infusions-Verveine---Orange.html and send me a link?
tomtoump 10 months ago
Please do not change the bounty requirements along the way, or offer a tip for any additional work.
tomtoump 10 months ago
Here is the link to the two files: https://yolobit.com/v/5150f8
user856 10 months ago
Check the latest version. The original HTML file is updated with the uploaded file's url. Also there is a check to skip the file if it is already processed.
tomtoump 10 months ago
Were you also able to add the concurrency function as well?
user856 10 months ago
Also the current script doesn't replace the old image URL in tag 'productImage' with the new image URL
user856 10 months ago
Can you check again regarding url replacement?
tomtoump 10 months ago
The script now is able to write back the URL, but after a while the script encountered this message: 'File 'image-products/26031204430825382116736883682875243617320161205142305.jpg' doesn't exist. -> Uploading ... {'status': -2, 'details': "RequestError: HTTPConnectionPool(host='image-eumt.oss-cn-beijing.aliyuncs.com', port=80): Read timed out. (read timeout=60)"}'
user856 10 months ago
And then the script stopped. Is this because the script finished uploading all images or because the script encountered the problem and it stopped running?
user856 10 months ago
Also can you please add the concurrency feature? This way I can quickly test all files in a relatively short time and report any issues.
user856 10 months ago
It seems like it encountered an error during the upload. The thing is that if you run the script again it will continue where it left off. If it detects that the url is already changed in the html file, it skips it.
tomtoump 10 months ago
The concurrency feature needs quite a lot changes in the current script, and I think it is out of scope of the current bounty.
tomtoump 10 months ago
Is there anyway you can add the concurrency feature if I increased the bounty level or sent you a tip? :)
user856 10 months ago
Sure. Have you tried running the script again? Did it throw the same error?
tomtoump 10 months ago
Are you able to have the script just continue running until the end even if it encounters an error on a file? Because I will need the script to be run as a regular cronjob and if it stops each time on a problematic file then the script can never finish in the background
user856 10 months ago
Please test the updated script.
tomtoump 10 months ago
When I run the script with this parameter '--threads 20' I got this error 'range() integer end argument expected, got str.'
user856 10 months ago
Forgot to set the type of the threads parameter. Should work fine now.
tomtoump 10 months ago
Hey, thanks a lot! The script is working great!
user856 10 months ago