Look for URLs in a list of domains
New here? Learn about Bountify and follow @bountify to get notified of new bounties! x

Hey,

I want a python (2.7) script that given a wordlist of domains (python script.py domains.txt) it will look in the source code of the html page and grab all URLs in that code (for each domain in the wordlist), and then looks for all JS files in that same domain and grab all URLs in all JS files. and do that for each domain in the wordlist.

Please consider to share the results with me in a private url in github then remove the URL after I download it. without that condition a bounty won't be tipped.

Thanks!

Hey, I have completed the bounty and made the script but I don't have a paid GitHub account to host a private project. Please help me out here.
grajat90 4 months ago
I have sent the email to your address. Thanks.
grajat90 4 months ago
Hey privcrawler, I discovered some problems in my code and i have corrected them. I am sending you another mail with subject py 2 and with attachment script2.py with the corrected code. Thanks!
grajat90 4 months ago
Do you want links for webpages or images, videos etc or both?
gurditsbedi 4 months ago
awarded to Codeword

Crowdsource coding tasks.

2 Solutions


Solution provided privately.

USAGE:
simple python script execution.

Thanks a lot.

Tested it but Unfortunately it's not the solution Im looking for. Im sorry
privcrawler 4 months ago
Winning solution

Step1. Open the main.py file and set you desired url which you want to crawl

Step2. The program will automatically create a folder for each base url.

Step3. Inside the folder there will be two text file queue.txt and crawled.txt
Crawled.txt will have a list of all crawled links while queue.txt will hold a list of url in queue.

Another way is that if you want some specific url to be crawled then you can also place the urls manually in queue.txt on a line by line basic and run the script, then program will pick the url from queue.txt can start crawling from there.

Thank you , Hope it helps , test it for yourself

I've mailed you with subject - python 2.7 web crawler
from query.techengine@gmail.com

Your solution is close to what I want but it seem you didn't get what Im looking for exactly, what do you mean by queue.txt url in queue ? if that is the case it will be crawling forever without a stoping condition, isn't ?
privcrawler 4 months ago
That's right.Program will not crawl any website other than the website you provide.The program will work like this mean to say, suppose the website you want to crawl is http://example.com, For this, you will have to set the homepage variable in main.py like this HOMEPAGE = 'http://www.example.com' when the script starts for the first time, it will create a folder name codeword, and also create two .txt file queue.txt ,crawled.txt , also the program will place the "http://www.example.com" URL in queue.txt so http://www.example.com will act as a start point for the program to start crawling the site. The program will first, crawl http://www.example.com and get all URLs from there and place them in queue.txt (to be further crawled),then delete the "http://www.example.com" from queue
Codeword 4 months ago
so as to ensure the the same url is not crawled twice, and add the crawled url in this case "http://www.example.com" to crawled.txt .This way at the end you will get a list of all crawled url in crawled.txt I have also include a condition so that it won't crawl the whole internet but only the webiste url you provide.Also the script works in multi threaded mode ans so its is fast. Hope it helps, Thank you any other query is welcome.Thanks once again
Codeword 4 months ago
queue.txt is only a list of waiting urls, which the program encounters when it crawl, out of which the program will pick the valid url related to the webste only and and put the crawled(validated) url in the crawed.txt file which will be the final output.
Codeword 4 months ago
If suppose, all the encountered url belong to the site only, then at the end queue.txt will be blank. While crawled.txt will have all the urls you want.
Codeword 4 months ago
View Timeline