wget command/script to download URL's from XML file
New here? Learn about Bountify and follow @bountify to get notified of new bounties! x

I need a wget command or script which will download as static HTML files all of the linked pages in an XML sitemap and then output their final URL as the filename.

e.g:

Download as static html 'http://mydomain.com/printers/some-post-title'

Save that as some-post-title.html

This command or script should be flexible enough so that only the final portion of the URL is used:

e.g:

'http://mydomain.com/printers/some-printer-title' -> some-printer-title.html
'http://mydomain.com/desktops/some-desktop-title' -> some-desktop-title.html

This command or script will need to be able to run in a cronjob, downloading a really large number of pages on a periodic basis. So it should be able handle simultaneous downloads as fast as possible.

I'm willing to tip for the best and FASTEST download solution to this issue.

Thanks

awarded to tomtoump
Tags
xml
wget

Crowdsource coding tasks.

3 Solutions


Hi,

you can try the following command :

wget -q http://www.example.com/sitemap.xml -O - | egrep -o "http://www\.example\.com[^<]+" | wget -q -i -  -wait 1

First it will download your sitemap from a specific url, than grep all links inside it and finally download it using wget. You can remove the "-wait 1" part to improve speed, but there will be lots of access to your site so be carefull with it.

But does this output the HTML file name the same as the actual page's title?
user0809 6 years ago
I have to test it tonight, as I am at work, I can't wget files. Do you have any sitemap that you can e-mail me ? so I will improve this script.
kerncy 6 years ago
I'm afraid I can't but it's a pretty regular XML sitemap, with just a whole list of links.
user0809 6 years ago
Winning solution

Here you are:

wget -q sitemap.xml -O - | \
egrep -o "<loc>[^<>]*</loc>" | sed -e 's:</*loc>::g' | \
parallel -j 100 wget -q {} -O {/}.html
What is the urls.txt? Is that something I need to manually create?
user0809 6 years ago
No, it's generated in the second line.
tomtoump 6 years ago
Is it possible to add a parallel option to this script to speed up the download?
user0809 6 years ago
Hey I see you updated the answer but I don't see any major differences. Is it possible to add a parallel download option so download multiple links at once? Thanks
user0809 6 years ago
Check again. I used GNU Parallel. You can change the number of parallel jobs by editing the number 100.
tomtoump 6 years ago
I get this error when trying: 'root@cloud-server-07:/var/temp# wget http://mytestdomain.com/sitemap.xml -O - | \
egrep -o "[^<>]" | sed -e 's:</loc>::g' | \ parallel -j 100 wget -q {} -O {/}.html --2015-02-16 13:12:36-- http://mytestdomain.com/sitemap.xml Resolving mytestdomain.com (mytestdomain.com)... 119.9.71.10 Connecting to mytestdomain.com (mytestdomain.com)|119.9.71.10|:80... connected. HTTP request sent, awaiting response... 200 OK Length: unspecified [text/xml] Saving to: `STDOUT'
[ <=> ] 120,155 --.-K/s in 0.003s Cannot write to `-' (Broken pipe).'
user0809 6 years ago
wget -q sitemap.xml -O - | egrep -o "[^<>]" | sed -e 's:</loc>::g' | parallel -j 100 wget -q {} -O {/}.html
tomtoump 6 years ago
Copy it from here and only change sitemap.xml.
tomtoump 6 years ago
Same issue
user0809 6 years ago
Cannot replicate. Tried with sitemaps from known sites and had no issue. Can you try with http://www.bbc.co.uk/news/sitemap.xml?
tomtoump 6 years ago
You can try here with a demo XML sitemap I have: https://drive.google.com/file/d/0B6h9HPRdfghjajlPbmZBdm1Xem8/view?usp=sharing
user0809 6 years ago
wget -q sitemap.xml -O - | egrep -o "<loc>[^<>]*</loc>" | sed -e 's:</*loc>::g' | parallel -j 100 wget -q {} -O {/}.html
tomtoump 6 years ago
I didn't use backticks before and some code was missing.
tomtoump 6 years ago
Resolving mytestdomain.com (mytestdomain.com)... 119.9.71.11 Connecting to mytestdomain.com (mytestdomain.com)|119.9.71.11|:80... connected. HTTP request sent, awaiting response... 200 OK Length: unspecified [text/xml] Saving to: `STDOUT' [ <=> ] 122,179 --.-K/s in 0.004s Cannot write to `-' (Broken pipe).
user0809 6 years ago
I uploaded the sitemap you send me to dropbox. Try wget -q https://dl.dropboxusercontent.com/u/55365/sitemap.xml -O - | egrep -o "<loc>[^<>]*</loc>" | sed -e 's:</*loc>::g' | parallel -j 100 wget -q {} -O {/}.html
tomtoump 6 years ago
Have a look at the screenshot, this is what I see on my end: https://drive.google.com/file/d/0B6h9HPRdfghjVENrMDZyTXVoT2M/view?usp=sharing
user0809 6 years ago
It seems to be a server related issue that I can't help you with if I don't have access to the server. If it's possible check the script on another computer.
tomtoump 6 years ago
I can give you access to the server it's not an issue since it's a test server. How do you want me to give you access credentials securely?
user0809 6 years ago
Send them to the email in my profile.
tomtoump 6 years ago

You can try this:

wget --quiet http://www.bbc.co.uk/news/sitemap.xml --output-document - | grep -E -o "http://www\.bbc\.co\.uk/[^<]+" | wget -q -i - --random-wait -E -e robots=off -U mozilla

Note about the last portion that actually downloads HTML pages:

  • it does it quietly (-q)
  • it forces the pages to have .html extension (-E)
  • it uses random timeouts (--random-wait)
  • it ignores robots.txt (-e robots=off)
  • it pretends to be a browser (-U mozilla)

For parallelism you can use GNU Parallel (https://www.gnu.org/software/parallel/) like this:

wget --quiet http://www.bbc.co.uk/news/sitemap.xml --output-document - | parallel --pipe -L10 grep -E -o "http://www\.bbc\.co\.uk/[^<]+" | wget -i - --random-wait -E -e robots=off -U mozilla

This will do processing in chunks of 10 lines.

UPDATE:

I finally got to my machine and managed to test it. Here's the result:

wget -qO - http://www.bbc.co.uk/news/sitemap.xml | parallel --pipe -j4 --round-robin --ungroup 'grep -E -o "http://www\.bbc\.co\.uk/[^<]+" | wget -q -i - -E'

Note the single quotes.

-j4 stands for the number of concurrent jobs.

You can check that the jobs are indeed running with

ps ax | grep wget
I get: 'parallel: invalid option -- '-' -: Invalid URL parallel [OPTIONS] command -- arguments: Scheme missing -: Invalid URL for each argument, run command with argument, in parallel: Scheme missing -: Invalid URL parallel [OPTIONS] -- commands: Scheme missing -: Invalid URL run specified commands in parallel: Scheme missing No URLs found in -. '
user0809 6 years ago
What is the 'round robin' parameter for?
user0809 6 years ago
Without --round-robin GNU Parallel will start a command per block; with --round-robin only the requested number of jobs will be started (--jobs). The records will then be distributed between the running jobs
dekkard 6 years ago
Normally --pipe will give a single block to each instance of the command. With --round-robin all blocks will at random be written to commands already running. This is useful if the command takes a long time to initialize.
dekkard 6 years ago
Glad to help you and thanks for the tip. Although it's quite strange to see you pick another solution if mine actually solved your issue...
dekkard 6 years ago
View Timeline