I'm looking for a script for the following process:

  1. The script is pointed at a list of URLs
  2. For each URL, the site is visited, and the web page is saved (html and files).

That's basically it. I've been trying to get a few different tools to work but to no avail. The main challenge is that some of the sites I'm trying to use are react, and see the request as a browser with Javascript disabled.

Any pointers or help on this would be appreciate.

Hi, have you ever used JSOUP library for web scraping in Java? It has the same issue as you mentioned. I've used them in the past and had the same issue.. WebClient for JSoup seem the solution..
SilverHood Apps 3 years ago
In the above example, 2 libraries are used:
1. HtmlUnit - This simulates a web browser using the WebClient class.
2. JSoup - This is used to extract the web page and parse if required.
In your case you might just extract and save for each corresponding URL. ps: Needs NetBeans IDE
SilverHood Apps 3 years ago
Hi, I've successfully created a working prototype using said library that extracts dynamic web pages.
Kindly get intouch from back of my page/account for its implementation, as it will be requiring heavy interaction
SilverHood Apps 3 years ago
Well that it will be easy to make with python for me, this Q is still on? Is O.K use selenium? Basically you have to mock the web browser to get the full page processed, then extract all the html and references to download all the content. The other part, the program has to make the async queries in parallel.
romelgomez 3 years ago
Do you want the page links, i.e. scripts, styles, etc. to point to their original URLs or to a relative path once saved? For example, if there's a <script src="">, do you want these links to become <script src="/js/file.js"> or leave them pointing to their original URLs (absolute paths)?
kostasx 3 years ago
This is some new trend being seen here, new new account post bounty & vanishes. Wonder that you achieve..
SilverHood Apps 3 years ago

