How to search tons of html
New here? Learn about Bountify and follow @bountify to get notified of new bounties! x

I am working on a project were we crawled a bunch of specific sites and saved their page html. We have several million pages which we have stored in some bulky lined json files ( the output of the crawlers) . Maybe 2000 of these jlfiles which is about 700gb of data.

Now it's time to query the pages looking for specific keywords, phrases, even markup snippets in some cases. My original thought was to load it all to elastic search and then just start running queries. Not a bad idea but I'm worried it's not going to be very feasible with a small army of nodes. I'm looking to see if anyone here has a better idea or tech/tools to handle this.

From a machine perspective I'm open to any single machine aws has. Loading data to XYZ should take no more than 6-8 hours and ideally query results can be less than 30 minutes per query ( point here is instant elastic search results would be nice but not needed)

Winning idea with specific details wins the bounty. Ask any question needed to come up with the idea!

Those markup snippets, how do you imagine they will be queried? Do they match the scenario I've described on the second part of my answer?
julianobs 5 years ago
awarded to Be The Match

Crowdsource coding tasks.

5 Solutions


There are several options worth trying.


Hello Qdev,

The methods mentioned by the previews solution are all very performaned when searching through text.
But with the amount of data you have i think you will never achieve the performance you want. The only way i see is using a NoSQL Database system.

You need to import you data in the Database which will take long. But after you have done that searchs will go really fast. I would advice mongodb. Its free. It does amazing fast selects on the data. In addition it also can import json data right away with the commandline tool mongoimport.

http://stackoverflow.com/questions/10999023/proper-way-to-import-json-file-to-mongo


I suggest using Solr; it's fairly quick to set up, supports regex queries and long running queries and commits.

Solr requires spinning disk instances to have as much RAM as the size of the database index. For 700gb of data can be a lot for a single instance, and in that case Solr offers sharding and distributed seach. This way you can adjust just how fast you need your results vs how much you want to spend on AWS.

If you plan on searching documents for sentences or exact word sequences, you might want to parse the HTML and extract just the text. It helps when there are things like "lorem <b>ipsum</b> <small>dolor</small> sit amet". For parsing HTML and extracting text I generally use either jquery with Node.js or CsQuery with C#.

A reference:

  • Crawl Anywhere: an opensource web crawler and webpage search engine using solr which might be interesting to have a look at

--

On another note, if you plan on extracting html from the pages (field values, etc), my own experience has shown me that regex can be very tiring and hard to maintain. What works best is really jquery/CsQuery, since you can use CSS selectors that understand the underlying document.

In this case, I'd put together a quick and dirty search and job manager (only if necessary). Database choices don't matter much in this case, you won't be using indexes, just pulling out documents and running CSS selector queries against them to pull out values.


I suggest you, to import, organize and index files into a NoSQL database to optimizing time search and simply quering.

There are several NoSQL databases, you can see here some of them with a comparison.

However I suggest you to use a full enterprise solution optimized for bigdata search like marklogic.

You can see here a comparison with elasticsearch.

lol just give my solution again... still i agree its the best way to do it :D
Ibenor 5 years ago

I suggest you take a look at Joyent's Manta service, which allows you to use as much computing power as you need on the files and just use standard tools such as Perl and/or awk for your specific purposes. You can find a link to manta here and a quick primer video by one of its inventors here!

View Timeline