Extract hrefs from html file in Bash using built-in tools
New here? Learn about Bountify and follow @bountify to get notified of new bounties! x

Given an HTML file where all tags are on one line, extract href values where the parent tag is <div class="item"></div>, for example

<div><div class="item"><a href="...">blah</a></div><div class="item"><a href="...">blah</a></div><div class="item"><a href="...">blah</a></div><div class="item"><a href="...">blah</a></div><div class="item"><a href="...">blah</a></div><div class="item"><a href="...">blah</a></div><div class="item"><a href="...">blah</a></div><div class="item"><a href="...">blah</a></div></div>

Tags
bash

Crowdsource coding tasks.

2 Solutions


 echo '<div><div class="item"><a href="linkA">blah</a></div><div class="item"><a href="linkB">blah</a></div><div class="item"><a href="linkC">blah</a></div><div class="item"><a href="linkD">blah</a></div><div class="item"><a href="linkE">blah</a></div><div class="item"><a href="linkF">blah</a></div><div class="item"><a href="linkG">blah</a></div><div class="item"><a href="linkH">blah</a></div></div>' | perl -pe 's/<div[^>]+?class="[^"]*?\bitem\b[^"]*"[^>]*>(?>[^<>]*<(?!\/?div)[^>]*>[^<>]*|<div[^>]*>(?>[^<>]*<(?!\/?div)[^>]*>[^<>]*)*<\/div>[^<>]*)*<a[^>]+href="([^"]*)">([^<]*(?:<(?!\/a>)[^>]*>[^<>]*)*)<\/a>|./\1/g'

Replaces everything NOT within the HREF value with nothing;
output;

linkAlinkBlinkClinkDlinkElinkFlinkGlinkH
In which distributions is Perl installed by default?
Araunah 14 days ago

echo '<div class="someclass" name="somename"><a href="extracted-me" name="dlname"><a href="dont-extract-me">I am the inner a-href tag</a></a></div><div class="someotherclass" name="othername"><a href="extract-me-too">I am the regular stuff</a></div><b>I am an unexpected string to ignore</b></div><div class="lastclass" name="bla"><a href="extract-me-as-well"/>Here the a-href tag is closed on the left</div>'    | grep -oP '<div [^>]*class="[^"]+"[^>]*><a [^>]*href="\K[^"]+(?="[^>]*>)'

Extracts all strings as specified by the requester, i. e. the output is:

extracted-me 
extract-me-too 
extract-me-as-well 

The "dont-extract-me" string is getting ignored since its parent is not a div container.

P. S.

Sorry for all the editing. This is my first post and I needed to figure out how the code formatting thing works.

In which distributions is grep -P supported by default?
Araunah 14 days ago
On all systems that come with PCRE, which is actually the case for all popular server and desktop distros (Debian, CentOS, RHEL, Ubuntu, and so on).
C+- 13 days ago
View Timeline