Extract hrefs from html file in Bash using built-in tools
New here? Learn about Bountify and follow @bountify to get notified of new bounties! x

Given an HTML file where all tags are on one line, extract href values where the parent tag is <div class="item"></div>, for example

<div><div class="item"><a href="...">blah</a></div><div class="item"><a href="...">blah</a></div><div class="item"><a href="...">blah</a></div><div class="item"><a href="...">blah</a></div><div class="item"><a href="...">blah</a></div><div class="item"><a href="...">blah</a></div><div class="item"><a href="...">blah</a></div><div class="item"><a href="...">blah</a></div></div>

Tags
bash

Crowdsource coding tasks.

1 Solution


echo '<div class="someclass" name="somename"><a href="extracted-me" name="dlname"><a href="dont-extract-me">I am the inner a-href tag</a></a></div><div class="someotherclass" name="othername"><a href="extract-me-too">I am the regular stuff</a></div><b>I am an unexpected string to ignore</b></div><div class="lastclass" name="bla"><a href="extract-me-as-well"/>Here the a-href tag is closed on the left</div>'    | grep -oP '<div [^>]*class="[^"]+"[^>]*><a [^>]*href="\K[^"]+(?="[^>]*>)'

Extracts all strings as specified by the requester, i. e. the output is:

extracted-me 
extract-me-too 
extract-me-as-well 

The "dont-extract-me" string is getting ignored since its parent is not a div container.

P. S.

Sorry for all the editing. This is my first post and I needed to figure out how the code formatting thing works.

In which distributions is grep -P supported by default?
Araunah 1 month ago
On all systems that come with PCRE, which is actually the case for all popular server and desktop distros (Debian, CentOS, RHEL, Ubuntu, and so on).
C+- 1 month ago
View Timeline