I’ve a course of the place I want to repeat all the photographs from an online web page. I used to run this course of with xmllint
, which is able to course of an XML or HTML file and print out the entries you specify. However when my server host supplier upgraded their techniques, they didn’t embody xmllint
. So I needed to discover one other method to extract a listing of pictures from an HTML web page. It seems you are able to do this in Bash.
The learn Assertion
Chances are you’ll not suppose Bash can parse information information, however it might probably with some intelligent pondering. Bash, like different UNIX shells earlier than it, can parse strains one after the other from a file by way of the built-in learn
assertion.
By default, the learn
assertion scans a line of information and splits it into fields. Normally, learn
splits fields utilizing areas and tabs, with newlines ending every line, however you’ll be able to change this conduct by setting the Inner Discipline Separator (IFS
) worth and the end-of-line delimiter (-d
).
To parse an HTML file utilizing learn
, set the IFS
to a greater-than image (>
) and the delimiter to a less-than image (<
). Every time Bash scans a line, it parses as much as the following <
(the beginning of an HTML tag) then splits that information at every >
(the top of an HTML tag). This pattern code takes a line of enter and splits the information into the TAG
and VALUE
variables:
native IFS='>' learn -d '<' TAG VALUE
Let’s discover how this works. Contemplate this easy HTML file:
<img src="https://www.cloudsavvyit.com/8315/parsing-html-in-bash/brand.png" alt="My brand" /> <p>some textual content</p>
The primary time learn
parses this file, it stops on the first <
image. Since <
is the primary character of this pattern enter, meaning Bash finds an empty string. The ensuing TAG
and VALUE
strings are additionally empty. However that’s wonderful for my use case.
The subsequent time Bash reads the enter, it will get img src="https://www.cloudsavvyit.com/8315/parsing-html-in-bash/brand.png"↲alt="My brand" />↲
with a newline proper earlier than the alt, and stops earlier than the <
image on the following line. Then learn
splits the road on the >
image, which leaves TAG
with img src="https://www.cloudsavvyit.com/8315/parsing-html-in-bash/brand.png"↲alt="My brand" /
and VALUE
with an empty newline.
The third time learn
parses the HTML file, it will get p>some textual content
. Bash splits the string on the >
leading to TAG
containing p
and VALUE
with some textual content
.
A Easy Parser
Now that you simply perceive how you can use learn
, it’s straightforward to parse an extended HTML file with Bash. Begin with a Bash perform referred to as xmlgetnext
to parse the information utilizing learn
, because you’ll be doing this many times within the script. I named my perform xmlgetnext
to remind me this can be a substitute for the Linux xmllint
program, however I might have simply as simply named it htmlgetnext
.
xmlgetnext () { native IFS='>' learn -d '<' TAG VALUE }
Now name that xmlgetnext
perform to parse the HTML file. That is my full htmltags
script:
#!/bin/sh # print a listing of all html tags xmlgetnext () { native IFS='>' learn -d '<' TAG VALUE } cat $1 | whereas xmlgetnext ; do echo $TAG ; executed
The final line is the important thing. It loops by means of the file utilizing xmlgetnext
to parse the HTML, and prints out solely the TAG
entries. And due to how echo
operates with the usual area separators, any strains like img src="https://www.cloudsavvyit.com/8315/parsing-html-in-bash/brand.png"↲alt="My brand" /
that include a newline get printed on a single line, as img src="https://www.cloudsavvyit.com/8315/parsing-html-in-bash/brand.png" alt="My brand" /
.

To fetch simply the listing of pictures, I run the output of this script by means of grep
to solely print the strains which have an img
tag firstly of the road.