Parsing HTML in Bash

Parsing HTML in Bash


Bash Shell

I’ve a course of the place I want to repeat all the photographs from an online web page. I used to run this course of with xmllint, which is able to course of an XML or HTML file and print out the entries you specify. However when my server host supplier upgraded their techniques, they didn’t embody xmllint. So I needed to discover one other method to extract a listing of pictures from an HTML web page. It seems you are able to do this in Bash.

The learn Assertion

Chances are you’ll not suppose Bash can parse information information, however it might probably with some intelligent pondering. Bash, like different UNIX shells earlier than it, can parse strains one after the other from a file by way of the built-in learn assertion.

By default, the learn assertion scans a line of information and splits it into fields. Normally, learn splits fields utilizing areas and tabs, with newlines ending every line, however you’ll be able to change this conduct by setting the Inner Discipline Separator (IFS) worth and the end-of-line delimiter (-d).

To parse an HTML file utilizing learn , set the IFS to a greater-than image (>) and the delimiter to a less-than image (<). Every time Bash scans a line, it parses as much as the following < (the beginning of an HTML tag) then splits that information at every > (the top of an HTML tag). This pattern code takes a line of enter and splits the information into the TAG and VALUE variables:

native IFS='>'
learn -d '<' TAG VALUE

Let’s discover how this works. Contemplate this easy HTML file:

<img src="https://www.cloudsavvyit.com/8315/parsing-html-in-bash/brand.png"
alt="My brand" />
<p>some textual content</p>

The primary time learn parses this file, it stops on the first < image. Since < is the primary character of this pattern enter, meaning Bash finds an empty string. The ensuing TAG and VALUE strings are additionally empty. However that’s wonderful for my use case.

The subsequent time Bash reads the enter, it will get img src="https://www.cloudsavvyit.com/8315/parsing-html-in-bash/brand.png"↲alt="My brand" />↲ with a newline proper earlier than the alt, and stops earlier than the < image on the following line. Then learn splits the road on the > image, which leaves TAG with img src="https://www.cloudsavvyit.com/8315/parsing-html-in-bash/brand.png"↲alt="My brand" / and VALUE with an empty newline.

The third time learn parses the HTML file, it will get p>some textual content. Bash splits the string on the > leading to TAG containing p and VALUE with some textual content .

A Easy Parser

Now that you simply perceive how you can use learn, it’s straightforward to parse an extended HTML file with Bash. Begin with a Bash perform referred to as xmlgetnext to parse the information utilizing learn , because you’ll be doing this many times within the script. I named my perform xmlgetnext to remind me this can be a substitute for the Linux xmllint program, however I might have simply as simply named it htmlgetnext .

xmlgetnext () {
native IFS='>'
learn -d '<' TAG VALUE
}

Now name that xmlgetnext perform to parse the HTML file. That is my full htmltags script:

#!/bin/sh
# print a listing of all html tags

xmlgetnext () {
native IFS='>'
learn -d '<' TAG VALUE
}

cat $1 | whereas xmlgetnext ; do echo $TAG ; executed

The final line is the important thing. It loops by means of the file utilizing xmlgetnext to parse the HTML, and prints out solely the TAG entries. And due to how echo operates with the usual area separators, any strains like img src="https://www.cloudsavvyit.com/8315/parsing-html-in-bash/brand.png"↲alt="My brand" / that include a newline get printed on a single line, as img src="https://www.cloudsavvyit.com/8315/parsing-html-in-bash/brand.png" alt="My brand" /.

Screenshot showing parsing HTML in Bash
Parsing HTML in Bash

To fetch simply the listing of pictures, I run the output of this script by means of grep to solely print the strains which have an img tag firstly of the road.



Source link

Uncategorized