Utilizing AngleSharp in PowerShell 7 to Parse Webpages

Utilizing AngleSharp in PowerShell 7 to Parse Webpages


Powershell logo

AngleSharp is a .NET library that makes parsing and dealing with HTML content material fast and straightforward. As AngleSharp is written in .NET, you need to use and devour the output in PowerShell as effectively. Combining these two permits you to shortly and simply script HTML content material. On this article, we’ll discover how you can arrange AngleSharp and devour a climate web page, and convert the info right into a PowerShell object.

Putting in and Loading AngleSharp

Putting in AngleSharp is straightforward utilizing the Set up-Bundle command. You may even set up the bundle into the CurrentUser scope which suggests you don’t want administrative rights to make use of this library. The bundle is contained within the NuGet library.

Set up-Bundle 'AngleSharp' -Scope 'CurrentUser' -Supply 'Nuget'

Subsequent we’ll wish to load AngleSharp to be used in our PowerShell script. To do that we’ll wish to use the Add-Sort cmdlet to immediately load the DLL for the library. Beneath is code that assists in finding the newest .NET model and loading that path, if the library isn’t already loaded in your session.

If ( -Not ([System.Management.Automation.PSTypeName]'AngleSharp.Parser.Html.HtmlParser').Sort ) {
    $standardAssemblyFullPath = (Get-ChildItem -Filter '*.dll' -Recurse (Break up-Path (Get-Bundle -Identify 'AngleSharp').Supply)).FullName | The place-Object {$_ -Like "*customary*"} | Choose-Object -Final 1

    Add-Sort -Path $standardAssemblyFullPath -ErrorAction 'SilentlyContinue'
} # Terminate If - Not Loaded

Learn on to find how you can parse the webpage content material and create a helpful PowerShell object!

Parsing a Webpage

After all, the entire level of that is to really parse an online web page. On this instance we’ll load the content material from Invoke-WebRequest after which utilizing the end result, parse the content material in AngleSharp. We’re going to use a neighborhood 7-day forecast from the Nationwide Climate Service to drag in climate knowledge and convert to an object. First, let’s retrieve the climate knowledge.

$Request = Invoke-WebRequest -Uri "<https://forecast.climate.gov/MapClick.php?lat=40.48675500000007&lon=-88.99177999999995>"

The information for the positioning that we’re excited about is within the Content material property, however it’s the full HTML supply, which is loads to course of. Usually, it’s best to make use of Chrome Developer Instruments to find the part of the HTML supply that we wish to use (F12 in Chrome for the positioning you wish to examine).

 

The HTML construction of the NWS climate web page.

Fortunately, there’s a div container with an unordered listing that we are able to parse. The following step is to really load the retrieved content material into AngleSharp.

$Parser = New-Object AngleSharp.Html.Parser.HtmlParser
$Parsed = $Parser.ParseDocument($Request.Content material)

Now that we’ve got the parsed content material obtainable in our $Parsed variable, we are able to begin to manipulate this knowledge to get to only the part we would like. Very conveniently, the NWS web site gives an ID only for this unordered listing named seven-day-forecast-list. Since every ID is exclusive on an HTML web page, this makes the listing straightforward to focus on. Utilizing the All property on our parsed content material, we are able to retrieve simply the item with the ID of seven-day-forecast-list.

$ForecastList = $Parsed.All | The place-Object ID -EQ 'seven-day-forecast-list'

This can end in numerous completely different properties, however we’re targeted on the ChildNodes property as it’ll include every li containing the info we want. To get an concept of what we wish to goal in our object, let’s check out a person li. There are a handful of parts with courses that we are able to goal.

  • period-name – The relative time interval.
  • short-desc – A condensed description of the climate.
  • temp temp-high – The excessive temperature.

 

HTML construction of a single tombstone-container aspect.

It’s possible you’ll discover that the img tag accommodates an alt property with numerous helpful info. It’s fairly straightforward to search out the category to focus on as it’s saved within the classname property of the kid node. To focus on the alt aspect we should depend on a barely completely different methodology, QuerySelectorAll which makes use of conventional CSS selectors to make complicated concentrating on straightforward.

$ForecastList.ChildNodes | ForEach-Object {
	# Retrieve simply the content material of the tombstone-container div beneath the forecast-tombstone li aspect.
  $Node = $_.ChildNodes | The place-Object ClassName -EQ 'tombstone-container'

  [PSCustomObject]@{
		# Search the kid nodes beneath the tombstone-container and discover the aspect named period-name. Retrieve simply the innerHTML which is the textual content worth. This features a break aspect, <br> of which we do not want, so change that with an area as a substitute.
    "Interval" = $Node.ChildNodes.The place({ $_.ClassName -EQ 'period-name'}).InnerHTML -Substitute "<br>"," "
    "Temp"   = $Node.ChildNodes.The place({ $_.ClassName -Match 'temp'}).InnerHTML
    "Quick"  = $Node.ChildNodes.The place({ $_.ClassName -EQ 'short-desc'}).InnerHTML -Substitute "<br>"," "
		# Since we do not have a category to focus on, on the foundation node, use the CSS selector p > img which seems for a p aspect with a baby img aspect. Subsequent, use the Attributes property to search out the one named alt and return it is worth.
    "Alt"    = $Node.QuerySelectorAll("p > img").Attributes.The place({$_.Identify -EQ 'alt'}).Worth
  }
}

 

Output of the parsed net web page from AngleSharp.

Though we’ve got to iterate over a couple of parts to finally get to only those we would like, we are able to stroll by way of the HTML doc construction and get to only what we want. It may be a bit difficult to know the constructions, however finally what AngleSharp is doing is creating objects for every DOM aspect. As soon as you determine the easiest way to focus on the weather you want, extracting the content material will not be tough.

Conclusion

AngleSharp presents a wonderful programmatic interface to parsing and interacting with HTML content material on webpages. This could open the door to utilizing PowerShell to retrieve content material that could be in any other case inaccessible. Taking this content material, storing it, and utilizing it in scripts is extraordinarily helpful and may help help system integration strategies!



Source link

Uncategorized