Html agility pack select child nodes
For any project that pulls content from the web in C# and parses it to a usable format, you will most likely find the HTML Agility Pack. The Agility Pack is standard for parsing HTML content in C#, because it has several methods and properties that conveniently work with the DOM. Instead of writing your own parsing engine, the HTML Agility Pack has everything you need to find specific DOM elements, traverse through child and parent nodes, and retrieve text and properties (e.g., HREF links) within specified elements. Show The first step is to install the HTML Agility Pack after you create your C# .NET project. In this example, we use a .NET Core MVC web app. To install the Agility Pack, you need to use NuGet. NuGet is available in the Visual Studio interface by going to Tools -> NuGet Package Manager -> Manage NuGet Packages for Solution. In this Window, perform a search for HTML Agility Pack, and install it into your solution dependencies. After you install it, you’ll notice the dependency in your solution, and you will find it referenced in your using statements. If you do not see the reference in your using statements, you must add the following line to every code file where you use the Agility Pack: Pull HTML from a Web Page Using Native C# LibrariesWith the Agility Pack dependency installed, you can now practice parsing HTML. For this tutorial, we’ll use Hacker News. It’s a good example since it is a dynamic page with a list of popular links that can be read by viewers. We’ll take the top 10 links on Hacker News, parse the HTML and place it into a JSON object. Before you scrape a page, you should understand its structure and take a look at the code behind on the page. This can be done in the browser using the “Inspect Element” option. We’re using Chrome, but this feature is available in FireFox and Edge. Right-click and inspect the element for the first link on Hacker News. You’ll notice that links are contained within a table, and each title is listed in a table row with specific class names. These class names can be used to pull content within each DOM element when you scrape the page. The circled elements in the image above show classes that can be used to parse elements from the rest of the DOM. The title class contains the elements for the main title that displays on the page, and the rank class displays the title’s rank. The storylink and score classes also contain important information about the link that we could add to the JSON object. We also want to target specific DOM element properties that contain information that we need. The Now that we understand the page DOM structure, we can write code that pulls the homepage for Hacker News. Before starting, add the following using statements to your code:
With the using statements in place, you can write a small method that will dynamically pull any web page and load it into a variable named
You can check to make sure that the web page content was pulled by setting a breakpoint and using the HTML Visualizer to view content: Parsing HTML Using Agility PackWith the HTML loaded into a variable, you can now use Agility Pack to parse it. You have two main options:
LINQ is useful when you want to search through nodes to find specific content. The XPath option is specific to Agility Pack and used by most developers to iterate through several elements. We’re going to use LINQ to get the top 10 stories, and then XPath to parse child elements and get specific properties for each one and load it into a JSON object. We use a JSON object, because it’s a universal language that can be used across platforms, APIs, and programming languages. Most systems support JSON, so it’s an easy way to work with external applications if you need to. We’ll create a new method that will parse the HTML. Here is the new method with a LINQ query to pull all items with the class name “athing” and their child elements:
This code loads
the HTML into the Agility Pack We don’t want all elements within each table row, so we need to iterate through each item and use Agility Pack to pull only story titles, URLs, rank, and score. We’ll add this
functionality to the
The above code iterates through all top 10 links on Hacker News and gets the information that we want, but it doesn’t do anything with the information. We now need to create a JSON object to contain the information. Once we have a JSON object, we can then pass it to anything we want -- another method in our code, an API on an external platform, or to another application that can ingest JSON. The easiest way to create a JSON object is to serialize it from a class. You can create a class in the same namespace as you’ve been creating your code in the previous examples. We’ll create a class named In this example, the code we’ve been creating is in the namespace Note: To use this tutorial example, you need to install the With the
Notice in the code above that the That’s it -- you’ve pulled the top 10 news links from Hacker News and created a JSON object. Here
is the full code from start to finish with the final JSON object contained in the
Note that you can also select child nodes from parent nodes with the Agility Pack. HTML Agility Pack will traverse down the DOM hierarchy using various methods should you want to pull table elements item by item down the DOM tree. Pull HTML Using Selenium and a Chrome Browser InstanceIn some cases, you’ll need to use Selenium with a browser to pull HTML from a page. This is because some websites work with client-side code to render results. Since client-side code executes after the browser loads HTML and scripts, the previous example will not get the results that you need. To emulate code loading in a browser, you can use a library named Selenium. Selenium lets you pull HTML from a page using your browser executable, and then you can parse the HTML using the Agility Pack in the same way we did above. Before you can parse in a browser, you need to install the
Note: You must keep the Selenium driver updates as Chrome updates. If you receive the error “SessionNotCreatedException: Message: session not created: This version of ChromeDriver only supports Chrome version xx” where “xx” is the version number for Chrome, you must update the Selenium library in your project. We’ll use the same variables from the previous example, but change the main code to pull the HTML using Selenium and load it into an object:
Notice in the code above that the same The full code to perform the request and parse HTML is below:
The code still parses the HTML and converts it to a JSON object from the For now, we used LINQ and XPath to select CSS classes, but the Agility Pack creators promise that CSS selectors are coming. Final ThoughtsThe HTML Agility Pack is a great tool for scraping websites, but it’s lacking some key features. For example, we needed additional libraries (Selenium) to scrape single page applications made with SPA frameworks such as React.js, Angular.js or Vue.js. XPath is also important, because this query language is much more flexible than CSS selectors. Overall, it’s a great library for parsing HTML, but you still need additional libraries for more flexibility. Our C# examples used the Agility Pack and Selenium for single page applications, but our next article will focus on ScrapySharp. You can also read our tutorial about web scraping with C#. |