Introduction
In the last session about How to Get elements by class in Html Agility Pack C# , one can very well understand about how to use the HtmlAgility pack to the fullest in order to obtain the elements that fall under the same CSS class. Furthermore to that, there might be a situation where one needs to perform data extraction from the Html content on the web page. If you are having trouble in learning about HtmlAgility pack, then follow this Install HTML agility pack and Load an HTML Document.Hard to do it on Regex!
Many a times, when it comes to text extraction, using regular expressions is the most common method that strikes to the mind. Though, the very purpose of these are to achieve extraction of content from the text based on the pattern, yet there are many shortfalls for novices to use them.- Being able to conclude about the perfect Regex pattern is very tricky. Unless one is expert on them, it is difficult to tell if the pattern is efficient or not.
- Adding to this complex situation, Regex is altogether a different system. Hence, using them may slow down the process.
Free Video Library: Learn HTML Agility Pack Step by Step
Alternate yet native and efficient method!
HtmlAgility pack has most of the utilities to help in getting job done swift and hassle free. One can traverse through entire HTML content present in a webpage. Follow here HTML Traversing using Agility Pack, to get comfortable about the topic.Applying innerText on an HTML element is an easy solution to extract specific text and thus, web scraping is not a big ordeal.
Step #1
Declare HtmlWeb variable and HtmlAgilityPack.HtmlDocument variable.
Step #2
Load the web page into HtmlDocument variable.
Step #3
Filter the Html elements based on the class name using the technique as mentioned below into IENumerable of type HtmlElements.DocumentNode.Descendants().Where(n => n.HasClass("mw-jump-link")).
Step #4
Iterate through each item in the nodes using a foreach loop and apply innerText on each of the item.Once you are done extracting the specific text, you can consider changing the HTML contents and to know how to manipulate the HTML content, do visit this session HTML Manipulation using html agility pack.
using System; using System; using HtmlAgilityPack; public class Program { public static void Main() { // define htmldocument var doc = new HtmlAgilityPack.HtmlDocument(); // declare HTMLWeb HtmlWeb web = new HtmlWeb(); // here loading document for specfic URL doc = web.Load("https://www.technologycrowds.com/2019/06/sha-512-hash-using-c-sharp.html"); // here searching for specific words var ress = doc.DocumentNode.SelectSingleNode("//*[text()[contains(., 'Working')]]").InnerText; // now displaying final output Console.WriteLine(ress); } }
Output
Working Sample
Post A Comment:
0 comments: