If you find this article useful, consider making a small donation to show your support for this web site and its content.
Free app Developer Interview available here.

Available on the iPhone App Store
Available on the Google Play
AboutMe
About me:
Hi. My name is Farooq Kaiser and I'm a software developer from Toronto, Canada.



Html screen scraping with HtmlAgilityPack Library

by Farooq Kaiser 29. August 2010 14:15

What is Screen Scraping ?

Screen scraping is a process that reads any webpage and extract data from html tags.

In this article, i will examine how to scrape a given web page using htmlagilitypack library. It is a .NET code library that allows you to parse "out of the web" HTML files. It can be downloaded @ http://htmlagilitypack.codeplex.com/

In this Tutorial, i will read my own web site http://savebigbucks.ca that offers daily deals in Canada.

Here is code snippet that reads the web page.

   1: string result = string.Empty;
   2: string Url = "http://www.savebigbucks.ca/";
   3:   
   4: HttpWebRequest request = (HttpWebRequest)WebRequest.Create(Url);
   5: request.Method = "GET";
   6:  
   7: using (var stream = request.GetResponse().GetResponseStream())
   8: using (var reader = new StreamReader(stream, Encoding.UTF8))
   9: {
  10:     result = reader.ReadToEnd();
  11: }

Now, i will parse the html tags, The following code will read the anchor tags from html.

   1: HtmlDocument doc = new HtmlDocument();
   2: doc.Load(new StringReader(result));
   3: HtmlNode root = doc.DocumentNode;
   4:  
   5: List<string> anchorTags = new List<string>();
   6:  
   7: foreach (HtmlNode link in root.SelectNodes("//a"))
   8: {
   9:     string att = link.OuterHtml;
  10:     anchorTags.Add(att);
  11: }

In above code, we are reading all anchor tags in http://savebigbucks.ca using xpath e.g. root.SelectNodes("//a").  As you can see, we can easily get to img/src or a/hrefs with a bunch XPATH queries.

Elegant way of making web call using LINQ

The above code can be rewritten more elegantly using LINQ. here is a revised version.

   1: HtmlWeb htmlWeb = new HtmlWeb();
   2: string url = @"http://www.savebigbucks.ca/";
   3: HtmlDocument doc = htmlWeb.Load(url);
   4: var links = doc.DocumentNode.Descendants("a").Select(x => x.OuterHtml).ToList();

Thanks Jeff Klawiter for your feedback and showing me the elegant way of using LINQ.  Jeff Klawiter is current maintainer of the Html Agility Pack project.

Summary

In this article, we examined how to scrape any web page using Html Agility Pack.

 

Currently rated 3.7 by 22 people

  • Currently 3.727273/5 Stars.
  • 1
  • 2
  • 3
  • 4
  • 5

Tags: , , ,


comments powered by Disqus
Jobs Autos Real estate Videos Power by Google