How to Use .NET C# for Web Scraping

Written by ahmedtarekhasan | Published 2023/02/07
Tech Story Tags: dotnet | csharp | software-development | software-engineering | web-scraping | scraping | puppeteer | selenium | web-monetization

TLDRWeb Scraping is the process of extracting some information from a web resource from its user interface rather than its legit APIs. It is not like calling a REST API of a website to get some data, it is like retrieving the website page like the browser does, parse the HTML and then extract the data rendered into the HTML.via the TL;DR App

A guide on how to do Web Scraping in .NET C#, with code samples.

What is Web Scraping

In English, the word Scraping has different definitions but they are all within the same meaning.

In Dictionary.com

to remove (an outer layer, adhering matter, etc.) in this way: to scrape the paint and varnish from a table.

In Dictionary.Cambridge.org

the act of removing the surface from something using a sharp edge or something rough.

However, we are for sure interested into what Web Scraping means in Software.

In software, Web Scraping is the process of extracting some information from a web resource from its user interface rather than its legit APIs. Therefore, it is not like calling a REST API of a website to get some data, it is like retrieving the website page like the browser does, parse the HTML and then extract the data rendered into the HTML.


Why Would We Need to Scrap a Website

Simply, because we need to have the data presented on this website and the website is not providing a legit API for us to retrieve this data.


Is Web Scraping Legal

It depends on the web resource itself. Some websites have it written somewhere if it is legal or not and sometimes it is not written anywhere.

Also, there is another factor which is what you are going to do with the data you scraped. Therefore, always try to be cautious and keep yourself safe. Do your research first before jumping into implementation.


How to do Web Scraping

There are different ways of doing it but in most of the cases the same concept applies; you write some code to get the HTML using the website URL, you parse the HTML, and finally you extract the data you want.

However, if we only stick to this definition, we would be missing a lot of details.

In some cases, things are more complicated than that. It depends on the way the website is built.

For Static websites, where the data are already rendered into the HTML from the first instance, you can follow the same steps we described.

However, for Dynamic websites, where the data are not rendered into the HTML from the first instance, and they are loaded dynamically through JavaScript libraries and frameworks (like Angular, React, Vue,…), you need to follow another approach.

Basically what you do in this case is that you try to mimic what a web browser (like Chrome, Firefox, IE, Edge,…) does and then you get the final HTML from the virtual browser you used. Once you have the full HTML where the data is rendered, the rest would be the same.


Should We do this Ourselves from Scratch

No, we already have some libraries which we can use to achieve the expected results.

For example, here is a list of some libraries which we can use.

Performing Calls:

  1. .NET HttpClient
  2. RestSharp

Parsing HTML:

  1. Html Agility Pack (HAP)
  2. CSQuery
  3. AngleSharp

Virtual Browser:

  1. Headless Chrome
  2. Selenium WebDriver
  3. Puppeteer Sharp

These are not the only libraries that we have to help on our Web Scraping project. If you search the internet you would find a lot more.


Scraping a Static Website

First, let’s start with trying to scrap some data from a static website. On this example, we are going to scrap my own GitHub profile https://github.com/AhmedTarekHasan

We would try to get a list of the pinned repositories on my profile. Each entry would be composed of the name of the repository and its description.

Therefore, let’s start.


Observing the Data Structure in the HTML

Till the moment of writing this article, this how my GitHub profile looked like.

When I checked the HTML, I found the following:

  1. All my pinned repositories are found inside the main container with this path: div[@class=js-pinned-items-reorder-container] > ol > li
  2. Each pinned repository is contained inside a container with this relative path to the parent path: div > div
  3. Each pinned repository, would have its name inside div > div > span > a, and its description inside p

Writing Code

Here are the steps I followed:

  1. Created a Console Application
    Solution: WebScraping
    Project: WebScraper
  2. Installed the NuGet package HtmlAgilityPack.
  3. Added the using directive using HtmlAgilityPack;
  4. Defined the method private static Task<string> GetHtml() to get the HTML.
  5. Defined the method private static List<(string RepositoryName, string Description)> ParseHtmlUsingHtmlAgilityPack(string html) to parse the HTML.
  6. Finally, the code should be as follows:

using System.Collections.Generic;
using System.Net.Http;
using System.Threading.Tasks;
using HtmlAgilityPack;

namespace WebScraper
{
    class Program
    {
        static async Task Main(string[] args)
        {
            var html = await GetHtml();
            var data = ParseHtmlUsingHtmlAgilityPack(html);
        }

        private static Task<string> GetHtml()
        {
            var client = new HttpClient();
            return client.GetStringAsync("https://github.com/AhmedTarekHasan");
        }

        private static List<(string RepositoryName, string Description)> ParseHtmlUsingHtmlAgilityPack(string html)
        {
            var htmlDoc = new HtmlDocument();
            htmlDoc.LoadHtml(html);

            var repositories =
                htmlDoc
                    .DocumentNode
                    .SelectNodes("//div[@class='js-pinned-items-reorder-container']/ol/li/div/div");

            List<(string RepositoryName, string Description)> data = new();

            foreach (var repo in repositories)
            {
                var name = repo.SelectSingleNode("div/div/span/a").InnerText;
                var description = repo.SelectSingleNode("p").InnerText;
                data.Add((name, description));
            }

            return data;
        }
    }
}

Running this code, you will get the following

For sure you can make use of some cleaning to the strings here but this is not a big deal.

As you can see, it is easy to use HttpClient and HtmlAgilityPack. All what you need is to get used to their APIs and then it would be an easy job.

What you also need to keep in mind is that some websites would require more work from your side. Sometimes a website would need login details, authentication tokens, some specific headers,…

All of this you can still handle with HttpClient or other libraries which you can use to perform a call.


Scraping a Dynamic Website

Now, we should try to scrap some data from a dynamic website. However, since I should be cautious before scraping a website, I would apply the same example as before but now with assuming that the website is dynamic.

Therefore, again, on this example, we are going to scrap my own GitHub profile https://github.com/AhmedTarekHasan


Observing the Data Structure in the HTML

This would be the same as before.


Writing Code

Here are the steps I followed:

  1. Created a Console Application
    Solution: WebScraping
    Project: WebScraper
  2. Installed the NuGet package HtmlAgilityPack.
  3. Installed the NuGet package Selenium.WebDriver.
  4. Installed the NuGet package Selenium.WebDriver.ChromeDriver.
  5. Added the using directive using HtmlAgilityPack;
  6. Added the using directive using OpenQA.Selenium.Chrome;
  7. Defined the method private static string GetHtml() to get the HTML.
  8. Defined the method private static List<(string RepositoryName, string Description)> ParseHtmlUsingHtmlAgilityPack(string html) to parse the HTML.
  9. Finally, the code should be as follows:

using System.Collections.Generic;
using HtmlAgilityPack;
using OpenQA.Selenium.Chrome;


namespace WebScraper
{
    class Program
    {
        static void Main(string[] args)
        {
            var html = GetHtml();
            var data = ParseHtmlUsingHtmlAgilityPack(html);
        }

        private static string GetHtml()
        {
            var options = new ChromeOptions
            {
                BinaryLocation = @"C:\Program Files (x86)\Google\Chrome\Application\chrome.exe"
            };

            options.AddArguments("headless");

            var chrome = new ChromeDriver(options);
            chrome.Navigate().GoToUrl("https://github.com/AhmedTarekHasan");

            return chrome.PageSource;
        }

        private static List<(string RepositoryName, string Description)> ParseHtmlUsingHtmlAgilityPack(string html)
        {
            var htmlDoc = new HtmlDocument();
            htmlDoc.LoadHtml(html);

            var repositories =
                htmlDoc
                    .DocumentNode
                    .SelectNodes("//div[@class='js-pinned-items-reorder-container']/ol/li/div/div");

            List<(string RepositoryName, string Description)> data = new();

            foreach (var repo in repositories)
            {
                var name = repo.SelectSingleNode("div/div/span/a").InnerText;
                var description = repo.SelectSingleNode("p").InnerText;
                data.Add((name, description));
            }

            return data;
        }
    }
}

Running this code, you will get the following

For sure you can make use of some cleaning to the strings here but this is not a big deal.

Again, using Selenium.WebDriver and Selenium.WebDriver.ChromeDriver is easy as you can see.


Final Words

As you can see, Web Scraping is not that hard, but it actually depends on the website you are trying to scrap. Sometimes you might get across a website that needs some tricks to make it work.

That’s it, hope you found reading this article as interesting as I found writing it.


Also Published Here


Written by ahmedtarekhasan | .NET (DotNet) Software Engineer & Blogger | https://linktr.ee/ahmedtarekhasan
Published by HackerNoon on 2023/02/07