Why You Should Identify a Standard Structure When Capturing Web Data from E-Commerce Webs

Written by pigivinci | Published 2023/03/31
Tech Story Tags: ecommerce | web-scraping | web-data | ecommerce-web-scraping-service | web-scraping-with-python | python | tutorial | coding

TLDRWeb scraping e-commerce websites is very common. It is extremely valuable to identify a standard structure. Data structures need to be specific (or different) across the same industries. There can be different structures depending on what level the information is captured (product-list page PLP, or cart page PDP)via the TL;DR App

Capturing web data from e-commerce websites is very common. Although each website has its own information displayed in the different parts of the UI (product-list page, product-detail page, cart, etc), it is extremely valuable to identify a standard structure.

Advantages of using a standardized approach

The advantages of adopting a standard structure are enormous when we need coordination between multiple extractions from different websites or, generally speaking, to create a resilient data extraction timeline that remains unchanged to all possible variations that even a single website can have over time.

We want to be able to decouple the extraction of data from the acquisition in a database or a data warehouse.

At databoutique.com we have gone through several iterations of the ideal data structure for capturing information from websites, and we came to the following conclusions (feel free to comment or, even better, join our conversation on Discord bout this):

  1. Data structures need to be standardized within a given industry (let’s say fashion, or pharmaceuticals, groceries, electronics)

  2. Data structures must be specific (or different) across the same industries.

  3. There can be different structures depending on what level the information is captured (product-list page PLP, vs. product-detail page PDP, or cart page). This point is related to the costs of accessing the different parts of the website.

The additional advantage of organizing data structures this way is to help address the final user’s request by being transparent on where they want the information to be captured, thus aligning their request to the cost of actually acquiring it.

Advantages of decoupling PLP and PDP processes (and field list)

Accessing PDP requires first accessing the PLP. This is why having a PLP structure different than the PDP structure is important.

Let’s imagine we want to crawl daily the products of a fashion website like Zalando.

While some information might change daily (like product prices or product availability), others are static, like product description and product image (let’s leave out in this example the availability of product sizes, this will be treated separately).

We could scan daily all PLP pages and only once each product page PDP when we need to capture the details on each product. Let’s do some math and see the difference.

Assumptions:

  • The cost in terms of bandwidth and proxy are similar for scraping a single PLP or a PDP (the actual cost here is irrelevant, we are interested in the comparison)
  • The number of products that appear on Zalando website during a year are 1 million (event this aspect is irrelevant, but helps framing the case)
  • The average number of items contained on a single page of product-list page PLP is 80 items
  • 80% of the PLP pages are filled (meaning, we have 80 items in almost every page, some pages do not list 80 items because there are no more items to show for this particular category

In the case we were to crawl daily the product detail page (PDP), we would have to access daily all pages of the product list PLP (15k pages= 1M items/80 items per page / 80% fill rate), to then pass on to each product detail page PDP (1M), for a total of 370M pages (1M X 365 + 15k > 365) scraped per year.

If we were to crawl PLP daily, and PDP only once, we would have to crawl only 6M pages a year (15k X 365 +1M).

That’s 55 times (!!) cheaper, or if you prefer if the cost per page was 0.0001 USD/page, it would cost 6.7k USD to scrape Zalando for a year through the PLP page vs. 370k USD/year with the daily PDP approach.

And this is very important to transfer it to the final user, so it is transparent that if she or he wants daily refreshes from information that is available only on the PDP page, the cost for data only would be 50 times higher.

Product-List Page categories

A product list page, in the different versions we encountered, can be categorized in:

  1. Categorized lists: where the product is presented within a certain category tree that can either be stable or dynamic (but we’ll get into that later). A nice example is industry specific websites, such as fashion stores, electronic stores and so on, where the website somehow assists the user in finding what she or he is looking for, but doesn’t know yet the exact product.

Nike.com is a good example for this: the store guides us into our shopping experience through the categories, to display products that are in our area of interest. We can off-course use the search bar, but this is secondary in the UX, as we can see from the positioning of the search bar itself.

  1. Uncategorized lists: where products are not displayed to be searched by category but by in-site search.

Amazon is a good example of this. Although there is a category tree, the main UX happens via the search bar.

We will consider the Categorized lists for now and keep in mind we want a structure to crawl the PLP, not entering the Product Detail Page (PDP) for the moment.

This is particularly helpful in large websites, with near or above a million products, where the number of pages crawled varies by several orders of magnitude between PLP and PDP.

PLP standardized field list

Now on the data structure. There is no golden standard, our decisions came from benchmarking many e-commerce websites and extracting the best structure in common to all of them.

we have different groups of information we need:

  1. technical info

    1. the field structure version

    2. the website we are capturing

    3. the timestamp or date the info was captured

  2. context info

    1. what country/ geography does the acquisition refer to

    2. what currency are the prices expressed in

  3. content

    1. category level 1

    2. category level 2

    3. category level 3

    4. brand

    5. product code

    6. product title

    7. full price

    8. discounted price

  4. reference fields

    1. imageurl

    2. PDP URL

Note: not all websites will have three levels of category, or in some cases, not all branches of the category tree will have a complete length. In both cases, we accept “n.a.” as unavailable in the fields.

For websites with branches with more depth than three levels (but to a maximum of 4 or 5) we accept concatenation of the last levels in the category level 3.

In the Data Boutique version, we use additional fields related to our business model. Check the complete version of this field list for future updates.

Also published here.


Written by pigivinci | Web scraping expert
Published by HackerNoon on 2023/03/31