Utilizing Web Scraping and Alternative Data in Financial Markets

In many previous posts on The Web Scraping Club, we have discussed how to scrape websites under various circumstances, such as when a website is protected by Cloudflare or when using a mobile app.

However, we have not explored which sectors benefit the most from the scraped data.

Some sectors are quite obvious, such as e-commerce businesses seeking to understand their competitors' offerings, as well as delivery apps and other markets.

However, there is one sector that always craves data, as a new and reliable dataset can lead to millions of dollars in benefits, and that is the financial sector. At Re Analytics, my current company, we have gained experience in this industry. In this post and the upcoming Lab (scheduled for April 27th), we will dive deep into the world of Alternative Data.

What does it mean for Alternative Data?

In the financial sector, data plays a crucial role in helping investors make decisions about their investment strategies. Michael Bloomberg built his empire by selling financial data and news to investors.

Jim Simmons, an accomplished mathematician, created the Renaissance Hedge Fund and was among the first to apply what we now refer to as data science in finance. This allowed him to gain a competitive edge over other funds and close better deals. Medallion, the fund's primary fund that is closed to outside investors, has earned more than $100 billion in trading profits since its establishment in 1988. According to Wikipedia, this translates to a 66.1% average gross annual return or a 39.1% average net annual return between 1988 and 2018.

As the world's economy becomes more digitalized, new sources of information are becoming available. While traditional financial data sources such as financial statements, balances, and historical financial market data remain important, different sources from the digital world are becoming increasingly attractive to the financial industry. These sources are referred to as alternative data. According to the official definition, alternative data is information about a particular company that is published by sources outside the company, providing unique and timely insights into investment opportunities.

What types of alternative data exist?

Since alternative data is data that does not come from inside the company, this definition includes a wide range of possibilities.

As we can see from this slide from alternativedata.org, in addition to web data, we have satellite images, credit card transactions, sentiment data, and so on.

In this 2010 article, we can gain a better understanding of how satellite images can be utilized for financial purposes. A company called Remote Sensing Metrics LLC used satellite images to monitor 100 Walmart parking lots as a representative sample. They counted the number of cars parked outside on a month-by-month basis to estimate the quarterly revenues by calculating the people flow.

As you may have guessed, this method does not provide an exact dollar figure, but with a suitable model, it can provide an estimate of revenues well before public data is made available for publicly traded stocks, potentially months in advance of the data becoming public.

Why web scraping is important in the alternative data landscape?

As we observed from the preceding slide, alternative data providers can be categorized into two groups: those who own or transform existing data (as in the satellite image example mentioned earlier, where the company obtained images from providers and customized its data product to estimate Walmart's revenue) and those who extract data from public sources and use it to generate insights (including web data and sentiment analysis).

An instance of the second category may involve a company that extracts customer sentiment from online reviews to determine if a targeted company is losing its influence over its customers. By scraping e-commerce data, we can also detect if a particular brand is experiencing a significant increase in sales compared to its direct competitors, which may be an indication of product or sales issues.

What are the attention points for web-scraped data in the financial industry?

Financial markets are subject to strict regulations to prevent fraud and insider trading, which is the trading of stocks based on non-public information about a company, and other types of legal issues. In fact, the use of data collected in an improper way can result in legal consequences for the fund itself and its managers.

Therefore, if you plan to sell data to hedge funds and investors, you should be prepared for a lot of paperwork. A good article by Zyte provides detailed information on what you need to demonstrate to funds to show that you collected the data properly.

As stated in the article:

Generally speaking, the risks associated with alternative data can be broken into four categories:

Exclusivity & Insider Trading

Privacy Violations

Copyright Infringement

Data Acquisition

Let’s briefly summarize the risks related to the four points before.

Exclusivity & Insider Trading:

As mentioned earlier, insider trading involves using non-public data for trading stocks. This means that data behind a paywall is generally not allowed since it is not publicly available to everyone, but only to paying users of the target websites. Additionally, if a scraper needs to log in to the target website to obtain data, it raises some red flags, and you must ensure that you do not violate any TOS.

Privacy Violations:

This applies when scraping personal data from the web, as privacy regulations around the world, especially in Europe with GDPR, have become increasingly restrictive in recent years.

For this reason, scraping personal data is generally a no-go in any project unless you can anonymize it.

Copyright Infringement:

Scraping and reselling data that is protected by copyright, such as photos and articles, is not a good idea, and is absolutely prohibited.

Data Acquisition:

Funds will undoubtedly want to know if the entire data acquisition process was carried out in the fairest possible manner, or if it caused any harm to the target websites.

For this reason, the Investment Data Standards Organization released a checklist with best practices to follow for web scraping. You can find all the points in the linked file, but just to give you an idea, here are some points:

A data collector should assess a website according to the terms of its robots.txt.
A data collector should access websites in a way that the access does not interfere with or impose an undue burden on their operation.
A data collector should not access, download or transmit non-public website data.
A data collector should not circumvent logins or other assess control restrictions such as captcha.
A data collector should not utilize IP masking or rotation to avoid website restrictions.
A data collector should respect valid cease and desist notices and the website’s right to govern the terms of access to the website and data.
A data collector should respect all copyright and trademark ownership and not act so as to obscure or delete copyright management information.

As you can see, the guidelines are very strict to avoid any possible issues for the data provider and the fund itself.

Key features for web-scraped data in the financial industry

Given all of this information and these premises, what features are required by the financial industry to consider a dataset interesting?

First and foremost, it's important to understand that funds have their own strategy for studying the markets, which impacts what they are looking for. In broad terms, we can divide funds into two categories: quantitative and fundamental. However, many funds fall somewhere between these two poles, combining both strategies.

Fundamental investors typically study economics, analyzing the business model and risk factors for individual companies using a bottom-up approach. Quants, on the other hand, use complex machine learning models fueled by a lot of data, searching for correlations between the data they ingest and the stock market using a top-down approach.

As you can imagine, these two approaches require different types of data. For fundamental investors, the dataset can be very specific to a stock ticker, but it should provide valuable insights into the business model of the target stock. For quants, the dataset should have some history (usually several years, depending on the market being covered and the model's needs) and cover many stocks. Otherwise, it may not be efficient to add the data to the model for just a few stocks.

Final remarks

With this article, I aimed to introduce the Alternative Data landscape, as it is one of the growing sectors in the data world where web scraping plays a significant role.

It is a complex environment, where there are strict rules on data sourcing, and providers must prove their accountability. Therefore, it is challenging for a freelancer to enter, whereas it could be relatively easier for established companies.

Depending on the investors, you may have different requirements in terms of the timeframe and specifications of the data, but if you can build a great data product, it could be a rewarding field to enter.

In the next episode of The Lab, we will attempt to build a dataset for financial investors as a fun experiment.

Also published here.