13 Best Datasets for Power BI Practice

In 2022, Gartner named Microsoft Power BI the Business Intelligence and Analytics Platforms leader.

With the aid of business intelligence tools like Microsoft Power BI, unstructured data can go through extraction, cleaning, and analysis processes to create insights that help organizations make data-driven decisions.

In this article, we will look at the 13 Best Datasets for Power BI Practice, which are essential in helping data professionals build their proficiency in Power BI.

List of the Best Datasets for Power BI Practice

1. Sample Superstore Sales

The Sample Superstore Sales dataset provides sales data for a fictional retail company, including information on products, orders and customers.

This dataset includes the following variables:

Order ID - A unique identifier for each order.
Customer ID - A unique identifier for each customer.
Order Date - The date of the order placement.
Ship Date - The date the order was shipped.
Ship Mode - The shipping mode for the order (e.g. standard, same-day).
Segment - The customer segment (e.g. Consumer, Corporate, Home Office).
Region - The region where the customer is located (e.g. West, Central, East).
Category - The category of the product purchased (e.g. Furniture, Technology, Office Supplies).
Sub-Category - The sub-category of the product purchased (e.g. Chairs, Desktops, Paper).
Product Name - The name of the product purchased.
Sales - The sales revenue for the product purchased.
Quantity - The number of units of the product purchased.
Discount - The discount applied to the product purchased.
Profit -The profit generated by the product purchased.

2. Adventure Works DW

The Adventure Works DW is a sample database for Microsoft SQL Server Analysis Services (SSAS). It offers a dimensional data model for a fictional bicycle manufacturer, Adventure Works Cycles. It also comprises information on product catalogues, sales, customer demographics and time-based data for analysis & reporting.

This dataset includes the following variables:

Customer -This includes customer demographics, such as age, gender, education, and income.
Sales - This includes sales information, such as sales territory, salesperson, and order date.
Product - This includes product categories, subcategories, and product names.
Date -This includes the date and related attributes such as quarter, month, day, and day of the week.
Geography - This includes customers' state, city, postal code and sales orders.

To download this dataset, you can click here.

3. Flight Delays and Cancellations

This real-world dataset comprises data on flight numbers, departure, airlines, arrival times and the reason for any delays or cancellations. With this dataset, Power BI users perform data analysis and create interactive dashboards to identify the most common causes of flight disruptions by studying the frequency of cancellations by airline and flight delays.

It comprises the following variables:

Flight Duration - The length of time from departure to arrival for the flight.
Delay Reason - The reason for any delay in the flight. Examples may include weather, mechanical issues, or air traffic control.
Delay Time - The amount of time by which the flight was delayed.
Cancellation Reason - The reason for cancellation of the flight. Examples may include weather, mechanical issues, or insufficient passenger demand.
Date of Flight - The date on which the flight took place.
Flight Number - A unique identifier assigned to each flight by the airline.
Airline Name - The name of the airline operating the flight.
Departure Airport - The airport from which the flight is scheduled to depart.
Arrival Airport - The airport at which the flight is scheduled to arrive.
Scheduled Departure Time - The time at which the flight was scheduled to depart, as originally planned by the airline.
Actual Departure Time - The actual time at which the flight departed, if different from the scheduled departure time.
Scheduled Arrival Time - The time at which the flight was scheduled to arrive, as initially planned by the airline.
Actual Arrival Time - The actual time at which the flight arrived, if different from the scheduled arrival time.

4. NYC Taxi Data

NYC Taxi Data is a rich and complex dataset that contains info on taxi trips in New York City, including trip durations, fare amounts, and pickup and drop-off locations. It covers millions of trips and spans several years, providing a rich source of information about urban mobility and transportation patterns in the city.

By analyzing this data, you can gain insights into various areas of the taxi industry in NYC. For example, you can visualize the distribution of trips over time and space, and identify hot spots of taxi activity in the city.

The dataset includes the following variables:

Trip Duration - The duration of the trip, in seconds.
Trip Distance - The distance travelled by taxi, in miles.
Number of Passengers - Total number of passengers in the taxi.
Fare Amount - The fare charged to the passenger, in dollars.
Payment Method - The method of payment used by passengers (e.g credit card, cash etc.).
Pickup and Drop-off Location - The GPS coordinates of the pickup and drop-off locations.
Trip Type - This indicates whether the trip is a dispatched trip (green taxi or for-hire) or a street hail (yellow taxi).
Pickup and Drop-off Time - The time and date at which the pickup and drop-off took place.

To download this dataset, click here.

5. Global Superstore

The Global Superstore dataset is a simulation of retail sales operations with stores in multiple countries. It includes information about customers, orders and products, which is particularly useful for exploring retail sales data, as it offers a large and diverse set of data that can be used to analyze customer behaviour, product performance and sales patterns.

It comprises the following variables:

Order ID - A unique identifier for each order.
Order Date - The date and time the order was placed.
Ship Date - The date and time the order was shipped.
Ship Mode - The method used to ship the order (e.g. standard, express).
Customer ID - A unique identifier for each customer.
Customer Name - The full name of the customer.
Segment - The customer segment such as Home Office or Corporate.
Country - The country where the customer resides.
City - The city where the customer resides.
State - The state where the customer resides.
Postal Code - The postal code of the customer's residence.
Region - The geographic region where the customer resides.
Product ID - A unique identifier for each product.
Category - The broad product category, such as Furniture, Office Supplies, or Technology.
Sub-Category - The specific product sub-category, such as Chairs, Paper, or Phones.
Product Name - The name of the product.
Sales - The total sales revenue for the product.
Quantity - The number of units of the product sold.
Discount - The discount applied to the product.
Profit - The total profit earned from the product.

To download this dataset, click here.

6. Seattle Weather Data

This dataset is a comprehensive dataset which provides historical weather information for the Seattle, Washington area. It can be used to study the climate and weather patterns as well as weather’s impact on various industries and activities, such as tourism, agriculture and transportation.

Some of the critical variables in the Seattle Weather Data include:

Date - The date of the observation.
Prcp - The amount of precipitation, in inches.
Tmax - The maximum temperature for that day, in degrees Fahrenheit.
Tmin - The minimum temperature for that day, in degrees Fahrenheit.
Rain - This shows TRUE if rain was observed on that day and FALSE if it was not.

7. World Bank Development Indicators

This dataset contains information on GDP, life expectancy, and literacy rates for various nations throughout the world. It also includes many economic and social variables.

Some of the variables included in this dataset are:

Gross Domestic Product (GDP)
Inflation
Unemployment rate
Government debt
Trade balance
Life expectancy
Infant mortality rate
Access to electricity
Literacy rate
Mobile cellular subscriptions

Note: The variables included in the dataset depend on the year and the country being analyzed.

You can download the dataset directly from the website or you can download it on Kaggle.

8. US Health Data

The US Health Dataset provides comprehensive information on health behaviour and health status, including data on healthcare utilization, physical activity and chronic diseases. It can be used to study trends in public health and to investigate the impact of lifestyle and health behaviour on health outcomes.

The US Health Data is sourced from the Centers for Disease Control and Prevention (CDC), the National Center for Health Statistics (NCHS), and the Agency for Healthcare Research and Quality (AHRQ).

The common variables in this dataset include:

Demographic information - Age, gender, race, and ethnicity
Health status indicators - Self-reported health, chronic conditions, and disability
Healthcare utilization measures - Hospitalizations, emergency room visits, and primary care visits
Health behaviours - Smoking, exercise, and diet
Health outcomes - Life expectancy, mortality rates, and incidence of specific diseases
Healthcare costs - Total medical expenditures, out-of-pocket costs, and insurance coverage
Access to healthcare, including insurance coverage, availability of healthcare providers, and proximity to healthcare facilities

Note: Variables included in the US Health Dataset can vary depending on the data source.

9. Stack Overflow Survey Results

Stack Overflow Survey Results contain results from the annual Stack Overflow developer survey. It includes various aspects of developer experience, such as salary and compensation, preferred technologies, work satisfaction etc. It can be used to explore and gain insights into the state of the developer community.

This dataset contains a large number of variables, including but not limited to the following:

Personal Information - Age, gender, country, and education level.
Employment - Employment of employment, company size, and job title.
Development Experience - Years of experience, primary programming language, and development environment.
Salary and Compensation - Salary, currency, and benefits.
Work Satisfaction - Job satisfaction, career satisfaction, and job search.
Technology Usage - Preferred operating system, programming language, development environment, and tooling.
Community Involvement - Contributions to open-source projects, Stack Overflow reputation, and participation in developer communities.

The dataset can be downloaded directly from the website.

10. Titanic: Machine Learning from Disaster

This popular open-source dataset offers information on the passengers onboard the Titanic ship when it sank on April 15, 1912.

Some of the variables included in the dataset:

PassengerId - A unique identifier for each passenger.
Survived: This shows whether the passenger survived or not (0 = No, 1 = Yes).
Pclass: A passenger's class (1 = 1st, 2 = 2nd, 3 = 3rd).
Name - A passenger's name.
Sex - A passenger's gender.
Age - A passenger's age.
SibSp - The number of siblings/spouses aboard.
Parch - The number of parents/children aboard.
Ticket - The ticket number.
Fare - The fare paid for the ticket.
Cabin - The cabin number.
Embarked - The port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton).

You can download the dataset on Kaggle.

11. Wine Quality

The Wine Quality dataset contains information on red and white wine samples. The goal of this Power BI dataset is to classify the quality of the wine based on chemical properties like pH, density, alcohol content and citric acid content.

The common variables included in this dataset:

Fixed Acidity - The number of fixed acids in the wine, expressed in g/dm^3.
Volatile Acidity - The number of volatile acids in the wine, expressed in g/dm^3.
Citric Acid - The amount of citric acid in the wine, expressed in g/dm^3.
Residual Sugar: The amount of residual sugar in the wine, expressed in g/dm^3
Chlorides - The amount of chloride in the wine, expressed in g/dm^3.
Free Sulfur Dioxide - The amount of free sulfur dioxide in the wine, expressed in mg/dm^3.
Total Sulfur Dioxide - The amount of total sulfur dioxide in the wine, expressed in mg/dm^3.
Density - The density of the wine, expressed in g/cm^3.
pH - The pH level of the wine.
Sulphates - The number of sulphates in the wine, expressed in g/dm^3.
Alcohol - The alcohol content of the wine, expressed in % vol.
Quality - The quality rating of the wine, on a scale of 0 to 10.

You can download the dataset from UCI Machine Learning Repository by clicking here.

12. US Crime Rates

The US Crime Rates dataset provides information on crime rates in the United States. It is organized based on geographical region, period or other relevant factors and is mostly used to analyze crime trends and patterns or as well to support criminal justice decision-making and law enforcement. It is also commonly used for exploratory data analysis and visualization and can be used to create interactive dashboards and reports in Power BI.

Some of the variables included in the dataset:

M - The percentage of males aged 14–24.
Po1 - The per capita expenditure on police protection in 1960.
Po2 - The per capita expenditure on police protection in 1959.
M.F - The number of males per 100 females.

You can download the dataset from Kaggle.

13. Airbnb Listings

This dataset is a collection of data on Airbnb listings, including price, amenities, type of property, number of bedrooms and location in New York City. It is commonly used for exploratory data analysis and visualization, with a focus on the distribution of listings and prices across different locations and neighbourhoods.

Some of the variables included in the dataset:

Id - Airbnb's unique identifier for the listing.
Host Id - Airbnb's unique identifier for the host.
Host name - The name of the listing.
Neighbourhood Group - The neighbourhood group e.g Manhattan, Brooklyn etc.
Host identity verification - This shows if the host identity is either verified or unconfirmed.

The dataset can be accessed on Kaggle by clicking here.

Common Project Use Cases for the Power BI Datasets

Retail Analytics

Sample Superstore Sales

Global Superstore

Adventure Works DW

Retail sales analysis, customer segmentation, product performance analysis, market segmentation and sales territory analysis.

Transportation Analytics

NYC Taxi Data

Flight Delays and Cancellations

Taxi demand and supply analysis, trip analysis, driver performance analysis, fare comparison, flight performance analysis and airport comparison.

Weather Analytics

Seattle Weather Data

Weather trend analysis, climate change analysis, prediction of weather patterns and impact on various industries.

Economic Analytics

World Bank Development Indicators

Global development analysis, comparison of countries, prediction of economic trends and economic performance analysis.

Healthcare Analytics

US Health Data

Healthcare analytics, comparison of states, analysis of healthcare spending and outcome.

Workforce Analytics

Stack Overflow Survey Results

Workforce analysis, technology trend analysis, comparison of salary and job satisfaction.

Machine Learning/Survival Prediction

Titanic: Machine Learning from Disaster

Survival prediction, data analysis and visualizations of the Titanic disaster.

Quality Analysis

Wine Quality

Quality analysis, prediction of wine quality based on its chemical properties, wine preference analysis and recommendations.

Crime Analytics

US Crime Rates

Crime analysis, comparison of crime rates by city, state and region, and analysis of crime patterns and trends.

Travel Analytics

Airbnb Listings

Travel analysis, housing demand analysis, rental pricing analysis and popular tourist destination analysis.

Final Thoughts

These datasets and common use cases will help you better understand the role of Power BI in helping organizations make smarter, real-time decisions.

They are also available for anyone to download and use freely.