From Data Chaos to Clarity: The Data Marketplace vs. Data Mesh

Written by liorb | Published 2024/03/06
Tech Story Tags: data-management | data-marketplace | data-mesh | data-quality | business-intelligence | decision-making | cost-effectiveness | marketplace-vs-data-mesh

TLDRData Mesh promising, but struggling? Explore the Data Marketplace, a collaborative approach promoting data quality, focus, and cost-effectiveness for informed decisionsvia the TL;DR App

For too long, the siren song of data mesh has lured companies into a false sense of security, promising a utopian future of seamless data management. But the harsh reality is sinking in: data mesh isn't the panacea it's cracked up to be.

As costs soar, trust in data erodes, and decision-making stalls, it's time to shatter the illusion and explore a new path forward.

For many organizations, the allure of data mesh, a decentralized approach to data management, has turned into a frustrating chase. While it promised data democratization and agility, the reality often involves siloed data, soaring costs, and eroded trust in data-driven decision-making.

Surveys paint a concerning picture: A recent survey by Unisphere Research, a division of Information Today, Inc., in partnership with ChaosSearch, found that 68% of IT managers implementing data mesh experienced significant data quality issues.

Another survey conducted by PwC in 2023 revealed that 72% of businesses struggled with increased complexity and lack of collaboration after adopting data mesh.

Experts are also raising concerns: "Data mesh, in theory, sounds great," says Zhamak Dehghani, a renowned data management consultant and author of the book 'Data Mesh: Delivering Data-Driven Value in the Age of Big Data.'

"However, the reality is that many organizations lack the cultural and organizational maturity to successfully implement it, leading to fragmented data landscapes and inconsistent data practices."

The Data Marketplace

Data Marketplace vs. Data Mesh: A Comparative Look

Feature

Data Mesh

Data Marketplace

Data Ownership

Decentralized (domain teams)

Decentralized (domain teams own data, but data products are managed by data experts)

Data Storage

Distributed across domain-specific data products

Centralized data lake with isolated access zones

Data Governance

Loosely coupled, relies on domain teams

Robust and centralized, with standardized processes

Data Access

Open access to all data for domain teams

Controlled access based on skills and expertise

Data Products

Developed and maintained by domain teams

Created and managed by data experts based on defined needs

Focus

Agility and data ownership

Focus, collaboration, and data quality

Here's how the data marketplace overcomes the challenges of data mesh:

  • Reduced Costs: By focusing on relevant data and eliminating duplication, the data marketplace minimizes storage and processing expenses.

  • Improved Data Quality: Robust governance and centralized data management ensure consistent data standards and quality across the organization.

  • Enhanced Collaboration: The marketplace fosters communication and collaboration between data producers and consumers, leading to better insights and decision-making.

  • Clearer Ownership: While data ownership remains with domain teams, the data marketplace ensures accountability for data products through designated data experts.

Data Marketplace Principals

Here are my 8 principles for using the data marketplace concept that will ensure your organization leverages the right data at the right time to make the right decisions:

Decentralized Data Ownership - In the marketplace, we keep the data producers owners of the data they produce, but they do not own the data products built on top of it, which I think is a fear thing.

One of my biggest criticisms about data mesh is the decentralized ownership which requires now a front-end engineer and a product manager to own a data product based on their data, they are required to design and maintain it, and they have zero clue why they do it, or how it’s being used, and if I will be honest, none of the tools they used is actually native to their work.

Instead, it just lets them set up the events or APIs calls, sets them with the tools to document what they send, to ingest the data without them touching anything, and from there, lets the experts work with it.

This system makes more sense as they only need to worry that the data is correct and reliable and are not required to fix data products gone wrong.


Transactional-based data - Yes, please stop storing all the data warehouses! Even if it’s S3 and “cheap,” there is no reason to store so much and process so much data; the reality is that most of this data is just sitting there getting dust.

By decentralizing the data, we allow the owners to decide what they store.

With the marketplace, I wish to take a different approach: the analysts, the middle people between the data consumers and producers can see all the data produced in the company, but they can decide what data to buy and sign a “contract” with the producer which describe basically what they will use the data, what are their expectations and what happens in case of a data incident.

By doing so, we ensure that the producers understand why they produce the data and what impact it has on the organization, and allow us to clean unused data either into deep storage or into the abyss by deleting it, and the analysts on the other side need to create a value case to why this data is important for them to use.

It’s creating a win-win situation that also addresses the missing link between the data producers and analysts on who using the data in the data lake structure, these transactions are logged and stored.


Robust Governance & Standards - Basic rules must be in the game, what is the metadata required, how and where to document the data ingested, produced, and visualized, what monitors should be in place to ensure quality, how the data contracts are created, and how data anomalies, incidents, and conflicts are handled; by doing, so we ensure we standardized the process.

I see a big need to have 2-3 types of data contracts but enforce that every data created has a schema validation process to ensure quality, and in case of an issue, it’s rejected, and notifications are sent to the owner of it to be aware “Huston we have a problem.”

Governance is important, and also, the enforcement of the regulation keeps this thing running with strong work-frames that address the use cases of data and serve as a system for future-proofing the data products.


Isolated Architecture Environments - Let’s talk architecture to make the data marketplace successful by creating three “isolated” levels of data on the data warehouse system; each level is accessible and usable based on the skills and expertise of the individuals and is meant to act as a safe net for data consumption:

  1. The Raw Data Zone serves as a secure storage for all data, regardless of quality, to ensure sensitive information protection. Think of it as a vault for safeguarding all types of data. Data then undergoes preparation steps like defining sessions, identifying bots, and ensuring compliance before being transformed into a usable format for analysis. This process can be compared to transforming raw ingredients into a ready-to-use meal for a chef.

  2. The Marketplace Data Zone empowers analysts by providing access to the prepared data through aggregated tables. Analysts can utilize exploration tools to query, analyze, and gain insights from the data. This zone, along with the tools it provides, plays a crucial role in enabling analysts to create valuable decision-making dashboards.

  3. The Decision-making Zone serves as a single source of truth for the entire company. This zone contains rigorously tested, secured, and verified data presented through user-friendly visualizations. Anyone in the company can access and explore this data in a controlled environment, either directly via the provided tools or by extracting it for use in tools like Excel.

    The data here is deemed the "official" version - any conflicting information from other sources should be considered inaccurate.

Data that enters the data lake has one access point where the data is validated, is it what needs to be here, the data is checked for anomalies, and only if it passes the inspection, it will arrive in a raw data landing zone; then from there, the data will have small components that will run and clean for bot traffic, add attribution to the session, or even enrich missing metadata to ensure the data can be used.

When it arrives to the data consumers, they will only see the data that was certified pre-calculated, and built, to enable the look and feel of a self-exploration environment without the fear of losing control over the numbers because “Self-serve analytics” is a nice dream but will never become reality due to the level of data literacy in the organizations.


Pull System - Based on the above and the transaction-based, why in hell should we have a push system where anyone interested can just push data into the lake or a dashboard?

We need to bring back control over the data and have a purpose and a clear why; otherwise, we’re just wasting money on moving data around. I am sure your cloud provider is happy and also convincing you this is cheap if you buy the right saving plans, but in reality, you have nothing to do with the data sitting there collecting dust.

Some industry standards estimate that around 87% of the data ingested into the data lake have no clear purpose or use case; think about the amount of money you are spending right now on having it stored.

The pull system ensures that only those with purpose arrive at the common areas and are used. Otherwise, it will just become history in a short period saving you a lot of money on wasted accomilated data.


Data Parliaments - The data domain subject matter experts are the ones who can build data products from the raw data and move it into the marketplace zone, or approve a new data product from the marketplace zone into the decision-making zone.

In the data mesh, one of the biggest issues is that anyone can pull data and build a data product with it, but this sometimes causes cases of data not matching or worse decisions to be made over wrong data.

The parliament members can be, for example, responsible only for the marketplace revenue tables; they are the only ones who know how to define the true source of the revenue, how to present it, and also how to handle issues with its quality when things go wrong.

On the other side, a parliament can handle the company KPIs table and will know which data in the marketplace is the one true source. Contrary to data mesh, we create a function of data controllers who validate that the marketplace has one true source of data, so each analyst using the marketplace zone knows they are using the right data, and they are the ones that ensure the correct data is carried across the company.

This also answers the governance being overwhelmed with so many topics they need to manage and helps them focus on the work-frames and the parliaments on the data products.


Data Certification - This is a vital part if we wish to ensure people in the organization consume the right data and make decisions based on the best available data to them.

We need to certify the data; the certification process allows us to, on the one side, ensure that we have one, revenue numbers used in the decision-making zone, and also ensure it has the right process behind it from data monitoring to error handling and up to how it should be documented.

The certification of the data ensures that everyone knows what the correct number is and has the trust that the data has all the required steps to ensure its reliability and trustworthiness; one of the things I met the most is the missing trust of leaders in the data they consume to make decisions, and this mistrust causes them to doubt the data and ask analysts to check it constantly whenever it doesn’t fit their expectation, wasting a lot of time of the analysts on debugging.

But by creating a data certification, we can ensure that, in case of a data incident, the user will know it before it’s being used, and the analysts will know where they need to look to solve it.

Basically, creates better trust between the data consumers to the analysts, and in case the data incident is not fixed where the problem is and who is blocking it from being resolved, creating full transparency and accountability over the data.


ROI - Return on investment! We spoke earlier on transactions and explained the use case. Well, with that, we need to have some system that evaluates the effort invested in creating a data product, the storage, the processing, the analyst time invested, the cost of maintaining it on the one side, and on the other, the possible profits by having the data.

I agree here that not every data source will have it, but I will argue it’s possible :-). By having, for example, the campaign performance on Facebook, we can optimize the campaigns and invest more money in campaigns that drive more profit or cut campaigns that waste money.

By doing so, we save and increase the revenue, and it can go into the calculation. Another example is by finance having the AWS costs available, they can find areas to improve savings either by buying saving plans or investigating high costs and possibly reducing them.

Another method to evaluate the revenue of the data is by incrementally testing; having this data means you can save or generate X amount of extra money; if not, this is the amount of money lost.

The ROI gives everyone in the chain of the data a know what the value of their data is and also the ability to evaluate how to prioritize a data incident with a clear or estimated monetary value next to it.


More data does not equal better insights; it just means more time lost on making decisions, more time wasted on bad data, and the bad thing, loss of focus; here is what I am trying to correct with my data marketplace concept.

  • Focus - Focus on the data that make sense for the organization to make decisions.

    Not all events and all data sources are relevant, and most definitely, having more data does not mean you will be able to make better decisions.

    Focusing on a small amount of highly relevant KPIs will help the business oriented around what is working or not, will ensure higher focus on the quality of the data, and ensure everyone knows how their work impacted or not the success of the organization.

  • WHY- If you produce data, you need to know WHY you produce it and what is the added value of having it. Without having it in place, you waste money.

    I don’t know how many times in my life as an analyst I was required to run some super expensive and complicated query that requested me to give some number or shares, but after asking what they were used for or what decisions they drove, I discovered that my three days work was “Just for good to know” and had no impact what so ever on the decision making or actions taken.

    I am thinking a lot about the wasted time of analysts answering questions that do not serve a real purpose and the amount of time gone on checking the data quality and compatibility, understanding it, and then trying to run the most efficient query, and it all goes with action.

  • Data usability - People should be able to self-serve themselves, and no, when I talk about self-serve, I am not talking about this drag and drop that we all fall in love with from Google Analytics.

    What I mean is that if you are an analyst or data consumer, you should have the right tools and access to data based on your skills and expertise. I do not wish to see data consumers playing with raw data and trying to drive learning without knowing what can be the results of them using the numbers they encounter; at the same time, I do wish to see it.

Even if you create more work-frames around the data mesh, the reality is that when you ask the wrong people to do the right things without the expertise, you increase their anxiety levels to handle situations they are not ready for, and worse, you cause them to lose focus on what their unit purpose is, and for this, you are wasting more growth opportunities.

Hiring more data engineers and data product managers won’t solve your problem; training unrelated people to own data products will fail.

I call you to try something new, reduce the noise from data increase your focus, and if you want, after you master the data marketplace and the delicate balance between centralized to decentralized, you can go in the direction of data mesh.

But all the time you don’t, you risk accumulating more costs in lost revenues and overspending on data the decision is yours!

Embracing the data marketplace isn't just about efficiency; it's about empowering all stakeholders in the data ecosystem to harness the true potential of data.

By focusing on collaboration, clarity, and purpose, we can unlock the power of data and navigate a path toward sustainable success.

Join the conversation! Share your thoughts on data management and explore whether the data marketplace could be a viable solution for your organization.


Written by liorb | The Art of Data Strategy: Wabi Sabi, Hummus, and the Path to Insight
Published by HackerNoon on 2024/03/06