SSI's Problem with Data Ownership

Facebook, Google, Amazon. Everybody seems to be out to get your data. No wonder, monetising data is a massive business that propelled centralised data controllers to the market cap stratosphere. But what do you get for giving away your data? A “free” service, if you are lucky, and a breakdown of democracy if you are fresh out of luck. Consequently, a number of people are pushing the idea that users should control their data using self-sovereign identity (SSI) management systems (like sovrin) and interoperable data pods (such as inrupt). As it turns out, though, this approach has a number of fundamental issues as well.

In this article, I’ll explain some of those issues through the lens of an economist and based on the experience we made at Registree — A company we started to help universities monetise sensitive student data while giving students a notion of ownership of their data.

So, who owns your data?

The Facebook / Cambridge Analytica scandal will hopefully be remembered as a turning point in the fight for privacy. Yes, privacy, defined by the political scientist Alan Westin as:

[…] the claim of individuals, groups, or institutions to determine for themselves when,how, and to what extent information about them is communicated to others.

is a human right, but that means little in a world where information itself is (almost) impossible to trace as it is passed along to others. At the core of the quest for privacy in our data-driven world is a quest for the control over our data. The current model of data storage has, what Jaron Lanier calls siren servers at its centre, centralised data controllers collecting vast amounts of personal- and user data. These central controllers, from Facebook, Alphabet, and Amazon in the private sector to the NSA on the government side, reap massive private benefits from controlling access and use of our data.

But other models of data storage are emerging, where data is stored in decentralized systems without a central point that can control access and use. What seems like a technicality — what kind of technology we use to store our data — has far-reaching implications for the question who owns the data.

Economists like to define ownership through the concept of residual control rights. The idea, first put forward by Sanford Grossman and economics Nobel laureate Oliver Hart in their 1986 article is simple: whenever it is costly to list all possible contingencies in a contract, for example the sale of an asset, it is optimal to let one party purchase the right to determine what should happen in all instances that are not specified in the contract. The control over these residual, i.e. non-contracted (and in many instances non-contractible) rights determines ownership.

Take the case of land. As the owner of a property, you can write a contract with a tenant to use your land for specific purposes (e.g. to build a house, harvest crops, or build a skate park), but the more specific you want to be (e.g. how large the house, which kinds of crop, how many half pipes) in the contract, the longer, more cumbersome, and generally expensive the contract becomes. At some point it simply makes no sense to add yet another clause to the contract and you’ll just say “if this issue arises, give me a call and I decide what happens”. The very fact that you get the right to decide what happens “in all other cases” is what makes you the owner of the property.

The case of data is much more difficult, though, and our traditional notions of ownership reach their limits. A central data controller like facebook is the owner of your personal data because they store the information about all your likes, your comments, what advertisements you clicked on and when you checked the facebook page of your ex. And not only do they store the data, they decide how it is used to finance their operations. You know, “Senator, we run ads”.

The obvious alternative is to save data not in central data stores, but in decentralized data pods, such as World Wide Web inventor Tim Berners Lee’s new venture inrupt. Without a doubt, there is a lot of potential in this approach, but it also makes a few things much harder.

The conundrum: Self-Sovereign Data Storage and Data Monetization

Let’s start with the simple problem: If everybody stores their data in data pods, we replace the thousands of centralised data controller with billions of individual data pods. But this in itself does not make it easier for the users to create value from their data. Individual data are much less valuable in isolation than when they are combined. Some novel protocols exist, such as secure multiparty computation (MPC), where data analytics can be performed over data pods without loss of privacy. But these methods are limited and usually break down for larger user numbers due to the complexity of communication required by the protocol (although progress has been made on this front).

And then it’s back to square one: users need to centralise their data to allow for more value-generating analytics to be performed (for example, computing the median of a distribution is still a tough problem for MPC). But data is what economists call non-rivalrous. This means that my consumption of information does not preclude others from consuming the same information. Think about the light from a lighthouse as an example of a non-rivalrous good. One ship using the light of a lighthouse as a guide does not preclude other ships from using the same light as a guide. In fact, the light from a lighthouse is a public good since it is also non-excludable, i.e. it is practically impossible to exclude some ships from the light while providing it to others.

Information generally, and data as a physical store of information specifically, are non-rivalrous, but the right to privacy, requires them to be excludable. So you should be able to determine for yourself what information is communicated about you to others.

This, precisely, is where the self-sovereign model of data storage gets into trouble. And we have seen this when we were building Registree, a platform that allows universities to monetise the student data they have (e.g. transcripts) by selling data access to employers. This would help employers to identify the ideal graduate and help students to find the ideal job. In a country like South Africa, where youth unemployment is at a mind-boggling 60%, the search frictions in the labour market are huge and matching employers and graduates creates massive value for both sides.

So, like Learning Machine and other startups in this space, we have considered using a self-sovereign model of data storage where we convince the university to issue credentials into a pod (NB: some self-sovereign data storage applications prefer the term SSI wallet, but I prefer pods because the term is more general) fully controlled by the student. But now imagine what the students would do with their sensitive data? Exactly. They will give it away first chance they get. This is what’s called the digital privacy paradox: people claim they care about their privacy online until you offer them a shiny trinket of questionable value.

Since data is non-rivalrous, giving away your data is not the same as giving a friend the keys to your car. Once your friend returns the keys, you can check the car to see that it has no scratches or dents, and you can make sure that while you have the keys nobody can use your car to drive around and possibly get into an accident. Using your car is both rivalrous and excludable: either you or your friend can drive it at the same time. And you can revoke your friend’s future access to the car once he returns it with a long scratch on the passenger door.

This ability to revoke the right to use the car is part of what makes you the owner of the car. But once you share data, it is, for all intents and purposes, gone. Yes, many countries, including South Africa, have privacy regulations, but enforcing them is close to impossible precisely because data can be copied freely and without a trace.

For us as a company, the self-sovereign model of data storage was essentially a non-starter because a) we would not be able to give students actual ownership of their data as we set out to do; and b) as it turns out, the transcript data is not actually owned by the students, but by the universities who act as custodians of the data and are not allowed or not willing to simply share it with third parties.

Shouldn’t users be allowed to share their data freely?

The question of revoking data access has caused us many sleepless nights as a company, but eventually we found a compromise that is workable. Instead of issuing transcripts to students’ data pods, we help universities to identify use cases for their data and design processes to implement those without the need for sharing data. We basically follow the open source business model: Instead of trying to collect data ourselves (writing proprietary closed source code), effectively becoming a centralised data controller, we specialise in giving advise to- and building services for data custodians.

Think about it this way: What self-sovereign data pods create is the ability for data creators to sell their data once — to whatever central data controller sits one step downstream in the data value chain. And no matter how you spin it, the non-rivalrous nature of data will always get you back to exactly this point: For data to be most valuable, it needs to be combined with other data, but once that happens, the central controller of this combined data set becomes the new data owner. In some models of self-sovereign data storage a data creator is paid whenever a third party needs to verify the veracity of the data. But this still only covers a tiny subset of the total value chain and does not prevent third parties to use the data in non-contractible ways once they have verified its veracity.

This is clearly not an optimal way of writing contracts about data usage and it’s not the only problem of the self-sovereign data storage model.

Let me add another layer of complexity and throw in a third important concept economists care about: externalities. An externality is a cost that affects a third party who did not choose to incur this cost. The generation of data is often afflicted with an externality, for example when I record a video chat with a friend and then sell it to a company that trains a novel speech-recognition algorithm, I am not only selling my own data (i.e. what I say in the chat), but also what my friend says. By selling the data of our call, I directly create an externality for him. Unfortunately, these externalities are everywhere. In the case of Registree, we figured that if we allow students to sell their verified transcript information to a possible employer, the employer will build up a database of marks and at some point judge applicants based on where in the distribution of marks they are. What happens when there is grade inflation, or -deflation? Wouldn’t we disadvantage one group of students without their knowledge or consent?

Allowing users to share data at will — even the data they generate themselves — is a slippery slope. Economists have started formalising this problem and study implications for market prices. MIT’s Daron Acemoglu and his co-authors, for example, show in a recent paper that:

We demonstrate that the data externalities depress the price of data because once a user’s information is leaked by others, she has less reason to protect her data and privacy. These depressed prices lead to excessive data sharing.

UPenn’s Annie Liang and her co-author Erik Madsen from NYU study externalities exerted by customers’ decisions to repay a loan (or not) on other customers that “look” similar from a bank’s perspective. The authors ask a poignant question, highly relevant in a world where banks become “behavioural”:

If Alice knows that the bank evaluates her not only based on her own repayment history, but also on the repayment histories of other individuals in her category, how does that change her incentives to exercise financial prudence?

The answer depends on several factors and highlights the subtle consequences of externalities in data generation.

What we can learn from the economics literature — from the classic texts of Grossman and Hart on ownership, over the digital privacy paradox of Athey and Catalini, to the new work on externalities in data generation by Acemoglu et al. and Liang and Madsen — is that the self-sovereign model of data storage has its very own set of problems that we first need to understand before we deploy sensitive data into pods or SSI wallets.

The issues we encountered when trying to use SSI in a setting with sensitive data does not mean there is no room for SSI solutions. At Registree, we use uPort as SSI wallet to give students some form of control over their data. But we use it specifically to implement access control to students’ personal data rather than a means of storing data. By delegating access control to a SSI wallet, Registree can guarantee its users that we do not have the ability to override their decision so that residual control rights are with the users, not with us. This, we believe is a first glimpse at a new business model, and SSI used right helps us to get there.