Zhamak Deghani has published in 2019 a first post about data mesh. A core concept of data mesh is the one of decentralized data ownership: place ownership with the one knowing the data. Too strict data privacy rules will quickly turn this decentralized approach into data silos, whereas too loose rules will turn it into a data mess. Many companies, not only those in pursuit of data mesh, are struggling with data ownership and the question becomes more and more complex in the modern data stack.
Too strict data privacy rules turn your data mesh in data silos, too loose rules into a data mess
Recently, at a conference, I have introduced the concept of data mesh as
“It’s a problem set, it’s a set of principles, it’s a solution”
You can not expect a central data team to understand your entire business or know all of your operational applications, let alone own data they did not produce. To tackle these problems, amongst many more, Zhamak Deghani has introduced the concept of data mesh and its DATSIS principles.
To fulfill these principles, she foresees you require a self-service data platform, decentralized domain-oriented data ownership and federated governance. This solution set requires data product thinking: Data should no longer be considered as a byproduct of an application, it is a product itself, which indeed should be owned by the team producing (or using) the data. Rather than constructing a central data team which should understand your entire business and application landscape, you bring the data knowledge to the teams with business expertise.
There are more implications when you bring product thinking to data. In general, when you can not explain why you have built a product, you should not have built it in the first place. The same holds for data: every data product requires a purpose. The why of your product. Look around you: for every object you see, you can tell why you have bought it. For some you might even have multiple reasons.
Every data product needs to have a purpose
Such multi-purpose usage will happen with data products as well. The reason why a team has built a data product, might differ from the reason that someone else is using it. A first data product on top of data coming from operational applications, exists just to describe the truth as is. Users have different purposes: users of a data product describing customers might want to use customer data for personalization purposes, marketing campaigns, …
Some of these purposes might be legally compliant, yet others might not. It can even become more complex: some of your customers might have consented to receive marketing emails, whereas others have not. This means your data products will become subset-multi-purposes. If that is even a word… But it shows the complexity of handling data privacy with care and avoiding fines.
The end goal of creating data products in your data mesh is to make a company more data driven. To allow employees to use high quality data. By placing ownership and transformation responsibility with teams that truly understand both the data as your business, you should end up with high quality data.
Where the ambition of data lakes has been to make all data available to everyone, this is no longer the case. How you handle data privacy will shape what a customer thinks of you as a company. However the opposite behavior of completely restricting data access and usage, which is a possible pitfall of decentralized ownership, does not make your company data driven, it just creates data silos.
Your ambition should be to land in the middle:
I want that every employee has access to all data he is allowed to and requires to excel in his job
Looking from a data mesh perspective, this means that every employee should have access to the data products he requires to excel in his job and that data products only contain purpose enabling data. Or at least, when an employee accesses a data product, he should only get the data for the purpose he is accessing it.
To obtain, or first of all request, access he should know which data products exist. Remember the DATSIS principles introduced by Zhamak? The D stands for Discoverability, which means that employees should be able to discover which data exists. Typically such an overview will be available in a data catalog. More and more players are invading this market.
Data products are listed with metadata in the data catalog. This metadata includes quality or trustworthy measures like latency of the data, but also include the owner and purpose of the product. When convinced a data product could serve the employee, his request for data access should end up with the data owner. Again, he knows his data product and the allowed purposes, whereas a central data governance team or service desk team can not be expected to know all the data products and their purposes.
A request for data access should end up with the data owner
When the data product owner can easily grant compliant access, you have reached a compliant data driven mesh. And you will have obtained an incredible productivity boost as your approval and access provisioning process will become a lot faster. However, just as a central team does not know all data products, a data product owner does not know the entire organization. So how can you speed up the process and reassure you provide compliant data access to data products?
This is where Purpose Based Access Control steps in, which can be considered as a subspace of Attribute Based Access Control or ABAC. Briefly summarized: you obtain access to a data product when you are allowed to access data for a purpose which is assigned to the data product. This allows you to split the responsibility of access approval: a data product owner decides for which purposes his data product can be used, whereas a line manager can approve his team members to access specific purposes. This means that no longer no one needs to know all data products, nor need to know the entire organizational structure.
Purpose based access control allows you to split access approval responsibilities: someone owns the purposes of data products, someone else approves the right to carry out a purpose.
Extra complexity rises of course in multi-purpose data products. One accessing the data product for a specific purpose, should only see the data belonging to this purpose, which might be a subset. This as well is a difference compared to the data lake approach, where the responsibility to filter out purpose specific data is put with the end-user and hence should be repeated multiple times within a company.
To resolve this issue, companies sometimes offer subset copies of data (within a filesystem or database), which would become too costly for streaming data, hence there is no such workaround for streaming data.
Compliant subset copies of data would be too costly for streaming data, there exists no easy workaround for streaming data.
Purpose Based Access Control brings even more benefits to the table. Whereas previously you might have asked access for a specific data product, you will now also get the correct access to all up- and downstream data products which fulfill the same purpose. Even more, when a purpose only allows you to access a subset of a data product (both row-level as column level) or a pseudomyzed version of the data product, the same privacy level will normally be applied for up- and downstream data products.
As data privacy is very important with regards to your outside world image and becoming data driven is growing from a competitive advantage towards being on par with your competitors, having a solution that guarantees your data privacy is crucial. For most companies both building a data platform as well as reassuring data privacy is not a core capability, nor a competitive advantage. As it is a complex issue I would always advise to look around on the market, even though you might be able to build it yourself.
Becoming data driven is growing from a competitive advantage towards being on par with your competitors
Just as data catalogs, solutions around privacy compliant or purpose based access control or purpose based encryption are popping up which might unburden you with regards to this complex problem.. They enable you to let your teams use privacy compliant data, consistent across tooling and data products and allow you to focus on what really matters:
Bringing value based on data for you, your company and your customer