In a recent episode of the data mesh podcast about data products, Scott Hirlemann, data mesh community host, interviews Zhamak Dheghani, founder of the data mesh concept, about data products. If you have never heard of the data mesh podcast, do take a look and I advise you to start with this episode as it focuses on one of the core concepts of data mesh: a data product.
Data Products are the quanta of a data mesh, the smallest particle of data people interact with. This means by definition that every piece of data in your analytical landscape should be part of a data product. Or, to quote the episode, “When data is being copied, it should be owned by a data product”.
A key outcome of data mesh is an increased trust in data, which depends on a successful implementation of several best practices. One of which is federating the responsibility for the creation and maintenance of data products to data workers that truly understand the business domain where the data resides. As such, data products are the best representation of how business domain experts see their respective domains. Much better than a central data or data governance team can ever do.. As a consumer of the data product, you are guaranteed to interact with data that reflects the truth of the business domain, as it has not undergone any influence from someone else.
Previously, users could stumble upon data and make faulty assumptions on its origin. Data lineage tools have been introduced in an attempt to solve this problem, but they only succeeded partly. They did provide the source of data, but missed important context about the transformations, and the people who were involved in the construction of the data. This means that you had no idea why a particular dataset had been created initially, nor which transformations had been applied. You couldn’t even pick up the phone and call somebody to get that information. Scott described it best: “It’s like a google sheet without revision history”. You interact with a copy of data, but you don’t know when and how it has been altered, nor by whom.
The introduction of the concept of a data product is an attempt to fill the gap. A data product consists of data assets accompanied with the relevant metadata, such as the freshness and the purpose of the data. Therefore, it is guaranteed that a data product is what a consumer expects it to be. It adds a lot of resilience to the system. It not only increases the trust of the consumer. The producer as well, can rest assured that his data is used as intended, as long as access is managed properly at least.
Something I’m personally advocating for is that data development is not, or should not be, too different from software engineering. Handling data, and I deliberately not restrict myself to data engineering as the same holds for data science and related fields, should learn from software engineering best practices.
Application development has evolved over time. People have learned that integrating applications via direct access to databases is a terrible idea. Instead we have seen the rise of API’s and BFF’s (back-end for front-end). The same holds true for your data landscape. Directly tapping in on databases creates tightly coupled systems and true spaghetti architectures. Directly integrating with databases does not allow any protection: you are opening up the gates and even dismissing the gatekeeper.
In data development, even over the past years, we have not yet adopted these software development best-practices. The introduction of API’s or BFF’s does not yet have a counterpart in the data space. As Zhamek states it: the data itself, its schema or the way that it is being stored, has acted as the interface. She points to the fact that we still tap in on the data itself, rather than interacting with an abstraction layer. And, without such an abstraction layer, the data has no manner to protect itself.
This gap can now be filled with the concept of a data product. It brings an abstraction layer to the data, which can act as an interface to it. You no longer have to interact with the data as such.You interact with the data product which has boundaries you can protect, instead. Even more, this product has an owner,whose responsibility is to protect the data and prevent malicious use.
Another valuable lesson from software development is the introduction of cross-functional teams that can deliver end-to-end value. Applying this concept in the data space means that you should add data literate people with software engineering or business teams. Or to a combination of both. This is exactly what data mesh is advocating for.
Previously, due to fractional functional roles, no one had end-to-end ownership from data to value. This has introduced a lot of friction and was bad for trust. Data scientists are never happy with data provided by data engineers because of poor data quality.. Data engineers are not happy with application developers, as data is never modeled as desired. There is always friction at every hand-over of data. Providing end-to-end ownership removes this friction and brings quality and trust by design to your data.
I have mentioned before that within data mesh it becomes the responsibility of the data owner to protect its data and prevent malicious use. Data can’t protect itself, yet data products have boundaries which an owner can protect. Unfortunately, this is a cumbersome and ungrateful task. At Raito, we have the ambition to simplify and scale data access management, regardless of where your data lives. In the data mesh setting, this can be changed into “regardless where your data product lives”.
This brings a much needed resilience to the system. Previously, someone would stumble over data, and rapidly make a copy for his own purpose, because they were never certain whether your access was guaranteed. Even with the best intentions, this creates several significant risks. When a data owner can truly protect the boundaries of a data product, it’s no longer live or die. When your access is provided in a structural and transparent manner, you have sufficient guarantee that you will keep it, even over future versions of the data product.
Let’s end with a final quote of Scott from the episode. By increased trust and guaranteed access to the right data, you now enter a “keep calm and analyze the data”-mode.