Data Mesh is a Dream of Making Data Agile and Valuable. And LOC is Designed to Help on That.

Bringing Decentralized, Domain-Driven Data Paradigm to Benefit Your Organization

Alan Wang
FST Network

--

Photo by Clarisse Croset on Unsplash

Author’s note: the technological details described in this article is based on an older version of FST Network’s LOC. Since LOC is still activity evolving to meet our customer’s needs and to stay competitive, please refer to our official blog and documentation for latest updates.

In 2019, Zhamak Dehghani at Thoughtworks published a ground-breaking article, How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh. She introduced an exciting data paradigm trend called data mesh.

A data mesh is a decentralized data architecture that organizes data by a specific business domain — for example, marketing, sales, customer service, and more — providing more ownership to the producers of a given dataset. The producers’ understanding of the domain data positions them to set data governance policies focused on documentation, quality, and access. This, in turn, enables self-service use across an organization.

While this federated approach eliminates many operational bottlenecks associated with centralized, monolithic systems, it doesn’t necessarily mean that you can’t use traditional storage systems, like data lakes or data warehouses. It just means that their use has shifted from a single, centralized data platform to multiple decentralized data repositories.

— IBM: What is a data mesh?

What is Data Mesh and Why It’s a Trend?

Long story short — since the posts are all pretty long — data mesh is to solve the long-standing problems of monolithic, centralized data strategies that has plagued companies for many years.

In the past, enterprises have dedicated data teams along with dedicated systems/infrastructure to extract, transform and load (ETLs) data from all over the company, and provide data to whoever needs them. The data would be stored in warehouses (structured data) or lakes (unstructured data).

As the scale of data grows, issues will start to surface:

  • The data storage becomes the bottleneck of data (longer processing time, lower user satisfaction) and more difficult to distinguish the source. Even if you simply dump data into a data lake, it would become a swamp that trap everything in a muddle.
  • Data pipelines are often highly coupled between source and user; any breaking changes can be disaster to users. The pipelines are not easy to be decomposed into smaller services either. Unscalable services means unscalable market.
  • Data teams (some companies can’t even afford them) are too specialized compared to other teams. They have very little understanding of the domain hence have absolute no incentive to improve the data quality.
The mess and bottleneck with traditional data strategies. Source: Data Mesh Principles and Logical Architecture, Zhamak Dehghani (2020)
Photo by Levi Meir Clancy on Unsplash

Data Mesh Comes to the Rescue

My ask before reading on is to momentarily suspend the deep assumptions and biases that the current paradigm of traditional data platform architecture has established; Be open to the possibility of moving beyond the monolithic and centralized data lakes to an intentionally distributed data mesh architecture; Embrace the reality of ever present, ubiquitous and distributed nature of data.

— Zhamak Dehghani

The key concept of data mesh is decentralization — granting ownership and autonomy to the teams that create their data. It’s a mindset shift that let people make more conscious decisions with their domain knowledge, making data useful and valuable as they should have been.

Source: Data Mesh (Zhamak Dehghani, 2022), O’Reilly Media, Inc.

Dehghani proposed four principles of data mesh:

  • Domain Ownership
  • Data as a Product
  • Self-Serve Data Platform
  • Federated Computational Governance
The four data mesh principles. Source: Data Mesh (Zhamak Dehghani, 2022), O’Reilly Media, Inc.

These are not small topics, in fact the author wrote a full book for them. But using the simplest way possible to explain:

  • Instead of processing and storing data in a central way, each domain team now owns its data and is responsible for the output pipelines.
  • Data are no longer by-products of processes but are treated as data products.
  • Data are actively pushed to the user on a cross-domain serving platform instead of waiting for central processing.
  • Since Data may be used across teams or even above teams, they need to be governed by some or all domain teams depending on circumstances, like U.S. state governments and federal government.
Data product in data mesh. Source: Data Mesh (Zhamak Dehghani, 2022), O’Reilly Media, Inc.

This distributed design makes data feasible, usable and valuable, thus bring more profits to the organization. The data are agile, scalable and provides business insights faster than ever.

Kubernetes: the Answer of Building the Distributed Data Serving Platform

Dehghani did not discuss much about implementation, since she referred data mesh as a logical architecture. However, the answer of data mesh already exists — Kubernetes.

Kubernetes is a container orchestration system that can be deployed on the cloud, or more specifically, on multiple physical or virtual servers that may or may not be together geographically. It can use namespaces to separate teams in the same environment. It can be managed by a small team. And a Kubernetes cluster is a distributed system in its own rights and is extremely scalable.

To build a self-serve data platform, we can simply deploy related resources into a Kubernetes cluster — data pipelines, databases and observability tools, etc. This is actually a lot easiler than trying to maintain dozens of different systems together.

The self-serve data platform can be easily built on Kubernetes. Source: Data Mesh (Zhamak Dehghani, 2022), O’Reilly Media, Inc.

So Domain Ownership and Self-Serve Data Platform is already satisfied. But how can data products be pushed to the platform? Well, push is a metaphor and does not necessarily mean uploading the data to somewhere, but that the data can be discovered and accessed at any time.

The solution? Event-driven architecture (EDA) and microservices. And this is something already been done on Kubernetes too. It’s called FaaS (Function-as-a-Service). The link below has some introduction of this cloud pattern.

LOC’s Data Product Design— Data Processes and Data Events

As I’ve discussed in other articles, FST Network’s Logic Operating Centre (LOC) is a FaaS platform. Users can deploy data pipelines (data processes, very similar to microservices) on the cloud, which can be triggered by several ways and emit data events themselves. The data processes can also be grouped by units and scenarios along with user access controls.

LOC defined data products in a two-fold way:

  • A data process, alone with its trigger(s), output and data policy (for example, access control) together is referred as a data product. The output would either be returned to the user immediatly, or stored in LOC if the data process is invoked asynchronously.
This illustration is actually an overview of a LOC data product too

In Dehghani’s book, she did mentioned the idea to use “internal pipelines” in a data product to reduce coordination, which can be implemented by logic functions in LOC data processes. As we’ve discussed in other articles, the multiple-logic design allows you to model more complex business processes than regular FaaS functions.

The user gets the result from the aggregator function without having to go through the complex steps.

  • A event (not to be confused with event-driven events; there are special, user-defined data events) itself is another type of data product, since it can carry messages — source, target, event name and additional payloads. They are stored in a separate database in LOC and can be queried in several says, including a panel called “Data Discovery” in LOC Studio.
The data lineage (events’ graph) that emitted from LOC data processes in Data Discovery

So a data pipeline is part of a data product, which is managed by the domain team and can be discoverable across domains. A data process can also generate data products. When a data process is deployed, it is ready to be consumed.

Data lineage, or the relationship of a series of events, allows us to trace how data is passed between several users, which can serve the purpose of checking if data are used according to company policy, and where exactly did they go wrong.

Here’s a bonus: any developer operation to the data processes — deploying, updating, deleting — leaves audit trails too, in the same form of events. Nothing will escape your watchful eyes.

Audio logs in LOC Studio

LOC Does Not Create Data Mesh — But It Can Help Making the Paradigm Shift

Like we’ve said before, data mesh is a high-level concept instead of a specific technical architecture. An organization has to be willing to embrace the new idea to make through the transition. LOC itself does not magically make the pain go away or enforce brave new data policies.

On the other hand, the FaaS nature of LOC is very friendly for quickly creating data pipelines. The non-intrusive nature means you can still build pipelines upon on existing heterogeneity systems. The events is designed for tracking data usage and data governance. Also, the fact that LOC deploys on Kubernetes lower the difficulty of managing a company-level data infrastructure.

FST works with 3rd party cloud providers so you don’t have to worry about maintaining servers yourself. And LOC will grant you more control than other FaaS platforms.

You can swap data products in this illustration to LOC data processes — and it would become a working reality, connecting your data producers and consumers. Source: Data Mesh (Zhamak Dehghani, 2022), O’Reilly Media, Inc.

From the beginning, LOC is designed with data mesh compatibility in mind. We can now see that it has realized almost all of the four data mesh principles:

  • Domain Ownership: teams can design and manage their own data processes, grouping by units, scenarios and user accounts.
  • Data as a Product: data pipelines are modeled and executed as data processes. They can also generate events as another form of data products.
  • Self-Serve Data Platform: teams can quickly build and deploy data processes in the FaaS way. Kubernetes takes care most of the operation aspects for you.
  • Some extent of Federated Computational Governance with data policy and user access control. How exactly to do these has to be overseen and enforced by the organization.
Source: Data Mesh (Zhamak Dehghani, 2022), O’Reilly Media, Inc.

In the article of introducing LOC, I’ve mentioned the following thing:

The ultimate vision of LOC is, of course, to become our client’s virtual assistive data officer (VADO) — a 24/7 data engine that eliminates the need for hiring human data officers and provides an unified, virtualized layer for accessing data.

The virtual data officer is not a central role either; every team gets to have their own virtual data officer, working within their domains.

And the idea of VADO is actually pretty close to another popular term: data virtualization or data fabric. Because the ultimate purpose of LOC is, of course, to provide data to the users via a unified intelligence layer, eliminating the precious space and time of finding them.

Data mesh, overall, is a idea bigger than a specific information system, as how Dehghani described it:

I have decided to classify data mesh as a sociotechnical paradigm: an approach that recognizes the interactions between people and the technical architecture and solutions in complex organizations. This is an approach to data management that not only optimizes for the technical excellence of analytical data sharing solutions but also improves the experience of all people involved: data providers, users, and owners.

Data mesh can be utilized as an element of an enterprise data strategy, articulating the target state of both the enterprise architecture and an organizational operating model with an iterative execution model.

And there is an old Chinese saying: One must have good tools in order to do a good job. LOC is perhaps the right tool — not just a good tool — you are looking for to make data mesh a working reality.

Did I mentioned that we offer business consulting as well?

Photo by Charles Forerunner on Unsplash

For more info about Logic Operating Centre (LOC) or demonstration request, please contact support@fstk.io.

--

--

Alan Wang
FST Network

Technical writer, former translator and IT editor.