Why use Data Mesh?
What it is and how to implement it?
The article provides an introduction to the Data Mesh architecture in Data Engineering and its similarities with the microservices architecture in software engineering. We wrote a guide last month – explaining ‘What is data mesh?’ if you want an explanation on this first, check that post out. In this article, we will attempt to answer some of the common questions in a little more depth – basic questions like the reasons and the purpose of a data mesh architecture. In addition, there’s some pointers on implementing a data mesh.
What’s the purpose of Data Mesh?
Why do I need one more data architecture?
When most companies saw the rising trend of “Data”, executives started racing to make their organisations “Data-Driven”. And after some examples of outstanding success, from nimble companies like Netflix, Airbnb, etc, big companies wanted to join the race and embraced the trend by investing in new data projects. Being data-driven was the talk of the town in most data meetups, and conferences and was the short-term goal for most companies, irrespective of their size & complexity.
But sadly, most of these data projects failed during the first few years and their Return On Investment (ROI) for most data projects have seen diminishing results. For most of these failures, The “Data Team” took the blame for being a bottleneck and not being nimble enough to adapt to industry trends and organisational changes.
To solve this scalability problem of the centralised “data team”, there was a need to decentralise the ownership & architecture. Folks at Thoughtworks had faced similar software engineering problems decades ago and solved them using a microservices architecture. Zamak Dehghani from Thoughtworks first proposed the “Data Mesh” architecture, which is heavily inspired by Domain-Driven-Development and Microservices Architecture from Software Architecture. Indeed both Microservices & Data Mesh architectures have their roots in Thoughtworks.
The Data Mesh is a decentralised socio-technical paradigm, which decentralises the ownership of data and its transformation (to information) as well as its serving. It turns the spotlight on the producers of the data and gives them the responsibility to handle their data just as they would handle their software.
The Data Mesh solution is NOT strictly technical, but more related to the alignment of People, Processes & Organisation.
Data architecture evolution and limitations
Below figures show the evolution of data architectures from the early 90’s to till date with the problems it inherited. For brevity, I have skipped the explanation of each architecture & its drawbacks.
Data Warehouses & Data Marts
Below are a few typical architectures of Data Warehouse (DW) with Data marts which is still suitable for many organisations.
A few of the conditions suitable for choosing the Data Warehouse & Marts architecture are listed below:
- Data is structured.
- The use cases which use data are well-defined and do not have scope for expansion or evolution in the future.
- Executives extract value from data using only BI, reporting & data analysis tools.
- The source data systems are small, stable & consistent in the future.
- Source DBs are owned by a small team.
- The Data team has complete ownership of data and are expert in data technologies.
- Organisations do not have any big data in future scope.
Data Lakes (DL) is essentially similar to a data warehouse; they both rely on a centralised data repository & team. In DL apart from raw storage, we see support for big data, ingestion of batch and streaming data, data catalogues, security, metadata, processing engines, data access mechanisms, etc.
The main difference compared to DW architecture is that DL stores the RAW data (without changing the format, etc) instead of structured and transformed data as in Data Warehouse.
In Data Lake architecture, Data Engineers in the Central Data Team will perform ETL, versioning & data lineage using the Raw data ingested as a single source DB. Data Lake is often described as “schema-on-read” in contrast to Data warehouse which is called “schema-on-write” since in DL we don’t force incoming data to have any specific schema/formats constraints, we implement schema/constraints upon “reading” the data. In DW we need to restrict all INPUT data to follow a schema before its updated to DB. DW focuses only on the Storage of small data, DL does big storage & processing from a wide variety of sources (structured, semi-structured, unstructured, batch, streaming).
A few of the conditions for choosing DL architecture are as follows:
- Organisations have to worry about big data with “the three v’s” (volume, variety & velocity).
- Data can be structured, semi-structured, or unstructured, and comes in the form of streams and batch sources.
- Use cases are unpredictable for the future.
- Data sources are owned by small teams or departments in different formats like sales team (CSV), finance dept (spreadsheets), HR (DB table), etc.
- Data consumption patterns are not fully established. It can be a dashboard for executives, tables for Data scientists, or some JSON dump for analytics software.
Data Lakehouse is a hybrid architecture that combines the benefits of both the world of a Data Warehouse and the Data lake. The Lakehouse architecture is designed to provide below features:
- ACID (Atomicity, Consistency, Isolation & Durability) transaction support.
- Schema enforcement & governance.
- BI Support on top of Data source.
- Decouple storage and computing.
- End-to-end streaming.
- Support for both structured &unstructured data.
- Support for multiple workloads, ex: ML, SQL, Analytics, BI & Data science.
The few considerations for having Lakehouse architecture are below:
- Most of the data use cases depend on structured data.
- Hard to foresee future data use cases.
- Value from data is obtained from BI, Reporting, and data analysis.
- We got big data.
- We got unstructured and structured data.
- Have many small Data sources.
- Data consumers & consumption patterns are not explicitly defined.
Is Lakehouse an enabler of Data mesh?
A lakehouse architecture can potentially be an enabler of data mesh. One aspect of data mesh is the idea of a “data lakehouse,” which combines the features of a data lake (a central repository for storing large amounts of raw data in its original format) with those of a data warehouse (a structured repository for storing and querying data for business intelligence and analytics purposes).
A lakehouse architecture can provide the necessary infrastructure for implementing a data mesh architecture, by providing a central repository for storing and organizing raw data and enabling flexible data processing and analysis.
However, it’s important to note that a lakehouse architecture is just one component of a data mesh architecture, and there are other considerations and challenges involved in implementing a data mesh, such as establishing data ownership and governance practices, building data literacy and self-service capabilities within teams, and ensuring interoperability between different data domains.
Do I need to have Lakehouse as the base architecture for Data Mesh?
Lakehouse is a data management concept that is often discussed in the context of data mesh, but it is NOT necessarily required for a data mesh implementation. A data mesh typically involves creating a decentralized data architecture and governance model that allows for more flexible and scalable data management across an organization. This can be accomplished using a variety of tools and technologies, including but not limited to Lakehouse. It ultimately depends on the specific requirements and goals of the organization.
What are the current data problems from existing architectures?
Below are some of the technical & business problems which led to having a new architecture in “Data Mesh”.
- Centralised & monolithic architecture
- Problem example: Waiting for the data team/ department to implement the new features.
- Problem: technical backlog for the data team is growing beyond its capacity.
- Problem: the data team is stressed with bugs & issues.
- Slow to adapt to technological changes.
- The data team is a bottleneck for innovation, experimentation and changes.
- Frequent changes to ETL/ELT.
- Existing ETL needs regular/constant maintenance or supervision efforts.
- Needs to update ETL / data pipeline changes for each feature.
- The bottleneck for all new innovations.
- Silos: Data team is filled with silos based on technology.
- Data providers & Data users are siloed
- No accountability for end results.
- Lack of domain knowledge, data team lacks domain expertise.
- Responsibility for data is under data teams.
- More time for end-to-end deployments & capture results/feedback.
- More friction with Data vs Other teams
- The ROI (Return Of Investment) on data teams has reduced.
- The current Data team is not scalable.
These problems are similar to the software engineering/development problems which were resolved by using domain-driven development & microservices.
What is Data Mesh?
Data Mesh is a data management and governance model that aims to create a self-serve, decentralized data ecosystem within an organization. In a data mesh architecture, data is treated as a product, and teams are empowered to take ownership of the data they produce and consume.
- It’s an Architectural Design Pattern for Data Architecture. Similar to Software Architecture patterns in software engineering.
- Data mesh is a decentralised socio-technical approach to share, access, and manage analytical data in complex and large-scale environments—within or across organisations.
- Data mesh is a new approach to sourcing, managing, and accessing data for analytics use cases at scale.
- It also tackles the problem of diverse data sources. It’s more about addressing organisation scalability than technical scalability.
What is NOT Data Mesh?
- It’s not a Vendor Product. There are no single data mesh products.
- Not a data lake or data lakehouse. These are complementary that can be part of data mesh implementation.
- Not a data catalogue/graph. Data mesh needs physical implementation.
- Data mesh is Not one time, consulting project investment, it is a journey.
- Not a self-service analytics product, but data mesh can have self-service analytics.
- It’s not a silver bullet that solves all centralised & monolithic data architectural problems.
It solves some of the most pressing and unaddressed modernization objectives for data-driven business initiatives. Like:
- Most digital transformations are failures.
- Costs of operational data outages are rising.
- Cloud lock-in is real and can become more costly
- Data lakes rarely succeed and are only focused on analytics
- Rise of distributed data is forcing a more effective, efficient, and economic architecture
- Organisational silos worsen data-sharing issues
- Data is the catalyst for the competitive edge and it is crucial to managing it well
Data Fabric is considered as an alternative to data mesh as it is conceptually similar. The data fabric concept is more broadly inclusive of a variety of data integration and data management styles, whereas data mesh is more associated with decentralisation and domain-driven design patterns.
What are the similarities between Data Mesh & Microservices?
- Data mesh is a decentralised strategy that treats data as a product and provides the infrastructure to make it more accessible to business users.
- Outcome-focused data product thinking.
- Having data consumers’ point of view.
- Data domain owners decide KPIs, SLA, etc.
- Alignment with operations & analytics
- Same data source, tech stack, etc to all teams.
- Data is dynamic, streaming by default & batch by exception, and obtained by self-service pipelines.
- Self-service & governed platform, with support for all event-driven types, complexity, formats, etc.
Planning & Design & Deploy a Data Mesh
How to design Data Mesh?
Data mesh at a very high level looks as shown in the figure below.
Data mesh is still in its infancy and can be implemented using any Cloud Infrastructure. It can be a hybrid deployment or mixed with other cloud providers, i.e, Domain X can be in AWS, Domain Y in GCP, Domain Z in On-Perm HA Server, etc.
Since data mesh is a socio-technical architecture, architects should first identify all the stakeholders and should analyse their interests, power, and impact for having data mesh.
We can do stakeholder analysis using the Power-Interest matrix found in Agile, PMP & other methodologies.
Using this matrix we can identify the team for “Federated Data Governance” who are responsible for deriving policies & metadata. The matrix is also used to identify domain & data ownership.
What are the principles to be considered while designing Data Mesh?
This is the first step after identifying needs & getting blessings from all stakeholders. This principle is based on “Domain-Driven Design”.
The principle’s primary aim is to Decentralise the responsibility of data and shift it into the business domain. To understand the business domain we can use many tools like Event-Storming, Domain storytelling, Data use cases and mind maps, or rich pictures explaining the domain and its decomposition.
To implement the Data Mesh effectively, we need to design the “Data products” and Organisational structure (Ownership) at the same time and in sync.
- Here are a few tips & tricks for designing effective data products:
- Push data ownership upstream.
- Define multiple connected models using data products.
- Avoid having a single source of truth, consider the most relevant domain data.
- Data pipelines are internal implementations of domains, don’t expose them.
Data as Product
The second logical progression toward data mesh after identifying domains will be to derive a “Data Product”.
A Data Product is an autonomous, read-optimized, standardized data unit containing at least one dataset (Domain Dataset), created for satisfying user needs. The data products will act as nodes in the mesh, similar to nodes in the graphs/network diagrams with their ingress & egress ports.
Previously we used to treat data as a “by-product” of a Software product and define the “data” in terms of schema & data/entity relationships between other data. Because of this other important data attributes like “schema validation”, metrics, domain description, documentation, and quality metrics became the sole responsibility of the centralized data team.
To overcome this problem in Data Mesh we have a concept of Data-Product, which treats data as part of a Software Product, and gets the same level of attention as the latter.
In simple terms, a data product is defined as the dataset with the below mandatory characteristics
- Business value.
- Has interface/iteration with other data products.
- It Can be used in different contexts.
Like a Canvas Business Model, we have a Data Product Canvas tool to structure our analysis for Data product design.
One of the core aspects of data products is the designing of its interfaces. The below figure shows the generalized external interfaces of a data product.
Communication interfaces are the input/output ports, which define formats/protocols to read and write data. Ex: files, API, streams, and Databases.
Information Interfaces are the groups of interfaces to collect additional information regarding data products. Ex: metadata, lineage, metrics, etc.
Configuration interfaces: Datamesh ecosystem platform-specific data, which are used for communicating with other data products in the mesh. Ex: security options, policies, and data sharing protocols. Etc.
Below is the typical internal view of a data product.
Federated Computational Governance
Federated governance is one of the core principles of data mesh, which differentiates it from other architectures. It consists of both centralized & decentralized decision bodies. To define it in simple terms, it’s a data governance model, where enterprise-level decisions are made by the central governing body and decentralized units are autonomous decision-makers within their domains for all other affairs. In addition, most of the repeated operations are translated to algorithms & automated.
Some of the significant objectives of federated governance are listed below
- Increase the business value or prevent losses from data by enforcing transparency.
- It’s a way to maximize the usability of data assets, by developing clearly defined roles, responsibilities & accountability for data. Also provides proper access & policies to share data across domains.
- To have data assets to adhere to internal & external compliance requirements of regulators.
Having clear data policies is critical for effective data governance, below are some of the aspects which need attention while designing the policies.
- Data collection: how to ethically collect data.
- Data inventory: specify the timelines and lifecycle of data, its security & safety technologies & rules for destroying data.
- Data Access: access policies, ACL, etc.
- Data Quality: ensure data has correct quality metrics.
- Data usage: specify data usage limitations as per law, ethics & standards.
- Data integrity: to make data trustworthy.
- Security: to prevent data loss/theft or misuse.
- Employee requirements policy: how employees deal with sensitive data.
- Data leakage prevention policy
- Encryption policy: to have encryption/decryption as per the latest industry standards.
- Adherence to Documentation standards for each domain, data product & policy.
- Metadata management
- Master data management.
These are the aspects that need to adhere from C-level to an intern developer working with data. Data governance aims to clarify processes, roles & responsibilities, policies, standards, and metrics to effectively derive business value from the data.
How to set up a data governance team?
The first step to achieving this principle is to set up a strong data governance team. Below are some of the steps to begin with.
- Define goals & their benefits.
- Analyse the current status quo & compare it to the target state which organization wants to achieve.
- Derive a roadmap to achieve this goal.
- Get approval from all the stakeholders.
- Develop and plan a data governance program.
- Implement the program.
- Monitor & control the program.
To make sure the governance team is not caught in bureaucracy & becomes a bottleneck as mesh scales, we have to design & plan to make sure all “policies” are automated in CI/CD pipeline. Which indeed turns all policies into computations and is easy to automate.
Data Mesh using GCP
Data Mesh using AWS
How to Implement & deploy Data Mesh
Based on current industry deployment reviews, the below table summarises a few suggestions for choosing cloud providers for Data Mesh deployments.