Data Governance on AWS using DataZone

15 minute read Published: 2024-08-09

In this blog, we will provide a brief introduction to data governance and show how to implement it on AWS using DataZone. We will walk through a practical example involving a multi-account setup to manage and share data stored in S3 and Redshift, highlighting key steps and best practices along the way.

Table of Contents

What is Data Governance?

Data governance is everything you do to ensure data is secure, private, accurate, available, and usable which helps organizations accelerate data-driven decisions. Key steps in implementing data governance typically include:

Data mapping and classification.

Data mapping involves documenting data assets and understanding how data flows through an organization’s systems. This process enables the identification of different data sets, which can then be classified based on various factors, such as whether they contain personal information or other sensitive data. These classifications directly influence the application of data governance policies to each data set, ensuring appropriate levels of security and compliance.

Business glossary.

A business glossary provides standardized definitions for business terms and concepts used within an organization. For instance, it might define what constitutes an active customer. By establishing a common vocabulary for business data, a business glossary supports consistent understanding and interpretation, which is crucial for effective governance across the organization.

Data catalog.

A data catalog is a comprehensive, indexed inventory of an organization’s data assets, created by collecting metadata from various systems. It typically includes details on data lineage, search functionalities, and collaboration tools. Data catalogs often integrate information about data governance policies and offer automated mechanisms for policy enforcement, ensuring that data is used in accordance with established governance standards.

What is AWS DataZone?

DataZone is a data management service that makes it faster and easier for customers to catalog, discover, share, and govern data stored across AWS accounts. With DataZone, administrators and data stewards who oversee an organization’s data assets can manage and govern access to data using fine-grained controls. These controls are designed to ensure access with the right level of privileges and context. DataZone makes it easier for engineers, data scientists, product managers, analysts, and business users to access data throughout an organization so that they can discover, use, and collaborate to derive data-driven insights.

Key concepts and Capabilities

Data Portal

This is a web application where different users can go to catalog, discover, govern, share, and analyze data in a self-service fashion.

Business Data Catalog

In your catalog, you can define the taxonomy or the business glossary. You can use this component to catalog data across your organization with business context and thus enable everyone in your organization to find and understand data quickly.

Data Projects and Environments

You can use projects to simplify access to the AWS analytics by creating business use case–based groupings of people, data assets, and analytics tools. DataZone projects provide a space where project members can collaborate, exchange data, and share data assets. Within projects, you can create environments that provide the necessary infrastructure to project members such as analytics tools and storage so that project members can easily produce new data or consume data they have access to.

Governance and Access Control

You can use built-in workflows that allow users across the organization to request access to data in the catalog and owners of the data to review and approve those subscription requests. Once a subscription request is approved, DataZone can automatically grant access by managing permission at underlying data stores such as Lake Formation and Redshift.

Setting up DataZone

Prerequisites

Set up Redshift Serverless clusters in both the Producer and Consumer accounts. This is essential for enabling database sharing across accounts.

Now that we’ve explored the fundamentals of data governance and the key concepts of DataZone, let’s move forward with the setup process.

Architecture


DataZone Account

In an organization, a central Data Team is typically responsible for setting up and managing the data marketplace using DataZone. Their key responsibilities include:

To begin setting up a Data Marketplace, the first step is to create a domain.

Create DataZone Domain




Associate Producer and Consumer accounts

Associating your AWS accounts with DataZone domains enables you to publish data from these AWS accounts into the DataZone catalog and create DataZone projects to work with your data across multiple AWS accounts. Now the Data Portal is ready lets associate the producer and the consumer accounts.


Create Projects

Projects enable a group of users to collaborate on various business use cases that involve publishing, discovering, subscribing to, and consuming data assets in the Amazon DataZone catalog. We will create projects for producer and consumer.


With the projects set up and the Producer and Consumer accounts linked, the next step is to enable blueprints in both the Producer and Consumer accounts. Once this is done, we’ll return to the DataZone account to create environments, establish a business catalog, and then publish and subscribe to data assets.

Create Environments

In DataZone projects, environments are defined as collections of configured resources — such as S3 buckets, Glue databases, or Athena workgroups — each associated with a specific set of IAM principals (user roles) who are granted owner or contributor permissions to manage those resources.

In our setup, we will create two environments: one for Athena (linked with S3) and another for Redshift.

Athena Environment



Redshift Environment




Producer and Consumer Account

Enable Blueprints

A blueprint with which the environment is created defines what AWS tools and services (eg, Glue or Redshift) members of the project to which the environment belongs can use as they work with assets in the DataZone catalog.

Go to DataZone service click View Associated Domain and under the Blueprints tab enable Default Data Lake and Default Data Warehouse blueprints.

Create Parameter Set for Redshift

Parameter set is a group of keys and values, required for DataZone to establish a connection to your Redshift cluster and is used to create data warehouse environments. These parameters include the name of your Redshift cluster, database, and the secrets manager that holds credentials to the cluster.


Note

Make sure the Redshift Manage Access Role has permissions to read the secret.


Repeat the above steps in the Consumer account to mirror the setup.

Publishing and Cataloging data product

Prerequisites

For this tutorial create two datasets in the Producer account.

Claims: Create a dataset in S3 that contains information on insurance claims filed. Additionally, use the Glue Data Catalog to catalog this data.

Customer: Create a dataset in Redshift that includes personal information and relevant details about customers.

With the projects and environments now created, we can proceed to import existing data, catalog it, and ultimately publish it and then will make the data easy to understand with business glossary and business name generation.

Publish Claims data





Publish Customer data



Having published both datasets, the next step is to subscribe from the Consumer account. Once subscribed, you can analyze the data using Athena and Redshift.

Discovering and subscribing data product

Data Consumer searches for data and discovers the data needed for the business use case. They also request access to the data through data subscription. Once the Data Product Owner approves the subscription, the data asset is available for use by the Data Analyst.

Claims data

Create subscription




Analyze and Visualize data in Athena




Customer data

Create subscription


Analyze and Visualize data in Redshift



Conclusion

This blog covered setting up data governance with AWS DataZone, including creating datasets, configuring environments, and managing data access. With these steps, you can now efficiently manage and analyze data across your organization, enhancing data-driven decision-making.