What is AWS Lake Formation? - AWS Lake Formation

What is AWS Lake Formation?

Welcome to the AWS Lake Formation Developer Guide.

AWS Lake Formation helps you centrally govern, secure, and globally share data for analytics and machine learning. With Lake Formation, you can manage fine-grained access control for your data lake data on Amazon Simple Storage Service (Amazon S3) and its metadata in AWS Glue Data Catalog.

Lake Formation provides its own permissions model that augments the IAM permissions model. Lake Formation permissions model enables fine-grained access to data stored in data lakes through a simple grant or revoke mechanism, much like a relational database management system (RDBMS). Lake Formation permissions are enforced using granular controls at the column, row, and cell-levels across AWS analytics and machine learning services, including Amazon Athena, Amazon QuickSight, Amazon Redshift Spectrum, Amazon EMR, and AWS Glue.

The Lake Formation hybrid access mode for AWS Glue Data Catalog lets you secure and access the cataloged data using both Lake Formation permissions and IAM permissions policies for Amazon S3 and AWS Glue actions. With hybrid access mode, data administrators can onboard Lake Formation permissions selectively and incrementally, focusing on one data lake use case at a time.

Lake Formation also allows you to share data internally and externally across multiple AWS accounts, AWS organizations or directly with IAM principals in another account providing fine-grained access to the AWS Glue Data Catalog metadata and underlying data.

Lake Formation features

Lake Formation helps you break down data silos and combine different types of structured and unstructured data into a centralized repository. First, identify existing data stores in Amazon S3 or relational and NoSQL databases, and move the data into your data lake. Then crawl, catalog, and prepare the data for analytics. Next, provide your users with secure self-service access to the data through their choice of analytics services.

Data ingestion and management

Import data from databases already in AWS

Once you specify where your existing databases are and provide your access credentials, Lake Formation reads the data and its metadata (schema) to understand the contents of the data source. It then imports the data to your new data lake and records the metadata in a central catalog. With Lake Formation, you can import data from MySQL, PostgreSQL, SQL Server, MariaDB, and Oracle databases running in Amazon RDS or hosted in Amazon EC2. Both bulk and incremental data loading are supported.

Import data from other external sources

You can use Lake Formation to move data from on-premises databases by connecting with Java Database Connectivity (JDBC). Identify your target sources and provide access credentials in the console, and Lake Formation reads and loads your data into the data lake. To import data from databases other than the ones listed above, you can create custom ETL jobs with AWS Glue.

Catalog and label your data

You can use AWS Glue crawlers to read your data in Amazon S3 and extract database and table schema and store that data in a searchable AWS Glue Data Catalog. Then, use Lake Formation Lake Formation tag-based access control (TBAC) to manage permissions on databases, tables, and columns. For more information about adding tables to the Data Catalog, see Creating Data Catalog tables and databases.

Security management

Define and manage access controls

Lake Formation provides a single place to manage access controls for data in your data lake. You can define security policies that restrict access to data at the database, table, column, row, and cell levels. These policies apply to IAM users and roles, and to users and groups when federating through an external identity provider. You can use fine-grained controls to access data secured by Lake Formation within Amazon Redshift Spectrum, Athena, AWS Glue ETL, and Amazon EMR for Apache Spark. Whenever you create IAM identities, make sure to follow IAM best practices. For more information, see Security best practices in the IAM User Guide.

Hybrid access mode

Lake Formation hybrid access mode provides the flexibility to selectively enable Lake Formation permissions for databases and tables in your AWS Glue Data Catalog. With hybrid access mode, you now have an incremental path that allows you to set Lake Formation permissions for a specific set of users without interrupting the permission policies of other existing users or workloads. For more information, see Hybrid access mode.

Implement audit logging

Lake Formation provides comprehensive audit logs with CloudTrail to monitor access and show compliance with centrally defined policies. You can audit data access history across analytics and machine learning services that read the data in your data lake via Lake Formation. This lets you see which users or roles have attempted to access what data, with which services, and when. You can access audit logs in the same way you access any other CloudTrail logs using the CloudTrail APIs and console. For more information about CloudTrail logs see Logging AWS Lake Formation API Calls Using AWS CloudTrail.

Row and cell-level security

Lake Formation provides data filters that allow you to restrict access to a combination of columns and rows. Use row and cell-level security to protect sensitive data like Personal Identifiable Information (PII). For more information about row-level security, see Data filtering and cell-level security in Lake Formation.

Tag-based access control

Use Lake Formation tag based access control to manage hundreds or even thousands data permissions by creating custom labels called LF-Tags. You can now define LF-Tags and attach them to databases, tables, or columns. Then, share controlled access across analytic, machine learning (ML), and extract, transform, and load (ETL) services for consumption. LF-Tags make sure that data governance can be scaled easily by replacing the policy definitions of thousands of resources with a few logical tags. Lake Formation provides a text-based search over this metadata, so your users can quickly find the data they need to analyze.

Cross account access

Lake Formation permission management capabilities simplify securing and managing distributed data lakes across multiple AWS accounts through a centralized approach, providing fine-grained access control to the Data Catalog and Amazon S3 locations. For more information, see Cross-account data sharing in Lake Formation.

Data sharing

The data sharing capability allows you to set up permissions on datasets stored in different data sources like Amazon Redshift without migrating data or metadata into Amazon S3 or AWS Glue Data Catalog. You can use the following methods to share data in Lake Formation:

For more information, see Data sharing in Lake Formation.

  • Integrating Lake Formation with Amazon Redshift data sharing – Use Lake Formation to centrally manage database, table, column, and row-level access permissions of Amazon Redshift datashares and restrict user access to objects within a datashare.

  • Connecting AWS Glue Data Catalog to external metastores – Connect AWS Glue Data Catalog to external metastores to manage access permissions on data sets in Amazon S3 using Lake Formation. No migration of metadata into the AWS Glue Data Catalog is necessary.

    For more information, see Managing permissions on datasets that use external metastores

  • Integrating Lake Formation with AWS Data Exchange – Lake Formation supports licensing access to your data through AWS Data Exchange. If you're interested in licensing your Lake Formation data, see What is AWS Data Exchange in the AWS Data Exchange User Guide.

Getting started with Lake Formation

We recommend that you start with the following sections:

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.