Google Cloud Data Catalog — Integrate Your On-Prem RDBMS Metadata
Code samples with a practical approach on how to ingest metadata from on-premise Relational Databases into Google Cloud Data Catalog
Disclaimer: All opinions expressed are my own, and represent no one but myself…. They come from the experience of participating in the development of fully operational sample connectors, available at: github.
Google Cloud Data Catalog Team has recently announced the product is GA, with the feature to accept custom (aka user-defined) types: data-catalog-metadata-management-now-generally-available! This brand new feature opens up scope for integrations and now users can leverage Data Catalog’s well-known potential to manage metadata from almost any kind of data asset.
The Challenge
We hardly find large organizations storing all their data at the same place, sometimes it’s because of compliance or even a strategic reason. That leads to many customers having their data assets spread across multiple silos, dealing with data that resides on hybrid clouds and/or on-premise environments.
It’s also widely known that the Relational Databases ecosystem, has many different vendors, making it hard to work with each one since we end up dealing with so many distinct features.
So the main idea here is to provide a reusable components
architecture, making it easier to integrate different Relational Databases to Google Cloud Data Catalog with minimum code setup.
To demonstrate how it works, I’ll share design thoughts and sample code, and for that the SQLServer connector
will be used as the implementation example.
Furthermore, this blog post will be updated with links for other RDBMS connectors implementations:
RDBMS Structure
To create a commons abstraction we need to take a step back and look at how assets are structured in a RDBMS.
At the top level, we have what we will name as theTable Container
, a logical segmentation that groups Tables
together, then we have each Table Columns
.
You might be wondering, why not just call the Table Container a Database or Schema?
On the traditional RDBMS
, a privileged user connects to the database, then he might create schemas, and inside those schemas, create Tables with column definitions.
Some RDBMS
differ on that, like MySQL
, where the create Schema statement is just a link to create Database.
Teradata
where the Database plays the Schema role.
And Oracle
whereas to create a Schema, we actually have to create a User and grant create Table permissions to it.
Looking at those differences, firstly we want to provide a flexible architecture, where each connector declares what their Table Container is.
Metadata Query
A second concern is how each RDBMS
stores its metadata, each vendor might have their own implementation on how it is stored, and what their internal system tables are.
RDBMS Connection
Finally, the last concern is how to connect to the RDBMS
, each vendor might use a different ODBC/JDBC driver, library, and connection args. So we must keep the piece of code responsible for the connection
open for extension.
Connector Overview
After that introduction, let’s understand how the connector works at a higher level.
It’s split into three major components: datacatalog-connector-commons
, datacatalog-connector-rdbms
, and datacatalog-connector-sqlserver
.
- datacatalog-connector-commons
The commons code to all connectors lives here, mostly the ingestion-related classes. By design, only this component should communicate with Data Catalog custom types API.
- datacatalog-connector-rdbms
This is the common code for all RDBMS
connectors, classes here are open for extension, allowing each connector to customize pieces of the process.
For example, if a RDBMS
connector needs to add new Tags
to the ingestion process, just extend the datacatalog_tag_factory
class, there you will have access to all the metadata from the source system.
- datacatalog-connector-sqlserver
Code for the SQLServer
connector, which addresses the 3 concerns covered at the introduction, we will go deeper into details on the next section.
Connector implementation
Let’s take a look at those files:
- metadata_query.sql
This the SQL query responsible for retrieving the metadata
For SQLServer
and many other RDBMS, the metadata information is stored at the information_schema
database. What often changes, is the internal system tables, in this case, we will ignore the DBO
schema when we scrape the metadata.
- metadata_definition.json
The Metadata Definition JSON, represents the three-level hierarchy we have in a particular RDBMS, in this case, the Table Container
is represented as schemas
, since we can have many schemas
inside the same database.
The source → target fields will extract the metadata from the SQL query and send it to a metadata object used by the latter connector processes. Some RDBMS are richer in terms of metadata information, so a flexible representation is important, allowing users to add new fields to it.
In Oracle for example we have a special metadata field that stores how many rows each Table has, without having to query each Table for it.
- MetadataScraper
Here we use the pyodbc
library, which loads the SQL Server ODBC driver to create the connection. Some users favored the Pymssql
library at some point in the past, but it was discontinued in favor of pyodbc
.
For each RDBMS
connector, we used their vendor recommend library, making this extension point in the code extremely important. For example:
- PostgreSQL →
psycopg2
- Oracle →
cx_Oracle
- MySQL →
mysql.connector
- Teradata →
teradatasql
The common on all those libraries is that they follow the DB API 2.0
specification, making it easier to create a generic abstraction on top of it, which is a huge part of the connector-rdbms
library.
For more details on each connector and their python connection library, please take a look at the github repo.
Connector execution
To wrap it up, this is the overall architecture of the SQLServer connector
On each execution, it’s going to:
- Scrape: connect to the database and retrieve all the available metadata.
- Prepare: transform it in Data Catalog entities and create Tags with additional metadata.
- Ingest: send the Data Catalog entities to the Google Cloud project.
It’s important to remember, that most of the code lives in the connector-rdbms
and connector-commons
libraries.
Running it
After setting up the connector
environment, by following the instructions at the Github repo, let’s execute it using its command line args:
# Environment variables
export DATACATALOG_PROJECT_ID=sqlserver2dc-gcp-project
export DATACATALOG_LOCATION_ID=us-central1
export SQLSERVER_HOST=localhost
export SQLSERVER_USERNAME=sqlserver
export SQLSERVER_PASSWORD=sqlserver_pwd
export SQLSERVER_DATABASE=master
export GOOGLE_APPLICATION_CREDENTIALS=<CREDENTIALS_FILES_FOLDER>/sqlserver2dc-datacatalog-credentials.json# Command line execution
google-datacatalog-sqlserver-connector \
--datacatalog-project-id=$DATACATALOG_PROJECT_ID \
--datacatalog-location-id=$DATACATALOG_LOCATION_ID \
--sqlserver_host=$SQLSERVER_HOST \
--sqlserver_user=$SQLSERVER_USERNAME \
--sqlserver_pass=$SQLSERVER_PASSWORD \
--sqlserver_database=$SQLSERVER_DATABASE
Results
Once the execution finishes we can go into Data Catalog search UI and look for the ingested assets
Running with CSV
There’s also an option to execute the connector by reading the metadata from a CSV file:
# Environment variables
export DATACATALOG_PROJECT_ID=sqlserver2dc-gcp-project
export DATACATALOG_LOCATION_ID=us-central1
export SQLSERVER_HOST=localhost
export RAW_METADATA_CSV=resource/sqlserver_full_dump.csv
export GOOGLE_APPLICATION_CREDENTIALS=<CREDENTIALS_FILES_FOLDER>/sqlserver2dc-datacatalog-credentials.json# Command line execution
google-datacatalog-sqlserver-connector \
--datacatalog-project-id=$DATACATALOG_PROJECT_ID \
--datacatalog-location-id=$DATACATALOG_LOCATION_ID \
--sqlserver_host=$SQLSERVER_HOST \
--raw-metadata-csv=$RAW_METADATA_CSV
This is extremely important since we might be dealing with an on-premise RDBMS in a network unable to reach the GCP project. The ingestion process then would be split into two; 1 — generating the CSV file, 2 — running the connector from it.
It’s also supported reading the CSV file from a Cloud Storage bucket, if the
raw-metadata-csv
arg is supplied with a GCS path.
Execution Metrics
Finally, let’s look at some metrics generated from the execution. Metrics were collected by running the connector
on a SQL Server 2017
Standard instance, populated with 1008
tables distributed into 4 schemas.
The following metrics are not a guarantee, they are approximations that may change depending on the environment, network and execution. Since mostly of the code are in common libraries, users can expect similar values from other RDBMS connectors.
The sample connector
All topics discussed in this article are covered in a sample connector, available on GitHub: rdbms-connectors. Feel free to get it and run according to the instructions. Contributions are welcome, by the way!
It’s licensed under the Apache License Version 2.0, distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
Closing thoughts
In this article, we have covered how to ingest metadata from a RDBMS into Google Cloud Data Catalog, enabling users to centralize their Metadata management, even when it resides within an on-premise environment. Having a RDBMS commons connector makes it easier to implement other database integrations, so stay tuned for new posts showing how to do the same with other source systems! Cheers!
References
- Data Catalog GA blog post: https://cloud.google.com/blog/products/data-analytics/data-catalog-metadata-management-now-generally-available
- Connector Github Repo: https://github.com/GoogleCloudPlatform/datacatalog-connectors
- Connector RDBMS Github Repo: https://github.com/GoogleCloudPlatform/datacatalog-connectors-rdbms
- Data Catalog official docs: https://cloud.google.com/data-catalog/
- SQL Server ODBC driver: https://docs.microsoft.com/en-us/sql/connect/odbc/linux-mac/installing-the-microsoft-odbc-driver-for-sql-server?view=sql-server-2017