Google Cloud Data Catalog — Integrate Your On-Prem RDBMS Metadata

Code samples with a practical approach on how to ingest metadata from on-premise Relational Databases into Google Cloud Data Catalog

Marcelo Costa
Google Cloud - Community
8 min readMay 12, 2020

--

Background by JJ Ying on Unsplash

Disclaimer: All opinions expressed are my own, and represent no one but myself…. They come from the experience of participating in the development of fully operational sample connectors, available at: github.

Google Cloud Data Catalog Team has recently announced the product is GA, with the feature to accept custom (aka user-defined) types: data-catalog-metadata-management-now-generally-available! This brand new feature opens up scope for integrations and now users can leverage Data Catalog’s well-known potential to manage metadata from almost any kind of data asset.

The Challenge

We hardly find large organizations storing all their data at the same place, sometimes it’s because of compliance or even a strategic reason. That leads to many customers having their data assets spread across multiple silos, dealing with data that resides on hybrid clouds and/or on-premise environments.
It’s also widely known that the Relational Databases ecosystem, has many different vendors, making it hard to work with each one since we end up dealing with so many distinct features.

So the main idea here is to provide a reusable components architecture, making it easier to integrate different Relational Databases to Google Cloud Data Catalog with minimum code setup.

To demonstrate how it works, I’ll share design thoughts and sample code, and for that the SQLServer connector will be used as the implementation example.

Furthermore, this blog post will be updated with links for other RDBMS connectors implementations:

RDBMS connectors options

RDBMS Structure

To create a commons abstraction we need to take a step back and look at how assets are structured in a RDBMS.

RDBMS data structure

At the top level, we have what we will name as theTable Container, a logical segmentation that groups Tables together, then we have each Table Columns.

You might be wondering, why not just call the Table Container a Database or Schema?

On the traditional RDBMS, a privileged user connects to the database, then he might create schemas, and inside those schemas, create Tables with column definitions.

Some RDBMS differ on that, like MySQL, where the create Schema statement is just a link to create Database.

Teradata where the Database plays the Schema role.

And Oracle whereas to create a Schema, we actually have to create a User and grant create Table permissions to it.

Looking at those differences, firstly we want to provide a flexible architecture, where each connector declares what their Table Container is.

Metadata Query

A second concern is how each RDBMS stores its metadata, each vendor might have their own implementation on how it is stored, and what their internal system tables are.

RDBMS Connection

Finally, the last concern is how to connect to the RDBMS, each vendor might use a different ODBC/JDBC driver, library, and connection args. So we must keep the piece of code responsible for the connection open for extension.

Connector Overview

After that introduction, let’s understand how the connector works at a higher level.

SQLServer Data Catalog Connector overview

It’s split into three major components: datacatalog-connector-commons, datacatalog-connector-rdbms, and datacatalog-connector-sqlserver.

  • datacatalog-connector-commons

The commons code to all connectors lives here, mostly the ingestion-related classes. By design, only this component should communicate with Data Catalog custom types API.

Commons connector structure
  • datacatalog-connector-rdbms

This is the common code for all RDBMS connectors, classes here are open for extension, allowing each connector to customize pieces of the process.

RDBMS connector structure

For example, if a RDBMS connector needs to add new Tags to the ingestion process, just extend the datacatalog_tag_factory class, there you will have access to all the metadata from the source system.

  • datacatalog-connector-sqlserver
SQLServer connector structure

Code for the SQLServer connector, which addresses the 3 concerns covered at the introduction, we will go deeper into details on the next section.

Connector implementation

SQLServer implementation expanded view

Let’s take a look at those files:

  • metadata_query.sql

This the SQL query responsible for retrieving the metadata

Metadata SQL Query

For SQLServer and many other RDBMS, the metadata information is stored at the information_schema database. What often changes, is the internal system tables, in this case, we will ignore the DBO schema when we scrape the metadata.

  • metadata_definition.json
Metadata Definition JSON

The Metadata Definition JSON, represents the three-level hierarchy we have in a particular RDBMS, in this case, the Table Container is represented as schemas, since we can have many schemas inside the same database.

The sourcetarget fields will extract the metadata from the SQL query and send it to a metadata object used by the latter connector processes. Some RDBMS are richer in terms of metadata information, so a flexible representation is important, allowing users to add new fields to it.

In Oracle for example we have a special metadata field that stores how many rows each Table has, without having to query each Table for it.

  • MetadataScraper
Metadata Scraper extension point

Here we use the pyodbc library, which loads the SQL Server ODBC driver to create the connection. Some users favored the Pymssql library at some point in the past, but it was discontinued in favor of pyodbc.

For each RDBMS connector, we used their vendor recommend library, making this extension point in the code extremely important. For example:

The common on all those libraries is that they follow the DB API 2.0 specification, making it easier to create a generic abstraction on top of it, which is a huge part of the connector-rdbms library.

For more details on each connector and their python connection library, please take a look at the github repo.

Connector execution

To wrap it up, this is the overall architecture of the SQLServer connector

SQLServer Connector Architecture

On each execution, it’s going to:

  • Scrape: connect to the database and retrieve all the available metadata.
  • Prepare: transform it in Data Catalog entities and create Tags with additional metadata.
  • Ingest: send the Data Catalog entities to the Google Cloud project.

It’s important to remember, that most of the code lives in the connector-rdbms and connector-commons libraries.

Running it

After setting up the connector environment, by following the instructions at the Github repo, let’s execute it using its command line args:

Results

Once the execution finishes we can go into Data Catalog search UI and look for the ingested assets

Ingested SQLServer Schema
Ingested SQLServer Table from the same Schema

Running with CSV

There’s also an option to execute the connector by reading the metadata from a CSV file:

This is extremely important since we might be dealing with an on-premise RDBMS in a network unable to reach the GCP project. The ingestion process then would be split into two; 1 — generating the CSV file, 2 — running the connector from it.

It’s also supported reading the CSV file from a Cloud Storage bucket, if the raw-metadata-csv arg is supplied with a GCS path.

Execution Metrics

Finally, let’s look at some metrics generated from the execution. Metrics were collected by running the connector on a SQL Server 2017 Standard instance, populated with 1008 tables distributed into 4 schemas.

The following metrics are not a guarantee, they are approximations that may change depending on the environment, network and execution. Since mostly of the code are in common libraries, users can expect similar values from other RDBMS connectors.

Metrics summary
Data Catalog API calls Drill down

The sample connector

All topics discussed in this article are covered in a sample connector, available on GitHub: rdbms-connectors. Feel free to get it and run according to the instructions. Contributions are welcome, by the way!

It’s licensed under the Apache License Version 2.0, distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

Closing thoughts

In this article, we have covered how to ingest metadata from a RDBMS into Google Cloud Data Catalog, enabling users to centralize their Metadata management, even when it resides within an on-premise environment. Having a RDBMS commons connector makes it easier to implement other database integrations, so stay tuned for new posts showing how to do the same with other source systems! Cheers!

References

--

--

Marcelo Costa
Google Cloud - Community

software engineer & google cloud certified architect and data engineer | love to code, working with open source and writing @ alvin.ai