Data Governance in the Age of Big Data

Organizations are becoming more data intelligent by trying to make as many decisions as possible, at as many levels as possible, based on insights from the data gathered by different systems.

They have a lot of historical data which they want to digitize and utilize. That increases the variety of data that must be processed. With more systems being built and connected to fetch more information, the volume of data has significantly gone up. These systems also generate the data at a much faster rate. Individuals in the organization want the changes in data to be reported faster than ever before to demonstrate quick responsiveness and agility.

The age of self-service BI that dawned on us no more than 10 years ago, is already behind us. Self-service BI still requires a very capable IT team to build the warehouses and marts that host the required data models. Those data models are designed and implemented based on the required set of metrics that one wants to visualize as stories and dashboards.

But what if the end-users don’t have those questions readily available to guide the IT team to build relevant models? What if they want to observe the patterns in data as if it is flowing in, and then decide what questions they want to be answered?

Within an organization, for the same product, the questions that an engineer wants to be answered could differ from those relevant for the corresponding project manager. Similarly, the questions could be different for a program manager, director or the CIO.

We are now in an age when there are hundreds of analytics tools, platforms, and products that allow users to peek into data at various stages in the pipeline. What they don’t do well enough are data governance and metadata management.

What is data governance and how could it lead to organizational success?

An organization will thrive if its data-users can find, understand and trust their data to make better decisions and deliver better results.

Data governance is a comprehensive framework for the overall management of an organization’s availability, usability, integrity, and security of its data.

Addressing the most crucial aspect first: well-defined data governance policies ensure better security. As per a Gemalto survey “, nearly two-thirds (64%) of consumers surveyed worldwide say they are unlikely to shop or do business again with a company that had experienced a breach where financial information was stolen, and almost half (49%) had the same opinion when it came to data breaches where personal information was stolen.” As you can see, in the event of a data breach, the negative impact of a decline in consumers’ perception will be much higher than any other kind of loss which the organization might bear.

When it comes to data security, there are concerns around security breaches, storage/archival of legacy data, customer consent, and third-party liabilities. Some of these concerns are industry-specific. Additionally, there are regulatory requirements that must be met as per the industry standards such as – BCBS 239, CCAR, SOLVENCY II, HIPAA, IDMP, and GDPR.

A data governance solution takes care of these concerns, and that coupled with effective data governance practices enables an organization to develop greater confidence in its data which is a prerequisite to making data-driven business decisions.

For each governance activity, data governance tools execute policies such as

Extract, transform, and load
Data quality maintenance
Master Data Management (MDM)
Life-cycle management

These tools also monitor security and metadata repositories.

A data value chain consists of the following entities – producer, publisher, consumer, a decision-maker. Governance puts constraints on the publisher, such that the dataset is delivered in a format that is acceptable to all consumers. It then defines the responsibilities for both these entities to establish the protocol for each such dataset. Things like – availability of data, what information does the data represent, is it derived/cleansed, what is its request latency, associated vocabulary, etc. – decide the protocols. These responsibilities and protocols translate to the following action items when zoomed out:

Manage data policies and rules
Discover and document data sources
Manage compliance with evolving regulations

Data governance through metadata management

Per Gartner, by 2020, 50% of information governance initiatives will be enacted with policies based on metadata alone. Metadata provides information enabling to make sense of data (E.g. datasets and images), concepts (E.g. classification methods), and real-world entities (E.g. people, products and places).

There are three types of metadata – Descriptive metadata which describes source for discovery and identification, Structural metadata which describes data models and reference data, and Administrative metadata which provides information that helps managing/monitoring that source. There are a variety of specs and frameworks which define how the metadata should be managed for optimum data governance.

Metadata Management on Hadoop

Hadoop, because of its distributed computing prowess, is now the centerpiece of any organization’s data strategy. It allows data users to do descriptive, predictive and prescriptive analytics in real-time.

Many have implemented or are implementing the quintessential data lake which can ingest, store, transform and help analyze data. It establishes circular connections between data sources, publishers and consumers. With many such data streams flowing into and within the system, it is an urgent requirement to take care of attributes called out above, such as – security, availability, and integrity.

Hortonworks Data Platform (HDP) has a combination of tools that configured together can provide the complete data governance picture through a metadata-based approach.

Apache Atlas, which is part of HDP, is an application that allows the exchange of metadata with other tools/processes which could be within or outside Hadoop stack. Thus, it provides platform-independent governance controls that effectively address compliance requirements. It works in tandem with other tools in HDP such as Ranger, Falcon, and Kafka to complete the data governance package.

The services include abilities to – capture data lineage, perform agile data modelling through a type system that allows custom metadata structures in a hierarchy taxonomy, use REST API for flexible HTTP access to various services on HDP, and import/export metadata from current tools and to downstream systems.

Metadata management of the entire data infrastructure

Innova Solutions have designed and implemented Hadoop clusters for several customers and integrated those with their existing Business Intelligence infrastructure. We also help maintain and evolve those clusters to meet changing data flow requirements.

An implementation generally starts with a few data sources, specific ingestion, transformation, and presentation requirements. The scope of these requirements increases as the organization wants to onboard newer data sources, more users and consuming applications. Either of these entities could be outside Hadoop infrastructure.

To ensure consistency, integrity, and availability of data to authenticated/authorized users as data moves between systems, Innova Solutions leverages a suite of applications available out of the box on the Hortonworks Data Platform. We have built a communication layer that allows disparate systems like Oracle, Cassandra, Kafka, Hive to exchange metadata that can be governed centrally using Apache Atlas.

The image above gives a view of metadata management in the age of data lakes, which act as source as well as the destination of data.

Atlas is an opensource framework/repository for metadata management developed for Hadoop stack. Its flexibility allows the exchange of metadata with other tools and processes within and outside the Hadoop stack, such as SQL Server or Oracle.

Ranger, which takes care of authentication through cluster Kerberization and Apache Knox, authorization and auditing, also extends the data governance features provided by Atlas using Tags. A tag is a label, for example – PII, which could be put on any field like SSN. It is at the granularity of these tags that we can use the entire governance infrastructure to establish data movement and monitoring controls.

Falcon provides features to define data lifecycle management, compliance (lineage and audit), replication and archival. The fourth tool which may or may not be part of the Hadoop infrastructure but is the fourth pillar of data governance is a dataflow integration/workflow suite.

These put together, provide a complete system that helps create a self-service data marketplace within the organization irrespective of the number of publishers and consumers. It will allow the data-driven organization to take on the ever-changing environment of vocabularies, taxonomies, and coding schemes.