Now’s the time to tackle Data Ownership

Ideas of how to approach Data Ownership within your organization

Published in

DataHub

4 min readJan 24, 2022

The rapid adoption of DataOps practices and data observability tooling is an exciting shift in the data industry, enabling Data Practitioners to detect data quality issues and minimize downtime proactively. But tooling can only take us so far; we also need to deliberately set expectations about who is responsible for addressing issues to mitigate confusion, frustration, and finger-pointing between teams.

Last month, I surveyed members of the DataHub Community to understand how data teams are navigating Data Ownership within their organization: why assigning Data Ownership is a worthwhile initiative, what it means to be a “Data Owner,” and how they hold Data Owners accountable.

I heard from 44 Data Engineers and Data Architects across the globe, representing companies of all industries, sizes, and growth stages. The following summarizes their responses to provide ideas & suggestions of how to approach Data Ownership within your organization.

Why assigning Data Ownership matters

As it becomes cheaper and easier to create, store, transform, and leverage data, Data Practitioners find themselves scrambling to keep track of who created what data, for which purposes, and what to do when it suddenly looks “wrong.” It’s even more complex to escalate and triage data-related issues when navigating monolithic applications, centralized data lakes, or systems with ever-expanding integrations.

Assigning Data Ownership can help with the following use cases:

regulatory/data quality management: delegating responsibility to Data Owners to adhere to regulatory workflows and to monitor data quality throughout the data lifecycle
incident management: enabling Data Owners to alert all owners of downstream entities when issues arise proactively
routing questions/escalations: providing clarity to a broad spectrum of stakeholders & subject matter experts when troubleshooting data-related issues

By prioritizing initiatives to reliably surface Data Ownership in a central location, Data Practitioners can begin to untangle these complex pipelines and ecosystems, supercharging the impact of DataOps & observability initiatives.

Tips for getting started with Data Ownership

Start early and start small

Don’t wait for your data stack to become so utterly complex that productivity grinds to a halt. Start with a small team or domain to test out workflows; iterate until you find a Data Ownership framework that works for your specific organization’s structure & tech stack.

“I strongly recommend starting with a specific area instead of trying to create a generalized solution. If someone is the owner of everything, they are not an owner of any one thing.” — Milan Sahu, Data Platform Architect at Kavak

Set clear expectations for owners

What does it mean for someone to own data? What concrete actions are you asking them to take? What is the consequence of inaction? Building this narrative makes it much easier to delegate ownership across teams effectively and set folks up for success. The most common responsibilities we heard from the DataHub Community include:

Monitor data sync SLAs into a central data lake
Define data quality rules and address failures
Maintain dataset- and column-level documentation
Provision access to other teams/teammates

Extract from the source & automate as much as possible

If your goal is to surface Data Ownership details in a centralized data catalog, invest the time in extracting ownership from source systems to minimize the need for ongoing, manual maintenance. It’s worth taking the time to understand how teams have already defined ownership within their development lifecycle instead of introducing a new workflow for them to adopt or a new document for them to maintain.

I say this from personal experience — resist the urge to create yet one more Google Sheet as a long-term solution for tracking data ownership. It will quickly become stale and inevitably turn into another lost tab in everyone’s browser.

We’re seeing some inspiring approaches to this within the DataHub Community, where teams are automatically extracting ownership from:

workflow orchestration logs (i.e., Airflow) to identify which user scheduled the task to generate a dataset
dbt meta configs which specify the model owner
data profiling tools (i.e., Great Expectations) to determine who is monitoring the quality of a given dataset
GitHub repositories that reliably map code maintainers to data generation

Image by Author. Example of how dbt meta config maps to DataHub Owner.

Focus on high-impact data

Once Data Owners know what you expect of them, help them understand where to focus their time & energy while also holding them accountable. Our Community members leverage DataHub’s metadata model to help prioritize based on:

degree centrality within the lineage graph: identifying which entities have the most upstream and downstream dependencies
usage statistics: measuring the query frequency of a given entity
search and view activity within DataHub: how frequently users are seeking additional detail

Connect with DataHub

Join us on Slack • Sign up for our Newsletter • Subscribe to our Calendar