From Silos to Standardization: Leveraging DBT for a Democratized Data Framework

By — Abhishek Pathania (Engineer, Platform)

UC Blogger
Urban Company – Engineering

--

Urban Company’s growth over the last few years has resulted in an increased volume and reliance on data to fulfill product & business use-cases. Data-driven decisions across teams have played a pivotal role in this journey; but it came with its own set of challenges related to scalability, ownership, discoverability, cost, etc.

The main goal of this blog is to dive deep into the issues faced by different teams and our journey towards solving them with a centralized solution known as the Common Computation Framework(CCF).

Base Data Pipeline

Let’s first understand our base data pipeline which is used to ingest raw data from various sources including Customer App, Partner App, Third Party Events (Appsflyer), transactional DBs (MySql, MongoDB), and microservices. Once the base data is stored in the data warehouse, a data analyst can perform the necessary aggregation and transformation to create derived tables and business metrics.

Base Data Pipeline

Data Access Patterns & Challenges

The following table shows data access patterns across different teams at Urban Company:

Fundamental problems to be solved

  1. Duplicate Metric Definitions:
    Our system lacked a single source of truth for metrics, which led to duplicate efforts and resource consumption in building the same metrics.
  2. Data Lineage Challenges:
    It was hard to figure out which raw or derived tables were used to create the final metrics. When there was a change in source tables, it became difficult to list down the downstream tables to be backfilled.
  3. Searchability:
    There was no catalog for searching existing tables, metrics, and their definition. Therefore, users often reached out to Analytics and Data platform teams to clarify their doubts.
  4. Ownership:
    It was difficult to associate metrics with team, its purpose and relevance; leading to challenges in cleanups and optimizations.
  5. Unoptimized Query performance:
    Different stacks(node.js, python, standalone servers) and people with varied expertise led to non-standardized and suboptimal query patterns.
  6. Monitoring & Alerting:
    Our existing data systems lacked adequate monitoring and alerting capabilities leading to reactive measures on breakages.

Core principles for the desired solution

Given we already had a standard pipeline for our base tables, now we had to standardize computation and storage of derived fact/metric tables. Following are the set of features to streamline this process:

  1. Software Engineering Practices:
    Use Git to track changes, enforce peer review, and encourage reusable components to reduce redundancy and enhance maintainability.
  2. Code Standardization:
    Implement a standardized naming convention and integrate linting tools to promote consistency and clarity in data models.
  3. Data Cataloging:
    Create a centralized catalog to store and organize all data models. Build a data lineage system to visualize data dependencies and the evolution journey.
  4. Testability:
    Integrate automated testing for the data pipeline. Test cases should cover various scenarios, data formats, and edge cases to validate data accuracy and reliability.
  5. Continuous integration / Continuous deployment(CICD):
    Set up automatic builds upon code changes and refresh data models based on their frequencies.
  6. Monitoring & Alerting:
    Add a centralized monitoring system to track the performance and health of data systems with proactive alerting to avoid potential data discrepancies or system failures.
  7. Scalability:
    Implement distributed processing frameworks to handle large volumes of data.
  8. Accessibility:
    Provide a user-friendly dashboard for easy onboarding of new metrics and a data access layer for micro-services to query data from a central store seamlessly.

Common Computation Framework (CCF)

To fulfill the above requirements, we came up with a platformized solution referred as Common Computation Framework(CCF). CCF provides a standardized, scalable, and user-friendly platform to enable seamless metric adoption and encourage data utilization across teams. The CCF ecosystem has two main components: the Data Modeling Layer and the Online Metric Store.

Common Computation Framework

Data Modeling Layer

We needed a data modeling layer that transforms raw data into a structured format that can be easily queried and is cost-effective. Following were a few platforms we evaluated w.r.t. feature sets, pros, and cons.

Data Modeling Solutions

Finally, we picked DBT(Data Build Tool) because it supports SQL, which is widely used across teams at Urban Company. It also offers important features like automatic documentation, data lineage, and metric searchability. DBT is platform-agnostic making it independent of the compute layer(e.g., Snowflake, Spark, Trino, etc).

We used an open-source DBT project with customizations according to our organizational needs which we will discuss later.

Workflow for creating and updating data models

Data Modeling Pipeline
  1. Using a Git Repo, user writes SQL queries used to generate fact/metrics tables.
  2. Raise merge requests on Gitlab, where the linting and validation pipeline runs.
  3. Once the MR gets merged, Jenkins automatically picks up the new changes, creates a Docker image, and publishes it to the Docker registry(ECR).
  4. Airflow picks the latest image and executes pipelines at the desired frequency (hourly/daily/weekly).
  5. During the Airflow’s DAG run, fact/metric tables are computed and stored on Snowflake.
  6. Metrics are now accessible from Redash and Superset as BI tools.
  7. New and existing tables and their definition can be visualized from DBT UI which gets auto-generated by documentation specified in the code.
Data Catalog Dashboard
Data Lineage

Customizations over DBT

  • Linting: Used SQLFluff to ensure consistent and readable SQL code. This open-source project automatically checks for adherence to coding standards and identifies errors. It ensures that our models are ANSI SQL compliant, providing us the flexibility to choose any compute layer.
  • Validations: To ensure table and column descriptions are consistent, names adhere to a standardized convention, and essential metadata (such as primary key columns, owner team, etc) are included, we used an open-source project called DBT-Checkpoint.
  • Frequency-based tables: We separated tables based on their update frequencies into folders(hourly, daily, weekly, etc.) which allowed us to process them in different jobs to meet the required SLAs.
  • Prod-like environment for dev: Made it easier to access data in the dev environment for users to test their models locally before publishing new changes.

Online Metric Store

To support the use cases in our transactional services, we needed an online database that could store and serve metrics in milliseconds. We have two types of query patterns with ~60k rpm peak traffic:

  • Key-Value Lookup: Querying desired metrics against a given entity eg. Customer, Partner, City, etc.
  • Reverse Lookup: Filtering based on a combination of metric values and returning the list of entities that meet the criteria.

We have two sources writing data to the online metrics store:

  • OLAP to OLTP sync — On successful completion of hourly, daily, and weekly workflows at the data modeling layer, data is synced at a throughput of ~75k rpm.
  • Real-time metrics — Micro-services publish near-real-time metrics that are calculated at their end.

We evaluated following databases against various parameters to meet the requirements of our Online Metric Store:

Finally, we aligned with MongoDB for efficient key-value lookups and ElasticSearch for reverse lookups. These two databases are well integrated into Urban Company’s micro-service ecosystem and can easily handle the query patterns and desired throughput at our scale. Additionally, a common stack shows compounding benefits with deeper expertise, performance, and cost optimizations.

Workflow to sync metrics from warehouse to metrics store

Online Metric Pipeline
  1. The online metric service periodically polls the metadata table from Snowflake for updates and syncs it to online data store. To keep it performant, we only update rows that have been modified since the last sync.
  2. Online data store is divided into separate data marts; each holding metrics specific to the owning team.
  3. Business services interact with online metrics service via a metric-store library with well-defined functions, viz. getMetrics and publishMetrics.

Impact

  • Democratization: Now, anyone with a basic knowledge of data and SQL can use CCF to create standardized metrics.
  • Centralized Metric Repository: We’ve put all of our metric definitions at one place so it’s easy to keep track of them and make sure they’re all consistent.
  • Data Catalog: DBT automatically generates documentation making it easier for users to access information about existing facts/metrics along with their lineage.
  • Standardization: DBT enables the application of software engineering practices, version control, testing and linting; leading to improved code quality.
  • Cost Reduction: Duplicate processing is minimized by ensuring that only updated rows are synced to production databases, reducing unnecessary data transfer and associated costs.
  • Platform Independence: DBT’s compatibility with various query engines allows it to be platform-independent, enabling the use of different compute layers(e.g., Snowflake, Spark, Trino) based on organizational needs.

Future Scope

  • Dashboard Improvements: Create a unified dashboard that allows quick creation/updation for all DBT models thus opening up data modeling to a broader audience.
  • Data Quality: We can incorporate data quality metrics, such as min-max limits, standard deviation, and null percentage improving data integrity and reliability.
  • Data drift & anomaly detection: A system that can identify unusual patterns and deviations from the expected data distribution.

Team:

Abhishek Pathania is a highly skilled technology professional, with expertise in building scalable solutions.

Shashank Chaudhary is a skilled tech engineer who specializes in crafting and implementing intricate infrastructure solutions.

Nitesh Jain is a passionate tech enthusiast skilled in leading engineering projects to deliver scalable solutions that drive business success.

Kushal Singh has a proven track record of delivering top-quality software solutions that exceed expectations and is always eager to take on the next big thing in tech.

Sounds like fun?
If you enjoyed this blog post, please clap 👏(as many times as you like) and follow us (@UC Blogger). Help us build a community by sharing on your favourite social networks (Twitter, LinkedIn, Facebook, etc).

If you are interested in finding out about opportunities, visit us at http://careers.urbancompany.com

--

--

UC Blogger
Urban Company – Engineering

The author of stories from inside Urban Company (owner of Engineering, Design & Culture blogs)