Refactor data pipelines

Case Study

Business Problem:

The trade-off between development speed and pipeline maintainability is a constant for data engineers, especially for those in a rapidly evolving organization. Additional ingestions from data sources are frequently added on an as-needed basis, making it difficult to leverage shared functionality between pipelines. Identifying when technical debt is prohibitive for an organization can be difficult, but remedying it can be even more so.

As our customer's data engineering team grappled with their own technical debt, they identified the need for higher data quality enforcement, the consolidation of shared pipeline functionality, and a scalable way to implement complex business logic for their downstream data scientists and machine learning engineers.

In this case study, we will explain how Quadratic designed our client's new new end-to-end pipeline architecture to make the creation of additional pipelines robust, maintainable and scalable, all while writing fewer lines of code with Apache Spark.

Original Architecture:

The original architecture designed to support our clients Annual Enrollment. start with the collection of data. Support for both internal and external sources were needed. For the internal sources, product teams was landing data in a staging area. For external sources, typical API Calls were made to land the same data. The data landed as JSON and the first thing was to convert it to Parquet. Depending on the requirements of the pipeline, a small amount of custom logic might be implemented. . All of the pipelines are run using oozie as an orchestrator and all logic is executed as combination of pig and hive jobs. Each pipeline is architected independently.

In the last stage of processing, data sources that have similar structure and business rules were combined into a single table. This reduces the number of tables that would have to maintain. Lastly, the data was exposed via Hive Tables.

Data Engineering Refactor Scope:

The original architecture evolved over time but continued to implement data pipelines using the older Hadoop technologies. The problem with this was two fold: (1) It was very difficult to on-board new data sources and (2) The pipelines were non-performant, buggy and the customer had to invest into monitoring of the jobs 24x7 with most this monitoring done manually because of short remediation time for SLAs . There was a huge opportunity to refactor these pipelines.

New Architecture:

In the new architecture, Spark/Scala was used to rewrite the pipelines with Airflow as an orchestration tool. The data was established in three distinct processing layers that now allow us to view data in various stages of the pipeline. The first we call the valid data set layer. This is a faithful representation of the source data with basic schema and data validations applied, and it’s primarily used for retaining historical data. Each hive table in this layer contains only one event where previously we left events embedded in JSON columns. This makes it easier to enforce schema now that we’re only handling one event at a time. The second is a serve data set. The exact transformations here vary by source, but we focus on preparing the valid data set for easy consumption. This is the first layer that we’ll expose to our downstream teams. The third and final layer is our data marts. We apply specific business logic to our serve data sets to present a denormalized view of the data. This layer is used to build objects to support aggregated data sets and reporting and can combine one or more serve data sets. depending on the context of the events. This greatly improved the quality because now one can broadly apply schema validation and data quality checks across all the pipelines.

Config Driven Ingestion

In the new architecture, we drive the generation of the pipeline through three components. The first component is a YAML configuration file that’s structured with a list of transforms, configurations to those transforms and information to construct the actual pipeline. Looking at the transform configurations piece, this should define all the information needed to configure the step. . The second component is a set of schema files that define how the data is structured in the three hive layers throughout our pipeline. We use these schema files to create the actual hive tables and to drive the schema validation step at the beginning of the pipeline. The third component is an orchestration library that parses the configuration files, constructs the ingest calls and other operational blocks relates those steps together and then constructs the airflow DAG and tasks.