Consumable enterprise Data lake

CASE STUDY

Case Study Overview:

The fortune 100 Firm set out to revamp their Enterprise Data Lake which was on a Cloudera platform to a new data platform primarily for (a) cost reduction (b) adopt to modern data architectures and (c) importantly, to enable the data lake to be easily governed and consumable.

Business Challenges:

The data lake development at this customer was in works for many years. With iterative development and on-the-fly data governance, the lake over time has become very difficult to consume with lack of proper metadata management, access provisioning and cataloging. The lake was built on a Hadoop cluster and with the costs growing higher, there was a need to retake a look on options on new technologies and platforms.

Solutions Delivered:

Quadratic Systems was the primary partner for designing and implementing the data lake solution on a new platform comprising of:

(1) On-Prem S3 object store (Scality) for data storage (replacing HDFS),

(2) Spark/Scala on Kubernetes containers (CaaS Platform)

(3) Dremio as a query tool (replacing Hive/Impala)

Scality was chosen for Data Storage and a Caas Platform (Kubernetes) for compute to replace the exisiting Hadoop environment.

Data Governance:

Data Quality addressed with Balance and Controls and Reconciliation Framework
Data Access/Security using Service Now Integration for provisioning
Metadata Management with integration with Informatica EDC
Incremental feeds to the lake were recorded in a metadata and available for consumers to easily find new data as it was made available.

Data Ingestion Framework:

A repeatable config driven framework written in Scala enabled Systems of Records to quickly on-board to the lake
Kafka Integration and Spark Streaming to accomplish near real time ingest to the lake

Data Consumption:

Dremio was made available for users to consume data on Scality with ease for exploration / reporting.
Advanced users employed Pyspark for analytical consumption

Enabling Technology

On-Prem S3 (Scality)
Spark on Kubernetes
Dremio
Scala / Pyspark