Building production-scale data pipelines usually involves wrangling outputs from multiple legacy systems. Whether you’re trying to build out business intelligence use cases, handle a system migration, or lay the foundations for a new data warehouse, chances are high that you’ll have to normalize and integrate the outputs of multiple systems that were never designed to talk to one another.
Recently, we built a production-scale data pipeline converting one data set from one enterprise system (Health Information Exchanges) to be used as an input into another (a claims-powered risk stratification algorithm). Although these two formats fundamentally represented the same underlying event (clinical encounters), the two systems spoke completely different “languages” — different coding standards, field definitions, and expectations about what was required. The goal was not a one-off ETL script, but a reusable, production-ready pipeline that downstream applications could rely on.