Algo

Data pipeline processing hundreds of millions of donation records into unified donor profiles

Work·Jan 2021 — Present·Architect & Lead

Problem

The existing analytics process was a chain of Stata scripts running on a single Mac desktop tower in the office. A full run took 2-3 weeks.
There was no change management. The data science team was improving analytics while the process was running, causing frequent breakages and restarts.
Observability was almost nonexistent. Reviewing intermediate data outputs required manual inspection, and failures were discovered late.
Donation data was scattered across incompatible sources: state databases, ActBlue, email platforms, SMS systems, and manual CSV imports, with no automated ingestion.
The same donor could appear under different names, addresses, and formats across sources with no shared identifier, leading to unreliable totals and duplicate profiles.

Identified the scope of the migration early as a board advisor, then transitioned to CTO to lead the effort full-time with a dedicated engineering team.
Performed open heart surgery on a moving target: pushed data to the cloud first to support the Moosehead project, then worked inward from both ends, automating data ingestion upstream and replacing processing branches with Airflow DAGs and BigQuery SQL downstream.
Maintained business continuity throughout. The legacy Stata process ran in parallel for ~3 months after the new system reached feature parity, validating data outputs before a hard cutover.
Built a custom TaskGraph abstraction that decouples business logic from Airflow infrastructure, making SQL and Python tasks independently testable.
Automated connections to database replication instances and APIs to eliminate manual exports, reducing the number of manual data sinks required.
After the migration, invested heavily in identity resolution: better heuristics, fuzzy matching, and manual curation layers to significantly reduce profile duplication and improve data quality.

Backend

PythonApache Airflow

Data & Storage

BigQueryDuckDBCloud StoragePub/Sub

Infrastructure

GCPCloud ComposerCloud FunctionsTerraform

Testing

pytestsyrupysqlglot

Other

DockerGitHub Actions

Replaced a 2-3 week batch process with incremental daily ingestion and nightly analytics runs.
Initial build took ~15 months with a team that grew from 1 to 4 engineers plus 4 data scientists.
Processes 40M+ donor profiles into a unified warehouse via 200+ orchestrated pipeline tasks.
Powers all downstream analytics, reporting, and client deliverables at Grassroots.
Transformed data quality from opaque and error-prone to auditable: automated schema validation, snapshot testing, and Google Chat alerting replaced manual data inspection that previously required pausing the entire pipeline.
Enabled the company to scale from serving a handful of clients to a full enterprise sales motion.

Process time

2+ weeks< 10 min

Donor profiles

25M40M+

Pipeline tasks

200+

Attributes per profile

600+