Data Integration Handbook¶

Upstream-to-two-tables mapping and validation for the Data Integration Team. Currently rendered against Your Institution (spec_example).

This handbook backs the feeds behind every dashboard — the upstream ETL that populates transactions and daily_balances at Your Institution. Each entry here is a task the Data Integration Team actually does: map a source table, prove the feed is sound, tag a forced Fed posting, extend the metadata contract, or debug a load that made it to the tables but not to the dashboards.

The team's job¶

The Data Integration Team owns the projection from the institution's upstream systems (core banking, Fed statements, processor reports, sweep engines) into the two base tables every shipped dashboard reads. Their attitude, in their own words:

What do I have a database server that can do fancy queries for unless I use it?

That attitude drives the whole schema. The contract is deliberately small — two tables, ~11 mandatory columns, JSON-string metadata for per-transfer_type extras — so the team can spend their effort on projection correctness instead of wrangling a sprawling normalized schema. The fancy queries (drift, rollups, transfer net-zero) live in computed views the database runs on demand.

The contract¶

Two tables feed everything:

transactions — one row per money-movement leg. 11 mandatory columns + conditional extras + a metadata JSON column.
daily_balances — one row per (account_id, balance_date). Stored EOD balance + a metadata JSON column for per-day configuration (limit-schedule payloads live here).

Every shipped dashboard (L1, L2 Flow Tracing, Investigation, Executives) reads from these two tables. account_type and transfer_type discriminate which slice each app cares about; the schema itself is shared. Full column contract, per-column failure modes, metadata catalog, and ETL examples:

Schema v6 — Data Feed Contract — the source-of-truth document. Read the Getting Started for Data Teams preamble first.

Optional: `expected_complete_at` (lateness)¶

transactions carries an optional expected_complete_at TIMESTAMP column. Populate it when your ETL knows the rail's settlement window — instant rails (Fed wire, on-us internal) same-day, ACH T+2, cards T+3. When NULL, downstream views fall back to posted_at + INTERVAL '1 day' via COALESCE, so omitting the column is safe.

Why bother populating it? The is_late predicate that the L1 Exceptions sheets project fires off the same COALESCE expression. A populated expected_complete_at gives the analyst a per-rail-accurate deadline; an unpopulated one falls back to the conservative one-day default (which over-fires, surfacing things that aren't really late yet, rather than hiding overdue rows). Adopt incrementally: pick the rail your team gets the most "is this really late or just slow?" questions about, populate that one first, leave the rest NULL.

For multi-leg transfers, downstream views collapse to the earliest debit leg's expected_complete_at as the transfer-level deadline. You don't need to denormalize this across all legs of a transfer — just populate the leg(s) you have rail data for, and the views work the join.

See Lateness as data for the default formula, the is_late predicate SQL, and the multi-leg tie-breaker query.

Several materialized views sit on top of these tables — the L1 invariant matviews (spec_example_drift, spec_example_overdraft, spec_example_limit_breach, spec_example_stuck_pending, spec_example_stuck_unbundled, spec_example_todays_exceptions) plus the Investigation cluster (spec_example_inv_pair_rolling_anomalies feeds Volume Anomalies; spec_example_inv_money_trail_edges feeds Money Trail and Account Network — recursive walk over parent_transfer_id). None are auto-refreshed: every ETL load must run REFRESH MATERIALIZED VIEW on each, or the operator- facing aging / anomaly / chain columns will lag. The dependency- ordered statements come from common/l2/schema.refresh_matviews_sql(l2_instance). See Materialized views for the full refresh contract.

Foundational walkthroughs¶

Start here — populate and validate the feed

How do I populate `transactions` from my core banking system?

Canonical projection from a hypothetical `gl_postings` source table into the two-table target. The first walkthrough a new team member reads.

How do I prove my ETL is working before going live?

Three pre-flight invariants (net-to-zero, balance recompute, orphan chains) with copy-paste SQL. Run these before the dashboard sees the data.

How do I validate a single account-day after a load?

Open the Daily Statement sheet on a specific `(account_id, balance_date)` to confirm opening, debits, credits, closing, and zero drift — the per-row companion to the universal pre-flight invariants.

Extension walkthroughs¶

When the feed shape changes — new source, new key

How do I tag a force-posted external transfer correctly?

The `origin` column + `parent_transfer_id` chain mechanics for Fed-statement ingest. Why force-posted matters for L1 exception classification.

How do I add a metadata key without breaking the dashboards?

Extension contract: portable scalar types, `JSON_OBJECT` writes / `JSON_VALUE` reads, Schema_v6 catalog update. Walks an `originating_branch` addition end-to-end.

Debug walkthroughs¶

When the feed looks right but the dashboard doesn't

What do I do when the demo passes but my prod data fails?

Six symptom-organized debug recipes (date filter, transfer_type filter, missing metadata key, drift KPI spike, broken parent chain, status enum drift).

The exemplary helper¶

quicksight-gen data etl-example emits canonical INSERT patterns the team can copy from when building a new ETL job:

quicksight-gen data etl-example -o etl-examples.sql

The current pattern set covers every base-table shape Investigation walks; the L1 + L2 Flow Tracing apps read the same base tables and need no app-specific patterns at this writing.

Every block carries a -- WHY: header that names the business invariant the pattern protects, and a -- Consumed by: header that names the dashboard view that reads the resulting rows. Strip the -EXAMPLE sentinel suffix and wire the column projections to your upstream feed's source fields.

Troubleshooting¶

Symptom: dashboards show data older than expected¶

A sheet is rendering, but the most recent rows look stale — yesterday's postings are missing, the Drift KPI hasn't budged since the last load, the Daily Statement walks stop a day or two behind the date you just ETL'd. Your first instinct is to suspect an ETL bug or a query bug. Before you go there, check whether the matviews are simply behind.

First diagnostic step: open the App Info sheet¶

Every shipped dashboard's last tab is Info (the App Info canary). Open it. The right-hand visual is the Matview Status table — one row per base table and per matview the dashboard reads, with three columns:

view_name — the table or matview name.
row_count — total rows in that table / matview.
latest_date — MAX(<date_col>) for that row, i.e. the most recent business day the table / matview knows about.

Compare the latest_date values:

The two base tables — spec_example_transactions and spec_example_daily_balances — reflect how fresh the source data is. Their latest_date is the high-water mark your ETL has loaded.
Every matview row below them should carry the same latest_date (or within one ETL cycle of it).
If a matview's latest_date is significantly older than the base tables, that matview is stale relative to the loaded data. The dashboard sheets reading that matview will look like the load never happened.

Root cause¶

PostgreSQL and Oracle materialized views do not auto-refresh. They only update when explicitly told to via REFRESH MATERIALIZED VIEW. An ETL load that writes new rows to spec_example_transactions and spec_example_daily_balances but doesn't follow up with the refresh statements leaves every matview pointed at the previous load — the dashboards visualize a stale snapshot and look unchanged.

Fix¶

Run the refresh, then reload the dashboard:

Demo flow: quicksight-gen data refresh --execute.
Production ETL: include the output of refresh_matviews_sql(l2_instance) after every transactions / daily_balances write. The helper emits the dependency-ordered statements for the L2 instance you pass it; call it from your load orchestrator.

A browser reload is enough — every dataset is Direct Query, no SPICE invalidation step.

While you're on the Info sheet, glance at the Liveness KPI on the left. It runs a real query against the database catalog. If it shows a number, QuickSight's rendering pipeline is healthy and any blank visual elsewhere on the dashboard is a data or SQL issue (stale matviews, an empty filter, an unexpected WHERE narrowing). If the Liveness KPI is also blank, QuickSight itself has hung — every visual on every sheet stuck on the spinner with no error banner. That's a separate failure mode; the fix is to wait it out, open in a fresh incognito window, or force a full delete-then-create of the QuickSight resource graph (theme, datasource, datasets, analysis, dashboard) plus a clean re-seed and matview refresh.

Reference¶

Schema v6 — Data Feed Contract — column specs, metadata keys, ETL examples. The source of truth this handbook points at.
Account Structure — the bank, customers, accounts, and money flows the populated data represents.
L1 Reconciliation Dashboard — the operator-facing dashboard your feeds serve.
Investigation — the AML/compliance dashboard your feeds serve.