Using Database Tables in targets

Why this exists

Our pipelines use targets for reproducibility, caching, and correct invalidation. That works well for files and in-memory R objects. It does not work automatically for database tables.

A dplyr::tbl() is just a pointer to “whatever is currently in the database.” If the underlying table changes, targets has no way to know. Without additional structure, downstream targets can appear up to date even when the data they depend on has been rebuilt, corrected, or backfilled.

This is a major point of failure when using an RDB with targets.

The make_table_hash_target() factory in the NSWERSutils package exists to make database state visible to targets so that pipelines rerun when and only when the underlying tables actually change.

If you are using RDB tables in a targets pipeline, you are expected to use this pattern.

What the factory does conceptually

When you call make_table_hash_target(), you are not loading data. You are defining a contract between targets and the database.

For each table name you provide, the factory creates:

A table content signature target
- This target runs a lightweight SQL checksum over the table.
- The result is a stable, compact hash that represents the table’s contents at that point in time.
- This is the mechanism by which database mutations become visible to targets.
A corresponding _tbl target
- This target returns a dplyr::tbl() wrapper for use in downstream analysis.
- It is the object you actually use in code downstream.
- Its validity is tied to the table’s hash, not just to the R expression that created it.
A tracked connection function
- The database connection function itself is a target.
- If the connection logic changes, downstream targets are invalidated appropriately.

The key idea is simple: every live database table gets a small, cheap fingerprint that targets can track.

Why this is mandatory in our pipelines

Without this pattern:

targets cannot detect changes inside the database.
Cached results can silently become wrong.
Rebuilds depend on luck and manual intervention.

With this pattern:

Table rebuilds, backfills, and corrections invalidate downstream work automatically.
Pipelines are idempotent and reproducible.
We can safely mix live RDB data with cached analytical results.

This is the only supported way to use database tables in targets for our work.

How we use it in practice

At the top of a pipeline, we define all required RDB tables in one place:

load_rdb_tables <- make_table_hash_target(
  c(
    "k12_students",
    "k12_enrollment",
    "k12_attendance",
    "k12_courses",
    "wh_student",
    "ps4_terms",
    "nsc_postsecondary"
  ),
  db_con_fun = env_con,
  .cue_mode.table_targets = "always",
  .cue_mode.tbl_targets = "thorough"
)

This does three important things:

Declares the full set of database tables the pipeline depends on.
Forces table hashes to be checked every run.
Produces one _tbl target per table for downstream use.

Downstream targets must use the _tbl objects:

list(
  load_rdb_tables,
  tar_target(
    blah,
    do_thing(k12_students_tbl)
  )
)

You should never call dplyr::tbl() or open ad hoc database connections inside analytic targets. If a table is not declared here, it is not a supported dependency.

Naming conventions are enforced by design

This factory always obeys our naming conventions. There is no flexibility here, and that is intentional.

For a table named:

k12_students

the factory will always create:

k12_students
The table content hash target.
k12_students_tbl
The dplyr::tbl() target used in analysis.

This consistency is allows us to:

Reason about dependency graphs without inspecting code.
Predict target names without searching.
Enforce review standards across pipelines.
Avoid one-off naming patterns that break tooling and conventions.

If you follow this pattern, naming is automatic and correct.

Cueing behavior and expectations

Table hash targets are typically run with mode = "always".
- This ensures we always detect changes in live tables.
- You can change the mode to “thorough” to save time during development, but PR reviewers should always check this such that the final code ensures the most up to date information is being pulled from the database.
- The SQL involved is cheap relative to downstream computation.
_tbl targets use normal or thorough cueing.
- They rebuild only when their dependencies indicate change.
- They are safe to cache because their validity is anchored to table state.