--- title: "Advanced Workflows" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Advanced Workflows} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = '#>' ) ``` This vignette covers advanced `irelink` workflows for readers who have already worked through the Getting Started or Deduplication vignette. All examples use `fake_1000` and a shared in-memory DuckDB connection. ## Setup Train a complete model on `fake_1000` that will be reused across sections. ```{r setup} library(irelink) library(ggplot2) df <- fake_1000 con <- DBI::dbConnect(duckdb::duckdb()) spec <- il_spec() |> il_compare(first_name, cl_name()) |> il_compare(surname, cl_name()) |> il_compare(dob, cl_dob()) |> il_compare(city, cl_exact(term_frequency = TRUE)) |> il_compare(email, cl_email()) |> il_block_on(first_name) |> il_block_on(surname) |> il_block_on(city) model <- il_model(df, spec = spec, con = con) model <- il_estimate_prior( model, block_on(first_name, surname), block_on(email), recall = 0.6 ) model <- il_estimate_u(model, max_pairs = 1e5) model <- il_estimate_em(model, block_on(first_name)) model <- il_estimate_em(model, block_on(dob)) pairs <- predict(model, threshold = 0.5) clusters <- il_cluster(pairs, threshold = 0.85) ``` ## Training diagnostics `il_training_history()` returns the m and u estimates from each EM iteration across all training sessions. Plot it to check convergence: ```{r history, fig.width = 7, fig.height = 5} hist <- il_training_history(model) autoplot(hist) ``` A well-converged model has stable values in the final iterations. If the estimates still drift, run more EM passes with different blocking rules or expand the candidate pairs. ## Pair inspection `il_compare_records()` scores a single pair of records against the spec. Use it to see why a known match scores too low or a non-match scores too high. ```{r compare-records} rec_a <- fake_1000[1, ] rec_b <- fake_1000[5, ] il_compare_records(rec_a, rec_b, spec = model$spec, con = con) ``` The gamma columns show the comparison level reached on each field. Use `il_weights(model)` to see the match-weight contribution of each level. Use a waterfall chart to see how field-level scores combine into one match decision: ```{r waterfall, fig.width = 6, fig.height = 3} autoplot(pairs, which = 1) ``` ## Lazy prediction for large data `predict(collect = FALSE)` keeps scored pairs in the database instead of collecting them into R. This path requires DuckDB or PostgreSQL and is especially useful when materializing millions of rows would exhaust memory. The examples here use DuckDB, and `il_cluster()` detects the lazy reference so it can run connected-components analysis in SQL. ```{r lazy} pairs_lazy <- predict(model, threshold = 0.5, collect = FALSE) pairs_lazy ``` Pass the lazy reference directly to `il_cluster()`: ```{r lazy-cluster} clusters_lazy <- il_cluster(pairs_lazy, threshold = 0.85) nrow(clusters_lazy) ``` `autoplot()` and `il_waterfall()` collect automatically when needed, so downstream analysis code stays the same. Use the lazy path on DuckDB or PostgreSQL when candidate-pair counts exceed available memory. The lazy prediction table is model-scoped, and `il_cleanup(model)` removes it along with the model's source and term-frequency tables. ## Chunked u estimation and SQL profiling For larger datasets, `il_estimate_u()` can accumulate random-pair gamma counts in chunks and stop once every comparison level has enough support: ```{r chunked-u, eval = FALSE} model <- il_estimate_u( model, max_pairs = 5e6, chunk_size = 250000, min_count_per_level = 100 ) model$params$u_estimation ``` When you need to inspect database performance, set `profile_sql = TRUE` on `il_estimate_u()`, `il_estimate_prior()`, or `predict()` to collect lightweight SQL timing metadata: ```{r sql-profile, eval = FALSE} pairs <- predict(model, threshold = 0.5, profile_sql = TRUE) attr(pairs, 'sql_profile') ``` ## Cluster diagnostics `il_graph_metrics()` computes node-level, edge-level, and cluster-level summaries from the linkage graph. Use it to spot thresholds that are too loose or clusters with unexpectedly low density. ```{r graph-metrics} metrics <- il_graph_metrics(pairs, clusters) ``` The cluster table reports size and internal edge density, which is the number of edges divided by the maximum possible number for a cluster of that size: ```{r graph-clusters} metrics$clusters ``` A large maximum cluster size with low density often means a transitive link is pulling unrelated entities together. Consider raising the threshold in `predict()` or `il_cluster()` and then checking the metrics again. The node table shows how many links each record participates in: ```{r graph-nodes} head(metrics$nodes) ``` Records with unusually high degree relative to their cluster size may be acting as hubs that inflate the cluster beyond its true membership. ## Phonetic blocking Standard equality blocking misses pairs where names sound alike but are spelled differently. Examples include "Smith" and "Smyth" or "Jon" and "John". Pass `.transform = il_soundex` to `il_block_on()` or `block_on()` to group names by phonetic code instead of exact spelling. ```{r phonetic-spec} spec_phon <- il_spec() |> il_compare(first_name, cl_name()) |> il_compare(surname, cl_name()) |> il_compare(dob, cl_dob()) |> il_block_on(first_name, .transform = il_soundex) |> il_block_on(surname, .transform = il_soundex) ``` Use the same `.transform` argument when you specify a blocking rule for an EM training pass: ```{r phonetic-train} model_phon <- il_model(df, spec = spec_phon, con = con) model_phon <- il_estimate_u(model_phon, max_pairs = 1e5) model_phon <- il_estimate_em( model_phon, block_on(first_name, .transform = il_soundex) ) ``` Phonetic blocking usually improves recall, but it also increases the number of candidate pairs. Use `il_count_pairs()` to check that trade-off before you commit to a spec. ## Column transforms The `transform` argument in `il_compare()` applies a function to both values before scoring. Use it to normalize case or remove whitespace before a similarity comparison: ```{r transform-spec} spec_tr <- il_spec() |> il_compare(first_name, cl_jaro_winkler(0.9, 0.7), transform = tolower) |> il_compare(surname, cl_jaro_winkler(0.9, 0.7), transform = tolower) |> il_compare(dob, cl_exact()) |> il_block_on(first_name) |> il_block_on(surname) model_tr <- il_model(df, spec = spec_tr, con = con) model_tr <- il_estimate_u(model_tr, max_pairs = 1e5) model_tr <- il_estimate_em(model_tr, block_on(surname)) ``` `tolower`, `toupper`, and `trimws` are translated to SQL on DuckDB and PostgreSQL, so they run in the database. Custom R functions only work on the R-side path. When you save a model with `il_save()`, `.rds` stores the R object as is, while `.json` writes Splink settings SQL so loaded comparisons come back as SQL-backed levels. Anonymous functions still produce a warning on save. ## Incremental matching `il_find_matches()` scores new records against data that is already loaded into a trained model, using the same blocking rules and comparison spec. ```{r find-matches} new_df <- data.frame( first_name = c('Jhon', 'Alice'), surname = c('Smith', 'Jones'), dob = c('1990-01-15', '1985-06-20'), city = c('London', 'Manchester'), email = c(NA, 'ajones@example.com') ) matches <- il_find_matches(model, new_df, threshold = 0.5) matches ``` Each row is a (new record, existing record) pair. `unique_id_l` identifies the new record (auto-assigned starting from 1) and `unique_id_r` identifies the matched record in the original dataset. This works well with `il_load()` and `il_attach()`. Load a saved model, attach it to the current database, and call `il_find_matches()` for each incoming batch of new records. ## Cleanup `il_cleanup(model)` only removes tables owned by that model, so it is safe when several models share the same connection. `il_cleanup_all(con)` is broader and is best reserved for failed runs or exploratory sessions where you want to clear every `irelink` table before disconnecting. ```{r cleanup} il_cleanup(model) il_cleanup(model_phon) il_cleanup(model_tr) DBI::dbDisconnect(con, shutdown = TRUE) ```