--- title: "Translating from fastLink" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Translating from fastLink} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = '#>' ) ``` `irelink` implements the same Fellegi-Sunter probabilistic record linkage framework as [fastLink](https://github.com/kosukeimai/fastLink), but it uses a different API and a SQL backend. This vignette maps common fastLink patterns to `irelink` so you can get started quickly. ## Design differences fastLink bundles data preparation, EM estimation, and matching into a single `fastLink()` call. `irelink` breaks that work into a pipeline of composable functions where you define a spec, build a model, estimate parameters, and then predict. fastLink's Jaro-Winkler comparisons produce three agreement levels, and `cut.a` and `cut.p` control the thresholds. `irelink` uses `cl_jaro_winkler(high, low)` to express the same thresholds. fastLink's `getMatches()` assigns a `dedupe.ids` column to flag duplicates. `irelink` uses `il_cluster()` instead, which assigns a `cluster_id` to each record. ## Core workflow | Step | fastLink | irelink | |------|-------------|-------------| | Define comparisons | `varnames`, `stringdist.match`, `partial.match` in `fastLink()` | `il_spec() |>`
`il_compare(...) |>`
`il_block_on(...)` | | Estimate and match | `fastLink(dfA, dfB, ...)` | `il_model(df, spec, con) |>`
`il_estimate_u() |>`
`il_estimate_em(block_on(...))` | | Set match threshold | `threshold.match` in `fastLink()` | `threshold` in `predict()` | | Get matched records | `getMatches(dfA, dfB, fl.out)` | `il_cluster(pairs)` | | Evaluate | manual confusion table | `il_cluster_confusion_matrix(model, labels_col, threshold)` | | Block by variable | `blockData(dfA, dfB, varnames)` + loop | `il_block_on(var)` in spec | | Numeric comparison | `numeric.match`, `cut.a.num` | `cl_numeric_diff(threshold)` | | Inspect parameters | `out$EM$patterns.w` | `autoplot(model, type = 'parameters')` | | Set prevalence prior | Not built in | `il_prior_prevalence(model, probability)` | | Save / load model | Not built in | `il_save(model, path)` / `il_load(path)` | ## Comparison functions fastLink compares string fields with Jaro-Winkler by default, or with Levenshtein or Jaro. These produce up to three agreement levels. Each one maps to a `cl_*()` function in `irelink`. | fastLink | irelink | |----------|---------| | JW (default), `cut.a`, `cut.p` | `cl_jaro_winkler(high, low)` | | `stringdist.method = "jaro"`, `cut.a`, `cut.p` | `cl_jaro(high, low)` | | `stringdist.method = "lv"`, `cut.a`, `cut.p` | `cl_levenshtein(low, high)` | | `numeric.match`, `cut.a.num` | `cl_numeric_diff(threshold)` | | exact agreement on non-string fields | `cl_exact()` | Levenshtein thresholds in `irelink` are raw edit distances. They are not renormalized similarity scores as in fastLink. `cl_levenshtein(1, 2)` means "distance <= 1 is full agreement, and distance <= 2 is partial agreement." ## Key parameters | fastLink parameter | irelink equivalent | |-------------------|--------------------| | `cut.a` | first argument to `cl_jaro_winkler()` | | `cut.p` | second argument to `cl_jaro_winkler()` | | `cut.a.num` | argument to `cl_numeric_diff()` | | `threshold.match` | `threshold` in `predict()` | | `dedupe = FALSE` | default, `irelink` never enforces 1-to-1 matching | | `n.cores` | irelink uses DuckDB parallelism automatically | ## Example: side-by-side deduplication **fastLink:** ```r library(fastLink) out <- fastLink( dfA = records, dfB = records, varnames = c('first_name', 'surname', 'dob'), stringdist.match = c('first_name', 'surname'), partial.match = c('first_name', 'surname'), cut.a = 0.94, cut.p = 0.84, threshold.match = 0.90, dedupe = FALSE ) recordsfL <- getMatches(dfA = records, dfB = records, fl.out = out) length(unique(recordsfL$dedupe.ids)) ``` **irelink:** ```r library(irelink) con <- DBI::dbConnect(duckdb::duckdb()) spec <- il_spec() |> il_compare(first_name, cl_jaro_winkler(0.94, 0.84)) |> il_compare(surname, cl_jaro_winkler(0.94, 0.84)) |> il_compare(dob, cl_exact()) |> il_block_on(surname) |> il_block_on(first_name) model <- il_model(fake_1000, spec = spec, con = con) |> il_estimate_u() |> il_estimate_em(block_on(surname)) |> il_estimate_em(block_on(first_name)) |> il_prior_prevalence(1e-3) pairs <- predict(model, threshold = 0.90) clusters <- il_cluster(pairs) length(unique(clusters$cluster_id)) il_cleanup(model) DBI::dbDisconnect(con, shutdown = TRUE) ``` `il_prior_prevalence()` replaces the training-driven prior with a population-level baseline. This is similar to resetting the prior after EM in fastLink workflows that use a heavily blocked training sample, and you can skip it if your training data is large and representative. ## Blocking fastLink's `blockData()` partitions records into groups and requires running `fastLink()` separately within each block before combining the results. In `irelink`, you declare blocking in the spec with `il_block_on()`, and the package applies those rules automatically so you do not need a manual loop. **fastLink:** ```r blocks <- blockData(records, records, varnames = 'surname') results <- list() for (j in seq_along(blocks)) { sub <- records[blocks[[j]]$dfA.inds, ] out_b <- fastLink(dfA = sub, dfB = sub, ...) sub <- getMatches(dfA = sub, dfB = sub, fl.out = out_b) sub$dedupe.ids <- paste0('B', j, '_', sub$dedupe.ids) results[[j]] <- sub } combined <- do.call('rbind', results) ``` **irelink:** ```r spec <- il_spec() |> il_compare(first_name, cl_jaro_winkler(0.94, 0.84)) |> il_compare(surname, cl_jaro_winkler(0.94, 0.84)) |> il_compare(dob, cl_exact()) |> il_block_on(surname) ``` fastLink also offers k-means blocking through `blockData(..., kmeans.block = ..., nclusters = ...)`. `irelink` does not include a built-in k-means blocking step because the data lives in a SQL backend. For numeric fields, the closest equivalent is `il_block_on()` with pre-bucketed values. ## Model inspection fastLink exposes learned parameters through `out$EM$patterns.w`, a table of agreement patterns and Fellegi-Sunter weights. `irelink` provides the same information visually. ```r autoplot(model) autoplot(model, type = 'parameters') ``` ## Evaluation fastLink requires you to build a confusion table by hand from `dedupe.ids` and a ground-truth column. `irelink` provides `il_cluster_confusion_matrix()`, which does this directly from the model. **fastLink:** ```r recordsfL$dupTrue <- ifelse(duplicated(recordsfL$cluster), 'Duplicated', 'Not duplicated') recordsfL$dupfL <- ifelse(duplicated(recordsfL$dedupe.ids), 'Duplicated', 'Not duplicated') confusion <- table('fastLink' = recordsfL$dupfL, 'True' = recordsfL$dupTrue) ``` **irelink:** ```r acc <- il_cluster_confusion_matrix(model, labels_col = 'cluster', threshold = 0.90) ``` For a full accuracy or precision-recall curve across all thresholds, use `il_accuracy()` and `il_precision_recall()`.