---
title: "Translating from fastLink"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Translating from fastLink}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = '#>'
)
```
`irelink` implements the same Fellegi-Sunter probabilistic record linkage framework as [fastLink](https://github.com/kosukeimai/fastLink), but it uses a different API and a SQL backend.
This vignette maps common fastLink patterns to `irelink` so you can get started quickly.
## Design differences
fastLink bundles data preparation, EM estimation, and matching into a single `fastLink()` call.
`irelink` breaks that work into a pipeline of composable functions where you define a spec, build a model, estimate parameters, and then predict.
fastLink's Jaro-Winkler comparisons produce three agreement levels, and `cut.a` and `cut.p` control the thresholds.
`irelink` uses `cl_jaro_winkler(high, low)` to express the same thresholds.
fastLink's `getMatches()` assigns a `dedupe.ids` column to flag duplicates.
`irelink` uses `il_cluster()` instead, which assigns a `cluster_id` to each record.
## Core workflow
| Step | fastLink | irelink |
|------|-------------|-------------|
| Define comparisons | `varnames`, `stringdist.match`, `partial.match` in `fastLink()` | `il_spec() |>`
`il_compare(...) |>`
`il_block_on(...)` |
| Estimate and match | `fastLink(dfA, dfB, ...)` | `il_model(df, spec, con) |>`
`il_estimate_u() |>`
`il_estimate_em(block_on(...))` |
| Set match threshold | `threshold.match` in `fastLink()` | `threshold` in `predict()` |
| Get matched records | `getMatches(dfA, dfB, fl.out)` | `il_cluster(pairs)` |
| Evaluate | manual confusion table | `il_cluster_confusion_matrix(model, labels_col, threshold)` |
| Block by variable | `blockData(dfA, dfB, varnames)` + loop | `il_block_on(var)` in spec |
| Numeric comparison | `numeric.match`, `cut.a.num` | `cl_numeric_diff(threshold)` |
| Inspect parameters | `out$EM$patterns.w` | `autoplot(model, type = 'parameters')` |
| Set prevalence prior | Not built in | `il_prior_prevalence(model, probability)` |
| Save / load model | Not built in | `il_save(model, path)` / `il_load(path)` |
## Comparison functions
fastLink compares string fields with Jaro-Winkler by default, or with Levenshtein or Jaro.
These produce up to three agreement levels.
Each one maps to a `cl_*()` function in `irelink`.
| fastLink | irelink |
|----------|---------|
| JW (default), `cut.a`, `cut.p` | `cl_jaro_winkler(high, low)` |
| `stringdist.method = "jaro"`, `cut.a`, `cut.p` | `cl_jaro(high, low)` |
| `stringdist.method = "lv"`, `cut.a`, `cut.p` | `cl_levenshtein(low, high)` |
| `numeric.match`, `cut.a.num` | `cl_numeric_diff(threshold)` |
| exact agreement on non-string fields | `cl_exact()` |
Levenshtein thresholds in `irelink` are raw edit distances.
They are not renormalized similarity scores as in fastLink.
`cl_levenshtein(1, 2)` means "distance <= 1 is full agreement, and distance <= 2 is partial agreement."
## Key parameters
| fastLink parameter | irelink equivalent |
|-------------------|--------------------|
| `cut.a` | first argument to `cl_jaro_winkler()` |
| `cut.p` | second argument to `cl_jaro_winkler()` |
| `cut.a.num` | argument to `cl_numeric_diff()` |
| `threshold.match` | `threshold` in `predict()` |
| `dedupe = FALSE` | default, `irelink` never enforces 1-to-1 matching |
| `n.cores` | irelink uses DuckDB parallelism automatically |
## Example: side-by-side deduplication
**fastLink:**
```r
library(fastLink)
out <- fastLink(
dfA = records,
dfB = records,
varnames = c('first_name', 'surname', 'dob'),
stringdist.match = c('first_name', 'surname'),
partial.match = c('first_name', 'surname'),
cut.a = 0.94,
cut.p = 0.84,
threshold.match = 0.90,
dedupe = FALSE
)
recordsfL <- getMatches(dfA = records, dfB = records, fl.out = out)
length(unique(recordsfL$dedupe.ids))
```
**irelink:**
```r
library(irelink)
con <- DBI::dbConnect(duckdb::duckdb())
spec <- il_spec() |>
il_compare(first_name, cl_jaro_winkler(0.94, 0.84)) |>
il_compare(surname, cl_jaro_winkler(0.94, 0.84)) |>
il_compare(dob, cl_exact()) |>
il_block_on(surname) |>
il_block_on(first_name)
model <- il_model(fake_1000, spec = spec, con = con) |>
il_estimate_u() |>
il_estimate_em(block_on(surname)) |>
il_estimate_em(block_on(first_name)) |>
il_prior_prevalence(1e-3)
pairs <- predict(model, threshold = 0.90)
clusters <- il_cluster(pairs)
length(unique(clusters$cluster_id))
il_cleanup(model)
DBI::dbDisconnect(con, shutdown = TRUE)
```
`il_prior_prevalence()` replaces the training-driven prior with a population-level baseline.
This is similar to resetting the prior after EM in fastLink workflows that use a heavily blocked training sample, and you can skip it if your training data is large and representative.
## Blocking
fastLink's `blockData()` partitions records into groups and requires running `fastLink()` separately within each block before combining the results.
In `irelink`, you declare blocking in the spec with `il_block_on()`, and the package applies those rules automatically so you do not need a manual loop.
**fastLink:**
```r
blocks <- blockData(records, records, varnames = 'surname')
results <- list()
for (j in seq_along(blocks)) {
sub <- records[blocks[[j]]$dfA.inds, ]
out_b <- fastLink(dfA = sub, dfB = sub, ...)
sub <- getMatches(dfA = sub, dfB = sub, fl.out = out_b)
sub$dedupe.ids <- paste0('B', j, '_', sub$dedupe.ids)
results[[j]] <- sub
}
combined <- do.call('rbind', results)
```
**irelink:**
```r
spec <- il_spec() |>
il_compare(first_name, cl_jaro_winkler(0.94, 0.84)) |>
il_compare(surname, cl_jaro_winkler(0.94, 0.84)) |>
il_compare(dob, cl_exact()) |>
il_block_on(surname)
```
fastLink also offers k-means blocking through `blockData(..., kmeans.block = ..., nclusters = ...)`.
`irelink` does not include a built-in k-means blocking step because the data lives in a SQL backend.
For numeric fields, the closest equivalent is `il_block_on()` with pre-bucketed values.
## Model inspection
fastLink exposes learned parameters through `out$EM$patterns.w`, a table of agreement patterns and Fellegi-Sunter weights.
`irelink` provides the same information visually.
```r
autoplot(model)
autoplot(model, type = 'parameters')
```
## Evaluation
fastLink requires you to build a confusion table by hand from `dedupe.ids` and a ground-truth column.
`irelink` provides `il_cluster_confusion_matrix()`, which does this directly from the model.
**fastLink:**
```r
recordsfL$dupTrue <- ifelse(duplicated(recordsfL$cluster), 'Duplicated', 'Not duplicated')
recordsfL$dupfL <- ifelse(duplicated(recordsfL$dedupe.ids), 'Duplicated', 'Not duplicated')
confusion <- table('fastLink' = recordsfL$dupfL, 'True' = recordsfL$dupTrue)
```
**irelink:**
```r
acc <- il_cluster_confusion_matrix(model, labels_col = 'cluster', threshold = 0.90)
```
For a full accuracy or precision-recall curve across all thresholds, use `il_accuracy()` and `il_precision_recall()`.