--- title: "Getting Started" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Getting Started} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = '#>' ) ``` ## What is record linkage? Record linkage, also called entity resolution or deduplication, identifies records in one or more datasets that refer to the same real-world entity. When datasets do not share a unique identifier, you must rely on imperfect fields such as names, dates of birth, and addresses. Probabilistic record linkage estimates the chance that two records are a match based on how similar they are across several fields. `irelink` implements the Fellegi-Sunter model of probabilistic record linkage. It estimates parameters with unsupervised expectation maximization, so you can get started without labeled training data. ## A typical workflow Every linkage task follows the same general pattern: 1. **Define a specification.** Choose which columns to compare and how. 2. **Build a model.** Load data into a SQL backend and attach the specification. 3. **Train parameters.** Estimate u-probabilities, then run EM to learn m-probabilities. 4. **Predict.** Score candidate pairs and keep the likely matches. 5. **Cluster.** Resolve pairwise links into groups that represent the same entity. The example below walks through each step using a small built-in dataset. ## Step 1: Define a specification A specification defines the comparisons and blocking rules that drive the model. Comparisons tell `irelink` how to score similarity on each field, and blocking rules limit which record pairs are compared so linkage stays tractable on large datasets. ```{r spec} library(irelink) spec <- il_spec() |> il_compare(first_name, cl_jaro_winkler(0.9, 0.7)) |> il_compare(surname, cl_jaro_winkler(0.9, 0.7)) |> il_compare(dob, cl_exact()) |> il_block_on(surname) |> il_block_on(first_name) spec ``` Each call to `il_compare()` adds one comparison dimension. Here, `cl_jaro_winkler(0.9, 0.7)` creates three levels: similarity of at least 0.9 is level 2, similarity of at least 0.7 is level 1, and anything lower is level 0. `cl_exact()` is a simple binary match. Blocking rules defined with `il_block_on()` restrict candidate pairs to records that share the same value in the blocking column. Multiple blocking rules use OR logic, so a pair is compared if it satisfies any one of them. ## Step 2: Build a model `il_model()` uploads the data to a SQL backend and attaches the specification. Any DBI-compatible connection works. Here we use an in-memory DuckDB database: ```{r model} df <- fake_20 con <- DBI::dbConnect(duckdb::duckdb()) model <- il_model(df, spec = spec, con = con) model ``` ## Step 3: Train parameters Training has two main steps. First, estimate u-probabilities, which are the chances that two random non-matching records agree at each comparison level: ```{r train-u} model <- il_estimate_u(model) ``` Next, run expectation maximization to learn m-probabilities, which are the chances that true matches agree at each level. You provide a blocking rule to generate the training pairs: ```{r train-em} model <- il_estimate_em(model, block_on(surname)) ``` You can inspect the learned parameters at any time: ```{r weights} il_weights(model) ``` ## Step 4: Predict `predict()` scores candidate pairs and returns those above a match-probability threshold: ```{r predict} pairs <- predict(model, threshold = 0.5) head(pairs) ``` Each row is a candidate pair. The output includes the left and right record identifiers, the per-comparison gamma values, the evidence-only `match_weight`, the prior-inclusive `total_match_weight`, and the posterior `match_probability`. ## Step 5: Cluster `il_cluster()` resolves pairwise predictions into entity clusters with connected-components analysis: ```{r cluster} clusters <- il_cluster(pairs) head(clusters) ``` Each record is assigned a `cluster_id`. Records in the same cluster are treated as the same entity. ## Comparison levels `irelink` includes a large set of comparison levels for common field types: | Level | Use case | |-------|----------| | `cl_exact()` | Binary exact match | | `cl_jaro_winkler()` | Names, short strings | | `cl_levenshtein()` | General fuzzy strings | | `cl_damerau_levenshtein()` | Strings with transpositions | | `cl_jaro()` | Lightweight string similarity | | `cl_jaccard()` | Token-set overlap | | `cl_cosine()` | Embedding similarity | | `cl_numeric_diff()` | Numeric fields (e.g., age) | | `cl_pct_diff()` | Percentage difference | | `cl_date_diff()` | Date fields | | `cl_time_diff()` | Time fields | | `cl_geo_distance()` | Geographic coordinates | | `cl_array_intersect()` | Array or set overlap | For common field types, domain-specific helpers combine multiple levels into a single call: | Helper | Fields | |--------|--------| | `cl_name()` | Generic name field | | `cl_first_last_name()` | First name and last name as separate fields | | `cl_forename_surname()` | Forename and surname with transposition | | `cl_dob()` | Date of birth | | `cl_email()` | Email addresses | | `cl_postcode()` | UK postal codes | | `cl_zip_code()` | US ZIP codes | ## Evaluation If you have labeled data, meaning pairs that are known matches or non-matches, `irelink` provides tools to assess model quality: - `il_accuracy()`: overall accuracy at a threshold - `il_precision_recall()`: precision and recall across thresholds - `il_roc()`: ROC curve data - `il_errors()`: inspect false positives and false negatives ## Cleaning up When you are done, release the database resources owned by the model. In an interactive session with abandoned models, use `il_cleanup_all(con)` before disconnecting to drop every `irelink` table on the connection. ```{r cleanup} il_cleanup(model) DBI::dbDisconnect(con, shutdown = TRUE) ```