--- title: "Record Linkage Across Datasets" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Record Linkage Across Datasets} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = '#>' ) ``` This vignette shows how to link records across two datasets using the FEBRL (Freely Extensible Biomedical Record Linkage) benchmark data. Dataset 4a contains 5,000 original records, and dataset 4b contains 5,000 duplicates, one for each original. The duplicate records include typographical errors, missing values, and transpositions. Ground truth is stored in the `rec_id` column, so records that share the same base number, such as `rec-1070-org` and `rec-1070-dup-0`, refer to the same entity. ## Setup ```{r setup} library(irelink) library(ggplot2) ``` ## Load the data ```{r head} head(febrl4a) head(febrl4b) ``` ## Explore data quality Check completeness across both tables. Table B has more missing values because of the corruption process: ```{r completeness} con <- DBI::dbConnect(duckdb::duckdb()) comp <- il_completeness(febrl4a, febrl4b, con = con) comp ``` ```{r completeness-plot, fig.width = 6, fig.height = 3.5} autoplot(comp) ``` ## Define the specification For linkage across two tables, set `link_type = "link"` when you create the model. This spec uses name similarity, date-of-birth matching, and exact postcode matching: ```{r spec} spec <- il_spec() |> il_compare(given_name, cl_name()) |> il_compare(surname, cl_name()) |> il_compare(date_of_birth, cl_exact()) |> il_compare(postcode, cl_exact()) |> il_block_on(surname) |> il_block_on(given_name) spec ``` ## Train the model Create a link-type model with both tables: ```{r model} model <- il_model( febrl4a, febrl4b, spec = spec, con = con, link_type = 'link' ) model ``` Next, estimate the prior match probability and the u-probabilities, then run EM: ```{r train} model <- il_estimate_prior( model, block_on(given_name, surname), block_on(surname, suburb), recall = 0.6 ) model <- il_estimate_u(model, max_pairs = 1e5) model <- il_estimate_em(model, block_on(surname)) model <- il_estimate_em(model, block_on(suburb)) ``` ## Inspect the model ```{r weights, fig.width = 6, fig.height = 3.5} autoplot(model) ``` ```{r params, fig.width = 7, fig.height = 4} autoplot(model, type = 'parameters') ``` ```{r weights-table} il_weights(model) ``` ## Predict and cluster ```{r predict} predictions <- predict(model, threshold = 0.5) nrow(predictions) ``` ```{r hist, fig.width = 6, fig.height = 3} autoplot(predictions) ``` Cluster the pairs to resolve entities: ```{r cluster} clusters <- il_cluster(predictions, threshold = 0.85) head(clusters) ``` ## Evaluate against ground truth The `rec_id` column stores the ground truth. Extract entity IDs and build pairwise labels: ```{r labels} # Extract entity number from rec_id (e.g., "rec-1070-org" -> "1070") entity_a <- sub('^rec-(\\d+)-org$', '\\1', febrl4a$rec_id) entity_b <- sub('^rec-(\\d+)-dup-\\d+$', '\\1', febrl4b$rec_id) # Build id-entity lookup (unique_id auto-generated by il_model) ids_a <- data.frame(unique_id = seq_len(nrow(febrl4a)), entity = entity_a) ids_b <- data.frame(unique_id = seq_len(nrow(febrl4b)), entity = entity_b) # True matches: same entity across tables positives <- merge(ids_a, ids_b, by = 'entity') names(positives) <- c('entity', 'unique_id_l', 'unique_id_r') positives$is_match <- 1L positives <- positives[, c('unique_id_l', 'unique_id_r', 'is_match')] # Sample non-matching pairs set.seed(42) n_neg <- min(nrow(positives), 2000L) neg_l <- sample(ids_a$unique_id, n_neg, replace = TRUE) neg_r <- sample(ids_b$unique_id, n_neg, replace = TRUE) ent_l <- ids_a$entity[match(neg_l, ids_a$unique_id)] ent_r <- ids_b$entity[match(neg_r, ids_b$unique_id)] negatives <- data.frame( unique_id_l = neg_l, unique_id_r = neg_r, is_match = ifelse(ent_l == ent_r, 1L, 0L) ) labels <- rbind(positives, negatives) nrow(labels) sum(labels$is_match) ``` `labels` includes all true cross-table matches, so any true match missed by the blocking rules is counted as a false negative in the evaluation curves below. ### Accuracy metrics ```{r accuracy, fig.width = 6, fig.height = 3.5} acc <- il_accuracy(model, labels = labels) autoplot(acc) ``` ### ROC and Precision-recall ```{r roc, fig.width = 5, fig.height = 4} roc <- il_roc(model, labels = labels) autoplot(roc) ``` ```{r pr, fig.width = 5, fig.height = 4} pr <- il_precision_recall(model, labels = labels) autoplot(pr) ``` ## Cleanup ```{r cleanup} il_cleanup(model) DBI::dbDisconnect(con, shutdown = TRUE) ``` `il_cleanup(model)` only removes tables owned by that model. If an interactive run fails before you keep the model object, call `il_cleanup_all(con)` to remove all `irelink` tables from the connection before disconnecting.