--- title: "Deduplicating 50k Synthetic Records" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Deduplicating 50k Synthetic Records} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} can_run <- requireNamespace('nanoparquet', quietly = TRUE) if (can_run) { url <- paste0( 'https://raw.githubusercontent.com/', 'moj-analytical-services/splink_datasets/', 'master/data/historical_figures_with_errors_50k.parquet' ) tmp <- paste0(tempdir(), '/historical_figures_with_errors_50k.parquet') df <- try( { utils::download.file(url, tmp, mode = 'wb', quiet = TRUE) nanoparquet::read_parquet(tmp) }, silent = TRUE ) can_run <- !inherits(df, 'try-error') } knitr::opts_chunk$set( collapse = TRUE, comment = '#>', eval = can_run ) ``` This vignette reproduces the [Splink "Deduplicate 50k synthetic" demo](https://moj-analytical-services.github.io/splink/demos/examples/duckdb/deduplicate_50k_synthetic.html) in `irelink`. The data is based on historical people scraped from Wikidata and includes duplicate records with realistic errors such as typos, missing values, and swapped fields. The `cluster` column provides the ground-truth entity labels used in evaluation. This vignette requires [nanoparquet](https://cran.r-project.org/package=nanoparquet) to read the remote Parquet file and only compiles when the package and the data URL are both available. ## Load the data ```{r load-data} library(irelink) library(ggplot2) df ``` ## Profile the data Use completeness and value distributions to choose blocking rules and comparisons: ```{r setup-con} con <- DBI::dbConnect(duckdb::duckdb()) ``` ```{r completeness, fig.width = 7, fig.height = 3} df |> il_completeness(con = con) |> autoplot() ``` ```{r profile} il_profile(df, first_name, surname, dob, birth_place, con = con, top_n = 8) ``` ## Choose blocking rules ```{r suggest-blocking} il_suggest_blocking(df, con = con) ``` The `cumulative_pairs` column shows the total number of unique pairs produced so far: ```{r count-pairs} il_count_pairs( df, block_on(surname, dob), block_on(first_name, dob), block_on(first_name, surname), block_on(dob, birth_place), con = con ) ``` ## Define the specification Apply term-frequency adjustment to `birth_place` and `occupation` so common values such as "London" receive less weight than rare ones: ```{r spec} spec <- il_spec() |> il_compare(first_name, cl_name()) |> il_compare(surname, cl_name()) |> il_compare(dob, cl_dob()) |> il_compare(postcode_fake, cl_postcode()) |> il_compare(birth_place, cl_exact(term_frequency = TRUE)) |> il_compare(occupation, cl_exact(term_frequency = TRUE)) |> il_block_on(first_name ~ il_substr(1, 3), surname ~ il_substr(1, 4)) |> il_block_on(surname, dob) |> il_block_on(first_name, dob) |> il_block_on(postcode_fake, first_name) |> il_block_on(postcode_fake, surname) |> il_block_on(dob, birth_place) |> il_block_on(postcode_fake ~ il_substr(1, 3), dob) |> il_block_on(postcode_fake ~ il_substr(1, 3), first_name) |> il_block_on(postcode_fake ~ il_substr(1, 3), surname) |> il_block_on( first_name ~ il_substr(1, 2), surname ~ il_substr(1, 2), dob ~ il_substr(1, 4) ) spec ``` ## Train the model ```{r model} model <- df |> il_model(spec = spec, con = con) |> il_estimate_prior( block_on(first_name, surname, dob), block_on(dob, postcode_fake), recall = 0.6 ) |> il_estimate_u(max_pairs = 5e6) |> il_estimate_em(block_on(first_name, surname)) |> il_estimate_em(block_on(dob)) ``` ## Inspect the trained model ```{r summary} summary(model) ``` ```{r weights-plot, fig.width = 7, fig.height = 4} autoplot(model) ``` ```{r params-plot, fig.width = 7, fig.height = 5} autoplot(model, type = 'parameters') ``` ```{r unlinkables, fig.width = 6, fig.height = 3.5} autoplot(il_unlinkables(model)) ``` ## Predict ```{r predict} predictions <- predict(model, threshold = 0.5) predictions ``` ```{r histogram, fig.width = 7, fig.height = 3.5} autoplot(predictions) ``` ```{r waterfall, fig.width = 7, fig.height = 4} autoplot(predictions, which = 1) ``` ## Cluster ```{r cluster} clusters <- il_cluster(predictions, threshold = 0.95) clusters ``` ## Evaluate against ground truth ```{r accuracy} acc <- il_accuracy(model, labels_col = 'cluster') acc ``` When you use `labels_col`, the evaluation derives all true duplicate pairs from the ground-truth cluster column. Some true pairs may never be generated by the blocking rules. Those pairs count as false negatives at every threshold. As a result, the maximum recall in the accuracy, ROC, and precision-recall plots is the blocking recall: ```{r blocking-recall} acc0 <- acc[acc$threshold == min(acc$threshold), ] acc0$tp / (acc0$tp + acc0$fn) ``` ```{r accuracy-plot, fig.width = 7, fig.height = 4} autoplot(acc) ``` ```{r roc, fig.width = 5, fig.height = 4.5} autoplot(il_roc(model, labels_col = 'cluster')) ``` ```{r pr, fig.width = 5, fig.height = 4.5} autoplot(il_precision_recall(model, labels_col = 'cluster')) ``` ### Error inspection ```{r errors-fp} errors <- il_errors(model, labels_col = 'cluster', threshold = 0.999) errors[errors$error_type == 'false_positive', ] ``` Some false negatives occur because the true pair was never generated by any blocking rule: ```{r errors-fn} errors <- il_errors(model, labels_col = 'cluster', threshold = 0.5) errors[errors$error_type == 'false_negative', ] ``` ## Cleanup ```{r cleanup} il_cleanup(model) DBI::dbDisconnect(con, shutdown = TRUE) ``` `il_cleanup(model)` is model-scoped. If an interactive run failed before you kept the model object, call `il_cleanup_all(con)` to remove all `irelink` tables from the connection before disconnecting.