---
title: "Deduplicating 50k Synthetic Records"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Deduplicating 50k Synthetic Records}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
can_run <- requireNamespace('nanoparquet', quietly = TRUE)

if (can_run) {
  url <- paste0(
    'https://raw.githubusercontent.com/',
    'moj-analytical-services/splink_datasets/',
    'master/data/historical_figures_with_errors_50k.parquet'
  )
  tmp <- paste0(tempdir(), '/historical_figures_with_errors_50k.parquet')
  df <- try(
    {
      utils::download.file(url, tmp, mode = 'wb', quiet = TRUE)
      nanoparquet::read_parquet(tmp)
    },
    silent = TRUE
  )
  can_run <- !inherits(df, 'try-error')
}

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = '#>',
  eval = can_run
)
```

This vignette reproduces the [Splink "Deduplicate 50k synthetic" demo](https://moj-analytical-services.github.io/splink/demos/examples/duckdb/deduplicate_50k_synthetic.html) in `irelink`.
The data is based on historical people scraped from Wikidata and includes duplicate records with realistic errors such as typos, missing values, and swapped fields.
The `cluster` column provides the ground-truth entity labels used in evaluation.

This vignette requires [nanoparquet](https://cran.r-project.org/package=nanoparquet) to read the remote Parquet file and only compiles when the package and the data URL are both available.

## Load the data

```{r load-data}
library(irelink)
library(ggplot2)

df
```

## Profile the data

Use completeness and value distributions to choose blocking rules and comparisons:

```{r setup-con}
con <- DBI::dbConnect(duckdb::duckdb())
```

```{r completeness, fig.width = 7, fig.height = 3}
df |>
  il_completeness(con = con) |>
  autoplot()
```

```{r profile}
il_profile(df, first_name, surname, dob, birth_place, con = con, top_n = 8)
```

## Choose blocking rules

```{r suggest-blocking}
il_suggest_blocking(df, con = con)
```

The `cumulative_pairs` column shows the total number of unique pairs produced so far:

```{r count-pairs}
il_count_pairs(
  df,
  block_on(surname, dob),
  block_on(first_name, dob),
  block_on(first_name, surname),
  block_on(dob, birth_place),
  con = con
)
```

## Define the specification

Apply term-frequency adjustment to `birth_place` and `occupation` so common values such as "London" receive less weight than rare ones:

```{r spec}
spec <- il_spec() |>
  il_compare(first_name, cl_name()) |>
  il_compare(surname, cl_name()) |>
  il_compare(dob, cl_dob()) |>
  il_compare(postcode_fake, cl_postcode()) |>
  il_compare(birth_place, cl_exact(term_frequency = TRUE)) |>
  il_compare(occupation, cl_exact(term_frequency = TRUE)) |>
  il_block_on(first_name ~ il_substr(1, 3), surname ~ il_substr(1, 4)) |>
  il_block_on(surname, dob) |>
  il_block_on(first_name, dob) |>
  il_block_on(postcode_fake, first_name) |>
  il_block_on(postcode_fake, surname) |>
  il_block_on(dob, birth_place) |>
  il_block_on(postcode_fake ~ il_substr(1, 3), dob) |>
  il_block_on(postcode_fake ~ il_substr(1, 3), first_name) |>
  il_block_on(postcode_fake ~ il_substr(1, 3), surname) |>
  il_block_on(
    first_name ~ il_substr(1, 2),
    surname ~ il_substr(1, 2),
    dob ~ il_substr(1, 4)
  )

spec
```

## Train the model

```{r model}
model <- df |>
  il_model(spec = spec, con = con) |>
  il_estimate_prior(
    block_on(first_name, surname, dob),
    block_on(dob, postcode_fake),
    recall = 0.6
  ) |>
  il_estimate_u(max_pairs = 5e6) |>
  il_estimate_em(block_on(first_name, surname)) |>
  il_estimate_em(block_on(dob))
```

## Inspect the trained model

```{r summary}
summary(model)
```

```{r weights-plot, fig.width = 7, fig.height = 4}
autoplot(model)
```

```{r params-plot, fig.width = 7, fig.height = 5}
autoplot(model, type = 'parameters')
```

```{r unlinkables, fig.width = 6, fig.height = 3.5}
autoplot(il_unlinkables(model))
```

## Predict

```{r predict}
predictions <- predict(model, threshold = 0.5)
predictions
```

```{r histogram, fig.width = 7, fig.height = 3.5}
autoplot(predictions)
```

```{r waterfall, fig.width = 7, fig.height = 4}
autoplot(predictions, which = 1)
```

## Cluster

```{r cluster}
clusters <- il_cluster(predictions, threshold = 0.95)
clusters
```

## Evaluate against ground truth

```{r accuracy}
acc <- il_accuracy(model, labels_col = 'cluster')
acc
```

When you use `labels_col`, the evaluation derives all true duplicate pairs from the ground-truth cluster column.
Some true pairs may never be generated by the blocking rules.
Those pairs count as false negatives at every threshold.
As a result, the maximum recall in the accuracy, ROC, and precision-recall plots is the blocking recall:

```{r blocking-recall}
acc0 <- acc[acc$threshold == min(acc$threshold), ]
acc0$tp / (acc0$tp + acc0$fn)
```

```{r accuracy-plot, fig.width = 7, fig.height = 4}
autoplot(acc)
```

```{r roc, fig.width = 5, fig.height = 4.5}
autoplot(il_roc(model, labels_col = 'cluster'))
```

```{r pr, fig.width = 5, fig.height = 4.5}
autoplot(il_precision_recall(model, labels_col = 'cluster'))
```

### Error inspection

```{r errors-fp}
errors <- il_errors(model, labels_col = 'cluster', threshold = 0.999)
errors[errors$error_type == 'false_positive', ]
```

Some false negatives occur because the true pair was never generated by any blocking rule:

```{r errors-fn}
errors <- il_errors(model, labels_col = 'cluster', threshold = 0.5)
errors[errors$error_type == 'false_negative', ]
```

## Cleanup

```{r cleanup}
il_cleanup(model)
DBI::dbDisconnect(con, shutdown = TRUE)
```

`il_cleanup(model)` is model-scoped.
If an interactive run failed before you kept the model object, call `il_cleanup_all(con)` to remove all `irelink` tables from the connection before disconnecting.