---
title: "Getting Started"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Getting Started}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = '#>'
)
```

## What is record linkage?

Record linkage, also called entity resolution or deduplication, identifies records in one or more datasets that refer to the same real-world entity.
When datasets do not share a unique identifier, you must rely on imperfect fields such as names, dates of birth, and addresses.
Probabilistic record linkage estimates the chance that two records are a match based on how similar they are across several fields.

`irelink` implements the Fellegi-Sunter model of probabilistic record linkage.
It estimates parameters with unsupervised expectation maximization, so you can get started without labeled training data.

## A typical workflow

Every linkage task follows the same general pattern:

1. **Define a specification.** Choose which columns to compare and how.
2. **Build a model.** Load data into a SQL backend and attach the specification.
3. **Train parameters.** Estimate u-probabilities, then run EM to learn m-probabilities.
4. **Predict.** Score candidate pairs and keep the likely matches.
5. **Cluster.** Resolve pairwise links into groups that represent the same entity.

The example below walks through each step using a small built-in dataset.

## Step 1: Define a specification

A specification defines the comparisons and blocking rules that drive the model.
Comparisons tell `irelink` how to score similarity on each field, and blocking rules limit which record pairs are compared so linkage stays tractable on large datasets.

```{r spec}
library(irelink)

spec <- il_spec() |>
  il_compare(first_name, cl_jaro_winkler(0.9, 0.7)) |>
  il_compare(surname, cl_jaro_winkler(0.9, 0.7)) |>
  il_compare(dob, cl_exact()) |>
  il_block_on(surname) |>
  il_block_on(first_name)

spec
```

Each call to `il_compare()` adds one comparison dimension.
Here, `cl_jaro_winkler(0.9, 0.7)` creates three levels: similarity of at least 0.9 is level 2, similarity of at least 0.7 is level 1, and anything lower is level 0.
`cl_exact()` is a simple binary match.

Blocking rules defined with `il_block_on()` restrict candidate pairs to records that share the same value in the blocking column.
Multiple blocking rules use OR logic, so a pair is compared if it satisfies any one of them.

## Step 2: Build a model

`il_model()` uploads the data to a SQL backend and attaches the specification.
Any DBI-compatible connection works.
Here we use an in-memory DuckDB database:

```{r model}
df <- fake_20
con <- DBI::dbConnect(duckdb::duckdb())

model <- il_model(df, spec = spec, con = con)
model
```

## Step 3: Train parameters

Training has two main steps.
First, estimate u-probabilities, which are the chances that two random non-matching records agree at each comparison level:

```{r train-u}
model <- il_estimate_u(model)
```

Next, run expectation maximization to learn m-probabilities, which are the chances that true matches agree at each level.
You provide a blocking rule to generate the training pairs:

```{r train-em}
model <- il_estimate_em(model, block_on(surname))
```

You can inspect the learned parameters at any time:

```{r weights}
il_weights(model)
```

## Step 4: Predict

`predict()` scores candidate pairs and returns those above a match-probability threshold:
```{r predict}
pairs <- predict(model, threshold = 0.5)
head(pairs)
```

Each row is a candidate pair.
The output includes the left and right record identifiers, the per-comparison gamma values, the evidence-only `match_weight`, the prior-inclusive `total_match_weight`, and the posterior `match_probability`.

## Step 5: Cluster

`il_cluster()` resolves pairwise predictions into entity clusters with connected-components analysis:

```{r cluster}
clusters <- il_cluster(pairs)
head(clusters)
```

Each record is assigned a `cluster_id`.
Records in the same cluster are treated as the same entity.

## Comparison levels

`irelink` includes a large set of comparison levels for common field types:

| Level | Use case |
|-------|----------|
| `cl_exact()` | Binary exact match |
| `cl_jaro_winkler()` | Names, short strings |
| `cl_levenshtein()` | General fuzzy strings |
| `cl_damerau_levenshtein()` | Strings with transpositions |
| `cl_jaro()` | Lightweight string similarity |
| `cl_jaccard()` | Token-set overlap |
| `cl_cosine()` | Embedding similarity |
| `cl_numeric_diff()` | Numeric fields (e.g., age) |
| `cl_pct_diff()` | Percentage difference |
| `cl_date_diff()` | Date fields |
| `cl_time_diff()` | Time fields |
| `cl_geo_distance()` | Geographic coordinates |
| `cl_array_intersect()` | Array or set overlap |

For common field types, domain-specific helpers combine multiple levels into a single call:

| Helper | Fields |
|--------|--------|
| `cl_name()` | Generic name field |
| `cl_first_last_name()` | First name and last name as separate fields |
| `cl_forename_surname()` | Forename and surname with transposition |
| `cl_dob()` | Date of birth |
| `cl_email()` | Email addresses |
| `cl_postcode()` | UK postal codes |
| `cl_zip_code()` | US ZIP codes |

## Evaluation

If you have labeled data, meaning pairs that are known matches or non-matches, `irelink` provides tools to assess model quality:

- `il_accuracy()`: overall accuracy at a threshold
- `il_precision_recall()`: precision and recall across thresholds
- `il_roc()`: ROC curve data
- `il_errors()`: inspect false positives and false negatives

## Cleaning up

When you are done, release the database resources owned by the model.
In an interactive session with abandoned models, use `il_cleanup_all(con)` before disconnecting to drop every `irelink` table on the connection.

```{r cleanup}
il_cleanup(model)
DBI::dbDisconnect(con, shutdown = TRUE)
```