---
title: "Translating from Splink"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Translating from Splink}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = '#>'
)
```
`irelink` translates the Python [splink](https://github.com/moj-analytical-services/splink) library into idiomatic R.
This vignette maps common Splink patterns to `irelink` so you can get started quickly.
## Design differences
Splink uses an object-oriented design centered on a `Linker` class.
`irelink` uses a functional pipeline that fits naturally in R.
The `Linker` object's namespaced methods such as `linker.training.*` and `linker.inference.*` become standalone functions that accept and return an `il_model` object.
Splink bundles comparison levels into high-level comparison classes such as `JaroWinklerAtThresholds`.
In `irelink`, the `cl_*()` functions fill the same role and can be passed directly to `il_compare()`.
## Core workflow
| Step | splink (Python) | irelink (R) |
|------|----------------|-------------|
| Load data | `splink_datasets.fake_1000` | `fake_1000` |
| Choose backend | `DuckDBAPI()` | `DBI::dbConnect(duckdb::duckdb())` |
| Define settings | `SettingsCreator(...)` | `il_spec() |>`
`il_compare(...) |>`
`il_block_on(...)` |
| Create model | `Linker(df, settings, db_api)` | `il_model(df, spec = spec, con = con)` |
| Estimate prior | `linker.training.`
`estimate_probability_two_random_records_match(...)` | `il_estimate_prior(model, ...)` |
| Estimate u | `linker.training.`
`estimate_u_using_random_sampling(...)` | `il_estimate_u(model)` |
| Estimate m (EM) | `linker.training.`
`estimate_parameters_using_expectation_maximisation(...)` | `il_estimate_em(model, ...)` |
| Estimate m (labels) | `linker.training.`
`estimate_m_from_pairwise_labels(...)` | `il_estimate_m_from_labels(model, ...)` |
| Predict | `linker.inference.predict(...)` | `predict(model, ...)` |
| Cluster | `linker.clustering.`
`cluster_pairwise_predictions_at_threshold(...)` | `il_cluster(pairs)` |
| Deterministic link | `linker.deterministic_link()` | `il_deterministic_link(df, ...)` |
| Find matches | `linker.inference.`
`find_matches_to_new_records(...)` | `il_find_matches(model, new_records, ...)` |
`irelink` also supports `link_type = "link_and_dedupe"` for two-table jobs where duplicates may exist within each input table and across the two tables.
## Comparison levels
Comparison levels are the building blocks used to score how similar two records are on a field.
Each `cl_*()` function corresponds to a Splink comparison level class.
| splink (Python) | irelink (R) |
|-----------------|-------------|
| `ExactMatchLevel` | `cl_exact()` |
| `LevenshteinLevel` | `cl_levenshtein()` |
| `DamerauLevenshteinLevel` | `cl_damerau_levenshtein()` |
| `JaroLevel` | `cl_jaro()` |
| `JaroWinklerLevel` | `cl_jaro_winkler()` |
| `JaccardLevel` | `cl_jaccard()` |
| `CosineSimilarityLevel` | `cl_cosine()` |
| `AbsoluteDifferenceLevel` | `cl_numeric_diff()` |
| `PercentageDifferenceLevel` | `cl_pct_diff()` |
| `AbsoluteTimeDifferenceAtThresholds` | `cl_date_diff()` |
| `DistanceInKMLevel` | `cl_geo_distance()` |
| `ArrayIntersectLevel` | `cl_array_intersect()` |
| `CustomLevel` | `cl_custom()` |
| `NullLevel` | `cl_null()` |
| `ElseLevel` | `cl_else()` |
| `And` | `cl_and()` |
| `Or` | `cl_or()` |
| `Not` | `cl_not()` |
## Domain-specific comparisons
Splink provides high-level comparison classes for common field types.
In `irelink`, these are helper functions that return preconfigured sets of levels.
| splink (Python) | irelink (R) |
|-----------------|-------------|
| `NameComparison` | `cl_name()` |
| `ForenameSurnameComparison` | `cl_forename_surname()` |
| `DateOfBirthComparison` | `cl_dob()` |
| `EmailComparison` | `cl_email()` |
| `PostcodeComparison` | `cl_postcode()` |
## Model inspection
| splink (Python) | irelink (R) |
|-----------------|-------------|
| `linker.visualisations.match_weights_chart()` | `il_weights(model)` |
| `linker.visualisations.`
`parameter_estimate_comparisons_chart()` | `il_parameters(model)` |
| `linker.visualisations.waterfall_chart(...)` | `il_waterfall(pairs, ...)` |
| `linker.misc.query_comparison_details(...)` | `il_compare_records(record_a, record_b, ...)` |
| `linker.training.`
`prediction_errors_from_labels_column(...)` | `il_errors(model, ...)` |
| `linker.evaluation.unlinkables_chart()` | `il_unlinkables(model)` |
## Evaluation
| splink (Python) | irelink (R) |
|-----------------|-------------|
| `linker.evaluation.`
`accuracy_chart_from_labels_column(...)` | `il_accuracy(model, ...)` |
| `linker.evaluation.`
`precision_recall_chart_from_labels_column(...)` | `il_precision_recall(model, ...)` |
| `linker.evaluation.`
`roc_chart_from_labels_column(...)` | `il_roc(model, ...)` |
## Data profiling
| splink (Python) | irelink (R) |
|-----------------|-------------|
| `linker.profile_columns(...)` | `il_profile(df, ...)` |
| `linker.count_num_comparisons_from_blocking_rule(...)` | `il_count_pairs(df, ...)` |
| *completeness profiling* | `il_completeness(df, ...)` |
## Persistence
| splink (Python) | irelink (R) |
|-----------------|-------------|
| `linker.misc.save_model_to_json(...)` | `il_save(model, path)` |
| `load_model_from_json(...)` | `il_load(path)` |
| `delete_tables_created_by_splink_from_db(...)` | `il_cleanup_all(con)` |
| *model-scoped cleanup* | `il_cleanup(model)` |
## Blocking rules
In Splink, you create blocking rules with `block_on()`, and `irelink` uses the same function name.
The main difference is where the rules are used: Splink passes them into `SettingsCreator`, while `irelink` adds them to a spec with `il_block_on()` or passes them directly to training functions.
```r
# blocking in the spec
spec <- il_spec() |>
il_compare(first_name, cl_jaro_winkler(0.9, 0.7)) |>
il_block_on(surname)
# blocking in EM training
model <- il_estimate_em(model, block_on(surname))
```
## Example: side-by-side deduplication
Below is a minimal deduplication example in both Splink and `irelink`.
**splink (Python):**
```python
from splink import Linker, SettingsCreator, DuckDBAPI, block_on, splink_datasets
import splink.comparison_library as cl
df = splink_datasets.fake_1000
db_api = DuckDBAPI()
settings = SettingsCreator(
link_type="dedupe_only",
comparisons=[
cl.JaroWinklerAtThresholds("first_name", [0.9, 0.7]),
cl.JaroWinklerAtThresholds("surname", [0.9, 0.7]),
cl.ExactMatch("dob"),
],
blocking_rules_to_generate_predictions=[
block_on("first_name"),
block_on("surname"),
],
)
linker = Linker(df, settings, db_api)
linker.training.estimate_u_using_random_sampling(max_pairs=1e6)
linker.training.estimate_parameters_using_expectation_maximisation(
block_on("surname")
)
pairwise = linker.inference.predict(threshold_match_probability=0.5)
clusters = linker.clustering.cluster_pairwise_predictions_at_threshold(
pairwise, 0.95
)
```
**irelink (R):**
```r
library(irelink)
df <- fake_1000
con <- DBI::dbConnect(duckdb::duckdb())
spec <- il_spec() |>
il_compare(first_name, cl_jaro_winkler(0.9, 0.7)) |>
il_compare(surname, cl_jaro_winkler(0.9, 0.7)) |>
il_compare(dob, cl_exact()) |>
il_block_on(first_name) |>
il_block_on(surname)
model <- il_model(df, spec = spec, con = con)
model <- il_estimate_u(model)
model <- il_estimate_em(model, block_on(surname))
pairs <- predict(model, threshold = 0.5)
clusters <- il_cluster(pairs)
il_cleanup(model)
DBI::dbDisconnect(con, shutdown = TRUE)
```
The examples above use probability thresholds because those transfer cleanly between Splink and `irelink`.
In Splink, prediction `match_weight` includes the prior odds.
In `irelink`, `match_weight` is evidence only, and `total_match_weight` is the prior-inclusive log2 odds.
Keep that difference in mind if you translate match-weight thresholds between the two packages.
## Example: finding matches against new records
**splink (Python):**
```python
new_records = pd.DataFrame([{
"first_name": "Jhon", "surname": "Smith", "dob": "1990-01-15"
}])
results = linker.inference.find_matches_to_new_records(
new_records, blocking_rules=[], match_weight_threshold=-10
)
```
**irelink (R):**
```r
new_df <- data.frame(
first_name = "Jhon",
surname = "Smith",
dob = "1990-01-15"
)
results <- il_find_matches(model, new_df, threshold = 0.5)
```
Splink uses a match-weight threshold for this workflow.
`il_find_matches()` filters on posterior match probability.
Translate those thresholds with the same caution as in the prediction example above.