Translating from Splink

irelink translates the Python splink library into idiomatic R. This vignette maps common Splink patterns to irelink so you can get started quickly.

Design differences

Splink uses an object-oriented design centered on a Linker class. irelink uses a functional pipeline that fits naturally in R. The Linker object’s namespaced methods such as linker.training.* and linker.inference.* become standalone functions that accept and return an il_model object.

Splink bundles comparison levels into high-level comparison classes such as JaroWinklerAtThresholds. In irelink, the cl_*() functions fill the same role and can be passed directly to il_compare().

Core workflow

Step	splink (Python)	irelink (R)
Load data	`splink_datasets.fake_1000`	`fake_1000`
Choose backend	`DuckDBAPI()`	`DBI::dbConnect(duckdb::duckdb())`
Define settings	`SettingsCreator(...)`	`il_spec() \|>` `il_compare(...) \|>` `il_block_on(...)`
Create model	`Linker(df, settings, db_api)`	`il_model(df, spec = spec, con = con)`
Estimate prior	`linker.training.` `estimate_probability_two_random_records_match(...)`	`il_estimate_prior(model, ...)`
Estimate u	`linker.training.` `estimate_u_using_random_sampling(...)`	`il_estimate_u(model)`
Estimate m (EM)	`linker.training.` `estimate_parameters_using_expectation_maximisation(...)`	`il_estimate_em(model, ...)`
Estimate m (labels)	`linker.training.` `estimate_m_from_pairwise_labels(...)`	`il_estimate_m_from_labels(model, ...)`
Predict	`linker.inference.predict(...)`	`predict(model, ...)`
Cluster	`linker.clustering.` `cluster_pairwise_predictions_at_threshold(...)`	`il_cluster(pairs)`
Deterministic link	`linker.deterministic_link()`	`il_deterministic_link(df, ...)`
Find matches	`linker.inference.` `find_matches_to_new_records(...)`	`il_find_matches(model, new_records, ...)`

irelink also supports link_type = "link_and_dedupe" for two-table jobs where duplicates may exist within each input table and across the two tables.

Comparison levels

Comparison levels are the building blocks used to score how similar two records are on a field. Each cl_*() function corresponds to a Splink comparison level class.

splink (Python)	irelink (R)
`ExactMatchLevel`	`cl_exact()`
`LevenshteinLevel`	`cl_levenshtein()`
`DamerauLevenshteinLevel`	`cl_damerau_levenshtein()`
`JaroLevel`	`cl_jaro()`
`JaroWinklerLevel`	`cl_jaro_winkler()`
`JaccardLevel`	`cl_jaccard()`
`CosineSimilarityLevel`	`cl_cosine()`
`AbsoluteDifferenceLevel`	`cl_numeric_diff()`
`PercentageDifferenceLevel`	`cl_pct_diff()`
`AbsoluteTimeDifferenceAtThresholds`	`cl_date_diff()`
`DistanceInKMLevel`	`cl_geo_distance()`
`ArrayIntersectLevel`	`cl_array_intersect()`
`CustomLevel`	`cl_custom()`
`NullLevel`	`cl_null()`
`ElseLevel`	`cl_else()`
`And`	`cl_and()`
`Or`	`cl_or()`
`Not`	`cl_not()`

Domain-specific comparisons

Splink provides high-level comparison classes for common field types. In irelink, these are helper functions that return preconfigured sets of levels.

splink (Python)	irelink (R)
`NameComparison`	`cl_name()`
`ForenameSurnameComparison`	`cl_forename_surname()`
`DateOfBirthComparison`	`cl_dob()`
`EmailComparison`	`cl_email()`
`PostcodeComparison`	`cl_postcode()`

Model inspection

splink (Python)	irelink (R)
`linker.visualisations.match_weights_chart()`	`il_weights(model)`
`linker.visualisations.` `parameter_estimate_comparisons_chart()`	`il_parameters(model)`
`linker.visualisations.waterfall_chart(...)`	`il_waterfall(pairs, ...)`
`linker.misc.query_comparison_details(...)`	`il_compare_records(record_a, record_b, ...)`
`linker.training.` `prediction_errors_from_labels_column(...)`	`il_errors(model, ...)`
`linker.evaluation.unlinkables_chart()`	`il_unlinkables(model)`

Evaluation

splink (Python)	irelink (R)
`linker.evaluation.` `accuracy_chart_from_labels_column(...)`	`il_accuracy(model, ...)`
`linker.evaluation.` `precision_recall_chart_from_labels_column(...)`	`il_precision_recall(model, ...)`
`linker.evaluation.` `roc_chart_from_labels_column(...)`	`il_roc(model, ...)`

Data profiling

splink (Python)	irelink (R)
`linker.profile_columns(...)`	`il_profile(df, ...)`
`linker.count_num_comparisons_from_blocking_rule(...)`	`il_count_pairs(df, ...)`
completeness profiling	`il_completeness(df, ...)`

Persistence

splink (Python)	irelink (R)
`linker.misc.save_model_to_json(...)`	`il_save(model, path)`
`load_model_from_json(...)`	`il_load(path)`
`delete_tables_created_by_splink_from_db(...)`	`il_cleanup_all(con)`
model-scoped cleanup	`il_cleanup(model)`

Blocking rules

In Splink, you create blocking rules with block_on(), and irelink uses the same function name. The main difference is where the rules are used: Splink passes them into SettingsCreator, while irelink adds them to a spec with il_block_on() or passes them directly to training functions.

# blocking in the spec
spec <- il_spec() |>
  il_compare(first_name, cl_jaro_winkler(0.9, 0.7)) |>
  il_block_on(surname)

# blocking in EM training
model <- il_estimate_em(model, block_on(surname))

Example: side-by-side deduplication

Below is a minimal deduplication example in both Splink and irelink.

splink (Python):

from splink import Linker, SettingsCreator, DuckDBAPI, block_on, splink_datasets
import splink.comparison_library as cl

df = splink_datasets.fake_1000
db_api = DuckDBAPI()

settings = SettingsCreator(
    link_type="dedupe_only",
    comparisons=[
        cl.JaroWinklerAtThresholds("first_name", [0.9, 0.7]),
        cl.JaroWinklerAtThresholds("surname", [0.9, 0.7]),
        cl.ExactMatch("dob"),
    ],
    blocking_rules_to_generate_predictions=[
        block_on("first_name"),
        block_on("surname"),
    ],
)

linker = Linker(df, settings, db_api)
linker.training.estimate_u_using_random_sampling(max_pairs=1e6)
linker.training.estimate_parameters_using_expectation_maximisation(
    block_on("surname")
)

pairwise = linker.inference.predict(threshold_match_probability=0.5)
clusters = linker.clustering.cluster_pairwise_predictions_at_threshold(
    pairwise, 0.95
)

irelink (R):

library(irelink)

df <- fake_1000
con <- DBI::dbConnect(duckdb::duckdb())

spec <- il_spec() |>
  il_compare(first_name, cl_jaro_winkler(0.9, 0.7)) |>
  il_compare(surname, cl_jaro_winkler(0.9, 0.7)) |>
  il_compare(dob, cl_exact()) |>
  il_block_on(first_name) |>
  il_block_on(surname)

model <- il_model(df, spec = spec, con = con)
model <- il_estimate_u(model)
model <- il_estimate_em(model, block_on(surname))

pairs <- predict(model, threshold = 0.5)
clusters <- il_cluster(pairs)

il_cleanup(model)
DBI::dbDisconnect(con, shutdown = TRUE)

The examples above use probability thresholds because those transfer cleanly between Splink and irelink. In Splink, prediction match_weight includes the prior odds. In irelink, match_weight is evidence only, and total_match_weight is the prior-inclusive log2 odds. Keep that difference in mind if you translate match-weight thresholds between the two packages.

Example: finding matches against new records

splink (Python):

new_records = pd.DataFrame([{
    "first_name": "Jhon", "surname": "Smith", "dob": "1990-01-15"
}])
results = linker.inference.find_matches_to_new_records(
    new_records, blocking_rules=[], match_weight_threshold=-10
)

irelink (R):

new_df <- data.frame(
  first_name = "Jhon",
  surname = "Smith",
  dob = "1990-01-15"
)
results <- il_find_matches(model, new_df, threshold = 0.5)

Splink uses a match-weight threshold for this workflow. il_find_matches() filters on posterior match probability. Translate those thresholds with the same caution as in the prediction example above.