irelink translates
the Python splink
library into idiomatic R. This vignette maps common Splink patterns to
irelink so you can get started quickly.
Splink uses an object-oriented design centered on a
Linker class. irelink uses a functional
pipeline that fits naturally in R. The Linker object’s
namespaced methods such as linker.training.* and
linker.inference.* become standalone functions that accept
and return an il_model object.
Splink bundles comparison levels into high-level comparison classes
such as JaroWinklerAtThresholds. In irelink,
the cl_*() functions fill the same role and can be passed
directly to il_compare().
| Step | splink (Python) | irelink (R) |
|---|---|---|
| Load data | splink_datasets.fake_1000 |
fake_1000 |
| Choose backend | DuckDBAPI() |
DBI::dbConnect(duckdb::duckdb()) |
| Define settings | SettingsCreator(...) |
il_spec() |>il_compare(...) |>il_block_on(...) |
| Create model | Linker(df, settings, db_api) |
il_model(df, spec = spec, con = con) |
| Estimate prior | linker.training.estimate_probability_two_random_records_match(...) |
il_estimate_prior(model, ...) |
| Estimate u | linker.training.estimate_u_using_random_sampling(...) |
il_estimate_u(model) |
| Estimate m (EM) | linker.training.estimate_parameters_using_expectation_maximisation(...) |
il_estimate_em(model, ...) |
| Estimate m (labels) | linker.training.estimate_m_from_pairwise_labels(...) |
il_estimate_m_from_labels(model, ...) |
| Predict | linker.inference.predict(...) |
predict(model, ...) |
| Cluster | linker.clustering.cluster_pairwise_predictions_at_threshold(...) |
il_cluster(pairs) |
| Deterministic link | linker.deterministic_link() |
il_deterministic_link(df, ...) |
| Find matches | linker.inference.find_matches_to_new_records(...) |
il_find_matches(model, new_records, ...) |
irelink also supports
link_type = "link_and_dedupe" for two-table jobs where
duplicates may exist within each input table and across the two
tables.
Comparison levels are the building blocks used to score how similar
two records are on a field. Each cl_*() function
corresponds to a Splink comparison level class.
| splink (Python) | irelink (R) |
|---|---|
ExactMatchLevel |
cl_exact() |
LevenshteinLevel |
cl_levenshtein() |
DamerauLevenshteinLevel |
cl_damerau_levenshtein() |
JaroLevel |
cl_jaro() |
JaroWinklerLevel |
cl_jaro_winkler() |
JaccardLevel |
cl_jaccard() |
CosineSimilarityLevel |
cl_cosine() |
AbsoluteDifferenceLevel |
cl_numeric_diff() |
PercentageDifferenceLevel |
cl_pct_diff() |
AbsoluteTimeDifferenceAtThresholds |
cl_date_diff() |
DistanceInKMLevel |
cl_geo_distance() |
ArrayIntersectLevel |
cl_array_intersect() |
CustomLevel |
cl_custom() |
NullLevel |
cl_null() |
ElseLevel |
cl_else() |
And |
cl_and() |
Or |
cl_or() |
Not |
cl_not() |
Splink provides high-level comparison classes for common field types.
In irelink, these are helper functions that return
preconfigured sets of levels.
| splink (Python) | irelink (R) |
|---|---|
NameComparison |
cl_name() |
ForenameSurnameComparison |
cl_forename_surname() |
DateOfBirthComparison |
cl_dob() |
EmailComparison |
cl_email() |
PostcodeComparison |
cl_postcode() |
| splink (Python) | irelink (R) |
|---|---|
linker.visualisations.match_weights_chart() |
il_weights(model) |
linker.visualisations.parameter_estimate_comparisons_chart() |
il_parameters(model) |
linker.visualisations.waterfall_chart(...) |
il_waterfall(pairs, ...) |
linker.misc.query_comparison_details(...) |
il_compare_records(record_a, record_b, ...) |
linker.training.prediction_errors_from_labels_column(...) |
il_errors(model, ...) |
linker.evaluation.unlinkables_chart() |
il_unlinkables(model) |
| splink (Python) | irelink (R) |
|---|---|
linker.evaluation.accuracy_chart_from_labels_column(...) |
il_accuracy(model, ...) |
linker.evaluation.precision_recall_chart_from_labels_column(...) |
il_precision_recall(model, ...) |
linker.evaluation.roc_chart_from_labels_column(...) |
il_roc(model, ...) |
| splink (Python) | irelink (R) |
|---|---|
linker.profile_columns(...) |
il_profile(df, ...) |
linker.count_num_comparisons_from_blocking_rule(...) |
il_count_pairs(df, ...) |
| completeness profiling | il_completeness(df, ...) |
| splink (Python) | irelink (R) |
|---|---|
linker.misc.save_model_to_json(...) |
il_save(model, path) |
load_model_from_json(...) |
il_load(path) |
delete_tables_created_by_splink_from_db(...) |
il_cleanup_all(con) |
| model-scoped cleanup | il_cleanup(model) |
In Splink, you create blocking rules with block_on(),
and irelink uses the same function name. The main
difference is where the rules are used: Splink passes them into
SettingsCreator, while irelink adds them to a
spec with il_block_on() or passes them directly to training
functions.
Below is a minimal deduplication example in both Splink and
irelink.
splink (Python):
from splink import Linker, SettingsCreator, DuckDBAPI, block_on, splink_datasets
import splink.comparison_library as cl
df = splink_datasets.fake_1000
db_api = DuckDBAPI()
settings = SettingsCreator(
link_type="dedupe_only",
comparisons=[
cl.JaroWinklerAtThresholds("first_name", [0.9, 0.7]),
cl.JaroWinklerAtThresholds("surname", [0.9, 0.7]),
cl.ExactMatch("dob"),
],
blocking_rules_to_generate_predictions=[
block_on("first_name"),
block_on("surname"),
],
)
linker = Linker(df, settings, db_api)
linker.training.estimate_u_using_random_sampling(max_pairs=1e6)
linker.training.estimate_parameters_using_expectation_maximisation(
block_on("surname")
)
pairwise = linker.inference.predict(threshold_match_probability=0.5)
clusters = linker.clustering.cluster_pairwise_predictions_at_threshold(
pairwise, 0.95
)irelink (R):
library(irelink)
df <- fake_1000
con <- DBI::dbConnect(duckdb::duckdb())
spec <- il_spec() |>
il_compare(first_name, cl_jaro_winkler(0.9, 0.7)) |>
il_compare(surname, cl_jaro_winkler(0.9, 0.7)) |>
il_compare(dob, cl_exact()) |>
il_block_on(first_name) |>
il_block_on(surname)
model <- il_model(df, spec = spec, con = con)
model <- il_estimate_u(model)
model <- il_estimate_em(model, block_on(surname))
pairs <- predict(model, threshold = 0.5)
clusters <- il_cluster(pairs)
il_cleanup(model)
DBI::dbDisconnect(con, shutdown = TRUE)The examples above use probability thresholds because those transfer
cleanly between Splink and irelink. In Splink, prediction
match_weight includes the prior odds. In
irelink, match_weight is evidence only, and
total_match_weight is the prior-inclusive log2 odds. Keep
that difference in mind if you translate match-weight thresholds between
the two packages.
splink (Python):
new_records = pd.DataFrame([{
"first_name": "Jhon", "surname": "Smith", "dob": "1990-01-15"
}])
results = linker.inference.find_matches_to_new_records(
new_records, blocking_rules=[], match_weight_threshold=-10
)irelink (R):
new_df <- data.frame(
first_name = "Jhon",
surname = "Smith",
dob = "1990-01-15"
)
results <- il_find_matches(model, new_df, threshold = 0.5)Splink uses a match-weight threshold for this workflow.
il_find_matches() filters on posterior match probability.
Translate those thresholds with the same caution as in the prediction
example above.