--- title: "Translating from Splink" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Translating from Splink} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = '#>' ) ``` `irelink` translates the Python [splink](https://github.com/moj-analytical-services/splink) library into idiomatic R. This vignette maps common Splink patterns to `irelink` so you can get started quickly. ## Design differences Splink uses an object-oriented design centered on a `Linker` class. `irelink` uses a functional pipeline that fits naturally in R. The `Linker` object's namespaced methods such as `linker.training.*` and `linker.inference.*` become standalone functions that accept and return an `il_model` object. Splink bundles comparison levels into high-level comparison classes such as `JaroWinklerAtThresholds`. In `irelink`, the `cl_*()` functions fill the same role and can be passed directly to `il_compare()`. ## Core workflow | Step | splink (Python) | irelink (R) | |------|----------------|-------------| | Load data | `splink_datasets.fake_1000` | `fake_1000` | | Choose backend | `DuckDBAPI()` | `DBI::dbConnect(duckdb::duckdb())` | | Define settings | `SettingsCreator(...)` | `il_spec() |>`
`il_compare(...) |>`
`il_block_on(...)` | | Create model | `Linker(df, settings, db_api)` | `il_model(df, spec = spec, con = con)` | | Estimate prior | `linker.training.`
`estimate_probability_two_random_records_match(...)` | `il_estimate_prior(model, ...)` | | Estimate u | `linker.training.`
`estimate_u_using_random_sampling(...)` | `il_estimate_u(model)` | | Estimate m (EM) | `linker.training.`
`estimate_parameters_using_expectation_maximisation(...)` | `il_estimate_em(model, ...)` | | Estimate m (labels) | `linker.training.`
`estimate_m_from_pairwise_labels(...)` | `il_estimate_m_from_labels(model, ...)` | | Predict | `linker.inference.predict(...)` | `predict(model, ...)` | | Cluster | `linker.clustering.`
`cluster_pairwise_predictions_at_threshold(...)` | `il_cluster(pairs)` | | Deterministic link | `linker.deterministic_link()` | `il_deterministic_link(df, ...)` | | Find matches | `linker.inference.`
`find_matches_to_new_records(...)` | `il_find_matches(model, new_records, ...)` | `irelink` also supports `link_type = "link_and_dedupe"` for two-table jobs where duplicates may exist within each input table and across the two tables. ## Comparison levels Comparison levels are the building blocks used to score how similar two records are on a field. Each `cl_*()` function corresponds to a Splink comparison level class. | splink (Python) | irelink (R) | |-----------------|-------------| | `ExactMatchLevel` | `cl_exact()` | | `LevenshteinLevel` | `cl_levenshtein()` | | `DamerauLevenshteinLevel` | `cl_damerau_levenshtein()` | | `JaroLevel` | `cl_jaro()` | | `JaroWinklerLevel` | `cl_jaro_winkler()` | | `JaccardLevel` | `cl_jaccard()` | | `CosineSimilarityLevel` | `cl_cosine()` | | `AbsoluteDifferenceLevel` | `cl_numeric_diff()` | | `PercentageDifferenceLevel` | `cl_pct_diff()` | | `AbsoluteTimeDifferenceAtThresholds` | `cl_date_diff()` | | `DistanceInKMLevel` | `cl_geo_distance()` | | `ArrayIntersectLevel` | `cl_array_intersect()` | | `CustomLevel` | `cl_custom()` | | `NullLevel` | `cl_null()` | | `ElseLevel` | `cl_else()` | | `And` | `cl_and()` | | `Or` | `cl_or()` | | `Not` | `cl_not()` | ## Domain-specific comparisons Splink provides high-level comparison classes for common field types. In `irelink`, these are helper functions that return preconfigured sets of levels. | splink (Python) | irelink (R) | |-----------------|-------------| | `NameComparison` | `cl_name()` | | `ForenameSurnameComparison` | `cl_forename_surname()` | | `DateOfBirthComparison` | `cl_dob()` | | `EmailComparison` | `cl_email()` | | `PostcodeComparison` | `cl_postcode()` | ## Model inspection | splink (Python) | irelink (R) | |-----------------|-------------| | `linker.visualisations.match_weights_chart()` | `il_weights(model)` | | `linker.visualisations.`
`parameter_estimate_comparisons_chart()` | `il_parameters(model)` | | `linker.visualisations.waterfall_chart(...)` | `il_waterfall(pairs, ...)` | | `linker.misc.query_comparison_details(...)` | `il_compare_records(record_a, record_b, ...)` | | `linker.training.`
`prediction_errors_from_labels_column(...)` | `il_errors(model, ...)` | | `linker.evaluation.unlinkables_chart()` | `il_unlinkables(model)` | ## Evaluation | splink (Python) | irelink (R) | |-----------------|-------------| | `linker.evaluation.`
`accuracy_chart_from_labels_column(...)` | `il_accuracy(model, ...)` | | `linker.evaluation.`
`precision_recall_chart_from_labels_column(...)` | `il_precision_recall(model, ...)` | | `linker.evaluation.`
`roc_chart_from_labels_column(...)` | `il_roc(model, ...)` | ## Data profiling | splink (Python) | irelink (R) | |-----------------|-------------| | `linker.profile_columns(...)` | `il_profile(df, ...)` | | `linker.count_num_comparisons_from_blocking_rule(...)` | `il_count_pairs(df, ...)` | | *completeness profiling* | `il_completeness(df, ...)` | ## Persistence | splink (Python) | irelink (R) | |-----------------|-------------| | `linker.misc.save_model_to_json(...)` | `il_save(model, path)` | | `load_model_from_json(...)` | `il_load(path)` | | `delete_tables_created_by_splink_from_db(...)` | `il_cleanup_all(con)` | | *model-scoped cleanup* | `il_cleanup(model)` | ## Blocking rules In Splink, you create blocking rules with `block_on()`, and `irelink` uses the same function name. The main difference is where the rules are used: Splink passes them into `SettingsCreator`, while `irelink` adds them to a spec with `il_block_on()` or passes them directly to training functions. ```r # blocking in the spec spec <- il_spec() |> il_compare(first_name, cl_jaro_winkler(0.9, 0.7)) |> il_block_on(surname) # blocking in EM training model <- il_estimate_em(model, block_on(surname)) ``` ## Example: side-by-side deduplication Below is a minimal deduplication example in both Splink and `irelink`. **splink (Python):** ```python from splink import Linker, SettingsCreator, DuckDBAPI, block_on, splink_datasets import splink.comparison_library as cl df = splink_datasets.fake_1000 db_api = DuckDBAPI() settings = SettingsCreator( link_type="dedupe_only", comparisons=[ cl.JaroWinklerAtThresholds("first_name", [0.9, 0.7]), cl.JaroWinklerAtThresholds("surname", [0.9, 0.7]), cl.ExactMatch("dob"), ], blocking_rules_to_generate_predictions=[ block_on("first_name"), block_on("surname"), ], ) linker = Linker(df, settings, db_api) linker.training.estimate_u_using_random_sampling(max_pairs=1e6) linker.training.estimate_parameters_using_expectation_maximisation( block_on("surname") ) pairwise = linker.inference.predict(threshold_match_probability=0.5) clusters = linker.clustering.cluster_pairwise_predictions_at_threshold( pairwise, 0.95 ) ``` **irelink (R):** ```r library(irelink) df <- fake_1000 con <- DBI::dbConnect(duckdb::duckdb()) spec <- il_spec() |> il_compare(first_name, cl_jaro_winkler(0.9, 0.7)) |> il_compare(surname, cl_jaro_winkler(0.9, 0.7)) |> il_compare(dob, cl_exact()) |> il_block_on(first_name) |> il_block_on(surname) model <- il_model(df, spec = spec, con = con) model <- il_estimate_u(model) model <- il_estimate_em(model, block_on(surname)) pairs <- predict(model, threshold = 0.5) clusters <- il_cluster(pairs) il_cleanup(model) DBI::dbDisconnect(con, shutdown = TRUE) ``` The examples above use probability thresholds because those transfer cleanly between Splink and `irelink`. In Splink, prediction `match_weight` includes the prior odds. In `irelink`, `match_weight` is evidence only, and `total_match_weight` is the prior-inclusive log2 odds. Keep that difference in mind if you translate match-weight thresholds between the two packages. ## Example: finding matches against new records **splink (Python):** ```python new_records = pd.DataFrame([{ "first_name": "Jhon", "surname": "Smith", "dob": "1990-01-15" }]) results = linker.inference.find_matches_to_new_records( new_records, blocking_rules=[], match_weight_threshold=-10 ) ``` **irelink (R):** ```r new_df <- data.frame( first_name = "Jhon", surname = "Smith", dob = "1990-01-15" ) results <- il_find_matches(model, new_df, threshold = 0.5) ``` Splink uses a match-weight threshold for this workflow. `il_find_matches()` filters on posterior match probability. Translate those thresholds with the same caution as in the prediction example above.