| Title: | Fast Probabilistic Record Linkage |
|---|---|
| Description: | Performs fast, scalable probabilistic record linkage and deduplication using the Fellegi-Sunter model. Records lacking a shared unique identifier are compared across configurable dimensions using exact, fuzzy, and distance-based comparisons, with model parameters estimated via unsupervised Expectation-Maximization. Multiple SQL backends are supported through 'DBI', enabling execution from laptop-scale ('DuckDB') through to distributed engines. This package is a translation of the Python 'splink' library by Linacre et al. into idiomatic R. |
| Authors: | Christopher T. Kenny [aut, cre, cph] (ORCID: <https://orcid.org/0000-0002-9386-6860>), Robin Linacre [cph] (Lead author of splink, the Python package this is derived from), Sam Lindsay [cph] (Author of splink), Theodore Manassis [cph] (Author of splink), Tom Hepworth [cph] (Author of splink), Andy Bond [cph] (Author of splink), Ross Kennedy [cph] (Author of splink), UK Ministry of Justice [cph] (Copyright holder of splink) |
| Maintainer: | Christopher T. Kenny <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.0.1 |
| Built: | 2026-05-20 19:32:02 UTC |
| Source: | https://github.com/christopherkenny/irelink |
Draws precision, recall, and F1 against the match-probability
threshold. The data is produced by il_accuracy().
## S3 method for class 'il_accuracy' autoplot(object, ...)## S3 method for class 'il_accuracy' autoplot(object, ...)
object |
An |
... |
Additional arguments (currently unused). |
A ggplot2::ggplot() object.
Plot Batch Comparator Scores
## S3 method for class 'il_comparator_score' autoplot(object, ...)## S3 method for class 'il_comparator_score' autoplot(object, ...)
object |
An |
... |
Additional arguments (currently unused). |
A ggplot2::ggplot() object.
Produces a match-weight histogram from scored pairs, or a waterfall
chart for a single pair when which is provided. This is a
convenience wrapper around ggplot2::ggplot(). For full control,
build a plot directly from the prediction result or from
il_waterfall().
## S3 method for class 'il_compared' autoplot(object, which = NULL, ...)## S3 method for class 'il_compared' autoplot(object, which = NULL, ...)
object |
An |
which |
An optional integer index. If provided, produces a
waterfall chart for that pair. If |
... |
Additional arguments (currently unused). |
A ggplot2::ggplot() object.
Plot Comparison Vector Distribution
## S3 method for class 'il_comparison_vectors' autoplot(object, ...)## S3 method for class 'il_comparison_vectors' autoplot(object, ...)
object |
An |
... |
Additional arguments (currently unused). |
A ggplot2::ggplot() object showing the top comparison patterns by
frequency.
Draws a grouped bar chart of non-null percentages per column, from
data produced by il_completeness().
## S3 method for class 'il_completeness' autoplot(object, ...)## S3 method for class 'il_completeness' autoplot(object, ...)
object |
An |
... |
Additional arguments (currently unused). |
A ggplot2::ggplot() object.
Draws a horizontal bar chart of candidate pairs generated by each
blocking rule, from data produced by il_count_pairs().
## S3 method for class 'il_count_pairs' autoplot(object, type = c("additional", "raw"), ...)## S3 method for class 'il_count_pairs' autoplot(object, type = c("additional", "raw"), ...)
object |
An |
type |
One of |
... |
Additional arguments (currently unused). |
A ggplot2::ggplot() object.
Produces a ready-made chart from a trained model. By default draws
the match-weights bar chart. Set type = "parameters" for an m / u
probability comparison. For full control, extract data with
il_weights() or il_parameters() and build a custom
ggplot2::ggplot().
## S3 method for class 'il_model' autoplot(object, type = c("weights", "parameters"), ...)## S3 method for class 'il_model' autoplot(object, type = c("weights", "parameters"), ...)
object |
A trained |
type |
One of |
... |
Additional arguments (currently unused). |
A ggplot2::ggplot() object.
Draws a precision–recall curve from the data produced by
il_precision_recall().
## S3 method for class 'il_precision_recall' autoplot(object, ...)## S3 method for class 'il_precision_recall' autoplot(object, ...)
object |
An |
... |
Additional arguments (currently unused). |
A ggplot2::ggplot() object.
Draws a faceted bar chart of value frequencies per column, from data
produced by il_profile().
## S3 method for class 'il_profile' autoplot(object, ...)## S3 method for class 'il_profile' autoplot(object, ...)
object |
An |
... |
Additional arguments (currently unused). |
A ggplot2::ggplot() object.
Draws a receiver operating characteristic curve from the data
produced by il_roc().
## S3 method for class 'il_roc' autoplot(object, ...)## S3 method for class 'il_roc' autoplot(object, ...)
object |
An |
... |
Additional arguments (currently unused). |
A ggplot2::ggplot() object.
Visualizes the output of il_string_similarity() as a horizontal bar
chart, making it easy to compare multiple string-distance metrics at a
glance.
## S3 method for class 'il_string_similarity' autoplot(object, ...)## S3 method for class 'il_string_similarity' autoplot(object, ...)
object |
An |
... |
Additional arguments (currently unused). |
A ggplot2::ggplot() object.
ggplot2::autoplot(il_string_similarity('John', 'Jon'))ggplot2::autoplot(il_string_similarity('John', 'Jon'))
Draws parameter estimates across EM iterations, faceted by comparison,
from data produced by il_training_history().
## S3 method for class 'il_training_history' autoplot(object, ...)## S3 method for class 'il_training_history' autoplot(object, ...)
object |
An |
... |
Additional arguments (currently unused). |
A ggplot2::ggplot() object.
Draws the proportion of records that cannot be linked at each
match-probability threshold, from data produced by
il_unlinkables().
## S3 method for class 'il_unlinkables' autoplot(object, ...)## S3 method for class 'il_unlinkables' autoplot(object, ...)
object |
An |
... |
Additional arguments (currently unused). |
A ggplot2::ggplot() object.
For each column, computes the fraction of true-match pairs that share the same value (recall). Helps identify which columns make effective blocking keys.
block_from_labels(.data, labels, columns = NULL, con = NULL)block_from_labels(.data, labels, columns = NULL, con = NULL)
.data |
A data frame or character table name. |
labels |
A data frame with |
columns |
Character vector of column names to evaluate. |
con |
A DBI connection from |
A tibble::tibble() with columns column, recall (fraction of true
matches caught), and n_matches_caught.
con <- DBI::dbConnect(duckdb::duckdb()) labels <- data.frame( unique_id_l = fake_1000_labels$unique_id_l, unique_id_r = fake_1000_labels$unique_id_r, is_match = as.integer(fake_1000_labels$clerical_match_score >= 0.5) ) block_from_labels(fake_1000, labels, con = con) DBI::dbDisconnect(con, shutdown = TRUE)con <- DBI::dbConnect(duckdb::duckdb()) labels <- data.frame( unique_id_l = fake_1000_labels$unique_id_l, unique_id_r = fake_1000_labels$unique_id_r, is_match = as.integer(fake_1000_labels$clerical_match_score >= 0.5) ) block_from_labels(fake_1000, labels, con = con) DBI::dbDisconnect(con, shutdown = TRUE)
Creates a blocking rule for use inside training verbs such as
il_estimate_em() and il_estimate_prior(). This is distinct from
il_block_on(), which adds prediction-time blocking to a specification.
The returned object describes how to partition pairs during training.
block_on(..., .where = NULL, .transform = NULL, .explode = NULL)block_on(..., .where = NULL, .transform = NULL, .explode = NULL)
... |
Column names (bare or |
.where |
An optional raw SQL string for non-equality blocking
conditions (e.g., |
.transform |
An optional transform applied to every column that does
not already have a formula transform. See |
.explode |
An optional character vector of array column names to
unnest before blocking. See |
A blocking-rule object for use in training verbs.
block_on(first_name, surname) # Fuzzy SQL conditions block_on(first_name, .where = 'levenshtein(l.dob, r.dob) <= 1') # Phonetic blocking block_on(first_name, .transform = il_soundex) # Per-column substring blocking block_on(first_name ~ il_substr(1, 3), surname ~ il_substr(1, 4))block_on(first_name, surname) # Fuzzy SQL conditions block_on(first_name, .where = 'levenshtein(l.dob, r.dob) <= 1') # Phonetic blocking block_on(first_name, .transform = il_soundex) # Per-column substring blocking block_on(first_name ~ il_substr(1, 3), surname ~ il_substr(1, 4))
Creates a compound level that fires only when all supplied conditions are satisfied.
cl_and(...)cl_and(...)
... |
Comparison-level objects to AND together. |
A comparison-level object.
cl_and(cl_exact(), cl_jaro_winkler(0.9))cl_and(cl_exact(), cl_jaro_winkler(0.9))
Creates comparison levels based on the number of shared elements between two array or list columns. Thresholds are integer counts, ordered from strictest (most shared elements required) to most lenient.
cl_array_intersect(...)cl_array_intersect(...)
... |
Integer count thresholds, ordered from strictest to most
lenient (e.g., |
A comparison-level object for use in il_compare().
il_spec() |> il_compare(tags, cl_array_intersect(2, 1))il_spec() |> il_compare(tags, cl_array_intersect(2, 1))
Creates comparison levels based on the best-matching pair of values
across two array columns. For each record pair, every element of the
left array is compared against every element of the right array. The
best score (maximum similarity for 'jaro_winkler', minimum distance
for 'levenshtein') is then tested against each threshold.
cl_array_min_distance(fn = c("jaro_winkler", "levenshtein"), ...)cl_array_min_distance(fn = c("jaro_winkler", "levenshtein"), ...)
fn |
Distance function: |
... |
Numeric thresholds, from strictest to most lenient. |
On DuckDB the pairwise comparison runs in SQL via an UNNEST cross-join scalar subquery. On SQLite it falls back to an R-side nested apply.
A comparison-level object for use in il_compare().
# Jaro-Winkler: best pairwise similarity >= 0.9 or >= 0.7 il_spec() |> il_compare(aliases, cl_array_min_distance('jaro_winkler', 0.9, 0.7)) # Levenshtein: best pairwise edit distance <= 1 or <= 2 il_spec() |> il_compare(aliases, cl_array_min_distance('levenshtein', 1, 2))# Jaro-Winkler: best pairwise similarity >= 0.9 or >= 0.7 il_spec() |> il_compare(aliases, cl_array_min_distance('jaro_winkler', 0.9, 0.7)) # Levenshtein: best pairwise edit distance <= 1 or <= 2 il_spec() |> il_compare(aliases, cl_array_min_distance('levenshtein', 1, 2))
Creates a comparison level that matches when the smaller of two array columns is a complete subset of the larger. In other words, every element of the smaller array appears in the larger one.
cl_array_subset()cl_array_subset()
On DuckDB and PostgreSQL this is computed in SQL using
ARRAY_LENGTH(ARRAY_INTERSECT(...)) = LEAST(ARRAY_LENGTH(...)).
On SQLite it falls back to an R-side set check.
A comparison-level object for use in il_compare().
il_spec() |> il_compare(qualifications, cl_array_subset())il_spec() |> il_compare(qualifications, cl_array_subset())
Creates a comparison level that fires when two column values are transposed between the left and right records (e.g., first name and surname accidentally swapped).
cl_columns_reversed(col_name_2, symmetrical = FALSE)cl_columns_reversed(col_name_2, symmetrical = FALSE)
col_name_2 |
Name of the second column (character). The first
column is the one passed to |
symmetrical |
Logical. If |
Use inside cl_levels() to add a swap-detection level to a custom
comparison. For a ready-made name comparison that already includes
swap detection, see cl_forename_surname().
A comparison-level object for use in il_compare() or
cl_levels().
# Detect swapped first/last names inside a custom comparison il_spec() |> il_compare( first_name, cl_levels( cl_null(), cl_exact(), cl_columns_reversed('surname', symmetrical = TRUE), cl_else() ) )# Detect swapped first/last names inside a custom comparison il_spec() |> il_compare( first_name, cl_levels( cl_null(), cl_exact(), cl_columns_reversed('surname', symmetrical = TRUE), cl_else() ) )
Creates comparison levels based on cosine similarity. Suitable for numeric or vectorised columns. Thresholds are between 0 and 1, ordered from strictest to most lenient.
cl_cosine(...)cl_cosine(...)
... |
Numeric thresholds between 0 and 1, ordered from strictest to most lenient. |
A comparison-level object for use in il_compare().
il_spec() |> il_compare(embedding, cl_cosine(0.8))il_spec() |> il_compare(embedding, cl_cosine(0.8))
Creates a comparison level from a raw SQL expression. Use this when
none of the built-in cl_*() helpers fit. The SQL should reference
l. and r. prefixed column names for the left and right records.
This is a tagged-string helper with processing semantics.
cl_custom(sql_expr, ...)cl_custom(sql_expr, ...)
sql_expr |
A character string containing a valid SQL expression. |
... |
Reserved for future use. |
A comparison-level object for use in il_compare().
il_spec() |> il_compare(score, cl_custom('l.score + r.score > 10'))il_spec() |> il_compare(score, cl_custom('l.score + r.score > 10'))
Creates comparison levels based on the Damerau-Levenshtein distance,
which extends cl_levenshtein() by also counting transpositions of
two adjacent characters as a single edit.
cl_damerau_levenshtein(..., term_frequency = FALSE)cl_damerau_levenshtein(..., term_frequency = FALSE)
... |
Integer distance thresholds, ordered from strictest to most lenient. |
term_frequency |
Logical. If |
A comparison-level object for use in il_compare().
il_spec() |> il_compare(name, cl_damerau_levenshtein(1))il_spec() |> il_compare(name, cl_damerau_levenshtein(1))
Creates comparison levels based on the absolute difference between two
dates. Thresholds should use the unit helpers days(), months(), or
years() for self-documenting, unit-safe specifications. Bare numerics
are interpreted as days.
cl_date_diff(...)cl_date_diff(...)
... |
Duration thresholds created by |
A comparison-level object for use in il_compare().
il_spec() |> il_compare(dob, cl_date_diff(days(30), days(365))) # Mix units freely il_spec() |> il_compare(dob, cl_date_diff(months(1), years(1)))il_spec() |> il_compare(dob, cl_date_diff(days(30), days(365))) # Mix units freely il_spec() |> il_compare(dob, cl_date_diff(months(1), years(1)))
A pre-built domain comparison for dates of birth. Combines exact matching, a Damerau-Levenshtein string check for transposed digits, and configurable date-difference levels to handle common errors.
cl_dob( thresholds = list(months(1), years(1), years(10)), term_frequency = FALSE )cl_dob( thresholds = list(months(1), years(1), years(10)), term_frequency = FALSE )
thresholds |
A list of unit-tagged threshold values created by
|
term_frequency |
Logical. If |
A comparison-level object for use in il_compare().
il_spec() |> il_compare(dob, cl_dob()) # Custom thresholds: within 7 days, 6 months, 2 years il_spec() |> il_compare(dob, cl_dob(thresholds = list(days(7), months(6), years(2))))il_spec() |> il_compare(dob, cl_dob()) # Custom thresholds: within 7 days, 6 months, 2 years il_spec() |> il_compare(dob, cl_dob(thresholds = list(days(7), months(6), years(2))))
Creates a residual level that matches any pair not captured by previous
levels. Typically used as the last level inside cl_levels().
cl_else()cl_else()
A comparison-level object.
cl_levels(cl_null(), cl_exact(), cl_else())cl_levels(cl_null(), cl_exact(), cl_else())
A pre-built domain comparison for email addresses. Provides levels for exact match, username-only match, and domain-only match.
cl_email(term_frequency = FALSE)cl_email(term_frequency = FALSE)
term_frequency |
Logical. If |
A comparison-level object for use in il_compare().
il_spec() |> il_compare(email, cl_email())il_spec() |> il_compare(email, cl_email())
Creates a comparison level that scores an exact match on a column. Optionally applies term-frequency adjustments so that rare values (e.g., an uncommon surname) receive higher match weights than common ones.
cl_exact(term_frequency = FALSE)cl_exact(term_frequency = FALSE)
term_frequency |
Logical. If |
A comparison-level object for use in il_compare().
il_spec() |> il_compare(city, cl_exact()) |> il_compare(county, cl_exact(term_frequency = TRUE))il_spec() |> il_compare(city, cl_exact()) |> il_compare(county, cl_exact(term_frequency = TRUE))
An American-English alias for cl_forename_surname(). Compares first
and last name columns, including a swap-detection level for accidentally
transposed names. Pass this to il_compare() on the first-name column
and supply the companion last-name column via last_name.
cl_first_last_name(last_name = "last_name", term_frequency = FALSE)cl_first_last_name(last_name = "last_name", term_frequency = FALSE)
last_name |
Name of the last name column in the data. Defaults to
|
term_frequency |
Logical. If |
A comparison-level object for use in il_compare().
il_spec() |> il_compare(first_name, cl_first_last_name())il_spec() |> il_compare(first_name, cl_first_last_name())
A pre-built domain comparison that compares forename and surname
columns, including a cross-field swap-detection level (where first name
and surname are accidentally transposed). Pass this to
il_compare() on the forename/first-name column and supply the
companion surname/last-name column via surname.
cl_forename_surname(surname = "surname", term_frequency = FALSE)cl_forename_surname(surname = "surname", term_frequency = FALSE)
surname |
Name of the surname column in the data. Defaults to
|
term_frequency |
Logical. If |
See also cl_first_last_name() for an American-English alias.
A comparison-level object for use in il_compare().
il_spec() |> il_compare(first_name, cl_forename_surname(surname = 'last_name'))il_spec() |> il_compare(first_name, cl_forename_surname(surname = 'last_name'))
Creates comparison levels based on the great-circle distance between
two latitude/longitude pairs. Thresholds should use the unit helpers
km() or mi() for clarity.
cl_geo_distance(...)cl_geo_distance(...)
... |
Distance thresholds created by |
A comparison-level object for use in il_compare().
il_spec() |> il_compare(c(lat, lon), cl_geo_distance(km(5), km(50))) # Use miles instead il_spec() |> il_compare(c(lat, lon), cl_geo_distance(mi(3), mi(30)))il_spec() |> il_compare(c(lat, lon), cl_geo_distance(km(5), km(50))) # Use miles instead il_spec() |> il_compare(c(lat, lon), cl_geo_distance(mi(3), mi(30)))
Creates comparison levels based on the Jaccard index, the ratio of the intersection to the union of character n-gram sets. Thresholds are between 0 and 1, ordered from strictest to most lenient.
cl_jaccard(...)cl_jaccard(...)
... |
Numeric thresholds between 0 and 1, ordered from strictest to most lenient. |
A comparison-level object for use in il_compare().
il_spec() |> il_compare(name, cl_jaccard(0.9))il_spec() |> il_compare(name, cl_jaccard(0.9))
Creates comparison levels based on the Jaro similarity score (0 to 1).
A simpler variant of cl_jaro_winkler() without the prefix bonus.
cl_jaro(..., term_frequency = FALSE)cl_jaro(..., term_frequency = FALSE)
... |
Numeric thresholds between 0 and 1, ordered from strictest to most lenient. |
term_frequency |
Logical. If |
A comparison-level object for use in il_compare().
il_spec() |> il_compare(name, cl_jaro(0.9))il_spec() |> il_compare(name, cl_jaro(0.9))
Creates comparison levels based on the Jaro-Winkler similarity score (0 to 1). Thresholds are passed as unnamed arguments ordered from strictest to most lenient.
cl_jaro_winkler(..., term_frequency = FALSE)cl_jaro_winkler(..., term_frequency = FALSE)
... |
Numeric thresholds between 0 and 1, ordered from strictest
to most lenient (e.g., |
term_frequency |
Logical. If |
A comparison-level object for use in il_compare().
il_spec() |> il_compare(first_name, cl_jaro_winkler(0.9, 0.7)) |> il_compare(surname, cl_jaro_winkler(0.9, term_frequency = TRUE))il_spec() |> il_compare(first_name, cl_jaro_winkler(0.9, 0.7)) |> il_compare(surname, cl_jaro_winkler(0.9, term_frequency = TRUE))
Assembles an ordered list of comparison levels from individual level
constructors. Use this when the built-in cl_*() helpers do not fit
and you need full control over the level hierarchy.
cl_levels(..., term_frequency = FALSE)cl_levels(..., term_frequency = FALSE)
... |
Level objects created by |
term_frequency |
Logical. If |
A comparison-level object for use in il_compare().
il_spec() |> il_compare( name, cl_levels( cl_null(), cl_exact(), cl_jaro_winkler(0.95), cl_jaro_winkler(0.88), cl_else(), term_frequency = TRUE ) )il_spec() |> il_compare( name, cl_levels( cl_null(), cl_exact(), cl_jaro_winkler(0.95), cl_jaro_winkler(0.88), cl_else(), term_frequency = TRUE ) )
Creates comparison levels based on the Levenshtein edit distance (minimum number of single-character insertions, deletions, or substitutions). Thresholds are integer counts, ordered from strictest (smallest distance) to most lenient.
cl_levenshtein(..., term_frequency = FALSE)cl_levenshtein(..., term_frequency = FALSE)
... |
Integer distance thresholds, ordered from strictest to most
lenient (e.g., |
term_frequency |
Logical. If |
A comparison-level object for use in il_compare().
il_spec() |> il_compare(name, cl_levenshtein(1, 2))il_spec() |> il_compare(name, cl_levenshtein(1, 2))
Creates a comparison level that checks whether a column equals a
fixed literal value on the left record, right record, or both. This is
useful as a gate inside cl_levels() to restrict a comparison to
records with a known value (e.g., only compare names when
country = 'US').
cl_literal(value, side = c("both", "left", "right"))cl_literal(value, side = c("both", "left", "right"))
value |
A scalar value to compare against. Character values are quoted in the generated SQL. Numerics are not. |
side |
Which record to check: |
A comparison-level object for use in il_compare() or
cl_levels().
il_spec() |> il_compare( country, cl_levels( cl_null(), cl_literal('US', side = 'both'), cl_exact(), cl_else() ) )il_spec() |> il_compare( country, cl_levels( cl_null(), cl_literal('US', side = 'both'), cl_exact(), cl_else() ) )
A pre-built domain comparison for personal names. Combines exact matching and Jaro-Winkler levels with thresholds tuned for typical name variation. Optionally adds a Soundex phonetic level as a final fallback before the else level, which helps catch names that sound similar but are spelled differently (e.g., Smith/Smyth).
cl_name(term_frequency = FALSE, phonetic = FALSE)cl_name(term_frequency = FALSE, phonetic = FALSE)
term_frequency |
Logical. If |
phonetic |
Logical. If |
A comparison-level object for use in il_compare().
il_spec() |> il_compare(first_name, cl_name()) |> il_compare(surname, cl_name(term_frequency = TRUE)) il_spec() |> il_compare(first_name, cl_name(phonetic = TRUE))il_spec() |> il_compare(first_name, cl_name()) |> il_compare(surname, cl_name(term_frequency = TRUE)) il_spec() |> il_compare(first_name, cl_name(phonetic = TRUE))
Creates a level that fires when the supplied condition does not hold.
cl_not(x)cl_not(x)
x |
A comparison-level object to negate. |
A comparison-level object.
cl_not(cl_exact())cl_not(cl_exact())
Creates a level that fires when either or both record values are NULL
or NA. Typically used as the first level inside cl_levels().
cl_null()cl_null()
A comparison-level object.
cl_levels(cl_null(), cl_exact(), cl_else())cl_levels(cl_null(), cl_exact(), cl_else())
Creates comparison levels based on the absolute difference between two numeric values. Thresholds are ordered from strictest (smallest permitted difference) to most lenient.
cl_numeric_diff(...)cl_numeric_diff(...)
... |
Numeric difference thresholds, ordered from strictest to
most lenient (e.g., |
A comparison-level object for use in il_compare().
il_spec() |> il_compare(age, cl_numeric_diff(1, 5))il_spec() |> il_compare(age, cl_numeric_diff(1, 5))
Creates a compound level that fires when any of the supplied conditions are satisfied.
cl_or(...)cl_or(...)
... |
Comparison-level objects to OR together. |
A comparison-level object.
cl_or(cl_jaro_winkler(0.9), cl_levenshtein(1))cl_or(cl_jaro_winkler(0.9), cl_levenshtein(1))
Creates comparison levels based on the relative percentage difference
between two numeric values. Thresholds are fractions (e.g., 0.05
for 5%), ordered from strictest to most lenient.
cl_pct_diff(...)cl_pct_diff(...)
... |
Numeric percentage thresholds, ordered from strictest to
most lenient (e.g., |
A comparison-level object for use in il_compare().
il_spec() |> il_compare(income, cl_pct_diff(0.05, 0.2))il_spec() |> il_compare(income, cl_pct_diff(0.05, 0.2))
A pre-built domain comparison for postcodes. Supports exact matching and prefix-based partial matching. Optionally appends geographic distance fallback levels when latitude and longitude columns are available.
cl_postcode( term_frequency = FALSE, lat_col = NULL, long_col = NULL, km_thresholds = c(1, 10, 100) )cl_postcode( term_frequency = FALSE, lat_col = NULL, long_col = NULL, km_thresholds = c(1, 10, 100) )
term_frequency |
Logical. If |
lat_col, long_col
|
Character. Names of latitude and longitude
columns. Both must be supplied together. When provided, geographic
distance levels are appended before |
km_thresholds |
Numeric vector of distance thresholds in
kilometres, ordered from strictest to most lenient. Only used when
|
A comparison-level object for use in il_compare().
il_spec() |> il_compare(postcode, cl_postcode()) # With geographic fallback (requires lat/lon columns in the data) il_spec() |> il_compare(postcode, cl_postcode(lat_col = 'lat', long_col = 'lon'))il_spec() |> il_compare(postcode, cl_postcode()) # With geographic fallback (requires lat/lon columns in the data) il_spec() |> il_compare(postcode, cl_postcode(lat_col = 'lat', long_col = 'lon'))
Creates a comparison level based on the Soundex phonetic algorithm.
Two strings match if their Soundex codes are identical. This can be
used as a fallback level within cl_levels() or cl_name() to catch
names that sound similar but are spelled differently (e.g.,
Smith/Smyth, Robert/Rupert).
cl_soundex()cl_soundex()
On DuckDB, Soundex runs via a registered SQL MACRO.
On PostgreSQL, it uses the native soundex() function.
On SQLite, it falls back to an R-side implementation.
A comparison-level object for use in il_compare() or
cl_levels().
il_spec() |> il_compare(first_name, cl_soundex()) il_spec() |> il_compare( first_name, cl_levels( cl_null(), cl_exact(), cl_jaro_winkler(0.9), cl_soundex(), cl_else() ) )il_spec() |> il_compare(first_name, cl_soundex()) il_spec() |> il_compare( first_name, cl_levels( cl_null(), cl_exact(), cl_jaro_winkler(0.9), cl_soundex(), cl_else() ) )
Creates comparison levels based on the absolute difference between two
datetime (timestamp) values. Thresholds should use the unit helpers
seconds(), minutes(), hours(), days(), months(), or years()
for self-documenting, unit-safe specifications. Bare numerics are
interpreted as seconds.
cl_time_diff(...)cl_time_diff(...)
... |
Duration thresholds created by |
This extends cl_date_diff() to support sub-day precision for timestamp
columns. Use cl_date_diff() for date-only columns.
A comparison-level object for use in il_compare().
il_spec() |> il_compare(timestamp, cl_time_diff(minutes(5), hours(1))) # Mix units freely il_spec() |> il_compare(timestamp, cl_time_diff(seconds(30), minutes(10), hours(2)))il_spec() |> il_compare(timestamp, cl_time_diff(minutes(5), hours(1))) # Mix units freely il_spec() |> il_compare(timestamp, cl_time_diff(seconds(30), minutes(10), hours(2)))
A pre-built domain comparison for US ZIP codes. Provides levels for
exact match, 5-digit prefix match (normalizes ZIP+4 against plain
5-digit codes), and 3-digit Sectional Center Facility (SCF) prefix
match. Accepts both plain 5-digit ('90210') and ZIP+4
('90210-3456') formats. Optionally appends geographic distance
fallback levels when latitude and longitude columns are available.
cl_zip_code( term_frequency = FALSE, lat_col = NULL, long_col = NULL, km_thresholds = c(1, 10, 100) )cl_zip_code( term_frequency = FALSE, lat_col = NULL, long_col = NULL, km_thresholds = c(1, 10, 100) )
term_frequency |
Logical. If |
lat_col, long_col
|
Character. Names of latitude and longitude
columns. Both must be supplied together. When provided, geographic
distance levels are appended before |
km_thresholds |
Numeric vector of distance thresholds in
kilometres, ordered from strictest to most lenient. Only used when
|
A comparison-level object for use in il_compare().
il_spec() |> il_compare(zip, cl_zip_code()) # With geographic fallback (requires lat/lon columns in the data) il_spec() |> il_compare(zip, cl_zip_code(lat_col = 'lat', long_col = 'lon'))il_spec() |> il_compare(zip, cl_zip_code()) # With geographic fallback (requires lat/lon columns in the data) il_spec() |> il_compare(zip, cl_zip_code(lat_col = 'lat', long_col = 'lon'))
A tagged-value constructor that marks a numeric threshold as a number
of days. Use inside
cl_date_diff() for self-documenting, unit-safe thresholds.
days(n)days(n)
n |
A non-negative numeric value. |
A tagged numeric with class il_days.
il_spec() |> il_compare(dob, cl_date_diff(days(30), days(365)))il_spec() |> il_compare(dob, cl_date_diff(days(30), days(365)))
A dataset of 1,000 synthetic records representing 181 unique people, each with varying numbers of duplicate entries. Duplicates have been corrupted with typographical errors, missing values, and other realistic data-quality issues. This is the primary demo dataset from the Python splink library.
fake_1000fake_1000
A tibble with 1,000 rows and 7 columns:
Integer. Row identifier (0-indexed).
Character. Given name, sometimes corrupted or missing.
Character. Family name, sometimes corrupted or missing.
Character. Date of birth in YYYY-MM-DD format.
Character. City of residence, sometimes missing.
Character. Email address, sometimes corrupted or missing.
Integer. Ground-truth entity label (0-indexed).
The cluster column provides ground-truth entity labels: records sharing
the same cluster value refer to the same person.
The unique_id column provides a unique identifier for each row,
starting at 0 (matching splink convention).
From the splink datasets repository maintained by the UK Ministry of Justice Analytical Services: https://github.com/moj-analytical-services/splink_datasets. Original data generated by the splink team (Linacre et al.) under the MIT license.
fake_1000_labels for pairwise clerical labels.
Pairwise clerical labels for the fake_1000 dataset.
Each row records whether a pair of records from fake_1000 is a true
match (clerical_match_score = 1) or a non-match
(clerical_match_score = 0).
These labels enable evaluation of model accuracy, ROC curves, and
precision-recall metrics.
fake_1000_labelsfake_1000_labels
A tibble with 3,176 rows and 5 columns:
Integer. unique_id of the left record.
Character. Source dataset name ("fake_1000").
Integer. unique_id of the right record.
Character. Source dataset name ("fake_1000").
Numeric. 1 for a match, 0 for a non-match.
From the splink datasets repository maintained by the UK Ministry of Justice Analytical Services: https://github.com/moj-analytical-services/splink_datasets. Original data generated by the splink team (Linacre et al.) under the MIT license.
A small, hand-crafted dataset of 20 records representing 5 unique people. Each person has four records with varying levels of corruption: exact matches, minor typos, and slightly shifted dates of birth. Designed for quick examples and unit tests.
fake_20fake_20
A tibble with 20 rows and 6 columns:
Character. Given name, sometimes corrupted.
Character. Family name, sometimes corrupted.
Character. Date of birth in YYYY-MM-DD format,
sometimes shifted by one day.
Character. City of residence.
Character. Email address, sometimes corrupted.
Integer. Ground-truth entity label (1 to 5).
fake_1000 for a larger benchmark dataset.
The FEBRL (Freely Extensible Biomedical Record Linkage) dataset 4a
contains 5,000 original records.
It is designed to be linked against febrl4b, which contains one
duplicate record per original.
Ground truth is encoded in rec_id: records sharing the same base ID
(e.g., rec-1070-org and rec-1070-dup-0) refer to the same entity.
febrl4afebrl4a
A tibble with 5,000 rows and 11 columns:
Character. Record identifier encoding entity and
origin (-org suffix).
Character. Given name, sometimes missing.
Character. Family name, sometimes missing.
Integer. Street number, sometimes missing.
Character. Primary address line, sometimes missing.
Character. Secondary address line, often missing.
Character. Suburb or neighborhood.
Integer. Postal code.
Character. Australian state abbreviation.
Integer. Date of birth as YYYYMMDD integer,
sometimes missing.
Integer. Social security identifier.
Distributed via the splink datasets repository (https://github.com/moj-analytical-services/splink_datasets) under the MIT license. The FEBRL datasets originate from Christen and Churches (2004) and are widely used as record-linkage benchmarks.
Christen, P. and Churches, T. (2004). Febrl – Freely Extensible Biomedical Record Linkage. Australian National University.
febrl4b for the corresponding duplicate records.
The FEBRL (Freely Extensible Biomedical Record Linkage) dataset 4b
contains 5,000 duplicate records, one for each original in febrl4a.
Duplicates have been corrupted with typographical errors, missing
values, and transpositions.
Ground truth is encoded in rec_id: the base number matches the
corresponding original (e.g., rec-1070-dup-0 matches
rec-1070-org).
febrl4bfebrl4b
A tibble with 5,000 rows and 11 columns:
Character. Record identifier encoding entity and
duplicate status (-dup-0 suffix).
Character. Given name, sometimes corrupted or missing.
Character. Family name, sometimes corrupted or missing.
Integer. Street number, sometimes missing.
Character. Primary address line, sometimes corrupted.
Character. Secondary address line, often missing.
Character. Suburb or neighborhood.
Integer. Postal code.
Character. Australian state abbreviation.
Integer. Date of birth as YYYYMMDD integer,
sometimes missing.
Integer. Social security identifier.
Distributed via the splink datasets repository (https://github.com/moj-analytical-services/splink_datasets) under the MIT license. The FEBRL datasets originate from Christen and Churches (2004) and are widely used as record-linkage benchmarks.
Christen, P. and Churches, T. (2004). Febrl – Freely Extensible Biomedical Record Linkage. Australian National University.
febrl4a for the corresponding original records.
A tagged-value constructor that marks a numeric threshold as a number
of hours. Use inside cl_time_diff() for self-documenting,
unit-safe thresholds.
hours(n)hours(n)
n |
A non-negative numeric value. |
A tagged numeric with class il_hours.
il_spec() |> il_compare(timestamp, cl_time_diff(hours(2), hours(24)))il_spec() |> il_compare(timestamp, cl_time_diff(hours(2), hours(24)))
Computes a full suite of classification metrics at a range of match-probability thresholds. Requires labeled pairs.
il_accuracy(model, labels = NULL, labels_col = NULL)il_accuracy(model, labels = NULL, labels_col = NULL)
model |
A trained |
labels |
A data frame of labeled pairs with a logical or integer
match indicator. Required unless |
labels_col |
Optional string naming a column in the original data
containing ground-truth cluster/entity IDs. When provided, pairwise
labels are derived automatically via |
A tibble::tibble() with one row per threshold, containing columns
threshold, tp, fp, fn, tn, fn_blocking_miss,
precision, recall, f1, f2, f0_5, specificity, npv,
accuracy, p4, and phi.
df <- data.frame( unique_id = 1:20, first_name = c( 'John', 'Jon', 'Jane', 'Jane', 'Bob', 'Bobby', 'Alice', 'Alicia', 'Tom', 'Thomas', 'John', 'Jon', 'Jane', 'Janet', 'Bob', 'Robert', 'Alice', 'Alison', 'Tom', 'Tomas' ), surname = c( 'Smith', 'Smith', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Brown', 'White', 'White', 'Smith', 'Smyth', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Browne', 'White', 'White' ), dob = c( '1990-01-01', '1990-01-01', '1985-06-15', '1985-06-15', '2000-12-01', '2000-12-01', '1975-03-22', '1975-03-22', '1988-07-04', '1988-07-04', '1990-01-01', '1990-01-02', '1985-06-15', '1985-06-16', '2000-12-01', '2000-12-02', '1975-03-22', '1975-03-23', '1988-07-04', '1988-07-05' ), city = c( 'London', 'London', 'Paris', 'Paris', 'Berlin', 'Berlin', 'Rome', 'Rome', 'Madrid', 'Madrid', 'London', 'London', 'Paris', 'Paris', 'Berlin', 'Berlin', 'Rome', 'Rome', 'Madrid', 'Madrid' ), email = c( '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]' ) ) con <- DBI::dbConnect(duckdb::duckdb()) spec <- il_spec() |> il_compare(first_name, cl_jaro_winkler(0.9, 0.7)) |> il_compare(surname, cl_jaro_winkler(0.9, 0.7)) |> il_compare(dob, cl_exact()) |> il_block_on(surname) |> il_block_on(first_name) model <- il_model(df, spec = spec, con = con) model <- il_estimate_u(model) model <- il_estimate_em(model, block_on(surname)) labels <- data.frame( unique_id_l = c(1L, 1L), unique_id_r = c(11L, 2L), is_match = c(1L, 0L) ) il_accuracy(model, labels = labels) DBI::dbDisconnect(con, shutdown = TRUE)df <- data.frame( unique_id = 1:20, first_name = c( 'John', 'Jon', 'Jane', 'Jane', 'Bob', 'Bobby', 'Alice', 'Alicia', 'Tom', 'Thomas', 'John', 'Jon', 'Jane', 'Janet', 'Bob', 'Robert', 'Alice', 'Alison', 'Tom', 'Tomas' ), surname = c( 'Smith', 'Smith', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Brown', 'White', 'White', 'Smith', 'Smyth', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Browne', 'White', 'White' ), dob = c( '1990-01-01', '1990-01-01', '1985-06-15', '1985-06-15', '2000-12-01', '2000-12-01', '1975-03-22', '1975-03-22', '1988-07-04', '1988-07-04', '1990-01-01', '1990-01-02', '1985-06-15', '1985-06-16', '2000-12-01', '2000-12-02', '1975-03-22', '1975-03-23', '1988-07-04', '1988-07-05' ), city = c( 'London', 'London', 'Paris', 'Paris', 'Berlin', 'Berlin', 'Rome', 'Rome', 'Madrid', 'Madrid', 'London', 'London', 'Paris', 'Paris', 'Berlin', 'Berlin', 'Rome', 'Rome', 'Madrid', 'Madrid' ), email = c( '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]' ) ) con <- DBI::dbConnect(duckdb::duckdb()) spec <- il_spec() |> il_compare(first_name, cl_jaro_winkler(0.9, 0.7)) |> il_compare(surname, cl_jaro_winkler(0.9, 0.7)) |> il_compare(dob, cl_exact()) |> il_block_on(surname) |> il_block_on(first_name) model <- il_model(df, spec = spec, con = con) model <- il_estimate_u(model) model <- il_estimate_em(model, block_on(surname)) labels <- data.frame( unique_id_l = c(1L, 1L), unique_id_r = c(11L, 2L), is_match = c(1L, 0L) ) il_accuracy(model, labels = labels) DBI::dbDisconnect(con, shutdown = TRUE)
Returns a transform that extracts the first or last element of an
array-valued column. The result can be passed as the transform
argument to il_compare() or il_block_on(), and composed with
other transforms via il_transform(). On DuckDB and PostgreSQL,
maps to SQL array indexing (col[1] or col[-1]).
il_array_element(position = c("first", "last"))il_array_element(position = c("first", "last"))
position |
Either |
An il_column_transform closure.
tf <- il_array_element('first') tf(list(c('Alice', 'A'), c('Bob'), character(0)))tf <- il_array_element('first') tf(list(c('Alice', 'A'), c('Bob'), character(0)))
Takes a loaded (or existing) il_model and binds it to new data and a
fresh database connection, producing a model ready for predict() or
further training. Accepts in-memory data frames, dbplyr::tbl_lazy table
references, or character table names.
il_attach(model, .data, ..., con = NULL, link_type = NULL)il_attach(model, .data, ..., con = NULL, link_type = NULL)
model |
An |
.data |
A data frame, |
... |
Additional datasets for multi-table linkage. |
con |
A DBI connection object from |
link_type |
Optionally override the model's link type. If |
This is the key function for the production workflow:
train once with il_model() -> save with il_save() -> later, load
with il_load() and attach to new data with il_attach().
The loaded model's trained parameters (m, u, prior) are preserved.
You can immediately call predict() on the attached model, or
continue training with il_estimate_em() using the existing
parameters as a warm start.
The model, now connected to con with data uploaded, ready
for predict(), il_find_matches(), or further training.
con <- DBI::dbConnect(duckdb::duckdb()) spec <- il_spec() |> il_compare(first_name, cl_jaro_winkler(0.9, 0.7)) |> il_block_on(surname) model <- il_model(fake_1000, spec = spec, con = con) model <- il_estimate_u(model) model <- il_estimate_em(model, block_on(surname)) path <- tempfile(fileext = '.rds') il_save(model, path) DBI::dbDisconnect(con, shutdown = TRUE) con2 <- DBI::dbConnect(duckdb::duckdb()) loaded <- il_load(path) model2 <- il_attach(loaded, fake_1000, con = con2) DBI::dbDisconnect(con2, shutdown = TRUE)con <- DBI::dbConnect(duckdb::duckdb()) spec <- il_spec() |> il_compare(first_name, cl_jaro_winkler(0.9, 0.7)) |> il_block_on(surname) model <- il_model(fake_1000, spec = spec, con = con) model <- il_estimate_u(model) model <- il_estimate_em(model, block_on(surname)) path <- tempfile(fileext = '.rds') il_save(model, path) DBI::dbDisconnect(con, shutdown = TRUE) con2 <- DBI::dbConnect(duckdb::duckdb()) loaded <- il_load(path) model2 <- il_attach(loaded, fake_1000, con = con2) DBI::dbDisconnect(con2, shutdown = TRUE)
Adds an equality-based blocking rule to a specification. During prediction, only record pairs that agree on the blocking columns are scored. Multiple calls are OR-ed together. Within a single call, columns are AND-ed.
il_block_on(spec, ..., .where = NULL, .transform = NULL, .explode = NULL)il_block_on(spec, ..., .where = NULL, .transform = NULL, .explode = NULL)
spec |
An |
... |
Columns for equality blocking (AND-ed within one call). Each entry is either:
|
.where |
An optional raw SQL string for non-equality blocking
conditions. Defaults to |
.transform |
An optional transform applied to every column that does
not already have a formula transform. Can be a single function (e.g.
il_soundex) or a named list of functions for per-column transforms.
Formula transforms in |
.explode |
An optional character vector of column names containing
arrays (list columns) to unnest before blocking. Each array element
becomes a separate row for the blocking join. Requires a DuckDB or
PostgreSQL backend. Defaults to |
An updated copy of spec.
# Block on state OR first name (two calls = OR) spec <- il_spec() |> il_block_on(state) |> il_block_on(first_name) # Block where state AND year both match (one call = AND) spec <- il_spec() |> il_block_on(state, year) # Per-column substring blocking with formula syntax spec <- il_spec() |> il_block_on(first_name ~ il_substr(1, 3), surname ~ il_substr(1, 4)) # Mix: substr on one column, plain match on another spec <- il_spec() |> il_block_on(postcode_fake ~ il_substr(1, 3), dob) # Same transform on all columns spec <- il_spec() |> il_block_on(first_name, .transform = il_soundex) # Explode array columns before blocking spec <- il_spec() |> il_block_on(email, .explode = 'email')# Block on state OR first name (two calls = OR) spec <- il_spec() |> il_block_on(state) |> il_block_on(first_name) # Block where state AND year both match (one call = AND) spec <- il_spec() |> il_block_on(state, year) # Per-column substring blocking with formula syntax spec <- il_spec() |> il_block_on(first_name ~ il_substr(1, 3), surname ~ il_substr(1, 4)) # Mix: substr on one column, plain match on another spec <- il_spec() |> il_block_on(postcode_fake ~ il_substr(1, 3), dob) # Same transform on all columns spec <- il_spec() |> il_block_on(first_name, .transform = il_soundex) # Explode array columns before blocking spec <- il_spec() |> il_block_on(email, .explode = 'email')
Returns a transform that casts any column to a character/VARCHAR type.
Useful when a numeric or date column needs to be compared as text. The
result can be passed as the transform argument to il_compare() or
il_block_on(), and composed with other transforms via il_transform().
On DuckDB and PostgreSQL, maps to SQL CAST(col AS VARCHAR).
il_cast_to_string()il_cast_to_string()
An il_column_transform closure.
tf <- il_cast_to_string() tf(c(12345L, 67890L))tf <- il_cast_to_string() tf(c(12345L, 67890L))
Cleans up the temporary tables owned by a single il_model. This is safe
to call on a shared DBI connection from DBI::dbConnect() containing other live irelink models.
Use il_cleanup_all() only when you explicitly want to remove every
irelink table from the connection.
il_cleanup(model)il_cleanup(model)
model |
An |
model, invisibly.
df <- data.frame( unique_id = 1:20, first_name = c( 'John', 'Jon', 'Jane', 'Jane', 'Bob', 'Bobby', 'Alice', 'Alicia', 'Tom', 'Thomas', 'John', 'Jon', 'Jane', 'Janet', 'Bob', 'Robert', 'Alice', 'Alison', 'Tom', 'Tomas' ), surname = c( 'Smith', 'Smith', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Brown', 'White', 'White', 'Smith', 'Smyth', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Browne', 'White', 'White' ), dob = c( '1990-01-01', '1990-01-01', '1985-06-15', '1985-06-15', '2000-12-01', '2000-12-01', '1975-03-22', '1975-03-22', '1988-07-04', '1988-07-04', '1990-01-01', '1990-01-02', '1985-06-15', '1985-06-16', '2000-12-01', '2000-12-02', '1975-03-22', '1975-03-23', '1988-07-04', '1988-07-05' ), city = c( 'London', 'London', 'Paris', 'Paris', 'Berlin', 'Berlin', 'Rome', 'Rome', 'Madrid', 'Madrid', 'London', 'London', 'Paris', 'Paris', 'Berlin', 'Berlin', 'Rome', 'Rome', 'Madrid', 'Madrid' ), email = c( '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]' ) ) con <- DBI::dbConnect(duckdb::duckdb()) spec <- il_spec() |> il_compare(first_name, cl_jaro_winkler(0.9, 0.7)) |> il_compare(surname, cl_jaro_winkler(0.9, 0.7)) |> il_compare(dob, cl_exact()) |> il_block_on(surname) |> il_block_on(first_name) model <- il_model(df, spec = spec, con = con) model <- il_estimate_u(model) model <- il_estimate_em(model, block_on(surname)) il_cleanup(model) DBI::dbDisconnect(con, shutdown = TRUE)df <- data.frame( unique_id = 1:20, first_name = c( 'John', 'Jon', 'Jane', 'Jane', 'Bob', 'Bobby', 'Alice', 'Alicia', 'Tom', 'Thomas', 'John', 'Jon', 'Jane', 'Janet', 'Bob', 'Robert', 'Alice', 'Alison', 'Tom', 'Tomas' ), surname = c( 'Smith', 'Smith', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Brown', 'White', 'White', 'Smith', 'Smyth', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Browne', 'White', 'White' ), dob = c( '1990-01-01', '1990-01-01', '1985-06-15', '1985-06-15', '2000-12-01', '2000-12-01', '1975-03-22', '1975-03-22', '1988-07-04', '1988-07-04', '1990-01-01', '1990-01-02', '1985-06-15', '1985-06-16', '2000-12-01', '2000-12-02', '1975-03-22', '1975-03-23', '1988-07-04', '1988-07-05' ), city = c( 'London', 'London', 'Paris', 'Paris', 'Berlin', 'Berlin', 'Rome', 'Rome', 'Madrid', 'Madrid', 'London', 'London', 'Paris', 'Paris', 'Berlin', 'Berlin', 'Rome', 'Rome', 'Madrid', 'Madrid' ), email = c( '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]' ) ) con <- DBI::dbConnect(duckdb::duckdb()) spec <- il_spec() |> il_compare(first_name, cl_jaro_winkler(0.9, 0.7)) |> il_compare(surname, cl_jaro_winkler(0.9, 0.7)) |> il_compare(dob, cl_exact()) |> il_block_on(surname) |> il_block_on(first_name) model <- il_model(df, spec = spec, con = con) model <- il_estimate_u(model) model <- il_estimate_em(model, block_on(surname)) il_cleanup(model) DBI::dbDisconnect(con, shutdown = TRUE)
Drops every table or view whose name starts with __il_ from a DBI
connection. This is intended as an explicit interactive escape hatch after
failed runs or exploratory sessions. Prefer il_cleanup() when cleaning up a
specific live model on a shared connection.
il_cleanup_all(con)il_cleanup_all(con)
con |
A DBI connection from |
con, invisibly.
df <- data.frame( unique_id = 1:4, name = c('Ann', 'Anne', 'Bob', 'Rob') ) con <- DBI::dbConnect(duckdb::duckdb()) spec <- il_spec() |> il_compare(name, cl_jaro_winkler(0.9)) |> il_block_on(name) model <- il_model(df, spec = spec, con = con) il_cleanup_all(con) DBI::dbDisconnect(con, shutdown = TRUE)df <- data.frame( unique_id = 1:4, name = c('Ann', 'Anne', 'Bob', 'Rob') ) con <- DBI::dbConnect(duckdb::duckdb()) spec <- il_spec() |> il_compare(name, cl_jaro_winkler(0.9)) |> il_block_on(name) model <- il_model(df, spec = spec, con = con) il_cleanup_all(con) DBI::dbDisconnect(con, shutdown = TRUE)
Groups scored record pairs into entity clusters using graph-based methods. The result assigns cluster IDs to records that represent the same real-world entity.
il_cluster( pairs, threshold = NULL, method = c("connected", "best_link"), ties_method = c("lowest_id", "drop"), source_dataset = NULL )il_cluster( pairs, threshold = NULL, method = c("connected", "best_link"), ties_method = c("lowest_id", "drop"), source_dataset = NULL )
pairs |
An |
threshold |
An optional secondary match-probability threshold.
If |
method |
One of |
ties_method |
How to handle tied best-link probabilities when
|
source_dataset |
An optional named character vector or data frame
mapping |
A tibble::tibble() with one row per input record, including a
cluster_id column.
df <- data.frame( unique_id = 1:20, first_name = c( 'John', 'Jon', 'Jane', 'Jane', 'Bob', 'Bobby', 'Alice', 'Alicia', 'Tom', 'Thomas', 'John', 'Jon', 'Jane', 'Janet', 'Bob', 'Robert', 'Alice', 'Alison', 'Tom', 'Tomas' ), surname = c( 'Smith', 'Smith', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Brown', 'White', 'White', 'Smith', 'Smyth', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Browne', 'White', 'White' ), dob = c( '1990-01-01', '1990-01-01', '1985-06-15', '1985-06-15', '2000-12-01', '2000-12-01', '1975-03-22', '1975-03-22', '1988-07-04', '1988-07-04', '1990-01-01', '1990-01-02', '1985-06-15', '1985-06-16', '2000-12-01', '2000-12-02', '1975-03-22', '1975-03-23', '1988-07-04', '1988-07-05' ), city = c( 'London', 'London', 'Paris', 'Paris', 'Berlin', 'Berlin', 'Rome', 'Rome', 'Madrid', 'Madrid', 'London', 'London', 'Paris', 'Paris', 'Berlin', 'Berlin', 'Rome', 'Rome', 'Madrid', 'Madrid' ), email = c( '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]' ) ) con <- DBI::dbConnect(duckdb::duckdb()) spec <- il_spec() |> il_compare(first_name, cl_jaro_winkler(0.9, 0.7)) |> il_compare(surname, cl_jaro_winkler(0.9, 0.7)) |> il_compare(dob, cl_exact()) |> il_block_on(surname) |> il_block_on(first_name) model <- il_model(df, spec = spec, con = con) model <- il_estimate_u(model) model <- il_estimate_em(model, block_on(surname)) pairs <- predict(model, threshold = 0.5) clusters <- il_cluster(pairs) DBI::dbDisconnect(con, shutdown = TRUE)df <- data.frame( unique_id = 1:20, first_name = c( 'John', 'Jon', 'Jane', 'Jane', 'Bob', 'Bobby', 'Alice', 'Alicia', 'Tom', 'Thomas', 'John', 'Jon', 'Jane', 'Janet', 'Bob', 'Robert', 'Alice', 'Alison', 'Tom', 'Tomas' ), surname = c( 'Smith', 'Smith', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Brown', 'White', 'White', 'Smith', 'Smyth', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Browne', 'White', 'White' ), dob = c( '1990-01-01', '1990-01-01', '1985-06-15', '1985-06-15', '2000-12-01', '2000-12-01', '1975-03-22', '1975-03-22', '1988-07-04', '1988-07-04', '1990-01-01', '1990-01-02', '1985-06-15', '1985-06-16', '2000-12-01', '2000-12-02', '1975-03-22', '1975-03-23', '1988-07-04', '1988-07-05' ), city = c( 'London', 'London', 'Paris', 'Paris', 'Berlin', 'Berlin', 'Rome', 'Rome', 'Madrid', 'Madrid', 'London', 'London', 'Paris', 'Paris', 'Berlin', 'Berlin', 'Rome', 'Rome', 'Madrid', 'Madrid' ), email = c( '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]' ) ) con <- DBI::dbConnect(duckdb::duckdb()) spec <- il_spec() |> il_compare(first_name, cl_jaro_winkler(0.9, 0.7)) |> il_compare(surname, cl_jaro_winkler(0.9, 0.7)) |> il_compare(dob, cl_exact()) |> il_block_on(surname) |> il_block_on(first_name) model <- il_model(df, spec = spec, con = con) model <- il_estimate_u(model) model <- il_estimate_em(model, block_on(surname)) pairs <- predict(model, threshold = 0.5) clusters <- il_cluster(pairs) DBI::dbDisconnect(con, shutdown = TRUE)
Computes a record-level confusion matrix after clustering predicted
matches into entities. A record is treated as "duplicated" if it is not
the first record in its predicted cluster, and likewise for the
ground-truth labels_col.
il_cluster_confusion_matrix( model, labels_col, threshold = 0.85, method = c("connected", "best_link"), ties_method = c("lowest_id", "drop"), source_dataset = NULL )il_cluster_confusion_matrix( model, labels_col, threshold = 0.85, method = c("connected", "best_link"), ties_method = c("lowest_id", "drop"), source_dataset = NULL )
model |
A trained |
labels_col |
String naming the ground-truth cluster/entity column in the model's source data. |
threshold |
Match-probability threshold passed to |
method |
Clustering method passed to |
ties_method |
Tie handling for |
source_dataset |
Optional source-dataset mapping passed to
|
For DuckDB and PostgreSQL backends, pair scoring and clustering are pushed into SQL where possible. The final summary still returns a one-row tibble in R.
A one-row tibble with columns threshold, tp, fp, fn,
tn, precision, recall, and f1.
df <- data.frame( unique_id = 1:5, first_name = c('John', 'John', 'Mary', 'Bob', 'Bob'), surname = c('Smith', 'Smith', 'Jones', 'Brown', 'Brown'), cluster = c(1, 1, 2, 3, 4) ) con <- DBI::dbConnect(duckdb::duckdb()) spec <- il_spec() |> il_compare(first_name, cl_exact()) |> il_compare(surname, cl_exact()) |> il_block_on(surname) model <- il_model(df, spec = spec, con = con) model <- il_estimate_u(model) model <- il_estimate_em(model, block_on(surname)) il_cluster_confusion_matrix(model, labels_col = 'cluster', threshold = 0.85) DBI::dbDisconnect(con, shutdown = TRUE)df <- data.frame( unique_id = 1:5, first_name = c('John', 'John', 'Mary', 'Bob', 'Bob'), surname = c('Smith', 'Smith', 'Jones', 'Brown', 'Brown'), cluster = c(1, 1, 2, 3, 4) ) con <- DBI::dbConnect(duckdb::duckdb()) spec <- il_spec() |> il_compare(first_name, cl_exact()) |> il_compare(surname, cl_exact()) |> il_block_on(surname) model <- il_model(df, spec = spec, con = con) model <- il_estimate_u(model) model <- il_estimate_em(model, block_on(surname)) il_cluster_confusion_matrix(model, labels_col = 'cluster', threshold = 0.85) DBI::dbDisconnect(con, shutdown = TRUE)
Computes string-similarity metrics between two columns of a data frame or database table. Useful for profiling data quality and choosing comparison thresholds. Rows where either column is missing are omitted.
il_comparator_score(.data, col_1, col_2, con = NULL)il_comparator_score(.data, col_1, col_2, con = NULL)
.data |
A data frame or character table name. Table names require
|
col_1, col_2
|
Column names (unquoted or character). |
con |
A DBI connection from |
With con = NULL, all metrics are computed in R with
stringdist::stringdist(). With a duckdb::duckdb() or PostgreSQL connection,
computation is pushed to SQL. SQL backends return the same column schema
but may leave unsupported metrics as NA: DuckDB currently computes
jaro_winkler, jaro, levenshtein, and jaccard; PostgreSQL computes
levenshtein and a jaro_winkler compatibility column backed by trigram
similarity().
A tibble::tibble() with the two input columns and metric columns
jaro_winkler, jaro, levenshtein, jaccard, and cosine.
Unsupported SQL-backend metrics are present as NA. The result has S3
class il_comparator_score.
df <- data.frame( name_l = c('John', 'Jane', 'Bob'), name_r = c('Jon', 'Janet', 'Bobby') ) il_comparator_score(df, name_l, name_r)df <- data.frame( name_l = c('John', 'Jane', 'Bob'), name_r = c('Jon', 'Janet', 'Bobby') ) il_comparator_score(df, name_l, name_r)
Shows the distribution of pairwise similarity scores and highlights which pairs exceed a given threshold.
il_comparator_threshold_chart( .data, col_1, col_2, similarity_threshold = NULL, distance_threshold = NULL, con = NULL )il_comparator_threshold_chart( .data, col_1, col_2, similarity_threshold = NULL, distance_threshold = NULL, con = NULL )
.data |
A data frame or table name. |
col_1, col_2
|
Column names (unquoted or character). |
similarity_threshold |
Numeric threshold for similarity metrics (>= to include). Applies to jaro_winkler, jaro, jaccard, cosine. |
distance_threshold |
Integer threshold for distance metrics (<= to include). Applies to levenshtein. |
con |
A DBI connection from |
A ggplot2::ggplot() object.
Declares how one or more columns should be compared when scoring record pairs. Each call adds one comparison to the specification.
il_compare( spec, col, method, ..., transform = NULL, tf_adjustment_weight = 1, tf_minimum_u_value = 0 )il_compare( spec, col, method, ..., transform = NULL, tf_adjustment_weight = 1, tf_minimum_u_value = 0 )
spec |
An |
col |
< |
method |
A comparison helper object created by a |
... |
Reserved for future use. |
transform |
An optional transformation function applied to both
left and right column values before comparison. Common choices
include |
tf_adjustment_weight |
Numeric power to raise the term-frequency
Bayes factor to. A value of |
tf_minimum_u_value |
Numeric floor for the term-frequency
denominator. When both TF values are below this threshold, it is
used instead, preventing unrealistically large match weights for
very rare terms. Defaults to |
col accepts tidyselect expressions: a bare column name, c(col_a, col_b), or helpers such as tidyselect::starts_with(). When multiple
columns are targeted, each receives its own comparison layer with the
same method.
An updated copy of spec.
spec <- il_spec() |> il_compare(first_name, cl_jaro_winkler(0.9, 0.7)) |> il_compare(dob, cl_date_diff(days(30), days(365))) # Apply a transform before comparing spec <- il_spec() |> il_compare(first_name, cl_jaro_winkler(0.9, 0.7), transform = tolower) # Scale TF adjustment weight spec <- il_spec() |> il_compare(first_name, cl_jaro_winkler(0.9, term_frequency = TRUE), tf_adjustment_weight = 0.5, tf_minimum_u_value = 0.001 )spec <- il_spec() |> il_compare(first_name, cl_jaro_winkler(0.9, 0.7)) |> il_compare(dob, cl_date_diff(days(30), days(365))) # Apply a transform before comparing spec <- il_spec() |> il_compare(first_name, cl_jaro_winkler(0.9, 0.7), transform = tolower) # Scale TF adjustment weight spec <- il_spec() |> il_compare(first_name, cl_jaro_winkler(0.9, term_frequency = TRUE), tf_adjustment_weight = 0.5, tf_minimum_u_value = 0.001 )
Scores a single pair of records against a specification without requiring a full training pipeline. Useful for quick one-off comparisons or debugging.
il_compare_records(record_a, record_b, spec, con = NULL)il_compare_records(record_a, record_b, spec, con = NULL)
record_a |
A named list or single-row data frame representing the first record. |
record_b |
A named list or single-row data frame representing the second record. |
spec |
An |
con |
A DBI connection object from |
A single-row tibble of per-comparison gamma values.
df <- data.frame( unique_id = 1:20, first_name = c( 'John', 'Jon', 'Jane', 'Jane', 'Bob', 'Bobby', 'Alice', 'Alicia', 'Tom', 'Thomas', 'John', 'Jon', 'Jane', 'Janet', 'Bob', 'Robert', 'Alice', 'Alison', 'Tom', 'Tomas' ), surname = c( 'Smith', 'Smith', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Brown', 'White', 'White', 'Smith', 'Smyth', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Browne', 'White', 'White' ), dob = c( '1990-01-01', '1990-01-01', '1985-06-15', '1985-06-15', '2000-12-01', '2000-12-01', '1975-03-22', '1975-03-22', '1988-07-04', '1988-07-04', '1990-01-01', '1990-01-02', '1985-06-15', '1985-06-16', '2000-12-01', '2000-12-02', '1975-03-22', '1975-03-23', '1988-07-04', '1988-07-05' ), city = c( 'London', 'London', 'Paris', 'Paris', 'Berlin', 'Berlin', 'Rome', 'Rome', 'Madrid', 'Madrid', 'London', 'London', 'Paris', 'Paris', 'Berlin', 'Berlin', 'Rome', 'Rome', 'Madrid', 'Madrid' ), email = c( '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]' ) ) con <- DBI::dbConnect(duckdb::duckdb()) spec <- il_spec() |> il_compare(first_name, cl_jaro_winkler(0.9, 0.7)) |> il_compare(surname, cl_jaro_winkler(0.9, 0.7)) |> il_compare(dob, cl_exact()) |> il_block_on(surname) |> il_block_on(first_name) record_a <- df[1, ] record_b <- df[2, ] il_compare_records(record_a, record_b, spec = spec, con = con) DBI::dbDisconnect(con, shutdown = TRUE)df <- data.frame( unique_id = 1:20, first_name = c( 'John', 'Jon', 'Jane', 'Jane', 'Bob', 'Bobby', 'Alice', 'Alicia', 'Tom', 'Thomas', 'John', 'Jon', 'Jane', 'Janet', 'Bob', 'Robert', 'Alice', 'Alison', 'Tom', 'Tomas' ), surname = c( 'Smith', 'Smith', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Brown', 'White', 'White', 'Smith', 'Smyth', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Browne', 'White', 'White' ), dob = c( '1990-01-01', '1990-01-01', '1985-06-15', '1985-06-15', '2000-12-01', '2000-12-01', '1975-03-22', '1975-03-22', '1988-07-04', '1988-07-04', '1990-01-01', '1990-01-02', '1985-06-15', '1985-06-16', '2000-12-01', '2000-12-02', '1975-03-22', '1975-03-23', '1988-07-04', '1988-07-05' ), city = c( 'London', 'London', 'Paris', 'Paris', 'Berlin', 'Berlin', 'Rome', 'Rome', 'Madrid', 'Madrid', 'London', 'London', 'Paris', 'Paris', 'Berlin', 'Berlin', 'Rome', 'Rome', 'Madrid', 'Madrid' ), email = c( '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]' ) ) con <- DBI::dbConnect(duckdb::duckdb()) spec <- il_spec() |> il_compare(first_name, cl_jaro_winkler(0.9, 0.7)) |> il_compare(surname, cl_jaro_winkler(0.9, 0.7)) |> il_compare(dob, cl_exact()) |> il_block_on(surname) |> il_block_on(first_name) record_a <- df[1, ] record_b <- df[2, ] il_compare_records(record_a, record_b, spec = spec, con = con) DBI::dbDisconnect(con, shutdown = TRUE)
Computes the distribution of gamma patterns (agreement vectors) across record pairs. Each unique combination of gamma values across comparisons is a "comparison vector". This function counts how often each pattern occurs.
il_comparison_vectors(model, blocking = NULL, limit = NULL)il_comparison_vectors(model, blocking = NULL, limit = NULL)
model |
A trained |
blocking |
A blocking rule created by |
limit |
Maximum number of pairs to sample. Defaults to |
On DuckDB/PostgreSQL, the computation runs entirely in SQL.
A tibble::tibble() with one row per unique comparison vector and
columns gamma_<col> for each comparison plus count (number
of pairs with that pattern) and proportion. Class
il_comparison_vectors.
con <- DBI::dbConnect(duckdb::duckdb()) spec <- il_spec() |> il_compare(first_name, cl_exact()) |> il_compare(surname, cl_exact()) model <- il_model(fake_20, spec = spec, con = con) vectors <- il_comparison_vectors(model) ggplot2::autoplot(vectors) il_cleanup(model) DBI::dbDisconnect(con, shutdown = TRUE)con <- DBI::dbConnect(duckdb::duckdb()) spec <- il_spec() |> il_compare(first_name, cl_exact()) |> il_compare(surname, cl_exact()) model <- il_model(fake_20, spec = spec, con = con) vectors <- il_comparison_vectors(model) ggplot2::autoplot(vectors) il_cleanup(model) DBI::dbDisconnect(con, shutdown = TRUE)
Computes the percentage of non-null values for each column across one or more datasets.
il_completeness(..., con = NULL)il_completeness(..., con = NULL)
... |
One or more data frames, dbplyr::tbl_lazy references, or character table names to profile. |
con |
A DBI connection object from |
A tibble::tibble() with columns table, column, n_total,
n_non_null, and pct_non_null.
df <- data.frame( unique_id = 1:20, first_name = c( 'John', 'Jon', 'Jane', 'Jane', 'Bob', 'Bobby', 'Alice', 'Alicia', 'Tom', 'Thomas', 'John', 'Jon', 'Jane', 'Janet', 'Bob', 'Robert', 'Alice', 'Alison', 'Tom', 'Tomas' ), surname = c( 'Smith', 'Smith', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Brown', 'White', 'White', 'Smith', 'Smyth', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Browne', 'White', 'White' ), dob = c( '1990-01-01', '1990-01-01', '1985-06-15', '1985-06-15', '2000-12-01', '2000-12-01', '1975-03-22', '1975-03-22', '1988-07-04', '1988-07-04', '1990-01-01', '1990-01-02', '1985-06-15', '1985-06-16', '2000-12-01', '2000-12-02', '1975-03-22', '1975-03-23', '1988-07-04', '1988-07-05' ), city = c( 'London', 'London', 'Paris', 'Paris', 'Berlin', 'Berlin', 'Rome', 'Rome', 'Madrid', 'Madrid', 'London', 'London', 'Paris', 'Paris', 'Berlin', 'Berlin', 'Rome', 'Rome', 'Madrid', 'Madrid' ), email = c( '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]' ) ) con <- DBI::dbConnect(duckdb::duckdb()) il_completeness(df, con = con) DBI::dbDisconnect(con, shutdown = TRUE)df <- data.frame( unique_id = 1:20, first_name = c( 'John', 'Jon', 'Jane', 'Jane', 'Bob', 'Bobby', 'Alice', 'Alicia', 'Tom', 'Thomas', 'John', 'Jon', 'Jane', 'Janet', 'Bob', 'Robert', 'Alice', 'Alison', 'Tom', 'Tomas' ), surname = c( 'Smith', 'Smith', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Brown', 'White', 'White', 'Smith', 'Smyth', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Browne', 'White', 'White' ), dob = c( '1990-01-01', '1990-01-01', '1985-06-15', '1985-06-15', '2000-12-01', '2000-12-01', '1975-03-22', '1975-03-22', '1988-07-04', '1988-07-04', '1990-01-01', '1990-01-02', '1985-06-15', '1985-06-16', '2000-12-01', '2000-12-02', '1975-03-22', '1975-03-23', '1988-07-04', '1988-07-05' ), city = c( 'London', 'London', 'Paris', 'Paris', 'Berlin', 'Berlin', 'Rome', 'Rome', 'Madrid', 'Madrid', 'London', 'London', 'Paris', 'Paris', 'Berlin', 'Berlin', 'Rome', 'Rome', 'Madrid', 'Madrid' ), email = c( '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]' ) ) con <- DBI::dbConnect(duckdb::duckdb()) il_completeness(df, con = con) DBI::dbDisconnect(con, shutdown = TRUE)
Computes the thresholded confusion-matrix counts for labeled pairs.
Uses the same SQL-first scoring path as il_accuracy() on supported
backends, so labeled pairs do not need to be predicted and collected
in full before evaluation.
il_confusion_matrix(model, labels = NULL, threshold = 0.85, labels_col = NULL)il_confusion_matrix(model, labels = NULL, threshold = 0.85, labels_col = NULL)
model |
A trained |
labels |
A data frame of labeled pairs with a logical or integer
match indicator. Required unless |
threshold |
A numeric value between 0 and 1 for classifying pairs
as matches. Defaults to |
labels_col |
Optional string naming a column in the original data
containing ground-truth cluster/entity IDs. When provided, pairwise
labels are derived automatically via |
A one-row tibble containing threshold, tp, fp, fn,
tn, fn_blocking_miss, precision, recall, and f1.
df <- data.frame( unique_id = 1:20, first_name = c( 'John', 'Jon', 'Jane', 'Jane', 'Bob', 'Bobby', 'Alice', 'Alicia', 'Tom', 'Thomas', 'John', 'Jon', 'Jane', 'Janet', 'Bob', 'Robert', 'Alice', 'Alison', 'Tom', 'Tomas' ), surname = c( 'Smith', 'Smith', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Brown', 'White', 'White', 'Smith', 'Smyth', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Browne', 'White', 'White' ), dob = c( '1990-01-01', '1990-01-01', '1985-06-15', '1985-06-15', '2000-12-01', '2000-12-01', '1975-03-22', '1975-03-22', '1988-07-04', '1988-07-04', '1990-01-01', '1990-01-02', '1985-06-15', '1985-06-16', '2000-12-01', '2000-12-02', '1975-03-22', '1975-03-23', '1988-07-04', '1988-07-05' ) ) con <- DBI::dbConnect(duckdb::duckdb()) spec <- il_spec() |> il_compare(first_name, cl_jaro_winkler(0.9, 0.7)) |> il_compare(surname, cl_jaro_winkler(0.9, 0.7)) |> il_compare(dob, cl_exact()) |> il_block_on(surname) |> il_block_on(first_name) model <- il_model(df, spec = spec, con = con) model <- il_estimate_u(model) model <- il_estimate_em(model, block_on(surname)) labels <- data.frame( unique_id_l = c(1L, 1L), unique_id_r = c(11L, 2L), is_match = c(1L, 0L) ) il_confusion_matrix(model, labels = labels, threshold = 0.85) DBI::dbDisconnect(con, shutdown = TRUE)df <- data.frame( unique_id = 1:20, first_name = c( 'John', 'Jon', 'Jane', 'Jane', 'Bob', 'Bobby', 'Alice', 'Alicia', 'Tom', 'Thomas', 'John', 'Jon', 'Jane', 'Janet', 'Bob', 'Robert', 'Alice', 'Alison', 'Tom', 'Tomas' ), surname = c( 'Smith', 'Smith', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Brown', 'White', 'White', 'Smith', 'Smyth', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Browne', 'White', 'White' ), dob = c( '1990-01-01', '1990-01-01', '1985-06-15', '1985-06-15', '2000-12-01', '2000-12-01', '1975-03-22', '1975-03-22', '1988-07-04', '1988-07-04', '1990-01-01', '1990-01-02', '1985-06-15', '1985-06-16', '2000-12-01', '2000-12-02', '1975-03-22', '1975-03-23', '1988-07-04', '1988-07-05' ) ) con <- DBI::dbConnect(duckdb::duckdb()) spec <- il_spec() |> il_compare(first_name, cl_jaro_winkler(0.9, 0.7)) |> il_compare(surname, cl_jaro_winkler(0.9, 0.7)) |> il_compare(dob, cl_exact()) |> il_block_on(surname) |> il_block_on(first_name) model <- il_model(df, spec = spec, con = con) model <- il_estimate_u(model) model <- il_estimate_em(model, block_on(surname)) labels <- data.frame( unique_id_l = c(1L, 1L), unique_id_r = c(11L, 2L), is_match = c(1L, 0L) ) il_confusion_matrix(model, labels = labels, threshold = 0.85) DBI::dbDisconnect(con, shutdown = TRUE)
Fixes one comparison's matched-class m probabilities during EM. This is a
hard constraint, stored separately from regularizing priors.
il_constrain_m(model, col, exact = NULL, levels = NULL)il_constrain_m(model, col, exact = NULL, levels = NULL)
model |
An |
col |
Comparison column, supplied as a bare name or string. |
exact |
Probability for the strongest gamma level. |
levels |
Complete named probability vector, with names such as |
The model with constraint metadata in model$params$constraints.
Inspect Model Constraints
il_constraints(model)il_constraints(model)
model |
An |
A tibble::tibble() of stored fixed-constraint metadata.
Estimates how many record pairs each blocking rule generates without performing full comparisons. Useful for tuning blocking strategies before training. Too many pairs is slow, while too few misses matches.
il_count_pairs(.data, ..., con = NULL, link_type = c("dedupe", "link"))il_count_pairs(.data, ..., con = NULL, link_type = c("dedupe", "link"))
.data |
A data frame, dbplyr::tbl_lazy, or character table name (first or only dataset). |
... |
Blocking rules created by |
con |
A DBI connection object from |
link_type |
One of |
A tibble::tibble() with columns rule and n_pairs. When blocking rules
are supplied, it also includes cumulative_pairs and
pct_of_cartesian.
df <- data.frame( unique_id = 1:20, first_name = c( 'John', 'Jon', 'Jane', 'Jane', 'Bob', 'Bobby', 'Alice', 'Alicia', 'Tom', 'Thomas', 'John', 'Jon', 'Jane', 'Janet', 'Bob', 'Robert', 'Alice', 'Alison', 'Tom', 'Tomas' ), surname = c( 'Smith', 'Smith', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Brown', 'White', 'White', 'Smith', 'Smyth', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Browne', 'White', 'White' ), dob = c( '1990-01-01', '1990-01-01', '1985-06-15', '1985-06-15', '2000-12-01', '2000-12-01', '1975-03-22', '1975-03-22', '1988-07-04', '1988-07-04', '1990-01-01', '1990-01-02', '1985-06-15', '1985-06-16', '2000-12-01', '2000-12-02', '1975-03-22', '1975-03-23', '1988-07-04', '1988-07-05' ), city = c( 'London', 'London', 'Paris', 'Paris', 'Berlin', 'Berlin', 'Rome', 'Rome', 'Madrid', 'Madrid', 'London', 'London', 'Paris', 'Paris', 'Berlin', 'Berlin', 'Rome', 'Rome', 'Madrid', 'Madrid' ), email = c( '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]' ) ) con <- DBI::dbConnect(duckdb::duckdb()) il_count_pairs( df, block_on(surname), block_on(first_name), con = con ) DBI::dbDisconnect(con, shutdown = TRUE)df <- data.frame( unique_id = 1:20, first_name = c( 'John', 'Jon', 'Jane', 'Jane', 'Bob', 'Bobby', 'Alice', 'Alicia', 'Tom', 'Thomas', 'John', 'Jon', 'Jane', 'Janet', 'Bob', 'Robert', 'Alice', 'Alison', 'Tom', 'Tomas' ), surname = c( 'Smith', 'Smith', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Brown', 'White', 'White', 'Smith', 'Smyth', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Browne', 'White', 'White' ), dob = c( '1990-01-01', '1990-01-01', '1985-06-15', '1985-06-15', '2000-12-01', '2000-12-01', '1975-03-22', '1975-03-22', '1988-07-04', '1988-07-04', '1990-01-01', '1990-01-02', '1985-06-15', '1985-06-16', '2000-12-01', '2000-12-02', '1975-03-22', '1975-03-23', '1988-07-04', '1988-07-05' ), city = c( 'London', 'London', 'Paris', 'Paris', 'Berlin', 'Berlin', 'Rome', 'Rome', 'Madrid', 'Madrid', 'London', 'London', 'Paris', 'Paris', 'Berlin', 'Berlin', 'Rome', 'Rome', 'Madrid', 'Madrid' ), email = c( '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]' ) ) con <- DBI::dbConnect(duckdb::duckdb()) il_count_pairs( df, block_on(surname), block_on(first_name), con = con ) DBI::dbDisconnect(con, shutdown = TRUE)
Finds exact-match record pairs using the blocking rules in the specification, without requiring probabilistic training. This is a common first step before probabilistic linkage. Pairs that match on all blocking columns are returned directly.
il_deterministic_link( .data, ..., spec, con = NULL, link_type = c("dedupe", "link", "link_and_dedupe") )il_deterministic_link( .data, ..., spec, con = NULL, link_type = c("dedupe", "link", "link_and_dedupe") )
.data |
A data frame, dbplyr::tbl_lazy, or character table name (first or only dataset). |
... |
Additional datasets for multi-table linkage. |
spec |
An |
con |
A DBI connection object from |
link_type |
One of |
A tibble::tibble() of exact-match record pairs.
df <- data.frame( unique_id = 1:20, first_name = c( 'John', 'Jon', 'Jane', 'Jane', 'Bob', 'Bobby', 'Alice', 'Alicia', 'Tom', 'Thomas', 'John', 'Jon', 'Jane', 'Janet', 'Bob', 'Robert', 'Alice', 'Alison', 'Tom', 'Tomas' ), surname = c( 'Smith', 'Smith', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Brown', 'White', 'White', 'Smith', 'Smyth', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Browne', 'White', 'White' ), dob = c( '1990-01-01', '1990-01-01', '1985-06-15', '1985-06-15', '2000-12-01', '2000-12-01', '1975-03-22', '1975-03-22', '1988-07-04', '1988-07-04', '1990-01-01', '1990-01-02', '1985-06-15', '1985-06-16', '2000-12-01', '2000-12-02', '1975-03-22', '1975-03-23', '1988-07-04', '1988-07-05' ), city = c( 'London', 'London', 'Paris', 'Paris', 'Berlin', 'Berlin', 'Rome', 'Rome', 'Madrid', 'Madrid', 'London', 'London', 'Paris', 'Paris', 'Berlin', 'Berlin', 'Rome', 'Rome', 'Madrid', 'Madrid' ), email = c( '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]' ) ) con <- DBI::dbConnect(duckdb::duckdb()) spec <- il_spec() |> il_compare(first_name, cl_jaro_winkler(0.9, 0.7)) |> il_compare(surname, cl_jaro_winkler(0.9, 0.7)) |> il_block_on(first_name, surname, dob) exact_matches <- il_deterministic_link(df, spec = spec, con = con) DBI::dbDisconnect(con, shutdown = TRUE)df <- data.frame( unique_id = 1:20, first_name = c( 'John', 'Jon', 'Jane', 'Jane', 'Bob', 'Bobby', 'Alice', 'Alicia', 'Tom', 'Thomas', 'John', 'Jon', 'Jane', 'Janet', 'Bob', 'Robert', 'Alice', 'Alison', 'Tom', 'Tomas' ), surname = c( 'Smith', 'Smith', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Brown', 'White', 'White', 'Smith', 'Smyth', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Browne', 'White', 'White' ), dob = c( '1990-01-01', '1990-01-01', '1985-06-15', '1985-06-15', '2000-12-01', '2000-12-01', '1975-03-22', '1975-03-22', '1988-07-04', '1988-07-04', '1990-01-01', '1990-01-02', '1985-06-15', '1985-06-16', '2000-12-01', '2000-12-02', '1975-03-22', '1975-03-23', '1988-07-04', '1988-07-05' ), city = c( 'London', 'London', 'Paris', 'Paris', 'Berlin', 'Berlin', 'Rome', 'Rome', 'Madrid', 'Madrid', 'London', 'London', 'Paris', 'Paris', 'Berlin', 'Berlin', 'Rome', 'Rome', 'Madrid', 'Madrid' ), email = c( '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]' ) ) con <- DBI::dbConnect(duckdb::duckdb()) spec <- il_spec() |> il_compare(first_name, cl_jaro_winkler(0.9, 0.7)) |> il_compare(surname, cl_jaro_winkler(0.9, 0.7)) |> il_block_on(first_name, surname, dob) exact_matches <- il_deterministic_link(df, spec = spec, con = con) DBI::dbDisconnect(con, shutdown = TRUE)
Compares model predictions against labeled pairs and returns all false-positive and false-negative errors at a given threshold. Useful for understanding which record pairs the model gets wrong.
il_errors(model, labels = NULL, threshold = 0.85, labels_col = NULL)il_errors(model, labels = NULL, threshold = 0.85, labels_col = NULL)
model |
A trained |
labels |
A data frame of labeled pairs with a logical or integer
match indicator. Required unless |
threshold |
A numeric value between 0 and 1 for classifying pairs
as matches. Defaults to |
labels_col |
Optional string naming a column in the original data containing ground-truth cluster/entity IDs. |
A tibble::tibble() of misclassified pairs with columns unique_id_l,
unique_id_r, match_weight, match_probability, true_label,
and error_type.
df <- data.frame( unique_id = 1:20, first_name = c( 'John', 'Jon', 'Jane', 'Jane', 'Bob', 'Bobby', 'Alice', 'Alicia', 'Tom', 'Thomas', 'John', 'Jon', 'Jane', 'Janet', 'Bob', 'Robert', 'Alice', 'Alison', 'Tom', 'Tomas' ), surname = c( 'Smith', 'Smith', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Brown', 'White', 'White', 'Smith', 'Smyth', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Browne', 'White', 'White' ), dob = c( '1990-01-01', '1990-01-01', '1985-06-15', '1985-06-15', '2000-12-01', '2000-12-01', '1975-03-22', '1975-03-22', '1988-07-04', '1988-07-04', '1990-01-01', '1990-01-02', '1985-06-15', '1985-06-16', '2000-12-01', '2000-12-02', '1975-03-22', '1975-03-23', '1988-07-04', '1988-07-05' ), city = c( 'London', 'London', 'Paris', 'Paris', 'Berlin', 'Berlin', 'Rome', 'Rome', 'Madrid', 'Madrid', 'London', 'London', 'Paris', 'Paris', 'Berlin', 'Berlin', 'Rome', 'Rome', 'Madrid', 'Madrid' ), email = c( '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]' ) ) con <- DBI::dbConnect(duckdb::duckdb()) spec <- il_spec() |> il_compare(first_name, cl_jaro_winkler(0.9, 0.7)) |> il_compare(surname, cl_jaro_winkler(0.9, 0.7)) |> il_compare(dob, cl_exact()) |> il_block_on(surname) |> il_block_on(first_name) model <- il_model(df, spec = spec, con = con) model <- il_estimate_u(model) model <- il_estimate_em(model, block_on(surname)) labels <- data.frame( unique_id_l = c(1L, 1L), unique_id_r = c(11L, 2L), is_match = c(1L, 0L) ) il_errors(model, labels = labels, threshold = 0.85) DBI::dbDisconnect(con, shutdown = TRUE)df <- data.frame( unique_id = 1:20, first_name = c( 'John', 'Jon', 'Jane', 'Jane', 'Bob', 'Bobby', 'Alice', 'Alicia', 'Tom', 'Thomas', 'John', 'Jon', 'Jane', 'Janet', 'Bob', 'Robert', 'Alice', 'Alison', 'Tom', 'Tomas' ), surname = c( 'Smith', 'Smith', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Brown', 'White', 'White', 'Smith', 'Smyth', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Browne', 'White', 'White' ), dob = c( '1990-01-01', '1990-01-01', '1985-06-15', '1985-06-15', '2000-12-01', '2000-12-01', '1975-03-22', '1975-03-22', '1988-07-04', '1988-07-04', '1990-01-01', '1990-01-02', '1985-06-15', '1985-06-16', '2000-12-01', '2000-12-02', '1975-03-22', '1975-03-23', '1988-07-04', '1988-07-05' ), city = c( 'London', 'London', 'Paris', 'Paris', 'Berlin', 'Berlin', 'Rome', 'Rome', 'Madrid', 'Madrid', 'London', 'London', 'Paris', 'Paris', 'Berlin', 'Berlin', 'Rome', 'Rome', 'Madrid', 'Madrid' ), email = c( '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]' ) ) con <- DBI::dbConnect(duckdb::duckdb()) spec <- il_spec() |> il_compare(first_name, cl_jaro_winkler(0.9, 0.7)) |> il_compare(surname, cl_jaro_winkler(0.9, 0.7)) |> il_compare(dob, cl_exact()) |> il_block_on(surname) |> il_block_on(first_name) model <- il_model(df, spec = spec, con = con) model <- il_estimate_u(model) model <- il_estimate_em(model, block_on(surname)) labels <- data.frame( unique_id_l = c(1L, 1L), unique_id_r = c(11L, 2L), is_match = c(1L, 0L) ) il_errors(model, labels = labels, threshold = 0.85) DBI::dbDisconnect(con, shutdown = TRUE)
Runs the EM algorithm under a blocking rule to learn m and u parameters from unlabeled data. Multiple calls with different blocking rules can be chained to train on complementary subsets of record pairs. Each call updates the model cumulatively.
il_estimate_em( model, blocking, convergence = 1e-05, fix_u = TRUE, fix_m = FALSE, max_iterations = 100L, fix_prior = FALSE, estimate_without_tf = TRUE, derive_prior = FALSE, estimator_mode = c("independent", "dependency-aware"), ... )il_estimate_em( model, blocking, convergence = 1e-05, fix_u = TRUE, fix_m = FALSE, max_iterations = 100L, fix_prior = FALSE, estimate_without_tf = TRUE, derive_prior = FALSE, estimator_mode = c("independent", "dependency-aware"), ... )
model |
An |
blocking |
A blocking rule created by |
convergence |
A numeric convergence tolerance. The EM loop stops
when the largest change in any updated parameter is below this value.
Defaults to |
fix_u |
Logical. If |
fix_m |
Logical. If |
max_iterations |
Maximum number of EM iterations. Defaults to
|
fix_prior |
Logical. If |
estimate_without_tf |
Logical. If |
derive_prior |
Logical. If |
estimator_mode |
Estimator to use. |
... |
Reserved for future options. |
An updated il_model with trained m and u parameters.
df <- data.frame( unique_id = 1:20, first_name = c( 'John', 'Jon', 'Jane', 'Jane', 'Bob', 'Bobby', 'Alice', 'Alicia', 'Tom', 'Thomas', 'John', 'Jon', 'Jane', 'Janet', 'Bob', 'Robert', 'Alice', 'Alison', 'Tom', 'Tomas' ), surname = c( 'Smith', 'Smith', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Brown', 'White', 'White', 'Smith', 'Smyth', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Browne', 'White', 'White' ), dob = c( '1990-01-01', '1990-01-01', '1985-06-15', '1985-06-15', '2000-12-01', '2000-12-01', '1975-03-22', '1975-03-22', '1988-07-04', '1988-07-04', '1990-01-01', '1990-01-02', '1985-06-15', '1985-06-16', '2000-12-01', '2000-12-02', '1975-03-22', '1975-03-23', '1988-07-04', '1988-07-05' ), city = c( 'London', 'London', 'Paris', 'Paris', 'Berlin', 'Berlin', 'Rome', 'Rome', 'Madrid', 'Madrid', 'London', 'London', 'Paris', 'Paris', 'Berlin', 'Berlin', 'Rome', 'Rome', 'Madrid', 'Madrid' ), email = c( '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]' ) ) con <- DBI::dbConnect(duckdb::duckdb()) spec <- il_spec() |> il_compare(first_name, cl_jaro_winkler(0.9, 0.7)) |> il_compare(surname, cl_jaro_winkler(0.9, 0.7)) |> il_compare(dob, cl_exact()) |> il_block_on(surname) |> il_block_on(first_name) model <- il_model(df, spec = spec, con = con) model <- il_estimate_u(model) model <- il_estimate_em(model, block_on(surname)) DBI::dbDisconnect(con, shutdown = TRUE)df <- data.frame( unique_id = 1:20, first_name = c( 'John', 'Jon', 'Jane', 'Jane', 'Bob', 'Bobby', 'Alice', 'Alicia', 'Tom', 'Thomas', 'John', 'Jon', 'Jane', 'Janet', 'Bob', 'Robert', 'Alice', 'Alison', 'Tom', 'Tomas' ), surname = c( 'Smith', 'Smith', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Brown', 'White', 'White', 'Smith', 'Smyth', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Browne', 'White', 'White' ), dob = c( '1990-01-01', '1990-01-01', '1985-06-15', '1985-06-15', '2000-12-01', '2000-12-01', '1975-03-22', '1975-03-22', '1988-07-04', '1988-07-04', '1990-01-01', '1990-01-02', '1985-06-15', '1985-06-16', '2000-12-01', '2000-12-02', '1975-03-22', '1975-03-23', '1988-07-04', '1988-07-05' ), city = c( 'London', 'London', 'Paris', 'Paris', 'Berlin', 'Berlin', 'Rome', 'Rome', 'Madrid', 'Madrid', 'London', 'London', 'Paris', 'Paris', 'Berlin', 'Berlin', 'Rome', 'Rome', 'Madrid', 'Madrid' ), email = c( '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]' ) ) con <- DBI::dbConnect(duckdb::duckdb()) spec <- il_spec() |> il_compare(first_name, cl_jaro_winkler(0.9, 0.7)) |> il_compare(surname, cl_jaro_winkler(0.9, 0.7)) |> il_compare(dob, cl_exact()) |> il_block_on(surname) |> il_block_on(first_name) model <- il_model(df, spec = spec, con = con) model <- il_estimate_u(model) model <- il_estimate_em(model, block_on(surname)) DBI::dbDisconnect(con, shutdown = TRUE)
Learns the m probabilities from a ground-truth identifier column
(e.g., Social Security Number) present in the input data. Records
sharing the same label value are treated as true matches. This is an
alternative to il_estimate_m_from_labels(), which requires a
separate table of pairwise labels.
il_estimate_m_from_column(model, label_col)il_estimate_m_from_column(model, label_col)
model |
An |
label_col |
The unquoted name of a column in the input data containing ground-truth entity identifiers. |
An updated il_model with estimated m parameters.
con <- DBI::dbConnect(duckdb::duckdb()) spec <- il_spec() |> il_compare(first_name, cl_jaro_winkler(0.9, 0.7)) |> il_compare(surname, cl_exact()) |> il_compare(dob, cl_exact()) |> il_block_on(surname) model <- il_model(fake_20, spec = spec, con = con) model <- il_estimate_u(model) model <- il_estimate_m_from_column(model, city) DBI::dbDisconnect(con, shutdown = TRUE)con <- DBI::dbConnect(duckdb::duckdb()) spec <- il_spec() |> il_compare(first_name, cl_jaro_winkler(0.9, 0.7)) |> il_compare(surname, cl_exact()) |> il_compare(dob, cl_exact()) |> il_block_on(surname) model <- il_model(fake_20, spec = spec, con = con) model <- il_estimate_u(model) model <- il_estimate_m_from_column(model, city) DBI::dbDisconnect(con, shutdown = TRUE)
Learns the m probabilities (the probability of observing each
comparison level given that the records do match) from a set of
pre-labeled record pairs. Use this instead of il_estimate_em() when
ground-truth labels are available.
il_estimate_m_from_labels(model, labels)il_estimate_m_from_labels(model, labels)
model |
An |
labels |
A data frame of labeled pairs with columns identifying the left record, right record, and a logical or integer match indicator. |
An updated il_model with estimated m parameters.
df <- data.frame( unique_id = 1:20, first_name = c( 'John', 'Jon', 'Jane', 'Jane', 'Bob', 'Bobby', 'Alice', 'Alicia', 'Tom', 'Thomas', 'John', 'Jon', 'Jane', 'Janet', 'Bob', 'Robert', 'Alice', 'Alison', 'Tom', 'Tomas' ), surname = c( 'Smith', 'Smith', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Brown', 'White', 'White', 'Smith', 'Smyth', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Browne', 'White', 'White' ), dob = c( '1990-01-01', '1990-01-01', '1985-06-15', '1985-06-15', '2000-12-01', '2000-12-01', '1975-03-22', '1975-03-22', '1988-07-04', '1988-07-04', '1990-01-01', '1990-01-02', '1985-06-15', '1985-06-16', '2000-12-01', '2000-12-02', '1975-03-22', '1975-03-23', '1988-07-04', '1988-07-05' ), city = c( 'London', 'London', 'Paris', 'Paris', 'Berlin', 'Berlin', 'Rome', 'Rome', 'Madrid', 'Madrid', 'London', 'London', 'Paris', 'Paris', 'Berlin', 'Berlin', 'Rome', 'Rome', 'Madrid', 'Madrid' ), email = c( '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]' ) ) con <- DBI::dbConnect(duckdb::duckdb()) spec <- il_spec() |> il_compare(first_name, cl_jaro_winkler(0.9, 0.7)) |> il_compare(surname, cl_jaro_winkler(0.9, 0.7)) |> il_compare(dob, cl_exact()) |> il_block_on(surname) |> il_block_on(first_name) model <- il_model(df, spec = spec, con = con) model <- il_estimate_u(model) labels <- data.frame( unique_id_l = c(1L, 1L), unique_id_r = c(11L, 2L), is_match = c(1L, 0L) ) model <- il_estimate_m_from_labels(model, labels) DBI::dbDisconnect(con, shutdown = TRUE)df <- data.frame( unique_id = 1:20, first_name = c( 'John', 'Jon', 'Jane', 'Jane', 'Bob', 'Bobby', 'Alice', 'Alicia', 'Tom', 'Thomas', 'John', 'Jon', 'Jane', 'Janet', 'Bob', 'Robert', 'Alice', 'Alison', 'Tom', 'Tomas' ), surname = c( 'Smith', 'Smith', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Brown', 'White', 'White', 'Smith', 'Smyth', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Browne', 'White', 'White' ), dob = c( '1990-01-01', '1990-01-01', '1985-06-15', '1985-06-15', '2000-12-01', '2000-12-01', '1975-03-22', '1975-03-22', '1988-07-04', '1988-07-04', '1990-01-01', '1990-01-02', '1985-06-15', '1985-06-16', '2000-12-01', '2000-12-02', '1975-03-22', '1975-03-23', '1988-07-04', '1988-07-05' ), city = c( 'London', 'London', 'Paris', 'Paris', 'Berlin', 'Berlin', 'Rome', 'Rome', 'Madrid', 'Madrid', 'London', 'London', 'Paris', 'Paris', 'Berlin', 'Berlin', 'Rome', 'Rome', 'Madrid', 'Madrid' ), email = c( '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]' ) ) con <- DBI::dbConnect(duckdb::duckdb()) spec <- il_spec() |> il_compare(first_name, cl_jaro_winkler(0.9, 0.7)) |> il_compare(surname, cl_jaro_winkler(0.9, 0.7)) |> il_compare(dob, cl_exact()) |> il_block_on(surname) |> il_block_on(first_name) model <- il_model(df, spec = spec, con = con) model <- il_estimate_u(model) labels <- data.frame( unique_id_l = c(1L, 1L), unique_id_r = c(11L, 2L), is_match = c(1L, 0L) ) model <- il_estimate_m_from_labels(model, labels) DBI::dbDisconnect(con, shutdown = TRUE)
Estimates the probability that two randomly selected records from the dataset are a match, using deterministic rules and a recall assumption. This prior anchors the Fellegi-Sunter model before more detailed parameter estimation.
il_estimate_prior(model, ..., recall = 0.7, profile_sql = FALSE)il_estimate_prior(model, ..., recall = 0.7, profile_sql = FALSE)
model |
An |
... |
Blocking rules created by |
recall |
A numeric value between 0 and 1 representing the assumed
recall of the deterministic rules. Defaults to |
profile_sql |
Logical. If |
An updated il_model with the estimated prior.
df <- data.frame( unique_id = 1:20, first_name = c( 'John', 'Jon', 'Jane', 'Jane', 'Bob', 'Bobby', 'Alice', 'Alicia', 'Tom', 'Thomas', 'John', 'Jon', 'Jane', 'Janet', 'Bob', 'Robert', 'Alice', 'Alison', 'Tom', 'Tomas' ), surname = c( 'Smith', 'Smith', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Brown', 'White', 'White', 'Smith', 'Smyth', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Browne', 'White', 'White' ), dob = c( '1990-01-01', '1990-01-01', '1985-06-15', '1985-06-15', '2000-12-01', '2000-12-01', '1975-03-22', '1975-03-22', '1988-07-04', '1988-07-04', '1990-01-01', '1990-01-02', '1985-06-15', '1985-06-16', '2000-12-01', '2000-12-02', '1975-03-22', '1975-03-23', '1988-07-04', '1988-07-05' ), city = c( 'London', 'London', 'Paris', 'Paris', 'Berlin', 'Berlin', 'Rome', 'Rome', 'Madrid', 'Madrid', 'London', 'London', 'Paris', 'Paris', 'Berlin', 'Berlin', 'Rome', 'Rome', 'Madrid', 'Madrid' ), email = c( '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]' ) ) con <- DBI::dbConnect(duckdb::duckdb()) spec <- il_spec() |> il_compare(first_name, cl_jaro_winkler(0.9, 0.7)) |> il_compare(surname, cl_jaro_winkler(0.9, 0.7)) |> il_compare(dob, cl_exact()) |> il_block_on(surname) |> il_block_on(first_name) model <- il_model(df, spec = spec, con = con) model <- il_estimate_prior(model, block_on(first_name, surname, dob)) DBI::dbDisconnect(con, shutdown = TRUE)df <- data.frame( unique_id = 1:20, first_name = c( 'John', 'Jon', 'Jane', 'Jane', 'Bob', 'Bobby', 'Alice', 'Alicia', 'Tom', 'Thomas', 'John', 'Jon', 'Jane', 'Janet', 'Bob', 'Robert', 'Alice', 'Alison', 'Tom', 'Tomas' ), surname = c( 'Smith', 'Smith', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Brown', 'White', 'White', 'Smith', 'Smyth', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Browne', 'White', 'White' ), dob = c( '1990-01-01', '1990-01-01', '1985-06-15', '1985-06-15', '2000-12-01', '2000-12-01', '1975-03-22', '1975-03-22', '1988-07-04', '1988-07-04', '1990-01-01', '1990-01-02', '1985-06-15', '1985-06-16', '2000-12-01', '2000-12-02', '1975-03-22', '1975-03-23', '1988-07-04', '1988-07-05' ), city = c( 'London', 'London', 'Paris', 'Paris', 'Berlin', 'Berlin', 'Rome', 'Rome', 'Madrid', 'Madrid', 'London', 'London', 'Paris', 'Paris', 'Berlin', 'Berlin', 'Rome', 'Rome', 'Madrid', 'Madrid' ), email = c( '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]' ) ) con <- DBI::dbConnect(duckdb::duckdb()) spec <- il_spec() |> il_compare(first_name, cl_jaro_winkler(0.9, 0.7)) |> il_compare(surname, cl_jaro_winkler(0.9, 0.7)) |> il_compare(dob, cl_exact()) |> il_block_on(surname) |> il_block_on(first_name) model <- il_model(df, spec = spec, con = con) model <- il_estimate_prior(model, block_on(first_name, surname, dob)) DBI::dbDisconnect(con, shutdown = TRUE)
Estimates the u probabilities (the probability of observing each comparison level given that the records do not match) by randomly sampling record pairs. Most random pairs are non-matches, so the observed level frequencies approximate the u distribution.
il_estimate_u( model, max_pairs = 1e+06, min_count_per_level = NULL, chunk_size = NULL, profile_sql = FALSE )il_estimate_u( model, max_pairs = 1e+06, min_count_per_level = NULL, chunk_size = NULL, profile_sql = FALSE )
model |
An |
max_pairs |
Maximum number of random pairs to sample. Defaults to
|
min_count_per_level |
Optional integer. When set, chunked estimation
stops once every comparison level has been observed at least this many
times, or once |
chunk_size |
Optional integer number of pairs to score per chunk. When set, u estimation accumulates gamma counts across chunks instead of using one aggregate query. |
profile_sql |
Logical. If |
An updated il_model with estimated u parameters.
df <- data.frame( unique_id = 1:20, first_name = c( 'John', 'Jon', 'Jane', 'Jane', 'Bob', 'Bobby', 'Alice', 'Alicia', 'Tom', 'Thomas', 'John', 'Jon', 'Jane', 'Janet', 'Bob', 'Robert', 'Alice', 'Alison', 'Tom', 'Tomas' ), surname = c( 'Smith', 'Smith', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Brown', 'White', 'White', 'Smith', 'Smyth', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Browne', 'White', 'White' ), dob = c( '1990-01-01', '1990-01-01', '1985-06-15', '1985-06-15', '2000-12-01', '2000-12-01', '1975-03-22', '1975-03-22', '1988-07-04', '1988-07-04', '1990-01-01', '1990-01-02', '1985-06-15', '1985-06-16', '2000-12-01', '2000-12-02', '1975-03-22', '1975-03-23', '1988-07-04', '1988-07-05' ), city = c( 'London', 'London', 'Paris', 'Paris', 'Berlin', 'Berlin', 'Rome', 'Rome', 'Madrid', 'Madrid', 'London', 'London', 'Paris', 'Paris', 'Berlin', 'Berlin', 'Rome', 'Rome', 'Madrid', 'Madrid' ), email = c( '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]' ) ) con <- DBI::dbConnect(duckdb::duckdb()) spec <- il_spec() |> il_compare(first_name, cl_jaro_winkler(0.9, 0.7)) |> il_compare(surname, cl_jaro_winkler(0.9, 0.7)) |> il_compare(dob, cl_exact()) |> il_block_on(surname) |> il_block_on(first_name) model <- il_model(df, spec = spec, con = con) model <- il_estimate_u(model) DBI::dbDisconnect(con, shutdown = TRUE)df <- data.frame( unique_id = 1:20, first_name = c( 'John', 'Jon', 'Jane', 'Jane', 'Bob', 'Bobby', 'Alice', 'Alicia', 'Tom', 'Thomas', 'John', 'Jon', 'Jane', 'Janet', 'Bob', 'Robert', 'Alice', 'Alison', 'Tom', 'Tomas' ), surname = c( 'Smith', 'Smith', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Brown', 'White', 'White', 'Smith', 'Smyth', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Browne', 'White', 'White' ), dob = c( '1990-01-01', '1990-01-01', '1985-06-15', '1985-06-15', '2000-12-01', '2000-12-01', '1975-03-22', '1975-03-22', '1988-07-04', '1988-07-04', '1990-01-01', '1990-01-02', '1985-06-15', '1985-06-16', '2000-12-01', '2000-12-02', '1975-03-22', '1975-03-23', '1988-07-04', '1988-07-05' ), city = c( 'London', 'London', 'Paris', 'Paris', 'Berlin', 'Berlin', 'Rome', 'Rome', 'Madrid', 'Madrid', 'London', 'London', 'Paris', 'Paris', 'Berlin', 'Berlin', 'Rome', 'Rome', 'Madrid', 'Madrid' ), email = c( '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]' ) ) con <- DBI::dbConnect(duckdb::duckdb()) spec <- il_spec() |> il_compare(first_name, cl_jaro_winkler(0.9, 0.7)) |> il_compare(surname, cl_jaro_winkler(0.9, 0.7)) |> il_compare(dob, cl_exact()) |> il_block_on(surname) |> il_block_on(first_name) model <- il_model(df, spec = spec, con = con) model <- il_estimate_u(model) DBI::dbDisconnect(con, shutdown = TRUE)
Searches for single-column (and optionally two-column) blocking rules that keep the total number of candidate pairs below a given ceiling.
il_find_blocking_below( .data, max_pairs, columns = NULL, con = NULL, link_type = c("dedupe", "link"), max_depth = 2L )il_find_blocking_below( .data, max_pairs, columns = NULL, con = NULL, link_type = c("dedupe", "link"), max_depth = 2L )
.data |
A data frame, dbplyr::tbl_lazy, or character table name. |
max_pairs |
Maximum number of pairs allowed. |
columns |
Character vector of column names. |
con |
A DBI connection object from |
link_type |
One of |
max_depth |
Maximum depth of column combinations (default |
A tibble::tibble() of qualifying blocking rules, sorted by n_pairs
ascending. Empty tibble if no rules qualify.
con <- DBI::dbConnect(duckdb::duckdb()) il_find_blocking_below(fake_1000, max_pairs = 100000, con = con) DBI::dbDisconnect(con, shutdown = TRUE)con <- DBI::dbConnect(duckdb::duckdb()) il_find_blocking_below(fake_1000, max_pairs = 100000, con = con) DBI::dbDisconnect(con, shutdown = TRUE)
Scores new records against the data already loaded into a trained model. Useful for real-time or incremental matching where new records arrive after the model has been trained.
il_find_matches(model, new_records, threshold = 0.85)il_find_matches(model, new_records, threshold = 0.85)
model |
A trained |
new_records |
A data frame, dbplyr::tbl_lazy, or character table name of new records to match against the model's existing data. |
threshold |
A numeric value between 0 and 1. Only matches at or
above this probability are returned. Defaults to |
An il_compared tibble of scored pairs between new records
and existing data.
df <- data.frame( unique_id = 1:20, first_name = c( 'John', 'Jon', 'Jane', 'Jane', 'Bob', 'Bobby', 'Alice', 'Alicia', 'Tom', 'Thomas', 'John', 'Jon', 'Jane', 'Janet', 'Bob', 'Robert', 'Alice', 'Alison', 'Tom', 'Tomas' ), surname = c( 'Smith', 'Smith', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Brown', 'White', 'White', 'Smith', 'Smyth', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Browne', 'White', 'White' ), dob = c( '1990-01-01', '1990-01-01', '1985-06-15', '1985-06-15', '2000-12-01', '2000-12-01', '1975-03-22', '1975-03-22', '1988-07-04', '1988-07-04', '1990-01-01', '1990-01-02', '1985-06-15', '1985-06-16', '2000-12-01', '2000-12-02', '1975-03-22', '1975-03-23', '1988-07-04', '1988-07-05' ), city = c( 'London', 'London', 'Paris', 'Paris', 'Berlin', 'Berlin', 'Rome', 'Rome', 'Madrid', 'Madrid', 'London', 'London', 'Paris', 'Paris', 'Berlin', 'Berlin', 'Rome', 'Rome', 'Madrid', 'Madrid' ), email = c( '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]' ) ) con <- DBI::dbConnect(duckdb::duckdb()) spec <- il_spec() |> il_compare(first_name, cl_jaro_winkler(0.9, 0.7)) |> il_compare(surname, cl_jaro_winkler(0.9, 0.7)) |> il_compare(dob, cl_exact()) |> il_block_on(surname) |> il_block_on(first_name) model <- il_model(df, spec = spec, con = con) model <- il_estimate_u(model) model <- il_estimate_em(model, block_on(surname)) new_df <- data.frame( first_name = 'Jhon', surname = 'Smith', dob = '1990-01-15', city = 'London' ) il_find_matches(model, new_df, threshold = 0.5) DBI::dbDisconnect(con, shutdown = TRUE)df <- data.frame( unique_id = 1:20, first_name = c( 'John', 'Jon', 'Jane', 'Jane', 'Bob', 'Bobby', 'Alice', 'Alicia', 'Tom', 'Thomas', 'John', 'Jon', 'Jane', 'Janet', 'Bob', 'Robert', 'Alice', 'Alison', 'Tom', 'Tomas' ), surname = c( 'Smith', 'Smith', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Brown', 'White', 'White', 'Smith', 'Smyth', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Browne', 'White', 'White' ), dob = c( '1990-01-01', '1990-01-01', '1985-06-15', '1985-06-15', '2000-12-01', '2000-12-01', '1975-03-22', '1975-03-22', '1988-07-04', '1988-07-04', '1990-01-01', '1990-01-02', '1985-06-15', '1985-06-16', '2000-12-01', '2000-12-02', '1975-03-22', '1975-03-23', '1988-07-04', '1988-07-05' ), city = c( 'London', 'London', 'Paris', 'Paris', 'Berlin', 'Berlin', 'Rome', 'Rome', 'Madrid', 'Madrid', 'London', 'London', 'Paris', 'Paris', 'Berlin', 'Berlin', 'Rome', 'Rome', 'Madrid', 'Madrid' ), email = c( '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]' ) ) con <- DBI::dbConnect(duckdb::duckdb()) spec <- il_spec() |> il_compare(first_name, cl_jaro_winkler(0.9, 0.7)) |> il_compare(surname, cl_jaro_winkler(0.9, 0.7)) |> il_compare(dob, cl_exact()) |> il_block_on(surname) |> il_block_on(first_name) model <- il_model(df, spec = spec, con = con) model <- il_estimate_u(model) model <- il_estimate_em(model, block_on(surname)) new_df <- data.frame( first_name = 'Jhon', surname = 'Smith', dob = '1990-01-15', city = 'London' ) il_find_matches(model, new_df, threshold = 0.5) DBI::dbDisconnect(con, shutdown = TRUE)
Returns node-, edge-, and cluster-level metrics from the linkage graph. Useful for diagnosing cluster quality and identifying bridge edges or weakly connected components.
il_graph_metrics(pairs, clusters)il_graph_metrics(pairs, clusters)
pairs |
An |
clusters |
A tibble from |
A named list of three tibbles:
nodesRecord-level metrics (degree, centrality).
edgesEdge-level metrics (match probability, bridge flag).
clustersCluster-level metrics (size, density).
df <- data.frame( unique_id = 1:20, first_name = c( 'John', 'Jon', 'Jane', 'Jane', 'Bob', 'Bobby', 'Alice', 'Alicia', 'Tom', 'Thomas', 'John', 'Jon', 'Jane', 'Janet', 'Bob', 'Robert', 'Alice', 'Alison', 'Tom', 'Tomas' ), surname = c( 'Smith', 'Smith', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Brown', 'White', 'White', 'Smith', 'Smyth', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Browne', 'White', 'White' ), dob = c( '1990-01-01', '1990-01-01', '1985-06-15', '1985-06-15', '2000-12-01', '2000-12-01', '1975-03-22', '1975-03-22', '1988-07-04', '1988-07-04', '1990-01-01', '1990-01-02', '1985-06-15', '1985-06-16', '2000-12-01', '2000-12-02', '1975-03-22', '1975-03-23', '1988-07-04', '1988-07-05' ), city = c( 'London', 'London', 'Paris', 'Paris', 'Berlin', 'Berlin', 'Rome', 'Rome', 'Madrid', 'Madrid', 'London', 'London', 'Paris', 'Paris', 'Berlin', 'Berlin', 'Rome', 'Rome', 'Madrid', 'Madrid' ), email = c( '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]' ) ) con <- DBI::dbConnect(duckdb::duckdb()) spec <- il_spec() |> il_compare(first_name, cl_jaro_winkler(0.9, 0.7)) |> il_compare(surname, cl_jaro_winkler(0.9, 0.7)) |> il_compare(dob, cl_exact()) |> il_block_on(surname) |> il_block_on(first_name) model <- il_model(df, spec = spec, con = con) model <- il_estimate_u(model) model <- il_estimate_em(model, block_on(surname)) pairs <- predict(model, threshold = 0.5) clusters <- il_cluster(pairs) metrics <- il_graph_metrics(pairs, clusters) metrics$clusters DBI::dbDisconnect(con, shutdown = TRUE)df <- data.frame( unique_id = 1:20, first_name = c( 'John', 'Jon', 'Jane', 'Jane', 'Bob', 'Bobby', 'Alice', 'Alicia', 'Tom', 'Thomas', 'John', 'Jon', 'Jane', 'Janet', 'Bob', 'Robert', 'Alice', 'Alison', 'Tom', 'Tomas' ), surname = c( 'Smith', 'Smith', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Brown', 'White', 'White', 'Smith', 'Smyth', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Browne', 'White', 'White' ), dob = c( '1990-01-01', '1990-01-01', '1985-06-15', '1985-06-15', '2000-12-01', '2000-12-01', '1975-03-22', '1975-03-22', '1988-07-04', '1988-07-04', '1990-01-01', '1990-01-02', '1985-06-15', '1985-06-16', '2000-12-01', '2000-12-02', '1975-03-22', '1975-03-23', '1988-07-04', '1988-07-05' ), city = c( 'London', 'London', 'Paris', 'Paris', 'Berlin', 'Berlin', 'Rome', 'Rome', 'Madrid', 'Madrid', 'London', 'London', 'Paris', 'Paris', 'Berlin', 'Berlin', 'Rome', 'Rome', 'Madrid', 'Madrid' ), email = c( '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]' ) ) con <- DBI::dbConnect(duckdb::duckdb()) spec <- il_spec() |> il_compare(first_name, cl_jaro_winkler(0.9, 0.7)) |> il_compare(surname, cl_jaro_winkler(0.9, 0.7)) |> il_compare(dob, cl_exact()) |> il_block_on(surname) |> il_block_on(first_name) model <- il_model(df, spec = spec, con = con) model <- il_estimate_u(model) model <- il_estimate_em(model, block_on(surname)) pairs <- predict(model, threshold = 0.5) clusters <- il_cluster(pairs) metrics <- il_graph_metrics(pairs, clusters) metrics$clusters DBI::dbDisconnect(con, shutdown = TRUE)
For a given blocking rule, returns the n blocking-key combinations
that produce the most record pairs. This helps diagnose skew, where a
single dominant key can create a quadratic explosion of pairs.
il_largest_blocks( .data, rule, n = 5L, con = NULL, link_type = c("dedupe", "link") )il_largest_blocks( .data, rule, n = 5L, con = NULL, link_type = c("dedupe", "link") )
.data |
A data frame, dbplyr::tbl_lazy, or character table name (first or only dataset). |
rule |
A blocking rule created by |
n |
Integer. Number of largest bins to return. Defaults to |
con |
A DBI connection object from |
link_type |
One of |
A tibble::tibble() with one row per blocking-key combination, sorted by
descending pair count. Columns are the blocking-key values plus
n_records and n_pairs.
df <- data.frame( unique_id = 1:20, first_name = c( 'John', 'Jon', 'Jane', 'Jane', 'Bob', 'Bobby', 'Alice', 'Alicia', 'Tom', 'Thomas', 'John', 'Jon', 'Jane', 'Janet', 'Bob', 'Robert', 'Alice', 'Alison', 'Tom', 'Tomas' ), surname = c( 'Smith', 'Smith', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Brown', 'White', 'White', 'Smith', 'Smyth', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Browne', 'White', 'White' ), city = c( 'London', 'London', 'Paris', 'Paris', 'Berlin', 'Berlin', 'Rome', 'Rome', 'Madrid', 'Madrid', 'London', 'London', 'Paris', 'Paris', 'Berlin', 'Berlin', 'Rome', 'Rome', 'Madrid', 'Madrid' ) ) con <- DBI::dbConnect(duckdb::duckdb()) il_largest_blocks(df, block_on(city), n = 3, con = con) DBI::dbDisconnect(con, shutdown = TRUE)df <- data.frame( unique_id = 1:20, first_name = c( 'John', 'Jon', 'Jane', 'Jane', 'Bob', 'Bobby', 'Alice', 'Alicia', 'Tom', 'Thomas', 'John', 'Jon', 'Jane', 'Janet', 'Bob', 'Robert', 'Alice', 'Alison', 'Tom', 'Tomas' ), surname = c( 'Smith', 'Smith', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Brown', 'White', 'White', 'Smith', 'Smyth', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Browne', 'White', 'White' ), city = c( 'London', 'London', 'Paris', 'Paris', 'Berlin', 'Berlin', 'Rome', 'Rome', 'Madrid', 'Madrid', 'London', 'London', 'Paris', 'Paris', 'Berlin', 'Berlin', 'Rome', 'Rome', 'Madrid', 'Madrid' ) ) con <- DBI::dbConnect(duckdb::duckdb()) il_largest_blocks(df, block_on(city), n = 3, con = con) DBI::dbDisconnect(con, shutdown = TRUE)
Reads a saved il_model object from .json or .rds.
il_load(path)il_load(path)
path |
A file path (character string) to a saved model. |
Settings JSON is loaded into an il_model that can be used with
il_attach() and predict(). The database connection and any in-database
tables are not loaded.
JSON imports reconstruct equivalent SQL-backed comparison and blocking
behavior for use after il_attach(). They do not necessarily recreate the
original irelink helper objects or transform functions stored in the
original spec.
An il_model object.
df <- data.frame( unique_id = 1:20, first_name = c( 'John', 'Jon', 'Jane', 'Jane', 'Bob', 'Bobby', 'Alice', 'Alicia', 'Tom', 'Thomas', 'John', 'Jon', 'Jane', 'Janet', 'Bob', 'Robert', 'Alice', 'Alison', 'Tom', 'Tomas' ), surname = c( 'Smith', 'Smith', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Brown', 'White', 'White', 'Smith', 'Smyth', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Browne', 'White', 'White' ), dob = c( '1990-01-01', '1990-01-01', '1985-06-15', '1985-06-15', '2000-12-01', '2000-12-01', '1975-03-22', '1975-03-22', '1988-07-04', '1988-07-04', '1990-01-01', '1990-01-02', '1985-06-15', '1985-06-16', '2000-12-01', '2000-12-02', '1975-03-22', '1975-03-23', '1988-07-04', '1988-07-05' ) ) con <- DBI::dbConnect(duckdb::duckdb()) spec <- il_spec() |> il_compare(first_name, cl_jaro_winkler(0.9, 0.7)) |> il_compare(surname, cl_jaro_winkler(0.9, 0.7)) |> il_compare(dob, cl_exact()) |> il_block_on(surname) |> il_block_on(first_name) model <- il_model(df, spec = spec, con = con) model <- il_estimate_u(model) model <- il_estimate_em(model, block_on(surname)) tmp <- tempfile(fileext = '.rds') il_save(model, tmp) loaded <- il_load(tmp) DBI::dbDisconnect(con, shutdown = TRUE)df <- data.frame( unique_id = 1:20, first_name = c( 'John', 'Jon', 'Jane', 'Jane', 'Bob', 'Bobby', 'Alice', 'Alicia', 'Tom', 'Thomas', 'John', 'Jon', 'Jane', 'Janet', 'Bob', 'Robert', 'Alice', 'Alison', 'Tom', 'Tomas' ), surname = c( 'Smith', 'Smith', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Brown', 'White', 'White', 'Smith', 'Smyth', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Browne', 'White', 'White' ), dob = c( '1990-01-01', '1990-01-01', '1985-06-15', '1985-06-15', '2000-12-01', '2000-12-01', '1975-03-22', '1975-03-22', '1988-07-04', '1988-07-04', '1990-01-01', '1990-01-02', '1985-06-15', '1985-06-16', '2000-12-01', '2000-12-02', '1975-03-22', '1975-03-23', '1988-07-04', '1988-07-05' ) ) con <- DBI::dbConnect(duckdb::duckdb()) spec <- il_spec() |> il_compare(first_name, cl_jaro_winkler(0.9, 0.7)) |> il_compare(surname, cl_jaro_winkler(0.9, 0.7)) |> il_compare(dob, cl_exact()) |> il_block_on(surname) |> il_block_on(first_name) model <- il_model(df, spec = spec, con = con) model <- il_estimate_u(model) model <- il_estimate_em(model, block_on(surname)) tmp <- tempfile(fileext = '.rds') il_save(model, tmp) loaded <- il_load(tmp) DBI::dbDisconnect(con, shutdown = TRUE)
Binds one or more datasets to a specification and a database connection, producing an untrained model. Accepts in-memory data frames, dbplyr::tbl_lazy table references, or character table names for data that already lives in a database.
il_model( .data, ..., spec, con = NULL, link_type = c("dedupe", "link", "link_and_dedupe") )il_model( .data, ..., spec, con = NULL, link_type = c("dedupe", "link", "link_and_dedupe") )
.data |
A data frame, |
... |
Additional datasets for multi-table linkage (same types
as |
spec |
An |
con |
A DBI connection object from |
link_type |
One of |
When .data is a dbplyr::tbl_lazy (from dplyr::tbl()), the connection
is extracted automatically and data stays in-database with zero
copying. A unique_id column is injected automatically if not
already present.
An untrained il_model object, ready for training verbs.
con <- DBI::dbConnect(duckdb::duckdb()) spec <- il_spec() |> il_compare(first_name, cl_jaro_winkler(0.9, 0.7)) |> il_block_on(surname) model <- il_model(fake_20, spec = spec, con = con) # Database-backed: pass a dbplyr reference directly DBI::dbWriteTable(con, 'my_data', fake_20, overwrite = TRUE) tbl_ref <- dplyr::tbl(con, 'my_data') model2 <- il_model(tbl_ref, spec = spec) DBI::dbDisconnect(con, shutdown = TRUE)con <- DBI::dbConnect(duckdb::duckdb()) spec <- il_spec() |> il_compare(first_name, cl_jaro_winkler(0.9, 0.7)) |> il_block_on(surname) model <- il_model(fake_20, spec = spec, con = con) # Database-backed: pass a dbplyr reference directly DBI::dbWriteTable(con, 'my_data', fake_20, overwrite = TRUE) tbl_ref <- dplyr::tbl(con, 'my_data') model2 <- il_model(tbl_ref, spec = spec) DBI::dbDisconnect(con, shutdown = TRUE)
Returns a transform that replaces a specific value with NA. Commonly
used to convert empty strings to NA before comparison so that
missing-data levels are triggered correctly. The result can be passed
as the transform argument to il_compare() or il_block_on(), and
composed with other transforms via il_transform(). On DuckDB and
PostgreSQL, maps to SQL NULLIF.
il_nullif(value)il_nullif(value)
value |
The value to treat as missing. |
An il_column_transform closure.
tf <- il_nullif('') tf(c('New York', '', 'Chicago')) # Use before comparison to treat blank city as missing spec <- il_spec() |> il_compare(city, cl_exact(), transform = il_nullif(''))tf <- il_nullif('') tf(c('New York', '', 'Chicago')) # Use before comparison to treat blank city as missing spec <- il_spec() |> il_compare(city, cl_exact(), transform = il_nullif(''))
Returns a tidy tibble of m and u probabilities for every comparison
level in the model. Designed for use with ggplot2::geom_point().
il_parameters(model)il_parameters(model)
model |
A trained |
For independent models, a tibble with columns comparison,
gamma_level, m, and u. For dependency-aware models, the fitted
training-pattern table used for scoring.
df <- data.frame( unique_id = 1:20, first_name = c( 'John', 'Jon', 'Jane', 'Jane', 'Bob', 'Bobby', 'Alice', 'Alicia', 'Tom', 'Thomas', 'John', 'Jon', 'Jane', 'Janet', 'Bob', 'Robert', 'Alice', 'Alison', 'Tom', 'Tomas' ), surname = c( 'Smith', 'Smith', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Brown', 'White', 'White', 'Smith', 'Smyth', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Browne', 'White', 'White' ), dob = c( '1990-01-01', '1990-01-01', '1985-06-15', '1985-06-15', '2000-12-01', '2000-12-01', '1975-03-22', '1975-03-22', '1988-07-04', '1988-07-04', '1990-01-01', '1990-01-02', '1985-06-15', '1985-06-16', '2000-12-01', '2000-12-02', '1975-03-22', '1975-03-23', '1988-07-04', '1988-07-05' ), city = c( 'London', 'London', 'Paris', 'Paris', 'Berlin', 'Berlin', 'Rome', 'Rome', 'Madrid', 'Madrid', 'London', 'London', 'Paris', 'Paris', 'Berlin', 'Berlin', 'Rome', 'Rome', 'Madrid', 'Madrid' ), email = c( '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]' ) ) con <- DBI::dbConnect(duckdb::duckdb()) spec <- il_spec() |> il_compare(first_name, cl_jaro_winkler(0.9, 0.7)) |> il_compare(surname, cl_jaro_winkler(0.9, 0.7)) |> il_compare(dob, cl_exact()) |> il_block_on(surname) |> il_block_on(first_name) model <- il_model(df, spec = spec, con = con) model <- il_estimate_u(model) model <- il_estimate_em(model, block_on(surname)) il_parameters(model) DBI::dbDisconnect(con, shutdown = TRUE)df <- data.frame( unique_id = 1:20, first_name = c( 'John', 'Jon', 'Jane', 'Jane', 'Bob', 'Bobby', 'Alice', 'Alicia', 'Tom', 'Thomas', 'John', 'Jon', 'Jane', 'Janet', 'Bob', 'Robert', 'Alice', 'Alison', 'Tom', 'Tomas' ), surname = c( 'Smith', 'Smith', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Brown', 'White', 'White', 'Smith', 'Smyth', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Browne', 'White', 'White' ), dob = c( '1990-01-01', '1990-01-01', '1985-06-15', '1985-06-15', '2000-12-01', '2000-12-01', '1975-03-22', '1975-03-22', '1988-07-04', '1988-07-04', '1990-01-01', '1990-01-02', '1985-06-15', '1985-06-16', '2000-12-01', '2000-12-02', '1975-03-22', '1975-03-23', '1988-07-04', '1988-07-05' ), city = c( 'London', 'London', 'Paris', 'Paris', 'Berlin', 'Berlin', 'Rome', 'Rome', 'Madrid', 'Madrid', 'London', 'London', 'Paris', 'Paris', 'Berlin', 'Berlin', 'Rome', 'Rome', 'Madrid', 'Madrid' ), email = c( '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]' ) ) con <- DBI::dbConnect(duckdb::duckdb()) spec <- il_spec() |> il_compare(first_name, cl_jaro_winkler(0.9, 0.7)) |> il_compare(surname, cl_jaro_winkler(0.9, 0.7)) |> il_compare(dob, cl_exact()) |> il_block_on(surname) |> il_block_on(first_name) model <- il_model(df, spec = spec, con = con) model <- il_estimate_u(model) model <- il_estimate_em(model, block_on(surname)) il_parameters(model) DBI::dbDisconnect(con, shutdown = TRUE)
Visualizes phonetic coding agreement between two columns. Shows how Soundex groupings match across pairs.
il_phonetic_chart(.data, col_1, col_2, con = NULL)il_phonetic_chart(.data, col_1, col_2, con = NULL)
.data |
A data frame or character table name. |
col_1, col_2
|
Column names (unquoted or character). |
con |
A DBI connection from |
A ggplot2::ggplot() object.
Returns a tidy tibble of precision and recall values at each
match-probability threshold. Requires labeled pairs. Designed for use
with ggplot2::geom_line().
il_precision_recall(model, labels = NULL, labels_col = NULL)il_precision_recall(model, labels = NULL, labels_col = NULL)
model |
A trained |
labels |
A data frame of labeled pairs with a logical or integer
match indicator. Required unless |
labels_col |
Optional string naming a column in the original data containing ground-truth cluster/entity IDs. |
A tibble::tibble() with columns threshold, precision, and recall.
df <- data.frame( unique_id = 1:20, first_name = c( 'John', 'Jon', 'Jane', 'Jane', 'Bob', 'Bobby', 'Alice', 'Alicia', 'Tom', 'Thomas', 'John', 'Jon', 'Jane', 'Janet', 'Bob', 'Robert', 'Alice', 'Alison', 'Tom', 'Tomas' ), surname = c( 'Smith', 'Smith', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Brown', 'White', 'White', 'Smith', 'Smyth', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Browne', 'White', 'White' ), dob = c( '1990-01-01', '1990-01-01', '1985-06-15', '1985-06-15', '2000-12-01', '2000-12-01', '1975-03-22', '1975-03-22', '1988-07-04', '1988-07-04', '1990-01-01', '1990-01-02', '1985-06-15', '1985-06-16', '2000-12-01', '2000-12-02', '1975-03-22', '1975-03-23', '1988-07-04', '1988-07-05' ), city = c( 'London', 'London', 'Paris', 'Paris', 'Berlin', 'Berlin', 'Rome', 'Rome', 'Madrid', 'Madrid', 'London', 'London', 'Paris', 'Paris', 'Berlin', 'Berlin', 'Rome', 'Rome', 'Madrid', 'Madrid' ), email = c( '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]' ) ) con <- DBI::dbConnect(duckdb::duckdb()) spec <- il_spec() |> il_compare(first_name, cl_jaro_winkler(0.9, 0.7)) |> il_compare(surname, cl_jaro_winkler(0.9, 0.7)) |> il_compare(dob, cl_exact()) |> il_block_on(surname) |> il_block_on(first_name) model <- il_model(df, spec = spec, con = con) model <- il_estimate_u(model) model <- il_estimate_em(model, block_on(surname)) labels <- data.frame( unique_id_l = c(1L, 1L), unique_id_r = c(11L, 2L), is_match = c(1L, 0L) ) il_precision_recall(model, labels = labels) DBI::dbDisconnect(con, shutdown = TRUE)df <- data.frame( unique_id = 1:20, first_name = c( 'John', 'Jon', 'Jane', 'Jane', 'Bob', 'Bobby', 'Alice', 'Alicia', 'Tom', 'Thomas', 'John', 'Jon', 'Jane', 'Janet', 'Bob', 'Robert', 'Alice', 'Alison', 'Tom', 'Tomas' ), surname = c( 'Smith', 'Smith', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Brown', 'White', 'White', 'Smith', 'Smyth', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Browne', 'White', 'White' ), dob = c( '1990-01-01', '1990-01-01', '1985-06-15', '1985-06-15', '2000-12-01', '2000-12-01', '1975-03-22', '1975-03-22', '1988-07-04', '1988-07-04', '1990-01-01', '1990-01-02', '1985-06-15', '1985-06-16', '2000-12-01', '2000-12-02', '1975-03-22', '1975-03-23', '1988-07-04', '1988-07-05' ), city = c( 'London', 'London', 'Paris', 'Paris', 'Berlin', 'Berlin', 'Rome', 'Rome', 'Madrid', 'Madrid', 'London', 'London', 'Paris', 'Paris', 'Berlin', 'Berlin', 'Rome', 'Rome', 'Madrid', 'Madrid' ), email = c( '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]' ) ) con <- DBI::dbConnect(duckdb::duckdb()) spec <- il_spec() |> il_compare(first_name, cl_jaro_winkler(0.9, 0.7)) |> il_compare(surname, cl_jaro_winkler(0.9, 0.7)) |> il_compare(dob, cl_exact()) |> il_block_on(surname) |> il_block_on(first_name) model <- il_model(df, spec = spec, con = con) model <- il_estimate_u(model) model <- il_estimate_em(model, block_on(surname)) labels <- data.frame( unique_id_l = c(1L, 1L), unique_id_r = c(11L, 2L), is_match = c(1L, 0L) ) il_precision_recall(model, labels = labels) DBI::dbDisconnect(con, shutdown = TRUE)
Adds a Dirichlet regularizing prior for one comparison's matched-class m
probabilities. Use exact for the strongest agreement level, or levels
for a complete named gamma-level distribution.
il_prior_m( model, col, exact = NULL, levels = NULL, strength, remainder = c("current") )il_prior_m( model, col, exact = NULL, levels = NULL, strength, remainder = c("current") )
model |
An |
col |
Comparison column, supplied as a bare name or string. |
exact |
Probability for the strongest gamma level. |
levels |
Complete named probability vector, with names such as |
strength |
Non-negative effective sample size. |
remainder |
How to distribute non-exact probability. Currently
|
The model with prior metadata stored in model$params$priors.
Sets a target for the global match prevalence. With strength = NULL, the
target is used only as the model's starting prior. With a finite strength,
EM also uses the target as Beta pseudo-counts.
il_prior_prevalence(model, probability, strength = NULL)il_prior_prevalence(model, probability, strength = NULL)
model |
An |
probability |
Target match probability, strictly between 0 and 1. |
strength |
Optional non-negative effective sample size. |
The model with prior metadata stored in model$params$priors.
Inspect Model Priors
il_priors(model)il_priors(model)
model |
An |
A tibble::tibble() of stored prior metadata.
Computes summary statistics and value-frequency distributions for selected columns of a dataset. Useful for understanding data quality before defining comparison rules. Accepts data frames, dbplyr::tbl_lazy table references, or character table names.
il_profile(.data, ..., con = NULL, top_n = NULL, bottom_n = NULL)il_profile(.data, ..., con = NULL, top_n = NULL, bottom_n = NULL)
.data |
A data frame, dbplyr::tbl_lazy, or character table name. |
... |
Columns to profile, specified as unquoted names or as
character strings containing raw SQL expressions (e.g.,
|
con |
A DBI connection object from |
top_n |
Integer. Number of most-frequent values to return per
column. Defaults to |
bottom_n |
Integer. Number of least-frequent values to return per
column. Defaults to |
A tibble::tibble() of per-column summary statistics.
df <- data.frame( unique_id = 1:20, first_name = c( 'John', 'Jon', 'Jane', 'Jane', 'Bob', 'Bobby', 'Alice', 'Alicia', 'Tom', 'Thomas', 'John', 'Jon', 'Jane', 'Janet', 'Bob', 'Robert', 'Alice', 'Alison', 'Tom', 'Tomas' ), surname = c( 'Smith', 'Smith', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Brown', 'White', 'White', 'Smith', 'Smyth', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Browne', 'White', 'White' ), dob = c( '1990-01-01', '1990-01-01', '1985-06-15', '1985-06-15', '2000-12-01', '2000-12-01', '1975-03-22', '1975-03-22', '1988-07-04', '1988-07-04', '1990-01-01', '1990-01-02', '1985-06-15', '1985-06-16', '2000-12-01', '2000-12-02', '1975-03-22', '1975-03-23', '1988-07-04', '1988-07-05' ), city = c( 'London', 'London', 'Paris', 'Paris', 'Berlin', 'Berlin', 'Rome', 'Rome', 'Madrid', 'Madrid', 'London', 'London', 'Paris', 'Paris', 'Berlin', 'Berlin', 'Rome', 'Rome', 'Madrid', 'Madrid' ), email = c( '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]' ) ) con <- DBI::dbConnect(duckdb::duckdb()) il_profile(df, first_name, surname, con = con, top_n = 5) DBI::dbDisconnect(con, shutdown = TRUE)df <- data.frame( unique_id = 1:20, first_name = c( 'John', 'Jon', 'Jane', 'Jane', 'Bob', 'Bobby', 'Alice', 'Alicia', 'Tom', 'Thomas', 'John', 'Jon', 'Jane', 'Janet', 'Bob', 'Robert', 'Alice', 'Alison', 'Tom', 'Tomas' ), surname = c( 'Smith', 'Smith', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Brown', 'White', 'White', 'Smith', 'Smyth', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Browne', 'White', 'White' ), dob = c( '1990-01-01', '1990-01-01', '1985-06-15', '1985-06-15', '2000-12-01', '2000-12-01', '1975-03-22', '1975-03-22', '1988-07-04', '1988-07-04', '1990-01-01', '1990-01-02', '1985-06-15', '1985-06-16', '2000-12-01', '2000-12-02', '1975-03-22', '1975-03-23', '1988-07-04', '1988-07-05' ), city = c( 'London', 'London', 'Paris', 'Paris', 'Berlin', 'Berlin', 'Rome', 'Rome', 'Madrid', 'Madrid', 'London', 'London', 'Paris', 'Paris', 'Berlin', 'Berlin', 'Rome', 'Rome', 'Madrid', 'Madrid' ), email = c( '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]' ) ) con <- DBI::dbConnect(duckdb::duckdb()) il_profile(df, first_name, surname, con = con, top_n = 5) DBI::dbDisconnect(con, shutdown = TRUE)
Returns a transform that extracts a regex match from a string column.
Returns NA when no match is found. The result can be passed as the
transform argument to il_compare() or il_block_on(), and
composed with other transforms via il_transform(). On DuckDB and
PostgreSQL, the computation is pushed into SQL.
il_regex_extract(pattern, group = 0L)il_regex_extract(pattern, group = 0L)
pattern |
A regular expression. |
group |
Integer capture group to extract. Use |
An il_column_transform closure.
# Extract a 5-digit ZIP code from a freeform address string tf <- il_regex_extract('\\d{5}') tf(c('Apt 4, 90210', '10001-1234', 'no zip'))# Extract a 5-digit ZIP code from a freeform address string tf <- il_regex_extract('\\d{5}') tf(c('Apt 4, 90210', '10001-1234', 'no zip'))
Allows you to supply pre-computed term frequency lookup tables instead of having them computed automatically from the data. This is useful when you have production TF tables from a larger dataset or want to reuse TF values across multiple linkage runs.
il_register_tf(model, col, tf_data, overwrite = FALSE)il_register_tf(model, col, tf_data, overwrite = FALSE)
model |
An |
col |
Character name of the comparison column. |
tf_data |
A data frame with columns |
overwrite |
Logical. If |
The supplied data must have exactly two columns: the value column
(named the same as the comparison column) and the frequency column
(named tf_<col>).
The updated model, with the TF table registered in the database.
con <- DBI::dbConnect(duckdb::duckdb()) spec <- il_spec() |> il_compare(first_name, cl_exact()) |> il_block_on(surname) model <- il_model(fake_20, spec = spec, con = con) tf <- data.frame( first_name = c('John', 'Jane', 'Bob', 'Alice', 'Tom'), tf_first_name = rep(0.2, 5) ) model <- il_register_tf(model, 'first_name', tf) il_cleanup(model) DBI::dbDisconnect(con, shutdown = TRUE)con <- DBI::dbConnect(duckdb::duckdb()) spec <- il_spec() |> il_compare(first_name, cl_exact()) |> il_block_on(surname) model <- il_model(fake_20, spec = spec, con = con) tf <- data.frame( first_name = c('John', 'Jane', 'Bob', 'Alice', 'Tom'), tf_first_name = rep(0.2, 5) ) model <- il_register_tf(model, 'first_name', tf) il_cleanup(model) DBI::dbDisconnect(con, shutdown = TRUE)
Returns a tidy tibble of false-positive rates and true-positive rates
at each match-probability threshold, for plotting an ROC curve.
Requires labeled pairs. Designed for use with ggplot2::geom_line().
il_roc(model, labels = NULL, labels_col = NULL)il_roc(model, labels = NULL, labels_col = NULL)
model |
A trained |
labels |
A data frame of labeled pairs with a logical or integer
match indicator. Required unless |
labels_col |
Optional string naming a column in the original data containing ground-truth cluster/entity IDs. |
A tibble::tibble() with columns threshold, fpr, and tpr.
df <- data.frame( unique_id = 1:20, first_name = c( 'John', 'Jon', 'Jane', 'Jane', 'Bob', 'Bobby', 'Alice', 'Alicia', 'Tom', 'Thomas', 'John', 'Jon', 'Jane', 'Janet', 'Bob', 'Robert', 'Alice', 'Alison', 'Tom', 'Tomas' ), surname = c( 'Smith', 'Smith', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Brown', 'White', 'White', 'Smith', 'Smyth', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Browne', 'White', 'White' ), dob = c( '1990-01-01', '1990-01-01', '1985-06-15', '1985-06-15', '2000-12-01', '2000-12-01', '1975-03-22', '1975-03-22', '1988-07-04', '1988-07-04', '1990-01-01', '1990-01-02', '1985-06-15', '1985-06-16', '2000-12-01', '2000-12-02', '1975-03-22', '1975-03-23', '1988-07-04', '1988-07-05' ), city = c( 'London', 'London', 'Paris', 'Paris', 'Berlin', 'Berlin', 'Rome', 'Rome', 'Madrid', 'Madrid', 'London', 'London', 'Paris', 'Paris', 'Berlin', 'Berlin', 'Rome', 'Rome', 'Madrid', 'Madrid' ), email = c( '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]' ) ) con <- DBI::dbConnect(duckdb::duckdb()) spec <- il_spec() |> il_compare(first_name, cl_jaro_winkler(0.9, 0.7)) |> il_compare(surname, cl_jaro_winkler(0.9, 0.7)) |> il_compare(dob, cl_exact()) |> il_block_on(surname) |> il_block_on(first_name) model <- il_model(df, spec = spec, con = con) model <- il_estimate_u(model) model <- il_estimate_em(model, block_on(surname)) labels <- data.frame( unique_id_l = c(1L, 1L), unique_id_r = c(11L, 2L), is_match = c(1L, 0L) ) il_roc(model, labels = labels) DBI::dbDisconnect(con, shutdown = TRUE)df <- data.frame( unique_id = 1:20, first_name = c( 'John', 'Jon', 'Jane', 'Jane', 'Bob', 'Bobby', 'Alice', 'Alicia', 'Tom', 'Thomas', 'John', 'Jon', 'Jane', 'Janet', 'Bob', 'Robert', 'Alice', 'Alison', 'Tom', 'Tomas' ), surname = c( 'Smith', 'Smith', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Brown', 'White', 'White', 'Smith', 'Smyth', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Browne', 'White', 'White' ), dob = c( '1990-01-01', '1990-01-01', '1985-06-15', '1985-06-15', '2000-12-01', '2000-12-01', '1975-03-22', '1975-03-22', '1988-07-04', '1988-07-04', '1990-01-01', '1990-01-02', '1985-06-15', '1985-06-16', '2000-12-01', '2000-12-02', '1975-03-22', '1975-03-23', '1988-07-04', '1988-07-05' ), city = c( 'London', 'London', 'Paris', 'Paris', 'Berlin', 'Berlin', 'Rome', 'Rome', 'Madrid', 'Madrid', 'London', 'London', 'Paris', 'Paris', 'Berlin', 'Berlin', 'Rome', 'Rome', 'Madrid', 'Madrid' ), email = c( '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]' ) ) con <- DBI::dbConnect(duckdb::duckdb()) spec <- il_spec() |> il_compare(first_name, cl_jaro_winkler(0.9, 0.7)) |> il_compare(surname, cl_jaro_winkler(0.9, 0.7)) |> il_compare(dob, cl_exact()) |> il_block_on(surname) |> il_block_on(first_name) model <- il_model(df, spec = spec, con = con) model <- il_estimate_u(model) model <- il_estimate_em(model, block_on(surname)) labels <- data.frame( unique_id_l = c(1L, 1L), unique_id_r = c(11L, 2L), is_match = c(1L, 0L) ) il_roc(model, labels = labels) DBI::dbDisconnect(con, shutdown = TRUE)
Serializes a trained il_model object to .json or .rds, chosen from
path.
il_save(model, path, overwrite = FALSE)il_save(model, path, overwrite = FALSE)
model |
A trained |
path |
A file path (character string) where the model will be saved. |
overwrite |
If |
.json writes Splink-style settings JSON. Other extensions write RDS. The
database connection and any in-database tables are not stored. Supply a
fresh connection with il_attach() after loading.
JSON export preserves scoring and prediction behavior by lowering comparisons and blocking rules to SQL. It does not guarantee exact round-tripping of irelink helper structure such as transform functions or structured blocking-rule fields.
model, invisibly.
df <- data.frame( unique_id = 1:20, first_name = c( 'John', 'Jon', 'Jane', 'Jane', 'Bob', 'Bobby', 'Alice', 'Alicia', 'Tom', 'Thomas', 'John', 'Jon', 'Jane', 'Janet', 'Bob', 'Robert', 'Alice', 'Alison', 'Tom', 'Tomas' ), surname = c( 'Smith', 'Smith', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Brown', 'White', 'White', 'Smith', 'Smyth', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Browne', 'White', 'White' ), dob = c( '1990-01-01', '1990-01-01', '1985-06-15', '1985-06-15', '2000-12-01', '2000-12-01', '1975-03-22', '1975-03-22', '1988-07-04', '1988-07-04', '1990-01-01', '1990-01-02', '1985-06-15', '1985-06-16', '2000-12-01', '2000-12-02', '1975-03-22', '1975-03-23', '1988-07-04', '1988-07-05' ) ) con <- DBI::dbConnect(duckdb::duckdb()) spec <- il_spec() |> il_compare(first_name, cl_jaro_winkler(0.9, 0.7)) |> il_compare(surname, cl_jaro_winkler(0.9, 0.7)) |> il_compare(dob, cl_exact()) |> il_block_on(surname) |> il_block_on(first_name) model <- il_model(df, spec = spec, con = con) model <- il_estimate_u(model) model <- il_estimate_em(model, block_on(surname)) tmp <- tempfile(fileext = '.rds') il_save(model, tmp) DBI::dbDisconnect(con, shutdown = TRUE)df <- data.frame( unique_id = 1:20, first_name = c( 'John', 'Jon', 'Jane', 'Jane', 'Bob', 'Bobby', 'Alice', 'Alicia', 'Tom', 'Thomas', 'John', 'Jon', 'Jane', 'Janet', 'Bob', 'Robert', 'Alice', 'Alison', 'Tom', 'Tomas' ), surname = c( 'Smith', 'Smith', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Brown', 'White', 'White', 'Smith', 'Smyth', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Browne', 'White', 'White' ), dob = c( '1990-01-01', '1990-01-01', '1985-06-15', '1985-06-15', '2000-12-01', '2000-12-01', '1975-03-22', '1975-03-22', '1988-07-04', '1988-07-04', '1990-01-01', '1990-01-02', '1985-06-15', '1985-06-16', '2000-12-01', '2000-12-02', '1975-03-22', '1975-03-23', '1988-07-04', '1988-07-05' ) ) con <- DBI::dbConnect(duckdb::duckdb()) spec <- il_spec() |> il_compare(first_name, cl_jaro_winkler(0.9, 0.7)) |> il_compare(surname, cl_jaro_winkler(0.9, 0.7)) |> il_compare(dob, cl_exact()) |> il_block_on(surname) |> il_block_on(first_name) model <- il_model(df, spec = spec, con = con) model <- il_estimate_u(model) model <- il_estimate_em(model, block_on(surname)) tmp <- tempfile(fileext = '.rds') il_save(model, tmp) DBI::dbDisconnect(con, shutdown = TRUE)
Identifies pairs of records within the same cluster that were not already scored during prediction (e.g. because they were in different blocking groups), and scores them using the model. This can reveal low-confidence links that bridge otherwise separate sub-clusters.
il_score_missing_edges(model, pairs, clusters, threshold = 0)il_score_missing_edges(model, pairs, clusters, threshold = 0)
model |
A trained |
pairs |
An |
clusters |
A tibble from |
threshold |
Numeric match-probability threshold for returned
pairs. Defaults to |
An il_compared tibble of newly scored pairs (those not
already in pairs).
df <- data.frame( unique_id = c(1, 2, 3), first_name = c('John', 'John', 'Jon'), surname = c('Smith', 'Smyth', 'Smith') ) con <- DBI::dbConnect(duckdb::duckdb()) spec <- il_spec() |> il_compare(first_name, cl_exact()) |> il_compare(surname, cl_exact()) |> il_block_on(surname) model <- il_model(df, spec = spec, con = con) model <- il_estimate_u(model) model <- il_estimate_em(model, block_on(surname)) pairs <- predict(model, threshold = 0.01) clusters <- tibble::tibble( unique_id = c('1', '2', '3'), cluster_id = 'cluster_1' ) missing <- il_score_missing_edges(model, pairs, clusters) il_cleanup(model) DBI::dbDisconnect(con, shutdown = TRUE)df <- data.frame( unique_id = c(1, 2, 3), first_name = c('John', 'John', 'Jon'), surname = c('Smith', 'Smyth', 'Smith') ) con <- DBI::dbConnect(duckdb::duckdb()) spec <- il_spec() |> il_compare(first_name, cl_exact()) |> il_compare(surname, cl_exact()) |> il_block_on(surname) model <- il_model(df, spec = spec, con = con) model <- il_estimate_u(model) model <- il_estimate_em(model, block_on(surname)) pairs <- predict(model, threshold = 0.01) clusters <- tibble::tibble( unique_id = c('1', '2', '3'), cluster_id = 'cluster_1' ) missing <- il_score_missing_edges(model, pairs, clusters) il_cleanup(model) DBI::dbDisconnect(con, shutdown = TRUE)
Scores an aggregated or row-level comparison-pattern table using a trained model. Dependency-aware models use their fitted log-linear pattern state. independent models use the fieldwise m/u parameters.
il_score_patterns(model, patterns)il_score_patterns(model, patterns)
model |
A trained |
patterns |
A data frame containing either comparison columns named like
the model comparisons or gamma columns named |
A tibble::tibble() containing the input columns plus match_weight and
total_match_weight, and match_probability.
Initializes a blank il_spec object onto which comparison layers and
blocking rules are added with il_compare() and il_block_on().
il_spec()il_spec()
An il_spec object with no comparisons or blocking rules.
spec <- il_spec() |> il_compare(first_name, cl_jaro_winkler(0.9, 0.7)) |> il_block_on(surname)spec <- il_spec() |> il_compare(first_name, cl_jaro_winkler(0.9, 0.7)) |> il_block_on(surname)
Computes multiple string-similarity metrics between two strings in a single call. No database connection is required. Useful for quick exploration of how names or other text fields compare.
il_string_similarity(a, b)il_string_similarity(a, b)
a |
A character string. |
b |
A character string. |
A single-row tibble with columns jaro_winkler, jaro,
levenshtein, jaccard, and cosine.
il_string_similarity('John', 'Jon')il_string_similarity('John', 'Jon')
Returns a transform that extracts a fixed-width substring from a string
column. The result can be passed as the transform argument to
il_compare() or il_block_on(), and composed with other transforms
via il_transform(). On DuckDB and PostgreSQL, the computation is
pushed into SQL.
il_substr(start, length)il_substr(start, length)
start |
Integer start position (1-indexed). |
length |
Integer number of characters to extract. |
An il_column_transform closure.
tf <- il_substr(1, 3) tf(c('Johnson', 'Smith', 'Lee')) # Use for blocking on the first 3 characters of a name spec <- il_spec() |> il_block_on(last_name, .transform = il_substr(1, 3))tf <- il_substr(1, 3) tf(c('Johnson', 'Smith', 'Lee')) # Use for blocking on the first 3 characters of a name spec <- il_spec() |> il_block_on(last_name, .transform = il_substr(1, 3))
Enumerates single-column blocking rules and ranks them by a heuristic that balances pair reduction against field coverage. Useful for choosing initial blocking rules before training.
il_suggest_blocking( .data, columns = NULL, con = NULL, link_type = c("dedupe", "link"), max_depth = 1L )il_suggest_blocking( .data, columns = NULL, con = NULL, link_type = c("dedupe", "link"), max_depth = 1L )
.data |
A data frame, dbplyr::tbl_lazy, or character table name. |
columns |
Character vector of column names to evaluate. When
|
con |
A DBI connection object from |
link_type |
One of |
max_depth |
Maximum number of columns to combine in a single
blocking rule. Defaults to |
A tibble::tibble() with columns rule, n_distinct, coverage,
n_pairs, pct_of_cartesian, and score, sorted by score
descending. Higher scores indicate better blocking rules.
df <- data.frame( unique_id = 1:20, first_name = c( 'John', 'Jon', 'Jane', 'Jane', 'Bob', 'Bobby', 'Alice', 'Alicia', 'Tom', 'Thomas', 'John', 'Jon', 'Jane', 'Janet', 'Bob', 'Robert', 'Alice', 'Alison', 'Tom', 'Tomas' ), surname = c( 'Smith', 'Smith', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Brown', 'White', 'White', 'Smith', 'Smyth', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Browne', 'White', 'White' ) ) con <- DBI::dbConnect(duckdb::duckdb()) il_suggest_blocking(df, con = con) DBI::dbDisconnect(con, shutdown = TRUE)df <- data.frame( unique_id = 1:20, first_name = c( 'John', 'Jon', 'Jane', 'Jane', 'Bob', 'Bobby', 'Alice', 'Alicia', 'Tom', 'Thomas', 'John', 'Jon', 'Jane', 'Janet', 'Bob', 'Robert', 'Alice', 'Alison', 'Tom', 'Tomas' ), surname = c( 'Smith', 'Smith', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Brown', 'White', 'White', 'Smith', 'Smyth', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Browne', 'White', 'White' ) ) con <- DBI::dbConnect(duckdb::duckdb()) il_suggest_blocking(df, con = con) DBI::dbDisconnect(con, shutdown = TRUE)
Visualizes the distribution of term frequencies for a column in the model. Shows how individual values shift the match weight via the TF adjustment. Rare values boost the weight, while common values penalize it.
il_tf_chart(model, col, n_most_freq = 10L, n_least_freq = 5L)il_tf_chart(model, col, n_most_freq = 10L, n_least_freq = 5L)
model |
An |
col |
A character string naming the column to plot. |
n_most_freq |
Number of most-frequent values to label. Default 10. |
n_least_freq |
Number of least-frequent values to label. Default 5. |
A ggplot2::ggplot() object.
con <- DBI::dbConnect(duckdb::duckdb()) spec <- il_spec() |> il_compare(first_name, cl_exact(term_frequency = TRUE)) model <- il_model(fake_20, spec = spec, con = con) il_tf_chart(model, 'first_name') il_cleanup(model) DBI::dbDisconnect(con, shutdown = TRUE)con <- DBI::dbConnect(duckdb::duckdb()) spec <- il_spec() |> il_compare(first_name, cl_exact(term_frequency = TRUE)) model <- il_model(fake_20, spec = spec, con = con) il_tf_chart(model, 'first_name') il_cleanup(model) DBI::dbDisconnect(con, shutdown = TRUE)
Returns a tidy tibble of parameter estimates at each EM iteration,
useful for diagnosing convergence. Designed for use with
ggplot2::geom_line() and ggplot2::facet_wrap().
il_training_history(model)il_training_history(model)
model |
A trained |
A tibble::tibble() with columns session, iteration, comparison,
gamma_level, and value.
df <- data.frame( unique_id = 1:20, first_name = c( 'John', 'Jon', 'Jane', 'Jane', 'Bob', 'Bobby', 'Alice', 'Alicia', 'Tom', 'Thomas', 'John', 'Jon', 'Jane', 'Janet', 'Bob', 'Robert', 'Alice', 'Alison', 'Tom', 'Tomas' ), surname = c( 'Smith', 'Smith', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Brown', 'White', 'White', 'Smith', 'Smyth', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Browne', 'White', 'White' ), dob = c( '1990-01-01', '1990-01-01', '1985-06-15', '1985-06-15', '2000-12-01', '2000-12-01', '1975-03-22', '1975-03-22', '1988-07-04', '1988-07-04', '1990-01-01', '1990-01-02', '1985-06-15', '1985-06-16', '2000-12-01', '2000-12-02', '1975-03-22', '1975-03-23', '1988-07-04', '1988-07-05' ), city = c( 'London', 'London', 'Paris', 'Paris', 'Berlin', 'Berlin', 'Rome', 'Rome', 'Madrid', 'Madrid', 'London', 'London', 'Paris', 'Paris', 'Berlin', 'Berlin', 'Rome', 'Rome', 'Madrid', 'Madrid' ), email = c( '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]' ) ) con <- DBI::dbConnect(duckdb::duckdb()) spec <- il_spec() |> il_compare(first_name, cl_jaro_winkler(0.9, 0.7)) |> il_compare(surname, cl_jaro_winkler(0.9, 0.7)) |> il_compare(dob, cl_exact()) |> il_block_on(surname) |> il_block_on(first_name) model <- il_model(df, spec = spec, con = con) model <- il_estimate_u(model) model <- il_estimate_em(model, block_on(surname)) il_training_history(model) DBI::dbDisconnect(con, shutdown = TRUE)df <- data.frame( unique_id = 1:20, first_name = c( 'John', 'Jon', 'Jane', 'Jane', 'Bob', 'Bobby', 'Alice', 'Alicia', 'Tom', 'Thomas', 'John', 'Jon', 'Jane', 'Janet', 'Bob', 'Robert', 'Alice', 'Alison', 'Tom', 'Tomas' ), surname = c( 'Smith', 'Smith', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Brown', 'White', 'White', 'Smith', 'Smyth', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Browne', 'White', 'White' ), dob = c( '1990-01-01', '1990-01-01', '1985-06-15', '1985-06-15', '2000-12-01', '2000-12-01', '1975-03-22', '1975-03-22', '1988-07-04', '1988-07-04', '1990-01-01', '1990-01-02', '1985-06-15', '1985-06-16', '2000-12-01', '2000-12-02', '1975-03-22', '1975-03-23', '1988-07-04', '1988-07-05' ), city = c( 'London', 'London', 'Paris', 'Paris', 'Berlin', 'Berlin', 'Rome', 'Rome', 'Madrid', 'Madrid', 'London', 'London', 'Paris', 'Paris', 'Berlin', 'Berlin', 'Rome', 'Rome', 'Madrid', 'Madrid' ), email = c( '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]' ) ) con <- DBI::dbConnect(duckdb::duckdb()) spec <- il_spec() |> il_compare(first_name, cl_jaro_winkler(0.9, 0.7)) |> il_compare(surname, cl_jaro_winkler(0.9, 0.7)) |> il_compare(dob, cl_exact()) |> il_block_on(surname) |> il_block_on(first_name) model <- il_model(df, spec = spec, con = con) model <- il_estimate_u(model) model <- il_estimate_em(model, block_on(surname)) il_training_history(model) DBI::dbDisconnect(con, shutdown = TRUE)
Creates a single transform function that applies multiple transformations
in sequence. The first function is applied first and the last function
is applied last.
The result is itself a function that can be passed as the transform
argument to il_compare() or il_block_on().
il_transform(...)il_transform(...)
... |
Two or more functions to compose, in application order.
Each must be a recognized transform (e.g. |
On the SQL side, the transforms are nested inside-out:
il_transform(tolower, trimws) becomes TRIM(LOWER(col)).
A function of class il_transform_chain that applies all
transforms in order. The individual steps are stored in the
"transforms" attribute.
# Lower-case then trim whitespace tf <- il_transform(tolower, trimws) tf(' Hello ') # Use in a specification spec <- il_spec() |> il_compare(name, cl_exact(), transform = il_transform(tolower, trimws))# Lower-case then trim whitespace tf <- il_transform(tolower, trimws) tf(' Hello ') # Use in a specification spec <- il_spec() |> il_compare(name, cl_exact(), transform = il_transform(tolower, trimws))
Returns a transform that attempts to parse a string column as a date.
Unlike as.Date(), failures return NA/NULL rather than raising an
error. On DuckDB this uses try_strptime(), and on PostgreSQL it uses
TO_DATE().
The result can be passed as the transform argument to il_compare()
or il_block_on(), and composed with other transforms via
il_transform().
il_try_parse_date(format = "%Y-%m-%d")il_try_parse_date(format = "%Y-%m-%d")
format |
A |
An il_column_transform closure.
tf <- il_try_parse_date() tf(c('2020-01-15', 'not-a-date', '1985-06-30')) # Non-ISO format tf2 <- il_try_parse_date('%m/%d/%Y') tf2(c('01/15/2020', 'bad'))tf <- il_try_parse_date() tf(c('2020-01-15', 'not-a-date', '1985-06-30')) # Non-ISO format tf2 <- il_try_parse_date('%m/%d/%Y') tf2(c('01/15/2020', 'bad'))
Returns a transform that attempts to parse a string column as a
timestamp. Unlike as.POSIXct(), failures return NA/NULL rather
than raising an error. On DuckDB this uses try_strptime(), and on
PostgreSQL it uses TO_TIMESTAMP(). The result can be passed as the
transform argument to il_compare() or il_block_on(), and
composed with other transforms via il_transform().
il_try_parse_timestamp(format = "%Y-%m-%d %H:%M:%S")il_try_parse_timestamp(format = "%Y-%m-%d %H:%M:%S")
format |
A |
An il_column_transform closure.
tf <- il_try_parse_timestamp() tf(c('2020-01-15 08:30:00', 'not-a-timestamp', '1985-06-30 12:00:00')) # Custom format tf2 <- il_try_parse_timestamp('%m/%d/%Y %I:%M %p') tf2(c('01/15/2020 08:30 AM', 'bad'))tf <- il_try_parse_timestamp() tf(c('2020-01-15 08:30:00', 'not-a-timestamp', '1985-06-30 12:00:00')) # Custom format tf2 <- il_try_parse_timestamp('%m/%d/%Y %I:%M %p') tf2(c('01/15/2020 08:30 AM', 'bad'))
Calculates the proportion of records that cannot be linked at each match-probability threshold. Returns a tidy tibble for plotting the "unlinkables curve". This helps show how restrictive each threshold is.
il_unlinkables(model)il_unlinkables(model)
model |
A trained |
A tibble::tibble() with columns threshold and pct_unlinkable.
df <- data.frame( unique_id = 1:20, first_name = c( 'John', 'Jon', 'Jane', 'Jane', 'Bob', 'Bobby', 'Alice', 'Alicia', 'Tom', 'Thomas', 'John', 'Jon', 'Jane', 'Janet', 'Bob', 'Robert', 'Alice', 'Alison', 'Tom', 'Tomas' ), surname = c( 'Smith', 'Smith', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Brown', 'White', 'White', 'Smith', 'Smyth', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Browne', 'White', 'White' ), dob = c( '1990-01-01', '1990-01-01', '1985-06-15', '1985-06-15', '2000-12-01', '2000-12-01', '1975-03-22', '1975-03-22', '1988-07-04', '1988-07-04', '1990-01-01', '1990-01-02', '1985-06-15', '1985-06-16', '2000-12-01', '2000-12-02', '1975-03-22', '1975-03-23', '1988-07-04', '1988-07-05' ), city = c( 'London', 'London', 'Paris', 'Paris', 'Berlin', 'Berlin', 'Rome', 'Rome', 'Madrid', 'Madrid', 'London', 'London', 'Paris', 'Paris', 'Berlin', 'Berlin', 'Rome', 'Rome', 'Madrid', 'Madrid' ), email = c( '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]' ) ) con <- DBI::dbConnect(duckdb::duckdb()) spec <- il_spec() |> il_compare(first_name, cl_jaro_winkler(0.9, 0.7)) |> il_compare(surname, cl_jaro_winkler(0.9, 0.7)) |> il_compare(dob, cl_exact()) |> il_block_on(surname) |> il_block_on(first_name) model <- il_model(df, spec = spec, con = con) model <- il_estimate_u(model) model <- il_estimate_em(model, block_on(surname)) il_unlinkables(model) DBI::dbDisconnect(con, shutdown = TRUE)df <- data.frame( unique_id = 1:20, first_name = c( 'John', 'Jon', 'Jane', 'Jane', 'Bob', 'Bobby', 'Alice', 'Alicia', 'Tom', 'Thomas', 'John', 'Jon', 'Jane', 'Janet', 'Bob', 'Robert', 'Alice', 'Alison', 'Tom', 'Tomas' ), surname = c( 'Smith', 'Smith', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Brown', 'White', 'White', 'Smith', 'Smyth', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Browne', 'White', 'White' ), dob = c( '1990-01-01', '1990-01-01', '1985-06-15', '1985-06-15', '2000-12-01', '2000-12-01', '1975-03-22', '1975-03-22', '1988-07-04', '1988-07-04', '1990-01-01', '1990-01-02', '1985-06-15', '1985-06-16', '2000-12-01', '2000-12-02', '1975-03-22', '1975-03-23', '1988-07-04', '1988-07-05' ), city = c( 'London', 'London', 'Paris', 'Paris', 'Berlin', 'Berlin', 'Rome', 'Rome', 'Madrid', 'Madrid', 'London', 'London', 'Paris', 'Paris', 'Berlin', 'Berlin', 'Rome', 'Rome', 'Madrid', 'Madrid' ), email = c( '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]' ) ) con <- DBI::dbConnect(duckdb::duckdb()) spec <- il_spec() |> il_compare(first_name, cl_jaro_winkler(0.9, 0.7)) |> il_compare(surname, cl_jaro_winkler(0.9, 0.7)) |> il_compare(dob, cl_exact()) |> il_block_on(surname) |> il_block_on(first_name) model <- il_model(df, spec = spec, con = con) model <- il_estimate_u(model) model <- il_estimate_em(model, block_on(surname)) il_unlinkables(model) DBI::dbDisconnect(con, shutdown = TRUE)
Returns a tidy tibble showing how each comparison contributed to the
total match weight for a specific record pair. Designed for use with
ggplot2::geom_col() and ggplot2::coord_flip().
il_waterfall(pairs, which = 1L)il_waterfall(pairs, which = 1L)
pairs |
An |
which |
An integer index identifying which row (pair) to
decompose. Defaults to |
A tibble::tibble() with columns step, order, contribution,
direction, start, and end. The rows include the prior odds,
one row per comparison contribution, and a final total.
df <- data.frame( unique_id = 1:20, first_name = c( 'John', 'Jon', 'Jane', 'Jane', 'Bob', 'Bobby', 'Alice', 'Alicia', 'Tom', 'Thomas', 'John', 'Jon', 'Jane', 'Janet', 'Bob', 'Robert', 'Alice', 'Alison', 'Tom', 'Tomas' ), surname = c( 'Smith', 'Smith', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Brown', 'White', 'White', 'Smith', 'Smyth', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Browne', 'White', 'White' ), dob = c( '1990-01-01', '1990-01-01', '1985-06-15', '1985-06-15', '2000-12-01', '2000-12-01', '1975-03-22', '1975-03-22', '1988-07-04', '1988-07-04', '1990-01-01', '1990-01-02', '1985-06-15', '1985-06-16', '2000-12-01', '2000-12-02', '1975-03-22', '1975-03-23', '1988-07-04', '1988-07-05' ), city = c( 'London', 'London', 'Paris', 'Paris', 'Berlin', 'Berlin', 'Rome', 'Rome', 'Madrid', 'Madrid', 'London', 'London', 'Paris', 'Paris', 'Berlin', 'Berlin', 'Rome', 'Rome', 'Madrid', 'Madrid' ), email = c( '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]' ) ) con <- DBI::dbConnect(duckdb::duckdb()) spec <- il_spec() |> il_compare(first_name, cl_jaro_winkler(0.9, 0.7)) |> il_compare(surname, cl_jaro_winkler(0.9, 0.7)) |> il_compare(dob, cl_exact()) |> il_block_on(surname) |> il_block_on(first_name) model <- il_model(df, spec = spec, con = con) model <- il_estimate_u(model) model <- il_estimate_em(model, block_on(surname)) pairs <- predict(model, threshold = 0.5) il_waterfall(pairs, which = 1) DBI::dbDisconnect(con, shutdown = TRUE)df <- data.frame( unique_id = 1:20, first_name = c( 'John', 'Jon', 'Jane', 'Jane', 'Bob', 'Bobby', 'Alice', 'Alicia', 'Tom', 'Thomas', 'John', 'Jon', 'Jane', 'Janet', 'Bob', 'Robert', 'Alice', 'Alison', 'Tom', 'Tomas' ), surname = c( 'Smith', 'Smith', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Brown', 'White', 'White', 'Smith', 'Smyth', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Browne', 'White', 'White' ), dob = c( '1990-01-01', '1990-01-01', '1985-06-15', '1985-06-15', '2000-12-01', '2000-12-01', '1975-03-22', '1975-03-22', '1988-07-04', '1988-07-04', '1990-01-01', '1990-01-02', '1985-06-15', '1985-06-16', '2000-12-01', '2000-12-02', '1975-03-22', '1975-03-23', '1988-07-04', '1988-07-05' ), city = c( 'London', 'London', 'Paris', 'Paris', 'Berlin', 'Berlin', 'Rome', 'Rome', 'Madrid', 'Madrid', 'London', 'London', 'Paris', 'Paris', 'Berlin', 'Berlin', 'Rome', 'Rome', 'Madrid', 'Madrid' ), email = c( '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]' ) ) con <- DBI::dbConnect(duckdb::duckdb()) spec <- il_spec() |> il_compare(first_name, cl_jaro_winkler(0.9, 0.7)) |> il_compare(surname, cl_jaro_winkler(0.9, 0.7)) |> il_compare(dob, cl_exact()) |> il_block_on(surname) |> il_block_on(first_name) model <- il_model(df, spec = spec, con = con) model <- il_estimate_u(model) model <- il_estimate_em(model, block_on(surname)) pairs <- predict(model, threshold = 0.5) il_waterfall(pairs, which = 1) DBI::dbDisconnect(con, shutdown = TRUE)
Returns a tidy tibble of comparison levels with their m probabilities,
u probabilities, and log-2 Bayes factors (match weights). Designed for
use with ggplot2::geom_col() and ggplot2::facet_wrap().
il_weights(model)il_weights(model)
model |
A trained |
A tibble::tibble() with columns comparison, gamma_level, m_prob,
u_prob, and weight.
df <- data.frame( unique_id = 1:20, first_name = c( 'John', 'Jon', 'Jane', 'Jane', 'Bob', 'Bobby', 'Alice', 'Alicia', 'Tom', 'Thomas', 'John', 'Jon', 'Jane', 'Janet', 'Bob', 'Robert', 'Alice', 'Alison', 'Tom', 'Tomas' ), surname = c( 'Smith', 'Smith', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Brown', 'White', 'White', 'Smith', 'Smyth', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Browne', 'White', 'White' ), dob = c( '1990-01-01', '1990-01-01', '1985-06-15', '1985-06-15', '2000-12-01', '2000-12-01', '1975-03-22', '1975-03-22', '1988-07-04', '1988-07-04', '1990-01-01', '1990-01-02', '1985-06-15', '1985-06-16', '2000-12-01', '2000-12-02', '1975-03-22', '1975-03-23', '1988-07-04', '1988-07-05' ), city = c( 'London', 'London', 'Paris', 'Paris', 'Berlin', 'Berlin', 'Rome', 'Rome', 'Madrid', 'Madrid', 'London', 'London', 'Paris', 'Paris', 'Berlin', 'Berlin', 'Rome', 'Rome', 'Madrid', 'Madrid' ), email = c( '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]' ) ) con <- DBI::dbConnect(duckdb::duckdb()) spec <- il_spec() |> il_compare(first_name, cl_jaro_winkler(0.9, 0.7)) |> il_compare(surname, cl_jaro_winkler(0.9, 0.7)) |> il_compare(dob, cl_exact()) |> il_block_on(surname) |> il_block_on(first_name) model <- il_model(df, spec = spec, con = con) model <- il_estimate_u(model) model <- il_estimate_em(model, block_on(surname)) il_weights(model) DBI::dbDisconnect(con, shutdown = TRUE)df <- data.frame( unique_id = 1:20, first_name = c( 'John', 'Jon', 'Jane', 'Jane', 'Bob', 'Bobby', 'Alice', 'Alicia', 'Tom', 'Thomas', 'John', 'Jon', 'Jane', 'Janet', 'Bob', 'Robert', 'Alice', 'Alison', 'Tom', 'Tomas' ), surname = c( 'Smith', 'Smith', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Brown', 'White', 'White', 'Smith', 'Smyth', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Browne', 'White', 'White' ), dob = c( '1990-01-01', '1990-01-01', '1985-06-15', '1985-06-15', '2000-12-01', '2000-12-01', '1975-03-22', '1975-03-22', '1988-07-04', '1988-07-04', '1990-01-01', '1990-01-02', '1985-06-15', '1985-06-16', '2000-12-01', '2000-12-02', '1975-03-22', '1975-03-23', '1988-07-04', '1988-07-05' ), city = c( 'London', 'London', 'Paris', 'Paris', 'Berlin', 'Berlin', 'Rome', 'Rome', 'Madrid', 'Madrid', 'London', 'London', 'Paris', 'Paris', 'Berlin', 'Berlin', 'Rome', 'Rome', 'Madrid', 'Madrid' ), email = c( '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]' ) ) con <- DBI::dbConnect(duckdb::duckdb()) spec <- il_spec() |> il_compare(first_name, cl_jaro_winkler(0.9, 0.7)) |> il_compare(surname, cl_jaro_winkler(0.9, 0.7)) |> il_compare(dob, cl_exact()) |> il_block_on(surname) |> il_block_on(first_name) model <- il_model(df, spec = spec, con = con) model <- il_estimate_u(model) model <- il_estimate_em(model, block_on(surname)) il_weights(model) DBI::dbDisconnect(con, shutdown = TRUE)
Returns TRUE if x inherits from class il_model.
is_il_model(x)is_il_model(x)
x |
An object to test. |
A single logical value.
is_il_model(il_spec())is_il_model(il_spec())
Returns TRUE if x inherits from class il_spec.
is_il_spec(x)is_il_spec(x)
x |
An object to test. |
A single logical value.
is_il_spec(il_spec())is_il_spec(il_spec())
A tagged-value constructor that marks a numeric threshold as a distance
in kilometres. Use inside cl_geo_distance() for self-documenting
thresholds.
km(n)km(n)
n |
A non-negative numeric value. |
A tagged numeric with class il_km.
il_spec() |> il_compare(c(lat, lon), cl_geo_distance(km(5), km(50)))il_spec() |> il_compare(c(lat, lon), cl_geo_distance(km(5), km(50)))
Given a model and a column name containing cluster or entity IDs,
generates pairwise labels for all predicted pairs. Two records
sharing the same value in labels_col are labeled as matches.
labels_from_column(model, labels_col, threshold = 0)labels_from_column(model, labels_col, threshold = 0)
model |
A trained |
labels_col |
A string naming the column in the original data that contains the ground-truth cluster or entity identifier. |
threshold |
Match-probability threshold for selecting predicted
pairs. Defaults to |
This is a convenience wrapper: instead of manually building a labels
data frame with unique_id_l, unique_id_r, and is_match, you
supply the column name and let irelink derive everything.
A data frame with columns unique_id_l, unique_id_r, and
is_match (integer 0/1).
con <- DBI::dbConnect(duckdb::duckdb()) spec <- il_spec() |> il_compare(first_name, cl_jaro_winkler(0.9, 0.7)) |> il_block_on(surname) model <- il_model(fake_1000, spec = spec, con = con) model <- il_estimate_u(model) model <- il_estimate_em(model, block_on(surname)) labels_from_column(model, 'cluster') DBI::dbDisconnect(con, shutdown = TRUE)con <- DBI::dbConnect(duckdb::duckdb()) spec <- il_spec() |> il_compare(first_name, cl_jaro_winkler(0.9, 0.7)) |> il_block_on(surname) model <- il_model(fake_1000, spec = spec, con = con) model <- il_estimate_u(model) model <- il_estimate_em(model, block_on(surname)) labels_from_column(model, 'cluster') DBI::dbDisconnect(con, shutdown = TRUE)
A tagged-value constructor that marks a numeric threshold as a distance
in miles. Converted to kilometres internally by cl_geo_distance().
mi(n)mi(n)
n |
A non-negative numeric value. |
A tagged numeric with class il_mi.
il_spec() |> il_compare(c(lat, lon), cl_geo_distance(mi(3), mi(30)))il_spec() |> il_compare(c(lat, lon), cl_geo_distance(mi(3), mi(30)))
A tagged-value constructor that marks a numeric threshold as a number
of minutes. Use inside cl_time_diff() for self-documenting
thresholds.
minutes(n)minutes(n)
n |
A non-negative numeric value. |
A tagged numeric with class il_minutes.
il_spec() |> il_compare(timestamp, cl_time_diff(minutes(5), minutes(60)))il_spec() |> il_compare(timestamp, cl_time_diff(minutes(5), minutes(60)))
A tagged-value constructor that marks a numeric threshold as a number
of months. Use inside cl_date_diff() for self-documenting
thresholds.
months(n)months(n)
n |
A non-negative numeric value. |
A tagged numeric with class il_months.
il_spec() |> il_compare(dob, cl_date_diff(months(1), months(12)))il_spec() |> il_compare(dob, cl_date_diff(months(1), months(12)))
Phonetic algorithms for blocking and comparison transforms. These functions compute phonetic encodings that group similar-sounding names together. They are useful for blocking rules that tolerate spelling variation.
il_soundex(x) il_metaphone(x) il_dmetaphone(x)il_soundex(x) il_metaphone(x) il_dmetaphone(x)
x |
A character vector to encode. |
When used as a transform argument in il_block_on(), block_on(),
or il_compare(), the computation is pushed into SQL so data is
never materialized into R.
| Function | DuckDB | PostgreSQL | SQLite |
il_soundex |
✓ (macro) | ✓ (native) | comparisons only (R-side) |
il_metaphone |
✗ | ✓ (native) | ✗ |
il_dmetaphone |
✗ | ✓ (native) | ✗ |
SQLite does not expose a way to register scalar R functions as SQL UDFs, so phonetic transforms cannot be used in blocking rules on SQLite. They continue to work in comparisons on SQLite via the R-side gamma computation path.
A character vector of phonetic codes (same length as x).
il_soundex(c('Smith', 'Smyth')) il_soundex(c('Robert', 'Rupert'))il_soundex(c('Smith', 'Smyth')) il_soundex(c('Robert', 'Rupert'))
Generates and scores all candidate record pairs that pass the blocking
rules, returning those above the match-probability threshold. This is
an S3 method for stats::predict().
## S3 method for class 'il_model' predict( object, threshold = 0.85, threshold_match_weight = NULL, type = c("pairs", "weights"), collect = TRUE, include_fields = FALSE, greedy = FALSE, profile_sql = FALSE, ... )## S3 method for class 'il_model' predict( object, threshold = 0.85, threshold_match_weight = NULL, type = c("pairs", "weights"), collect = TRUE, include_fields = FALSE, greedy = FALSE, profile_sql = FALSE, ... )
object |
A trained |
threshold |
A numeric value between 0 and 1. Only pairs with a
match probability at or above this threshold are returned. Defaults
to |
threshold_match_weight |
Optional numeric value. When set, pairs
are filtered on evidence-only match weight (log2 Bayes factor) instead
of probability. Typical values range from about -5 to +30. Overrides
|
type |
One of |
collect |
If |
include_fields |
If |
greedy |
If |
profile_sql |
Logical. If |
... |
Additional arguments passed to the generic. |
When collect = TRUE: an il_compared tibble with one row
per candidate pair, including columns for record IDs, match weight,
total match weight, match probability, and per-comparison gamma values.
match_weight is the evidence-only log2 Bayes factor. The additive
prior term is exposed separately through total_match_weight, whose
value is match_weight + log2(prior / (1 - prior)).
When collect = FALSE: an il_compared_lazy object referencing the
scored pairs table in the database.
df <- data.frame( unique_id = 1:20, first_name = c( 'John', 'Jon', 'Jane', 'Jane', 'Bob', 'Bobby', 'Alice', 'Alicia', 'Tom', 'Thomas', 'John', 'Jon', 'Jane', 'Janet', 'Bob', 'Robert', 'Alice', 'Alison', 'Tom', 'Tomas' ), surname = c( 'Smith', 'Smith', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Brown', 'White', 'White', 'Smith', 'Smyth', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Browne', 'White', 'White' ), dob = c( '1990-01-01', '1990-01-01', '1985-06-15', '1985-06-15', '2000-12-01', '2000-12-01', '1975-03-22', '1975-03-22', '1988-07-04', '1988-07-04', '1990-01-01', '1990-01-02', '1985-06-15', '1985-06-16', '2000-12-01', '2000-12-02', '1975-03-22', '1975-03-23', '1988-07-04', '1988-07-05' ), city = c( 'London', 'London', 'Paris', 'Paris', 'Berlin', 'Berlin', 'Rome', 'Rome', 'Madrid', 'Madrid', 'London', 'London', 'Paris', 'Paris', 'Berlin', 'Berlin', 'Rome', 'Rome', 'Madrid', 'Madrid' ), email = c( '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]' ) ) con <- DBI::dbConnect(duckdb::duckdb()) spec <- il_spec() |> il_compare(first_name, cl_jaro_winkler(0.9, 0.7)) |> il_compare(surname, cl_jaro_winkler(0.9, 0.7)) |> il_compare(dob, cl_exact()) |> il_block_on(surname) |> il_block_on(first_name) model <- il_model(df, spec = spec, con = con) model <- il_estimate_u(model) model <- il_estimate_em(model, block_on(surname)) pairs <- predict(model, threshold = 0.5) DBI::dbDisconnect(con, shutdown = TRUE)df <- data.frame( unique_id = 1:20, first_name = c( 'John', 'Jon', 'Jane', 'Jane', 'Bob', 'Bobby', 'Alice', 'Alicia', 'Tom', 'Thomas', 'John', 'Jon', 'Jane', 'Janet', 'Bob', 'Robert', 'Alice', 'Alison', 'Tom', 'Tomas' ), surname = c( 'Smith', 'Smith', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Brown', 'White', 'White', 'Smith', 'Smyth', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Browne', 'White', 'White' ), dob = c( '1990-01-01', '1990-01-01', '1985-06-15', '1985-06-15', '2000-12-01', '2000-12-01', '1975-03-22', '1975-03-22', '1988-07-04', '1988-07-04', '1990-01-01', '1990-01-02', '1985-06-15', '1985-06-16', '2000-12-01', '2000-12-02', '1975-03-22', '1975-03-23', '1988-07-04', '1988-07-05' ), city = c( 'London', 'London', 'Paris', 'Paris', 'Berlin', 'Berlin', 'Rome', 'Rome', 'Madrid', 'Madrid', 'London', 'London', 'Paris', 'Paris', 'Berlin', 'Berlin', 'Rome', 'Rome', 'Madrid', 'Madrid' ), email = c( '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]' ) ) con <- DBI::dbConnect(duckdb::duckdb()) spec <- il_spec() |> il_compare(first_name, cl_jaro_winkler(0.9, 0.7)) |> il_compare(surname, cl_jaro_winkler(0.9, 0.7)) |> il_compare(dob, cl_exact()) |> il_block_on(surname) |> il_block_on(first_name) model <- il_model(df, spec = spec, con = con) model <- il_estimate_u(model) model <- il_estimate_em(model, block_on(surname)) pairs <- predict(model, threshold = 0.5) DBI::dbDisconnect(con, shutdown = TRUE)
Displays a human-readable summary of the model's type, data, training status, comparisons, and blocking rules.
## S3 method for class 'il_model' print(x, ...)## S3 method for class 'il_model' print(x, ...)
x |
An |
... |
Additional arguments passed to |
x, invisibly.
df <- data.frame( unique_id = 1:20, first_name = c( 'John', 'Jon', 'Jane', 'Jane', 'Bob', 'Bobby', 'Alice', 'Alicia', 'Tom', 'Thomas', 'John', 'Jon', 'Jane', 'Janet', 'Bob', 'Robert', 'Alice', 'Alison', 'Tom', 'Tomas' ), surname = c( 'Smith', 'Smith', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Brown', 'White', 'White', 'Smith', 'Smyth', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Browne', 'White', 'White' ), dob = c( '1990-01-01', '1990-01-01', '1985-06-15', '1985-06-15', '2000-12-01', '2000-12-01', '1975-03-22', '1975-03-22', '1988-07-04', '1988-07-04', '1990-01-01', '1990-01-02', '1985-06-15', '1985-06-16', '2000-12-01', '2000-12-02', '1975-03-22', '1975-03-23', '1988-07-04', '1988-07-05' ), city = c( 'London', 'London', 'Paris', 'Paris', 'Berlin', 'Berlin', 'Rome', 'Rome', 'Madrid', 'Madrid', 'London', 'London', 'Paris', 'Paris', 'Berlin', 'Berlin', 'Rome', 'Rome', 'Madrid', 'Madrid' ), email = c( '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]' ) ) con <- DBI::dbConnect(duckdb::duckdb()) spec <- il_spec() |> il_compare(first_name, cl_jaro_winkler(0.9, 0.7)) |> il_block_on(surname) model <- il_model(df, spec = spec, con = con) print(model) DBI::dbDisconnect(con, shutdown = TRUE)df <- data.frame( unique_id = 1:20, first_name = c( 'John', 'Jon', 'Jane', 'Jane', 'Bob', 'Bobby', 'Alice', 'Alicia', 'Tom', 'Thomas', 'John', 'Jon', 'Jane', 'Janet', 'Bob', 'Robert', 'Alice', 'Alison', 'Tom', 'Tomas' ), surname = c( 'Smith', 'Smith', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Brown', 'White', 'White', 'Smith', 'Smyth', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Browne', 'White', 'White' ), dob = c( '1990-01-01', '1990-01-01', '1985-06-15', '1985-06-15', '2000-12-01', '2000-12-01', '1975-03-22', '1975-03-22', '1988-07-04', '1988-07-04', '1990-01-01', '1990-01-02', '1985-06-15', '1985-06-16', '2000-12-01', '2000-12-02', '1975-03-22', '1975-03-23', '1988-07-04', '1988-07-05' ), city = c( 'London', 'London', 'Paris', 'Paris', 'Berlin', 'Berlin', 'Rome', 'Rome', 'Madrid', 'Madrid', 'London', 'London', 'Paris', 'Paris', 'Berlin', 'Berlin', 'Rome', 'Rome', 'Madrid', 'Madrid' ), email = c( '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]' ) ) con <- DBI::dbConnect(duckdb::duckdb()) spec <- il_spec() |> il_compare(first_name, cl_jaro_winkler(0.9, 0.7)) |> il_block_on(surname) model <- il_model(df, spec = spec, con = con) print(model) DBI::dbDisconnect(con, shutdown = TRUE)
Displays a human-readable summary of the comparisons and blocking rules
stored in an il_spec object.
## S3 method for class 'il_spec' print(x, ...)## S3 method for class 'il_spec' print(x, ...)
x |
An |
... |
Additional arguments passed to |
x, invisibly.
spec <- il_spec() |> il_compare(first_name, cl_exact()) print(spec)spec <- il_spec() |> il_compare(first_name, cl_exact()) print(spec)
A tagged-value constructor that marks a numeric threshold as a number
of seconds. Use inside cl_time_diff() for self-documenting
thresholds.
seconds(n)seconds(n)
n |
A non-negative numeric value. |
A tagged numeric with class il_seconds.
il_spec() |> il_compare(timestamp, cl_time_diff(seconds(30), seconds(300)))il_spec() |> il_compare(timestamp, cl_time_diff(seconds(30), seconds(300)))
Prints a detailed table of trained parameters including m and u probabilities, match weights, and the prior match probability for each comparison level.
## S3 method for class 'il_model' summary(object, ...)## S3 method for class 'il_model' summary(object, ...)
object |
An |
... |
Additional arguments passed to |
A summary object, invisibly.
df <- data.frame( unique_id = 1:20, first_name = c( 'John', 'Jon', 'Jane', 'Jane', 'Bob', 'Bobby', 'Alice', 'Alicia', 'Tom', 'Thomas', 'John', 'Jon', 'Jane', 'Janet', 'Bob', 'Robert', 'Alice', 'Alison', 'Tom', 'Tomas' ), surname = c( 'Smith', 'Smith', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Brown', 'White', 'White', 'Smith', 'Smyth', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Browne', 'White', 'White' ), dob = c( '1990-01-01', '1990-01-01', '1985-06-15', '1985-06-15', '2000-12-01', '2000-12-01', '1975-03-22', '1975-03-22', '1988-07-04', '1988-07-04', '1990-01-01', '1990-01-02', '1985-06-15', '1985-06-16', '2000-12-01', '2000-12-02', '1975-03-22', '1975-03-23', '1988-07-04', '1988-07-05' ), city = c( 'London', 'London', 'Paris', 'Paris', 'Berlin', 'Berlin', 'Rome', 'Rome', 'Madrid', 'Madrid', 'London', 'London', 'Paris', 'Paris', 'Berlin', 'Berlin', 'Rome', 'Rome', 'Madrid', 'Madrid' ), email = c( '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]' ) ) con <- DBI::dbConnect(duckdb::duckdb()) spec <- il_spec() |> il_compare(first_name, cl_jaro_winkler(0.9, 0.7)) |> il_block_on(surname) model <- il_model(df, spec = spec, con = con) summary(model) DBI::dbDisconnect(con, shutdown = TRUE)df <- data.frame( unique_id = 1:20, first_name = c( 'John', 'Jon', 'Jane', 'Jane', 'Bob', 'Bobby', 'Alice', 'Alicia', 'Tom', 'Thomas', 'John', 'Jon', 'Jane', 'Janet', 'Bob', 'Robert', 'Alice', 'Alison', 'Tom', 'Tomas' ), surname = c( 'Smith', 'Smith', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Brown', 'White', 'White', 'Smith', 'Smyth', 'Doe', 'Doe', 'Jones', 'Jones', 'Brown', 'Browne', 'White', 'White' ), dob = c( '1990-01-01', '1990-01-01', '1985-06-15', '1985-06-15', '2000-12-01', '2000-12-01', '1975-03-22', '1975-03-22', '1988-07-04', '1988-07-04', '1990-01-01', '1990-01-02', '1985-06-15', '1985-06-16', '2000-12-01', '2000-12-02', '1975-03-22', '1975-03-23', '1988-07-04', '1988-07-05' ), city = c( 'London', 'London', 'Paris', 'Paris', 'Berlin', 'Berlin', 'Rome', 'Rome', 'Madrid', 'Madrid', 'London', 'London', 'Paris', 'Paris', 'Berlin', 'Berlin', 'Rome', 'Rome', 'Madrid', 'Madrid' ), email = c( '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]' ) ) con <- DBI::dbConnect(duckdb::duckdb()) spec <- il_spec() |> il_compare(first_name, cl_jaro_winkler(0.9, 0.7)) |> il_block_on(surname) model <- il_model(df, spec = spec, con = con) summary(model) DBI::dbDisconnect(con, shutdown = TRUE)
A tagged-value constructor that marks a numeric threshold as a number
of years. Use inside cl_date_diff() for self-documenting thresholds.
years(n)years(n)
n |
A non-negative numeric value. |
A tagged numeric with class il_years.
il_spec() |> il_compare(dob, cl_date_diff(years(1)))il_spec() |> il_compare(dob, cl_date_diff(years(1)))