Changes in version 0.0.1 Initial development release, translating Python's splink probabilistic record linkage engine into idiomatic R. Core pipeline - il_spec(), il_compare(), and il_block_on() define the linkage model declaratively: which fields to compare, how to compare them, and which blocking rules to apply. - il_model() binds a spec to one or two datasets and a DBI connection for dedupe, link, or link_and_dedupe, and accepts in-memory data frames, dbplyr::tbl_lazy references, or existing table-name strings. - predict() scores all candidate pairs above a match-probability threshold (or an evidence-only match-weight threshold via threshold_match_weight). - il_cluster() resolves scored pairs into entity clusters via connected components (using igraph) or single-best-link (with source_dataset for cross-source filtering). Comparison library - String similarity: cl_exact(), cl_levenshtein(), cl_damerau_levenshtein(), cl_jaro(), cl_jaro_winkler(), cl_jaccard(), cl_cosine(). - Numeric and distance: cl_numeric_diff(), cl_pct_diff(), cl_geo_distance(). - Temporal: cl_date_diff() for date proximity with days(), months(), and years() helpers, and cl_time_diff() for sub-day precision with seconds(), minutes(), and hours() helpers. - Collections: cl_array_intersect(), cl_array_subset(), cl_array_min_distance(). - Domain-specific: cl_name(), cl_first_last_name() / cl_forename_surname() (both accept a companion column for the surname and handle first/last swap detection), cl_dob(), cl_email(), cl_domain(), cl_soundex(), cl_zip_code() (exact, 5-digit ZIP+4, and 3-digit Sectional Center Facility prefix levels), cl_postcode(). - Composition: cl_levels(), cl_and(), cl_or(), cl_not(), cl_null(), cl_else(), cl_literal(), cl_custom(), cl_columns_reversed(). - All applicable comparators accept term_frequency = TRUE for Fellegi-Sunter term-frequency adjustments. Column transforms - il_transform() composes multiple R functions into a chainable transform with SQL-side nesting (e.g. TRIM(LOWER(col))). - Column transform factories for SQL-side column expressions: il_substr(), il_regex_extract(), il_nullif(), il_cast_to_string(), il_try_parse_date(), il_array_element(). - Built-in transforms auto-translated to SQL: tolower, toupper, trimws. - Phonetic transforms: il_soundex(), il_metaphone(), il_dmetaphone() (usable as R functions and SQL macros). Blocking - il_block_on() and block_on() for equality-based and custom SQL blocking rules, with per-column transform support via formula syntax (col ~ transform, e.g. first_name ~ il_substr(1, 3)) or a named-list .transform for programmatic construction. - .explode parameter for array-valued blocking columns (generates UNNEST subqueries for DuckDB/PostgreSQL). - il_count_pairs() estimates candidate-pair counts, including cumulative totals and percent-of-cartesian summaries across rule combinations. - il_suggest_blocking() ranks candidate blocking rules by pair-reduction, coverage, and balanced score. - il_find_blocking_below() finds blocking rule combinations below a pair count ceiling. - block_from_labels() measures per-column recall from labeled pairs. - il_largest_blocks() identifies the blocking keys that generate the most records and pairs, respecting blocking transforms. Training - il_estimate_u() estimates non-match probabilities by sampling random pairs, with optional chunked estimation through chunk_size and early stopping through min_count_per_level. - il_estimate_em() runs the Fellegi-Sunter EM algorithm with configurable max_iterations, convergence, fix_u, fix_m, fix_prior, derive_prior, estimate_without_tf, and estimator_mode parameters. - estimator_mode = "dependency-aware" fits log-linear matched and unmatched comparison-pattern distributions over aggregated gamma counts, preserving missing comparison states as explicit pattern levels. - il_estimate_prior() sets the prior match probability from deterministic matching rules, counting unique blocked pairs across overlapping rules. - il_prior_prevalence() and il_prior_m() add regularizing custom priors for EM, il_constrain_m() adds explicit fixed matched-class constraints, and il_priors() / il_constraints() expose the stored metadata. - il_estimate_m_from_labels() and il_estimate_m_from_column() initialize parameters from ground-truth labels. Prediction - predict() supports both threshold (match probability) and threshold_match_weight (evidence-only log2 Bayes factor) filtering. - Prediction output includes evidence-only match_weight, prior-inclusive total_match_weight, and posterior match_probability. - predict(type = "weights") returns match weights on the log2 Bayes-factor scale, and greedy = TRUE adds deterministic one-to-one post-processing for link models. - include_fields = TRUE joins all source columns into the scored output. - collect = FALSE returns an il_compared_lazy object backed by a model-scoped in-database table. - il_score_missing_edges() enumerates and scores unscored within-cluster pairs. - il_score_patterns() scores compatible comparison-pattern tables, including dependency-aware pattern tables larger than the table used for fitting. - il_deterministic_link() performs single-table exact-match deduplication without training. - il_find_matches() scores a set of probe records against existing data. - profile_sql = TRUE on predict() attaches lightweight SQL timing metadata to collected predictions or lazy prediction objects. Diagnostics and evaluation - il_parameters() and il_weights() expose the learned m/u parameters. - il_waterfall() decomposes a pair's match weight into per-comparison contributions. - il_training_history() tracks parameter convergence across EM iterations. - il_completeness() and il_profile() summarize data quality, and il_profile() accepts raw SQL expressions as column definitions (e.g., "city || left(first_name, 1)"). - il_unlinkables() identifies records that cannot be linked under any blocking rule. - il_accuracy(), il_precision_recall(), and il_roc() evaluate performance against labeled data. - il_errors() surfaces false positives and false negatives. - il_graph_metrics() computes node degree, node centrality, cluster density, cluster centralization, and bridge detection. - il_comparison_vectors() returns the gamma pattern distribution from a trained model. Data exploration - il_compare_records() scores one explicit record pair against a spec without fitting a full model, and il_string_similarity() computes 5 string similarity metrics for a single pair. - il_comparator_score() computes batch string similarity across a DataFrame with SQL-side scoring on DuckDB/PostgreSQL. - il_comparator_threshold_chart() visualizes match rates at multiple similarity thresholds. - il_phonetic_chart() produces a Soundex agreement heatmap. - il_tf_chart() visualizes model-specific term frequency distributions with labeled most/least common values. - il_register_tf() registers pre-computed term frequency tables in the database and returns the updated model. Visualization - autoplot() methods for il_model, il_compared, il_training_history, il_accuracy, il_roc, il_precision_recall, il_unlinkables, il_completeness, il_count_pairs, il_profile, il_string_similarity, il_comparator_score, and il_comparison_vectors. - All chart types are composable with standard ggplot2 layers. Datasets - fake_1000: 1,000 records (250 entities) for deduplication. - fake_1000_labels: 3,176 pairwise labels for evaluation. - fake_20: minimal 20-record example. - febrl4a / febrl4b: 5,000-record cross-table linkage benchmark from FEBRL. SQL backends and persistence - All computation runs inside a DBI-compatible database: DuckDB (recommended), SQLite, or PostgreSQL. - Database-backed workflows support zero-copy registration from dbplyr::tbl_lazy references and existing table names, in addition to in-memory data frames. - il_save() and il_load() support both RDS files and Splink settings JSON. - il_attach() reattaches a saved model to different data or connections. - il_cleanup() removes temporary tables owned by a single model, making it safe for shared DBI connections with multiple live models. - il_cleanup_all() removes all package-owned temporary tables from a connection for exploratory sessions and failed runs. Performance - Gamma computation is pushed into DuckDB using native C++ string similarity functions. - SQLite is retained as a fallback with R-side gamma computation via stringdist. - DuckDB and PostgreSQL use SQL-native connected components, with an igraph fallback for SQLite. - Term-frequency, lazy prediction, and scratch tables use generated model-scoped names to avoid collisions on shared connections. - profile_sql = TRUE on il_estimate_u(), il_estimate_prior(), and predict() records lightweight SQL timing metadata for performance investigation. - End-to-end benchmarks against an R-side SQLite baseline: 1,000 records in 1.4 s (2.1× faster), 5,000 records in 19.5 s (1.6×), 10,000 records in 61.4 s (2.6×). Speedup grows with dataset size.