--- title: "Getting Started with BCP47" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{BCP47} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r setup} library(BCP47) ``` ## What is BCP 47? BCP 47 (Best Current Practice 47) is the IETF standard that defines how human languages are identified in internet protocols. It is specified by two RFCs: - **RFC 5646** — the syntax for language tags (e.g., `en-US`, `zh-Hans-CN`) - **RFC 4647** — the rules for matching language tags to available resources A BCP 47 tag is a sequence of subtags separated by hyphens: ``` language [-script] [-region] [-variant]* [-extension]* [-privateuse] ``` For example: | Tag | Meaning | |---|---| | `en` | English | | `en-US` | English as used in the United States | | `zh-Hans-CN` | Chinese, Simplified script, as used in China | | `sr-Latn` | Serbian written in the Latin script | | `de-1901` | German, traditional orthography (1901 variant) | | `x-myapp` | Entirely private-use tag | The canonical source of valid subtags is the [IANA Language Subtag Registry](https://www.iana.org/assignments/language-subtag-registry/). ## Parsing `bcp_parse()` decomposes a tag into its named components. All subtags are returned in lower-case. Both hyphens (`-`) and underscores (`_`) are accepted as separators. ```{r parse-basic} bcp_parse("en-US") ``` ```{r parse-complex} bcp_parse("zh-Hans-CN") ``` The returned list always has the same structure: ```{r parse-fields} tag <- bcp_parse("sr-Latn-RS-rozaj-x-custom") names(tag) ``` - **`language`**: primary language subtag (e.g., `"sr"`) - **`extlang`**: extended language subtags (`NULL` if absent) - **`script`**: four-letter script subtag (`NA` if absent) - **`region`**: two-letter or three-digit region subtag (`NA` if absent) - **`variants`**: variant subtags (`NULL` if absent) - **`extensions`**: named list of extension sequences - **`private`**: private-use subtags (`NULL` if absent) ```{r parse-private} # Pure private-use tag bcp_parse("x-myapp-v2") ``` ```{r parse-variant} # Variant and extension subtags bcp_parse("en-US-u-ca-gregory") ``` ## Language Matching `bcp_match_language()` implements the RFC 4647 "Lookup" scheme. Given an ordered list of language preferences and a set of available tags, it returns the best match. Matching works by progressively stripping the rightmost subtag from each preference until a match is found: ``` en-US → en (strip region) zh-Hans-CN → zh-Hans → zh (strip region, then script) ``` ```{r match-basic} # User prefers en-US; only 'en' is available — falls back to the base language bcp_match_language("en-US", c("en", "fr", "de")) ``` ```{r match-fallback} # Multiple preferences: de-AT falls back through de, then en-GB falls back to en bcp_match_language(c("de-AT", "en-GB"), c("en", "fr")) ``` ```{r match-chinese} # Prefer Traditional Chinese, fall back to Simplified, then English bcp_match_language( c("zh-Hant-TW", "zh-Hans", "en"), c("zh-Hans", "en", "fr") ) ``` ```{r match-default} # No match — return a default value bcp_match_language("pt-BR", c("fr", "de"), default = "en") ``` Matching is **case-insensitive**. The original casing from `available` is preserved in the return value: ```{r match-case} bcp_match_language("EN-US", c("en-US", "fr")) ``` ## Validation `bcp_validate()` checks whether the language, script, and region subtags in a tag appear in the IANA Language Subtag Registry. It downloads and caches the registry on first use. ```{r validate, eval=FALSE} bcp_validate("en-US") # TRUE — both 'en' and 'US' are registered bcp_validate("zh-Hans-CN") # TRUE bcp_validate("xx-ZZ") # FALSE — 'xx' is not a registered language bcp_validate("en-Xxxx") # FALSE — 'Xxxx' is not a registered script ``` Note that validation only checks structural registry membership. It does not check whether a *combination* of subtags is meaningful (e.g., `en-Hans` would pass validation even though English is not normally written in the Han script). ## Normalization `bcp_normalize()` applies the canonicalization rules from RFC 5646: 1. **Deprecated languages** are replaced with their preferred values (e.g., `iw` → `he` for Hebrew) 2. **Default scripts** are suppressed (e.g., `en-Latn-US` → `en-US`, since Latin is the default script for English) 3. **Canonical casing** is applied: language lower-case, script title-case, region upper-case ```{r normalize, eval=FALSE} bcp_normalize("en-us") # "en-US" (region uppercased) bcp_normalize("ZH-HANS-CN") # "zh-Hans-CN" (language lowercased, script title-cased) bcp_normalize("en-Latn-US") # "en-US" (Latn is the default/suppress script for 'en') bcp_normalize("sr-latn") # "sr-Latn" (Latn is NOT the default for Serbian) ``` ## The IANA Registry Both `bcp_validate()` and `bcp_normalize()` rely on the IANA Language Subtag Registry. You can access it directly with `bcp_process_registry()`, which returns a tidy tibble: ```{r registry, eval=FALSE} reg <- bcp_process_registry() nrow(reg) # Registry metadata attr(reg, "last_update") # Browse languages reg[reg$type == "language", c("subtag", "description", "suppress_script")] # Find deprecated languages and their preferred replacements deprecated <- reg[reg$type == "language" & !is.na(reg$preferred_value), ] head(deprecated[, c("subtag", "description", "preferred_value")]) ``` ### Caching To avoid downloading the registry on every call, use `bcp_cache_update()` to save it locally: ```{r cache, eval=FALSE} # Download and cache the registry bcp_cache_update() # Inspect the cache bcp_cache_path() # file path bcp_cache_size() # size on disk # Refresh to get the latest registry bcp_cache_update(overwrite = TRUE) # Remove the cache bcp_cache_clear() ``` After caching, `bcp_validate()` and `bcp_normalize()` will load from the local file automatically.