BCP 47 (Best Current Practice 47) is the IETF standard that defines how human languages are identified in internet protocols. It is specified by two RFCs:
en-US, zh-Hans-CN)A BCP 47 tag is a sequence of subtags separated by hyphens:
language [-script] [-region] [-variant]* [-extension]* [-privateuse]
For example:
| Tag | Meaning |
|---|---|
en |
English |
en-US |
English as used in the United States |
zh-Hans-CN |
Chinese, Simplified script, as used in China |
sr-Latn |
Serbian written in the Latin script |
de-1901 |
German, traditional orthography (1901 variant) |
x-myapp |
Entirely private-use tag |
The canonical source of valid subtags is the IANA Language Subtag Registry.
bcp_parse() decomposes a tag into its named components.
All subtags are returned in lower-case. Both hyphens (-)
and underscores (_) are accepted as separators.
bcp_parse("en-US")
#> $language
#> [1] "en"
#>
#> $extlang
#> NULL
#>
#> $script
#> [1] NA
#>
#> $region
#> [1] "us"
#>
#> $variants
#> NULL
#>
#> $extensions
#> list()
#>
#> $private
#> NULLbcp_parse("zh-Hans-CN")
#> $language
#> [1] "zh"
#>
#> $extlang
#> NULL
#>
#> $script
#> [1] "hans"
#>
#> $region
#> [1] "cn"
#>
#> $variants
#> NULL
#>
#> $extensions
#> list()
#>
#> $private
#> NULLThe returned list always has the same structure:
tag <- bcp_parse("sr-Latn-RS-rozaj-x-custom")
names(tag)
#> [1] "language" "extlang" "script" "region" "variants"
#> [6] "extensions" "private"language: primary language subtag
(e.g., "sr")extlang: extended language subtags
(NULL if absent)script: four-letter script subtag
(NA if absent)region: two-letter or three-digit
region subtag (NA if absent)variants: variant subtags
(NULL if absent)extensions: named list of extension
sequencesprivate: private-use subtags
(NULL if absent)bcp_match_language() implements the RFC 4647 “Lookup”
scheme. Given an ordered list of language preferences and a set of
available tags, it returns the best match.
Matching works by progressively stripping the rightmost subtag from each preference until a match is found:
en-US → en (strip region)
zh-Hans-CN → zh-Hans → zh (strip region, then script)
# User prefers en-US; only 'en' is available — falls back to the base language
bcp_match_language("en-US", c("en", "fr", "de"))
#> [1] "en"# Multiple preferences: de-AT falls back through de, then en-GB falls back to en
bcp_match_language(c("de-AT", "en-GB"), c("en", "fr"))
#> [1] "en"# Prefer Traditional Chinese, fall back to Simplified, then English
bcp_match_language(
c("zh-Hant-TW", "zh-Hans", "en"),
c("zh-Hans", "en", "fr")
)
#> [1] "zh-Hans"# No match — return a default value
bcp_match_language("pt-BR", c("fr", "de"), default = "en")
#> [1] "en"Matching is case-insensitive. The original casing
from available is preserved in the return value:
bcp_validate() checks whether the language, script, and
region subtags in a tag appear in the IANA Language Subtag Registry. It
downloads and caches the registry on first use.
bcp_validate("en-US") # TRUE — both 'en' and 'US' are registered
bcp_validate("zh-Hans-CN") # TRUE
bcp_validate("xx-ZZ") # FALSE — 'xx' is not a registered language
bcp_validate("en-Xxxx") # FALSE — 'Xxxx' is not a registered scriptNote that validation only checks structural registry membership. It
does not check whether a combination of subtags is meaningful
(e.g., en-Hans would pass validation even though English is
not normally written in the Han script).
bcp_normalize() applies the canonicalization rules from
RFC 5646:
iw → he for
Hebrew)en-Latn-US → en-US, since Latin is the default
script for English)Both bcp_validate() and bcp_normalize()
rely on the IANA Language Subtag Registry. You can access it directly
with bcp_process_registry(), which returns a tidy
tibble:
reg <- bcp_process_registry()
nrow(reg)
# Registry metadata
attr(reg, "last_update")
# Browse languages
reg[reg$type == "language", c("subtag", "description", "suppress_script")]
# Find deprecated languages and their preferred replacements
deprecated <- reg[reg$type == "language" & !is.na(reg$preferred_value), ]
head(deprecated[, c("subtag", "description", "preferred_value")])To avoid downloading the registry on every call, use
bcp_cache_update() to save it locally:
# Download and cache the registry
bcp_cache_update()
# Inspect the cache
bcp_cache_path() # file path
bcp_cache_size() # size on disk
# Refresh to get the latest registry
bcp_cache_update(overwrite = TRUE)
# Remove the cache
bcp_cache_clear()After caching, bcp_validate() and
bcp_normalize() will load from the local file
automatically.