| Title: | Work with Language Tags |
|---|---|
| Description: | Tools to parse, validate, normalize, and match language tags following the Best Current Practice 47 (BCP 47) standard, which defines the syntax (RFC 5646, <https://tools.ietf.org/html/rfc5646>) and lookup rules (RFC 4647, <https://tools.ietf.org/html/rfc4647>) for identifying human languages. Includes a bundled snapshot of the IANA Language Subtag Registry (<https://www.iana.org/assignments/language-subtag-registry/>) with optional support for updating. |
| Authors: | Christopher T. Kenny [aut, cre] (ORCID: <https://orcid.org/0000-0002-9386-6860>) |
| Maintainer: | Christopher T. Kenny <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.0.0.9000 |
| Built: | 2026-06-08 06:30:48 UTC |
| Source: | https://github.com/christopherkenny/BCP47 |
BCP47 cachebcp_cache_path() returns the path to the cached registry file.
bcp_cache_size() reports the total size of cache files on disk.
bcp_cache_clear() deletes the cached registry file.
bcp_cache_update() downloads the latest IANA registry and writes it to cache.
bcp_cache_path() bcp_cache_size() bcp_cache_clear(force = FALSE) bcp_cache_update(overwrite = FALSE)bcp_cache_path() bcp_cache_size() bcp_cache_clear(force = FALSE) bcp_cache_update(overwrite = FALSE)
force |
Logical. If |
overwrite |
Logical. If |
bcp_cache_path(): A character scalar giving the file path of the cached registry.
bcp_cache_size(): The cache size in bytes, invisibly. Also prints a
human-readable size message.
bcp_cache_clear(): The cache path, invisibly.
bcp_cache_update(): The registry data frame, invisibly.
bcp_cache_path() bcp_cache_size()bcp_cache_path() bcp_cache_size()
Implements the "Lookup" scheme from RFC 4647 Section 3.4: given an ordered
list of language tag preferences and a set of available tags, returns the
best match. Matching proceeds by progressively stripping the rightmost
subtag from each preference until a match is found or no subtags remain.
Single-character subtags (including the private-use prefix "x") are always
stripped as a unit together with any preceding single-character subtags.
bcp_match_language(preferences, available, default = NULL)bcp_match_language(preferences, available, default = NULL)
preferences |
A character vector of BCP 47 language tags representing the caller's ordered preferences, from most to least preferred. |
available |
A character vector of BCP 47 language tags that are available to choose from. |
default |
The value to return when no preference matches any available
tag. Defaults to |
The best-matching tag from available, or default if no match is
found. Matching is case-insensitive; the returned value preserves the
original casing from available.
bcp_match_language(c('en-US', 'fr'), c('en', 'fr-FR', 'de')) bcp_match_language('zh-Hans-CN', c('zh-TW', 'zh-Hans', 'en')) bcp_match_language('pt-BR', c('fr', 'de'), default = 'en')bcp_match_language(c('en-US', 'fr'), c('en', 'fr-FR', 'de')) bcp_match_language('zh-Hans-CN', c('zh-TW', 'zh-Hans', 'en')) bcp_match_language('pt-BR', c('fr', 'de'), default = 'en')
Applies the canonicalization rules from RFC 5646: preferred values are substituted for deprecated subtags, default scripts are suppressed, and each component is cased according to convention (language lower-case, script title-case, region upper-case).
bcp_normalize(tag, registry = bcp_get_registry())bcp_normalize(tag, registry = bcp_get_registry())
tag |
A character scalar BCP 47 language tag. |
registry |
A data frame of the IANA Language Subtag Registry, as
returned by |
A character scalar with the normalized BCP 47 tag.
bcp_normalize('en-us') bcp_normalize('ZH-hans-cn')bcp_normalize('en-us') bcp_normalize('ZH-hans-cn')
Decomposes a BCP 47 language tag into its constituent subtags following the
syntax defined in RFC 5646. Both hyphen (-) and underscore (_) are
accepted as subtag separators.
bcp_parse(tag)bcp_parse(tag)
tag |
A character scalar BCP 47 language tag. |
A named list with the following elements:
languageThe primary language subtag (e.g., "en", "zh"), or
NA for a pure private-use tag.
extlangA character vector of extended language subtags
(three-letter codes following the primary language), or NULL.
scriptThe four-letter script subtag (e.g., "latn", "hans"),
or NA if absent.
regionThe two-letter or three-digit region subtag (e.g., "us",
"419"), or NA if absent.
variantsA character vector of variant subtags, or NULL.
extensionsA named list of extension subtag sequences, keyed by the single-letter extension singleton.
privateA character vector of private-use subtags (following x-),
or NULL.
All subtags are returned in lower-case.
bcp_parse('en-US') bcp_parse('zh-Hans-CN') bcp_parse('de-1901') bcp_parse('x-private')bcp_parse('en-US') bcp_parse('zh-Hans-CN') bcp_parse('de-1901') bcp_parse('x-private')
Downloads and parses the IANA Language Subtag Registry into a tidy data
frame. Each row represents one registry entry (language, extlang, script,
region, variant, grandfathered, or redundant tag). Columns correspond to
registry fields such as type, subtag, description, added,
preferred_value, and suppress_script. When an entry has multiple values
for a single field (e.g., multiple Description lines), they are joined
with ";".
bcp_process_registry( url = "https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry" )bcp_process_registry( url = "https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry" )
url |
A character scalar giving the URL of the registry plain-text file. Defaults to the official IANA location. |
A tibble with one row per registry entry and one column per field.
The last_update attribute records the File-Date from the registry
header.
reg <- bcp_process_registry() head(reg)reg <- bcp_process_registry() head(reg)
Checks whether the primary language, script, and region subtags of a BCP 47 tag appear in the IANA Language Subtag Registry. Extension and private-use subtags are not validated.
bcp_validate(tag, registry = bcp_get_registry())bcp_validate(tag, registry = bcp_get_registry())
tag |
A character scalar BCP 47 language tag. |
registry |
A data frame of the IANA Language Subtag Registry, as
returned by |
A logical scalar: TRUE if all checked subtags are registered,
FALSE otherwise.
bcp_validate('en-US') bcp_validate('en-Latn-US') bcp_validate('xx-ZZ')bcp_validate('en-US') bcp_validate('en-Latn-US') bcp_validate('xx-ZZ')