---
title: "Getting Started with BCP47"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{BCP47}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r setup}
library(BCP47)
```

## What is BCP 47?

BCP 47 (Best Current Practice 47) is the IETF standard that defines how human languages are identified in internet protocols.
It is specified by two RFCs:

- **RFC 5646** — the syntax for language tags (e.g., `en-US`, `zh-Hans-CN`)
- **RFC 4647** — the rules for matching language tags to available resources

A BCP 47 tag is a sequence of subtags separated by hyphens:

```
language [-script] [-region] [-variant]* [-extension]* [-privateuse]
```

For example:

| Tag | Meaning |
|---|---|
| `en` | English |
| `en-US` | English as used in the United States |
| `zh-Hans-CN` | Chinese, Simplified script, as used in China |
| `sr-Latn` | Serbian written in the Latin script |
| `de-1901` | German, traditional orthography (1901 variant) |
| `x-myapp` | Entirely private-use tag |

The canonical source of valid subtags is the [IANA Language Subtag Registry](https://www.iana.org/assignments/language-subtag-registry/).

## Parsing

`bcp_parse()` decomposes a tag into its named components.
All subtags are returned in lower-case.
Both hyphens (`-`) and underscores (`_`) are accepted as separators.

```{r parse-basic}
bcp_parse("en-US")
```

```{r parse-complex}
bcp_parse("zh-Hans-CN")
```

The returned list always has the same structure:

```{r parse-fields}
tag <- bcp_parse("sr-Latn-RS-rozaj-x-custom")
names(tag)
```

- **`language`**: primary language subtag (e.g., `"sr"`)
- **`extlang`**: extended language subtags (`NULL` if absent)
- **`script`**: four-letter script subtag (`NA` if absent)
- **`region`**: two-letter or three-digit region subtag (`NA` if absent)
- **`variants`**: variant subtags (`NULL` if absent)
- **`extensions`**: named list of extension sequences
- **`private`**: private-use subtags (`NULL` if absent)

```{r parse-private}
# Pure private-use tag
bcp_parse("x-myapp-v2")
```

```{r parse-variant}
# Variant and extension subtags
bcp_parse("en-US-u-ca-gregory")
```

## Language Matching

`bcp_match_language()` implements the RFC 4647 "Lookup" scheme.
Given an ordered list of language preferences and a set of available tags, it returns the best match.

Matching works by progressively stripping the rightmost subtag from each preference until a match is found:

```
en-US → en  (strip region)
zh-Hans-CN → zh-Hans → zh  (strip region, then script)
```

```{r match-basic}
# User prefers en-US; only 'en' is available — falls back to the base language
bcp_match_language("en-US", c("en", "fr", "de"))
```

```{r match-fallback}
# Multiple preferences: de-AT falls back through de, then en-GB falls back to en
bcp_match_language(c("de-AT", "en-GB"), c("en", "fr"))
```

```{r match-chinese}
# Prefer Traditional Chinese, fall back to Simplified, then English
bcp_match_language(
  c("zh-Hant-TW", "zh-Hans", "en"),
  c("zh-Hans", "en", "fr")
)
```

```{r match-default}
# No match — return a default value
bcp_match_language("pt-BR", c("fr", "de"), default = "en")
```

Matching is **case-insensitive**.
The original casing from `available` is preserved in the return value:

```{r match-case}
bcp_match_language("EN-US", c("en-US", "fr"))
```

## Validation

`bcp_validate()` checks whether the language, script, and region subtags in a tag appear in the IANA Language Subtag Registry.
It downloads and caches the registry on first use.

```{r validate, eval=FALSE}
bcp_validate("en-US") # TRUE — both 'en' and 'US' are registered
bcp_validate("zh-Hans-CN") # TRUE
bcp_validate("xx-ZZ") # FALSE — 'xx' is not a registered language
bcp_validate("en-Xxxx") # FALSE — 'Xxxx' is not a registered script
```

Note that validation only checks structural registry membership.
It does not check whether a *combination* of subtags is meaningful (e.g., `en-Hans` would pass validation even though English is not normally written in the Han script).

## Normalization

`bcp_normalize()` applies the canonicalization rules from RFC 5646:

1. **Deprecated languages** are replaced with their preferred values (e.g., `iw` → `he` for Hebrew)
2. **Default scripts** are suppressed (e.g., `en-Latn-US` → `en-US`, since Latin is the default script for English)
3. **Canonical casing** is applied: language lower-case, script title-case, region upper-case

```{r normalize, eval=FALSE}
bcp_normalize("en-us") # "en-US"   (region uppercased)
bcp_normalize("ZH-HANS-CN") # "zh-Hans-CN"  (language lowercased, script title-cased)
bcp_normalize("en-Latn-US") # "en-US"   (Latn is the default/suppress script for 'en')
bcp_normalize("sr-latn") # "sr-Latn" (Latn is NOT the default for Serbian)
```

## The IANA Registry

Both `bcp_validate()` and `bcp_normalize()` rely on the IANA Language Subtag Registry.
You can access it directly with `bcp_process_registry()`, which returns a tidy tibble:

```{r registry, eval=FALSE}
reg <- bcp_process_registry()
nrow(reg)

# Registry metadata
attr(reg, "last_update")

# Browse languages
reg[reg$type == "language", c("subtag", "description", "suppress_script")]

# Find deprecated languages and their preferred replacements
deprecated <- reg[reg$type == "language" & !is.na(reg$preferred_value), ]
head(deprecated[, c("subtag", "description", "preferred_value")])
```

### Caching

To avoid downloading the registry on every call, use `bcp_cache_update()` to save it locally:

```{r cache, eval=FALSE}
# Download and cache the registry
bcp_cache_update()

# Inspect the cache
bcp_cache_path() # file path
bcp_cache_size() # size on disk

# Refresh to get the latest registry
bcp_cache_update(overwrite = TRUE)

# Remove the cache
bcp_cache_clear()
```

After caching, `bcp_validate()` and `bcp_normalize()` will load from the local file automatically.