summaryrefslogtreecommitdiff
path: root/lib/elixir/pages/Unicode Syntax.md
blob: f0aab5a56c03ddcedc48e70f1abc5b9fc97b44d5 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
# Unicode Syntax

Elixir implements [Unicode Annex #31](http://unicode.org/reports/tr31/) for non-quoted atoms and variables as specified in this document.

## Version

The specification in this document is included in Elixir v1.5 and requires OTP 20+. To check the Unicode version of your current Elixir installation, please run `String.Unicode.version()`.

## R1. Default Identifiers

Elixir identifiers are identified as:

    <Identifier> := <Start> <Continue>* <Ending>?

where `<Start>` is:

> characters derived from the Unicode General Category of uppercase letters, lowercase letters, titlecase letters, modifier letters, other letters, letter numbers, plus `Other_ID_Start`, minus `Pattern_Syntax` and `Pattern_White_Space` code points
>
> In set notation: `[[:L:][:Nl:][:Other_ID_Start:]--[:Pattern_Syntax:]--[:Pattern_White_Space:]]`

and `<Continue>` is:

> ID_Start characters, plus characters having the Unicode General Category of nonspacing marks, spacing combining marks, decimal number, connector punctuation, plus `Other_ID_Continue`, minus `Pattern_Syntax` and `Pattern_White_Space` code points.
>
> In set notation: `[[:ID_Start:][:Mn:][:Mc:][:Nd:][:Pc:][:Other_ID_Continue:]--[:Pattern_Syntax:]--[:Pattern_White_Space:]]`

`<Ending>` is an addition specific to Elixir that includes the codepoints ? (003F) and ! (0021).

Elixir does not implement requirement R1a. It does implement requirement R1b.

### Atoms

Atoms in Elixir follow the identifier rule above with the following modifications:

  * `<Start>` includes the codepoint _ (005F)
  * `<Continue>` includes the codepoint @ (0040)

### Variables

Atoms in Elixir follow the identifier rule above with the following modifications:

  * `<Start>` includes the codepoint _ (005F)
  * `<Start>` must not include Lu (letter uppercase) and Lt (letter titlecase) characters
  * `<Continue>` includes Lu (letter uppercase) and Lt (letter titlecase) characters

## R6. Filtered Normalized Identifiers

Identifiers in Elixir are case sensitive.

Elixir requires all atoms and variables to be in NFC form. Any other form will fail with a relevant error message. Quoted-atoms and variables can, however, be in any form and are not verified by the parser.

In other words, the atom `:josé` can only be written with the codepoints 006A 006F 0073 00E9. On the other hand, `:"josé"` may be written as 006A 006F 0073 00E9 or 006A 006F 0073 0065 0301.

## Other considerations

It is worth noting that Elixir supports only codepoints \t (0009), \n (000A), \r (000D) and \s (0020) as whitespace and therefore does not follow requirement R3.