summaryrefslogtreecommitdiff
path: root/manual/src/refman/lex.etex
blob: 2d95099af471b53724a11bbcf084a1fd4f5a9fa1 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
\section{s:lexical-conventions}{Lexical conventions}
%HEVEA\cutname{lex.html}
\subsubsection*{sss:lex:blanks}{Blanks}

The following characters are considered as blanks: space,
horizontal tabulation, carriage return, line feed and form feed. Blanks are
ignored, but they separate adjacent identifiers, literals and
keywords that would otherwise be confused as one single identifier,
literal or keyword.

\subsubsection*{sss:lex:comments}{Comments}

Comments are introduced by the two characters  @"(*"@, with no
intervening blanks, and terminated by the characters @"*)"@, with
no intervening blanks. Comments are treated as blank characters.
Comments do not occur inside string or character literals. Nested
comments are handled correctly.

\begin{caml_example}{verbatim}
(* single line comment *)

(* multiple line comment, commenting out part of a program, and containing a
nested comment:
let f = function
  | 'A'..'Z' -> "Uppercase"
    (* Add other cases later... *)
*)
\end{caml_example}

\subsubsection*{sss:lex:identifiers}{Identifiers}

\begin{syntax}
ident: (letter || "_") { letter || "0"\ldots"9" || "_" || "'" } ;
capitalized-ident: ("A"\ldots"Z") { letter || "0"\ldots"9" || "_" || "'" } ;
lowercase-ident:
   ("a"\ldots"z" || "_") { letter || "0"\ldots"9" || "_" || "'" } ;
letter: "A"\ldots"Z" || "a"\ldots"z"
\end{syntax}

Identifiers are sequences of letters, digits, "_" (the underscore
character), and "'" (the single quote), starting with a
letter or an underscore.
Letters contain at least the 52 lowercase and uppercase
letters from the ASCII set. The current implementation
also recognizes as letters some characters from the ISO
8859-1 set (characters 192--214 and 216--222 as uppercase letters;
characters 223--246 and 248--255 as lowercase letters). This
feature is deprecated and should be avoided for future compatibility.

All characters in an identifier are
meaningful. The current implementation accepts identifiers up to
16000000 characters in length.

In many places, OCaml makes a distinction between capitalized
identifiers and identifiers that begin with a lowercase letter.  The
underscore character is considered a lowercase letter for this
purpose.

\subsubsection*{sss:integer-literals}{Integer literals}

\begin{syntax}
integer-literal:
          ["-"] ("0"\ldots"9") { "0"\ldots"9" || "_" }
        | ["-"] ("0x" || "0X") ("0"\ldots"9" || "A"\ldots"F" || "a"\ldots"f")
                            { "0"\ldots"9" || "A"\ldots"F" || "a"\ldots"f" || "_" }
        | ["-"] ("0o" || "0O") ("0"\ldots"7") { "0"\ldots"7" || "_" }
        | ["-"] ("0b" || "0B") ("0"\ldots"1") { "0"\ldots"1" || "_" }
;
int32-literal: integer-literal 'l'
;
int64-literal: integer-literal 'L'
;
nativeint-literal: integer-literal 'n'
\end{syntax}

An integer literal is a sequence of one or more digits, optionally
preceded by a minus sign. By default, integer literals are in decimal
(radix 10). The following prefixes select a different radix:
\begin{tableau}{|l|l|}{Prefix}{Radix}
\entree{"0x", "0X"}{hexadecimal (radix 16)}
\entree{"0o", "0O"}{octal (radix 8)}
\entree{"0b", "0B"}{binary (radix 2)}
\end{tableau}
(The initial @"0"@ is the digit zero; the @"O"@ for octal is the letter O.)
An integer literal can be followed by one of the letters "l", "L" or "n"
to indicate that this integer has type "int32", "int64" or "nativeint"
respectively, instead of the default type "int" for integer literals.
The interpretation of integer literals that fall outside the range of
representable integer values is undefined.

For convenience and readability, underscore characters (@"_"@) are accepted
(and ignored) within integer literals.

\begin{caml_example}{toplevel}
let house_number = 37
let million = 1_000_000
let copyright = 0x00A9
let counter64bit = ref 0L;;
\end{caml_example}

\subsubsection*{sss:floating-point-literals}{Floating-point literals}

\begin{syntax}
float-literal:
          ["-"] ("0"\ldots"9") { "0"\ldots"9" || "_" } ["." { "0"\ldots"9" || "_" }]
          [("e" || "E") ["+" || "-"] ("0"\ldots"9") { "0"\ldots"9" || "_" }]
        | ["-"] ("0x" || "0X")
          ("0"\ldots"9" || "A"\ldots"F" || "a"\ldots"f")
          { "0"\ldots"9" || "A"\ldots"F" || "a"\ldots"f" || "_" } \\
          ["." { "0"\ldots"9" || "A"\ldots"F" || "a"\ldots"f" || "_" }]
          [("p" || "P") ["+" || "-"] ("0"\ldots"9") { "0"\ldots"9" || "_" }]
\end{syntax}

Floating-point decimal literals consist in an integer part, a
fractional part and
an exponent part. The integer part is a sequence of one or more
digits, optionally preceded by a minus sign. The fractional part is a
decimal point followed by zero, one or more digits.
The exponent part is the character @"e"@ or @"E"@ followed by an
optional @"+"@ or @"-"@ sign, followed by one or more digits.  It is
interpreted as a power of 10.
The fractional part or the exponent part can be omitted but not both, to
avoid ambiguity with integer literals.
The interpretation of floating-point literals that fall outside the
range of representable floating-point values is undefined.

Floating-point hexadecimal literals are denoted with the @"0x"@ or @"0X"@
prefix.  The syntax is similar to that of floating-point decimal
literals, with the following differences.
The integer part and the fractional part use hexadecimal
digits.  The exponent part starts with the character  @"p"@ or @"P"@.
It is written in decimal and interpreted as a power of 2.

For convenience and readability, underscore characters (@"_"@) are accepted
(and ignored) within floating-point literals.

\begin{caml_example}{toplevel}
let pi = 3.141_592_653_589_793_12
let small_negative = -1e-5
let machine_epsilon = 0x1p-52;;
\end{caml_example}

\subsubsection*{sss:character-literals}{Character literals}
\label{s:characterliteral}

\begin{syntax}
char-literal:
          "'" regular-char "'"
        | "'" escape-sequence "'"
;
escape-sequence:
          "\" ("\" || '"' || "'" || "n" || "t" || "b" || "r" || space)
        | "\" ("0"\ldots"9") ("0"\ldots"9") ("0"\ldots"9")
        | "\x" ("0"\ldots"9" || "A"\ldots"F" || "a"\ldots"f")
               ("0"\ldots"9" || "A"\ldots"F" || "a"\ldots"f")
        | "\o" ("0"\ldots"3") ("0"\ldots"7") ("0"\ldots"7")
\end{syntax}

Character literals are delimited by @"'"@ (single quote) characters.
The two single quotes enclose either one character different from
@"'"@ and @'\'@, or one of the escape sequences below:
\begin{tableau}{|l|l|}{Sequence}{Character denoted}
\entree{"\\\\"}{backslash ("\\")}
\entree{"\\\""}{double quote ("\"")}
\entree{"\\'"}{single quote ("'")}
\entree{"\\n"}{linefeed (LF)}
\entree{"\\r"}{carriage return (CR)}
\entree{"\\t"}{horizontal tabulation (TAB)}
\entree{"\\b"}{backspace (BS)}
\entree{"\\"\var{space}}{space (SPC)}
\entree{"\\"\var{ddd}}{the character with ASCII code \var{ddd} in decimal}
\entree{"\\x"\var{hh}}{the character with ASCII code \var{hh} in hexadecimal}
\entree{"\\o"\var{ooo}}{the character with ASCII code \var{ooo} in octal}
\end{tableau}

\begin{caml_example}{toplevel}
let a = 'a'
let single_quote = '\''
let copyright = '\xA9';;
\end{caml_example}
\subsubsection*{sss:stringliterals}{String literals}

\begin{syntax}
string-literal:
          '"' { string-character } '"'
       |  '{' quoted-string-id '|' { any-char } '|' quoted-string-id '}'
;
quoted-string-id:
     { 'a'...'z' || '_' }
;
string-character:
          regular-string-char
        | escape-sequence
        | "\u{" {{ "0"\ldots"9" || "A"\ldots"F" || "a"\ldots"f" }} "}"
        | '\' newline { space || tab }
\end{syntax}

String literals are delimited by @'"'@ (double quote) characters.
The two double quotes enclose a sequence of either characters
different from @'"'@ and @'\'@, or escape sequences from the
table given above for character literals, or a Unicode character
escape sequence.

A Unicode character escape sequence is substituted by the UTF-8
encoding of the specified Unicode scalar value. The Unicode scalar
value, an integer in the ranges 0x0000...0xD7FF or 0xE000...0x10FFFF,
is defined using 1 to 6 hexadecimal digits; leading zeros are allowed.

\begin{caml_example}{toplevel}
let greeting = "Hello, World!\n"
let superscript_plus = "\u{207A}";;
\end{caml_example}

To allow splitting long string literals across lines, the sequence
"\\"\var{newline}~\var{spaces-or-tabs} (a backslash at the end of a line
followed by any number of spaces and horizontal tabulations at the
beginning of the next line) is ignored inside string literals.

\begin{caml_example}{toplevel}
let longstr =
  "Call me Ishmael. Some years ago — never mind how long \
  precisely — having little or no money in my purse, and \
  nothing particular to interest me on shore, I thought I\
  \ would sail about a little and see the watery part of t\
  he world.";;
\end{caml_example}

Quoted string literals provide an alternative lexical syntax for
string literals. They are useful to represent strings of arbitrary content
without escaping. Quoted strings are delimited by a matching pair
of @'{' quoted-string-id '|'@ and @'|' quoted-string-id '}'@ with
the same @quoted-string-id@ on both sides. Quoted strings do not interpret
any character in a special way but requires that the
sequence @'|' quoted-string-id '}'@ does not occur in the string itself.
The identifier @quoted-string-id@ is a (possibly empty) sequence of
lowercase letters and underscores that can be freely chosen to avoid
such issue.

\begin{caml_example}{toplevel}
let quoted_greeting = {|"Hello, World!"|}
let nested = {ext|hello {|world|}|ext};;
\end{caml_example}

The current implementation places practically no restrictions on the
length of string literals.

\subsubsection*{sss:labelname}{Naming labels}

To avoid ambiguities, naming labels in expressions cannot just be defined
syntactically as the sequence of the three tokens "~", @ident@ and
":", and have to be defined at the lexical level.

\begin{syntax}
label-name: lowercase-ident
;
label: "~" label-name ":"
;
optlabel: "?" label-name ":"
\end{syntax}

Naming labels come in two flavours: @label@ for normal arguments and
@optlabel@ for optional ones. They are simply distinguished by their
first character, either "~" or "?".

Despite @label@ and @optlabel@ being lexical entities in expressions,
their expansions @'~' label-name ':'@ and @'?' label-name ':'@ will be
used in grammars, for the sake of readability. Note also that inside
type expressions, this expansion can be taken literally, {\em i.e.}
there are really 3 tokens, with optional blanks between them.

\subsubsection*{sss:lex-ops-symbols}{Prefix and infix symbols}

%%  || '`' lowercase-ident '`'

\begin{syntax}
infix-symbol:
        (core-operator-char || '%' || '<') { operator-char }
      | "#" {{ operator-char }}
;
prefix-symbol:
        '!' { operator-char }
      | ('?' || '~') {{ operator-char }}
;
operator-char:
        '~' || '!' || '?' || core-operator-char || '%' || '<' || ':' || '.'
;
core-operator-char:
        '$' || '&' || '*' || '+' || '-' || '/' || '=' || '>' || '@' || '^' || '|'
\end{syntax}
See also the following language extensions:
\hyperref[s:ext-ops]{extension operators},
\hyperref[s:index-operators]{extended indexing operators},
and \hyperref[s:binding-operators]{binding operators}.

Sequences of ``operator characters'', such as "<=>" or "!!",
are read as a single token from the @infix-symbol@ or @prefix-symbol@
class. These symbols are parsed as prefix and infix operators inside
expressions, but otherwise behave like normal identifiers.
%% Identifiers starting with a lowercase letter and enclosed
%% between backquote characters @'`' lowercase-ident '`'@ are also parsed
%% as infix operators.

\subsubsection*{sss:keywords}{Keywords}

The identifiers below are reserved as keywords, and cannot be employed
otherwise:
\begin{verbatim}
      and         as          assert      asr         begin       class
      constraint  do          done        downto      else        end
      exception   external    false       for         fun         function
      functor     if          in          include     inherit     initializer
      land        lazy        let         lor         lsl         lsr
      lxor        match       method      mod         module      mutable
      new         nonrec      object      of          open        or
      private     rec         sig         struct      then        to
      true        try         type        val         virtual     when
      while       with
\end{verbatim}
%
\goodbreak%
%
The following character sequences are also keywords:
%
%% FIXME the token >] is not used anywhere in the syntax
%
\begin{alltt}
"    !=    #     &     &&    '     (     )     *     +     ,     -"
"    -.    ->    .     ..    .~    :     ::    :=    :>    ;     ;;"
"    <     <-    =     >     >]    >}    ?     [     [<    [>    [|"
"    ]     _     `     {     {<    |     |]    ||    }     ~"
\end{alltt}
%
Note that the following identifiers are keywords of the now unmaintained Camlp4
system and should be avoided for backwards compatibility reasons.
%
\begin{verbatim}
    parser    value    $     $$    $:    <:    <<    >>    ??
\end{verbatim}

\subsubsection*{sss:lex-ambiguities}{Ambiguities}

Lexical ambiguities are resolved according to the ``longest match''
rule: when a character sequence can be decomposed into two tokens in
several different ways, the decomposition retained is the one with the
longest first token.

\subsubsection*{sss:lex-linedir}{Line number directives}

\begin{syntax}
linenum-directive:
     '#' {{ "0"\ldots"9" }} '"' { string-character } '"'
\end{syntax}

Preprocessors that generate OCaml source code can insert line number
directives in their output so that error messages produced by the
compiler contain line numbers and file names referring to the source
file before preprocessing, instead of after preprocessing.
A line number directive starts at the beginning of a line,
is composed of a @"#"@ (sharp sign), followed by
a positive integer (the source line number), followed by a
character string (the source file name).
Line number directives are treated as blanks during lexical
analysis.

% FIXME spaces and tabs are allowed before and after the number
% FIXME ``string-character'' is inaccurate: everything is allowed except
%       CR, LF, and doublequote; moreover, backslash escapes are not
% interpreted (especially backslash-doublequote)
% FIXME any number of random characters are allowed (and ignored) at the
%       end of the line, except CR and LF.