libraries/ghc-prim/GHC/CString.hs


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330

{-# LANGUAGE MagicHash, NoImplicitPrelude, BangPatterns #-}

-----------------------------------------------------------------------------
-- |
-- Module      :  GHC.CString
-- Copyright   :  (c) The University of Glasgow 2011
-- License     :  see libraries/ghc-prim/LICENSE
--
-- Maintainer  :  cvs-ghc@haskell.org
-- Stability   :  internal
-- Portability :  non-portable (GHC Extensions)
--
-- GHC C strings definitions (previously in GHC.Base).
-- Use GHC.Exts from the base package instead of importing this
-- module directly.
--
-----------------------------------------------------------------------------

module GHC.CString (
        -- * Ascii variants
        unpackCString#, unpackAppendCString#, unpackFoldrCString#,

        -- * Utf variants
        unpackCStringUtf8#, unpackAppendCStringUtf8#, unpackFoldrCStringUtf8#,

        -- * Other
        unpackNBytes#,
    ) where

import GHC.Types
import GHC.Prim

{-
Note [String literals in GHC]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
String literals get quite a bit of special handling in GHC.  This Note
summarises the moving parts.

* Desugaring: see GHC.HsToCore.Match.Literal.dsLit, which in
  turn calls GHC.Core.Make.mkStringExprFS.

  The desugarer desugars the Haskell literal "foo" into Core
     GHC.CString.unpackCString# "foo"#
  where "foo"# is primitive string literal (of type Addr#).

  When the string cannot be encoded as a C string, we use UTF8:
     GHC.CString.unpackCStringUtf8# "foo"#

* The library module ghc-prim:GHC.CString has a bunch of functions that
  work over primitive strings, including GHC.CString.unpackCString#

* GHC.Core.Op.ConstantFold has some RULES that optimise treatment of
  literal strings. These include things like:

    + Special handling of elem over string literals.
    + Constant folding the desugared form of ("foo" ++ "bar")
      into ("foobar")
    + and more

* GHC.Base has a number of regular rules for String literals.

  + a rule "eqString": (==) @String = eqString
    where GHC.Base.eqString :: String -> String -> Bool

    ConstantFold has a RULE for eqString on literals:
     eqString (Lit "foo"#) (Lit "bar"#) --> False

    This allows compile time evaluation of things like "foo" == "bar"

  + A bunch of rules to promote fusion:

    "unpack"       [~1] forall a   . unpackCString# a             = build (unpackFoldrCString# a)
    "unpack-list"  [1]  forall a   . unpackFoldrCString# a (:) [] = unpackCString# a
    "unpack-append"     forall a n . unpackFoldrCString# a (:) n  = unpackAppendCString# a n

    And UTF8 variants of these rules.

* We allow primitive (unlifted) literal strings to be top-level
  bindings, breaking out usual rule.  See GHC.Core
  Note [Core top-level string literals]

* TODO: There is work on a special code-gen path for top-level boxed strings
     str :: [Char]
     str = unpackCString# "foo"#
  so that they can all share a common code pointer

  There is a WIP MR on gitlab for this: !3012

-}

-----------------------------------------------------------------------------
-- Unpacking C strings
-----------------------------------------------------------------------------

-- This code is needed for virtually all programs, since it's used for
-- unpacking the strings of error messages.

-- Used to be in GHC.Base, but was moved to ghc-prim because the new generics
-- stuff uses Strings in the representation, so to give representations for
-- ghc-prim types we need unpackCString#

{- Note [Inlining unpackCString#]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
There's really no point in ever inlining things like unpackCString# as the loop
doesn't specialise in an interesting way and we can't deforest the list
constructors (we'd want to use unpackFoldrCString# for this). Moreover, it's
pretty small, so there's a danger that it'll be inlined at every literal, which
is a waste.

Moreover, inlining early may interfere with a variety of rules that are supposed
to match unpackCString#,

 * BuiltInRules in GHC.Core.Op.ConstantFold; e.g.
       eqString (unpackCString# (Lit s1)) (unpackCString# (Lit s2)
          = s1 == s2

 * unpacking rules; e.g. in GHC.Base,
       unpackCString# a
          = build (unpackFoldrCString# a)

 * stream fusion rules; e.g. in the `text` library,
       unstream (S.map safe (S.streamList (GHC.unpackCString# a)))
          = unpackCString# a

Moreover, we want to make it CONLIKE, so that:

* the rules in GHC.Core.Op.ConstantFold will fire when the string is let-bound.
  E.g. the eqString rule in GHC.Core.Op.ConstantFold
   eqString (unpackCString# (Lit s1)) (unpackCString# (Lit s2) = s1==s2

* exprIsConApp_maybe will see the string when we have
     let x = unpackCString# "foo"#
     ...(case x of algs)...

All of this goes for unpackCStringUtf8# too.
-}

{- Note [Inlining of unpackFoldrCString]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Usually the unpack-list rule turns unpackFoldrCString# into unpackCString#
It also has a BuiltInRule in PrelRules.hs:
     unpackFoldrCString# "foo" c (unpackFoldrCString# "baz" c n)
       =  unpackFoldrCString# "foobaz" c n

We use NOINLINE [0] on the grounds that, unlike
unpackCString#, there *is* some point in inlining
unpackFoldrCString#, because we get better code for the
higher-order function call.

This can cause a code size increase but it was minimal
when looking at nofib.

  Note [unpackCString# iterating over addr]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

When unpacking unpackCString# and friends repeatedly return a cons cell
containing:
* The current character we just unpacked.
* A thunk to unpack the rest of the string.

In order to minimize the size of the thunk we do not index of
the start of the string, offsetting into it, but instead increment
the addr and always use offset 0#.

This works since these two expressions will read from the same address.
* `indexCharOffAddr# a i`
* `indexCharOffAddr (a `plusAddr#` i) 0#`

This way we avoid the need for the thunks to close over both the start of
the string and the current offset, saving a word for each character unpacked.

This has the additional advantage the we can guarantee that  only the
increment will happen in the loop.
If we use the offset start off with the increment and an addition
to get the real address. Which might not be optimized aways.

-}

unpackCString# :: Addr# -> [Char]
{-# NOINLINE CONLIKE unpackCString# #-}
unpackCString# addr
    | isTrue# (ch `eqChar#` '\0'#) = []
    | True                         = C# ch : unpackCString# (addr `plusAddr#` 1#)
      where
        -- See Note [unpackCString# iterating over addr]
        !ch = indexCharOffAddr# addr 0#


unpackAppendCString# :: Addr# -> [Char] -> [Char]
{-# NOINLINE unpackAppendCString# #-}
     -- See the NOINLINE note on unpackCString#
unpackAppendCString# addr rest
    | isTrue# (ch `eqChar#` '\0'#) = rest
    | True                         = C# ch : unpackAppendCString# (addr `plusAddr#` 1#) rest
      where
        -- See Note [unpackCString# iterating over addr]
        !ch = indexCharOffAddr# addr 0#

-- See [Inlining of unpackFoldrCString]
{-# NOINLINE[0] unpackFoldrCString# #-}
unpackFoldrCString# :: Addr# -> (Char -> a -> a) -> a -> a
unpackFoldrCString# str f z_init = go str z_init
  where
    go addr z
      | isTrue# (ch `eqChar#` '\0'#) = z
      | True                         = C# ch `f` go (addr `plusAddr#` 1#) z
      where
        -- See Note [unpackCString# iterating over addr]
        !ch = indexCharOffAddr# addr 0#

-- There's really no point in inlining this for the same reasons as
-- unpackCString. See Note [Inlining unpackCString#] above for details.
unpackCStringUtf8# :: Addr# -> [Char]
{-# NOINLINE CONLIKE unpackCStringUtf8# #-}
unpackCStringUtf8# addr
    | isTrue# (ch `eqChar#` '\0'#  ) = []
    | True =
        let !byte_count = getByteCount ch
            !utf_ch = unpackUtf8Char# byte_count ch addr
            !addr' = addr `plusBytes` byte_count
        in  C# utf_ch : unpackCStringUtf8# addr'
      where
        -- See Note [unpackCString# iterating over addr]
        !ch = indexCharOffAddr# addr 0#


unpackAppendCStringUtf8# :: Addr# -> [Char] -> [Char]
{-# NOINLINE unpackAppendCStringUtf8# #-}
     -- See the NOINLINE note on unpackCString#
unpackAppendCStringUtf8# addr rest
    | isTrue# (ch `eqChar#` '\0'#) = rest
    | True =
        let !byte_count = getByteCount ch
            !utf_ch = unpackUtf8Char# byte_count ch addr
            !addr' = (addr `plusBytes` byte_count)
        in  C# utf_ch : unpackAppendCStringUtf8# addr' rest
      where
        -- See Note [unpackCString# iterating over addr]
        !ch = indexCharOffAddr# addr 0#

-- See Note [Inlining of unpackFoldrCString]
{-# NOINLINE[0] unpackFoldrCStringUtf8# #-}
unpackFoldrCStringUtf8# :: Addr# -> (Char -> a -> a) -> a -> a
unpackFoldrCStringUtf8# addr_init f z_init
  = go addr_init z_init
  where
    go addr z
      | isTrue# (ch `eqChar#` '\0'#) = z
      | True =
          let !byte_count = getByteCount ch
              !utf_ch = unpackUtf8Char# byte_count ch addr
              !addr' = (addr `plusBytes` byte_count)
          in C# utf_ch `f` go addr' z
      where
        -- See Note [unpackCString# iterating over addr]
        !ch = indexCharOffAddr# addr 0#

-- There's really no point in inlining this for the same reasons as
-- unpackCString. See Note [Inlining unpackCString#] above for details.
unpackNBytes# :: Addr# -> Int# -> [Char]
{-# NOINLINE unpackNBytes# #-}
unpackNBytes# _addr 0#   = []
unpackNBytes#  addr len# = unpack [] (len# -# 1#)
    where
     unpack :: [Char] -> Int# -> [Char]
     unpack acc i#
      | isTrue# (i# <# 0#)  = acc
      | True                =
         case indexCharOffAddr# addr i# of
            ch -> unpack (C# ch : acc) (i# -# 1#)


------------------------------
--- UTF8 decoding utilities
------------------------------
--
-- These functions make explicit the logic that was originally
-- part of unpackCStringUtf8. Since we want the same support for ascii
-- and non-ascii a variety of functions needs the same logic. Instead
-- of C&P'in the decoding logic all over we have it here once, and then
-- force GHC to inline it.
--
-- All the overhead of the Bytes argument and calls goes away once all is
-- said and done. And what remains is readable code in Haskell land and
-- performant code in the resulting binary.

data Bytes = One | Two | Three | Four

{-# INLINE getByteCount #-}
getByteCount :: Char# -> Bytes
getByteCount ch
    | isTrue# (ch `leChar#` '\x7F'#) = One
    | isTrue# (ch `leChar#` '\xDF'#) = Two
    | isTrue# (ch `leChar#` '\xEF'#) = Three
    | True                           = Four

{-# INLINE plusBytes #-}
plusBytes :: Addr# -> Bytes -> Addr#
plusBytes addr bytes =
  case bytes of
    One   -> addr `plusAddr#` 1#
    Two   -> addr `plusAddr#` 2#
    Three -> addr `plusAddr#` 3#
    Four  -> addr `plusAddr#` 4#

-- | Take the current address, read unicode char of the given size.
-- We obviously want the number of bytes, but we have to read one
-- byte to determine the number of bytes for the current codepoint
-- so we might as well reuse it and avoid a read.
--
-- Side Note: We don't dare to decode all 4 possibilities at once.
-- Reading past the end of the addr might trigger an exception.
-- For this reason we really have to check the width first and only
-- decode after.
{-# INLINE unpackUtf8Char# #-}
unpackUtf8Char# :: Bytes -> Char# -> Addr# -> Char#
unpackUtf8Char# bytes ch addr =
  case bytes of
    One -> ch
    Two ->   (chr# (((ord# ch                                           -# 0xC0#) `uncheckedIShiftL#`  6#) +#
                     (ord# (indexCharOffAddr# (addr `plusAddr#` 1#) 0#) -# 0x80#)))
    Three -> (chr# (((ord# ch                                           -# 0xE0#) `uncheckedIShiftL#` 12#) +#
                    ((ord# (indexCharOffAddr# (addr `plusAddr#` 1#) 0#) -# 0x80#) `uncheckedIShiftL#`  6#) +#
                     (ord# (indexCharOffAddr# (addr `plusAddr#` 2#) 0#) -# 0x80#)))
    Four ->  (chr# (((ord# ch                                           -# 0xF0#) `uncheckedIShiftL#` 18#) +#
                    ((ord# (indexCharOffAddr# (addr `plusAddr#` 1#) 0#) -# 0x80#) `uncheckedIShiftL#` 12#) +#
                    ((ord# (indexCharOffAddr# (addr `plusAddr#` 2#) 0#) -# 0x80#) `uncheckedIShiftL#`  6#) +#
                     (ord# (indexCharOffAddr# (addr `plusAddr#` 3#) 0#) -# 0x80#)))