summaryrefslogtreecommitdiff
path: root/libraries/base/GHC/IO/Encoding/UTF16.hs
diff options
context:
space:
mode:
authorSimon Marlow <marlowsd@gmail.com>2009-06-12 13:56:31 +0000
committerSimon Marlow <marlowsd@gmail.com>2009-06-12 13:56:31 +0000
commit7b067f2d08d4968ab323405828fa2f053964ffb0 (patch)
tree6e316f1c66e51db05bd3b25ebc3ad0c3dfa78c16 /libraries/base/GHC/IO/Encoding/UTF16.hs
parent16f5710341ce34644311213c2006b4fb1ac4eaa8 (diff)
downloadhaskell-7b067f2d08d4968ab323405828fa2f053964ffb0.tar.gz
Rewrite of the IO library, including Unicode support
Highlights: * Unicode support for Handle I/O: ** Automatic encoding and decoding using a per-Handle encoding. ** The encoding defaults to the locale encoding (only on Unix so far, perhaps Windows later). ** Built-in UTF-8, UTF-16 (BE/LE), and UTF-32 (BE/LE) codecs. ** iconv-based codec for other encodings on Unix * Modularity: the low-level IO interface is exposed as a type class (GHC.IO.IODevice) so you can build your own low-level IO providers and make Handles from them. * Newline translation: instead of being Windows-specific wired-in magic, the translation from \r\n -> \n and back again is available on all platforms and is configurable for reading/writing independently. Unicode-aware Handles ~~~~~~~~~~~~~~~~~~~~~ This is a significant restructuring of the Handle implementation with the primary goal of supporting Unicode character encodings. The only change to the existing behaviour is that by default, text IO is done in the prevailing locale encoding of the system (except on Windows [1]). Handles created by openBinaryFile use the Latin-1 encoding, as do Handles placed in binary mode using hSetBinaryMode. We provide a way to change the encoding for an existing Handle: GHC.IO.Handle.hSetEncoding :: Handle -> TextEncoding -> IO () and various encodings (from GHC.IO.Encoding): latin1, utf8, utf16, utf16le, utf16be, utf32, utf32le, utf32be, localeEncoding, and a way to lookup other encodings: GHC.IO.Encoding.mkTextEncoding :: String -> IO TextEncoding (it's system-dependent whether the requested encoding will be available). We may want to export these from somewhere more permanent; that's a topic for a future library proposal. Thanks to suggestions from Duncan Coutts, it's possible to call hSetEncoding even on buffered read Handles, and the right thing happens. So we can read from text streams that include multiple encodings, such as an HTTP response or email message, without having to turn buffering off (though there is a penalty for switching encodings on a buffered Handle, as the IO system has to do some re-decoding to figure out where it should start reading from again). If there is a decoding error, it is reported when an attempt is made to read the offending character from the Handle, as you would expect. Performance varies. For "hGetContents >>= putStr" I found the new library was faster on my x86_64 machine, but slower on an x86. On the whole I'd expect things to be a bit slower due to the extra decoding/encoding, but probabaly not noticeably. If performance is critical for your app, then you should be using bytestring and text anyway. [1] Note: locale encoding is not currently implemented on Windows due to the built-in Win32 APIs for encoding/decoding not being sufficient for our purposes. Ask me for details. Offers of help gratefully accepted. Newline Translation ~~~~~~~~~~~~~~~~~~~ In the old IO library, text-mode Handles on Windows had automatic translation from \r\n -> \n on input, and the opposite on output. It was implemented using the underlying CRT functions, which meant that there were certain odd restrictions, such as read/write text handles needing to be unbuffered, and seeking not working at all on text Handles. In the rewrite, newline translation is now implemented in the upper layers, as it needs to be since we have to perform Unicode decoding before newline translation. This means that it is now available on all platforms, which can be quite handy for writing portable code. For now, I have left the behaviour as it was, namely \r\n -> \n on Windows, and no translation on Unix. However, another reasonable default (similar to what Python does) would be to do \r\n -> \n on input, and convert to the platform-native representation (either \r\n or \n) on output. This is called universalNewlineMode (below). The API is as follows. (available from GHC.IO.Handle for now, again this is something we will probably want to try to get into System.IO at some point): -- | The representation of a newline in the external file or stream. data Newline = LF -- ^ "\n" | CRLF -- ^ "\r\n" deriving Eq -- | Specifies the translation, if any, of newline characters between -- internal Strings and the external file or stream. Haskell Strings -- are assumed to represent newlines with the '\n' character; the -- newline mode specifies how to translate '\n' on output, and what to -- translate into '\n' on input. data NewlineMode = NewlineMode { inputNL :: Newline, -- ^ the representation of newlines on input outputNL :: Newline -- ^ the representation of newlines on output } deriving Eq -- | The native newline representation for the current platform nativeNewline :: Newline -- | Map "\r\n" into "\n" on input, and "\n" to the native newline -- represetnation on output. This mode can be used on any platform, and -- works with text files using any newline convention. The downside is -- that @readFile a >>= writeFile b@ might yield a different file. universalNewlineMode :: NewlineMode universalNewlineMode = NewlineMode { inputNL = CRLF, outputNL = nativeNewline } -- | Use the native newline representation on both input and output nativeNewlineMode :: NewlineMode nativeNewlineMode = NewlineMode { inputNL = nativeNewline, outputNL = nativeNewline } -- | Do no newline translation at all. noNewlineTranslation :: NewlineMode noNewlineTranslation = NewlineMode { inputNL = LF, outputNL = LF } -- | Change the newline translation mode on the Handle. hSetNewlineMode :: Handle -> NewlineMode -> IO () IO Devices ~~~~~~~~~~ The major change here is that the implementation of the Handle operations is separated from the underlying IO device, using type classes. File descriptors are just one IO provider; I have also implemented memory-mapped files (good for random-access read/write) and a Handle that pipes output to a Chan (useful for testing code that writes to a Handle). New kinds of Handle can be implemented outside the base package, for instance someone could write bytestringToHandle. A Handle is made using mkFileHandle: -- | makes a new 'Handle' mkFileHandle :: (IODevice dev, BufferedIO dev, Typeable dev) => dev -- ^ the underlying IO device, which must support -- 'IODevice', 'BufferedIO' and 'Typeable' -> FilePath -- ^ a string describing the 'Handle', e.g. the file -- path for a file. Used in error messages. -> IOMode -- ^ The mode in which the 'Handle' is to be used -> Maybe TextEncoding -- ^ text encoding to use, if any -> NewlineMode -- ^ newline translation mode -> IO Handle This also means that someone can write a completely new IO implementation on Windows based on native Win32 HANDLEs, and distribute it as a separate package (I really hope somebody does this!). This restructuring isn't as radical as previous designs. I haven't made any attempt to make a separate binary I/O layer, for example (although hGetBuf/hPutBuf do bypass the text encoding and newline translation). The main goal here was to get Unicode support in, and to allow others to experiment with making new kinds of Handle. We could split up the layers further later. API changes and Module structure ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ NB. GHC.IOBase and GHC.Handle are now DEPRECATED (they are still present, but are just re-exporting things from other modules now). For 6.12 we'll want to bump base to version 5 and add a base4-compat. For now I'm using #if __GLASGOW_HASKEL__ >= 611 to avoid deprecated warnings. I split modules into smaller parts in many places. For example, we now have GHC.IORef, GHC.MVar and GHC.IOArray containing the implementations of IORef, MVar and IOArray respectively. This was necessary for untangling dependencies, but it also makes things easier to follow. The new module structurue for the IO-relatied parts of the base package is: GHC.IO Implementation of the IO monad; unsafe*; throw/catch GHC.IO.IOMode The IOMode type GHC.IO.Buffer Buffers and operations on them GHC.IO.Device The IODevice and RawIO classes. GHC.IO.BufferedIO The BufferedIO class. GHC.IO.FD The FD type, with instances of IODevice, RawIO and BufferedIO. GHC.IO.Exception IO-related Exceptions GHC.IO.Encoding The TextEncoding type; built-in TextEncodings; mkTextEncoding GHC.IO.Encoding.Types GHC.IO.Encoding.Iconv GHC.IO.Encoding.Latin1 GHC.IO.Encoding.UTF8 GHC.IO.Encoding.UTF16 GHC.IO.Encoding.UTF32 Implementation internals for GHC.IO.Encoding GHC.IO.Handle The main API for GHC's Handle implementation, provides all the Handle operations + mkFileHandle + hSetEncoding. GHC.IO.Handle.Types GHC.IO.Handle.Internals GHC.IO.Handle.Text Implementation of Handles and operations. GHC.IO.Handle.FD Parts of the Handle API implemented by file-descriptors: openFile, stdin, stdout, stderr, fdToHandle etc.
Diffstat (limited to 'libraries/base/GHC/IO/Encoding/UTF16.hs')
-rw-r--r--libraries/base/GHC/IO/Encoding/UTF16.hs310
1 files changed, 310 insertions, 0 deletions
diff --git a/libraries/base/GHC/IO/Encoding/UTF16.hs b/libraries/base/GHC/IO/Encoding/UTF16.hs
new file mode 100644
index 0000000000..e3801c0937
--- /dev/null
+++ b/libraries/base/GHC/IO/Encoding/UTF16.hs
@@ -0,0 +1,310 @@
+{-# OPTIONS_GHC -fno-implicit-prelude -funbox-strict-fields #-}
+{-# LANGUAGE BangPatterns #-}
+-----------------------------------------------------------------------------
+-- |
+-- Module : GHC.IO.Encoding.UTF16
+-- Copyright : (c) The University of Glasgow, 2009
+-- License : see libraries/base/LICENSE
+--
+-- Maintainer : libraries@haskell.org
+-- Stability : internal
+-- Portability : non-portable
+--
+-- UTF-16 Codecs for the IO library
+--
+-- Portions Copyright : (c) Tom Harper 2008-2009,
+-- (c) Bryan O'Sullivan 2009,
+-- (c) Duncan Coutts 2009
+--
+-----------------------------------------------------------------------------
+
+module GHC.IO.Encoding.UTF16 (
+ utf16,
+ utf16_decode,
+ utf16_encode,
+
+ utf16be,
+ utf16be_decode,
+ utf16be_encode,
+
+ utf16le,
+ utf16le_decode,
+ utf16le_encode,
+ ) where
+
+import GHC.Base
+import GHC.Real
+import GHC.Num
+import GHC.IO
+import GHC.IO.Exception
+import GHC.IO.Buffer
+import GHC.IO.Encoding.Types
+import GHC.Word
+import Data.Bits
+import Data.Maybe
+import GHC.IORef
+
+#if DEBUG
+import System.Posix.Internals
+import Foreign.C
+import GHC.Show
+
+puts :: String -> IO ()
+puts s = do withCStringLen (s++"\n") $ \(p,len) ->
+ c_write 1 p (fromIntegral len)
+ return ()
+#endif
+
+-- -----------------------------------------------------------------------------
+-- The UTF-16 codec: either UTF16BE or UTF16LE with a BOM
+
+utf16 :: TextEncoding
+utf16 = TextEncoding { mkTextDecoder = utf16_DF,
+ mkTextEncoder = utf16_EF }
+
+utf16_DF :: IO TextDecoder
+utf16_DF = do
+ seen_bom <- newIORef Nothing
+ return (BufferCodec (utf16_decode seen_bom) (return ()))
+
+utf16_EF :: IO TextEncoder
+utf16_EF = do
+ done_bom <- newIORef False
+ return (BufferCodec (utf16_encode done_bom) (return ()))
+
+utf16_encode :: IORef Bool -> EncodeBuffer
+utf16_encode done_bom input
+ output@Buffer{ bufRaw=oraw, bufL=_, bufR=ow, bufSize=os }
+ = do
+ b <- readIORef done_bom
+ if b then utf16_native_encode input output
+ else if os - ow < 2
+ then return (input,output)
+ else do
+ writeIORef done_bom True
+ writeWord8Buf oraw ow bom1
+ writeWord8Buf oraw (ow+1) bom2
+ utf16_native_encode input output{ bufR = ow+2 }
+
+utf16_decode :: IORef (Maybe DecodeBuffer) -> DecodeBuffer
+utf16_decode seen_bom
+ input@Buffer{ bufRaw=iraw, bufL=ir, bufR=iw, bufSize=_ }
+ output
+ = do
+ mb <- readIORef seen_bom
+ case mb of
+ Just decode -> decode input output
+ Nothing ->
+ if iw - ir < 2 then return (input,output) else do
+ c0 <- readWord8Buf iraw ir
+ c1 <- readWord8Buf iraw (ir+1)
+ case () of
+ _ | c0 == bomB && c1 == bomL -> do
+ writeIORef seen_bom (Just utf16be_decode)
+ utf16be_decode input{ bufL= ir+2 } output
+ | c0 == bomL && c1 == bomB -> do
+ writeIORef seen_bom (Just utf16le_decode)
+ utf16le_decode input{ bufL= ir+2 } output
+ | otherwise -> do
+ writeIORef seen_bom (Just utf16_native_decode)
+ utf16_native_decode input output
+
+
+bomB, bomL, bom1, bom2 :: Word8
+bomB = 0xfe
+bomL = 0xff
+
+-- choose UTF-16BE by default for UTF-16 output
+utf16_native_decode :: DecodeBuffer
+utf16_native_decode = utf16be_decode
+
+utf16_native_encode :: EncodeBuffer
+utf16_native_encode = utf16be_encode
+
+bom1 = bomB
+bom2 = bomL
+
+-- -----------------------------------------------------------------------------
+-- UTF16LE and UTF16BE
+
+utf16be :: TextEncoding
+utf16be = TextEncoding { mkTextDecoder = utf16be_DF,
+ mkTextEncoder = utf16be_EF }
+
+utf16be_DF :: IO TextDecoder
+utf16be_DF = return (BufferCodec utf16be_decode (return ()))
+
+utf16be_EF :: IO TextEncoder
+utf16be_EF = return (BufferCodec utf16be_encode (return ()))
+
+
+utf16le :: TextEncoding
+utf16le = TextEncoding { mkTextDecoder = utf16le_DF,
+ mkTextEncoder = utf16le_EF }
+
+utf16le_DF :: IO TextDecoder
+utf16le_DF = return (BufferCodec utf16le_decode (return ()))
+
+utf16le_EF :: IO TextEncoder
+utf16le_EF = return (BufferCodec utf16le_encode (return ()))
+
+
+
+utf16be_decode :: DecodeBuffer
+utf16be_decode
+ input@Buffer{ bufRaw=iraw, bufL=ir0, bufR=iw, bufSize=_ }
+ output@Buffer{ bufRaw=oraw, bufL=_, bufR=ow0, bufSize=os }
+ = let
+ loop !ir !ow
+ | ow >= os || ir >= iw = done ir ow
+ | ir + 1 == iw = done ir ow
+ | otherwise = do
+ c0 <- readWord8Buf iraw ir
+ c1 <- readWord8Buf iraw (ir+1)
+ let x1 = fromIntegral c0 `shiftL` 8 + fromIntegral c1
+ if validate1 x1
+ then do writeCharBuf oraw ow (unsafeChr (fromIntegral x1))
+ loop (ir+2) (ow+1)
+ else if iw - ir < 4 then done ir ow else do
+ c2 <- readWord8Buf iraw (ir+2)
+ c3 <- readWord8Buf iraw (ir+3)
+ let x2 = fromIntegral c2 `shiftL` 8 + fromIntegral c3
+ if not (validate2 x1 x2) then invalid else do
+ writeCharBuf oraw ow (chr2 x1 x2)
+ loop (ir+4) (ow+1)
+ where
+ invalid = if ir > ir0 then done ir ow else ioe_decodingError
+
+ -- lambda-lifted, to avoid thunks being built in the inner-loop:
+ done !ir !ow = return (if ir == iw then input{ bufL=0, bufR=0 }
+ else input{ bufL=ir },
+ output{ bufR=ow })
+ in
+ loop ir0 ow0
+
+utf16le_decode :: DecodeBuffer
+utf16le_decode
+ input@Buffer{ bufRaw=iraw, bufL=ir0, bufR=iw, bufSize=_ }
+ output@Buffer{ bufRaw=oraw, bufL=_, bufR=ow0, bufSize=os }
+ = let
+ loop !ir !ow
+ | ow >= os || ir >= iw = done ir ow
+ | ir + 1 == iw = done ir ow
+ | otherwise = do
+ c0 <- readWord8Buf iraw ir
+ c1 <- readWord8Buf iraw (ir+1)
+ let x1 = fromIntegral c1 `shiftL` 8 + fromIntegral c0
+ if validate1 x1
+ then do writeCharBuf oraw ow (unsafeChr (fromIntegral x1))
+ loop (ir+2) (ow+1)
+ else if iw - ir < 4 then done ir ow else do
+ c2 <- readWord8Buf iraw (ir+2)
+ c3 <- readWord8Buf iraw (ir+3)
+ let x2 = fromIntegral c3 `shiftL` 8 + fromIntegral c2
+ if not (validate2 x1 x2) then invalid else do
+ writeCharBuf oraw ow (chr2 x1 x2)
+ loop (ir+4) (ow+1)
+ where
+ invalid = if ir > ir0 then done ir ow else ioe_decodingError
+
+ -- lambda-lifted, to avoid thunks being built in the inner-loop:
+ done !ir !ow = return (if ir == iw then input{ bufL=0, bufR=0 }
+ else input{ bufL=ir },
+ output{ bufR=ow })
+ in
+ loop ir0 ow0
+
+ioe_decodingError :: IO a
+ioe_decodingError = ioException
+ (IOError Nothing InvalidArgument "utf16_decode"
+ "invalid UTF-16 byte sequence" Nothing Nothing)
+
+utf16be_encode :: EncodeBuffer
+utf16be_encode
+ input@Buffer{ bufRaw=iraw, bufL=ir0, bufR=iw, bufSize=_ }
+ output@Buffer{ bufRaw=oraw, bufL=_, bufR=ow0, bufSize=os }
+ = let
+ done !ir !ow = return (if ir == iw then input{ bufL=0, bufR=0 }
+ else input{ bufL=ir },
+ output{ bufR=ow })
+ loop !ir !ow
+ | ir >= iw = done ir ow
+ | os - ow < 2 = done ir ow
+ | otherwise = do
+ (c,ir') <- readCharBuf iraw ir
+ case ord c of
+ x | x < 0x10000 -> do
+ writeWord8Buf oraw ow (fromIntegral (x `shiftR` 8))
+ writeWord8Buf oraw (ow+1) (fromIntegral x)
+ loop ir' (ow+2)
+ | otherwise -> do
+ if os - ow < 4 then done ir ow else do
+ let
+ n1 = x - 0x10000
+ c1 = fromIntegral (n1 `shiftR` 18 + 0xD8)
+ c2 = fromIntegral (n1 `shiftR` 10)
+ n2 = n1 .&. 0x3FF
+ c3 = fromIntegral (n2 `shiftR` 8 + 0xDC)
+ c4 = fromIntegral n2
+ --
+ writeWord8Buf oraw ow c1
+ writeWord8Buf oraw (ow+1) c2
+ writeWord8Buf oraw (ow+2) c3
+ writeWord8Buf oraw (ow+3) c4
+ loop ir' (ow+4)
+ in
+ loop ir0 ow0
+
+utf16le_encode :: EncodeBuffer
+utf16le_encode
+ input@Buffer{ bufRaw=iraw, bufL=ir0, bufR=iw, bufSize=_ }
+ output@Buffer{ bufRaw=oraw, bufL=_, bufR=ow0, bufSize=os }
+ = let
+ done !ir !ow = return (if ir == iw then input{ bufL=0, bufR=0 }
+ else input{ bufL=ir },
+ output{ bufR=ow })
+ loop !ir !ow
+ | ir >= iw = done ir ow
+ | os - ow < 2 = done ir ow
+ | otherwise = do
+ (c,ir') <- readCharBuf iraw ir
+ case ord c of
+ x | x < 0x10000 -> do
+ writeWord8Buf oraw ow (fromIntegral x)
+ writeWord8Buf oraw (ow+1) (fromIntegral (x `shiftR` 8))
+ loop ir' (ow+2)
+ | otherwise ->
+ if os - ow < 4 then done ir ow else do
+ let
+ n1 = x - 0x10000
+ c1 = fromIntegral (n1 `shiftR` 18 + 0xD8)
+ c2 = fromIntegral (n1 `shiftR` 10)
+ n2 = n1 .&. 0x3FF
+ c3 = fromIntegral (n2 `shiftR` 8 + 0xDC)
+ c4 = fromIntegral n2
+ --
+ writeWord8Buf oraw ow c2
+ writeWord8Buf oraw (ow+1) c1
+ writeWord8Buf oraw (ow+2) c4
+ writeWord8Buf oraw (ow+3) c3
+ loop ir' (ow+4)
+ in
+ loop ir0 ow0
+
+chr2 :: Word16 -> Word16 -> Char
+chr2 (W16# a#) (W16# b#) = C# (chr# (upper# +# lower# +# 0x10000#))
+ where
+ !x# = word2Int# a#
+ !y# = word2Int# b#
+ !upper# = uncheckedIShiftL# (x# -# 0xD800#) 10#
+ !lower# = y# -# 0xDC00#
+{-# INLINE chr2 #-}
+
+validate1 :: Word16 -> Bool
+validate1 x1 = (x1 >= 0 && x1 < 0xD800) || x1 > 0xDFFF
+{-# INLINE validate1 #-}
+
+validate2 :: Word16 -> Word16 -> Bool
+validate2 x1 x2 = x1 >= 0xD800 && x1 <= 0xDBFF &&
+ x2 >= 0xDC00 && x2 <= 0xDFFF
+{-# INLINE validate2 #-}