stack: fix stack allocations on Windows

Summary: On Windows one is not allowed to drop the stack by more than a page size. The reason for this is that the OS only allocates enough stack till what the TEB specifies. After that a guard page is placed and the rest of the virtual address space is unmapped. The intention is that doing stack allocations will cause you to hit the guard which will then map the next page in and move the guard. This is done to prevent what in the Linux world is known as stack clash vulnerabilities https://access.redhat.com/security/cve/cve-2017-1000364. There are modules in GHC for which the liveliness analysis thinks the reserved 8KB of spill slots isn't enough. One being DynFlags and the other being Cabal. Though I think the Cabal one is likely a bug: ``` 4d6544: 81 ec 00 46 00 00 sub $0x4600,%esp 4d654a: 8d 85 94 fe ff ff lea -0x16c(%ebp),%eax 4d6550: 3b 83 1c 03 00 00 cmp 0x31c(%ebx),%eax 4d6556: 0f 82 de 8d 02 00 jb 4ff33a <_cLpg_info+0x7a> 4d655c: c7 45 fc 14 3d 50 00 movl $0x503d14,-0x4(%ebp) 4d6563: 8b 75 0c mov 0xc(%ebp),%esi 4d6566: 83 c5 fc add $0xfffffffc,%ebp 4d6569: 66 f7 c6 03 00 test $0x3,%si 4d656e: 0f 85 a6 d7 02 00 jne 503d1a <_cLpb_info+0x6> 4d6574: 81 c4 00 46 00 00 add $0x4600,%esp ``` It allocates nearly 18KB of spill slots for a simple 4 line function and doesn't even use it. Note that this doesn't happen on x64 or when making a validate build. Only when making a build without a validate and build.mk. This and the allocation in DynFlags means the stack allocation will jump over the guard page into unmapped memory areas and GHC or an end program segfaults. The pagesize on x86 Windows is 4KB which means we hit it very easily for these two modules, which explains the total DOA of GHC 32bit for the past 3 releases and the "random" segfaults on Windows. ``` 0:000> bp 00503d29 0:000> gn Breakpoint 0 hit WARNING: Stack overflow detected. The unwound frames are extracted from outside normal stack bounds. eax=03b6b9c9 ebx=00dc90f0 ecx=03cac48c edx=03cac43d esi=03b6b9c9 edi=03abef40 eip=00503d29 esp=013e96fc ebp=03cf8f70 iopl=0 nv up ei pl nz na po nc cs=0023 ss=002b ds=002b es=002b fs=0053 gs=002b efl=00000202 setup+0x103d29: 00503d29 89442440 mov dword ptr [esp+40h],eax ss:002b:013e973c=???????? WARNING: Stack overflow detected. The unwound frames are extracted from outside normal stack bounds. WARNING: Stack overflow detected. The unwound frames are extracted from outside normal stack bounds. 0:000> !teb TEB at 00384000 ExceptionList: 013effcc StackBase: 013f0000 StackLimit: 013eb000 ``` This doesn't fix the liveliness analysis but does fix the allocations, by emitting a function call to `__chkstk_ms` when doing allocations of larger than a page, this will make sure the stack is probed every page so the kernel maps in the next page. `__chkstk_ms` is provided by `libGCC`, which is under the `GNU runtime exclusion license`, so it's safe to link against it, even for proprietary code. (Technically we already do since we link compiled C code in.) For allocations smaller than a page we drop the stack and probe the new address. This avoids the function call and still makes sure we hit the guard if needed. PS: In case anyone is Wondering why we didn't notice this before, it's because we only test x86_64 and on Windows 10. On x86_64 the page size is 8KB and also the kernel is a bit more lenient on Windows 10 in that it seems to catch the segfault and resize the stack if it was unmapped: ``` 0:000> t eax=03b6b9c9 ebx=00dc90f0 ecx=03cac48c edx=03cac43d esi=03b6b9c9 edi=03abef40 eip=00503d2d esp=013e96fc ebp=03cf8f70 iopl=0 nv up ei pl nz na po nc cs=0023 ss=002b ds=002b es=002b fs=0053 gs=002b efl=00000202 setup+0x103d2d: 00503d2d 8b461b mov eax,dword ptr [esi+1Bh] ds:002b:03b6b9e4=03cac431 0:000> !teb TEB at 00384000 ExceptionList: 013effcc StackBase: 013f0000 StackLimit: 013e9000 ``` Likely Windows 10 has a guard page larger than previous versions. This fixes the stack allocations, and as soon as I get the time I will look at the liveliness analysis. I find it highly unlikely that simple Cabal function requires ~2200 spill slots. Test Plan: ./validate Reviewers: simonmar, bgamari Reviewed By: bgamari Subscribers: AndreasK, rwbarton, thomie, carter GHC Trac Issues: #15154 Differential Revision: https://phabricator.haskell.org/D4917
author: Tamar Christina <tamar@zhox.com> 2018-07-18 21:03:58 +0100
committer: Tamar Christina <tamar@zhox.com> 2018-07-18 21:20:26 +0100
commit: d0bbe1bf351c8b85c310afb0dd1fb1f12f9474bf (patch)
tree: ac6f4b23241bf504a98765d497c2799ff7b4982d /compiler/nativeGen
parent: b290f15c4d01d45c41c02ae5c5333a1fab022a32 (diff)
download: haskell-d0bbe1bf351c8b85c310afb0dd1fb1f12f9474bf.tar.gz
4 files changed, 103 insertions, 27 deletions
diff --git a/compiler/nativeGen/Instruction.hs b/compiler/nativeGen/Instruction.hs
index 63b5b0df7e..0bd99fbee8 100644
--- a/compiler/nativeGen/Instruction.hs
+++ b/compiler/nativeGen/Instruction.hs
@@ -191,14 +191,12 @@ class   Instruction instr where
 
         -- Subtract an amount from the C stack pointer
         mkStackAllocInstr
-                :: Platform  -- TODO: remove (needed by x86/x86_64
-                             -- because they share an Instr type)
+                :: Platform
                 -> Int
-                -> instr
+                -> [instr]
 
         -- Add an amount to the C stack pointer
         mkStackDeallocInstr
-                :: Platform  -- TODO: remove (needed by x86/x86_64
-                             -- because they share an Instr type)
+                :: Platform
                 -> Int
-                -> instr
+                -> [instr]
diff --git a/compiler/nativeGen/PPC/Instr.hs b/compiler/nativeGen/PPC/Instr.hs
index d21e7f8176..8eb5e8fa8d 100644
--- a/compiler/nativeGen/PPC/Instr.hs
+++ b/compiler/nativeGen/PPC/Instr.hs
@@ -77,19 +77,19 @@ instance Instruction Instr where
         mkStackDeallocInstr     = ppc_mkStackDeallocInstr
 
 
-ppc_mkStackAllocInstr :: Platform -> Int -> Instr
+ppc_mkStackAllocInstr :: Platform -> Int -> [Instr]
 ppc_mkStackAllocInstr platform amount
   = ppc_mkStackAllocInstr' platform (-amount)
 
-ppc_mkStackDeallocInstr :: Platform -> Int -> Instr
+ppc_mkStackDeallocInstr :: Platform -> Int -> [Instr]
 ppc_mkStackDeallocInstr platform amount
   = ppc_mkStackAllocInstr' platform amount
 
-ppc_mkStackAllocInstr' :: Platform -> Int -> Instr
+ppc_mkStackAllocInstr' :: Platform -> Int -> [Instr]
 ppc_mkStackAllocInstr' platform amount
   = case platformArch platform of
-    ArchPPC      -> UPDATE_SP II32 (ImmInt amount)
-    ArchPPC_64 _ -> UPDATE_SP II64 (ImmInt amount)
+    ArchPPC      -> [UPDATE_SP II32 (ImmInt amount)]
+    ArchPPC_64 _ -> [UPDATE_SP II64 (ImmInt amount)]
     _            -> panic $ "ppc_mkStackAllocInstr' "
                             ++ show (platformArch platform)
 
@@ -126,7 +126,7 @@ allocMoreStack platform slots (CmmProc info lbl live (ListGraph code)) = do
 
         insert_stack_insns (BasicBlock id insns)
             | Just new_blockid <- mapLookup id new_blockmap
-                = [ BasicBlock id [alloc, BCC ALWAYS new_blockid Nothing]
+                = [ BasicBlock id $ alloc ++ [BCC ALWAYS new_blockid Nothing]
                   , BasicBlock new_blockid block'
                   ]
             | otherwise
@@ -139,8 +139,8 @@ allocMoreStack platform slots (CmmProc info lbl live (ListGraph code)) = do
             -- "labeled-goto" we use JMP, and for "computed-goto" we
             -- use MTCTR followed by BCTR. See 'PPC.CodeGen.genJump'.
             = case insn of
-                JMP _           -> dealloc : insn : r
-                BCTR [] Nothing -> dealloc : insn : r
+                JMP _           -> dealloc ++ (insn : r)
+                BCTR [] Nothing -> dealloc ++ (insn : r)
                 BCTR ids label  -> BCTR (map (fmap retarget) ids) label : r
                 BCCFAR cond b p -> BCCFAR cond (retarget b) p : r
                 BCC    cond b p -> BCC    cond (retarget b) p : r
diff --git a/compiler/nativeGen/RegAlloc/Liveness.hs b/compiler/nativeGen/RegAlloc/Liveness.hs
index 2a2990f6ce..9d93564317 100644
--- a/compiler/nativeGen/RegAlloc/Liveness.hs
+++ b/compiler/nativeGen/RegAlloc/Liveness.hs
@@ -147,10 +147,10 @@ instance Instruction instr => Instruction (InstrSR instr) where
         mkJumpInstr target      = map Instr (mkJumpInstr target)
 
         mkStackAllocInstr platform amount =
-             Instr (mkStackAllocInstr platform amount)
+             Instr <$> mkStackAllocInstr platform amount
 
         mkStackDeallocInstr platform amount =
-             Instr (mkStackDeallocInstr platform amount)
+             Instr <$> mkStackDeallocInstr platform amount
 
 
 -- | An instruction with liveness information.
diff --git a/compiler/nativeGen/X86/Instr.hs b/compiler/nativeGen/X86/Instr.hs
index ee3e64cf24..c7000c9f4b 100644
--- a/compiler/nativeGen/X86/Instr.hs
+++ b/compiler/nativeGen/X86/Instr.hs
@@ -858,25 +858,104 @@ x86_mkJumpInstr
 x86_mkJumpInstr id
         = [JXX ALWAYS id]
 
+-- Note [Windows stack layout]
+-- | On most OSes the kernel will place a guard page after the current stack
+--   page.  If you allocate larger than a page worth you may jump over this
+--   guard page.  Not only is this a security issue, but on certain OSes such
+--   as Windows a new page won't be allocated if you don't hit the guard.  This
+--   will cause a segfault or access fault.
+--
+--   This function defines if the current allocation amount requires a probe.
+--   On Windows (for now) we emit a call to _chkstk for this.  For other OSes
+--   this is not yet implemented.
+--   See https://docs.microsoft.com/en-us/windows/desktop/DevNotes/-win32-chkstk
+--   The Windows stack looks like this:
+--
+--                         +-------------------+
+--                         |        SP         |
+--                         +-------------------+
+--                         |                   |
+--                         |    GUARD PAGE     |
+--                         |                   |
+--                         +-------------------+
+--                         |                   |
+--                         |                   |
+--                         |     UNMAPPED      |
+--                         |                   |
+--                         |                   |
+--                         +-------------------+
+--
+--   In essense each allocation larger than a page size needs to be chunked and
+--   a probe emitted after each page allocation.  You have to hit the guard
+--   page so the kernel can map in the next page, otherwise you'll segfault.
+--
+needs_probe_call :: Platform -> Int -> Bool
+needs_probe_call platform amount
+  = case platformOS platform of
+     OSMinGW32 -> case platformArch platform of
+                    ArchX86    -> amount > (4 * 1024)
+                    ArchX86_64 -> amount > (8 * 1024)
+                    _          -> False
+     _         -> False
 
 x86_mkStackAllocInstr
         :: Platform
         -> Int
-        -> Instr
+        -> [Instr]
 x86_mkStackAllocInstr platform amount
-  = case platformArch platform of
-      ArchX86    -> SUB II32 (OpImm (ImmInt amount)) (OpReg esp)
-      ArchX86_64 -> SUB II64 (OpImm (ImmInt amount)) (OpReg rsp)
-      _ -> panic "x86_mkStackAllocInstr"
+  = case platformOS platform of
+      OSMinGW32 ->
+        -- These will clobber AX but this should be ok because
+        --
+        -- 1. It is the first thing we do when entering the closure and AX is
+        --    a caller saved registers on Windows both on x86_64 and x86.
+        --
+        -- 2. The closures are only entered via a call or longjmp in which case
+        --    there are no expectations for volatile registers.
+        --
+        -- 3. When the target is a local branch point it is re-targeted
+        --    after the dealloc, preserving #2.  See note [extra spill slots].
+        --
+        -- We emit a call because the stack probes are quite involved and
+        -- would bloat code size a lot.  GHC doesn't really have an -Os.
+        -- __chkstk is guaranteed to leave all nonvolatile registers and AX
+        -- untouched.  It's part of the standard prologue code for any Windows
+        -- function dropping the stack more than a page.
+        -- See Note [Windows stack layout]
+        case platformArch platform of
+            ArchX86    | needs_probe_call platform amount ->
+                           [ MOV II32 (OpImm (ImmInt amount)) (OpReg eax)
+                           , CALL (Left $ strImmLit "___chkstk_ms") [eax]
+                           , SUB II32 (OpReg eax) (OpReg esp)
+                           ]
+                       | otherwise ->
+                           [ SUB II32 (OpImm (ImmInt amount)) (OpReg esp)
+                           , TEST II32 (OpReg esp) (OpReg esp)
+                           ]
+            ArchX86_64 | needs_probe_call platform amount ->
+                           [ MOV II64 (OpImm (ImmInt amount)) (OpReg rax)
+                           , CALL (Left $ strImmLit "__chkstk_ms") [rax]
+                           , SUB II64 (OpReg rax) (OpReg rsp)
+                           ]
+                       | otherwise ->
+                           [ SUB II64 (OpImm (ImmInt amount)) (OpReg rsp)
+                           , TEST II64 (OpReg rsp) (OpReg rsp)
+                           ]
+            _ -> panic "x86_mkStackAllocInstr"
+      _       ->
+        case platformArch platform of
+          ArchX86    -> [ SUB II32 (OpImm (ImmInt amount)) (OpReg esp) ]
+          ArchX86_64 -> [ SUB II64 (OpImm (ImmInt amount)) (OpReg rsp) ]
+          _ -> panic "x86_mkStackAllocInstr"
 
 x86_mkStackDeallocInstr
         :: Platform
         -> Int
-        -> Instr
+        -> [Instr]
 x86_mkStackDeallocInstr platform amount
   = case platformArch platform of
-      ArchX86    -> ADD II32 (OpImm (ImmInt amount)) (OpReg esp)
-      ArchX86_64 -> ADD II64 (OpImm (ImmInt amount)) (OpReg rsp)
+      ArchX86    -> [ADD II32 (OpImm (ImmInt amount)) (OpReg esp)]
+      ArchX86_64 -> [ADD II64 (OpImm (ImmInt amount)) (OpReg rsp)]
       _ -> panic "x86_mkStackDeallocInstr"
 
 i386_insert_ffrees
@@ -996,7 +1075,7 @@ allocMoreStack platform slots proc@(CmmProc info lbl live (ListGraph code)) = do
 
       insert_stack_insns (BasicBlock id insns)
          | Just new_blockid <- mapLookup id new_blockmap
-         = [ BasicBlock id [alloc, JXX ALWAYS new_blockid]
+         = [ BasicBlock id $ alloc ++ [JXX ALWAYS new_blockid]
            , BasicBlock new_blockid block' ]
          | otherwise
          = [ BasicBlock id block' ]
@@ -1004,7 +1083,7 @@ allocMoreStack platform slots proc@(CmmProc info lbl live (ListGraph code)) = do
            block' = foldr insert_dealloc [] insns
 
       insert_dealloc insn r = case insn of
-         JMP _ _     -> dealloc : insn : r
+         JMP _ _     -> dealloc ++ (insn : r)
          JXX_GBL _ _ -> panic "insert_dealloc: cannot handle JXX_GBL"
          _other      -> x86_patchJumpInstr insn retarget : r
            where retarget b = fromMaybe b (mapLookup b new_blockmap)
@@ -1013,7 +1092,6 @@ allocMoreStack platform slots proc@(CmmProc info lbl live (ListGraph code)) = do
     -- in
     return (CmmProc info lbl live (ListGraph new_code))
 
-
 data JumpDest = DestBlockId BlockId | DestImm Imm
 
 getJumpDestBlockId :: JumpDest -> Maybe BlockId
author	Tamar Christina <tamar@zhox.com>	2018-07-18 21:03:58 +0100
committer	Tamar Christina <tamar@zhox.com>	2018-07-18 21:20:26 +0100
commit	d0bbe1bf351c8b85c310afb0dd1fb1f12f9474bf (patch)
tree	ac6f4b23241bf504a98765d497c2799ff7b4982d /compiler/nativeGen
parent	b290f15c4d01d45c41c02ae5c5333a1fab022a32 (diff)
download	haskell-d0bbe1bf351c8b85c310afb0dd1fb1f12f9474bf.tar.gz