| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* In stg_ap_0_fast, if we're evaluating a thunk, the thunk might
evaluate to a function in which case we may have to adjust its CCS.
* The interpreter has its own implementation of stg_ap_0_fast, so we
have to do the same shenanigans with creating empty PAPs and copying
PAPs there.
* GHCi creates Cost Centres as children of CCS_MAIN, which enterFunCCS()
wrongly assumed to imply that they were CAFs. Now we use the is_caf
flag for this, which we have to correctly initialise when we create a
Cost Centre in GHCi.
|
|
|
|
|
|
|
|
|
|
|
|
| |
Test Plan: Validate on lots of platforms
Reviewers: erikd, simonmar, austin
Reviewed By: erikd, simonmar
Subscribers: michalt, thomie
Differential Revision: https://phabricator.haskell.org/D2699
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary:
We currently have two info tables for a constructor
* XXX_con_info: the info table for a heap-resident instance of the
constructor, It has type CONSTR, or one of the specialised types like
CONSTR_1_0
* XXX_static_info: the info table for a static instance of this
constructor, which has type CONSTR_STATIC or CONSTR_STATIC_NOCAF.
I'm getting rid of the latter, and using the `con_info` info table for
both static and dynamic constructors. For rationale and more details
see Note [static constructors] in SMRep.hs.
I also removed these macros: `isSTATIC()`, `ip_STATIC()`,
`closure_STATIC()`, since they relied on the CONSTR/CONSTR_STATIC
distinction, and anyway HEAP_ALLOCED() does the same job.
Test Plan: validate
Reviewers: bgamari, simonpj, austin, gcampax, hvr, niteria, erikd
Subscribers: thomie
Differential Revision: https://phabricator.haskell.org/D2690
GHC Trac Issues: #12455
|
|
|
|
|
|
|
|
|
|
|
|
| |
The `nat` type was an alias for `unsigned int` with a comment saying
it was at least 32 bits. We keep the typedef in case client code is
using it but mark it as deprecated.
Test Plan: Validated on Linux, OS X and Windows
Reviewers: simonmar, austin, thomie, hvr, bgamari, hsyl20
Differential Revision: https://phabricator.haskell.org/D2166
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The following tests fail on powerpc64 and have a ticket.
Mark those tests as expect_broken.
Here are the details:
The PowerPC native code generator does not support DWARF debug
information. This is tracked in ticket #11261. Mark the respective
tests broken on powerpc64.
testsuite: mark print022 broken on powerpc64
Ticket #11262 tracks difference in stdout for print022.
testsuite: mark recomp015 broken on powerpc64
testsuite: mark recomp011 broken on powerpc64
This is tracked as ticket #11323 and #11260.
testsuite: mark linker tests broken on powerpc64
Ticket #11259 tracks tests failing because there is no RTS
linker on powerpc64.
Test Plan: validate
Reviewers: erikd, austin, bgamari
Reviewed By: bgamari
Subscribers: thomie
Differential Revision: https://phabricator.haskell.org/D1928
GHC Trac Issues: #11259, #11260, #11261, #11262, #11323
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
it seems that this closure type has not been in use since 5d52d9, so all
this is dead and untested code. This removes it. Some of the code might
be useful for a counting indirection as described in #10613, so when
implementing that, have a look at what this commit removes.
Test Plan: validate on harbormaster
Reviewers: austin, bgamari, simonmar
Reviewed By: simonmar
Subscribers: thomie
Differential Revision: https://phabricator.haskell.org/D1821
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary:
The main goal here is enable stack traces in GHCi. After this change,
if you start GHCi like this:
ghci -fexternal-interpreter -prof
(which requires packages to be built for profiling, but not GHC
itself) then the interpreter manages cost-centre stacks during
execution and can produce a stack trace on request. Call locations
are available for all interpreted code, and any compiled code that was
built with the `-fprof-auto` familiy of flags.
There are a couple of ways to get a stack trace:
* `error`/`undefined` automatically get one attached
* `Debug.Trace.traceStack` can be used anywhere, and prints the current
stack
Because the interpreter is running in a separate process, only the
interpreted code is running in profiled mode and the compiler itself
isn't slowed down by profiling.
The GHCi debugger still doesn't work with -fexternal-interpreter,
although this patch gets it a step closer. Most of the functionality
of breakpoints is implemented, but the runtime value introspection is
still not supported.
Along the way I also did some refactoring and added type arguments to
the various remote pointer types in `GHCi.RemotePtr`, so there's
better type safety and documentation in the bridge code between GHC
and ghc-iserv.
Test Plan: validate
Reviewers: bgamari, ezyang, austin, hvr, goldfire, erikd
Subscribers: thomie
Differential Revision: https://phabricator.haskell.org/D1747
GHC Trac Issues: #11047, #11100
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary:
Breakpoints become SCCs, so we have detailed call-stack info for
interpreted code. Currently this only works when GHC is compiled with
-prof, but D1562 (Remote GHCi) removes this constraint so that in the
future call stacks will be available without building your own GHCi.
How can you get a stack trace?
* programmatically: GHC.Stack.currentCallStack
* I've added an experimental :where command that shows the stack when
stopped at a breakpoint
* `error` attaches a call stack automatically, although since calls to
`error` are often lifted out to the top level, this is less useful
than it might be (ImplicitParams still works though).
* Later we might attach call stacks to all exceptions
Other related changes in this diff:
* I reduced the number of places that get ticks attached for
breakpoints. In particular there was a breakpoint around the whole
declaration, which was often redundant because it bound no variables.
This reduces clutter in the stack traces and speeds up compilation.
* I tidied up some RealSrcSpan stuff in InteractiveUI, and made a few
other small cleanups
Test Plan: validate
Reviewers: ezyang, bgamari, austin, hvr
Subscribers: thomie
Differential Revision: https://phabricator.haskell.org/D1595
GHC Trac Issues: #11047
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary:
(Apologies for the size of this patch, I couldn't make a smaller one
that was validate-clean and also made sense independently)
(Some of this code is derived from GHCJS.)
This commit adds support for running interpreted code (for GHCi and
TemplateHaskell) in a separate process. The functionality is
experimental, so for now it is off by default and enabled by the flag
-fexternal-interpreter.
Reaosns we want this:
* compiling Template Haskell code with -prof does not require
building the code without -prof first
* when GHC itself is profiled, it can interpret unprofiled code, and
the same applies to dynamic linking. We would no longer need to
force -dynamic-too with TemplateHaskell, and we can load ordinary
objects into a dynamically-linked GHCi (and vice versa).
* An unprofiled GHCi can load and run profiled code, which means it
can use the stack-trace functionality provided by profiling without
taking the performance hit on the compiler that profiling would
entail.
Amongst other things; see
https://ghc.haskell.org/trac/ghc/wiki/RemoteGHCi for more details.
Notes on the implementation are in Note [Remote GHCi] in the new
module compiler/ghci/GHCi.hs. It probably needs more documenting,
feel free to suggest things I could elaborate on.
Things that are not currently implemented for -fexternal-interpreter:
* The GHCi debugger
* :set prog, :set args in GHCi
* `recover` in Template Haskell
* Redirecting stdin/stdout for the external process
These are all doable, I just wanted to get to a working validate-clean
patch first.
I also haven't done any benchmarking yet. I expect there to be slight hit
to link times for byte code and some penalty due to having to
serialize/deserialize TH syntax, but I don't expect it to be a serious
problem. There's also lots of low-hanging fruit in the byte code
generator/linker that we could exploit to speed things up.
Test Plan:
* validate
* I've run parts of the test suite with
EXTRA_HC_OPTS=-fexternal-interpreter, notably tests/ghci and tests/th.
There are a few failures due to the things not currently implemented
(see above).
Reviewers: simonpj, goldfire, ezyang, austin, alanz, hvr, niteria, bgamari, gibiansky, luite
Subscribers: thomie
Differential Revision: https://phabricator.haskell.org/D1562
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary:
Amazingly, there were zero changes to the byte code generator and very
few changes to the interpreter - mainly because we've used good
abstractions that hide the differences between profiling and
non-profiling. So that bit was pleasantly straightforward, but there
were a pile of other wibbles to get the whole test suite through.
Note that a compiler built with -prof is now like one built with
-dynamic, in that to use TH you have to build the code the same way.
For dynamic, we automatically enable -dynamic-too when TH is required,
but we don't have anything equivalent for profiling, so you have to
explicitly use -prof when building code that uses TH with a profiled
compiler. For this reason Cabal won't work with TH. We don't expect
to ship a profiled compiler, so I think that's OK.
Test Plan: validate with GhcProfiled=YES in validate.mk
Reviewers: goldfire, bgamari, rwbarton, austin, hvr, erikd, ezyang
Reviewed By: ezyang
Subscribers: thomie
Differential Revision: https://phabricator.haskell.org/D1407
GHC Trac Issues: #4837, #545
|
|
|
|
|
|
|
|
|
|
| |
Rename StgArrWords to StgArrBytes (see Trac #8552)
Reviewed By: austin
Differential Revision: https://phabricator.haskell.org/D1233
GHC Trac Issues: #8552
|
| |
|
|
|
|
| |
Signed-off-by: Austin Seipp <austin@well-typed.com>
|
|
|
|
|
|
|
| |
This reverts commit 35672072b4091d6f0031417bc160c568f22d0469.
Conflicts:
compiler/main/DriverPipeline.hs
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary:
In preparation for indirecting all references to closures,
we rename _closure to _static_closure to ensure any old code
will get an undefined symbol error. In order to reference
a closure foobar_closure (which is now undefined), you should instead
use STATIC_CLOSURE(foobar). For convenience, a number of these
old identifiers are macro'd.
Across C-- and C (Windows and otherwise), there were differing
conventions on whether or not foobar_closure or &foobar_closure
was the address of the closure. Now, all foobar_closure references
are addresses, and no & is necessary.
CHARLIKE/INTLIKE were not changed, simply alpha-renamed.
Part of remove HEAP_ALLOCED patch set (#8199)
Depends on D265
Signed-off-by: Edward Z. Yang <ezyang@mit.edu>
Test Plan: validate
Reviewers: simonmar, austin
Subscribers: simonmar, ezyang, carter, thomie
Differential Revision: https://phabricator.haskell.org/D267
GHC Trac Issues: #8199
|
|
|
|
| |
This reverts commit 39b5c1cbd8950755de400933cecca7b8deb4ffcd.
|
|
|
|
|
|
|
|
| |
This will hopefully help ensure some basic consistency in the forward by
overriding buffer variables. In particular, it sets the wrap length, the
offset to 4, and turns off tabs.
Signed-off-by: Austin Seipp <austin@well-typed.com>
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The main change here is that the Cmm parser now allows high-level cmm
code with argument-passing and function calls. For example:
foo ( gcptr a, bits32 b )
{
if (b > 0) {
// we can make tail calls passing arguments:
jump stg_ap_0_fast(a);
}
return (x,y);
}
More details on the new cmm syntax are in Note [Syntax of .cmm files]
in CmmParse.y.
The old syntax is still more-or-less supported for those occasional
code fragments that really need to explicitly manipulate the stack.
However there are a couple of differences: it is now obligatory to
give a list of live GlobalRegs on every jump, e.g.
jump %ENTRY_CODE(Sp(0)) [R1];
Again, more details in Note [Syntax of .cmm files].
I have rewritten most of the .cmm files in the RTS into the new
syntax, except for AutoApply.cmm which is generated by the genapply
program: this file could be generated in the new syntax instead and
would probably be better off for it, but I ran out of enthusiasm.
Some other changes in this batch:
- The PrimOp calling convention is gone, primops now use the ordinary
NativeNodeCall convention. This means that primops and "foreign
import prim" code must be written in high-level cmm, but they can
now take more than 10 arguments.
- CmmSink now does constant-folding (should fix #7219)
- .cmm files now go through the cmmPipeline, and as a result we
generate better code in many cases. All the object files generated
for the RTS .cmm files are now smaller. Performance should be
better too, but I haven't measured it yet.
- RET_DYN frames are removed from the RTS, lots of code goes away
- we now have some more canned GC points to cover unboxed-tuples with
2-4 pointers, which will reduce code size a little.
|
|
|
|
| |
No size changes in the non-debug object files
|
|
|
|
|
|
|
|
| |
All the wibble seem to have cancelled out, and (non-debug) object sizes
are back to where they started.
I'm not 100% sure that the types are optimal, but at least now the
functions have types and we can fix them if necessary.
|
| |
|
|
|
|
|
| |
We may need to do this differently once we get as far as building the
RTS in the dyn ways.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This means that both time and heap profiling work for parallel
programs. Main internal changes:
- CCCS is no longer a global variable; it is now another
pseudo-register in the StgRegTable struct. Thus every
Capability has its own CCCS.
- There is a new built-in CCS called "IDLE", which records ticks for
Capabilities in the idle state. If you profile a single-threaded
program with +RTS -N2, you'll see about 50% of time in "IDLE".
- There is appropriate locking in rts/Profiling.c to protect the
shared cost-centre-stack data structures.
This patch does enough to get it working, I have cut one big corner:
the cost-centre-stack data structure is still shared amongst all
Capabilities, which means that multiple Capabilities will race when
updating the "allocations" and "entries" fields of a CCS. Not only
does this give unpredictable results, but it runs very slowly due to
cache line bouncing.
It is strongly recommended that you use -fno-prof-count-entries to
disable the "entries" count when profiling parallel programs. (I shall
add a note to this effect to the docs).
|
| |
|
|
|
|
|
|
|
|
|
|
|
| |
Based on a patch from David Terei.
Some parts are a little ugly (e.g. defining things that only ASSERTs
use only when DEBUG is defined), so we might want to tweak things a
little.
I've also turned off -Werror for didn't-inline warnings, as we now
get a few such warnings.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch makes two changes to the way stacks are managed:
1. The stack is now stored in a separate object from the TSO.
This means that it is easier to replace the stack object for a thread
when the stack overflows or underflows; we don't have to leave behind
the old TSO as an indirection any more. Consequently, we can remove
ThreadRelocated and deRefTSO(), which were a pain.
This is obviously the right thing, but the last time I tried to do it
it made performance worse. This time I seem to have cracked it.
2. Stacks are now represented as a chain of chunks, rather than
a single monolithic object.
The big advantage here is that individual chunks are marked clean or
dirty according to whether they contain pointers to the young
generation, and the GC can avoid traversing clean stack chunks during
a young-generation collection. This means that programs with deep
stacks will see a big saving in GC overhead when using the default GC
settings.
A secondary advantage is that there is much less copying involved as
the stack grows. Programs that quickly grow a deep stack will see big
improvements.
In some ways the implementation is simpler, as nothing special needs
to be done to reclaim stack as the stack shrinks (the GC just recovers
the dead stack chunks). On the other hand, we have to manage stack
underflow between chunks, so there's a new stack frame
(UNDERFLOW_FRAME), and we now have separate TSO and STACK objects.
The total amount of code is probably about the same as before.
There are new RTS flags:
-ki<size> Sets the initial thread stack size (default 1k) Egs: -ki4k -ki2m
-kc<size> Sets the stack chunk size (default 32k)
-kb<size> Sets the stack chunk buffer size (default 1k)
-ki was previously called just -k, and the old name is still accepted
for backwards compatibility. These new options are documented.
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is patch that adds support for interruptible FFI calls in the form
of a new foreign import keyword 'interruptible', which can be used
instead of 'safe' or 'unsafe'. Interruptible FFI calls act like safe
FFI calls, except that the worker thread they run on may be interrupted.
Internally, it replaces BlockedOnCCall_NoUnblockEx with
BlockedOnCCall_Interruptible, and changes the behavior of the RTS
to not modify the TSO_ flags on the event of an FFI call from
a thread that was interruptible. It also modifies the bytecode
format for foreign call, adding an extra Word16 to indicate
interruptibility.
The semantics of interruption vary from platform to platform, but the
intent is that any blocking system calls are aborted with an error code.
This is most useful for making function calls to system library
functions that support interrupting. There is no support for pre-Vista
Windows.
There is a partner testsuite patch which adds several tests for this
functionality.
|
| |
|
|
|
|
|
|
|
| |
These are no longer used: once upon a time they used to have different
layout from IND and IND_PERM respectively, but that is no longer the
case since we changed the remembered set to be an array of addresses
instead of a linked list of closures.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This replaces the global blackhole_queue with a clever scheme that
enables us to queue up blocked threads on the closure that they are
blocked on, while still avoiding atomic instructions in the common
case.
Advantages:
- gets rid of a locked global data structure and some tricky GC code
(replacing it with some per-thread data structures and different
tricky GC code :)
- wakeups are more prompt: parallel/concurrent performance should
benefit. I haven't seen anything dramatic in the parallel
benchmarks so far, but a couple of threading benchmarks do improve
a bit.
- waking up a thread blocked on a blackhole is now O(1) (e.g. if
it is the target of throwTo).
- less sharing and better separation of Capabilities: communication
is done with messages, the data structures are strictly owned by a
Capability and cannot be modified except by sending messages.
- this change will utlimately enable us to do more intelligent
scheduling when threads block on each other. This is what started
off the whole thing, but it isn't done yet (#3838).
I'll be documenting all this on the wiki in due course.
|
| |
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is a batch of refactoring to remove some of the GC's global
state, as we move towards CPU-local GC.
- allocateLocal() now allocates large objects into the local
nursery, rather than taking a global lock and allocating
then in gen 0 step 0.
- allocatePinned() was still allocating from global storage and
taking a lock each time, now it uses local storage.
(mallocForeignPtrBytes should be faster with -threaded).
- We had a gen 0 step 0, distinct from the nurseries, which are
stored in a separate nurseries[] array. This is slightly strange.
I removed the g0s0 global that pointed to gen 0 step 0, and
removed all uses of it. I think now we don't use gen 0 step 0 at
all, except possibly when there is only one generation. Possibly
more tidying up is needed here.
- I removed the global allocate() function, and renamed
allocateLocal() to allocate().
- the alloc_blocks global is gone. MAYBE_GC() and
doYouWantToGC() now check the local nursery only.
|
|
|
|
| |
statically.
|
| |
|
|
|
|
| |
This gives about a 15% performance boost in GHCi for me. nice!
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The first phase of this tidyup is focussed on the header files, and in
particular making sure we are exposinng publicly exactly what we need
to, and no more.
- Rts.h now includes everything that the RTS exposes publicly,
rather than a random subset of it.
- Most of the public header files have moved into subdirectories, and
many of them have been renamed. But clients should not need to
include any of the other headers directly, just #include the main
public headers: Rts.h, HsFFI.h, RtsAPI.h.
- All the headers needed for via-C compilation have moved into the
stg subdirectory, which is self-contained. Most of the headers for
the rest of the RTS APIs have moved into the rts subdirectory.
- I left MachDeps.h where it is, because it is so widely used in
Haskell code.
- I left a deprecated stub for RtsFlags.h in place. The flag
structures are now exposed by Rts.h.
- Various internal APIs are no longer exposed by public header files.
- Various bits of dead code and declarations have been removed
- More gcc warnings are turned on, and the RTS code is more
warning-clean.
- More source files #include "PosixSource.h", and hence only use
standard POSIX (1003.1c-1995) interfaces.
There is a lot more tidying up still to do, this is just the first
pass. I also intend to standardise the names for external RTS APIs
(e.g use the rts_ prefix consistently), and declare the internal APIs
as hidden for shared libraries.
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
| |
This reduces the latency between a context-switch being triggered and
the thread returning to the scheduler, which in turn should reduce the
cost of the GC barrier when there are many cores.
We still retain the old context_switch flag which is checked at the
end of each block of allocation. The idea is that setting HpLim may
fail if the the target thread is modifying HpLim at the same time; the
context_switch flag is a fallback. It also allows us to "context
switch soon" without forcing an immediate switch, which can be costly.
|
|
|
|
| |
Fixes crashes on Windows and Sparc
|
|
|
|
|
| |
Fixes a long-standing bug that could in some cases cause sub-optimal
scheduling behaviour.
|
|
|
|
| |
This program makes a PAP with 203 arguments :-)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This replaces the hand-rolled architecture-specific FFI support in
GHCi with the standard libffi as used in GCJ, Python and other
projects. I've bundled the complete libffi-3.0.4 tarball in the
source tree in the same way as we do for GMP, the difference being
that we always build and install our own libffi regardless of whether
there's one on the system (it's small, and we don't want
dependency/versioning headaches).
In particular this means that unregisterised builds will now have a
fully working GHCi including FFI out of the box, provided libffi
supports the platform.
There is also code in the RTS to use libffi in place of
rts/Adjustor.c, but it is currently not enabled if we already have
support in Adjustor.c for the current platform. We need to assess the
performance impact before using libffi here too (in GHCi we don't care
too much about performance).
|
|
|
|
|
|
|
|
|
|
|
|
| |
This means that an unregisterised build on a platform not directly
supported by GHC can now have full FFI support using libffi.
Also in this commit:
- use PrimRep rather than CgRep to describe FFI args in the byte
code generator. No functional changes, but PrimRep is more correct.
- change TyCon.sizeofPrimRep to primRepSizeW, which is more useful
|