chromium/docs/subtle_threading_bugs.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133

# Subtle Threading Bugs and Patterns to avoid them

[TOC]

## The Problem
We were using a number of patterns that were problematic:

1. Using `BrowserThread::GetMessageLoop` This isn't safe, since it could return
   a valid pointer, but since the caller isn't holding on to a lock anymore, the
   target MessageLoop could be destructed while it's being used.
   
1. Caching of MessageLoop pointers in order to use them later for PostTask and friends

   * This was more efficient previously (more on that later) since using
     `BrowserThread::GetMessageLoop` involved a lock.
   
   * But it spread logic about the order of thread destruction all over the
     code.  Code that moved from the IO thread to the file thread and back, in
     order to avoid doing disk access on the IO thread, ended up having to do an
     extra hop through the UI thread on the way back to the IO thread since the
     file thread outlives the IO thread.  Of course, most code learnt this the
     hard way, after doing the straight forward IO->file->IO thread hop and
     updating the code after seeing reliability or user crashes.

   * It made the browser shutdown fragile and hence difficult to update.
   
1. File thread hops using RefCountedThreadSafe objects which have non-trivial
   destructors

   * To reduce jank, frequently an object on the UI or IO thread would execute
     some code on the file thread, then post the result back to the original
     thread.  We make this easy using `base::Callback` and
     `RefCountedThreadSafe`, so this pattern happened often, but it's not always
     safe: base::Callback holds an extra reference on the object to ensure that
     it doesn't invoke a method on a deleted object.  But it's quite possible
     that before the file thread's stack unwinds and it releases the extra
     reference, that the response task on the original thread executed and
     released its own additional reference.  The file thread is then left with
     the remaining reference and the object gets destructed there.  While for
     most objects this is ok, many have non-trivial destructors, with the worst
     being ones that register with the UI-thread NotificationService.  Dangling
     pointers would be left behind and tracking these crashes from ChromeBot or
     the crash dumps has wasted several days at least for me.

1. Having browser code take different code paths if a thread didn't exist

   * This could be either deceptively harmless (i.e. execute synchronously when
     it was normally asynchronous), when in fact it makes shutdown slower
     because disk access is moved to the UI thread.
 
   * It could lead to data loss, if tasks are silently not posted because the
     code assumes this only happens in unit tests, when it could occur on
     browser shutdown as well.

## The Solution

* 1+2: Where possible, use `BrowserThread::PostTask`. Everywhere else, use
  `scoped_refptr<MessageLoopProxy>`.

  `BrowserThread::PostTask` and friends (i.e. `PostDelayedTask`, `DeleteSoon`,
  `ReleaseSoon`) are safe and efficient: no locks are grabbed if the target
  thread is known to outlive the current thread.  The four static methods have
  the same signature as the ones from `MessageLoop`, with the addition of the
  first parameter to indicate the target thread:

    `BrowserThread::PostTask(BrowserThread::FILE, FROM_HERE, task);`

  Similarly, `MessageLoopProxy` has (non-static) methods with the same signature
  as the `MessageLoop` counterparts, but is safe for caching a [reference
  counting] pointer to. You can obtain a `MessageLoopProxy` via
  `Thread::message_loop_proxy()`, `BrowserThread::GetMessageLoopProxy()`, or
  `MessageLoopProxy::CreateForCurrentThread()`.

* 3: If you want to execute a method on another thread and jump back to the
  original thread, use `PostTaskAndReply()`:

    BrowserThread::PostTaskAndReply(BrowserThread::FILE, FROM_HERE,
                                    base::Bind(&Foo::DoStuffOnFileThread, this),
                                    base::Bind(&Foo::StuffDone, this));

  `PostTaskAndReply()` will make sure that both tasks are destroyed on the
  thread they were created on, so if they hold the last reference to the object,
  it will be destroyed on the originating thread. You can also use
  `PostTaskAndReplyWithResult()` to return a value from the first task and pass
  it into the second task.
  
  Alternatively, if your object must be destructed on a specific thread, you can
  use a trait from BrowserThread:

    class Foo : public base::RefCountedThreadSafe<Foo, BrowserThread::DeleteOnIOThread>

* 4: I've removed all the special casing and always made the objects in the
  browser code behave in one way.  If you're writing a unit test and need to use
  an object that goes to a file thread (where before it would proceed
  synchronously), you just need:

    BrowserThread file_thread(BrowserThread::FILE, MessageLoop::current());
    foo->StartAsyncTaskOnFileThread();
    MessageLoop::current()->RunAllPending();

  There are plenty of examples now in the tests.

## Gotchas
Even when using `BrowseThread` or `MessageLoopProxy`, you will still likely have
messages lost (usually resulting in memory leaks) when the target thread is in
the process of shutting down: the benefit over `MessageLoop` is primarily one of
avoiding crashing in unpredictable ways. (See this thread for debate.)

`BrowseThread` and `MessageLoopProxy::PostTask` will silently delete a task if
the thread doesn't exist.  This is done to avoid having all the code that uses
it have special cases if the target thread exists or not, and to make Valgrind
happy.  As usual, the task for `DeleteSoon/ReleaseSoon` doesn't do anything in
its destructor, so this won't cause unexpected behavior with them.  But be wary
of posted `Task` objects which have non-trivial destructors or smart pointers as
members.  I'm still on the fence about this, since while the latter is
theoretical now, it could lead to problems in the future.  I might change it so
that the tasks are not deleted when I'm ready for more Valgrind fun.

If you absolutely must know if a task was posted or not, you can check the
return value of `PostTask` and friends.  But note that even if the task was
posted successfully, there's no guarantee that it will run because the target
thread could already have a `QuitTask` queued up, or be in the early stages of
quitting.

`g_browser->io_thread()` and `file_thread()` are still around (the former for
IPC code, and the latter for Linux proxy code which is in net and so can't use
`BrowserThread`).  Don't use them unless absolutely necessary.


### More information

* https://bugs.chromium.org/p/chromium/issues/detail?id=25354