docs/cr50_vboot_troubleshooting.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230

# Cr50 And Chrome OS Verified Boot Troubleshooting

H1 is a Google security chip installed on most Chrome OS devices. Cr50 is the
firmware running on the H1. A high level overview of hardware and firmware can
be found in [this
presentation](https://2018.osfc.io/uploads/talk/paper/7/gsc_copy.pdf).

This write-up is an attempt to explain how Cr50 participates in the Chrome OS
device boot process, and what are possible reasons for the dreaded "Chrome OS
Missing Or Damaged" screen showing up when Chrome OS device reboots.

## Basic overview

The H1 controls reset lines of the EC (embedded controller) and the AP
(application processor, or SOC). During normal Chromebook operation H1 is
always powered up as long as battery retains even a minimal amount of charge.
In Chromeboxes H1 powers on with the rest of the system.

One of the important functions of H1 in the system is a subset of TPM (Trusted
Platform Module) functionality. The TPM stores verified boot information, this
is why any **problems communicating with the TPM during the boot up process**
result in the Chrome OS device falling into **recovery mode**.

Another important function of the H1 in the system is CCD ([closed case
debugging](https://chromium.googlesource.com/chromiumos/platform/ec/+/fe6ca90e/docs/case_closed_debugging_gsc.md#))

## H1 power states and CCD

During periods of inactivity H1 could enter a *sleep* or *deep sleep* state.
In *sleep* state most of the clocks are turned off and power consumption is
minimized, but SRAM contents and the CPU state are maintained. In *deep sleep*
state the H1 is practically shut down.

The H1 never enters the *deep sleep* state during the Chrome OS boot process,
but could enter the *sleep* state if the Chrome OS device boot process is
delayed for whatever reason, and **only when CCD is not active**. This could
be one of the reasons that there are boot failures when CCD is not connected,
but the failures go away if CCD is on (the debug cable is plugged in).

To make sure the H1 exits the *sleep* state the AP triggers a wake up event,
details of which are described below.

## H1 communications with the AP

The H1 could be connected to the AP over the I2C or SPI bus. The same Cr50
firmware is used in both cases, the decision which of the two interfaces to
use is made based on resistor straps the Cr50 reads at startup.

Both I2C and SPI interfaces do not fully comply with their respective bus
standards: the I2C controller does not support clock stretching, and the SPI
controller can not be clocked faster than 2 MHz.

Look for a text line like the following in the Cr50 console output right after
power up

> [0.005657 Valid strap: 0xa properties: 0x41]

to confirm that the straps were read properly.

A Cr50 console command allows to see which interface is used to communicate
with the AP:

> \> brdpprop<br>
> properties = 0x1141

If the least significant bit of the value is set, the H1 is using the SPI
interface, if the bit is cleared the H1 is using the I2C interface.

Using H1 imposes additional requirements on the AP interface - the H1 might
have to be waken up from sleep, and flow controls the AP using an additional
`AP_INT_L` signal, both described in more details below.

## TPM reset

The H1 is staying up until power is removed, unless it falls into deep sleep.
TPM is just one of the components of the Cr50 firmware, and the TPM must be
reset when the AP resets.

There are differences between ARM and X86 reset circuit architectures. ARM
SOCs have a bidirectional reset signal called `SYS_RST_L`. They (or, rather,
most of them, but let's not worry about the outliers) generate a pulse on this
line when the SOC reboots. External device can toggle this line to reset the
SOC asynchronously, which is what the Cr50 does to reset ARM SOCs.

The X86 SOCs have two separate signals, one output `PLT_RST_L` which is held
low, while the AP is in reset or in low power mode, and one input,
`SYS_RST_ODL` which Cr50 toggles to reset the SOC.

In case of X86, when `PLT_RST_L` is held low longer than a second, the Cr50
considers this an indication of the AP going into a low power mode (S5 or
lower), which means that the AP will start from the reset vector when it wakes
up, so Cr50 can take H1 into *deep sleep* mode as well.

On top of that ARM based Chrome OS devices have some additional logic which
forces the `SYS_RST_L` behave similar to `PLT_RST_L` - it stays low when
the SOC is in a low power mode and will resume operation from the reset
vector. This allows H1 to enter deep sleep on ARM devices as well.

Resistor bootstraps tell the Cr50 which kind of reset architecture to expect,
the SOC reset indication is used both to reset the TPM component and to enter
the *deep sleep* mode as appropriate.

In the `brdprop` command output bit D5 when set signifies `SYS_RST_L`
('regular' ARM devices) and bit D6 - `PLT_RST_L` (X86 and modified ARM) type
of reset.

Boot problems can arise when the AP reboots, without cr50 seeing a pulse on
the `SYS_RST_L` or `PLT_RST_L` signal: in this case the very first TPM_Startup
command sent by coreboot returns an error, and the Chrome OS device falls into
recovery mode.


## Cr50 operations synchronization

The H1 microcontroller is very slow (clocked at 24 MHz), the AP is usually
hundreds of times faster, there is a need to slow down the AP when it tries to
talk to the TPM during boot up process. The issue is complicated by the
inability of the I2C controller of stretching the clock.

In both I2C and SPI modes the AP\_INT\_L H1 output signal is used to indicate
to the AP that the H1 is ready for the next I2C or SPI transaction. By default
this signal is a 4+ us long low pulse. Some X86 platforms require a pulse of
100+ us, this pulse extension mode can be configured by setting a bit in a TPM
register (I2C register address 0x1c or SPI register address 0xfe0).

In any case it is important that the AP firmware is properly configuring the
pin where the AP\_INT\_L signal is connected as an edge sensitive GPIO, which
latches on either falling or rising edge of the signal.

AP firmware missing these synchronization pulses results in boot process
taking very long time and the AP firmware log including messages

> Timeout wait for TPM IRQ!

in case of SPI or

> Cr50 i2c TPM IRQ timeout!

in case of I2C.

## Waking H1 up from sleep

The I2C Start sequence is sufficient for the H1 to resume operation, the AP
does not have to do anything special. In case of SPI the matters are more
complicated.

Technically speaking the assertion of the CS SPI bus signal is enough to wake
up the H1, but it takes time for it to become fully operational, the AP could
be already transmitting the message by the time the H1 SPI controller is
ready. This is why in case the previous SPI transaction was a second or more
ago, the SPI driver is required to first issue a CS pulse without transferring
any data, just to wake up the H1, then wait for 100 us to let the H1 wake up,
and then continue with a regular SPI transaction.

If the AP does not follow this protocol and starts transmitting before H1 is
ready, communications failures are likely, resulting in the Chrome OS device
falling into recovery. This often happens when the device took a long time to
find the kernel to boot, and then the AP is trying to lock the TPM state
before starting up the kernel, but fails, because the H1 was asleep by this
time and was not properly woken up.

## SPI Message Synchronization

SPI interface is synchronous, and either read or write accesses happen within
a single transaction. The Trusted Computing Group (TCG) came up with a
hardware protocol on top of SPI specification to allow the slow device to flow
control the fast host controller.

The base idea is that each time the AP wants to read or write a TPM register,
it sends a SPI packet, which consists of the header and data fields.

The header field is always present, it is 4 bytes in size, and includes the
operation type (read or write), data length and register address.

The header is sent out as soon as the SPI transaction starts, then the AP
starts monitoring the MOSI line, one byte at a time, paying attention to bit
D0. The Cr50 keeps sending zeros on that bit, until ready to proceed with the
operation requested in the transaction header. Once the Cr50 is ready, it
responds with a byte with bit D0 set to one. At this point the AP knows that
starting with the next byte the actual data of the transaction can be flowing,
so it either sends the data in case of write or reads it from the TPM in case
of reads.

This is described in details in [TCG PC Client Platform TPM Profile (PTP)
Specification Family "2.0" Level 00 Revision
00.43](https://drive.google.com/file/d/16r1vDhf1fnggI4BkOBuTXPqOQt4LaFvk/view?usp=sharing)
in section "6.4 Spi Hardware Protocol".

The AP ignoring this flow control mechanism is yet another common problem
causing failures to boot, because the driver starts sending or receiving data
before TPM is ready. This failure is more likely to happen when developing new
SPI drivers.

## Boot up process examples

A trace of a typical Chrome OS device boot process was collected using the
[Saleae](https://www.saleae.com/) Logic Pro 16 logic analyzer.

The [full trace](./images/bobba_boot.sal) can be examined in details using the
Saleae application in the trace analysis mode.

A few detailed snapshots of this trace are shown below (click to expand):

### Full boot sequence

[![Full boot sequence](./images/typical_boot.png)][1] shows communications
between AP an H1 during a typical Chrome OS boot: first a flurry of
communications between Coreboot and the H1, then some time spent verifying and
loading various firmware stages, then a block of communications between
Depthcarge and the H1.

### Typical read sequence

[![Typical read sequence](./images/typical_read.png)][2] shows the 4 byte
header where the read of four bytes from register address 0xd40f00 is
requested. The TPM is not ready and sends all zeros on the MISO line for three
cycles, then sends a byte of 01 and then the AP reads four bytes of the actual
register value (0xe01a2800). Then, after H1 is ready to accept the next SPI
transaction it generates a pulse on AP\_INT\_L.

### Read with wake pulse sequence

[![Read with wake pulse](./images/read_with_wake_pulse.png)][3] is an example
of a case where the AP toggles the CS line first, without sending any data,
and then in 100 us starts the actual SPI transaction completed with the
AP\_INT\_L pulse.

[1]:https://drive.google.com/file/d/16Z_Nw1e6z5akUnyLZyI8ivfT5frxKPQh/view
[2]:https://drive.google.com/file/d/1weBd6kBiXoQ0I3TGmbpiHZm0dimByYnI/view
[3]:https://drive.google.com/file/d/13ZSP3up4leG0Etqo4A_gkFK1MeptGDCw/view