summaryrefslogtreecommitdiff
path: root/subversion/libsvn_fs_base/notes/structure
blob: 8e2159f29e1b8819befd922b9db1cca9d9e63993 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
Subversion on Berkeley DB                                    -*- text -*-

There are many different ways to implement the Subversion filesystem
interface.  You could implement it directly using ordinary POSIX
filesystem operations; you could build it using an SQL server as a
back end; you could build it on RCS; and so on.

This implementation of the Subversion filesystem interface is built on
top of Berkeley DB (http://www.sleepycat.com).  Berkeley DB supports
transactions and recoverability, making it well-suited for Subversion.



Nodes and Node Revisions

In a Subversion filesystem, a `node' corresponds roughly to an
`inode' in a Unix filesystem:

   * A node is either a file or a directory.

   * A node's contents change over time.

   * When you change a node's contents, it's still the same node; it's
     just been changed.  So a node's identity isn't bound to a specific
     set of contents.

   * If you rename a node, it's still the same node, just under a
     different name.  So a node's identity isn't bound to a particular
     filename.

A `node revision' refers to a node's contents at a specific point in
time.  Changing a node's contents always creates a new revision of that
node.  Once created, a node revision's contents never change.

When we create a node, its initial contents are the initial revision of
the node.  As users make changes to the node over time, we create new
revisions of that same node.  When a user commits a change that deletes
a file from the filesystem, we don't delete the node, or any revision
of it --- those stick around to allow us to recreate prior revisions of
the filesystem.  Instead, we just remove the reference to the node
from the directory.



ID's

Within the database, we refer to nodes and node revisions using a
string of three unique identifiers (the "node ID", the "copy ID", and
the "txn ID"), separated by periods.

    node_revision_id ::= node_id '.' copy_id '.' txn_id

The node ID is unique to a particular node in the filesystem across
all of revision history.  That is, two node revisions who share
revision history (perhaps because they are different revisions of the
same node, or because one is a copy of the other, e.g.) have the same
node ID, whereas two node revisions who have no common revision
history will not have the same node ID.

The copy ID is a key into the `copies' table (see `Copies' below), and
identifies that a given node revision, or one of its ancestors,
resulted from a unique filesystem copy operation.

The txn ID is just an identifier that is unique to a single filesystem
commit.  All node revisions created as part of a commit share this txn
ID (which, incidentally, gets its name from the fact that this id is
the same id used as the primary key of Subversion transactions; see
`Transactions' below).

A directory entry identifies the file or subdirectory it refers to
using a node revision ID --- not a node ID.  This means that a change
to a file far down in a directory hierarchy requires the parent
directory of the changed node to be updated, to hold the new node
revision ID.  Now, since that parent directory has changed, its parent
needs to be updated, and so on to the root.  We call this process
"bubble-up".

If a particular subtree was unaffected by a given commit, the node
revision ID that appears in its parent will be unchanged.  When
doing an update, we can notice this, and ignore that entire
subtree.  This makes it efficient to find localized changes in
large trees.



A Word About Keys

Some of the Subversion database tables use base-36 numbers as their
keys.  Some debate exists about whether the use of base-36 (as opposed
to, say, regular decimal values) is either necessary or good.  It is
outside the scope of this document to make a claim for or against this
usage.  As such, the reader will please note that for the majority of
the document, the use of the term "number" when referring to keys of
database tables should be interpreted to mean "a monotonically
increasing unique key whose order with respect to other keys in the
table is irrelevant".  :-)

To determine the actual type currently in use for the keys of a given
table, you are invited to check out the "Appendix: Filesystem
structure summary" section of this document.



NODE-REVISION: how we represent a node revision

We represent a given revision of a file or directory node using a list
skel (see include/private/svn_skel.h for an explanation of skels).
A node revision skel has the form:

    (HEADER PROP-KEY KIND-SPECIFIC ...)

where HEADER is a header skel, whose structure is common to all nodes,
PROP-KEY is the key of the representation that contains this node's
properties list, and the KIND-SPECIFIC elements carry data dependent
on what kind of node this is --- file, directory, etc.

HEADER has the form:

    (KIND CREATED-PATH [PRED-ID [PRED-COUNT [HAS-MERGEINFO MERGEINFO-COUNT]]])

where:

   * KIND indicates what sort of node this is.  It must be one of the
     following:
       - "file", indicating that the node is a file (see FILE below).
       - "dir", indicating that the node is a directory (see DIR below).

   * CREATED-PATH is the canonicalized absolute filesystem path at
     which this node was created.

   * PRED-ID, if present, indicates the node revision which is the
     immediate ancestor of this node.

   * PRED-COUNT, if present, indicates the number of predecessors the
     node revision has (recursively).

   * HAS-MERGEINFO and MERGEINFO-COUNT, if present, indicate ...
     ### TODO

Note that a node cannot change its kind from one revision to the next.
A directory node is always a directory; a file node is always a file;
etc.  The fact that the node's kind is stored in each node revision,
rather than in some revision-independent place, might suggest that
it's possible for a node to change kinds from revision to revision, but
Subversion does not allow this.

PROP-KEY is a key into the `representations' table (see REPRESENTATIONS 
below), whose value is a representation pointing to a string 
(see `strings' table) that is a PROPLIST skel.

The KIND-SPECIFIC portions are discussed below.



PROPLIST: a property list is a list skel of the form:

    (NAME1 VALUE1 NAME2 VALUE2 ...)

where each NAMEi is the name of a property, and VALUEi is the value of
the property named NAMEi.  Every valid property list has an even
number of elements.



FILE: how files are represented.

If a NODE-REVISION's header's KIND is "file", then the node-revision
skel represents a file, and has the form:

    (HEADER PROP-KEY DATA-INFO [EDIT-DATA-KEY])

where

    DATA-INFO ::= DATA-KEY | (DATA-KEY DATA-KEY-UNIQID)

and DATA-KEY identifies the representation for the file's current
contents, and EDIT-DATA-KEY identifies the representation currently
available for receiving new contents for the file.

DATA-KEY-UNIQID ...
### TODO

See discussion of representations later.



DIR: how directories are represented.

If the header's KIND is "dir", then the node-revision skel
represents a directory, and has the form:

    (HEADER PROP-KEY ENTRIES-KEY)

where ENTRIES-KEY identifies the representation for the directory's
entries list (see discussion of representations later).  An entries
list has the form

    (ENTRY ...)

where each entry is

    (NAME ID)

where:

   * NAME is the name of the directory entry, in UTF-8, and

   * ID is the ID of the node revision to which this entry refers



REPRESENTATIONS: where and how Subversion stores your data.

Some parts of a node revision are essentially constant-length: for
example, the KIND field and the REV.  Other parts can have
arbitrarily varying length: property lists, file contents, and
directory entry lists.  This variable-length data is often similar
from one revision to the next, so Subversion stores just the deltas
between them, instead of successive fulltexts.

The HEADER portion of a node revision holds the constant-length stuff,
which is never deltified.  The rest of a node revision just points to
data stored outside the node revision proper.  This design makes the
repository code easier to maintain, because deltification and
undeltification are confined to a layer separate from node revisions,
and makes the code more efficient, because Subversion can retrieve
just the parts of a node it needs for a given operation.

Deltifiable data is stored in the `strings' table, as mediated by the
`representations' table.  Here's how it works:

The `strings' table stores only raw bytes.  A given string could be
any one of these:

   - a file's contents
   - a delta that reconstructs file contents, or part of a file's contents
   - a directory entry list skel
   - a delta that reconstructs a dir entry list skel, or part of same
   - a property list skel
   - a delta that reconstructs a property list skel, or part of same

There is no way to tell, from looking at a string, what kind of data
it is.  A directory entry list skel is indistinguishable from file
contents that just happen to look exactly like the unparsed form of a
directory entry list skel.  File contents that just happen to look
like svndiff data are indistinguishable from delta data.

The code is able to interpret a given string because Subversion

   a) knows whether to be looking for a property list or some
      kind-specific data,

   b) knows the `kind' of the node revision in question,

   c) always goes through the `representations' table to discover if
      any undeltification or other transformation is needed.

The `representations' table is an intermediary between node revisions
and strings.  Node revisions never refer directly into the `strings'
table; instead, they always refer into the `representations' table,
which knows whether a given string is a fulltext or a delta, and if it
is a delta, what it is a delta against.  That, combined with the
knowledge in (a) and (b) above, allows Subversion to retrieve the data
and parse it appropriately.  A representation has the form:

   (HEADER KIND-SPECIFIC)

where HEADER is

   (KIND TXN [MD5 [SHA1]])

The KIND is "fulltext" or "delta".  TXN is the txn ID for the txn in
which this representation was created.  MD5 is a checksum of the
representation's contents, that is, what the representation produces,
regardless of whether it is stored deltified or as fulltext.  (For
compatibility with older versions of Subversion, MD5 may be
absent, in which case the filesystem behaves as though the checksum is
there and is correct.) An additional kind of checksum, SHA1, is present
in newer formats, starting with version ...
### TODO

The TXN also serves as a kind of mutability flag: if txn T tries to
change a representation's contents, but the rep's TXN is not T, then
something has gone horribly wrong and T should leave the rep alone
(and probably error).  Of course, "change a representation" here means
changing what the rep's consumer sees.  Switching a representation's
storage strategy, for example from fulltext to deltified, wouldn't
count as a change, since that wouldn't affect what the rep produces.

KIND-SPECIFIC varies considerably depending on the kind of
representation.  Here are the two forms currently recognized:

   (("fulltext" TXN [MD5 [SHA1]]) STRING-KEY)
       The data is at STRING-KEY in the `strings' table.

   (("delta" TXN [MD5 [SHA1]]) (OFFSET WINDOW) ...)
       Each OFFSET indicates the point in the fulltext that this
       element reconstructs, and WINDOW says how to reconstruct it:

       WINDOW ::= (DIFF SIZE REP-KEY [REP-OFFSET]) ;
       DIFF   ::= ("svndiff" VERSION STRING-KEY)

       Notice that a WINDOW holds only metadata.  REP-KEY says what
       the window should be applied against, or none if this is a
       self-compressed delta; SIZE says how much data this window
       reconstructs; VERSION says what version of the svndiff format
       is being used (currently only version 0 is supported); and
       STRING-KEY says which string contains the actual svndiff data
       (there is no diff data held directly in the representations
       table, of course).

       Note also that REP-KEY might refer to a representation that
       itself requires undeltification.  We use a delta combiner to
       combine all the deltas needed to reproduce the fulltext from
       some stored plaintext.

       Branko says this is what REP-OFFSET is for:
       > The offsets embedded in the svndiff are stored in a string;
       > these offsets would be in the representation. The point is that
       > you get all the information you need to select the appropriate
       > windows from the rep skel -- without touching a single
       > string. This means a bit more space used in the repository, but
       > lots less memory used on the server.

       We'll see if it turns out to be necessary.

In the future, there may be other representations, for example
indicating that the text is stored elsewhere in the database, or
perhaps in an ordinary Unix file.

Let's work through an example node revision:

   (("file" REV COUNT) PROP-KEY "2345")

The entry for key "2345" in `representations' is:

   (("delta" TXN CHECKSUM) (0 (("svndiff" 0 "1729") 65 "2343")))

and the entry for key "2343" in `representations' is:

   (("fulltext" TXN CHECKSUM) "1001")

while the entry for key "1729" in `strings' is:

   <some unprintable glob of svndiff data>

which, when applied to the fulltext at key "1001" in strings, results
in this new fulltext:

   "((some text) (that looks) (deceptively like) (directory entries))"

Et voila!  Subversion knew enough, via the `representations' and
`strings' tables, to undeltify and get that fulltext; and knew enough,
because of the node revision's "file" type, to interpret the result as
file contents, not as a directory entry list.

(Note that the `strings' table stores multiple DB values per key.
That is, although it's accurate to say there is one string per key,
the string may be divided into multiple consecutive blocks, all
sharing that key.  You use a Berkeley DB cursor to find the desired
value[s], when retrieving a particular offset+len in a string.)

Representations know nothing about ancestry -- the `representations'
table never refers to node revision id's, only to strings or to other
representations.  In other words, while the `nodes' table allows
recovery of ancestry information, the `representations' and `strings'
tables together handle deltification and undeltification
*independently* of ancestry.  At present, Subversion generally stores
the youngest strings in "fulltext" form, and older strings as "delta"s
against them (unless the delta would save no space compared to the
fulltext).  However, there's nothing magic about that particular
arrangement.  Other interesting alternatives:

   * We could store the N most recently accessed strings as fulltexts,
     letting access patterns determine the most appropriate
     representation for each revision.

   * We could occasionally store deltas against the N'th younger
     revision, storing larger jumps with a frequency inverse to the
     distance covered, yielding a tree-structured history.

Since the filesystem interface doesn't expose these details, we can
change the representation pretty much as we please to optimize
whatever parameter we care about --- storage size, speed, robustness,
etc.

Representations never share strings - every string is referred to by
exactly one representation.  This is so that when we change a
representation to a different form (e.g. during deltification), we can
delete the strings containing the old form, and know that we're not
messing up any other reps by doing so.


Further Notes On Deltifying:
----------------------------

When a representation is deltified, it is changed in place.
New strings are created containing the new delta, the representation
is changed to refer to the new strings, and the original (usually
fulltext) string or strings are deleted.

The node revisions referring to that representation will not be
changed; instead, the same rep key will now be associated with
different value.  That way, we get reader locking for free: if someone
is reading a file while Subversion is deltifying that file, one of the
two sides will get a DB_DEADLOCK and svn_fs__retry_txn() will retry.

### todo: add a note about cycle-checking here, too.



The Berkeley DB "nodes" table

The database contains a table called "nodes", which is a btree indexed
by node revision ID's, mapping them onto REPRESENTATION skels.  Node 0
is always the root directory, and node revision ID 0.0.0 is always the
empty directory.  We use the value of the key 'next-key' to indicate
the next unused node ID.

Assuming that we store the most recent revision on every branch as
fulltext, and all other revisions as deltas, we can retrieve any node
revision by searching for the last revision of the node, and then
walking backwards to specific revision we desire, applying deltas as
we go.



REVISION: filesystem revisions, and the Berkeley DB "revisions" table

We represent a filesystem revision using a skel of the form:

    ("revision" TXN)

where TXN is the key into the `transactions' table (see 'Transactions' below)
whose value is the transaction that was committed to create this revision.

The database contains a table called "revisions", which is a
record-number table mapping revision numbers onto REVISION skels.
Since Berkeley DB record numbers start with 1, whereas Subversion
filesystem revision numbers start at zero, revision V is stored as
record number V+1 in the `revisions' table.  Filesystem revision zero
always has node revision 0.0.0 as its root directory; that node
revision is guaranteed to be an empty directory.



Transactions

Every transaction ends when it is either successfully committed, or
aborted.  We call a transaction which has been either committed or
aborted "finished", and one which hasn't "unfinished".  

Transactions are identified by unique numbers, called transaction
ID's.  Currently, transaction ID's are never reused, though this is
not mandated by the schema.  In the database, we always represent a
transaction ID in its shortest ASCII form.

The Berkeley DB `transactions' table records both unfinished and
committed transactions.  Every key in this table is a transaction ID.
Unfinished transactions have values that are skels of one of the
following forms:

   ("transaction" ROOT-ID BASE-ID PROPLIST COPIES)
   ("dead" ROOT-ID BASE-ID PROPLIST COPIES)

where:

   * ROOT-ID is the node revision ID of the transaction's root
     directory.  This is of the form 0.0.THIS-TXN-ID.

   * BASE-ID is the node revision ID of the root of the transaction's
     base revision.  This is of the form 0.0.BASE-TXN-ID - the base
     transaction is, of course, the transaction of the base revision.

   * PROPLIST is a skel giving the revision properties for the
     transaction.

   * COPIES contains a list of keys into the `copies' table,
     referencing all the filesystem copies created inside of this
     transaction.  If the transaction is aborted, these copies get
     removed from the `copies' table.

   * A "dead" transaction is one that has been requested to be
     destroyed, and should never, ever, be committed.

Committed transaction, however, have values that are skels of the form:

   ("committed" ROOT-ID REV PROPLIST COPIES)

where:

   * ROOT-ID is the node revision ID of the committed transaction's (or
     revision's) root node.

   * REV represents the revision that was created when the
     transaction was committed.

   * PROPLIST is a skel giving the revision properties for the
     committed transaction.

   * COPIES contains a list of keys into the `copies' table,
     referencing all the filesystem copies created by this committed
     transaction.  Nothing currently uses this information for
     committed transactions, but it could be useful in the future.

As the sole exception to the rule above, the `transactions' table
always has one entry whose key is `next-key', and whose value is the
lowest transaction ID that has never yet been used.  We use this entry
to allocate ID's for new transactions.

The `transactions' table is a btree, with no particular sort order.



Changes

As modifications are made (files and dirs added or removed, text and
properties changed, etc.) on Subversion transaction trees, the
filesystem tracks the basic change made in the Berkeley DB `changes'
table.  

The `changes' table is a btree with Berkeley's "duplicate keys"
functionality (and with no particular sort order), and maps the
one-to-many relationship of a transaction ID to a "change" item.
Change items are skels of the form:

   ("change" PATH ID CHANGE-KIND TEXT-MOD PROP-MOD)

where:

   * PATH is the path that was operated on to enact this change.

   * ID is the node revision ID of the node changed.  The precise
     meaning varies based on the kind of the change:
      - "add" or "modify": a new node revision created in the current
        txn.
      - "delete": a node revision from a previous txn.
      - "replace": a replace operation actually acts on two node
        revisions, one being deleted, one being added.  Only the added
        node-revision ID is recorded in the `changes' table - this is
        a design flaw.
      - "reset": no node revision applies.  A zero atom is used as a
        placeholder.

   * CHANGE-KIND is one of the following:

     - "add"     : PATH/ID was added to the filesystem.
     - "delete"  : PATH/ID was removed from the filesystem.
     - "replace" : PATH/ID was removed, then re-added to the filesystem.
     - "modify"  : PATH/ID was otherwise modified.
     - "reset"   : Ignore any previous changes for PATH/ID in this txn.
                   This kind is no longer created by Subversion 1.3.0
                   and later, and can probably be removed at the next
                   schema bump.

   * TEXT-MOD is a bit specifying whether or not the contents of
     this node was modified.

   * PROP-MOD is a bit specifying whether or not the properties of
     this node where modified.

In order to fully describe the changes made to any given path as part
of a single transaction, one must read all the change items associated
with the transaction's ID, and "collapse" multiple entries that refer
to that path. 



Copies

Each time a filesystem copy operation is performed, Subversion records
meta-data about that copy.  

Copies are identified by unique numbers called copy ID's.  Currently,
copy ID's are never reused, though this is not mandated by the schema.
In the database, we always represent a copy ID in its shortest ASCII
form.

The Berkeley DB `copies' table records all filesystem copies.  Every
key in this table is copy ID, and every value is a skel of one of the
following forms:

   ("copy" SRC-PATH SRC-TXN DST-NODE-ID)
   ("soft-copy" SRC-PATH SRC-TXN DST-NODE-ID)

where:

   * "copy" indicates an explicitly requested copy, and "soft-copy"
     indicates a node that was cloned internally as part of an
     explicitly requested copy of some parent directory. See the
     section "Copies and Copy IDs" in the file <fs-history> for
     details.

   * SRC-PATH and SRC-TXN are the canonicalized absolute path and
     transaction ID, respectively, of the source of the copy.

   * DST-NODE-ID represents the new node revision created as a result
     of the copy.

As the sole exception to the rule above, the `copies' table always has
one entry whose key is `next-key', and whose value is the lowest copy ID
that has never yet been used.  We use this entry to allocate new
copy ID's.

The `copies' table is a btree, with no particular sort order.



Locks

When a caller locks a file -- reserving an exclusive right to modify
or delete it -- an lock object is created in a `locks' table.

The `locks' table is a btree whose key is a UUID string known as
a "lock-token", and whose value is a skel representing a lock.  The
fields in the skel mirror those of an svn_lock__t (see svn_types.h):

    ("lock" PATH TOKEN OWNER COMMENT XML-P CREATION-DATE EXPIRATION-DATE)

where:

   * PATH is the absolute filesystem path reserved by the lock.

   * TOKEN is the universally unique identifier of the lock, known
     as the lock-token.  This is the same as the row's key.

   * OWNER is the authenticated username that "owns" the lock.

   * COMMENT is a string describing the lock.  It may be empty, or it
     might describe the rationale for locking.

   * XML-P is a boolean (either 0 or 1) indicating whether the COMMENT
     field is wrapped in an XML tag.  (This is something only used by
     the DAV layer, for webdav interoperabliity.)

   * CREATION-DATE is a string representation of the date/time when
     the lock was created.  (see svn_time_to_cstring())

   * EXPIRATION-DATE is a string representation of the date/time when
     the lock will cease to be valid.  (see svn_time_to_cstring())

In addition to creating a lock in the `locks' table, a new row is
created in a `lock-tokens' table.  The `lock-tokens' table is a btree
whose key is an absolute path in the filesystem.  The value of each
key is a lock-token (which is a key into the `locks' table.)

To test if a path is locked, simply check if the path is a key in the
`lock-tokens' table.  To see if a certain directory has any locked
children below, we ask BerkeleyDB to do a "greater or equal match" on
the directory path, and see if any results come back.  If they do,
then at least one of the directory's children is locked, and thus the
directory cannot be deleted without further investigation.

Locks are ephemeral things, not historied in any way.  They are
potentially created and deleted quite often.  When a lock is
destroyed, the appropriate row is removed from the `locks' table.
Additionally, the locked-path is removed from the `lock-tokens' table.





Merge rules

The Subversion filesystem must provide the following characteristics:

- clients can submit arbitrary rearrangements of the tree, to be
  performed as atomic changes to the filesystem tree
- multiple clients can submit non-overlapping changes at the same time,
  without blocking
- readers must never block other readers or writers
- writers must never block readers
- writers may block writers

Merging rules:

   The general principle: a series of changes can be merged iff the
   final outcome is independent of the order you apply them in.

Merging algorithm:

   For each entry NAME in the directory ANCESTOR:
  
     Let ANCESTOR-ENTRY, SOURCE-ENTRY, and TARGET-ENTRY be the IDs of
     the name within ANCESTOR, SOURCE, and TARGET respectively.
     (Possibly null if NAME does not exist in SOURCE or TARGET.)
  
     If ANCESTOR-ENTRY == SOURCE-ENTRY, then:
       No changes were made to this entry while the transaction was in
       progress, so do nothing to the target.
  
     Else if ANCESTOR-ENTRY == TARGET-ENTRY, then:
       A change was made to this entry while the transaction was in
       process, but the transaction did not touch this entry.  Replace
       TARGET-ENTRY with SOURCE-ENTRY.
  
     Else:
       Changes were made to this entry both within the transaction and
       to the repository while the transaction was in progress.  They
       must be merged or declared to be in conflict.
  
       If SOURCE-ENTRY and TARGET-ENTRY are both null, that's a
       double delete; if one of them is null, that's a delete versus
       a modification. In any of these cases, flag a conflict.
  
       If any of the three entries is of type file, declare a conflict.
  
       If either SOURCE-ENTRY or TARGET-ENTRY is not a direct
       modification of ANCESTOR-ENTRY (determine by comparing the
       node-id fields), declare a conflict.  A replacement is
       incompatible with a modification or other replacement--even
       an identical replacement.
  
       Direct modifications were made to the directory ANCESTOR-ENTRY
       in both SOURCE and TARGET.  Recursively merge these
       modifications.
  
   For each leftover entry NAME in the directory SOURCE:
  
     If NAME exists in TARGET, declare a conflict.  Even if SOURCE and
     TARGET are adding exactly the same thing, two additions are not
     auto-mergeable with each other.
  
     Add NAME to TARGET with the entry from SOURCE.
  
   Now that we are done merging the changes from SOURCE into the
   directory TARGET, update TARGET's predecessor to be SOURCE.

The following algorithm was used when the Subversion filesystem was
initially written, but has been replaced with the simpler and more
performant algorithm above:

   Merging two nodes, A and B, with respect to a common ancestor
   ANCESTOR:
   
   - First, the merge fails unless A, B, and ANCESTOR are all the same
     kind of node.
   - If A and B are text files:
     - If A is an ancestor of B, then B is the merged result.
     - If A is identical to B, then B (arbitrarily) is the merged
       result.
     - Otherwise, the merge fails.
   - If A and B are both directories:
     - For every directory entry E in either A, B, or ANCESTOR, here
       are the cases:
         - E exists in neither ANCESTOR nor A.
         - E doesn't exist in ANCESTOR, and has been added to A.
         - E exists in ANCESTOR, but has been deleted from A.
         - E exists in both ANCESTOR and A ...
           - but refers to different nodes.
           - but refers to different revisions of the same node.
           - and refers to the same node revision.
   
       The same set of possible relationships with ANCESTOR holds for B,
       so there are thirty-six combinations.  The matrix is symmetrical
       with A and B reversed, so we only have to describe one triangular
       half, including the diagonal --- 21 combinations.
   
       - (6) E exists in neither ANCESTOR nor A:
         - (1) E exists in neither ANCESTOR nor B.  Can't occur, by
           assumption that E exists in either A, B, or ancestor.
         - (1) E has been added to B.  Add E in the merged result. ***
         - (1) E has been deleted from B.  Can't occur, by assumption
           that E doesn't exist in ANCESTOR.
         - (3) E exists in both ANCESTOR and B.  Can't occur, by
           assumption that E doesn't exist in ancestor.
       - (5) E doesn't exist in ANCESTOR, and has been added to A.
         - (1) E doesn't exist in ANCESTOR, and has been added to B.
           Conflict.
         - (1) E exists in ANCESTOR, but has been deleted from B.
           Can't occur, by assumption that E doesn't exist in
           ANCESTOR.
         - (3) E exists in both ANCESTOR and B.  Can't occur, by
           assumption that E doesn't exist in ANCESTOR.
       - (4) E exists in ANCESTOR, but has been deleted from A.
         - (1) E exists in ANCESTOR, but has been deleted from B.  If
           neither delete was a result of a rename, then omit E from
           the merged tree.  *** Otherwise, conflict.
         - E exists in both ANCESTOR and B ...
           - (1) but refers to different nodes.  Conflict.
           - (1) but refers to different revisions of the same node.
             Conflict.
           - (1) and refers to the same node revision.  Omit E from
             the merged tree. ***
       - (3) E exists in both ANCESTOR and A, but refers to different
         nodes.
         - (1) E exists in both ANCESTOR and B, but refers to
           different nodes.  Conflict.
         - (1) E exists in both ANCESTOR and B, but refers to
           different revisions of the same node.  Conflict.
         - (1) E exists in both ANCESTOR and B, and refers to the same
           node revision.  Replace E with A's node revision.  ***
       - (2) E exists in both ANCESTOR and A, but refers to different
         revisions of the same node.
         - (1) E exists in both ANCESTOR and B, but refers to
           different revisions of the same node.  Try to merge A/E and
           B/E, recursively.  ***
         - (1) E exists in both ANCESTOR and B, and refers to the same
           node revision.  Replace E with A's node revision.  ***
       - (1) E exists in both ANCESTOR and A, and refers to the same
         node revision.
         - (1) E exists in both ANCESTOR and B, and refers to the same
           node revision.  Nothing has happened to ANCESTOR/E, so no
           change is necessary.
   
   *** == something actually happens


Non-Historical Properties

[[Yes, do tell.]]


UUIDs: Universally Unique Identifiers

Every filesystem has a UUID.  This is represented as record #1 in the
`uuids' table.


Layers

In previous structurings of the code, I had trouble keeping track of
exactly who has implemented which promises, based on which other
promises from whom.

I hope the arrangement below will help me keep things straight, and
make the code more reliable.  The files are arranged in order from
low-level to high-level: each file depends only on services provided
by the files before it.

skel.c, id.c, dbt.c, convert-size.c

                Low-level utility functions.

fs_skels.c      Routines for marshaling between skels and native FS types.

fs.c            Creating and destroying filesystem objects.

err.c           Error handling.

nodes-table.c, txn-table.c, rev-table.c, reps-table.c, strings-table.c

                Create and open particular database tables.
                Responsible for intra-record consistency.

node-rev.c      Creating, reading, and writing node revisions.
                Responsible for deciding what gets deltified when.

reps-strings.c
                Retrieval and storage of represented strings.
                This will handle delta-based storage,

dag.c           Operations on the DAG filesystem.  "DAG" because the
                interface exposes the filesystem's sharing structure.
                Enforce inter-record consistency.

tree.c          Operations on the tree filesystem.  This layer is
                built on top of dag.c, but transparently distinguishes
                virtual copies, making the underlying DAG look like a
                real tree.  This makes incomplete transactions behave
                like ordinary mutable filesystems.

delta.c         Computing deltas.



Appendix: Filesystem structure summary
======================================

Berkeley DB tables
------------------

                "nodes" : btree(ID -> NODE-REVISION, "next-key" -> NODE-ID)
            "revisions" : recno(REVISION)
         "transactions" : btree(TXN -> TRANSACTION, "next-key" -> TXN)
              "changes" : btree(TXN -> CHANGE)
               "copies" : btree(CPY -> COPY, "next-key" -> CPY)
              "strings" : btree(STR -> STRING, "next-key" -> STR)
      "representations" : btree(REP -> REPRESENTATION, "next-key" -> REP)
                "uuids" : recno(UUID)
                "locks" : btree(TOKEN -> LOCK)
          "lock-tokens" : btree(PATH -> TOKEN)
         "node-origins" : btree(NODE-ID -> ID)
        "checksum-reps" : btree(SHA1SUM -> REP, "next-key" -> number-36)
        "miscellaneous" : btree(STRING -> STRING)

Syntactic elements
------------------

Table keys:

                     ID ::= NODE-REV-ID ;
                    TXN ::= number-36 ;
                    CPY ::= number-36 ;
                    STR ::= number-36 ;
                    REP ::= number-36 ;
                  TOKEN ::= uuid ;

Property lists:

               PROPLIST ::= (PROP ...) ;
                   PROP ::= atom atom ;


Filesystem revisions:

               REVISION ::= ("revision" TXN) ;


Transactions:

            TRANSACTION ::= UNFINISHED-TXN | COMMITTED-TXN | DEAD-TXN
         UNFINISHED-TXN ::= ("transaction" ROOT-ID BASE-ID PROPLIST COPIES) ;
          COMMITTED-TXN ::= ("committed" ROOT-ID REV PROPLIST COPIES) ;
               DEAD-TXN ::= ("dead" ROOT-ID BASE-ID PROPLIST COPIES) ;
                ROOT-ID ::= NODE-REV-ID ;
                BASE-ID ::= NODE-REV-ID ;
                 COPIES ::= (CPY ...) ;
                    REV ::= number ;


Changes:

                 CHANGE ::= ("change" PATH ID CHANGE-KIND TEXT-MOD PROP-MOD) ;
            CHANGE-KIND ::= "add" | "delete" | "replace" | "modify" | "reset";
               TEXT-MOD ::= atom ;
               PROP-MOD ::= atom ;


Copies:

                   COPY ::= REAL-COPY | SOFT-COPY
              REAL-COPY ::= ("copy" SRC-PATH SRC-TXN DST-NODE-ID)
              SOFT-COPY ::= ("soft-copy" SRC-PATH SRC-TXN DST-NODE-ID)
               SRC-PATH ::= atom ;
                SRC-TXN ::= TXN ;
            DST-NODE-ID ::= NODE-REV-ID ;


Entries lists:

                ENTRIES ::= (ENTRY ...) ;
                  ENTRY ::= (NAME ID) ;
                   NAME ::= atom ;


Node revisions:

          NODE-REVISION ::= FILE | DIR ;
                   FILE ::= (HEADER PROP-KEY DATA-INFO [EDIT-DATA-KEY]) ;
                    DIR ::= (HEADER PROP-KEY ENTRIES-KEY) ; 
                 HEADER ::= (KIND CREATED-PATH 
                             [PRED-ID [PRED-COUNT 
                              [HAS-MERGEINFO MERGEINFO-COUNT]]]) ;
                   KIND ::= "file" | "dir" ;
                PRED-ID ::= NODE-REV-ID | "";
             PRED-COUNT ::= number | "" ;
           CREATED-PATH ::= atom ;
               PROP-KEY ::= atom ;
              DATA-INFO ::= DATA-KEY | (DATA-KEY DATA-KEY-UNIQID)
               DATA-KEY ::= atom ;
        DATA-KEY-UNIQID ::= atom ;
          EDIT-DATA-KEY ::= atom ;
          HAS-MERGEINFO ::= "0" | "1" ;
        MERGEINFO-COUNT ::= number ;


Representations:

         REPRESENTATION ::= FULLTEXT | DELTA ;
               FULLTEXT ::= (HEADER STRING-KEY) ;
                  DELTA ::= (HEADER (OFFSET WINDOW) ...) ;
                 WINDOW ::= (DIFF SIZE REP-KEY [REP-OFFSET]) ;
                   DIFF ::= ("svndiff" VERSION STRING-KEY) ;
                VERSION ::= number ;
                REP-KEY ::= atom ;
             STRING-KEY ::= atom ;
                 OFFSET ::= number ;
             REP-OFFSET ::= number ;

                 HEADER ::= (KIND TXN [MD5 [SHA1]]) ;
                   KIND ::= "fulltext" | "delta" ;
             
                   SIZE ::= number ;
                    MD5 ::= ("md5" MD5SUM) ;
                   SHA1 ::= ("sha1" SHA1SUM) ;
                 MD5SUM ::= atom ;
                SHA1SUM ::= atom ;


Strings:

                 STRING ::= RAWTEXT | LISTTEXT | DIFFTEXT
                RAWTEXT ::= /{anything.class}*/ ;
               LISTTEXT ::= list ;
               DIFFTEXT ::= /{anything.class}*/ ;


Node revision IDs:

            NODE-REV-ID ::= NODE-ID '.' CPY '.' TXN ;
                NODE-ID ::= number ;

UUIDs:
                   UUID ::= uuid ;


Locks:

                   LOCK ::= ("lock" PATH TOKEN OWNER 
                                    COMMENT XML-P CR-DATE [X-DATE]);
                   PATH ::= atom ;
                  OWNER ::= atom ;
                COMMENT ::= atom ;
                  XML-P ::= "0" | "1" ;
                CR-DATE ::= atom ;
                 X-DATE ::= atom ;

Lock tokens:

               (the value is just a lock-token, which is a uuid)


Node origins:

                NODE-ID ::= NODE-REV-ID ;


Lexical elements
----------------

UUIDs:

                   uuid ::= hexits-32 '-' hexits-16 '-' hexits-16 '-'
                            hexits-16 '-' hexits-48 ;
 
Numbers:

                 number ::= /{digit.class}+/ ;
              number-36 ::= /{base36.class}+/ ;
              hexits-32 ::= /{base16.class}{8}/ ;
              hexits-16 ::= /{base16.class}{4}/ ;
              hexits-48 ::= /{base16.class}{12}/ ;

(Note: the following are described in skel.h)
Skels:

                   skel ::= atom | list;
                   list ::= list.head list.body.opt list.tail ;
                   atom ::= atom.imp-len | atom.exp-len ;

              list.head ::= '(' spaces.opt ;
              list.tail ::= spaces.opt ')' ;
          list.body.opt ::=  | list.body ;
              list.body ::= skel | list.body spaces.opt skel ;

           atom.imp-len ::= /{name.class}[^\(\){ws.class}]*/ ;
           atom.exp-len ::= /({digit.class}+){ws.class}.{\1}/ ;

             spaces.opt ::= /{ws.class}*/ ;


Character classes:

               ws.class ::= [\t\n\f\r\ ] ;
            digit.class ::= [0-9] ;
             name.class ::= [A-Za-z] ;
           base16.class ::= [0-9a-f]
           base36.class ::= [a-z0-9]
         anything.class ::= anything at all ;



Appendix: 'miscellaneous' table contents
======================================

The 'miscellaneous' table contains string keys mapped to string
values.  Here is a table of the supported keys, the descriptions of
their values, and the filesystem format version in which they were
introduced.

   Fmt   Key                  Value
   ---   ------------------   ------------------------------------
    4    forward-delta-rev    Youngest revision in the repository as of
                              the moment when it was upgraded to support
                              forward deltas.