diff options
Diffstat (limited to 'doc/impl.html')
-rw-r--r-- | doc/impl.html | 228 |
1 files changed, 0 insertions, 228 deletions
diff --git a/doc/impl.html b/doc/impl.html deleted file mode 100644 index b190d2c..0000000 --- a/doc/impl.html +++ /dev/null @@ -1,228 +0,0 @@ -<!DOCTYPE html> -<html> -<head> -<link rel="stylesheet" type="text/css" href="doc.css" /> -<title>Leveldb file layout and compactions</title> -</head> - -<body> - -<h1>Files</h1> - -The implementation of leveldb is similar in spirit to the -representation of a single -<a href="http://labs.google.com/papers/bigtable.html"> -Bigtable tablet (section 5.3)</a>. -However the organization of the files that make up the representation -is somewhat different and is explained below. - -<p> -Each database is represented by a set of file stored in a directory. -There are several different types of files as documented below: -<p> -<h2>Log files</h2> -<p> -A log file (*.log) stores a sequence of recent updates. Each update -is appended to the current log file. When the log file reaches a -pre-determined size (approximately 1MB by default), it is converted -to a sorted table (see below) and a new log file is created for future -updates. -<p> -A copy of the current log file is kept in an in-memory structure (the -<code>memtable</code>). This copy is consulted on every read so that read -operations reflect all logged updates. -<p> -<h2>Sorted tables</h2> -<p> -A sorted table (*.sst) stores a sequence of entries sorted by key. -Each entry is either a value for the key, or a deletion marker for the -key. (Deletion markers are kept around to hide obsolete values -present in older sorted tables). -<p> -The set of sorted tables are organized into a sequence of levels. The -sorted table generated from a log file is placed in a special <code>young</code> -level (also called level-0). When the number of young files exceeds a -certain threshold (currently four), all of the young files are merged -together with all of the overlapping level-1 files to produce a -sequence of new level-1 files (we create a new level-1 file for every -2MB of data.) -<p> -Files in the young level may contain overlapping keys. However files -in other levels have distinct non-overlapping key ranges. Consider -level number L where L >= 1. When the combined size of files in -level-L exceeds (10^L) MB (i.e., 10MB for level-1, 100MB for level-2, -...), one file in level-L, and all of the overlapping files in -level-(L+1) are merged to form a set of new files for level-(L+1). -These merges have the effect of gradually migrating new updates from -the young level to the largest level using only bulk reads and writes -(i.e., minimizing expensive seeks). - -<h2>Large value files</h2> -<p> -Each large value (greater than 64KB by default) is placed in a large -value file (*.val) of its own. An entry is maintained in the log -and/or sorted tables that maps from the corresponding key to the -name of this large value file. The name of the large value file -is derived from a SHA1 hash of the value and its length so that -identical values share the same file. -<p> -<h2>Manifest</h2> -<p> -A MANIFEST file lists the set of sorted tables that make up each -level, the corresponding key ranges, and other important metadata. -A new MANIFEST file (with a new number embedded in the file name) -is created whenever the database is reopened. The MANIFEST file is -formatted as a log, and changes made to the serving state (as files -are added or removed) are appended to this log. -<p> -<h2>Current</h2> -<p> -CURRENT is a simple text file that contains the name of the latest -MANIFEST file. -<p> -<h2>Info logs</h2> -<p> -Informational messages are printed to files named LOG and LOG.old. -<p> -<h2>Others</h2> -<p> -Other files used for miscellaneous purposes may also be present -(LOCK, *.dbtmp). - -<h1>Level 0</h1> -When the log file grows above a certain size (1MB by default): -<ul> -<li>Write the contents of the current memtable to an sstable -<li>Replace the current memtable by a brand new empty memtable -<li>Switch to a new log file -<li>Delete the old log file and the old memtable -</ul> -Experimental measurements show that generating an sstable from a 1MB -log file takes ~12ms, which seems like an acceptable latency hiccup to -add infrequently to a log write. - -<p> -The new sstable is added to a special level-0 level. level-0 contains -a set of files (up to 4 by default). However unlike other levels, -these files do not cover disjoint ranges, but may overlap each other. - -<h1>Compactions</h1> - -<p> -When the size of level L exceeds its limit, we compact it in a -background thread. The compaction picks a file from level L and all -overlapping files from the next level L+1. Note that if a level-L -file overlaps only part of a level-(L+1) file, the entire file at -level-(L+1) is used as an input to the compaction and will be -discarded after the compaction. Aside: because level-0 is special -(files in it may overlap each other), we treat compactions from -level-0 to level-1 specially: a level-0 compaction may pick more than -one level-0 file in case some of these files overlap each other. - -<p> -A compaction merges the contents of the picked files to produce a -sequence of level-(L+1) files. We switch to producing a new -level-(L+1) file after the current output file has reached the target -file size (2MB). We also switch to a new output file when the key -range of the current output file has grown enough to overlap more then -ten level-(L+2) files. This last rule ensures that a later compaction -of a level-(L+1) file will not pick up too much data from level-(L+2). - -<p> -The old files are discarded and the new files are added to the serving -state. - -<p> -Compactions for a particular level rotate through the key space. In -more detail, for each level L, we remember the ending key of the last -compaction at level L. The next compaction for level L will pick the -first file that starts after this key (wrapping around to the -beginning of the key space if there is no such file). - -<p> -Compactions drop overwritten values. They also drop deletion markers -if there are no higher numbered levels that contain a file whose range -overlaps the current key. - -<h2>Timing</h2> - -Level-0 compactions will read up to four 1MB files from level-0, and -at worst all the level-1 files (10MB). I.e., we will read 14MB and -write 14MB. - -<p> -Other than the special level-0 compactions, we will pick one 2MB file -from level L. In the worst case, this will overlap ~ 12 files from -level L+1 (10 because level-(L+1) is ten times the size of level-L, -and another two at the boundaries since the file ranges at level-L -will usually not be aligned with the file ranges at level-L+1). The -compaction will therefore read 26MB and write 26MB. Assuming a disk -IO rate of 100MB/s (ballpark range for modern drives), the worst -compaction cost will be approximately 0.5 second. - -<p> -If we throttle the background writing to something small, say 10% of -the full 100MB/s speed, a compaction may take up to 5 seconds. If the -user is writing at 10MB/s, we might build up lots of level-0 files -(~50 to hold the 5*10MB). This may signficantly increase the cost of -reads due to the overhead of merging more files together on every -read. - -<p> -Solution 1: To reduce this problem, we might want to increase the log -switching threshold when the number of level-0 files is large. Though -the downside is that the larger this threshold, the larger the delay -that we will add to write latency when a write triggers a log switch. - -<p> -Solution 2: We might want to decrease write rate artificially when the -number of level-0 files goes up. - -<p> -Solution 3: We work on reducing the cost of very wide merges. -Perhaps most of the level-0 files will have their blocks sitting -uncompressed in the cache and we will only need to worry about the -O(N) complexity in the merging iterator. - -<h2>Number of files</h2> - -Instead of always making 2MB files, we could make larger files for -larger levels to reduce the total file count, though at the expense of -more bursty compactions. Alternatively, we could shard the set of -files into multiple directories. - -<p> -An experiment on an <code>ext3</code> filesystem on Feb 04, 2011 shows -the following timings to do 100K file opens in directories with -varying number of files: -<table class="datatable"> -<tr><th>Files in directory</th><th>Microseconds to open a file</th></tr> -<tr><td>1000</td><td>9</td> -<tr><td>10000</td><td>10</td> -<tr><td>100000</td><td>16</td> -</table> -So maybe even the sharding is not necessary on modern filesystems? - -<h1>Recovery</h1> - -<ul> -<li> Read CURRENT to find name of the latest committed MANIFEST -<li> Read the named MANIFEST file -<li> Clean up stale files -<li> We could open all sstables here, but it is probably better to be lazy... -<li> Convert log chunk to a new level-0 sstable -<li> Start directing new writes to a new log file with recovered sequence# -</ul> - -<h1>Garbage collection of files</h1> - -<code>DeleteObsoleteFiles()</code> is called at the end of every -compaction and at the end of recovery. It finds the names of all -files in the database. It deletes all log files that are not the -current log file. It deletes all table files that are not referenced -from some level and are not the output of an active compaction. It -deletes all large value files that are not referenced from any live -table or log file. - -</body> -</html> |