Excellent additional ideas from Greg A. Woods.

author: Martin Pool <mbp@samba.org> 2002-03-26 11:09:35 +0000
committer: Martin Pool <mbp@samba.org> 2002-03-26 11:09:35 +0000
commit: ab9c58efbbb32349d733f4c8f06b6c97acc61287 (patch)
tree: 05763752f3ca9b61b227f63ead32edca65b763c9
parent: 5b491d3451014e85eeee9a7d0d967ea51f991538 (diff)
download: rsync-ab9c58efbbb32349d733f4c8f06b6c97acc61287.tar.gz
1 files changed, 79 insertions, 15 deletions
diff --git a/rsync3.txt b/rsync3.txt
index 42d77dca..77b38595 100644
--- a/rsync3.txt
+++ b/rsync3.txt
@@ -192,8 +192,9 @@ Command-line options:
 
 Scripting issues:
 
-  - Perhaps support multiple scripting languages: candidates include
-    Perl, Python, Tcl, Scheme (guile?), sh, ...
+  - Perhaps support multiple scripting languages:  candidates include
+    Perl, Python, Tcl, lisp (librep?), Scheme (siod, guile, elk,
+    minischeme, Kali, STk?), sh, ICI, Lua, Ruby, Pike, smalltalk...
 
   - Simply running a subprocess and looking at its stdout/exit code
     might be sufficient, though it could also be pretty slow if it's
@@ -208,13 +209,30 @@ Scripting issues:
 
   - Tcl is broken Lisp.
 
+  - librep is desgined for embedding.
+
   - Lots of sysadmins know Perl, though Perl can give some bizarre or
     confusing errors.  The built in stat operators and regexps might
     be useful.
 
-  - Sadly probably not enough people know Scheme.
+  - Sadly probably not enough people know Scheme, but with the number of
+    scheme-based application scripting languages they're going to have
+    to learn it anyway!
+
+    - siod is designed for embedding and is very small.
+
+    - kali is designed for handling distributed executable content.
+
+    - elk & guile are both designed for embedding.
+
+  - sh is hard to embed and even a full POSIX shell leaves a lot to be
+    desired as a useful programming language.
 
-  - sh is hard to embed.
+  - Ruby is truly object-oriented.
+
+  - ICI or Pike will keep C programmers happy.
+
+  - Lua is simple to learn and small and designed for embedding.
 
 
 Scripting hooks:
@@ -396,18 +414,64 @@ Conflict resolution:
     would be useful.
 
 
-Moved files: <http://rsync.samba.org/cgi-bin/rsync.fom?file=44>
-
-  - There's no trivial way to detect renamed files, especially if they
-    move between directories.
-
-  - If we had a picture of the remote directory from last time on
-    either machine, then the inode numbers might give us a hint about
-    files which may have been renamed.
+Moved files:
 
   - Files that are renamed and not modified can be detected by
-    examining the directory listing, looking for files with the same
-    size/date as the origin.
+    pre-calculating whole-file hash (MD5?) signatures for all files in
+    the target heirarchy (source files need only have their whole-file
+    hash calculated just before they would be transferred).
+
+    - whenever you're about to copy a whole file to the target hierarchy
+      (there's no matching filename in the target directory) first
+      search for a matching file already in the target hierarchy and if
+      one is found:
+
+      - if the matching file is missing in the source directory then
+        first try to create the new target file with a hard link
+        (presumably the source file will be deleted, if deletions in the
+        target hierarchy are permitted by the command-line/config options)
+
+      - if the source file and target directory are on different
+        machines then simply make the copy locally within the target
+        hierarchy on the target machine
+
+      - if the source file and target directory are on the same machine
+        then make the copy from whichever file is on a different
+        filesystem (st_dev) from the target directory [it is possible
+        the target hierarchy spans two filesystems and thus the existing
+        copy in the target might be in a different filesystem from the
+        target directory]
+
+    - whenever updating a target file with the normal rsync algorithm
+      first search for duplicates of the current target's whole-file
+      hash value and then update all identical targets simultaneously
+      with the same data blocks from the source file.  Remember the
+      source file's whole-file hash value so that when each of the
+      updated targets is encountered in the source hierarchy the
+      matching source file can be checked to be sure it too is still
+      identical to the initially encountered source file that the update
+      was done from.  [if the source file in the matching location for
+      an already updated duplicate turns out to be different from the
+      source file used to update the duplicate then perhaps it would be
+      good, at least when on different machines, to have a saved copy of
+      the un-touched target so that the previous updates to it can be
+      quickly undone, but this complicates cleanup quite a bit]
+      
+    - all deleted files are handled normally.
+
+    - all file meta-data are handled normally.
+
+  - There's no trivial way to detect renamed and modified files, though
+    by also pre-calculating the hash signatures for each block of each
+    file in the target hierarchy then fuzzy matching heuristics (eg. if
+    more than some percentage of blocks are identical) could identify
+    new files which have many blocks in common and thus which could
+    first be copied locally on the target and then updated with the
+    normal rsync algorithm.  Keeping all this data for very large
+    hierarchies might still be too expensive though so perhaps it should
+    only be done if some noticable percentage of large files (savings
+    are only possible if the files are multiple blocks in length) in the
+    target hierarchy are apparently missing and would need copying.
 
 
 Filesystem migration:
@@ -466,4 +530,4 @@ Related work:
   - http://freshmeat.net/search/?site=Freshmeat&q=mirror&section=projects
 
   - BitTorrent -- p2p mirroring
-    http://bitconjurer.org/BitTorrent/ 
-\ No newline at end of file
+    http://bitconjurer.org/BitTorrent/
author	Martin Pool <mbp@samba.org>	2002-03-26 11:09:35 +0000
committer	Martin Pool <mbp@samba.org>	2002-03-26 11:09:35 +0000
commit	ab9c58efbbb32349d733f4c8f06b6c97acc61287 (patch)
tree	05763752f3ca9b61b227f63ead32edca65b763c9
parent	5b491d3451014e85eeee9a7d0d967ea51f991538 (diff)
download	rsync-ab9c58efbbb32349d733f4c8f06b6c97acc61287.tar.gz