trying to fix the mystical bug

If prev is not null and no match if found, always encode both remaining chunks. Previously some chunks of chunkSize could have been stored as TempChunks in the recipe instead of as StoredChunks with hashes and an Id. This did not fix the mystical bug. But it helped finding where it came from.
author: n-peugnet <n.peugnet@free.fr> 2021-09-22 16:29:12 +0200
committer: n-peugnet <n.peugnet@free.fr> 2021-09-22 16:38:34 +0200
commit: 368f89466f48e8621254b04c1bca996db5c7a66a (patch)
tree: 8bc5dcc6dc80dcfef9a1a05bd15a841127a303ec
parent: 34a60695b43713dfe82713cdc59fb6ca35ad4757 (diff)
download: dna-backup-368f89466f48e8621254b04c1bca996db5c7a66a.tar.gz
dna-backup-368f89466f48e8621254b04c1bca996db5c7a66a.zip
2 files changed, 31 insertions, 6 deletions
diff --git a/TODO.md b/TODO.md
index 01c362b..f3c2d3f 100644
--- a/TODO.md
+++ b/TODO.md
@@ -61,3 +61,31 @@ ideas
 
 3. If we don't need to reduce read amplification we could compress all chunks if
     it reduces the space used.
+
+mystical bug 22/09
+------------------
+
+On the second run, delta chunks can be encoded against better matching chunks as
+more of them are present in the `sketchMap`. But we don't want this to happen,
+because this adds data to write again, even if it has already been written.
+
+Possible solutions :
+
+- keep IDs for delta chunks, calculate a hash of the target data and store it in
+    a new map. Then, when a chunk is encoded, first check if it exists in
+    the fingerprint map, then in the delta map, and only after that check for
+    matches in the sketch map.
+    This should also probably be used for `TempChunks` as they have more chance
+    to be delta-encoded on a second run.
+- wait the end of the stream before delta-encoding chunks. So if it is not found
+    in the fingerprints map, but it is found in the sketch map, then we wait to
+    see if we found a better candidate for delta-encoding.
+    This would not fix the problem of `TempChunks` that become delta-encoded on
+    the second run. So we would need IDs and a map for these. Tail packing
+    `TempChunks` could also help solve this problem
+    (see [priority 2](#priority-2)).
+
+The first solution would have an advantage if we were directly streaming the
+output of the program into DNA, as it could start DNA-encode it from the first
+chunk. The second solution will probably have better space-saving performance as
+waiting for better matches will probably lower the size of the patches.
diff --git a/repo.go b/repo.go
index 0d7ea65..4761c49 100644
--- a/repo.go
+++ b/repo.go
@@ -577,14 +577,11 @@ func (r *Repo) encodeTempChunks(prev BufferedChunk, curr BufferedChunk, version
 		c, success := r.encodeTempChunk(tmp, version, last, storeQueue)
 		if success {
 			return []Chunk{c}
-		} else {
-			return []Chunk{prev, curr}
 		}
-	} else {
-		prevD, _ := r.encodeTempChunk(prev, version, last, storeQueue)
-		currD, _ := r.encodeTempChunk(curr, version, last, storeQueue)
-		return []Chunk{prevD, currD}
 	}
+	prevD, _ := r.encodeTempChunk(prev, version, last, storeQueue)
+	currD, _ := r.encodeTempChunk(curr, version, last, storeQueue)
+	return []Chunk{prevD, currD}
 }
 
 func (r *Repo) matchStream(stream io.Reader, version int) []Chunk {
author	n-peugnet <n.peugnet@free.fr>	2021-09-22 16:29:12 +0200
committer	n-peugnet <n.peugnet@free.fr>	2021-09-22 16:38:34 +0200
commit	368f89466f48e8621254b04c1bca996db5c7a66a (patch)
tree	8bc5dcc6dc80dcfef9a1a05bd15a841127a303ec
parent	34a60695b43713dfe82713cdc59fb6ca35ad4757 (diff)
download	dna-backup-368f89466f48e8621254b04c1bca996db5c7a66a.tar.gz dna-backup-368f89466f48e8621254b04c1bca996db5c7a66a.zip