From 3da2cd3fc7ca694a7b263c099be29c9c8b46af22 Mon Sep 17 00:00:00 2001
From: n-peugnet <n.peugnet@free.fr>
Date: Mon, 13 Sep 2021 17:31:41 +0200
Subject: first tests on real data

---
 .gitignore              |   3 ++
 TODO.md                 |  28 ++++++--------
 docs/note-2021-09-13.md | 100 ++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 115 insertions(+), 16 deletions(-)
 create mode 100644 docs/note-2021-09-13.md

diff --git a/.gitignore b/.gitignore
index b80751f..6f0a1cb 100644
--- a/.gitignore
+++ b/.gitignore
@@ -1,5 +1,8 @@
 ## Project generated files
 dna-backup
 
+## Test directories
+test_*
+
 ## IDE files
 .vscode
diff --git a/TODO.md b/TODO.md
index adff59b..9dbecc4 100644
--- a/TODO.md
+++ b/TODO.md
@@ -5,20 +5,6 @@ priority 1
         chunks if no similar chunk is found. Thus, it will need to be of
         `chunkSize` or less. Otherwise, it could not be possibly used for
         deduplication.
-    ```
-    for each new chunk:
-        find similar in sketchMap
-        if exists:
-            delta encode
-        else:
-            calculate fingerprint
-            store in fingerprintMap
-            store in sketchMap
-    ```
-- [x] read from repo (Restore function)
-    - [x] store recipe
-    - [x] load recipe
-    - [x] read chunks in-order into a stream
 - [ ] read individual files
 - [ ] properly store information to be DNA encoded
     - [ ] tar source to keep files metadata ?
@@ -39,8 +25,8 @@ priority 2
     fact just `TempChunks` for now)
 - [ ] option to commit without deltas to save new base chunks
 - [ ] custom binary marshal and unmarshal for chunks
-- [ ] use `loadChunkContent` in `loadChunks`
-- [ ] store hashes for faster maps rebuild
+- [x] use `loadChunkContent` in `loadChunks`
+- [ ] TODO: store hashes for faster maps rebuild
 - [ ] try [Fdelta](https://github.com/amlwwalker/fdelta) and
     [Xdelta](https://github.com/nine-lives-later/go-xdelta) instead of Bsdiff
 - [ ] maybe use an LRU cache instead of the current FIFO one.
@@ -53,3 +39,13 @@ reunion 7/09
 - [ ] store recipe and files incrementally
 - [ ] compress recipe
 - [ ] make size comparison between recipe and chunks with some datasets
+
+ideas
+-----
+1. Would it be a good idea to store the compressed size for each chunk?
+    Maybe this way we could only load the chunks needed for each file read.
+
+2. Implement the `fs` interface of Go? Not sure if this will be useful.
+
+3. If we don't need to reduce read amplification we could compress all chunks if
+    it reduces the space used.
diff --git a/docs/note-2021-09-13.md b/docs/note-2021-09-13.md
new file mode 100644
index 0000000..230e6e3
--- /dev/null
+++ b/docs/note-2021-09-13.md
@@ -0,0 +1,100 @@
+First run on a big folder (1.1Go)
+==================================
+
+The program was for the first time run against a fairly big folder: my Desktop
+folder.
+
+It is a composite folder that contains a lot of different things, among which:
+
+- a lot of downloaded PDF files
+- a lot of downloaded image files
+- a massive (715,3Mio) package of binary updates
+- a TV show episode (85Mio)
+- some other fat binary executables (108Mio)
+- some compressed packages (53Mio)
+
+_I am started to understand the bad compression performances_
+
+## Resources
+
+- CPU: 90% 4 cores
+- RAM: 0.7% 16Go
+
+## Time
+
+`16:00:04` → `16:12:33` = `00:12:29`
+
+## Size
+
+### Original
+
+```
+1050129 kio
+```
+
+### Repo
+
+```
+     18 kio	test_1/00000/files
+1017586 kio	test_1/00000/chunks
+   5908 kio	test_1/00000/recipe
+1023515 kio	test_1/00000
+1023519 kio	test_1
+```
+
+saved:
+
+```
+26610 kio
+```
+
+## Notes
+Not really impressed by the saving, even when there are a lot of deduplicated
+files.
+
+There are a lot of chunks larger than uncompressed ones
+(8208o instead of8192o). 
+This is probably because of the added Zlib header and the fact that no
+compression was achieved.
+
+## Conclusions
+
+1. I should probably test with a folder that contains less binary and compressed
+    data.
+2. Maybe we can store the chunks uncompressed when we detect that it uses less
+    space than the compressed version.
+
+Second run on source code folder (1.0Go)
+=========================================
+
+## Resources
+
+Similar
+
+## Time
+
+`17:09:01` → `17:13:51` = `00:4:50`
+
+## Size
+
+### Original
+
+```
+925515 kio
+```
+
+### Repo
+
+```
+  6433 kio	test_2/00000/files
+272052 kio	test_2/00000/chunks
+ 17468 kio	test_2/00000/recipe
+295956 kio	test_2/00000
+295960 kio	test_2
+```
+
+saved:
+
+```
+629555 kio
+```
-- 
cgit v1.2.3