From 3da2cd3fc7ca694a7b263c099be29c9c8b46af22 Mon Sep 17 00:00:00 2001 From: n-peugnet Date: Mon, 13 Sep 2021 17:31:41 +0200 Subject: first tests on real data --- .gitignore | 3 ++ TODO.md | 28 ++++++-------- docs/note-2021-09-13.md | 100 ++++++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 115 insertions(+), 16 deletions(-) create mode 100644 docs/note-2021-09-13.md diff --git a/.gitignore b/.gitignore index b80751f..6f0a1cb 100644 --- a/.gitignore +++ b/.gitignore @@ -1,5 +1,8 @@ ## Project generated files dna-backup +## Test directories +test_* + ## IDE files .vscode diff --git a/TODO.md b/TODO.md index adff59b..9dbecc4 100644 --- a/TODO.md +++ b/TODO.md @@ -5,20 +5,6 @@ priority 1 chunks if no similar chunk is found. Thus, it will need to be of `chunkSize` or less. Otherwise, it could not be possibly used for deduplication. - ``` - for each new chunk: - find similar in sketchMap - if exists: - delta encode - else: - calculate fingerprint - store in fingerprintMap - store in sketchMap - ``` -- [x] read from repo (Restore function) - - [x] store recipe - - [x] load recipe - - [x] read chunks in-order into a stream - [ ] read individual files - [ ] properly store information to be DNA encoded - [ ] tar source to keep files metadata ? @@ -39,8 +25,8 @@ priority 2 fact just `TempChunks` for now) - [ ] option to commit without deltas to save new base chunks - [ ] custom binary marshal and unmarshal for chunks -- [ ] use `loadChunkContent` in `loadChunks` -- [ ] store hashes for faster maps rebuild +- [x] use `loadChunkContent` in `loadChunks` +- [ ] TODO: store hashes for faster maps rebuild - [ ] try [Fdelta](https://github.com/amlwwalker/fdelta) and [Xdelta](https://github.com/nine-lives-later/go-xdelta) instead of Bsdiff - [ ] maybe use an LRU cache instead of the current FIFO one. @@ -53,3 +39,13 @@ reunion 7/09 - [ ] store recipe and files incrementally - [ ] compress recipe - [ ] make size comparison between recipe and chunks with some datasets + +ideas +----- +1. Would it be a good idea to store the compressed size for each chunk? + Maybe this way we could only load the chunks needed for each file read. + +2. Implement the `fs` interface of Go? Not sure if this will be useful. + +3. If we don't need to reduce read amplification we could compress all chunks if + it reduces the space used. diff --git a/docs/note-2021-09-13.md b/docs/note-2021-09-13.md new file mode 100644 index 0000000..230e6e3 --- /dev/null +++ b/docs/note-2021-09-13.md @@ -0,0 +1,100 @@ +First run on a big folder (1.1Go) +================================== + +The program was for the first time run against a fairly big folder: my Desktop +folder. + +It is a composite folder that contains a lot of different things, among which: + +- a lot of downloaded PDF files +- a lot of downloaded image files +- a massive (715,3Mio) package of binary updates +- a TV show episode (85Mio) +- some other fat binary executables (108Mio) +- some compressed packages (53Mio) + +_I am started to understand the bad compression performances_ + +## Resources + +- CPU: 90% 4 cores +- RAM: 0.7% 16Go + +## Time + +`16:00:04` → `16:12:33` = `00:12:29` + +## Size + +### Original + +``` +1050129 kio +``` + +### Repo + +``` + 18 kio test_1/00000/files +1017586 kio test_1/00000/chunks + 5908 kio test_1/00000/recipe +1023515 kio test_1/00000 +1023519 kio test_1 +``` + +saved: + +``` +26610 kio +``` + +## Notes +Not really impressed by the saving, even when there are a lot of deduplicated +files. + +There are a lot of chunks larger than uncompressed ones +(8208o instead of8192o). +This is probably because of the added Zlib header and the fact that no +compression was achieved. + +## Conclusions + +1. I should probably test with a folder that contains less binary and compressed + data. +2. Maybe we can store the chunks uncompressed when we detect that it uses less + space than the compressed version. + +Second run on source code folder (1.0Go) +========================================= + +## Resources + +Similar + +## Time + +`17:09:01` → `17:13:51` = `00:4:50` + +## Size + +### Original + +``` +925515 kio +``` + +### Repo + +``` + 6433 kio test_2/00000/files +272052 kio test_2/00000/chunks + 17468 kio test_2/00000/recipe +295956 kio test_2/00000 +295960 kio test_2 +``` + +saved: + +``` +629555 kio +``` -- cgit v1.2.3