请教一个内容可变分块(Content Defined Chunk, CDC)后合并文件与源文件不符的问题 - V2EX
V2EX = way to explore
V2EX 是一个关于分享和探索的地方
现在注册
已注册用户请  登录
请不要在回答技术问题时复制粘贴 AI 生成的内容
eric107008
V2EX    程序员

请教一个内容可变分块(Content Defined Chunk, CDC)后合并文件与源文件不符的问题

  •  
  •   eric107008 2024-02-15 21:21:57 +08:00 1504 次点击
    这是一个创建于 607 天前的主题,其中的信息可能已经有所发展或是发生改变。

    最近在做一个跟网络存储有关的项目涉及到 CDC ,目前比较主流的算法除了前几年新提出的 FastCDC 以外就是基于拉宾卡普算法的 Rabin CDC ,其基本思想是计算拉宾指纹并根据滚动哈希进行分块。目前比较典型的成熟应用就是海文的 Seafile ,它的后端使用的是 Rabin CDC 并且网络上有很多针对其源代码的分析,例如: [1] http://www.ilovecpp.com/2019/02/03/cdc/ [2] https://blog.csdn.net/cyk0620/article/details/120813255

    我也找到了一个 Github 仓库基本上是把 Seafile 的分块部分单独拿出来做了个主程序的实现: https://github.com/baixiangcpp/FileCDC

    目前遇到的问题是,当我尝试将分块后的文件块再重新合并成一个文件,但发现这个文件跟原来的哈希不一致。具体来说,我使用 ubuntu 官网的 22.04 amd64 的系统镜像 iso 进行测试,源文件的 SHA256 哈希:

    $ sha256sum ubuntu.iso a435f6f393dda581172490eda9f683c32e495158a780b5a1de422ee77d98e909 ubuntu.iso 

    这与 Ubuntu 官网提供的 SHA256 校验值是一致的。但我尝试将分块后的块按照顺序重新合并后,得到的文件虽然大小与源文件相同,但 SHA256 哈希却变了:

    $ sha256sum merged.iso 04a917c75a2d7733e0aabe6a5dbd0bbc875e8d3e63622c8d3c78b9058b92abd3 merged.iso 

    我自己用 C#另外实现了一遍,也是针对同一个 iso 源文件,但合并出来的文件哈希也不一样,甚至分块结果(也就是分了多少块,每一块的 SHA1 )和 C 语言的实现也不一样。

    我用小黄鸭调试法检查了几遍也没能发现逻辑上的错误和问题,初步的断点调试和单元测发现拉宾指纹的计算部分 C#的运算结果和 C 语言的运算结果没有区别,但文件分块出来不一样,合并起来也不一样,完全没有排查的思路。毕竟一个大文件会分出 500+块,也很难通过断点一个个查到底是哪里出了问题。

    所以现在出现的现象包括: 1. 分块后合并导致哈希不一致( C 语言实现和 C#实现都存在,但两种实现合并后的文件哈希也互不一致) 2. C#的实现和 C 语言实现两种实现计算 Rabin 指纹的结果一致,但分块结果不完全一致,具体来说,有些 C#分出来的块在 C 语言实现的结果中不存在,C 语言实现的结果一些块在 C#结果中也没有,分出来的块数也不一样。但有一些块是一样的,SHA1 和大小都一致。

    所以想问问大佬们: 1. 以上出现的两种情况是正常的吗? 2. 如果不正常,是由于什么原因造成的? 3. 如何调试这样的问题?打断点一步步调试?找拉宾指纹的分割点可是一个个字节找的,这恐怕不太现实

    附: 1. C 语言的完整代码是基于 https://github.com/baixiangcpp/FileCDC ,但改动了两个地方 一个是写块的函数,我把写在当前工作目录改成了卸载 chunks 文件夹中:

    ```c int do_write_chunk (const unsigned char *checksum, const char *buf, int len) { char chksum_str[41]; int fd; int n; rawdata_to_hex (checksum, chksum_str, 20); /* Don't write if the block already exists. */ if (g_access (chksum_str, F_OK) == 0) return 0; // Make path "/workspace/chunks/{chksum_str}" char path[100]; sprintf(path, "/workspaces/FileCDC-master/chunks/%s", chksum_str); fd = open (path, O_WRONLY|O_CREAT,0666); if (fd == -1) { printf ("Failed to open block %s.\n", path); return -1; } n = write (fd, buf, len); if (n < 0) { printf ("Failed to write chunk %s.\n", chksum_str); close (fd); return -1; } close(fd); return 0; } ``` 另一个是我自己加的,合并分块的函数: ```c void file_merge_cdc(CDCFileDescriptor* file_desc) { int i; int fd; char path[100]; char *buf = malloc(file_desc->block_max_sz); int merged_file = open("merged.iso", O_WRONLY | O_APPEND, 0666); if(merged_file == -1) { printf("Failed to open merged file.\n"); return; } for(i = 0; i < file_desc->block_nr; i++) { char chksum_str[41]; rawdata_to_hex(file_desc->blk_sha1s + i * CHECKSUM_LENGTH, chksum_str, 20); sprintf(path, "/workspaces/FileCDC-master/chunks/%s", chksum_str); fd = open(path, O_RDONLY); if(fd == -1) { printf("Failed to open block %s.\n", path); return; } read(fd, buf, file_desc->block_max_sz); close(fd); write(merged_file, buf, file_desc->block_max_sz); } close(merged_file); free(buf); } ``` 2. C#的分块实现 ```c# public class RabingCDC { private RabinChecksum rabinChecksum = new RabinChecksum(); private readonly int MinChunkSize = 1024 * 1024 * 6; // 6MB private readonly int MaxChunkSize = 1024 * 1024 * 10; // 10MB private readonly int AvarageChunkSize = 1024 * 1024 * 8; // 8MB private LinkedList<string> fileHashed = new LinkedList<string>(); public async Task FileChunkCDC(Stream file) { byte[] buffer = new byte[MaxChunkSize]; int cursor = MinChunkSize - 1; int bufferDataTail = 0; while (true) { int readSize = 4096; // 4KB if (bufferDataTail < MinChunkSize) { readSize = MinChunkSize - bufferDataTail + 4096; } else if (MaxChunkSize - bufferDataTail < 4096) { readSize = MaxChunkSize - bufferDataTail; } int readed = await file.ReadAsync(buffer, bufferDataTail, readSize); bufferDataTail += readed; if (bufferDataTail < MinChunkSize || cursor >= bufferDataTail) { // The file is less than minimum chunk size // Write the entire file as a block WriteBlock(buffer.Take(readed).ToArray()); bufferDataTail = 0; cursor = MinChunkSize - 1; break; } uint fingerprint = 0; while (cursor < bufferDataTail) { if (cursor == MinChunkSize - 1) { fingerprint = rabinChecksum.GetRabingChecksum(buffer.Skip(cursor).Take(48).ToArray()); } else { fingerprint = rabinChecksum.RabinRollingChecksum(fingerprint, 48, buffer[cursor - 48], buffer[cursor]); } if ((fingerprint & (AvarageChunkSize - 1)) == (0x0013 & (AvarageChunkSize - 1)) || cursor + 1 >= MaxChunkSize) { WriteBlock(buffer.Take(cursor + 1).ToArray()); // Move the rest of the buffer to the beginning Array.Copy(buffer, cursor, buffer, 0, bufferDataTail - cursor-1); bufferDataTail -= cursor + 1; cursor = MinChunkSize - 1; break; } else { cursor++; } } } } public void WriteBlock(byte[] block) { var path = "D:\\Code\\temp\\cdc"; // Write block as a new file under the path, the file name is the SHA1 of the block // Calcualte SHA1 of the block byte[] sha1 = SHA1.Create().ComputeHash(block); string fileName = BitConverter.ToString(sha1).Replace("-", "").ToLower(); fileHashed.AddLast(fileName); File.WriteAllBytes(Path.Combine(path, fileName), block); } public async Task FileMergeCDC(string chunksFolderPath , string resultFilePath) { var fileStream = File.OpenWrite(resultFilePath); foreach (var file in fileHashed) { var buffer = File.ReadAllBytes(Path.Combine(chunksFolderPath, file)); await fileStream.WriteAsync(buffer, 0, buffer.Length); } fileStream.Close(); } } 
    目前尚无回复
    关于     帮助文档     自助推广系统     博客     API     FAQ     Solana     2975 人在线   最高记录 6679       Select Language
    创意工作者们的社区
    World is powered by solitude
    VERSION: 3.9.8.5 29ms UTC 13:35 PVG 21:35 LAX 06:35 JFK 09:35
    Do have faith in what you're doing.
    ubao snddm index pchome yahoo rakuten mypaper meadowduck bidyahoo youbao zxmzxm asda bnvcg cvbfg dfscv mmhjk xxddc yybgb zznbn ccubao uaitu acv GXCV ET GDG YH FG BCVB FJFH CBRE CBC GDG ET54 WRWR RWER WREW WRWER RWER SDG EW SF DSFSF fbbs ubao fhd dfg ewr dg df ewwr ewwr et ruyut utut dfg fgd gdfgt etg dfgt dfgd ert4 gd fgg wr 235 wer3 we vsdf sdf gdf ert xcv sdf rwer hfd dfg cvb rwf afb dfh jgh bmn lgh rty gfds cxv xcv xcs vdas fdf fgd cv sdf tert sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf shasha9178 shasha9178 shasha9178 shasha9178 shasha9178 liflif2 liflif2 liflif2 liflif2 liflif2 liblib3 liblib3 liblib3 liblib3 liblib3 zhazha444 zhazha444 zhazha444 zhazha444 zhazha444 dende5 dende denden denden2 denden21 fenfen9 fenf619 fen619 fenfe9 fe619 sdf sdf sdf sdf sdf zhazh90 zhazh0 zhaa50 zha90 zh590 zho zhoz zhozh zhozho zhozho2 lislis lls95 lili95 lils5 liss9 sdf0ty987 sdft876 sdft9876 sdf09876 sd0t9876 sdf0ty98 sdf0976 sdf0ty986 sdf0ty96 sdf0t76 sdf0876 df0ty98 sf0t876 sd0ty76 sdy76 sdf76 sdf0t76 sdf0ty9 sdf0ty98 sdf0ty987 sdf0ty98 sdf6676 sdf876 sd876 sd876 sdf6 sdf6 sdf9876 sdf0t sdf06 sdf0ty9776 sdf0ty9776 sdf0ty76 sdf8876 sdf0t sd6 sdf06 s688876 sd688 sdf86