中文文献(文言)两个版本差异对比,有没有什么方案? - V2EX
V2EX = way to explore
V2EX 是一个关于分享和探索的地方
现在注册
已注册用户请  登录
garywill
V2EX    编程

中文文献(文言)两个版本异对比,有没有什么方案?

  •  
  •   garywill 2024-01-09 10:19:07 +08:00 1301 次点击
    这是一个创建于 640 天前的主题,其中的信息可能已经有所发展或是发生改变。

    同一篇中文文章(文言),有两个版本,想找一个程序做差异对比,有没有什么现有的或相关的工具?

    同时想要以下这些差异被忽略:

    • 标点不同。古文不标号的,都是后人标的,所以标点肯定不同
    • 排版不同。换行、分段的位置
    • 繁体和简体字的差异

    例如, 版本 1 ,都是简体字,且无标点,又有空格,每小句都换行

    床 前 看 月 光

    疑 是 地 上 霜

    举 头 望 山 月

    低 头 思 故 乡

    版本 2

    前明月光,疑是地上霜。

    望明月,低思故。

    希望软件给出的对比结果是:

    1. 看<->明

    2. 山<->明

    以上例子是五言,每一句字数一样。还需要比较每句字数不一样的文章

    有没有什么现有的或相关的工具?

    6 条回复    2024-01-09 14:02:01 +08:00
    vacuitym
        1
    vacuitym  
       2024-01-09 10:29:07 +08:00
    这个自己实现起来应该比较容易,首先把两个都专程简体或者繁体,然后对符号也都转成一样的,然后直接对比差异
    garywill
        2
    garywill  
    OP
       2024-01-09 10:46:54 +08:00
    补充个难度更大点的例子。断句、复杂的标点。(甚至有中间缺失几句的)

    版本 1
    帝曰:「有其年已老而有子者,何也?」岐伯曰:「此其天度,常通,而有也。此有子,男不八八,女不七七,而天地之精皆竭矣。」

    版本 2
    帝曰有其年已老而有子者何也
    岐伯曰此其天度常通而有也
    此有子男不八八女不七七
    而天地之精皆竭矣
    superychen
        3
    superychen  
       2024-01-09 13:52:04 +08:00
    字数都一样么?问问 gpt 就能用 python 给你生成个代码
    superychen
        4
    superychen  
       2024-01-09 13:56:43 +08:00
    ```python
    import opencc
    import re
    from difflib import SequenceMatcher

    PATTERN_CHINESE = re.compile(r'[\u4e00-\u9fa5]')
    COnVERTER= opencc.OpenCC("t2s")

    # 只保留中文
    def clean(text):
    return ''.join(PATTERN_CHINESE.findall(text))

    # 繁体转简体
    def simplify(text):
    return CONVERTER.convert(text)

    # 比较文本
    def compare_text(text1, text2):
    text1 = clean(text1)
    text2 = clean(text2)
    text1a = simplify(text1)
    text2a = simplify(text2)
    matcher = SequenceMatcher(None, text1a, text2a)
    diffs = matcher.get_opcodes()
    index = 0
    for tag, i1, i2, j1, j2 in diffs:
    if tag == 'replace':
    index += 1
    print(f'{index}. {text1[i1:i2]} <-> {text2[j1:j2]}')

    # 简体转繁体
    simplified_text = '''床 前 看 月 光

    疑 是 地 上 霜

    举 头 望 山 月

    低 头 思 故 乡'''
    traditional_text = '''前明月光,疑是地上霜。

    望明月,低思故。'''

    compare_text(simplified_text,traditional_text)
    ```
    geelaw
        5
    geelaw  
       2024-01-09 13:57:31 +08:00   1
    可以用编辑距离建模。

    准备工作:找一本字典,记住所有的标点、空白、汉字,以及同一个字的不同写法(简体繁体异体字)。

    1. 两个字符串都删除所有的标点空白,只留汉字。
    2. 计算编辑距离最小的编辑:把一个字替换为它的其他写法、删除一个字、增加一个字的代价可以都设置为 1 (这样的话把一个字改成和它没关系的另一个字的代价就是 2 )。

    第二步是标准的动态规划问题。
    superychen
        6
    superychen  
       2024-01-09 14:02:01 +08:00
    <iframe
    src="https://carbon.now.sh/embed?bg=rgba%2874%2C74%2C74%2C1%29&t=vscode&wt=none&l=python&width=680&ds=true&dsyoff=20px&dsblur=68px&wc=true&wa=true&pv=56px&ph=56px&ln=false&fl=1&fm=Hack&fs=14px&lh=133%25&si=false&es=2x&wm=false&code=import%2520opencc%250Aimport%2520re%250Afrom%2520difflib%2520import%2520SequenceMatcher%250A%250APATTERN_CHINESE%2520%253D%2520re.compile%28r%27%255B%255Cu4e00-%255Cu9fa5%255D%27%29%250ACONVERTER%2520%253D%2520opencc.OpenCC%28%2522t2s%2522%29%250A%250A%2523%2520%25E5%258F%25AA%25E4%25BF%259D%25E7%2595%2599%25E4%25B8%25AD%25E6%2596%2587%250Adef%2520clean%28text%29%253A%250A%2520%2520%2520%2520return%2520%27%27.join%28PATTERN_CHINESE.findall%28text%29%29%250A%250A%2523%2520%25E7%25B9%2581%25E4%25BD%2593%25E8%25BD%25AC%25E7%25AE%2580%25E4%25BD%2593%250Adef%2520simplify%28text%29%253A%250A%2520%2520%2520%2520return%2520CONVERTER.convert%28text%29%250A%250A%2523%2520%25E6%25AF%2594%25E8%25BE%2583%25E6%2596%2587%25E6%259C%25AC%250Adef%2520compare_text%28text1%252C%2520text2%29%253A%250A%2520%2520%2520%2520text1%2520%253D%2520clean%28text1%29%250A%2520%2520%2520%2520text2%2520%253D%2520clean%28text2%29%250A%2520%2520%2520%2520text1a%2520%253D%2520simplify%28text1%29%250A%2520%2520%2520%2520text2a%2520%253D%2520simplify%28text2%29%250A%2520%2520%2520%2520matcher%2520%253D%2520SequenceMatcher%28None%252C%2520text1a%252C%2520text2a%29%250A%2520%2520%2520%2520diffs%2520%253D%2520matcher.get_opcodes%28%29%250A%2520%2520%2520%2520index%2520%253D%25200%250A%2520%2520%2520%2520for%2520tag%252C%2520i1%252C%2520i2%252C%2520j1%252C%2520j2%2520in%2520diffs%253A%250A%2520%2520%2520%2520%2520%2520%2520%2520if%2520tag%2520%253D%253D%2520%27replace%27%253A%250A%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520index%2520%252B%253D%25201%250A%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520%2520print%28f%27%257Bindex%257D.%2520%257Btext1%255Bi1%253Ai2%255D%257D%2520%253C-%253E%2520%257Btext2%255Bj1%253Aj2%255D%257D%27%29%250A%250A%2523%2520%25E7%25AE%2580%25E4%25BD%2593%25E8%25BD%25AC%25E7%25B9%2581%25E4%25BD%2593%250Asimplified_text%2520%253D%2520%27%27%27%25E5%25BA%258A%2520%25E5%2589%258D%2520%25E7%259C%258B%2520%25E6%259C%2588%2520%25E5%2585%2589%250A%250A%25E7%2596%2591%2520%25E6%2598%25AF%2520%25E5%259C%25B0%2520%25E4%25B8%258A%2520%25E9%259C%259C%250A%250A%25E4%25B8%25BE%2520%25E5%25A4%25B4%2520%25E6%259C%259B%2520%25E5%25B1%25B1%2520%25E6%259C%2588%250A%250A%25E4%25BD%258E%2520%25E5%25A4%25B4%2520%25E6%2580%259D%2520%25E6%2595%2585%2520%25E4%25B9%25A1%27%27%27%250Atraditional_text%2520%253D%2520%27%27%27%25E7%2589%2580%25E5%2589%258D%25E6%2598%258E%25E6%259C%2588%25E5%2585%2589%25EF%25BC%258C%25E7%2596%2591%25E6%2598%25AF%25E5%259C%25B0%25E4%25B8%258A%25E9%259C%259C%25E3%2580%2582%250A%250A%25E8%2588%2589%25E9%25A0%25AD%25E6%259C%259B%25E6%2598%258E%25E6%259C%2588%25EF%25BC%258C%25E4%25BD%258E%25E9%25A0%25AD%25E6%2580%259D%25E6%2595%2585%25E9%2584%2589%25E3%2580%2582%27%27%27%250A%250Acompare_text%28simplified_text%252Ctraditional_text%29"
    style="width: 673px; height: 951px; border:0; transform: scale(1); overflow:hidden;"
    sandbox="allow-scripts allow-same-origin">
    </iframe>
    关于     帮助文档     自助推广系统     博客     API     FAQ     Solana     3983 人在线   最高记录 6679       Select Language
    创意工作者们的社区
    World is powered by solitude
    VERSION: 3.9.8.5 24ms UTC 04:10 PVG 12:10 LAX 21:10 JFK 00:10
    Do have faith in what you're doing.
    ubao snddm index pchome yahoo rakuten mypaper meadowduck bidyahoo youbao zxmzxm asda bnvcg cvbfg dfscv mmhjk xxddc yybgb zznbn ccubao uaitu acv GXCV ET GDG YH FG BCVB FJFH CBRE CBC GDG ET54 WRWR RWER WREW WRWER RWER SDG EW SF DSFSF fbbs ubao fhd dfg ewr dg df ewwr ewwr et ruyut utut dfg fgd gdfgt etg dfgt dfgd ert4 gd fgg wr 235 wer3 we vsdf sdf gdf ert xcv sdf rwer hfd dfg cvb rwf afb dfh jgh bmn lgh rty gfds cxv xcv xcs vdas fdf fgd cv sdf tert sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf shasha9178 shasha9178 shasha9178 shasha9178 shasha9178 liflif2 liflif2 liflif2 liflif2 liflif2 liblib3 liblib3 liblib3 liblib3 liblib3 zhazha444 zhazha444 zhazha444 zhazha444 zhazha444 dende5 dende denden denden2 denden21 fenfen9 fenf619 fen619 fenfe9 fe619 sdf sdf sdf sdf sdf zhazh90 zhazh0 zhaa50 zha90 zh590 zho zhoz zhozh zhozho zhozho2 lislis lls95 lili95 lils5 liss9 sdf0ty987 sdft876 sdft9876 sdf09876 sd0t9876 sdf0ty98 sdf0976 sdf0ty986 sdf0ty96 sdf0t76 sdf0876 df0ty98 sf0t876 sd0ty76 sdy76 sdf76 sdf0t76 sdf0ty9 sdf0ty98 sdf0ty987 sdf0ty98 sdf6676 sdf876 sd876 sd876 sdf6 sdf6 sdf9876 sdf0t sdf06 sdf0ty9776 sdf0ty9776 sdf0ty76 sdf8876 sdf0t sd6 sdf06 s688876 sd688 sdf86