PyTorch 高级实战教程:基于 Bi-LSTM CRF 实现命名实体识别和中文分词 - V2EX
V2EX = way to explore
V2EX 是一个关于分享和探索的地方
Sign Up Now
For Existing Member  Sign In
推荐学习书目
Learn Python the Hard Way
Python Sites
PyPI - Python Package Index
http://diveintopython.org/toc/index.html
Pocoo
值得关注的项目
PyPy
Celery
Jinja2
Read the Docs
gevent
pyenv
virtualenv
Stackless Python
Beautiful Soup
结巴中文分词
Green Unicorn
Sentry
Shovel
Pyflakes
pytest
Python 编程
pep8 Checker
Styles
PEP 8
Google Python Style Guide
Code Style from The Hitchhiker's Guide
fendouai_com

PyTorch 高级实战教程:基于 Bi-LSTM CRF 实现命名实体识别和中文分词

  •  1
     
  •   fendouai_com Apr 12, 2019 5768 views
    This topic created in 2574 days ago, the information mentioned may be changed or developed.

    前言:实测 PyTorch 代码非常简洁易懂,只需要将中文分词的数据集预处理成作者提到的格式,即可很快的就迁移了这个代码到中文分词中,相关的代码后续将会分享。

    具体的数据格式,这种方式并不适合处理很多的数据,但是对于 demo 来说非常友好,把英文改成中文,标签改成分词问题中的 “ BEMS ” 就可以跑起来了。

    # Make up some training data training_data = [( "the wall street journal reported today that apple corporation made money".split(), "B I I I O O O B I O O".split() ), ( "georgia tech is a university in georgia".split(), "B I O O O O B".split() )] 

    Pytorch 是一个动态神经网络工具包。 动态工具包的另一个例子是 Dynet (我之所以提到这一点,因为与 Pytorch 和 Dynet 的工作方式类似。如果你在 Dynet 中看到一个例子,它可能会帮助你在 Pytorch 中实现它)。 相反的是静态工具包,包括 Theano,Keras,TensorFlow 等。核心区别如下:

    在静态工具箱中,您可以定义一次计算图,对其进行编译,然后将实例流式传输给它。 在动态工具包中,您可以为每个实例定义计算图。 它永远不会被编译并且是即时执行的。 动态工具包还有一个优点,那就是更容易调试,代码更像主机语言(我的意思是 pytorch 和 dynet 看起来更像实际的 python 代码,而不是 keras 或 theano )。

    Bi-LSTM Conditional Random Field ( Bi-LSTM CRF ) 对于本节,我们将看到用于命名实体识别的 Bi-LSTM 条件随机场的完整复杂示例。 上面的 LSTM 标记符通常足以用词性标注,但是像 CRF 这样的序列模型对于 NER 上的强大性能非常重要。 假设熟悉 CRF。 虽然这个名字听起来很可怕,但所有模型都是 CRF,但是 LSTM 提供了特征。 这是一个高级模型,比本教程中的任何早期模型复杂得多。

    实现细节: 下面的例子在 log 空间中实现了计算微分函数的正向算法,以及要解码的维特比算法。反向传播将自动为我们计算梯度。我们不必用手做任何事。

    这个算法用来演示,没有优化。如果您了解正在发生的事情,您可能会很快看到,在转发算法中迭代下一个标记可能是在一个大型操作中完成的。我想用代码来提高可读性。如果你想做相关的改变,你可以用这个标记器来完成真正的任务。

    # Author: Robert Guthrie import torch import torch.autograd as autograd import torch.nn as nn import torch.optim as optim torch.manual_seed(1) 

    帮助程序函数,使代码更具可读性。

    def argmax(vec): # return the argmax as a python int _, idx = torch.max(vec, 1) return idx.item() def prepare_sequence(seq, to_ix): idxs = [to_ix[w] for w in seq] return torch.tensor(idxs, dtype=torch.long) # Compute log sum exp in a numerically stable way for the forward algorithm def log_sum_exp(vec): max_score = vec[0, argmax(vec)] max_score_broadcast = max_score.view(1, -1).expand(1, vec.size()[1]) return max_score + \ torch.log(torch.sum(torch.exp(vec - max_score_broadcast))) 

    创建模型

    class BiLSTM_CRF(nn.Module): def __init__(self, vocab_size, tag_to_ix, embedding_dim, hidden_dim): super(BiLSTM_CRF, self).__init__() self.embedding_dim = embedding_dim self.hidden_dim = hidden_dim self.vocab_size = vocab_size self.tag_to_ix = tag_to_ix self.tagset_size = len(tag_to_ix) self.word_embeds = nn.Embedding(vocab_size, embedding_dim) self.lstm = nn.LSTM(embedding_dim, hidden_dim // 2, num_layers=1, bidirectiOnal=True) # Maps the output of the LSTM into tag space. self.hidden2tag = nn.Linear(hidden_dim, self.tagset_size) # Matrix of transition parameters. Entry i,j is the score of # transitioning *to* i *from* j. self.transitiOns= nn.Parameter( torch.randn(self.tagset_size, self.tagset_size)) # These two statements enforce the constraint that we never transfer # to the start tag and we never transfer from the stop tag self.transitions.data[tag_to_ix[START_TAG], :] = -10000 self.transitions.data[:, tag_to_ix[STOP_TAG]] = -10000 self.hidden = self.init_hidden() def init_hidden(self): return (torch.randn(2, 1, self.hidden_dim // 2), torch.randn(2, 1, self.hidden_dim // 2)) def _forward_alg(self, feats): # Do the forward algorithm to compute the partition function init_alphas = torch.full((1, self.tagset_size), -10000.) # START_TAG has all of the score. init_alphas[0][self.tag_to_ix[START_TAG]] = 0. # Wrap in a variable so that we will get automatic backprop forward_var = init_alphas # Iterate through the sentence for feat in feats: alphas_t = [] # The forward tensors at this timestep for next_tag in range(self.tagset_size): # broadcast the emission score: it is the same regardless of # the previous tag emit_score = feat[next_tag].view( 1, -1).expand(1, self.tagset_size) # the ith entry of trans_score is the score of transitioning to # next_tag from i trans_score = self.transitions[next_tag].view(1, -1) # The ith entry of next_tag_var is the value for the # edge (i -> next_tag) before we do log-sum-exp next_tag_var = forward_var + trans_score + emit_score # The forward variable for this tag is log-sum-exp of all the # scores. alphas_t.append(log_sum_exp(next_tag_var).view(1)) forward_var = torch.cat(alphas_t).view(1, -1) terminal_var = forward_var + self.transitions[self.tag_to_ix[STOP_TAG]] alpha = log_sum_exp(terminal_var) return alpha def _get_lstm_features(self, sentence): self.hidden = self.init_hidden() embeds = self.word_embeds(sentence).view(len(sentence), 1, -1) lstm_out, self.hidden = self.lstm(embeds, self.hidden) lstm_out = lstm_out.view(len(sentence), self.hidden_dim) lstm_feats = self.hidden2tag(lstm_out) return lstm_feats def _score_sentence(self, feats, tags): # Gives the score of a provided tag sequence score = torch.zeros(1) tags = torch.cat([torch.tensor([self.tag_to_ix[START_TAG]], dtype=torch.long), tags]) for i, feat in enumerate(feats): score = score + \ self.transitions[tags[i + 1], tags[i]] + feat[tags[i + 1]] score = score + self.transitions[self.tag_to_ix[STOP_TAG], tags[-1]] return score def _viterbi_decode(self, feats): backpointers = [] # Initialize the viterbi variables in log space init_vvars = torch.full((1, self.tagset_size), -10000.) init_vvars[0][self.tag_to_ix[START_TAG]] = 0 # forward_var at step i holds the viterbi variables for step i-1 forward_var = init_vvars for feat in feats: bptrs_t = [] # holds the backpointers for this step viterbivars_t = [] # holds the viterbi variables for this step for next_tag in range(self.tagset_size): # next_tag_var[i] holds the viterbi variable for tag i at the # previous step, plus the score of transitioning # from tag i to next_tag. # We don't include the emission scores here because the max # does not depend on them (we add them in below) next_tag_var = forward_var + self.transitions[next_tag] best_tag_id = argmax(next_tag_var) bptrs_t.append(best_tag_id) viterbivars_t.append(next_tag_var[0][best_tag_id].view(1)) # Now add in the emission scores, and assign forward_var to the set # of viterbi variables we just computed forward_var = (torch.cat(viterbivars_t) + feat).view(1, -1) backpointers.append(bptrs_t) # Transition to STOP_TAG terminal_var = forward_var + self.transitions[self.tag_to_ix[STOP_TAG]] best_tag_id = argmax(terminal_var) path_score = terminal_var[0][best_tag_id] # Follow the back pointers to decode the best path. best_path = [best_tag_id] for bptrs_t in reversed(backpointers): best_tag_id = bptrs_t[best_tag_id] best_path.append(best_tag_id) # Pop off the start tag (we dont want to return that to the caller) start = best_path.pop() assert start == self.tag_to_ix[START_TAG] # Sanity check best_path.reverse() return path_score, best_path def neg_log_likelihood(self, sentence, tags): feats = self._get_lstm_features(sentence) forward_score = self._forward_alg(feats) gold_score = self._score_sentence(feats, tags) return forward_score - gold_score def forward(self, sentence): # dont confuse this with _forward_alg above. # Get the emission scores from the BiLSTM lstm_feats = self._get_lstm_features(sentence) # Find the best path, given the features. score, tag_seq = self._viterbi_decode(lstm_feats) return score, tag_seq 

    开始训练

    START_TAG = "<START>" STOP_TAG = "<STOP>" EMBEDDING_DIM = 5 HIDDEN_DIM = 4 # Make up some training data training_data = [( "the wall street journal reported today that apple corporation made money".split(), "B I I I O O O B I O O".split() ), ( "georgia tech is a university in georgia".split(), "B I O O O O B".split() )] word_to_ix = {} for sentence, tags in training_data: for word in sentence: if word not in word_to_ix: word_to_ix[word] = len(word_to_ix) tag_to_ix = {"B": 0, "I": 1, "O": 2, START_TAG: 3, STOP_TAG: 4} model = BiLSTM_CRF(len(word_to_ix), tag_to_ix, EMBEDDING_DIM, HIDDEN_DIM) optimizer = optim.SGD(model.parameters(), lr=0.01, weight_decay=1e-4) # Check predictions before training with torch.no_grad(): precheck_sent = prepare_sequence(training_data[0][0], word_to_ix) precheck_tags = torch.tensor([tag_to_ix[t] for t in training_data[0][1]], dtype=torch.long) print(model(precheck_sent)) # Make sure prepare_sequence from earlier in the LSTM section is loaded for epoch in range( 300): # again, normally you would NOT do 300 epochs, it is toy data for sentence, tags in training_data: # Step 1. Remember that Pytorch accumulates gradients. # We need to clear them out before each instance model.zero_grad() # Step 2. Get our inputs ready for the network, that is, # turn them into Tensors of word indices. sentence_in = prepare_sequence(sentence, word_to_ix) targets = torch.tensor([tag_to_ix[t] for t in tags], dtype=torch.long) # Step 3. Run our forward pass. loss = model.neg_log_likelihood(sentence_in, targets) # Step 4. Compute the loss, gradients, and update the parameters by # calling optimizer.step() loss.backward() optimizer.step() # Check predictions after training with torch.no_grad(): precheck_sent = prepare_sequence(training_data[0][0], word_to_ix) print(model(precheck_sent)) # We got it! 

    输出

    (tensor(2.6907), [1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1]) (tensor(20.4906), [0, 1, 1, 1, 2, 2, 2, 0, 1, 2, 2]) 

    我们没有必要在进行解码时创建计算图,因为我们不会从维特比路径得分反向传播。 因为无论如何我们都有它,尝试训练标记器,其中损失函数是维特比路径得分和测试标准路径得分之间的差异。 应该清楚的是,当预测的标签序列是正确的标签序列时,该功能是非负的和 0。 这基本上是结构感知器。

    由于已经实现了 Viterbi 和 score_sentence,因此这种修改应该很短。 这是取决于训练实例的计算图形的示例。 虽然我没有尝试在静态工具包中实现它,但我想它可能但不那么直截了当。

    拿起一些真实数据并进行比较!

    原文链接: https://pytorch.org/tutorials/beginner/nlp/advanced_tutorial.html#advanced-making-dynamic-decisions-and-the-bi-lstm-crf

    更多 PyTorch 实战教程: http://pytorchchina.com/

    3 replies    2019-04-16 09:57:01 +08:00
    diggerdu
        1
    diggerdu  
       Apr 12, 2019 via iPhone
    写的不错
    oblivious
        2
    oblivious  
       Apr 12, 2019
    NICE WORK!

    但是有一个小问题:这好像不是中文分词,是我理解错了吗 > <
    fendouai_com
        3
    fendouai_com  
    OP
       Apr 16, 2019
    @oblivious 这个是 NER,可以迁移到中文分词,文章开头有提到迁移的思路。
    About     Help     Advertise     Blog     API     FAQ     Solana     1295 Online   Highest 6679       Select Language
    创意工作者们的社区
    World is powered by solitude
    VERSION: 3.9.8.5 34ms UTC 23:43 PVG 07:43 LAX 16:43 JFK 19:43
    Do have faith in what you're doing.
    ubao msn snddm index pchome yahoo rakuten mypaper meadowduck bidyahoo youbao zxmzxm asda bnvcg cvbfg dfscv mmhjk xxddc yybgb zznbn ccubao uaitu acv GXCV ET GDG YH FG BCVB FJFH CBRE CBC GDG ET54 WRWR RWER WREW WRWER RWER SDG EW SF DSFSF fbbs ubao fhd dfg ewr dg df ewwr ewwr et ruyut utut dfg fgd gdfgt etg dfgt dfgd ert4 gd fgg wr 235 wer3 we vsdf sdf gdf ert xcv sdf rwer hfd dfg cvb rwf afb dfh jgh bmn lgh rty gfds cxv xcv xcs vdas fdf fgd cv sdf tert sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf shasha9178 shasha9178 shasha9178 shasha9178 shasha9178 liflif2 liflif2 liflif2 liflif2 liflif2 liblib3 liblib3 liblib3 liblib3 liblib3 zhazha444 zhazha444 zhazha444 zhazha444 zhazha444 dende5 dende denden denden2 denden21 fenfen9 fenf619 fen619 fenfe9 fe619 sdf sdf sdf sdf sdf zhazh90 zhazh0 zhaa50 zha90 zh590 zho zhoz zhozh zhozho zhozho2 lislis lls95 lili95 lils5 liss9 sdf0ty987 sdft876 sdft9876 sdf09876 sd0t9876 sdf0ty98 sdf0976 sdf0ty986 sdf0ty96 sdf0t76 sdf0876 df0ty98 sf0t876 sd0ty76 sdy76 sdf76 sdf0t76 sdf0ty9 sdf0ty98 sdf0ty987 sdf0ty98 sdf6676 sdf876 sd876 sd876 sdf6 sdf6 sdf9876 sdf0t sdf06 sdf0ty9776 sdf0ty9776 sdf0ty76 sdf8876 sdf0t sd6 sdf06 s688876 sd688 sdf86