中文命名实体识别NER
命名實(shí)體識(shí)別(英語:Named Entity Recognition),簡稱NER,是指識(shí)別文本中具有特定意義的實(shí)體,主要包括人名、地名、機(jī)構(gòu)名、專有名詞等,以及時(shí)間、數(shù)量、貨幣、比例數(shù)值等文字。目前在NER上表現(xiàn)較好的模型都是基于深度學(xué)習(xí)或者是統(tǒng)計(jì)學(xué)習(xí)的方法的,這些方法共同的特點(diǎn)都是需要大量的數(shù)據(jù)來進(jìn)行學(xué)習(xí),本文使用的數(shù)據(jù)集是2018ACL論文中新浪財(cái)經(jīng)收集的簡歷數(shù)據(jù)。
數(shù)據(jù)集鏈接:https://github.com/jiesutd/LatticeLSTM
標(biāo)注集采用BIOES(B表示實(shí)體開頭,E表示實(shí)體結(jié)尾,I表示在實(shí)體內(nèi)部,O表示非實(shí)體,S表示單個(gè)實(shí)體),句子之間用一個(gè)空行隔開。
對(duì)于命名實(shí)體識(shí)別其他方法舉例
?
?
常用的模型以及涉及到的主要代碼
?1、隱馬爾可夫模型(HMM)
隱馬爾可夫模型描述由一個(gè)隱藏的馬爾科夫鏈隨機(jī)生成不可觀測的狀態(tài)隨機(jī)序列,再由各個(gè)狀態(tài)生成一個(gè)觀測而產(chǎn)生觀測隨機(jī)序列的過程(李航 統(tǒng)計(jì)學(xué)習(xí)方法)。隱馬爾可夫模型由初始狀態(tài)分布,狀態(tài)轉(zhuǎn)移概率矩陣以及觀測概率矩陣所確定。上面的定義太過學(xué)術(shù)看不懂沒關(guān)系,我們只需要知道,NER本質(zhì)上可以看成是一種序列標(biāo)注問題(預(yù)測每個(gè)字的BIOES標(biāo)記),在使用HMM解決NER這種序列標(biāo)注問題的時(shí)候,我們所能觀測到的是字組成的序列(觀測序列),觀測不到的是每個(gè)字對(duì)應(yīng)的標(biāo)注(狀態(tài)序列)。對(duì)應(yīng)的,HMM的三個(gè)要素可以解釋為,初始狀態(tài)分布就是每一個(gè)標(biāo)注作為句子第一個(gè)字的標(biāo)注的概率,狀態(tài)轉(zhuǎn)移概率矩陣就是由某一個(gè)標(biāo)注轉(zhuǎn)移到下一個(gè)標(biāo)注的概率(設(shè)狀態(tài)轉(zhuǎn)移矩陣為??,那么若前一個(gè)詞的標(biāo)注為??,則下一個(gè)詞的標(biāo)注為??的概率為??),觀測概率矩陣就是指在某個(gè)標(biāo)注下,生成某個(gè)詞的概率。根據(jù)HMM的三個(gè)要素,我們可以定義如下的HMM模型:
class HMM(object):def __init__(self, N, M):"""Args:N: 狀態(tài)數(shù),這里對(duì)應(yīng)存在的標(biāo)注的種類M: 觀測數(shù),這里對(duì)應(yīng)有多少不同的字"""self.N = Nself.M = M# 狀態(tài)轉(zhuǎn)移概率矩陣 A[i][j]表示從i狀態(tài)轉(zhuǎn)移到j(luò)狀態(tài)的概率self.A = torch.zeros(N, N)# 觀測概率矩陣, B[i][j]表示i狀態(tài)下生成j觀測的概率self.B = torch.zeros(N, M)# 初始狀態(tài)概率 Pi[i]表示初始時(shí)刻為狀態(tài)i的概率self.Pi = torch.zeros(N)有了模型定義,接下來的問題就是訓(xùn)練模型了。HMM模型的訓(xùn)練過程對(duì)應(yīng)隱馬爾可夫模型的學(xué)習(xí)問題(李航 統(tǒng)計(jì)學(xué)習(xí)方法),實(shí)際上就是根據(jù)訓(xùn)練數(shù)據(jù)根據(jù)最大似然的方法估計(jì)模型的三個(gè)要素,即上文提到的初始狀態(tài)分布、狀態(tài)轉(zhuǎn)移概率矩陣以及觀測概率矩陣。舉個(gè)例子幫助理解,在估計(jì)初始狀態(tài)分布的時(shí)候,假如某個(gè)標(biāo)記在數(shù)據(jù)集中作為句子第一個(gè)字的標(biāo)記的次數(shù)為k,句子的總數(shù)為N,那么該標(biāo)記作為句子第一個(gè)字的概率可以近似估計(jì)為k/N,很簡單對(duì)吧,使用這種方法,我們近似估計(jì)HMM的三個(gè)要素,代碼如下(出現(xiàn)過的函數(shù)將用省略號(hào)代替):
class HMM(object):def __init__(self, N, M):....def train(self, word_lists, tag_lists, word2id, tag2id):"""HMM的訓(xùn)練,即根據(jù)訓(xùn)練語料對(duì)模型參數(shù)進(jìn)行估計(jì),因?yàn)槲覀冇杏^測序列以及其對(duì)應(yīng)的狀態(tài)序列,所以我們可以使用極大似然估計(jì)的方法來估計(jì)隱馬爾可夫模型的參數(shù)參數(shù):word_lists: 列表,其中每個(gè)元素由字組成的列表,如 ['擔(dān)','任','科','員']tag_lists: 列表,其中每個(gè)元素是由對(duì)應(yīng)的標(biāo)注組成的列表,如 ['O','O','B-TITLE', 'E-TITLE']word2id: 將字映射為IDtag2id: 字典,將標(biāo)注映射為ID"""assert len(tag_lists) == len(word_lists)# 估計(jì)轉(zhuǎn)移概率矩陣for tag_list in tag_lists:seq_len = len(tag_list)for i in range(seq_len - 1):current_tagid = tag2id[tag_list[i]]next_tagid = tag2id[tag_list[i+1]]self.A[current_tagid][next_tagid] += 1# 一個(gè)重要的問題:如果某元素沒有出現(xiàn)過,該位置為0,這在后續(xù)的計(jì)算中是不允許的# 解決方法:我們將等于0的概率加上很小的數(shù)self.A[self.A == 0.] = 1e-10self.A = self.A / self.A.sum(dim=1, keepdim=True)# 估計(jì)觀測概率矩陣for tag_list, word_list in zip(tag_lists, word_lists):assert len(tag_list) == len(word_list)for tag, word in zip(tag_list, word_list):tag_id = tag2id[tag]word_id = word2id[word]self.B[tag_id][word_id] += 1self.B[self.B == 0.] = 1e-10self.B = self.B / self.B.sum(dim=1, keepdim=True)# 估計(jì)初始狀態(tài)概率for tag_list in tag_lists:init_tagid = tag2id[tag_list[0]]self.Pi[init_tagid] += 1self.Pi[self.Pi == 0.] = 1e-10self.Pi = self.Pi / self.Pi.sum()模型訓(xùn)練完畢之后,要利用訓(xùn)練好的模型進(jìn)行解碼,就是對(duì)給定的模型未見過的句子,求句子中的每個(gè)字對(duì)應(yīng)的標(biāo)注,針對(duì)這個(gè)解碼問題,我們使用的是維特比(viterbi)算法。關(guān)于該算法的數(shù)學(xué)推導(dǎo),可以查閱一下李航統(tǒng)計(jì)學(xué)習(xí)方法。
HMM存在兩個(gè)缺陷:1)觀察值之間嚴(yán)格獨(dú)立,觀測到的句子中每個(gè)字相互獨(dú)立
2)狀態(tài)轉(zhuǎn)移過程中當(dāng)前狀態(tài)只與前一狀態(tài)有關(guān),沒有關(guān)注到后一時(shí)刻的狀態(tài)
HMM代碼實(shí)現(xiàn)的主要模型部分如下:
import torchclass HMM(object):def __init__(self, N, M):"""Args:N: 狀態(tài)數(shù),這里對(duì)應(yīng)存在的標(biāo)注的種類M: 觀測數(shù),這里對(duì)應(yīng)有多少不同的字"""self.N = Nself.M = M# 狀態(tài)轉(zhuǎn)移概率矩陣 A[i][j]表示從i狀態(tài)轉(zhuǎn)移到j(luò)狀態(tài)的概率self.A = torch.zeros(N, N)# 觀測概率矩陣, B[i][j]表示i狀態(tài)下生成j觀測的概率self.B = torch.zeros(N, M)# 初始狀態(tài)概率 Pi[i]表示初始時(shí)刻為狀態(tài)i的概率self.Pi = torch.zeros(N)def train(self, word_lists, tag_lists, word2id, tag2id):"""HMM的訓(xùn)練,即根據(jù)訓(xùn)練語料對(duì)模型參數(shù)進(jìn)行估計(jì),因?yàn)槲覀冇杏^測序列以及其對(duì)應(yīng)的狀態(tài)序列,所以我們可以使用極大似然估計(jì)的方法來估計(jì)隱馬爾可夫模型的參數(shù)參數(shù):word_lists: 列表,其中每個(gè)元素由字組成的列表,如 ['擔(dān)','任','科','員']tag_lists: 列表,其中每個(gè)元素是由對(duì)應(yīng)的標(biāo)注組成的列表,如 ['O','O','B-TITLE', 'E-TITLE']word2id: 將字映射為IDtag2id: 字典,將標(biāo)注映射為ID"""assert len(tag_lists) == len(word_lists)# 估計(jì)轉(zhuǎn)移概率矩陣for tag_list in tag_lists:seq_len = len(tag_list)for i in range(seq_len - 1):current_tagid = tag2id[tag_list[i]]next_tagid = tag2id[tag_list[i+1]]self.A[current_tagid][next_tagid] += 1# 問題:如果某元素沒有出現(xiàn)過,該位置為0,這在后續(xù)的計(jì)算中是不允許的# 解決方法:我們將等于0的概率加上很小的數(shù)self.A[self.A == 0.] = 1e-10self.A = self.A / self.A.sum(dim=1, keepdim=True)# 估計(jì)觀測概率矩陣for tag_list, word_list in zip(tag_lists, word_lists):assert len(tag_list) == len(word_list)for tag, word in zip(tag_list, word_list):tag_id = tag2id[tag]word_id = word2id[word]self.B[tag_id][word_id] += 1self.B[self.B == 0.] = 1e-10self.B = self.B / self.B.sum(dim=1, keepdim=True)# 估計(jì)初始狀態(tài)概率for tag_list in tag_lists:init_tagid = tag2id[tag_list[0]]self.Pi[init_tagid] += 1self.Pi[self.Pi == 0.] = 1e-10self.Pi = self.Pi / self.Pi.sum()def test(self, word_lists, word2id, tag2id):pred_tag_lists = []for word_list in word_lists:pred_tag_list = self.decoding(word_list, word2id, tag2id)pred_tag_lists.append(pred_tag_list)return pred_tag_listsdef decoding(self, word_list, word2id, tag2id):"""使用維特比算法對(duì)給定觀測序列求狀態(tài)序列, 這里就是對(duì)字組成的序列,求其對(duì)應(yīng)的標(biāo)注。維特比算法實(shí)際是用動(dòng)態(tài)規(guī)劃解隱馬爾可夫模型預(yù)測問題,即用動(dòng)態(tài)規(guī)劃求概率最大路徑(最優(yōu)路徑)這時(shí)一條路徑對(duì)應(yīng)著一個(gè)狀態(tài)序列"""# 問題:整條鏈很長的情況下,十分多的小概率相乘,最后可能造成下溢# 解決辦法:采用對(duì)數(shù)概率,這樣源空間中的很小概率,就被映射到對(duì)數(shù)空間的大的負(fù)數(shù)# 同時(shí)相乘操作也變成簡單的相加操作A = torch.log(self.A)B = torch.log(self.B)Pi = torch.log(self.Pi)# 初始化 維比特矩陣viterbi 它的維度為[狀態(tài)數(shù), 序列長度]# 其中viterbi[i, j]表示標(biāo)注序列的第j個(gè)標(biāo)注為i的所有單個(gè)序列(i_1, i_2, ..i_j)出現(xiàn)的概率最大值seq_len = len(word_list)viterbi = torch.zeros(self.N, seq_len)# backpointer是跟viterbi一樣大小的矩陣# backpointer[i, j]存儲(chǔ)的是 標(biāo)注序列的第j個(gè)標(biāo)注為i時(shí),第j-1個(gè)標(biāo)注的id# 等解碼的時(shí)候,我們用backpointer進(jìn)行回溯,以求出最優(yōu)路徑backpointer = torch.zeros(self.N, seq_len).long()# self.Pi[i] 表示第一個(gè)字的標(biāo)記為i的概率# Bt[word_id]表示字為word_id的時(shí)候,對(duì)應(yīng)各個(gè)標(biāo)記的概率# self.A.t()[tag_id]表示各個(gè)狀態(tài)轉(zhuǎn)移到tag_id對(duì)應(yīng)的概率# 所以第一步為start_wordid = word2id.get(word_list[0], None)Bt = B.t()if start_wordid is None:# 如果字不再字典里,則假設(shè)狀態(tài)的概率分布是均勻的bt = torch.log(torch.ones(self.N) / self.N)else:bt = Bt[start_wordid]viterbi[:, 0] = Pi + btbackpointer[:, 0] = -1# 遞推公式:# viterbi[tag_id, step] = max(viterbi[:, step-1]* self.A.t()[tag_id] * Bt[word])# 其中word是step時(shí)刻對(duì)應(yīng)的字# 由上述遞推公式求后續(xù)各步for step in range(1, seq_len):wordid = word2id.get(word_list[step], None)# 處理字不在字典中的情況# bt是在t時(shí)刻字為wordid時(shí),狀態(tài)的概率分布if wordid is None:# 如果字不再字典里,則假設(shè)狀態(tài)的概率分布是均勻的bt = torch.log(torch.ones(self.N) / self.N)else:bt = Bt[wordid] # 否則從觀測概率矩陣中取btfor tag_id in range(len(tag2id)):max_prob, max_id = torch.max(viterbi[:, step-1] + A[:, tag_id],dim=0)viterbi[tag_id, step] = max_prob + bt[tag_id]backpointer[tag_id, step] = max_id# 終止, t=seq_len 即 viterbi[:, seq_len]中的最大概率,就是最優(yōu)路徑的概率best_path_prob, best_path_pointer = torch.max(viterbi[:, seq_len-1], dim=0)# 回溯,求最優(yōu)路徑best_path_pointer = best_path_pointer.item()best_path = [best_path_pointer]for back_step in range(seq_len-1, 0, -1):best_path_pointer = backpointer[best_path_pointer, back_step]best_path_pointer = best_path_pointer.item()best_path.append(best_path_pointer)# 將tag_id組成的序列轉(zhuǎn)化為tagassert len(best_path) == len(word_list)id2tag = dict((id_, tag) for tag, id_ in tag2id.items())tag_list = [id2tag[id_] for id_ in reversed(best_path)]return tag_list2、條件隨機(jī)場
上面講的HMM模型中存在兩個(gè)假設(shè),一是輸出觀察值之間嚴(yán)格獨(dú)立,二是狀態(tài)轉(zhuǎn)移過程中當(dāng)前狀態(tài)只與前一狀態(tài)有關(guān)。也就是說,在命名實(shí)體識(shí)別的場景下,HMM認(rèn)為觀測到的句子中的每個(gè)字都是相互獨(dú)立的,而且當(dāng)前時(shí)刻的標(biāo)注只與前一時(shí)刻的標(biāo)注相關(guān)。但實(shí)際上,命名實(shí)體識(shí)別往往需要更多的特征,比如詞性,詞的上下文等等,同時(shí)當(dāng)前時(shí)刻的標(biāo)注應(yīng)該與前一時(shí)刻以及后一時(shí)刻的標(biāo)注都相關(guān)聯(lián)。由于這兩個(gè)假設(shè)的存在,顯然HMM模型在解決命名實(shí)體識(shí)別的問題上是存在缺陷的。
而條件隨機(jī)場就沒有這種問題,它通過引入自定義的特征函數(shù),不僅可以表達(dá)觀測之間的依賴,還可表示當(dāng)前觀測與前后多個(gè)狀態(tài)之間的復(fù)雜依賴,可以有效克服HMM模型面臨的問題。條件隨機(jī)場數(shù)學(xué)公式不在此講述了。其解碼也是采用維特比算法。
from sklearn_crfsuite import CRF # CRF的具體實(shí)現(xiàn)太過復(fù)雜,這里我們借助一個(gè)外部的庫def word2features(sent, i):"""抽取單個(gè)字的特征"""word = sent[i]prev_word = "<s>" if i == 0 else sent[i-1]next_word = "</s>" if i == (len(sent)-1) else sent[i+1]# 因?yàn)槊總€(gè)詞相鄰的詞會(huì)影響這個(gè)詞的標(biāo)記# 所以我們使用:# 前一個(gè)詞,當(dāng)前詞,后一個(gè)詞,# 前一個(gè)詞+當(dāng)前詞, 當(dāng)前詞+后一個(gè)詞# 作為特征features = {'w': word,'w-1': prev_word,'w+1': next_word,'w-1:w': prev_word+word,'w:w+1': word+next_word,'bias': 1}return featuresdef sent2features(sent):"""抽取序列特征"""return [word2features(sent, i) for i in range(len(sent))]class CRFModel(object):def __init__(self,algorithm='lbfgs',c1=0.1,c2=0.1,max_iterations=100,all_possible_transitions=False):self.model = CRF(algorithm=algorithm,c1=c1,c2=c2,max_iterations=max_iterations,all_possible_transitions=all_possible_transitions)def train(self, sentences, tag_lists):"""訓(xùn)練模型"""features = [sent2features(s) for s in sentences]self.model.fit(features, tag_lists)def test(self, sentences):"""解碼,對(duì)給定句子預(yù)測其標(biāo)注"""features = [sent2features(s) for s in sentences]pred_tag_lists = self.model.predict(features)return pred_tag_lists?
3、Bi_LSTM_CRF
簡單的LSTM的優(yōu)點(diǎn)是能夠通過雙向的設(shè)置學(xué)習(xí)到觀測序列(輸入的字)之間的依賴,在訓(xùn)練過程中,LSTM能夠根據(jù)目標(biāo)(比如識(shí)別實(shí)體)自動(dòng)提取觀測序列的特征,但是缺點(diǎn)是無法學(xué)習(xí)到狀態(tài)序列(輸出的標(biāo)注)之間的關(guān)系,要知道,在命名實(shí)體識(shí)別任務(wù)中,標(biāo)注之間是有一定的關(guān)系的,比如B類標(biāo)注(表示某實(shí)體的開頭)后面不會(huì)再接一個(gè)B類標(biāo)注,所以LSTM在解決NER這類序列標(biāo)注任務(wù)時(shí),雖然可以省去很繁雜的特征工程,但是也存在無法學(xué)習(xí)到標(biāo)注上下文的缺點(diǎn)。相反,CRF的優(yōu)點(diǎn)就是能對(duì)隱含狀態(tài)建模,學(xué)習(xí)狀態(tài)序列的特點(diǎn),但它的缺點(diǎn)是需要手動(dòng)提取序列特征。所以一般的做法是,在LSTM后面再加一層CRF,以獲得兩者的優(yōu)點(diǎn)。
下面是給Bi-LSTM加一層CRF的代碼實(shí)現(xiàn):
from itertools import zip_longest from copy import deepcopyimport torch import torch.nn as nn import torch.optim as optimfrom .util import tensorized, sort_by_lengths, cal_loss, cal_lstm_crf_loss from .config import TrainingConfig, LSTMConfig from .bilstm import BiLSTMclass BILSTM_Model(object):def __init__(self, vocab_size, out_size, crf=True):"""功能:對(duì)LSTM的模型進(jìn)行訓(xùn)練與測試參數(shù):vocab_size:詞典大小out_size:標(biāo)注種類crf選擇是否添加CRF層"""self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")# 加載模型參數(shù)self.emb_size = LSTMConfig.emb_sizeself.hidden_size = LSTMConfig.hidden_sizeself.crf = crf# 根據(jù)是否添加crf初始化不同的模型 選擇不一樣的損失計(jì)算函數(shù)if not crf:self.model = BiLSTM(vocab_size, self.emb_size,self.hidden_size, out_size).to(self.device)self.cal_loss_func = cal_losselse:self.model = BiLSTM_CRF(vocab_size, self.emb_size,self.hidden_size, out_size).to(self.device)self.cal_loss_func = cal_lstm_crf_loss# 加載訓(xùn)練參數(shù):self.epoches = TrainingConfig.epochesself.print_step = TrainingConfig.print_stepself.lr = TrainingConfig.lrself.batch_size = TrainingConfig.batch_size# 初始化優(yōu)化器self.optimizer = optim.Adam(self.model.parameters(), lr=self.lr)# 初始化其他指標(biāo)self.step = 0self._best_val_loss = 1e18self.best_model = Nonedef train(self, word_lists, tag_lists,dev_word_lists, dev_tag_lists,word2id, tag2id):# 對(duì)數(shù)據(jù)集按照長度進(jìn)行排序word_lists, tag_lists, _ = sort_by_lengths(word_lists, tag_lists)dev_word_lists, dev_tag_lists, _ = sort_by_lengths(dev_word_lists, dev_tag_lists)B = self.batch_sizefor e in range(1, self.epoches+1):self.step = 0losses = 0.for ind in range(0, len(word_lists), B):batch_sents = word_lists[ind:ind+B]batch_tags = tag_lists[ind:ind+B]losses += self.train_step(batch_sents,batch_tags, word2id, tag2id)if self.step % TrainingConfig.print_step == 0:total_step = (len(word_lists) // B + 1)print("Epoch {}, step/total_step: {}/{} {:.2f}% Loss:{:.4f}".format(e, self.step, total_step,100. * self.step / total_step,losses / self.print_step))losses = 0.# 每輪結(jié)束測試在驗(yàn)證集上的性能,保存最好的一個(gè)val_loss = self.validate(dev_word_lists, dev_tag_lists, word2id, tag2id)print("Epoch {}, Val Loss:{:.4f}".format(e, val_loss))def train_step(self, batch_sents, batch_tags, word2id, tag2id):self.model.train()self.step += 1# 準(zhǔn)備數(shù)據(jù)tensorized_sents, lengths = tensorized(batch_sents, word2id)tensorized_sents = tensorized_sents.to(self.device)targets, lengths = tensorized(batch_tags, tag2id)targets = targets.to(self.device)# forwardscores = self.model(tensorized_sents, lengths)# 計(jì)算損失 更新參數(shù)self.optimizer.zero_grad()loss = self.cal_loss_func(scores, targets, tag2id).to(self.device)loss.backward()self.optimizer.step()return loss.item()def validate(self, dev_word_lists, dev_tag_lists, word2id, tag2id):self.model.eval()with torch.no_grad():val_losses = 0.val_step = 0for ind in range(0, len(dev_word_lists), self.batch_size):val_step += 1# 準(zhǔn)備batch數(shù)據(jù)batch_sents = dev_word_lists[ind:ind+self.batch_size]batch_tags = dev_tag_lists[ind:ind+self.batch_size]tensorized_sents, lengths = tensorized(batch_sents, word2id)tensorized_sents = tensorized_sents.to(self.device)targets, lengths = tensorized(batch_tags, tag2id)targets = targets.to(self.device)# forwardscores = self.model(tensorized_sents, lengths)# 計(jì)算損失loss = self.cal_loss_func(scores, targets, tag2id).to(self.device)val_losses += loss.item()val_loss = val_losses / val_stepif val_loss < self._best_val_loss:print("保存模型...")self.best_model = deepcopy(self.model)self._best_val_loss = val_lossreturn val_lossdef test(self, word_lists, tag_lists, word2id, tag2id):"""返回最佳模型在測試集上的預(yù)測結(jié)果"""# 準(zhǔn)備數(shù)據(jù)word_lists, tag_lists, indices = sort_by_lengths(word_lists, tag_lists)tensorized_sents, lengths = tensorized(word_lists, word2id)tensorized_sents = tensorized_sents.to(self.device)self.best_model.eval()with torch.no_grad():batch_tagids = self.best_model.test(tensorized_sents, lengths, tag2id)# 將id轉(zhuǎn)化為標(biāo)注pred_tag_lists = []id2tag = dict((id_, tag) for tag, id_ in tag2id.items())for i, ids in enumerate(batch_tagids):tag_list = []if self.crf:for j in range(lengths[i] - 1): # crf解碼過程中,end被舍棄tag_list.append(id2tag[ids[j].item()])else:for j in range(lengths[i]):tag_list.append(id2tag[ids[j].item()])pred_tag_lists.append(tag_list)# indices存有根據(jù)長度排序后的索引映射的信息# 比如若indices = [1, 2, 0] 則說明原先索引為1的元素映射到的新的索引是0,# 索引為2的元素映射到新的索引是1...# 下面根據(jù)indices將pred_tag_lists和tag_lists轉(zhuǎn)化為原來的順序ind_maps = sorted(list(enumerate(indices)), key=lambda e: e[1])indices, _ = list(zip(*ind_maps))pred_tag_lists = [pred_tag_lists[i] for i in indices]tag_lists = [tag_lists[i] for i in indices]return pred_tag_lists, tag_listsclass BiLSTM_CRF(nn.Module):def __init__(self, vocab_size, emb_size, hidden_size, out_size):"""初始化參數(shù):vocab_size:字典的大小emb_size:詞向量的維數(shù)hidden_size:隱向量的維數(shù)out_size:標(biāo)注的種類"""super(BiLSTM_CRF, self).__init__()self.bilstm = BiLSTM(vocab_size, emb_size, hidden_size, out_size)# CRF實(shí)際上就是多學(xué)習(xí)一個(gè)轉(zhuǎn)移矩陣 [out_size, out_size] 初始化為均勻分布self.transition = nn.Parameter(torch.ones(out_size, out_size) * 1/out_size)# self.transition.data.zero_()def forward(self, sents_tensor, lengths):# [B, L, out_size]emission = self.bilstm(sents_tensor, lengths)# 計(jì)算CRF scores, 這個(gè)scores大小為[B, L, out_size, out_size]# 也就是每個(gè)字對(duì)應(yīng)一個(gè) [out_size, out_size]的矩陣# 這個(gè)矩陣第i行第j列的元素的含義是:上一時(shí)刻tag為i,這一時(shí)刻tag為j的分?jǐn)?shù)batch_size, max_len, out_size = emission.size()crf_scores = emission.unsqueeze(2).expand(-1, -1, out_size, -1) + self.transition.unsqueeze(0)return crf_scoresdef test(self, test_sents_tensor, lengths, tag2id):"""使用維特比算法進(jìn)行解碼"""start_id = tag2id['<start>']end_id = tag2id['<end>']pad = tag2id['<pad>']tagset_size = len(tag2id)crf_scores = self.forward(test_sents_tensor, lengths)device = crf_scores.device# B:batch_size, L:max_len, T:target set sizeB, L, T, _ = crf_scores.size()# viterbi[i, j, k]表示第i個(gè)句子,第j個(gè)字對(duì)應(yīng)第k個(gè)標(biāo)記的最大分?jǐn)?shù)viterbi = torch.zeros(B, L, T).to(device)# backpointer[i, j, k]表示第i個(gè)句子,第j個(gè)字對(duì)應(yīng)第k個(gè)標(biāo)記時(shí)前一個(gè)標(biāo)記的id,用于回溯backpointer = (torch.zeros(B, L, T).long() * end_id).to(device)lengths = torch.LongTensor(lengths).to(device)# 向前遞推for step in range(L):batch_size_t = (lengths > step).sum().item()if step == 0:# 第一個(gè)字它的前一個(gè)標(biāo)記只能是start_idviterbi[:batch_size_t, step,:] = crf_scores[: batch_size_t, step, start_id, :]backpointer[: batch_size_t, step, :] = start_idelse:max_scores, prev_tags = torch.max(viterbi[:batch_size_t, step-1, :].unsqueeze(2) +crf_scores[:batch_size_t, step, :, :], # [B, T, T]dim=1)viterbi[:batch_size_t, step, :] = max_scoresbackpointer[:batch_size_t, step, :] = prev_tags# 在回溯的時(shí)候我們只需要用到backpointer矩陣backpointer = backpointer.view(B, -1) # [B, L * T]tagids = [] # 存放結(jié)果tags_t = Nonefor step in range(L-1, 0, -1):batch_size_t = (lengths > step).sum().item()if step == L-1:index = torch.ones(batch_size_t).long() * (step * tagset_size)index = index.to(device)index += end_idelse:prev_batch_size_t = len(tags_t)new_in_batch = torch.LongTensor([end_id] * (batch_size_t - prev_batch_size_t)).to(device)offset = torch.cat([tags_t, new_in_batch],dim=0) # 這個(gè)offset實(shí)際上就是前一時(shí)刻的index = torch.ones(batch_size_t).long() * (step * tagset_size)index = index.to(device)index += offset.long()try:tags_t = backpointer[:batch_size_t].gather(dim=1, index=index.unsqueeze(1).long())except RuntimeError:import pdbpdb.set_trace()tags_t = tags_t.squeeze(1)tagids.append(tags_t.tolist())# tagids:[L-1](L-1是因?yàn)榭廴チ薳nd_token),大小的liebiao# 其中列表內(nèi)的元素是該batch在該時(shí)刻的標(biāo)記# 下面修正其順序,并將維度轉(zhuǎn)換為 [B, L]tagids = list(zip_longest(*reversed(tagids), fillvalue=pad))tagids = torch.Tensor(tagids).long()# 返回解碼的結(jié)果return tagids注:關(guān)于維特比算法推薦看鏈接,講解的通俗易懂如何通俗講解維特比算法
其他學(xué)習(xí)連接:
Advanced: Making Dynamic Decisions and the Bi-LSTM CRF — PyTorch Tutorials 1.11.0+cu102 documentation
Bi-LSTM-CRF for Sequence Labeling - 知乎
https://github.com/jiesutd/LatticeLSTM
總結(jié)
以上是生活随笔為你收集整理的中文命名实体识别NER的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: “心脏出血”漏洞可导致密码泄露
- 下一篇: Ansible Ad—hoc 模式常用模