学习Knowledge Graph Embedding Based Question Answering代码笔记
前言
最近被導師安排學習一下【Knowledge Graph Embedding Based Question Answering】 這篇paper,這篇paper的重點在于運用了Knowledge Graph為dataset,在不用知道數據結構的情況下,去完成Question Answering這個自然語言處理方向的問題。這篇筆記只用來記錄一下閱讀這篇paper的github的代碼時,作為一名很菜的本科學生所發現覺得可能有用的代碼片段,具體對paper的筆記會再開一份筆記另行記錄
希望自己能和大家一起學習進步!加油!
paper 鏈接:
delivery.acm.org/10.1145/330…acm=1564312374_9607150c0f9e4d7029cba11e69cb8903 (請復制全部)
github 鏈接:
github.com/xhuang31/KE…
下面會逐步緩慢更新
正文開始!
比如我們想去掉what is your name里的what is,獲得結果your name,便可使用如下代碼:
whhowset = [{'what', 'how', 'where', 'who', 'which', 'whom'}, {'in which', 'what is', "what 's", 'what are', 'what was', 'what were', 'where is', 'where are','where was', 'where were', 'who is', 'who was', 'who are', 'how is', 'what did'}, {'what kind of', 'what kinds of', 'what type of', 'what types of', 'what sort of'}] question = ["what","is","your","name"] for j in range(3, 0, -1):if ' '.join(question[0:j]) in whhowset[j - 1]:del question[0:j]continue print(question) 復制代碼output: ["your","name"]
以下引用自wiki里對n-gram的解釋:n-gram is a contiguous sequence of n items from a given sample of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application. The n-grams typically are collected from a text or speech corpus.
n可自定義,例如unigram, bigram. 對于n-gram的具體例子就是:
- 單詞:word: apple, n-gram list: ['a','p','l','e','ap','pp','pl','pl','app','ppl','ple','appl','pple','apple']
- 句子:sentence: 'how are you', n-gram list: ['how', 'are', 'u', 'how are', 'are u', 'how are u']
output: ['how', 'are', 'u', 'how are', 'are u', 'how are u']
output: 為一個文件:output.txt: 內容為:
Human 0a 1am 2I 3 復制代碼function:parser.add_argument(name or flags...[, action][, nargs][, const][, default][, type][, choices][, required][, help][, metavar][, dest])
parameters (cite from the Pytorch documentation):
- const - A constant value required by some action and nargs selections.
- dest - The name of the attribute to be added to the object returned by parse_args().
- action - The basic type of action to be taken when this argument is encountered at the command line.
output: python prog.py 1 2 3 4 --> 4(get the maximum), python prog.py 1 2 3 4 --sum -->10(get the sum)
output: Counter({'blue': 3, 'red': 2, 'green': 1})
output: tensor([0.0043, 0.1056, 0.2858]),this array will always be the same, if you don't have the manual_seed function, the output will be different every time
Example:
torch.backends.cudnn.deterministic = True 復制代碼torchtext組件:
- Field :主要包含以下數據預處理的配置信息,比如指定分詞方法,是否轉成小寫,起始字符,結束字符,補全字符以及詞典等等
- Dataset :繼承自pytorch的Dataset,用于加載數據,提供了TabularDataset可以指點路徑,格式,Field信息就可以方便的完成數據加載。同時torchtext還提供預先構建的常用數據集的Dataset對象,可以直接加載使用,splits方法可以同時加載訓練集,驗證集和測試集。
- Iterator : 主要是數據輸出的模型的迭代器,可以支持batch定制
此處為數據預處理設置為全部轉為小寫
torchtext的Dataset是繼承自pytorch的Dataset,提供了一個可以下載壓縮數據并解壓的方法(支持.zip, .gz, .tgz)
splits方法可以同時讀取訓練集,驗證集,測試集
TabularDataset可以很方便的讀取CSV, TSV, or JSON格式的文件
train = data.TabularDataset(path=os.path.join(args.output, 'dete_train.txt'), format='tsv', fields=[('text', TEXT), ('ed', ED)]) dev, test = data.TabularDataset.splits(path=args.output, validation='valid.txt', test='test.txt', format='tsv', fields=field) 復制代碼加載數據后可以建立詞典,建立詞典的時候可以使用與訓練的word vector
TEXT.build_vocab(train,vectors="text.6B.100d") 復制代碼Iterator是torchtext到模型的輸出,它提供了我們對數據的一般處理方式,比如打亂,排序,等等,可以動態修改batch大小,這里也有splits方法 可以同時輸出訓練集,驗證集,測試集
train_iter = data.Iterator(train, batch_size=args.batch_size, device=torch.device('cuda', args.gpu), train=True,repeat=False, sort=False, shuffle=True, sort_within_batch=False)dev_iter = data.Iterator(dev, batch_size=args.batch_size, device=torch.device('cuda', args.gpu), train=False,repeat=False, sort=False, shuffle=False, sort_within_batch=False) 復制代碼output: 2 -4
轉載于:https://juejin.im/post/5d3d8157f265da1ba84ada19
總結
以上是生活随笔為你收集整理的学习Knowledge Graph Embedding Based Question Answering代码笔记的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: [bzoj4034]树上操作
- 下一篇: 54. Spiral Matrix