N-Gram的数据结构
ARPA的n-gram語法如下:
[html] view plaincopyprint? \data\ ngram 1=64000 ngram 2=522530 ngram 3=173445 \1-grams: -5.24036 'cause -0.2084827 -4.675221 'em -0.221857 -4.989297 'n -0.05809768 -5.365303 'til -0.1855581 -2.111539 </s> 0.0 -99 <s> -0.7736475 -1.128404 <unk> -0.8049794 -2.271447 a -0.6163939 -5.174762 a's -0.03869072 -3.384722 a. -0.1877073 -5.789208 a.'s 0.0 -6.000091 aachen 0.0 -4.707208 aaron -0.2046838 -5.580914 aaron's -0.06230035 -5.789208 aarons -0.07077657 -5.881973 aaronson -0.2173971具體說明見 :ARPA的n-gram語言模型格式
整個ARPA-LM由很多個n-gram項組成,分別說明這兩個的數據結構
一,n-gram數據結構
n-gram的數據結構如下:
typedef struct { real log_prob ; real log_bo ; int *words ; } ARPALMEntry ;words,表示當前的n-gram所涉及的單詞,如果是1-gram,那就只有一個,如果是2-gram,那么words就包括這兩個單詞的序號。
 log_bo,表示ngram的回退概率。
 log_prob,表示ngram的組合概率。
二,ARPA-LM數據結構
多個項組成的整個n-gram語言模型的數據結構如下:
 [cpp] view plaincopyprint?
vocab,用于構建語言模型的詞典指針。詞典定義見:詞典內存存儲模型
 entries,語言模型的所有ngram項,是ARPALMEntry類型的一個二維數組。entries[0]存儲1-gram,entries[1]存儲2-gram,依此類推。
 n_ngrams,整型數組,依次包含1-gram,2-gram,3-gram,....所包含的ngram項個數。
 unk_wrd,詞典中可以不在語言模型中的詞。
 unk_id,詞典中可以不在語言模型中的詞的ID,這個ID指定為詞典的最后一個詞序號。
 n_unk_words,在讀語言模型之后,統計在詞典中,但沒有用來建立語言模型的詞個數,如果沒有指定unk_wrd的話,是不允許的,就表示所有的詞典中的詞都應該用來建語言模型。
 unk_words,存儲6中統計的詞序號。
 words_in_lm,這個標識詞典中的詞是否在語言模型中出現。
轉載于:https://www.cnblogs.com/jonky/p/10154115.html
總結
以上是生活随笔為你收集整理的N-Gram的数据结构的全部內容,希望文章能夠幫你解決所遇到的問題。
 
                            
                        