LSTM:《Understanding LSTM Networks》的翻译并解读
LSTM:《Understanding LSTM Networks》的翻譯并解讀
?
?
?
目錄
Understanding LSTM Networks
Recurrent Neural Networks
The Problem of Long-Term Dependencies
LSTM Networks
The Core Idea Behind LSTMs
Step-by-Step LSTM Walk Through
Variants on Long Short Term Memory
Conclusion
Acknowledgments
?
?
Understanding LSTM Networks
Posted on August 27, 2015
原文地址:http://colah.github.io/posts/2015-08-Understanding-LSTMs/
?
Recurrent Neural Networks
| Humans don’t start their thinking from scratch every second. As you read this essay, you understand each word based on your understanding of previous words. You don’t throw everything away and start thinking from scratch again. Your thoughts have persistence. Traditional neural networks can’t do this, and it seems like a major shortcoming. For example, imagine you want to classify what kind of event is happening at every point in a movie. It’s unclear how a traditional neural network could use its reasoning about previous events in the film to inform later ones. Recurrent neural networks address this issue. They are networks with loops in them, allowing information to persist. | 人類并不是每一秒都能從頭開始思考。當你閱讀這篇文章的時候,你是根據你對之前的單詞的理解來理解每一個單詞的。你不會把所有東西都扔掉,然后從頭開始思考。你的思想有持續力。 傳統的神經網絡做不到這一點,這似乎是一個主要的缺點。例如,假設您想要對電影中每一點發生的事件進行分類。目前還不清楚傳統的神經網絡如何利用其對電影中先前事件的推理來為后來的事件提供信息。 遞歸神經網絡解決了這個問題。它們是包含循環的網絡,允許信息持續存在。 | 
| In the above diagram, a chunk of neural network,?\(A\), looks at some input?\(x_t\)?and outputs a value?\(h_t\). A loop allows information to be passed from one step of the network to the next. | 在上面的圖中,一個神經網絡塊,查看一些輸入xtxt并輸出一個值htht。循環允許信息從網絡的一個步驟傳遞到下一個步驟。 | 
| These loops make recurrent neural networks seem kind of mysterious. However, if you think a bit more, it turns out that they aren’t all that different than a normal neural network. A recurrent neural network can be thought of as multiple copies of the same network, each passing a message to a successor. Consider what happens if we unroll the loop: | 這些循環使得遞歸神經網絡看起來有點神秘。然而,如果你多想一下,就會發現它們與普通的神經網絡并沒有太大的不同。一個遞歸神經網絡可以被認為是同一個網絡的多個副本,每個副本都將一個消息傳遞給一個后續副本。考慮一下如果我們展開循環會發生什么: | 
| This chain-like nature reveals that recurrent neural networks are intimately related to sequences and lists. They’re the natural architecture of neural network to use for such data. And they certainly are used! In the last few years, there have been incredible success applying RNNs to a variety of problems: speech recognition, language modeling, translation, image captioning… The list goes on. I’ll leave discussion of the amazing feats one can achieve with RNNs to Andrej Karpathy’s excellent blog post,?The Unreasonable Effectiveness of Recurrent Neural Networks. But they really are pretty amazing. Essential to these successes is the use of “LSTMs,” a very special kind of recurrent neural network which works, for many tasks, much much better than the standard version. Almost all exciting results based on recurrent neural networks are achieved with them. It’s these LSTMs that this essay will explore. | 這種鏈狀的性質揭示了遞歸神經網絡與序列和列表密切相關。它們是神經網絡用來處理這些數據的自然結構。 它們確實被使用了!在過去的幾年里,將RNNs應用到各種各樣的問題上取得了令人難以置信的成功:語音識別、語言建模、翻譯、圖像字幕等等。我將把關于使用RNNs可以實現的驚人壯舉的討論留給Andrej Karpathy的優秀博客文章:循環神經網絡的不合理有效性。但它們真的很神奇。 這些成功的關鍵是“LSTMs”的使用,這是一種非常特殊的遞歸神經網絡,它在很多任務上都比標準版本好得多。幾乎所有基于遞歸神經網絡的激動人心的結果都是用它們實現的。本文將探索這些LSTMs。 | 
?
The Problem of Long-Term Dependencies??長期依賴的問題
| One of the appeals of RNNs is the idea that they might be able to connect previous information to the present task, such as using previous video frames might inform the understanding of the present frame. If RNNs could do this, they’d be extremely useful. But can they? It depends. Sometimes, we only need to look at recent information to perform the present task. For example, consider a language model trying to predict the next word based on the previous ones. If we are trying to predict the last word in “the clouds are in the?sky,” we don’t need any further context – it’s pretty obvious the next word is going to be sky. In such cases, where the gap between the relevant information and the place that it’s needed is small, RNNs can learn to use the past information. | RNNs的一個吸引人的地方是,他們可能能夠將以前的信息與當前的任務聯系起來,例如使用以前的視頻幀可能有助于理解當前的幀。如果RNNs可以做到這一點,它們將是非常有用的。但他們能嗎?視情況而定。 | 
| But there are also cases where we need more context. Consider trying to predict the last word in the text “I grew up in France… I speak fluent?French.” Recent information suggests that the next word is probably the name of a language, but if we want to narrow down which language, we need the context of France, from further back. It’s entirely possible for the gap between the relevant information and the point where it is needed to become very large. | 有時,我們只需要查看最近的信息來執行當前的任務。例如,考慮一個語言模型,它試圖根據前面的單詞預測下一個單詞。如果我們試圖預測“天空中的云”中的最后一個詞,我們不需要任何進一步的上下文——很明顯下一個詞將是天空。在這種情況下,相關信息和需要信息的地方之間的差距很小,RNNs可以學習使用過去的信息。 但在某些情況下,我們需要更多的上下文。試著預測文章的最后一個詞“我在法國長大……我說一口流利的法語。”“最近的信息表明,下一個詞可能是一種語言的名字,但如果我們想縮小范圍,我們需要更早的法國的背景。相關信息與需要它變得非常大的點之間的差距是完全可能的。 | 
| In theory, RNNs are absolutely capable of handling such “long-term dependencies.” A human could carefully pick parameters for them to solve toy problems of this form. Sadly, in practice, RNNs don’t seem to be able to learn them. The problem was explored in depth by?Hochreiter (1991) [German]?and?Bengio, et al. (1994), who found some pretty fundamental reasons why it might be difficult. Thankfully, LSTMs don’t have this problem! ? | 從理論上講,RNNs 絕對有能力處理這種“長期依賴”。“人類可以為他們仔細挑選參數來解決這種形式的玩具問題。遺憾的是,在實踐中,RNNs似乎不能學習它們。Hochreiter(1991)[德語]和Bengio等人(1994)對這個問題進行了深入的探討,他們發現了一些非常基本的原因,解釋了為什么這個問題可能很難解決。 謝天謝地,lstm沒有這個問題! | 
LSTM Networks
| Long Short Term Memory networks – usually just called “LSTMs” – are a special kind of RNN, capable of learning long-term dependencies. They were introduced by?Hochreiter & Schmidhuber (1997), and were refined and popularized by many people in following work.1?They work tremendously well on a large variety of problems, and are now widely used. LSTMs are explicitly designed to avoid the long-term dependency problem. Remembering information for long periods of time is practically their default behavior, not something they struggle to learn! All recurrent neural networks have the form of a chain of repeating modules of neural network. In standard RNNs, this repeating module will have a very simple structure, such as a single tanh layer. | 長期短期記憶網絡——通常簡稱為“LSTMs”——是一種特殊的RNN,能夠學習長期依賴關系。它們由Hochreiter和Schmidhuber(1997)引入,并在隨后的工作中被許多人提煉和推廣。他們在許多問題上都做得非常好,現在被廣泛使用。 LSTMs的設計是為了避免長期依賴問題。長時間記憶信息實際上是他們的默認行為,而不是他們努力學習的東西! 所有的遞歸神經網絡都是由一系列重復的神經網絡模塊組成的。在標準的RNNs中,這個重復的模塊有一個非常簡單的結構,比如一個單tanh層。 | 
| LSTMs also have this chain like structure, but the repeating module has a different structure. Instead of having a single neural network layer, there are four, interacting in a very special way. | ? LSTMs也有這種類似鏈的結構,但是重復模塊有不同的結構。不是只有一個神經網絡層,而是有四個,它們以一種非常特殊的方式相互作用。? | 
| Don’t worry about the details of what’s going on. We’ll walk through the LSTM diagram step by step later. For now, let’s just try to get comfortable with the notation we’ll be using. | 不要擔心正在發生的事情的細節。稍后,我們將逐步遍歷LSTM圖。現在,讓我們試著熟悉一下我們將要使用的符號。 | 
| In the above diagram, each line carries an entire vector, from the output of one node to the inputs of others. The pink circles represent pointwise operations, like vector addition, while the yellow boxes are learned neural network layers. Lines merging denote concatenation, while a line forking denote its content being copied and the copies going to different locations. | ? ? 
 | 
?
The Core Idea Behind LSTMs?LSTMs背后的核心思想
| The key to LSTMs is the cell state, the horizontal line running through the top of the diagram. The cell state is kind of like a conveyor belt. It runs straight down the entire chain, with only some minor linear interactions. It’s very easy for information to just flow along it unchanged. | LSTMs的關鍵是單元狀態,即貫穿圖頂部的水平線。 細胞狀態有點像一個傳送帶。它沿著整個鏈向下,只有一些微小的線性相互作用。信息很容易以不變的方式流動。? | 
| The LSTM does have the ability to remove or add information to the cell state, carefully regulated by structures called gates. Gates are a way to optionally let information through. They are composed out of a sigmoid neural net layer and a pointwise multiplication operation. | LSTM確實能夠刪除或向細胞狀態添加信息,這是由稱為門的結構仔細控制的。 門是一種可選地讓信息通過的方法。它們由sigmoid神經網絡層和逐點乘法運算組成。 ? | 
| The sigmoid layer outputs numbers between zero and one, describing how much of each component should be let through. A value of zero means “let nothing through,” while a value of one means “let everything through!” An LSTM has three of these gates, to protect and control the cell state. | sigmoid層輸出0到1之間的數字,描述每個組件應該允許通過的數量。0的值表示“不讓任何東西通過”,而1的值表示“讓所有東西通過!” LSTM有三個這樣的門來保護和控制單元狀態。 ? ? | 
?
Step-by-Step LSTM Walk Through??分步執行LSTM
| The first step in our LSTM is to decide what information we’re going to throw away from the cell state. This decision is made by a sigmoid layer called the “forget gate layer.” It looks at?\(h_{t-1}\)?and?\(x_t\), and outputs a number between?\(0\)?and?\(1\)?for each number in the cell state?\(C_{t-1}\). A?\(1\)?represents “completely keep this” while a?\(0\)?represents “completely get rid of this.” | ? LSTM的第一步是決定要從單元狀態中丟棄什么信息。這個決定是由一個叫做“忘記門”的sigmoid層做出的。“它查看ht?1ht?1和xtxt,并為細胞狀態Ct?1Ct?1中的每個數輸出一個00到11之間的數字。11代表“完全保留這個”,而00代表“完全擺脫這個”。” | 
| Let’s go back to our example of a language model trying to predict the next word based on all the previous ones. In such a problem, the cell state might include the gender of the present subject, so that the correct pronouns can be used. When we see a new subject, we want to forget the gender of the old subject. | 讓我們回到我們的例子,一個語言模型試圖預測下一個單詞基于所有前面的詞。在這樣的問題中,單元格狀態可能包括當前主體的性別,這樣就可以使用正確的代詞。當我們看到一個新的主題時,我們想要忘記舊主題的性別。 ? | 
| The next step is to decide what new information we’re going to store in the cell state. This has two parts. First, a sigmoid layer called the “input gate layer” decides which values we’ll update. Next, a tanh layer creates a vector of new candidate values,?\(\tilde{C}_t\), that could be added to the state. In the next step, we’ll combine these two to create an update to the state. In the example of our language model, we’d want to add the gender of the new subject to the cell state, to replace the old one we’re forgetting. | 下一步是決定要在單元狀態中存儲什么新信息。它有兩部分。首先,一個名為“輸入門層”的sigmoid層決定要更新哪些值。接下來,tanh層創建一個新的候選值向量C~tC~t,可以將其添加到狀態中。在下一個步驟中,我們將把這兩者結合起來以創建對狀態的更新。 在我們的語言模型示例中,我們希望將新主體的性別添加到單元格狀態,以替換我們忘記的舊主體。 ? | 
| It’s now time to update the old cell state,?\(C_{t-1}\), into the new cell state?\(C_t\). The previous steps already decided what to do, we just need to actually do it. We multiply the old state by?\(f_t\), forgetting the things we decided to forget earlier. Then we add?\(i_t*\tilde{C}_t\). This is the new candidate values, scaled by how much we decided to update each state value. | 現在是時候將舊的細胞狀態Ct?1Ct?1更新為新的細胞狀態CtCt了。前面的步驟已經決定了要做什么,我們只需要實際去做。 我們將舊狀態乘以ft,忘記了我們之前決定忘記的事情。然后我們把它加入到顯示狀態顯示狀態C~tit C~t。這是新的候選值,根據我們決定更新每個狀態值的程度進行縮放。 ? ? | 
| In the case of the language model, this is where we’d actually drop the information about the old subject’s gender and add the new information, as we decided in the previous steps. | 在語言模型中,這是我們實際刪除關于舊主題性別的信息并添加新信息的地方,正如我們在前面的步驟中所決定的那樣。 | 
| Finally, we need to decide what we’re going to output. This output will be based on our cell state, but will be a filtered version. First, we run a sigmoid layer which decides what parts of the cell state we’re going to output. Then, we put the cell state through?\(\tanh\)?(to push the values to be between?\(-1\)?and?\(1\)) and multiply it by the output of the sigmoid gate, so that we only output the parts we decided to. For the language model example, since it just saw a subject, it might want to output information relevant to a verb, in case that’s what is coming next. For example, it might output whether the subject is singular or plural, so that we know what form a verb should be conjugated into if that’s what follows next. | 最后,我們需要決定我們要輸出什么。此輸出將基于我們的單元格狀態,但將是經過篩選的版本。首先,我們運行一個sigmoid層,它決定我們要輸出的單元狀態的哪些部分。然后,我們將細胞狀態放入tanhtanh(將值設置為?1?1和11之間),并將其乘以s形門的輸出,這樣我們只輸出我們決定輸出的部分。 對于語言模型示例,因為它只是看到了一個主題,所以它可能希望輸出與動詞相關的信息,以防接下來會發生什么。例如,它可以輸出主語是單數還是復數,這樣我們就可以知道一個動詞接下來應該變成什么形式。 | 
?
Variants on Long Short Term Memory??LSTM的變體
| What I’ve described so far is a pretty normal LSTM. But not all LSTMs are the same as the above. In fact, it seems like almost every paper involving LSTMs uses a slightly different version. The differences are minor, but it’s worth mentioning some of them. One popular LSTM variant, introduced by?Gers & Schmidhuber (2000), is adding “peephole connections.” This means that we let the gate layers look at the cell state. | 到目前為止,我所描述的是一個非常普通的LSTM。但并不是所有的lstm都與上述相同。事實上,似乎幾乎每一篇涉及LSTMs的論文都使用了稍微不同的版本。差異很小,但值得一提。 一種流行的LSTM變體,由Gers和Schmidhuber(2000)引入,增加了“窺視孔連接”。這意味著我們讓柵極層觀察細胞的狀態。 | 
| The above diagram adds peepholes to all the gates, but many papers will give some peepholes and not others. Another variation is to use coupled forget and input gates. Instead of separately deciding what to forget and what we should add new information to, we make those decisions together. We only forget when we’re going to input something in its place. We only input new values to the state when we forget something older. | 上面的圖表在所有的門上都加了窺視孔,但是很多論文只會給出一些窺視孔,而不會給出其他的。 另一種變化是使用耦合忘記和輸入門。我們不是單獨決定忘記什么和應該添加什么新信息,而是一起做這些決定。我們只會忘記什么時候在它的位置上輸入東西。我們只在忘記舊的值時才向狀態輸入新值。 | 
| A slightly more dramatic variation on the LSTM is the Gated Recurrent Unit, or GRU, introduced by?Cho, et al. (2014). It combines the forget and input gates into a single “update gate.” It also merges the cell state and hidden state, and makes some other changes. The resulting model is simpler than standard LSTM models, and has been growing increasingly popular. | 
 | 
| These are only a few of the most notable LSTM variants. There are lots of others, like Depth Gated RNNs by?Yao, et al. (2015). There’s also some completely different approach to tackling long-term dependencies, like Clockwork RNNs by?Koutnik, et al. (2014). Which of these variants is best? Do the differences matter??Greff, et al. (2015)?do a nice comparison of popular variants, finding that they’re all about the same.?Jozefowicz, et al. (2015)?tested more than ten thousand RNN architectures, finding some that worked better than LSTMs on certain tasks. | 這些只是最值得注意的LSTM變體中的幾個。還有很多其他的,如姚等人(2015)的《深度門控RNNs》。還有一些完全不同的處理長期依賴關系的方法,如Koutnik等人(2014)的Clockwork RNNs。 這些變體中哪個是最好的?差異重要嗎?Greff等人(2015)對流行的變體做了一個很好的比較,發現它們都差不多。Jozefowicz等人(2015)測試了一萬多個RNN架構,發現有些架構在某些任務上比LSTMs工作得更好。 | 
?
Conclusion
| Earlier, I mentioned the remarkable results people are achieving with RNNs. Essentially all of these are achieved using LSTMs. They really work a lot better for most tasks! Written down as a set of equations, LSTMs look pretty intimidating. Hopefully, walking through them step by step in this essay has made them a bit more approachable. | 之前,我提到了人們使用RNNs所取得的顯著成果。基本上所有這些都是使用LSTMs實現的。對于大多數任務來說,它們確實工作得更好! 作為一組方程來寫,lstm看起來很嚇人。希望在這篇文章中一步一步地介紹它們能使它們更容易理解。 | 
| LSTMs were a big step in what we can accomplish with RNNs. It’s natural to wonder: is there another big step? A common opinion among researchers is: “Yes! There is a next step and it’s attention!” The idea is to let every step of an RNN pick information to look at from some larger collection of information. For example, if you are using an RNN to create a caption describing an image, it might pick a part of the image to look at for every word it outputs. In fact,?Xu,?et al.?(2015)?do exactly this – it might be a fun starting point if you want to explore attention! There’s been a number of really exciting results using attention, and it seems like a lot more are around the corner… | LSTMs是我們使用RNNs實現目標的一大步。人們很自然地會想:還會有更大的進步嗎?研究人員普遍認為:“是的!下一步就是集中注意力!這個想法是讓RNN的每一步都從更大的信息集合中挑選信息。例如,如果您使用RNN來創建描述圖像的標題,它可能會選擇圖像的一部分來查看它輸出的每個單詞。事實上,Xu等人(2015)正是這樣做的——如果你想探索注意力,這可能是一個有趣的起點!已經有很多使用注意力的令人興奮的結果,而且似乎更多的結果即將出現…… | 
| Attention isn’t the only exciting thread in RNN research. For example, Grid LSTMs by?Kalchbrenner,?et al.?(2015)?seem extremely promising. Work using RNNs in generative models – such as?Gregor,?et al.?(2015),?Chung,?et al.?(2015), or?Bayer & Osendorfer (2015)?– also seems very interesting. The last few years have been an exciting time for recurrent neural networks, and the coming ones promise to only be more so!? | 注意力并不是RNN研究中唯一令人興奮的線索。例如,Kalchbrenner等人(2015)的Grid LSTMs似乎非常有前途。在生成模型中使用RNNs的工作——如Gregor等人(2015)、Chung等人(2015)或Bayer & Osendorfer等人(2015)——似乎也非常有趣。過去的幾年對于遞歸神經網絡來說是激動人心的一年,而未來的幾年將會更加激動人心! ? | 
?
Acknowledgments??致謝
| I’m grateful to a number of people for helping me better understand LSTMs, commenting on the visualizations, and providing feedback on this post. | 我非常感謝許多人幫助我更好地理解LSTMs,對其可視化進行了評論,并對本文提供了反饋。 | 
| I’m very grateful to my colleagues at Google for their helpful feedback, especially?Oriol Vinyals,?Greg Corrado,?Jon Shlens,?Luke Vilnis, and?Ilya Sutskever. I’m also thankful to many other friends and colleagues for taking the time to help me, including?Dario Amodei, and?Jacob Steinhardt. I’m especially thankful to?Kyunghyun Cho?for extremely thoughtful correspondence about my diagrams. Before this post, I practiced explaining LSTMs during two seminar series I taught on neural networks. Thanks to everyone who participated in those for their patience with me, and for their feedback. | 我非常感謝谷歌的同事們提供的有用的反饋,特別是Oriol Vinyals、Greg Corrado、Jon Shlens、Luke Vilnis和Ilya Sutskever。我也感謝許多其他的朋友和同事花時間來幫助我,包括達里奧·阿莫德和雅各布·斯坦哈特。我特別感謝Kyunghyun Cho對我的圖表所做的極其周到的回復。 在這篇文章之前,我在兩個關于神經網絡的系列研討會上練習解釋LSTMs。感謝每一個參與其中的人,感謝他們對我的耐心,感謝他們的反饋。 | 
?
?
?
?
總結
以上是生活随笔為你收集整理的LSTM:《Understanding LSTM Networks》的翻译并解读的全部內容,希望文章能夠幫你解決所遇到的問題。
 
                            
                        - 上一篇: DL之MaskR-CNN:基于类Mask
- 下一篇: 成功解决ModuleNotFoundEr
