LSTM:《Long Short-Term Memory》的翻译并解读
LSTM:《Long Short-Term Memory》的翻譯并解讀
?
?
?
目錄
Long Short-Term Memory
Abstract
1 INTRODUCTION
2 PREVIOUS WORK ?
3 CONSTANT ERROR BACKPROP
3.1 EXPONENTIALLY DECAYING ERROR
3.2 CONSTANT ERROR FLOW: NAIVE APPROACH
4 LONG SHORT-TERM MEMORY
5 EXPERIMENTS
5.1 EXPERIMENT 1: EMBEDDED REBER GRAMMAR ?
?
?
?
Long Short-Term Memory
論文原文
地址01:https://arxiv.org/pdf/1506.04214.pdf
地址02:https://www.bioinf.jku.at/publications/older/2604.pdf
?
Abstract
| Learning to store information over extended time intervals via recurrent backpropagation ?takes a very long time, mostly due to insucient, decaying error back ?ow. We brie y review ?Hochreiter's 1991 analysis of this problem, then address it by introducing a novel, ecient, ?gradient-based method called \Long Short-Term Memory" (LSTM). Truncating the gradient ?where this does not do harm, LSTM can learn to bridge minimal time lags in excess of 1000 ?discrete time steps by enforcing constant error ?ow through \constant error carrousels" within ?special units. Multiplicative gate units learn to open and close access to the constant error ? ow. LSTM is local in space and time; its computational complexity per time step and weight ?is O(1). Our experiments with arti cial data involve local, distributed, real-valued, and noisy ?pattern representations. In comparisons with RTRL, BPTT, Recurrent Cascade-Correlation, ?Elman nets, and Neural Sequence Chunking, LSTM leads to many more successful runs, and ?learns much faster. LSTM also solves complex, arti cial long time lag tasks that have never ?been solved by previous recurrent network algorithms. | 通過(guò)周期性的反向傳播學(xué)習(xí),在擴(kuò)展的時(shí)間間隔內(nèi)存儲(chǔ)信息需要很長(zhǎng)的時(shí)間,這主要是由于不確定的、衰減的錯(cuò)誤導(dǎo)致的。我們簡(jiǎn)要回顧了Hochreiter在1991年對(duì)這個(gè)問(wèn)題的分析,然后介紹了一種新穎的、獨(dú)特的、基于梯度的方法,稱(chēng)為L(zhǎng)STM (LSTM)。在不造成傷害的情況下截?cái)嗵荻?#xff0c;LSTM可以學(xué)習(xí)在超過(guò)1000個(gè)離散時(shí)間步長(zhǎng)的最小時(shí)間滯后上橋接,方法是通過(guò)在特殊單元內(nèi)的“恒定誤差輪盤(pán)”強(qiáng)制執(zhí)行恒定誤差。乘性門(mén)單元學(xué)習(xí)打開(kāi)和關(guān)閉訪問(wèn)的恒定誤差低。LSTM在空間和時(shí)間上都是局部的;其每時(shí)間步長(zhǎng)的計(jì)算復(fù)雜度和權(quán)值為O(1)。我們對(duì)人工數(shù)據(jù)的實(shí)驗(yàn)包括局部的、分布式的、實(shí)值的和有噪聲的模式表示。在與RTRL、BPTT、周期性級(jí)聯(lián)相關(guān)、Elman網(wǎng)和神經(jīng)序列分塊的比較中,LSTM帶來(lái)了更多的成功運(yùn)行,并且學(xué)習(xí)速度更快。LSTM還解決了以前的遞歸網(wǎng)絡(luò)算法所不能解決的復(fù)雜、人工的長(zhǎng)時(shí)間滯后問(wèn)題。 |
?
1 INTRODUCTION
| Recurrent networks can in principle use their feedback connections to store representations of recent input events in form of activations (\short-term memory", as opposed to \long-term memory" embodied by slowly changing weights). This is potentially signicant for many applications, including speech processing, non-Markovian control, and music composition (e.g., Mozer 1992). The most widely used algorithms for learning what to put in short-term memory, however, take too much time or do not work well at all, especially when minimal time lags between inputs and corresponding teacher signals are long. Although theoretically fascinating, existing methods do not provide clear practical advantages over, say, backprop in feedforward nets with limited time windows. This paper will review an analysis of the problem and suggest a remedy.? | 遞歸網(wǎng)絡(luò)原則上可以使用它們的反饋連接以激活的形式存儲(chǔ)最近輸入事件的表示(“短期記憶”,而不是“長(zhǎng)期記憶”,后者由緩慢變化的權(quán)重表示)。這對(duì)許多應(yīng)用程序都有潛在的重要性,包括語(yǔ)音處理、非馬爾可夫控制和音樂(lè)作曲(例如,Mozer 1992)。然而,最廣泛使用的學(xué)習(xí)短期記憶的算法要么花費(fèi)了太多時(shí)間,要么根本就不能很好地工作,尤其是在輸入和相應(yīng)教師信號(hào)之間的最小時(shí)滯很長(zhǎng)時(shí)。雖然理論上很吸引人,但現(xiàn)有的方法并沒(méi)有提供明顯的實(shí)際優(yōu)勢(shì),例如,在有限時(shí)間窗口的前饋網(wǎng)絡(luò)中,backprop。本文將對(duì)這一問(wèn)題進(jìn)行分析,并提出解決辦法。 |
| The problem. With conventional \Back-Propagation Through Time" (BPTT, e.g., Williams and Zipser 1992, Werbos 1988) or \Real-Time Recurrent Learning" (RTRL, e.g., Robinson and Fallside 1987), error signals \ owing backwards in time" tend to either (1) blow up or (2) vanish: the temporal evolution of the backpropagated error exponentially depends on the size of the weights (Hochreiter 1991). Case (1) may lead to oscillating weights, while in case (2) learning to bridge long time lags takes a prohibitive amount of time, or does not work at all (see section 3).? | 這個(gè)問(wèn)題。與傳統(tǒng)\反向傳播通過(guò)時(shí)間”(BPTT,例如,1992年威廉姆斯和拉鏈,Werbos 1988)或\實(shí)時(shí)復(fù)發(fā)性學(xué)習(xí)”(RTRL,例如,羅賓遜和Fallside 1987),誤差信號(hào)在時(shí)間上向后\由于”傾向于(1)炸毀或(2):消失的時(shí)間演化backpropagated誤差指數(shù)的大小取決于重量(Hochreiter 1991)。情形(1)可能會(huì)導(dǎo)致權(quán)值的振蕩,而情形(2)學(xué)習(xí)如何橋接長(zhǎng)時(shí)間滯后的情況會(huì)花費(fèi)大量的時(shí)間,或者根本不起作用(參見(jiàn)第3節(jié))。 |
| The remedy. This paper presents \Long Short-Term Memory" (LSTM), a novel recurrent network architecture in conjunction with an appropriate gradient-based learning algorithm. LSTM is designed to overcome these error back- ow problems. It can learn to bridge time intervals in excess of 1000 steps even in case of noisy, incompressible input sequences, without loss of short time lag capabilities. This is achieved by an ecient, gradient-based algorithm for an architecture?enforcing constant (thus neither exploding nor vanishing) error ow through internal states of special units (provided the gradient computation is truncated at certain architecture-specic points | this does not aect long-term error ow though).? | 補(bǔ)救措施。本文提出了一種新的遞歸網(wǎng)絡(luò)結(jié)構(gòu)——長(zhǎng)短時(shí)記憶(LSTM),并結(jié)合適當(dāng)?shù)奶荻葘W(xué)習(xí)算法。LSTM的設(shè)計(jì)就是為了克服這些錯(cuò)誤的反向問(wèn)題。它可以學(xué)習(xí)橋接超過(guò)1000步的時(shí)間間隔,即使在有噪聲、不可壓縮的輸入序列的情況下,也不會(huì)損失短時(shí)間延遲能力。這是通過(guò)一種特殊的、基于梯度的算法來(lái)實(shí)現(xiàn)的,它針對(duì)的是一種通過(guò)特殊單元的內(nèi)部狀態(tài)來(lái)執(zhí)行常量(因此既不會(huì)爆炸也不會(huì)消失)的錯(cuò)誤(假設(shè)梯度計(jì)算在某些特定的體系結(jié)構(gòu)點(diǎn)|被截?cái)?#xff0c;但這并不影響長(zhǎng)期的錯(cuò)誤)。 |
| Outline of paper. Section 2 will brie y review previous work. Section 3 begins with an outline of the detailed analysis of vanishing errors due to Hochreiter (1991). It will then introduce a naive approach to constant error backprop for didactic purposes, and highlight its problems concerning information storage and retrieval. These problems will lead to the LSTM architecture as described in Section 4. Section 5 will present numerous experiments and comparisons with competing methods. LSTM outperforms them, and also learns to solve complex, articial tasks no other recurrent net algorithm has solved. Section 6 will discuss LSTM's limitations and advantages. The appendix contains a detailed description of the algorithm (A.1), and explicit error ow formulae (A.2). | 第二部分將簡(jiǎn)要回顧以前的工作。第3節(jié)以詳細(xì)分析Hochreiter(1991)所造成的消失誤差的大綱開(kāi)始。然后,它將介紹一種用于教學(xué)目的的幼稚的不斷錯(cuò)誤支持方法,并突出其在信息存儲(chǔ)和檢索方面的問(wèn)題。這些問(wèn)題將導(dǎo)致第4節(jié)中描述的LSTM體系結(jié)構(gòu)。第5節(jié)將提供大量的實(shí)驗(yàn)和與競(jìng)爭(zhēng)方法的比較。LSTM比它們做得更好,而且還學(xué)會(huì)了解決復(fù)雜的人工任務(wù),這是其他遞歸網(wǎng)絡(luò)算法所不能解決的。第6節(jié)將討論LSTM的局限性和優(yōu)點(diǎn)。附錄中有算法的詳細(xì)描述(a .1),以及公式的顯式誤差(a .2)。 |
2 PREVIOUS WORK ?
| This section will focus on recurrent nets with time-varying inputs (as opposed to nets with stationary ?inputs and xpoint-based gradient calculations, e.g., Almeida 1987, Pineda 1987). ? | 本節(jié)將集中討論具有時(shí)變輸入的遞歸網(wǎng)絡(luò)(而不是具有固定輸入和基于x點(diǎn)的梯度計(jì)算的網(wǎng)絡(luò),例如Almeida 1987和Pineda 1987)。 |
| Gradient-descent variants. The approaches of Elman (1988), Fahlman (1991), Williams ?(1989), Schmidhuber (1992a), Pearlmutter (1989), and many of the related algorithms in Pearlmutter's ?comprehensive overview (1995) suer from the same problems as BPTT and RTRL (see ?Sections 1 and 3). ? | 梯度下降法變體。Elman(1988)、Fahlman(1991)、Williams(1989)、Schmidhuber (1992a)、Pearlmutter(1989)的方法,以及Pearlmutter的綜合綜述(1995)中的許多相關(guān)算法,都是從與BPTT和RTRL相同的問(wèn)題中提出的(見(jiàn)第1節(jié)和第3節(jié)) |
| Time-delays. Other methods that seem practical for short time lags only are Time-Delay ?Neural Networks (Lang et al. 1990) and Plate's method (Plate 1993), which updates unit activations ?based on a weighted sum of old activations (see also de Vries and Principe 1991). Lin et al. ?(1995) propose variants of time-delay networks called NARX networks. | 時(shí)間延遲。其他似乎只適用于短時(shí)間滯后的方法有時(shí)滯神經(jīng)網(wǎng)絡(luò)(Lang et al. 1990)和Plate法(Plate 1993),后者基于舊激活的加權(quán)和更新單位激活(參見(jiàn)de Vries和Principe 1991)。Lin等人(1995)提出了時(shí)延網(wǎng)絡(luò)的變體NARX網(wǎng)絡(luò)。 |
| Time constants. To deal with long time lags, Mozer (1992) uses time constants in uencing ?changes of unit activations (deVries and Principe's above-mentioned approach (1991) may in fact ?be viewed as a mixture of TDNN and time constants). For long time lags, however, the time ?constants need external ne tuning (Mozer 1992). Sun et al.'s alternative approach (1993) updates ?the activation of a recurrent unit by adding the old activation and the (scaled) current net input. ?The net input, however, tends to perturb the stored information, which makes long-term storage ?impractical. ? | 時(shí)間常量。為了處理長(zhǎng)時(shí)間滯后,Mozer(1992)使用時(shí)間常數(shù)來(lái)表示單位激活的變化(deVries and Principe’s上述方法(1991)實(shí)際上可以看作是TDNN和時(shí)間常數(shù)的混合物)。然而,對(duì)于長(zhǎng)時(shí)間滯后,時(shí)間常數(shù)需要外部ne調(diào)諧(Mozer 1992)。Sun等人的替代方法(1993)通過(guò)添加舊的激活和(縮放的)當(dāng)前凈輸入來(lái)更新一個(gè)經(jīng)常性單元的激活。然而,凈輸入往往會(huì)干擾所存儲(chǔ)的信息,這使得長(zhǎng)期存儲(chǔ)變得不切實(shí)際。 |
| Ring's approach. Ring (1993) also proposed a method for bridging long time lags. Whenever ?a unit in his network receives con icting error signals, he adds a higher order unit in uencing ?appropriate connections. Although his approach can sometimes be extremely fast, to bridge a ?time lag involving 100 steps may require the addition of 100 units. Also, Ring's net does not ?generalize to unseen lag durations. ? | 環(huán)的方法。Ring(1993)也提出了一種橋接長(zhǎng)時(shí)間滯后的方法。當(dāng)他的網(wǎng)絡(luò)中的一個(gè)單元接收到通信錯(cuò)誤信號(hào)時(shí),他就增加一個(gè)更高階的單元來(lái)建立適當(dāng)?shù)倪B接。雖然他的方法有時(shí)非常快,但要跨越100步的時(shí)間延遲可能需要增加100個(gè)單元。同樣,環(huán)網(wǎng)也不能推廣到看不見(jiàn)的滯后時(shí)間。 |
| Bengio et al.'s approaches. Bengio et al. (1994) investigate methods such as simulated ?annealing, multi-grid random search, time-weighted pseudo-Newton optimization, and discrete ?error propagation. Their \latch" and \2-sequence" problems are very similar to problem 3a with ?minimal time lag 100 (see Experiment 3). Bengio and Frasconi (1994) also propose an EM approach ?for propagating targets. With n so-called \state networks", at a given time, their system can be ?in one of only n dierent states. See also beginning of Section 5. But to solve continuous problems ?such as the \adding problem" (Section 5.4), their system would require an unacceptable number ?of states (i.e., state networks). | Bengio等人的方法。Bengio等人(1994)研究了模擬退火、多網(wǎng)格隨機(jī)搜索、時(shí)間加權(quán)偽牛頓優(yōu)化和離散誤差傳播等方法。他們的“閂鎖”和“2-序列”問(wèn)題與3a問(wèn)題非常相似,只有最小的滯后時(shí)間100(見(jiàn)實(shí)驗(yàn)3)。Bengio和Frasconi(1994)也提出了一種EM方法來(lái)傳播目標(biāo)。對(duì)于n個(gè)所謂的“狀態(tài)網(wǎng)絡(luò)”,在給定的時(shí)間內(nèi),它們的系統(tǒng)只能處于n種不同狀態(tài)中的一種。參見(jiàn)第5節(jié)的開(kāi)頭。但是,為了解決諸如“\添加問(wèn)題”(第5.4節(jié))之類(lèi)的連續(xù)問(wèn)題,它們的系統(tǒng)將需要不可接受的狀態(tài)數(shù)(即狀態(tài)的網(wǎng)絡(luò))。 |
| Kalman lters. Puskorius and Feldkamp (1994) use Kalman lter techniques to improve ?recurrent net performance. Since they use \a derivative discount factor imposed to decay exponentially ?the eects of past dynamic derivatives," there is no reason to believe that their Kalman ?Filter Trained Recurrent Networks will be useful for very long minimal time lags. ?Second order nets. We will see that LSTM uses multiplicative units (MUs) to protect error ? ow from unwanted perturbations. It is not the rst recurrent net method using MUs though. ?For instance, Watrous and Kuhn (1992) use MUs in second order nets. Some dierences to LSTM ?are:
|
|
| Simple weight guessing. To avoid long time lag problems of gradient-based approaches we ?may simply randomly initialize all network weights until the resulting net happens to classify all ?training sequences correctly. In fact, recently we discovered (Schmidhuber and Hochreiter 1996, ?Hochreiter and Schmidhuber 1996, 1997) that simple weight guessing solves many of the problems ?in (Bengio 1994, Bengio and Frasconi 1994, Miller and Giles 1993, Lin et al. 1995) faster than ?the algorithms proposed therein. This does not mean that weight guessing is a good algorithm. ?It just means that the problems are very simple. More realistic tasks require either many free ?parameters (e.g., input weights) or high weight precision (e.g., for continuous-valued parameters), ?such that guessing becomes completely infeasible. ? | 簡(jiǎn)單的猜測(cè)。為了避免基于梯度的方法的長(zhǎng)時(shí)間滯后問(wèn)題,我們可以簡(jiǎn)單地隨機(jī)初始化所有網(wǎng)絡(luò)權(quán)值,直到最終得到的網(wǎng)絡(luò)正確地對(duì)所有訓(xùn)練序列進(jìn)行分類(lèi)。事實(shí)上,最近我們發(fā)現(xiàn)(Schmidhuber and Hochreiter 1996, Hochreiter and Schmidhuber 1996, 1997)簡(jiǎn)單的重量猜測(cè)解決了(Bengio 1994, Bengio and Frasconi 1994, Miller and Giles 1993, Lin et al. 1995)中的許多問(wèn)題,比其中提出的算法更快。這并不意味著猜測(cè)權(quán)重是一個(gè)好的算法。這意味著問(wèn)題很簡(jiǎn)單。更實(shí)際的任務(wù)需要許多自由參數(shù)(例如,輸入權(quán)值)或較高的權(quán)值精度(例如,連續(xù)值參數(shù)),這樣猜測(cè)就變得完全不可行的。 |
| Adaptive sequence chunkers. Schmidhuber's hierarchical chunker systems (1992b, 1993) ?do have a capability to bridge arbitrary time lags, but only if there is local predictability across the ?subsequences causing the time lags (see also Mozer 1992). For instance, in his postdoctoral thesis ?(1993), Schmidhuber uses hierarchical recurrent nets to rapidly solve certain grammar learning ?tasks involving minimal time lags in excess of 1000 steps. The performance of chunker systems, ?however, deteriorates as the noise level increases and the input sequences become less compressible. ?LSTM does not suer from this problem. | 自適應(yīng)序列chunkers。Schmidhuber的分層chunker系統(tǒng)(1992b, 1993)確實(shí)具有橋接任意時(shí)間滯后的能力,但前提是子序列具有局部可預(yù)測(cè)性,從而導(dǎo)致時(shí)間滯后(參見(jiàn)Mozer 1992)。例如,在他的博士后論文(1993)中,Schmidhuber使用層次遞歸網(wǎng)絡(luò)來(lái)快速解決某些語(yǔ)法學(xué)習(xí)任務(wù),這些任務(wù)涉及的時(shí)間延遲最小,超過(guò)了1000步。然而,隨著噪聲水平的提高和輸入序列的可壓縮性的降低,chunker系統(tǒng)的性能會(huì)下降。LSTM不能解決這個(gè)問(wèn)題。 |
?
3 CONSTANT ERROR BACKPROP? 固定誤差支持
3.1 EXPONENTIALLY DECAYING ERROR? ?指數(shù)衰減誤差
| Conventional BPTT (e.g. Williams and Zipser 1992). Output unit k's target at time t is denoted by dk (t). Using mean squared error, k's error signal is | 傳統(tǒng)的BPTT(如Williams和Zipser 1992)。輸出單元k在t時(shí)刻的目標(biāo)用dk (t)表示,利用均方誤差,k的誤差信號(hào)為 ? |
| The corresponding contribution to wjl 's total weight update is #j (t)yl ?(t ?1), where ?is the ?learning rate, and l stands for an arbitrary unit connected to unit j. ?Outline of Hochreiter's analysis (1991, page 19-21). Suppose we have a fully connected ?net whose non-input unit indices range from 1 to n. Let us focus on local error ?ow from unit u ?to unit v (later we will see that the analysis immediately extends to global error ?ow). The error ?occurring at an arbitrary unit u at time step t is propagated \back into time" for q time steps, to ?an arbitrary unit v. This will scale the error by the following fact | wjl的總權(quán)重更新的相應(yīng)貢獻(xiàn)是#j (t)yl (t 1),其中為學(xué)習(xí)率,l表示連接到j(luò)單元的任意單元。Hochreiter分析概要(1991年,第19-21頁(yè))。假設(shè)我們有一個(gè)完全連通的網(wǎng)絡(luò),它的非輸入單位指數(shù)范圍從1到n。讓我們關(guān)注從單位u到單位v的局部誤差ow(稍后我們將看到分析立即擴(kuò)展到全局誤差ow)。發(fā)生在任意單位u上的時(shí)間步長(zhǎng)t的誤差被傳播回時(shí)間中,對(duì)于q時(shí)間步長(zhǎng),傳播回任意單位v ? |
| ? | |
| ? |
?
3.2 CONSTANT ERROR FLOW: NAIVE APPROACH?常量錯(cuò)誤流:簡(jiǎn)單的方法
| A single unit. To avoid vanishing error signals, how can we achieve constant error ow through a single unit j with a single connection to itself? According to the rules above, at time t, j's local error back ow is #j (t) = f 0 j (netj (t))#j (t + 1)wjj . To enforce constant error ow through j, we?h j, we | 一個(gè)單元。為了避免消失的錯(cuò)誤信號(hào),我們?nèi)绾瓮ㄟ^(guò)一個(gè)單一的單位j與一個(gè)單一的連接到自己實(shí)現(xiàn)恒定的錯(cuò)誤低?根據(jù)上面的規(guī)則,在t時(shí)刻,j的本地錯(cuò)誤返回ow是#j (t) = f0 j (netj (t))#j (t + 1)wjj。為了通過(guò)j來(lái)執(zhí)行常誤差ow,我們h j,我們 ? |
| In the experiments, this will be ensured by using the identity function fj : fj (x) = x; 8x, and by setting wjj = 1:0. We refer to this as the constant error carrousel (CEC). CEC will be LSTM's central feature (see Section 4). Of course unit j will not only be connected to itself but also to other units. This invokes two obvious, related problems (also inherent in all other gradient-based approaches): | 在實(shí)驗(yàn)中,利用恒等函數(shù)fj: fj (x) = x來(lái)保證;設(shè)置wjj = 1:0。我們稱(chēng)之為常誤差卡魯塞爾(CEC)。CEC將是LSTM的中心特性(參見(jiàn)第4節(jié))。當(dāng)然,單元j不僅與自身相連,還與其他單元相連。這引發(fā)了兩個(gè)明顯的、相關(guān)的問(wèn)題(也是所有其他基于梯度的方法所固有的): |
|
|
| Of course, input and output weight con icts are not specic for long time lags, but occur for ?short time lags as well. Their eects, however, become particularly pronounced in the long time ?lag case: as the time lag increases, (1) stored information must be protected against perturbation ?for longer and longer periods, and | especially in advanced stages of learning | (2) more and ?more already correct outputs also require protection against perturbation. ? Due to the problems above the naive approach does not work well except in case of certain ?simple problems involving local input/output representations and non-repeating input patterns ?(see Hochreiter 1991 and Silva et al. 1996). The next section shows how to do it right. | 當(dāng)然,輸入和輸出的權(quán)系數(shù)在長(zhǎng)時(shí)間滯后時(shí)是不特定的,但在短時(shí)間滯后時(shí)也會(huì)出現(xiàn)。除,然而,在長(zhǎng)時(shí)間滯后的情況下尤為明顯:隨著時(shí)間間隔的增加,(1)存儲(chǔ)信息必須防止擾動(dòng)時(shí)間卻越來(lái)越長(zhǎng),學(xué)習(xí)|和|尤其是晚期(2)越來(lái)越多的正確輸出也需要防止擾動(dòng)。 由于上述問(wèn)題,天真的方法不能很好地工作,除非某些簡(jiǎn)單的問(wèn)題涉及本地輸入/輸出表示和非重復(fù)輸入模式(見(jiàn)Hochreiter 1991和Silva et al. 1996)。下一節(jié)將展示如何正確地執(zhí)行此操作。 ? |
?
4 LONG SHORT-TERM MEMORY
| Memory cells and gate units. To construct an architecture that allows for constant error ow through special, self-connected units without the disadvantages of the naive approach, we extend the constant error carrousel CEC embodied by the self-connected, linear unit j from Section 3.2 by introducing additional features. A multiplicative input gate unit is introduced to protect the memory contents stored in j from perturbation by irrelevant inputs. Likewise, a multiplicative output gate unit is introduced which protects other units from perturbation by currently irrelevant memory contents stored in j. | 記憶單元和門(mén)單元。為了構(gòu)建一個(gè)允許通過(guò)特殊的、自連接的單元實(shí)現(xiàn)恒定誤差的體系結(jié)構(gòu),同時(shí)又不存在樸素方法的缺點(diǎn),我們通過(guò)引入額外的特性來(lái)擴(kuò)展3.2節(jié)中自連接的線性單元j所包含的恒定誤差carrousel CEC。為了保護(hù)存儲(chǔ)在j中的存儲(chǔ)內(nèi)容不受無(wú)關(guān)輸入的干擾,引入了乘法輸入門(mén)單元。同樣地,一個(gè)乘法輸出門(mén)單元被引入,它保護(hù)其他單元不受當(dāng)前不相關(guān)的存儲(chǔ)在j中的內(nèi)存內(nèi)容的干擾。 |
| ? | |
| ? | |
| net Figure 1: Architecture of memory cel l cj (the box) and its gate units inj ; outj . The self-recurrent connection (with weight 1.0) indicates feedback with a delay of 1 time step. It builds the basis of the \constant error carrousel" CEC. The gate units open and close access to CEC. See text and appendix A.1 for details.? | 圖1:memory cel l cj(盒子)的結(jié)構(gòu)和它的門(mén)單元inj;outj。自循環(huán)連接(權(quán)值為1.0)表示反饋延遲1個(gè)時(shí)間步長(zhǎng)。它建立了恒定誤差carrousel“CEC”的基礎(chǔ)。星門(mén)單元打開(kāi)和關(guān)閉CEC的入口。詳情見(jiàn)正文和附錄A.1。 ? |
| ls. ?Why gate units? To avoid input weight con icts, inj controls the error ?ow to memory cell ?cj 's input connections wcj i . To circumvent cj 's output weight con icts, outj controls the error ? ow from unit j's output connections. In other words, the net can use inj to decide when to keep ?or override information in memory cell cj , and outj to decide when to access memory cell cj and ?when to prevent other units from being perturbed by cj (see Figure 1). ? | 為什么門(mén)單位?為了避免輸入權(quán)值沖突,inj控制了內(nèi)存單元cj的輸入連接的誤差。為了繞過(guò)cj的輸出權(quán)值,outj控制來(lái)自單位j的輸出連接的錯(cuò)誤。換句話說(shuō),網(wǎng)絡(luò)可以使用inj來(lái)決定何時(shí)在內(nèi)存單元cj中保留或覆蓋信息,而使用outj來(lái)決定何時(shí)訪問(wèn)內(nèi)存單元cj以及何時(shí)防止其他單元受到cj的干擾(參見(jiàn)圖1)。 |
| Error signals trapped within a memory cell's CEC cannot change { but dierent error signals ? owing into the cell (at dierent times) via its output gate may get superimposed. The output ?gate will have to learn which errors to trap in its CEC, by appropriately scaling them. The input gate will have to learn when to release errors, again by appropriately scaling them. Essentially, the multiplicative gate units open and close access to constant error ow through CEC. | 存儲(chǔ)單元的CEC中的錯(cuò)誤信號(hào)不能改變{但是通過(guò)輸出門(mén)進(jìn)入單元的不同錯(cuò)誤信號(hào)(在不同的時(shí)間)可以被疊加。通過(guò)適當(dāng)?shù)財(cái)U(kuò)展,輸出門(mén)必須了解在其CEC中應(yīng)該捕獲哪些錯(cuò)誤。輸入門(mén)必須學(xué)會(huì)何時(shí)釋放錯(cuò)誤,再次通過(guò)適當(dāng)?shù)財(cái)U(kuò)展它們。從本質(zhì)上說(shuō),乘性門(mén)單元通過(guò)CEC打開(kāi)和關(guān)閉對(duì)恒定誤差的訪問(wèn)。? |
| Distributed output representations typically do require output gates. Not always are both ?gate types necessary, though | one may be sucient. For instance, in Experiments 2a and 2b in ?Section 5, it will be possible to use input gates only. In fact, output gates are not required in case ?of local output encoding | preventing memory cells from perturbing already learned outputs can ?be done by simply setting the corresponding weights to zero. Even in this case, however, output ?gates can be benecial: they prevent the net's attempts at storing long time lag memories (which ?are usually hard to learn) from perturbing activations representing easily learnable short time lag ?memories. (This will prove quite useful in Experiment 1, for instance.) ? | 分布式輸出表示通常需要輸出門(mén)。雖然|一個(gè)可能是必需的,但兩個(gè)門(mén)不一定都是必需的。例如,在第5節(jié)的2a和2b實(shí)驗(yàn)中,將可能只使用輸入門(mén)。事實(shí)上,在本地輸出編碼為|的情況下,不需要輸出門(mén),只要將相應(yīng)的權(quán)值設(shè)置為0,就可以防止內(nèi)存單元干擾已經(jīng)學(xué)習(xí)過(guò)的輸出。然而,即使在這種情況下,輸出門(mén)也可能是有益的:它們阻止了網(wǎng)絡(luò)存儲(chǔ)長(zhǎng)時(shí)間滯后記憶(通常很難學(xué)習(xí))的嘗試,從而干擾了代表容易學(xué)習(xí)的短時(shí)間滯后記憶的激活。(例如,這在實(shí)驗(yàn)1中將被證明非常有用。) |
| Network topology. We use networks with one input layer, one hidden layer, and one output ?layer. The (fully) self-connected hidden layer contains memory cells and corresponding gate units ?(for convenience, we refer to both memory cells and gate units as being located in the hidden ?layer). The hidden layer may also contain \conventional" hidden units providing inputs to gate ?units and memory cells. All units (except for gate units) in all layers have directed connections ?(serve as inputs) to all units in the layer above (or to all higher layers { Experiments 2a and 2b). ? Memory cell blocks. S memory cells sharing the same input gate and the same output gate ?form a structure called a \memory cell block of size S". Memory cell blocks facilitate information ?storage | as with conventional neural nets, it is not so easy to code a distributed input within a ?single cell. Since each memory cell block has as many gate units as a single memory cell (namely ?two), the block architecture can be even slightly more ecient (see paragraph \computational ?complexity"). A memory cell block of size 1 is just a simple memory cell. In the experiments ?(Section 5), we will use memory cell blocks of various sizes. ? | 網(wǎng)絡(luò)拓?fù)浣Y(jié)構(gòu)。我們使用一個(gè)輸入層、一個(gè)隱含層和一個(gè)輸出層的網(wǎng)絡(luò)。(完全)自連接的隱層包含內(nèi)存單元和相應(yīng)的柵極單元(為了方便起見(jiàn),我們將位于隱層中的內(nèi)存單元和柵極單元都稱(chēng)為隱層)。所述隱層還可以包含提供柵極單元和存儲(chǔ)器單元輸入的常規(guī)“隱單元”。所有層中的所有單元(門(mén)單元除外)都有指向連接(作為輸入)到上面層中的所有單元(或所有更高的層{實(shí)驗(yàn)2a和2b)。? 存儲(chǔ)單元塊。共享相同的輸入門(mén)和輸出門(mén)的內(nèi)存單元形成一個(gè)稱(chēng)為大小為S的內(nèi)存單元塊的結(jié)構(gòu)。記憶單元塊促進(jìn)信息存儲(chǔ)|與傳統(tǒng)的神經(jīng)網(wǎng)絡(luò)一樣,在單個(gè)單元內(nèi)編碼分布式輸入并不容易。由于每個(gè)內(nèi)存單元塊與單個(gè)內(nèi)存單元(即兩個(gè))具有同樣多的門(mén)單元,因此塊架構(gòu)甚至可以更特殊一些(請(qǐng)參閱段落“計(jì)算復(fù)雜性”)。大小為1的內(nèi)存單元塊只是一個(gè)簡(jiǎn)單的內(nèi)存單元。在實(shí)驗(yàn)(第5部分)中,我們將使用不同大小的存儲(chǔ)單元塊。 |
| ? | ? |
| ? | |
|
Computational complexity. As with Mozer's focused recurrent backprop algorithm (Mozer ?1989), only the derivatives @scj ?@wil ?need to be stored and updated. Hence the LSTM algorithm is ?very ecient, with an excellent update complexity of O(W), where W the number of weights (see ?details in appendix A.1). Hence, LSTM and BPTT for fully recurrent nets have the same update ?complexity per time step (while RTRL's is much worse). Unlike full BPTT, however, LSTM is ?local in space and time3 ?: there is no need to store activation values observed during sequence ?processing in a stack with potentially unlimited size. | 學(xué)習(xí)。我們使用RTRL的一個(gè)變體(例如,Robinson和Fallside 1987),它適當(dāng)?shù)乜紤]了輸入和輸出門(mén)所引起的變化的乘法動(dòng)力學(xué)。然而,以確保non-decaying錯(cuò)誤backprop通過(guò)內(nèi)部狀態(tài)的記憶細(xì)胞,與截?cái)郆PTT(例如,威廉姆斯和彭1990),錯(cuò)誤到達(dá)\存儲(chǔ)單元網(wǎng)絡(luò)輸入”(細(xì)胞cj,這包括netcj、netinj netoutj)得不到傳播更久遠(yuǎn)的時(shí)代(盡管他們服務(wù)變化的權(quán)重)。只有在2個(gè)內(nèi)存單元中,錯(cuò)誤才會(huì)通過(guò)之前的內(nèi)部狀態(tài)scj傳播回來(lái)。為了可視化這一點(diǎn):一旦一個(gè)錯(cuò)誤信號(hào)到達(dá)一個(gè)內(nèi)存單元輸出,它將被輸出門(mén)激活和h0縮放。然后它在記憶細(xì)胞的CEC中,在那里它可以無(wú)限地慢下來(lái)而不需要被縮放。只有當(dāng)它通過(guò)輸入門(mén)和g離開(kāi)存儲(chǔ)單元時(shí),它才通過(guò)輸入門(mén)激活和g 0再次被縮放。然后,它用于在截?cái)嘀案膫魅氲臋?quán)重(有關(guān)顯式公式,請(qǐng)參閱附錄)。 計(jì)算的復(fù)雜性。與Mozer的重點(diǎn)循環(huán)支持算法(Mozer 1989)一樣,只需要存儲(chǔ)和更新導(dǎo)數(shù)@scj @wil。因此LSTM算法非常特殊,更新復(fù)雜度為O(W),其中W表示權(quán)值的數(shù)量(詳見(jiàn)附錄A.1)。因此,對(duì)于完全經(jīng)常網(wǎng),LSTM和BPTT的每一步更新復(fù)雜度是相同的(而RTRL要差得多)。但是,與完整的BPTT不同的是,LSTM在空間和時(shí)間上是局部的:不需要將序列處理期間觀察到的激活值存儲(chǔ)在具有無(wú)限大小的堆棧中。 |
| Abuse problem and solutions. In the beginning of the learning phase, error reduction ?may be possible without storing information over time. The network will thus tend to abuse ?memory cells, e.g., as bias cells (i.e., it might make their activations constant and use the outgoing ?connections as adaptive thresholds for other units). The potential diculty is: it may take a ?long time to release abused memory cells and make them available for further learning. A similar ?\abuse problem" appears if two memory cells store the same (redundant) information. There are ?at least two solutions to the abuse problem: (1) Sequential network construction (e.g., Fahlman ?1991): a memory cell and the corresponding gate units are added to the network whenever the error stops decreasing (see Experiment 2 in Section 5). (2) Output gate bias: each output gate gets a negative initial bias, to push initial memory cell activations towards zero. Memory cells with more negative bias automatically get \allocated" later (see Experiments 1, 3, 4, 5, 6 in Section 5). | 濫用問(wèn)題及解決方法。在學(xué)習(xí)階段的開(kāi)始,可以在不存儲(chǔ)信息的情況下減少錯(cuò)誤。因此,該網(wǎng)絡(luò)將傾向于濫用記憶細(xì)胞,例如,作為偏見(jiàn)細(xì)胞。,它可能使它們的激活保持不變,并使用傳出連接作為其他單元的自適應(yīng)閾值)。潛在的問(wèn)題是:釋放被濫用的記憶細(xì)胞并使其用于進(jìn)一步的學(xué)習(xí)可能需要很長(zhǎng)時(shí)間。如果兩個(gè)記憶單元存儲(chǔ)相同的(冗余的)信息,就會(huì)出現(xiàn)類(lèi)似的“濫用”問(wèn)題。至少有兩個(gè)解決濫用問(wèn)題:(1)順序網(wǎng)絡(luò)建設(shè)(例如,Fahlman 1991):一個(gè)存儲(chǔ)單元和相應(yīng)的單元門(mén)時(shí)被添加到網(wǎng)絡(luò)錯(cuò)誤停止減少(見(jiàn)實(shí)驗(yàn)2節(jié)5)。(2)輸出門(mén)偏見(jiàn):每個(gè)輸出門(mén)負(fù)初始偏差,將最初的記憶細(xì)胞激活為零。帶有更多負(fù)偏差的記憶細(xì)胞將被自動(dòng)分配”稍后(參見(jiàn)第5節(jié)中的實(shí)驗(yàn)1、3、4、5、6)。 |
| Internal state drift and remedies. If memory cell cj 's inputs are mostly positive or mostly ?negative, then its internal state sj will tend to drift away over time. This is potentially dangerous, ?for the h0 ?(sj ) will then adopt very small values, and the gradient will vanish. One way to circumvent ?this problem is to choose an appropriate function h. But h(x) = x, for instance, has the ?disadvantage of unrestricted memory cell output range. Our simple but eective way of solving ?drift problems at the beginning of learning is to initially bias the input gate inj towards zero. ?Although there is a tradeo?between the magnitudes of h0 ?(sj ) on the one hand and of yinj ?and ?f 0 ?inj on the other, the potential negative eect of input gate bias is negligible compared to the one ?of the drifting eect. With logistic sigmoid activation functions, there appears to be no need for ?ne-tuning the initial bias, as conrmed by Experiments 4 and 5 in Section 5.4. | 內(nèi)部狀態(tài)漂移和補(bǔ)救措施。如果記憶細(xì)胞cj的輸入大部分是正的或大部分是負(fù)的,那么它的內(nèi)部狀態(tài)sj會(huì)隨著時(shí)間的推移而漂移。這是潛在的危險(xiǎn),因?yàn)閔0 (sj)將采用非常小的值,而梯度將消失。解決這個(gè)問(wèn)題的一種方法是選擇一個(gè)合適的函數(shù)h,但是h(x) = x的缺點(diǎn)是不限制內(nèi)存單元的輸出范圍。我們?cè)趯W(xué)習(xí)之初解決漂移問(wèn)題的簡(jiǎn)單而有效的方法是使輸入門(mén)inj最初偏向于零。雖然在h0 (sj)與yinj和f0 inj的量級(jí)之間存在貿(mào)易,但與漂移效應(yīng)相比,輸入門(mén)偏差的潛在負(fù)效應(yīng)可以忽略不計(jì)。對(duì)于logistic sigmoid激活函數(shù),似乎不需要對(duì)初始偏差進(jìn)行ne調(diào)節(jié),正如5.4節(jié)中的實(shí)驗(yàn)4和實(shí)驗(yàn)5所證實(shí)的那樣。 |
?
5 EXPERIMENTS?實(shí)驗(yàn)
| Introduction. Which tasks are appropriate to demonstrate the quality of a novel long time lag | 介紹。哪些任務(wù)是合適的,以證明一個(gè)新的長(zhǎng)時(shí)間滯后的質(zhì)量 |
| algorithm? First of all, minimal time lags between relevant input signals and corresponding teacher ?signals must be long for al l training sequences. In fact, many previous recurrent net algorithms ?sometimes manage to generalize from very short training sequences to very long test sequences. ?See, e.g., Pollack (1991). But a real long time lag problem does not have any short time lag ?exemplars in the training set. For instance, Elman's training procedure, BPTT, oine RTRL, ?online RTRL, etc., fail miserably on real long time lag problems. See, e.g., Hochreiter (1991) and ?Mozer (1992). A second important requirement is that the tasks should be complex enough such ?that they cannot be solved quickly by simple-minded strategies such as random weight guessing.? | 算法?首先,對(duì)于all訓(xùn)練序列,相關(guān)輸入信號(hào)與相應(yīng)教師信號(hào)之間的最小時(shí)滯必須很長(zhǎng)。事實(shí)上,許多以前的遞歸網(wǎng)絡(luò)算法有時(shí)能夠?qū)⒎浅6痰挠?xùn)練序列推廣到非常長(zhǎng)的測(cè)試序列。參見(jiàn),例如Pollack(1991)。但是一個(gè)真實(shí)的長(zhǎng)時(shí)間滯后問(wèn)題在訓(xùn)練集中沒(méi)有任何短時(shí)間滯后的例子。例如,Elman的訓(xùn)練過(guò)程,BPTT, oine RTRL, online RTRL等,在真實(shí)的長(zhǎng)時(shí)間滯后問(wèn)題上嚴(yán)重失敗。例如Hochreiter(1991)和Mozer(1992)。第二個(gè)重要的要求是,任務(wù)應(yīng)該足夠復(fù)雜,不能用簡(jiǎn)單的策略(如隨機(jī)猜測(cè)權(quán)值)快速解決。 ? |
| Guessing can outperform many long time lag algorithms. Recently we discovered ?(Schmidhuber and Hochreiter 1996, Hochreiter and Schmidhuber 1996, 1997) that many long ?time lag tasks used in previous work can be solved more quickly by simple random weight guessing ?than by the proposed algorithms. For instance, guessing solved a variant of Bengio and Frasconi's ?\parity problem" (1994) problem much faster4 ?than the seven methods tested by Bengio et al. ?(1994) and Bengio and Frasconi (1994). Similarly for some of Miller and Giles' problems (1993). Of ?course, this does not mean that guessing is a good algorithm. It just means that some previously ?used problems are not extremely appropriate to demonstrate the quality of previously proposed ?algorithms. ? | 猜測(cè)可以勝過(guò)許多長(zhǎng)時(shí)間延遲的算法。最近我們發(fā)現(xiàn)(Schmidhuber and Hochreiter 1996, Hochreiter and Schmidhuber 1996, 1997),以前工作中使用的許多長(zhǎng)時(shí)間延遲任務(wù)可以通過(guò)簡(jiǎn)單的隨機(jī)猜測(cè)權(quán)值來(lái)快速解決,而不是通過(guò)所提出的算法。例如,猜測(cè)解決了Bengio和Frasconi's奇偶校驗(yàn)問(wèn)題(1994)的一個(gè)變體,比Bengio等人(1994)和Bengio和Frasconi(1994)測(cè)試的七種方法要快得多。類(lèi)似地,米勒和賈爾斯的一些問(wèn)題(1993年)。當(dāng)然,這并不意味著猜測(cè)是一個(gè)好的算法。這只是意味著一些以前用過(guò)的問(wèn)題不是非常適合用來(lái)演示以前提出的算法的質(zhì)量。? |
| What's common to Experiments 1{6. All our experiments (except for Experiment 1) ?involve long minimal time lags | there are no short time lag training exemplars facilitating ?learning. Solutions to most of our tasks are sparse in weight space. They require either many ?parameters/inputs or high weight precision, such that random weight guessing becomes infeasible. ? | 實(shí)驗(yàn)1{6。我們所有的實(shí)驗(yàn)(除了實(shí)驗(yàn)1)都涉及到長(zhǎng)時(shí)間的最小滯后時(shí)間|沒(méi)有短時(shí)間的滯后訓(xùn)練范例來(lái)促進(jìn)學(xué)習(xí)。我們大多數(shù)任務(wù)的解在權(quán)值空間中是稀疏的。它們要么需要許多參數(shù)/輸入,要么需要較高的權(quán)值精度,這樣隨機(jī)猜測(cè)權(quán)值就變得不可行了。? |
| We always use on-line learning (as opposed to batch learning), and logistic sigmoids as activation ?functions. For Experiments 1 and 2, initial weights are chosen in the range [0:2; ?0:2], for ?the other experiments in [0:1; ?0:1]. Training sequences are generated randomly according to the ?various task descriptions. In slight deviation from the notation in Appendix A1, each discrete ?time step of each input sequence involves three processing steps:
| 我們總是使用在線學(xué)習(xí)(而不是批量學(xué)習(xí)),并使用邏輯sigmoids作為激活函數(shù)。實(shí)驗(yàn)1和實(shí)驗(yàn)2的初始權(quán)值選擇在[0:2;0:2],用于其他實(shí)驗(yàn)[0:1;0:1)。根據(jù)不同的任務(wù)描述,隨機(jī)生成訓(xùn)練序列。與附錄A1中的符號(hào)略有偏差,每個(gè)輸入序列的每個(gè)離散時(shí)間步都涉及三個(gè)處理步驟:
|
| For comparisons with recurrent nets taught by gradient descent, we give results only for RTRL, ?except for comparison 2a, which also includes BPTT. Note, however, that untruncated BPTT (see, ?e.g., Williams and Peng 1990) computes exactly the same gradient as oine RTRL. With long time ?lag problems, oine RTRL (or BPTT) and the online version of RTRL (no activation resets, online ?weight changes) lead to almost identical, negative results (as conrmed by additional simulations ?in Hochreiter 1991; see also Mozer 1992). This is because oine RTRL, online RTRL, and full ?BPTT all suer badly from exponential error decay. ? | 對(duì)于用梯度下降法講授的循環(huán)網(wǎng)的比較,我們只給出了RTRL的結(jié)果,除了比較2a,其中也包括了BPTT。但是,請(qǐng)注意未截?cái)嗟腂PTT(參見(jiàn), Williams和Peng(1990)計(jì)算的梯度與oine RTRL完全相同。由于存在長(zhǎng)時(shí)間滯后問(wèn)題,oine RTRL(或BPTT)和RTRL的在線版本(沒(méi)有激活重置,在線權(quán)重變化)導(dǎo)致幾乎相同的負(fù)結(jié)果(如Hochreiter 1991中的額外模擬所證實(shí)的;參見(jiàn)Mozer 1992)。這是因?yàn)閛ine RTRL、online RTRL和full BPTT都嚴(yán)重依賴(lài)于指數(shù)誤差衰減。? ? |
| Our LSTM architectures are selected quite arbitrarily. If nothing is known about the complexity ?of a given problem, a more systematic approach would be: start with a very small net consisting ?of one memory cell. If this does not work, try two cells, etc. Alternatively, use sequential network ?construction (e.g., Fahlman 1991). | 我們的LSTM架構(gòu)是任意選擇的。如果對(duì)給定問(wèn)題的復(fù)雜性一無(wú)所知,那么一種更系統(tǒng)的方法是:從一個(gè)由一個(gè)記憶單元組成的非常小的網(wǎng)絡(luò)開(kāi)始。如果這不起作用,嘗試兩個(gè)單元格,等等。或者,使用順序網(wǎng)絡(luò)結(jié)構(gòu)(例如,Fahlman 1991)。 ? ? ? |
Outline of experiments??試驗(yàn)大綱
?
Subsection 5.7 will provide a detailed summary of experimental conditions in two tables for reference. |
|
?
5.1 EXPERIMENT 1: EMBEDDED REBER GRAMMAR ?實(shí)驗(yàn)1:嵌入式REBER語(yǔ)法
| 任務(wù)。我們的首要任務(wù)是學(xué)習(xí)嵌入的Reber語(yǔ)法”,例如Smith和Zipser(1989)、Cleeremans等人(1989)和Fahlman(1991)。因?yàn)樗试S訓(xùn)練序列有短的時(shí)間滯后(只有9個(gè)步驟),所以它不是一個(gè)長(zhǎng)時(shí)間滯后的問(wèn)題。我們引入它有兩個(gè)原因:
|
| ? | ? |
| ? | ? |
?
?
?
?
?
?
?
?
?
?
?
《新程序員》:云原生和全面數(shù)字化實(shí)踐50位技術(shù)專(zhuān)家共同創(chuàng)作,文字、視頻、音頻交互閱讀總結(jié)
以上是生活随笔為你收集整理的LSTM:《Long Short-Term Memory》的翻译并解读的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: CV之SR:超分辨率(Super res
- 下一篇: AI:一个20年程序猿的学习资料大全—区