當(dāng)前位置：首頁(yè) > 编程资源 > 编程问答 >内容正文

编程问答

LSTM：《Long Short-Term Memory》的翻译并解读

發(fā)布時(shí)間：2025/3/21 编程问答 23 豆豆

生活随笔收集整理的這篇文章主要介紹了 LSTM：《Long Short-Term Memory》的翻译并解读小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

LSTM：《Long Short-Term Memory》的翻譯并解讀

Long Short-Term Memory

Abstract

1 INTRODUCTION

2 PREVIOUS WORK ?

3 CONSTANT ERROR BACKPROP

3.1 EXPONENTIALLY DECAYING ERROR

3.2 CONSTANT ERROR FLOW: NAIVE APPROACH

4 LONG SHORT-TERM MEMORY

5 EXPERIMENTS

5.1 EXPERIMENT 1: EMBEDDED REBER GRAMMAR ?

Long Short-Term Memory

論文原文
地址01：https://arxiv.org/pdf/1506.04214.pdf
地址02：https://www.bioinf.jku.at/publications/older/2604.pdf

Abstract

Learning to store information over extended time intervals via recurrent backpropagation ?takes a very long time, mostly due to insucient, decaying error back ?ow. We brie y review ?Hochreiter's 1991 analysis of this problem, then address it by introducing a novel, ecient, ?gradient-based method called \Long Short-Term Memory" (LSTM). Truncating the gradient ?where this does not do harm, LSTM can learn to bridge minimal time lags in excess of 1000 ?discrete time steps by enforcing constant error ?ow through \constant error carrousels" within ?special units. Multiplicative gate units learn to open and close access to the constant error ? ow. LSTM is local in space and time; its computational complexity per time step and weight ?is O(1). Our experiments with arti cial data involve local, distributed, real-valued, and noisy ?pattern representations. In comparisons with RTRL, BPTT, Recurrent Cascade-Correlation, ?Elman nets, and Neural Sequence Chunking, LSTM leads to many more successful runs, and ?learns much faster. LSTM also solves complex, arti cial long time lag tasks that have never ?been solved by previous recurrent network algorithms.

通過(guò)周期性的反向傳播學(xué)習(xí)，在擴(kuò)展的時(shí)間間隔內(nèi)存儲(chǔ)信息需要很長(zhǎng)的時(shí)間，這主要是由于不確定的、衰減的錯(cuò)誤導(dǎo)致的。我們簡(jiǎn)要回顧了Hochreiter在1991年對(duì)這個(gè)問(wèn)題的分析，然后介紹了一種新穎的、獨(dú)特的、基于梯度的方法，稱(chēng)為L(zhǎng)STM (LSTM)。在不造成傷害的情況下截?cái)嗵荻?#xff0c;LSTM可以學(xué)習(xí)在超過(guò)1000個(gè)離散時(shí)間步長(zhǎng)的最小時(shí)間滯后上橋接，方法是通過(guò)在特殊單元內(nèi)的“恒定誤差輪盤(pán)”強(qiáng)制執(zhí)行恒定誤差。乘性門(mén)單元學(xué)習(xí)打開(kāi)和關(guān)閉訪問(wèn)的恒定誤差低。LSTM在空間和時(shí)間上都是局部的;其每時(shí)間步長(zhǎng)的計(jì)算復(fù)雜度和權(quán)值為O(1)。我們對(duì)人工數(shù)據(jù)的實(shí)驗(yàn)包括局部的、分布式的、實(shí)值的和有噪聲的模式表示。在與RTRL、BPTT、周期性級(jí)聯(lián)相關(guān)、Elman網(wǎng)和神經(jīng)序列分塊的比較中，LSTM帶來(lái)了更多的成功運(yùn)行，并且學(xué)習(xí)速度更快。LSTM還解決了以前的遞歸網(wǎng)絡(luò)算法所不能解決的復(fù)雜、人工的長(zhǎng)時(shí)間滯后問(wèn)題。

1 INTRODUCTION

Recurrent networks can in principle use their feedback connections to store representations of recent input events in form of activations (\short-term memory", as opposed to \long-term memory" embodied by slowly changing weights). This is potentially signicant for many applications, including speech processing, non-Markovian control, and music composition (e.g., Mozer 1992). The most widely used algorithms for learning what to put in short-term memory, however, take too much time or do not work well at all, especially when minimal time lags between inputs and corresponding teacher signals are long. Although theoretically fascinating, existing methods do not provide clear practical advantages over, say, backprop in feedforward nets with limited time windows. This paper will review an analysis of the problem and suggest a remedy.?	遞歸網(wǎng)絡(luò)原則上可以使用它們的反饋連接以激活的形式存儲(chǔ)最近輸入事件的表示(“短期記憶”，而不是“長(zhǎng)期記憶”，后者由緩慢變化的權(quán)重表示)。這對(duì)許多應(yīng)用程序都有潛在的重要性，包括語(yǔ)音處理、非馬爾可夫控制和音樂(lè)作曲(例如，Mozer 1992)。然而，最廣泛使用的學(xué)習(xí)短期記憶的算法要么花費(fèi)了太多時(shí)間，要么根本就不能很好地工作，尤其是在輸入和相應(yīng)教師信號(hào)之間的最小時(shí)滯很長(zhǎng)時(shí)。雖然理論上很吸引人，但現(xiàn)有的方法并沒(méi)有提供明顯的實(shí)際優(yōu)勢(shì)，例如，在有限時(shí)間窗口的前饋網(wǎng)絡(luò)中，backprop。本文將對(duì)這一問(wèn)題進(jìn)行分析，并提出解決辦法。
The problem. With conventional \Back-Propagation Through Time" (BPTT, e.g., Williams and Zipser 1992, Werbos 1988) or \Real-Time Recurrent Learning" (RTRL, e.g., Robinson and Fallside 1987), error signals \ owing backwards in time" tend to either (1) blow up or (2) vanish: the temporal evolution of the backpropagated error exponentially depends on the size of the weights (Hochreiter 1991). Case (1) may lead to oscillating weights, while in case (2) learning to bridge long time lags takes a prohibitive amount of time, or does not work at all (see section 3).?	這個(gè)問(wèn)題。與傳統(tǒng)\反向傳播通過(guò)時(shí)間”(BPTT,例如,1992年威廉姆斯和拉鏈,Werbos 1988)或\實(shí)時(shí)復(fù)發(fā)性學(xué)習(xí)”(RTRL,例如,羅賓遜和Fallside 1987),誤差信號(hào)在時(shí)間上向后\由于”傾向于(1)炸毀或(2):消失的時(shí)間演化backpropagated誤差指數(shù)的大小取決于重量(Hochreiter 1991)。情形(1)可能會(huì)導(dǎo)致權(quán)值的振蕩，而情形(2)學(xué)習(xí)如何橋接長(zhǎng)時(shí)間滯后的情況會(huì)花費(fèi)大量的時(shí)間，或者根本不起作用(參見(jiàn)第3節(jié))。
The remedy. This paper presents \Long Short-Term Memory" (LSTM), a novel recurrent network architecture in conjunction with an appropriate gradient-based learning algorithm. LSTM is designed to overcome these error back- ow problems. It can learn to bridge time intervals in excess of 1000 steps even in case of noisy, incompressible input sequences, without loss of short time lag capabilities. This is achieved by an ecient, gradient-based algorithm for an architecture?enforcing constant (thus neither exploding nor vanishing) error ow through internal states of special units (provided the gradient computation is truncated at certain architecture-specic points \| this does not aect long-term error ow though).?	補(bǔ)救措施。本文提出了一種新的遞歸網(wǎng)絡(luò)結(jié)構(gòu)——長(zhǎng)短時(shí)記憶(LSTM)，并結(jié)合適當(dāng)?shù)奶荻葘W(xué)習(xí)算法。LSTM的設(shè)計(jì)就是為了克服這些錯(cuò)誤的反向問(wèn)題。它可以學(xué)習(xí)橋接超過(guò)1000步的時(shí)間間隔，即使在有噪聲、不可壓縮的輸入序列的情況下，也不會(huì)損失短時(shí)間延遲能力。這是通過(guò)一種特殊的、基于梯度的算法來(lái)實(shí)現(xiàn)的，它針對(duì)的是一種通過(guò)特殊單元的內(nèi)部狀態(tài)來(lái)執(zhí)行常量(因此既不會(huì)爆炸也不會(huì)消失)的錯(cuò)誤(假設(shè)梯度計(jì)算在某些特定的體系結(jié)構(gòu)點(diǎn)\|被截?cái)?#xff0c;但這并不影響長(zhǎng)期的錯(cuò)誤)。
Outline of paper. Section 2 will brie y review previous work. Section 3 begins with an outline of the detailed analysis of vanishing errors due to Hochreiter (1991). It will then introduce a naive approach to constant error backprop for didactic purposes, and highlight its problems concerning information storage and retrieval. These problems will lead to the LSTM architecture as described in Section 4. Section 5 will present numerous experiments and comparisons with competing methods. LSTM outperforms them, and also learns to solve complex, articial tasks no other recurrent net algorithm has solved. Section 6 will discuss LSTM's limitations and advantages. The appendix contains a detailed description of the algorithm (A.1), and explicit error ow formulae (A.2).	第二部分將簡(jiǎn)要回顧以前的工作。第3節(jié)以詳細(xì)分析Hochreiter(1991)所造成的消失誤差的大綱開(kāi)始。然后，它將介紹一種用于教學(xué)目的的幼稚的不斷錯(cuò)誤支持方法，并突出其在信息存儲(chǔ)和檢索方面的問(wèn)題。這些問(wèn)題將導(dǎo)致第4節(jié)中描述的LSTM體系結(jié)構(gòu)。第5節(jié)將提供大量的實(shí)驗(yàn)和與競(jìng)爭(zhēng)方法的比較。LSTM比它們做得更好，而且還學(xué)會(huì)了解決復(fù)雜的人工任務(wù)，這是其他遞歸網(wǎng)絡(luò)算法所不能解決的。第6節(jié)將討論LSTM的局限性和優(yōu)點(diǎn)。附錄中有算法的詳細(xì)描述(a .1)，以及公式的顯式誤差(a .2)。

2 PREVIOUS WORK ?

This section will focus on recurrent nets with time-varying inputs (as opposed to nets with stationary ?inputs and xpoint-based gradient calculations, e.g., Almeida 1987, Pineda 1987). ?	本節(jié)將集中討論具有時(shí)變輸入的遞歸網(wǎng)絡(luò)(而不是具有固定輸入和基于x點(diǎn)的梯度計(jì)算的網(wǎng)絡(luò)，例如Almeida 1987和Pineda 1987)。
Gradient-descent variants. The approaches of Elman (1988), Fahlman (1991), Williams ?(1989), Schmidhuber (1992a), Pearlmutter (1989), and many of the related algorithms in Pearlmutter's ?comprehensive overview (1995) suer from the same problems as BPTT and RTRL (see ?Sections 1 and 3). ?	梯度下降法變體。Elman(1988)、Fahlman(1991)、Williams(1989)、Schmidhuber (1992a)、Pearlmutter(1989)的方法，以及Pearlmutter的綜合綜述(1995)中的許多相關(guān)算法，都是從與BPTT和RTRL相同的問(wèn)題中提出的(見(jiàn)第1節(jié)和第3節(jié))
Time-delays. Other methods that seem practical for short time lags only are Time-Delay ?Neural Networks (Lang et al. 1990) and Plate's method (Plate 1993), which updates unit activations ?based on a weighted sum of old activations (see also de Vries and Principe 1991). Lin et al. ?(1995) propose variants of time-delay networks called NARX networks.	時(shí)間延遲。其他似乎只適用于短時(shí)間滯后的方法有時(shí)滯神經(jīng)網(wǎng)絡(luò)(Lang et al. 1990)和Plate法(Plate 1993)，后者基于舊激活的加權(quán)和更新單位激活(參見(jiàn)de Vries和Principe 1991)。Lin等人(1995)提出了時(shí)延網(wǎng)絡(luò)的變體NARX網(wǎng)絡(luò)。
Time constants. To deal with long time lags, Mozer (1992) uses time constants in uencing ?changes of unit activations (deVries and Principe's above-mentioned approach (1991) may in fact ?be viewed as a mixture of TDNN and time constants). For long time lags, however, the time ?constants need external ne tuning (Mozer 1992). Sun et al.'s alternative approach (1993) updates ?the activation of a recurrent unit by adding the old activation and the (scaled) current net input. ?The net input, however, tends to perturb the stored information, which makes long-term storage ?impractical. ?	時(shí)間常量。為了處理長(zhǎng)時(shí)間滯后，Mozer(1992)使用時(shí)間常數(shù)來(lái)表示單位激活的變化(deVries and Principe’s上述方法(1991)實(shí)際上可以看作是TDNN和時(shí)間常數(shù)的混合物)。然而，對(duì)于長(zhǎng)時(shí)間滯后，時(shí)間常數(shù)需要外部ne調(diào)諧(Mozer 1992)。Sun等人的替代方法(1993)通過(guò)添加舊的激活和(縮放的)當(dāng)前凈輸入來(lái)更新一個(gè)經(jīng)常性單元的激活。然而，凈輸入往往會(huì)干擾所存儲(chǔ)的信息，這使得長(zhǎng)期存儲(chǔ)變得不切實(shí)際。
Ring's approach. Ring (1993) also proposed a method for bridging long time lags. Whenever ?a unit in his network receives con icting error signals, he adds a higher order unit in uencing ?appropriate connections. Although his approach can sometimes be extremely fast, to bridge a ?time lag involving 100 steps may require the addition of 100 units. Also, Ring's net does not ?generalize to unseen lag durations. ?	環(huán)的方法。Ring(1993)也提出了一種橋接長(zhǎng)時(shí)間滯后的方法。當(dāng)他的網(wǎng)絡(luò)中的一個(gè)單元接收到通信錯(cuò)誤信號(hào)時(shí)，他就增加一個(gè)更高階的單元來(lái)建立適當(dāng)?shù)倪B接。雖然他的方法有時(shí)非常快，但要跨越100步的時(shí)間延遲可能需要增加100個(gè)單元。同樣，環(huán)網(wǎng)也不能推廣到看不見(jiàn)的滯后時(shí)間。
Bengio et al.'s approaches. Bengio et al. (1994) investigate methods such as simulated ?annealing, multi-grid random search, time-weighted pseudo-Newton optimization, and discrete ?error propagation. Their \latch" and \2-sequence" problems are very similar to problem 3a with ?minimal time lag 100 (see Experiment 3). Bengio and Frasconi (1994) also propose an EM approach ?for propagating targets. With n so-called \state networks", at a given time, their system can be ?in one of only n dierent states. See also beginning of Section 5. But to solve continuous problems ?such as the \adding problem" (Section 5.4), their system would require an unacceptable number ?of states (i.e., state networks).	Bengio等人的方法。Bengio等人(1994)研究了模擬退火、多網(wǎng)格隨機(jī)搜索、時(shí)間加權(quán)偽牛頓優(yōu)化和離散誤差傳播等方法。他們的“閂鎖”和“2-序列”問(wèn)題與3a問(wèn)題非常相似，只有最小的滯后時(shí)間100(見(jiàn)實(shí)驗(yàn)3)。Bengio和Frasconi(1994)也提出了一種EM方法來(lái)傳播目標(biāo)。對(duì)于n個(gè)所謂的“狀態(tài)網(wǎng)絡(luò)”，在給定的時(shí)間內(nèi)，它們的系統(tǒng)只能處于n種不同狀態(tài)中的一種。參見(jiàn)第5節(jié)的開(kāi)頭。但是，為了解決諸如“\添加問(wèn)題”(第5.4節(jié))之類(lèi)的連續(xù)問(wèn)題，它們的系統(tǒng)將需要不可接受的狀態(tài)數(shù)(即狀態(tài)的網(wǎng)絡(luò))。
Kalman lters. Puskorius and Feldkamp (1994) use Kalman lter techniques to improve ?recurrent net performance. Since they use \a derivative discount factor imposed to decay exponentially ?the eects of past dynamic derivatives," there is no reason to believe that their Kalman ?Filter Trained Recurrent Networks will be useful for very long minimal time lags. ?Second order nets. We will see that LSTM uses multiplicative units (MUs) to protect error ? ow from unwanted perturbations. It is not the rst recurrent net method using MUs though. ?For instance, Watrous and Kuhn (1992) use MUs in second order nets. Some dierences to LSTM ?are: (1) Watrous and Kuhn's architecture does not enforce constant error ?ow and is not designed to solve long time lag problems. (2) It has fully connected second-order sigma-pi units, while the ?LSTM architecture's MUs are used only to gate access to constant error ?ow. (3) Watrous and ?Kuhn's algorithm costs O(W2 ?) operations per time step, ours only O(W), where W is the number ?of weights. See also Miller and Giles (1993) for additional work on MUs. ?	Kalman lters. Puskorius and Feldkamp (1994)使用Kalman lter技術(shù)來(lái)提高經(jīng)常性?xún)艨?jī)效。由于他們使用一個(gè)衍生品折扣因子來(lái)指數(shù)衰減過(guò)去動(dòng)態(tài)衍生品的影響，“我們沒(méi)有理由相信他們的卡爾曼濾波訓(xùn)練的遞歸網(wǎng)絡(luò)在很長(zhǎng)一段時(shí)間內(nèi)都是有用的。”二階網(wǎng)絡(luò)。我們將看到LSTM使用乘法單位(MUs)來(lái)保護(hù)錯(cuò)誤不受不必要的干擾。但它不是第一個(gè)使用MUs的遞歸網(wǎng)絡(luò)方法。例如，Watrous和Kuhn(1992)在二階網(wǎng)中使用MUs。LSTM的一些不同之處是:(1)Watrous和Kuhn的架構(gòu)不強(qiáng)制恒定的錯(cuò)誤，也不是為了解決長(zhǎng)時(shí)間滯后的問(wèn)題而設(shè)計(jì)的。 (2)它具有完全連通的二階sigma-pi單元，而LSTM體系結(jié)構(gòu)的MUs僅用于對(duì)恒定誤差低的門(mén)訪問(wèn)。 (3) Watrous和Kuhn的算法每時(shí)間步需要O(W2)個(gè)操作，我們的算法只需要O(W)個(gè)操作，其中W是權(quán)值的個(gè)數(shù)。有關(guān)MUs的其他工作也見(jiàn)Miller和Giles(1993)。
Simple weight guessing. To avoid long time lag problems of gradient-based approaches we ?may simply randomly initialize all network weights until the resulting net happens to classify all ?training sequences correctly. In fact, recently we discovered (Schmidhuber and Hochreiter 1996, ?Hochreiter and Schmidhuber 1996, 1997) that simple weight guessing solves many of the problems ?in (Bengio 1994, Bengio and Frasconi 1994, Miller and Giles 1993, Lin et al. 1995) faster than ?the algorithms proposed therein. This does not mean that weight guessing is a good algorithm. ?It just means that the problems are very simple. More realistic tasks require either many free ?parameters (e.g., input weights) or high weight precision (e.g., for continuous-valued parameters), ?such that guessing becomes completely infeasible. ?	簡(jiǎn)單的猜測(cè)。為了避免基于梯度的方法的長(zhǎng)時(shí)間滯后問(wèn)題，我們可以簡(jiǎn)單地隨機(jī)初始化所有網(wǎng)絡(luò)權(quán)值，直到最終得到的網(wǎng)絡(luò)正確地對(duì)所有訓(xùn)練序列進(jìn)行分類(lèi)。事實(shí)上，最近我們發(fā)現(xiàn)(Schmidhuber and Hochreiter 1996, Hochreiter and Schmidhuber 1996, 1997)簡(jiǎn)單的重量猜測(cè)解決了(Bengio 1994, Bengio and Frasconi 1994, Miller and Giles 1993, Lin et al. 1995)中的許多問(wèn)題，比其中提出的算法更快。這并不意味著猜測(cè)權(quán)重是一個(gè)好的算法。這意味著問(wèn)題很簡(jiǎn)單。更實(shí)際的任務(wù)需要許多自由參數(shù)(例如，輸入權(quán)值)或較高的權(quán)值精度(例如，連續(xù)值參數(shù))，這樣猜測(cè)就變得完全不可行的。
Adaptive sequence chunkers. Schmidhuber's hierarchical chunker systems (1992b, 1993) ?do have a capability to bridge arbitrary time lags, but only if there is local predictability across the ?subsequences causing the time lags (see also Mozer 1992). For instance, in his postdoctoral thesis ?(1993), Schmidhuber uses hierarchical recurrent nets to rapidly solve certain grammar learning ?tasks involving minimal time lags in excess of 1000 steps. The performance of chunker systems, ?however, deteriorates as the noise level increases and the input sequences become less compressible. ?LSTM does not suer from this problem.	自適應(yīng)序列chunkers。Schmidhuber的分層chunker系統(tǒng)(1992b, 1993)確實(shí)具有橋接任意時(shí)間滯后的能力，但前提是子序列具有局部可預(yù)測(cè)性，從而導(dǎo)致時(shí)間滯后(參見(jiàn)Mozer 1992)。例如，在他的博士后論文(1993)中，Schmidhuber使用層次遞歸網(wǎng)絡(luò)來(lái)快速解決某些語(yǔ)法學(xué)習(xí)任務(wù)，這些任務(wù)涉及的時(shí)間延遲最小，超過(guò)了1000步。然而，隨著噪聲水平的提高和輸入序列的可壓縮性的降低，chunker系統(tǒng)的性能會(huì)下降。LSTM不能解決這個(gè)問(wèn)題。

3 CONSTANT ERROR BACKPROP? 固定誤差支持

3.1 EXPONENTIALLY DECAYING ERROR? ?指數(shù)衰減誤差

Conventional BPTT (e.g. Williams and Zipser 1992). Output unit k's target at time t is denoted by dk (t). Using mean squared error, k's error signal is	傳統(tǒng)的BPTT(如Williams和Zipser 1992)。輸出單元k在t時(shí)刻的目標(biāo)用dk (t)表示，利用均方誤差，k的誤差信號(hào)為 ?
The corresponding contribution to wjl 's total weight update is #j (t)yl ?(t ?1), where ?is the ?learning rate, and l stands for an arbitrary unit connected to unit j. ?Outline of Hochreiter's analysis (1991, page 19-21). Suppose we have a fully connected ?net whose non-input unit indices range from 1 to n. Let us focus on local error ?ow from unit u ?to unit v (later we will see that the analysis immediately extends to global error ?ow). The error ?occurring at an arbitrary unit u at time step t is propagated \back into time" for q time steps, to ?an arbitrary unit v. This will scale the error by the following fact	wjl的總權(quán)重更新的相應(yīng)貢獻(xiàn)是#j (t)yl (t 1)，其中為學(xué)習(xí)率，l表示連接到j(luò)單元的任意單元。Hochreiter分析概要(1991年，第19-21頁(yè))。假設(shè)我們有一個(gè)完全連通的網(wǎng)絡(luò)，它的非輸入單位指數(shù)范圍從1到n。讓我們關(guān)注從單位u到單位v的局部誤差ow(稍后我們將看到分析立即擴(kuò)展到全局誤差ow)。發(fā)生在任意單位u上的時(shí)間步長(zhǎng)t的誤差被傳播回時(shí)間中，對(duì)于q時(shí)間步長(zhǎng)，傳播回任意單位v ?
	?
	?

3.2 CONSTANT ERROR FLOW: NAIVE APPROACH?常量錯(cuò)誤流:簡(jiǎn)單的方法

A single unit. To avoid vanishing error signals, how can we achieve constant error ow through a single unit j with a single connection to itself? According to the rules above, at time t, j's local error back ow is #j (t) = f 0 j (netj (t))#j (t + 1)wjj . To enforce constant error ow through j, we?h j, we	一個(gè)單元。為了避免消失的錯(cuò)誤信號(hào)，我們?nèi)绾瓮ㄟ^(guò)一個(gè)單一的單位j與一個(gè)單一的連接到自己實(shí)現(xiàn)恒定的錯(cuò)誤低?根據(jù)上面的規(guī)則，在t時(shí)刻，j的本地錯(cuò)誤返回ow是#j (t) = f0 j (netj (t))#j (t + 1)wjj。為了通過(guò)j來(lái)執(zhí)行常誤差ow，我們h j，我們 ?
In the experiments, this will be ensured by using the identity function fj : fj (x) = x; 8x, and by setting wjj = 1:0. We refer to this as the constant error carrousel (CEC). CEC will be LSTM's central feature (see Section 4). Of course unit j will not only be connected to itself but also to other units. This invokes two obvious, related problems (also inherent in all other gradient-based approaches):	在實(shí)驗(yàn)中，利用恒等函數(shù)fj: fj (x) = x來(lái)保證;設(shè)置wjj = 1:0。我們稱(chēng)之為常誤差卡魯塞爾(CEC)。CEC將是LSTM的中心特性(參見(jiàn)第4節(jié))。當(dāng)然，單元j不僅與自身相連，還與其他單元相連。這引發(fā)了兩個(gè)明顯的、相關(guān)的問(wèn)題(也是所有其他基于梯度的方法所固有的):
1. Input weight con ict: for simplicity, let us focus on a single additional input weight wji . ?Assume that the total error can be reduced by switching on unit j in response to a certain input, ?and keeping it active for a long time (until it helps to compute a desired output). Provided i is nonzero, ?since the same incoming weight has to be used for both storing certain inputs and ignoring ?others, wji will often receive con icting weight update signals during this time (recall that j is ?linear): these signals will attempt to make wji participate in (1) storing the input (by switching ?on j) and (2) protecting the input (by preventing j from being switched o?by irrelevant later ?inputs). This con ict makes learning dicult, and calls for a more context-sensitive mechanism ?for controlling \write operations" through input weights. ? 2. Output weight con ict: assume j is switched on and currently stores some previous ?input. For simplicity, let us focus on a single additional outgoing weight wkj . The same wkj has ?to be used for both retrieving j's content at certain times and preventing j from disturbing k ?at other times. As long as unit j is non-zero, wkj will attract con icting weight update signals ?generated during sequence processing: these signals will attempt to make wkj participate in (1) ?accessing the information stored in j and \| at dierent times \| (2) protecting unit k from being ?perturbed by j. For instance, with many tasks there are certain \short time lag errors" that can be ?reduced in early training stages. However, at later training stages j may suddenly start to cause ?avoidable errors in situations that already seemed under control by attempting to participate in ?reducing more dicult \long time lag errors". Again, this con ict makes learning dicult, and ?calls for a more context-sensitive mechanism for controlling \read operations" through output ?weights. ?	1. 輸入權(quán)值約束:為了簡(jiǎn)單起見(jiàn)，我們將重點(diǎn)放在單個(gè)額外的輸入權(quán)值wji上。假設(shè)可以通過(guò)打開(kāi)單元j來(lái)響應(yīng)某個(gè)輸入，并長(zhǎng)時(shí)間保持它處于活動(dòng)狀態(tài)(直到它有助于計(jì)算所需的輸出)，從而減少總錯(cuò)誤。提供我是零,因?yàn)橄嗤膫魅氲闹亓勘仨毷怯糜诖鎯?chǔ)特定的輸入和無(wú)視他人,wji通常會(huì)接收con ict重量更新信號(hào)在此期間(回想一下,j是線性):這些信號(hào)將試圖使wji參與(1)存儲(chǔ)輸入(通過(guò)打開(kāi)j)和(2)保護(hù)輸入(通過(guò)阻止j被無(wú)關(guān)緊要了o后輸入)。這使得學(xué)習(xí)變得困難，需要一種更上下文敏感的機(jī)制來(lái)“通過(guò)輸入權(quán)重”控制寫(xiě)操作。 2. 輸出權(quán)值:假設(shè)j已經(jīng)打開(kāi)，并且當(dāng)前存儲(chǔ)了一些以前的輸入。為了簡(jiǎn)單起見(jiàn)，讓我們關(guān)注單個(gè)額外的輸出權(quán)wkj。相同的wkj必須在特定時(shí)間用于檢索j的內(nèi)容，在其他時(shí)間用于防止j干擾k。只要單位j是零,wkj將吸引con ict重量更新信號(hào)生成的序列處理期間:這些信號(hào)將試圖使wkj參與(1)訪問(wèn)的信息存儲(chǔ)在j和\| \| dierent倍(2)保護(hù)單元?jiǎng)P西從被攝動(dòng)j。例如,許多任務(wù)有些\短時(shí)間延遲錯(cuò)誤”,可以減少在早期訓(xùn)練階段。然而，在后來(lái)的訓(xùn)練階段，j可能會(huì)突然開(kāi)始在那些似乎已經(jīng)在控制之中的情況下，通過(guò)嘗試減少更多的長(zhǎng)時(shí)間延遲錯(cuò)誤來(lái)造成可避免的錯(cuò)誤。同樣，這一缺點(diǎn)使學(xué)習(xí)變得困難，需要一種更上下文敏感的機(jī)制來(lái)“通過(guò)輸出權(quán)重”控制讀操作。
Of course, input and output weight con icts are not specic for long time lags, but occur for ?short time lags as well. Their eects, however, become particularly pronounced in the long time ?lag case: as the time lag increases, (1) stored information must be protected against perturbation ?for longer and longer periods, and \| especially in advanced stages of learning \| (2) more and ?more already correct outputs also require protection against perturbation. ? Due to the problems above the naive approach does not work well except in case of certain ?simple problems involving local input/output representations and non-repeating input patterns ?(see Hochreiter 1991 and Silva et al. 1996). The next section shows how to do it right.	當(dāng)然，輸入和輸出的權(quán)系數(shù)在長(zhǎng)時(shí)間滯后時(shí)是不特定的，但在短時(shí)間滯后時(shí)也會(huì)出現(xiàn)。除,然而,在長(zhǎng)時(shí)間滯后的情況下尤為明顯:隨著時(shí)間間隔的增加,(1)存儲(chǔ)信息必須防止擾動(dòng)時(shí)間卻越來(lái)越長(zhǎng),學(xué)習(xí)\|和\|尤其是晚期(2)越來(lái)越多的正確輸出也需要防止擾動(dòng)。  由于上述問(wèn)題，天真的方法不能很好地工作，除非某些簡(jiǎn)單的問(wèn)題涉及本地輸入/輸出表示和非重復(fù)輸入模式(見(jiàn)Hochreiter 1991和Silva et al. 1996)。下一節(jié)將展示如何正確地執(zhí)行此操作。 ?

4 LONG SHORT-TERM MEMORY

Memory cells and gate units. To construct an architecture that allows for constant error ow through special, self-connected units without the disadvantages of the naive approach, we extend the constant error carrousel CEC embodied by the self-connected, linear unit j from Section 3.2 by introducing additional features. A multiplicative input gate unit is introduced to protect the memory contents stored in j from perturbation by irrelevant inputs. Likewise, a multiplicative output gate unit is introduced which protects other units from perturbation by currently irrelevant memory contents stored in j.	記憶單元和門(mén)單元。為了構(gòu)建一個(gè)允許通過(guò)特殊的、自連接的單元實(shí)現(xiàn)恒定誤差的體系結(jié)構(gòu)，同時(shí)又不存在樸素方法的缺點(diǎn)，我們通過(guò)引入額外的特性來(lái)擴(kuò)展3.2節(jié)中自連接的線性單元j所包含的恒定誤差carrousel CEC。為了保護(hù)存儲(chǔ)在j中的存儲(chǔ)內(nèi)容不受無(wú)關(guān)輸入的干擾，引入了乘法輸入門(mén)單元。同樣地，一個(gè)乘法輸出門(mén)單元被引入，它保護(hù)其他單元不受當(dāng)前不相關(guān)的存儲(chǔ)在j中的內(nèi)存內(nèi)容的干擾。
	?
	?
net Figure 1: Architecture of memory cel l cj (the box) and its gate units inj ; outj . The self-recurrent connection (with weight 1.0) indicates feedback with a delay of 1 time step. It builds the basis of the \constant error carrousel" CEC. The gate units open and close access to CEC. See text and appendix A.1 for details.?	圖1:memory cel l cj(盒子)的結(jié)構(gòu)和它的門(mén)單元inj;outj。自循環(huán)連接(權(quán)值為1.0)表示反饋延遲1個(gè)時(shí)間步長(zhǎng)。它建立了恒定誤差carrousel“CEC”的基礎(chǔ)。星門(mén)單元打開(kāi)和關(guān)閉CEC的入口。詳情見(jiàn)正文和附錄A.1。 ?
ls. ?Why gate units? To avoid input weight con icts, inj controls the error ?ow to memory cell ?cj 's input connections wcj i . To circumvent cj 's output weight con icts, outj controls the error ? ow from unit j's output connections. In other words, the net can use inj to decide when to keep ?or override information in memory cell cj , and outj to decide when to access memory cell cj and ?when to prevent other units from being perturbed by cj (see Figure 1). ?	為什么門(mén)單位?為了避免輸入權(quán)值沖突，inj控制了內(nèi)存單元cj的輸入連接的誤差。為了繞過(guò)cj的輸出權(quán)值，outj控制來(lái)自單位j的輸出連接的錯(cuò)誤。換句話說(shuō)，網(wǎng)絡(luò)可以使用inj來(lái)決定何時(shí)在內(nèi)存單元cj中保留或覆蓋信息，而使用outj來(lái)決定何時(shí)訪問(wèn)內(nèi)存單元cj以及何時(shí)防止其他單元受到cj的干擾(參見(jiàn)圖1)。
Error signals trapped within a memory cell's CEC cannot change { but dierent error signals ? owing into the cell (at dierent times) via its output gate may get superimposed. The output ?gate will have to learn which errors to trap in its CEC, by appropriately scaling them. The input gate will have to learn when to release errors, again by appropriately scaling them. Essentially, the multiplicative gate units open and close access to constant error ow through CEC.	存儲(chǔ)單元的CEC中的錯(cuò)誤信號(hào)不能改變{但是通過(guò)輸出門(mén)進(jìn)入單元的不同錯(cuò)誤信號(hào)(在不同的時(shí)間)可以被疊加。通過(guò)適當(dāng)?shù)財(cái)U(kuò)展，輸出門(mén)必須了解在其CEC中應(yīng)該捕獲哪些錯(cuò)誤。輸入門(mén)必須學(xué)會(huì)何時(shí)釋放錯(cuò)誤，再次通過(guò)適當(dāng)?shù)財(cái)U(kuò)展它們。從本質(zhì)上說(shuō)，乘性門(mén)單元通過(guò)CEC打開(kāi)和關(guān)閉對(duì)恒定誤差的訪問(wèn)。?
Distributed output representations typically do require output gates. Not always are both ?gate types necessary, though \| one may be sucient. For instance, in Experiments 2a and 2b in ?Section 5, it will be possible to use input gates only. In fact, output gates are not required in case ?of local output encoding \| preventing memory cells from perturbing already learned outputs can ?be done by simply setting the corresponding weights to zero. Even in this case, however, output ?gates can be benecial: they prevent the net's attempts at storing long time lag memories (which ?are usually hard to learn) from perturbing activations representing easily learnable short time lag ?memories. (This will prove quite useful in Experiment 1, for instance.) ?	分布式輸出表示通常需要輸出門(mén)。雖然\|一個(gè)可能是必需的，但兩個(gè)門(mén)不一定都是必需的。例如，在第5節(jié)的2a和2b實(shí)驗(yàn)中，將可能只使用輸入門(mén)。事實(shí)上，在本地輸出編碼為\|的情況下，不需要輸出門(mén)，只要將相應(yīng)的權(quán)值設(shè)置為0，就可以防止內(nèi)存單元干擾已經(jīng)學(xué)習(xí)過(guò)的輸出。然而，即使在這種情況下，輸出門(mén)也可能是有益的:它們阻止了網(wǎng)絡(luò)存儲(chǔ)長(zhǎng)時(shí)間滯后記憶(通常很難學(xué)習(xí))的嘗試，從而干擾了代表容易學(xué)習(xí)的短時(shí)間滯后記憶的激活。(例如，這在實(shí)驗(yàn)1中將被證明非常有用。)
Network topology. We use networks with one input layer, one hidden layer, and one output ?layer. The (fully) self-connected hidden layer contains memory cells and corresponding gate units ?(for convenience, we refer to both memory cells and gate units as being located in the hidden ?layer). The hidden layer may also contain \conventional" hidden units providing inputs to gate ?units and memory cells. All units (except for gate units) in all layers have directed connections ?(serve as inputs) to all units in the layer above (or to all higher layers { Experiments 2a and 2b). ? Memory cell blocks. S memory cells sharing the same input gate and the same output gate ?form a structure called a \memory cell block of size S". Memory cell blocks facilitate information ?storage \| as with conventional neural nets, it is not so easy to code a distributed input within a ?single cell. Since each memory cell block has as many gate units as a single memory cell (namely ?two), the block architecture can be even slightly more ecient (see paragraph \computational ?complexity"). A memory cell block of size 1 is just a simple memory cell. In the experiments ?(Section 5), we will use memory cell blocks of various sizes. ?	網(wǎng)絡(luò)拓?fù)浣Y(jié)構(gòu)。我們使用一個(gè)輸入層、一個(gè)隱含層和一個(gè)輸出層的網(wǎng)絡(luò)。(完全)自連接的隱層包含內(nèi)存單元和相應(yīng)的柵極單元(為了方便起見(jiàn)，我們將位于隱層中的內(nèi)存單元和柵極單元都稱(chēng)為隱層)。所述隱層還可以包含提供柵極單元和存儲(chǔ)器單元輸入的常規(guī)“隱單元”。所有層中的所有單元(門(mén)單元除外)都有指向連接(作為輸入)到上面層中的所有單元(或所有更高的層{實(shí)驗(yàn)2a和2b)。? 存儲(chǔ)單元塊。共享相同的輸入門(mén)和輸出門(mén)的內(nèi)存單元形成一個(gè)稱(chēng)為大小為S的內(nèi)存單元塊的結(jié)構(gòu)。記憶單元塊促進(jìn)信息存儲(chǔ)\|與傳統(tǒng)的神經(jīng)網(wǎng)絡(luò)一樣，在單個(gè)單元內(nèi)編碼分布式輸入并不容易。由于每個(gè)內(nèi)存單元塊與單個(gè)內(nèi)存單元(即兩個(gè))具有同樣多的門(mén)單元，因此塊架構(gòu)甚至可以更特殊一些(請(qǐng)參閱段落“計(jì)算復(fù)雜性”)。大小為1的內(nèi)存單元塊只是一個(gè)簡(jiǎn)單的內(nèi)存單元。在實(shí)驗(yàn)(第5部分)中，我們將使用不同大小的存儲(chǔ)單元塊。
?	?
	?
Learning. We use a variant of RTRL (e.g., Robinson and Fallside 1987) which properly takes ?into account the altered, multiplicative dynamics caused by input and output gates. However, to ?ensure non-decaying error backprop through internal states of memory cells, as with truncated ?BPTT (e.g., Williams and Peng 1990), errors arriving at \memory cell net inputs" (for cell cj , this ?includes netcj ?, netinj ?, netoutj ) do not get propagated back further in time (although they do serve ?to change the incoming weights). Only within2 memory cells, errors are propagated back through ?previous internal states scj ?. To visualize this: once an error signal arrives at a memory cell output, ?it gets scaled by output gate activation and h0 ?. Then it is within the memory cell's CEC, where it ?can ?ow back indenitely without ever being scaled. Only when it leaves the memory cell through ?the input gate and g, it is scaled once more by input gate activation and g ?0 ?. It then serves to ?change the incoming weights before it is truncated (see appendix for explicit formulae). ? Computational complexity. As with Mozer's focused recurrent backprop algorithm (Mozer ?1989), only the derivatives @scj ?@wil ?need to be stored and updated. Hence the LSTM algorithm is ?very ecient, with an excellent update complexity of O(W), where W the number of weights (see ?details in appendix A.1). Hence, LSTM and BPTT for fully recurrent nets have the same update ?complexity per time step (while RTRL's is much worse). Unlike full BPTT, however, LSTM is ?local in space and time3 ?: there is no need to store activation values observed during sequence ?processing in a stack with potentially unlimited size.	學(xué)習(xí)。我們使用RTRL的一個(gè)變體(例如，Robinson和Fallside 1987)，它適當(dāng)?shù)乜紤]了輸入和輸出門(mén)所引起的變化的乘法動(dòng)力學(xué)。然而,以確保non-decaying錯(cuò)誤backprop通過(guò)內(nèi)部狀態(tài)的記憶細(xì)胞,與截?cái)郆PTT(例如,威廉姆斯和彭1990),錯(cuò)誤到達(dá)\存儲(chǔ)單元網(wǎng)絡(luò)輸入”(細(xì)胞cj,這包括netcj、netinj netoutj)得不到傳播更久遠(yuǎn)的時(shí)代(盡管他們服務(wù)變化的權(quán)重)。只有在2個(gè)內(nèi)存單元中，錯(cuò)誤才會(huì)通過(guò)之前的內(nèi)部狀態(tài)scj傳播回來(lái)。為了可視化這一點(diǎn):一旦一個(gè)錯(cuò)誤信號(hào)到達(dá)一個(gè)內(nèi)存單元輸出，它將被輸出門(mén)激活和h0縮放。然后它在記憶細(xì)胞的CEC中，在那里它可以無(wú)限地慢下來(lái)而不需要被縮放。只有當(dāng)它通過(guò)輸入門(mén)和g離開(kāi)存儲(chǔ)單元時(shí)，它才通過(guò)輸入門(mén)激活和g 0再次被縮放。然后，它用于在截?cái)嘀案膫魅氲臋?quán)重(有關(guān)顯式公式，請(qǐng)參閱附錄)。計(jì)算的復(fù)雜性。與Mozer的重點(diǎn)循環(huán)支持算法(Mozer 1989)一樣，只需要存儲(chǔ)和更新導(dǎo)數(shù)@scj @wil。因此LSTM算法非常特殊，更新復(fù)雜度為O(W)，其中W表示權(quán)值的數(shù)量(詳見(jiàn)附錄A.1)。因此，對(duì)于完全經(jīng)常網(wǎng)，LSTM和BPTT的每一步更新復(fù)雜度是相同的(而RTRL要差得多)。但是，與完整的BPTT不同的是，LSTM在空間和時(shí)間上是局部的:不需要將序列處理期間觀察到的激活值存儲(chǔ)在具有無(wú)限大小的堆棧中。
Abuse problem and solutions. In the beginning of the learning phase, error reduction ?may be possible without storing information over time. The network will thus tend to abuse ?memory cells, e.g., as bias cells (i.e., it might make their activations constant and use the outgoing ?connections as adaptive thresholds for other units). The potential diculty is: it may take a ?long time to release abused memory cells and make them available for further learning. A similar ?\abuse problem" appears if two memory cells store the same (redundant) information. There are ?at least two solutions to the abuse problem: (1) Sequential network construction (e.g., Fahlman ?1991): a memory cell and the corresponding gate units are added to the network whenever the error stops decreasing (see Experiment 2 in Section 5). (2) Output gate bias: each output gate gets a negative initial bias, to push initial memory cell activations towards zero. Memory cells with more negative bias automatically get \allocated" later (see Experiments 1, 3, 4, 5, 6 in Section 5).	濫用問(wèn)題及解決方法。在學(xué)習(xí)階段的開(kāi)始，可以在不存儲(chǔ)信息的情況下減少錯(cuò)誤。因此，該網(wǎng)絡(luò)將傾向于濫用記憶細(xì)胞，例如，作為偏見(jiàn)細(xì)胞。，它可能使它們的激活保持不變，并使用傳出連接作為其他單元的自適應(yīng)閾值)。潛在的問(wèn)題是:釋放被濫用的記憶細(xì)胞并使其用于進(jìn)一步的學(xué)習(xí)可能需要很長(zhǎng)時(shí)間。如果兩個(gè)記憶單元存儲(chǔ)相同的(冗余的)信息，就會(huì)出現(xiàn)類(lèi)似的“濫用”問(wèn)題。至少有兩個(gè)解決濫用問(wèn)題:(1)順序網(wǎng)絡(luò)建設(shè)(例如,Fahlman 1991):一個(gè)存儲(chǔ)單元和相應(yīng)的單元門(mén)時(shí)被添加到網(wǎng)絡(luò)錯(cuò)誤停止減少(見(jiàn)實(shí)驗(yàn)2節(jié)5)。(2)輸出門(mén)偏見(jiàn):每個(gè)輸出門(mén)負(fù)初始偏差,將最初的記憶細(xì)胞激活為零。帶有更多負(fù)偏差的記憶細(xì)胞將被自動(dòng)分配”稍后(參見(jiàn)第5節(jié)中的實(shí)驗(yàn)1、3、4、5、6)。
Internal state drift and remedies. If memory cell cj 's inputs are mostly positive or mostly ?negative, then its internal state sj will tend to drift away over time. This is potentially dangerous, ?for the h0 ?(sj ) will then adopt very small values, and the gradient will vanish. One way to circumvent ?this problem is to choose an appropriate function h. But h(x) = x, for instance, has the ?disadvantage of unrestricted memory cell output range. Our simple but eective way of solving ?drift problems at the beginning of learning is to initially bias the input gate inj towards zero. ?Although there is a tradeo?between the magnitudes of h0 ?(sj ) on the one hand and of yinj ?and ?f 0 ?inj on the other, the potential negative eect of input gate bias is negligible compared to the one ?of the drifting eect. With logistic sigmoid activation functions, there appears to be no need for ?ne-tuning the initial bias, as conrmed by Experiments 4 and 5 in Section 5.4.	內(nèi)部狀態(tài)漂移和補(bǔ)救措施。如果記憶細(xì)胞cj的輸入大部分是正的或大部分是負(fù)的，那么它的內(nèi)部狀態(tài)sj會(huì)隨著時(shí)間的推移而漂移。這是潛在的危險(xiǎn)，因?yàn)閔0 (sj)將采用非常小的值，而梯度將消失。解決這個(gè)問(wèn)題的一種方法是選擇一個(gè)合適的函數(shù)h，但是h(x) = x的缺點(diǎn)是不限制內(nèi)存單元的輸出范圍。我們?cè)趯W(xué)習(xí)之初解決漂移問(wèn)題的簡(jiǎn)單而有效的方法是使輸入門(mén)inj最初偏向于零。雖然在h0 (sj)與yinj和f0 inj的量級(jí)之間存在貿(mào)易，但與漂移效應(yīng)相比，輸入門(mén)偏差的潛在負(fù)效應(yīng)可以忽略不計(jì)。對(duì)于logistic sigmoid激活函數(shù)，似乎不需要對(duì)初始偏差進(jìn)行ne調(diào)節(jié)，正如5.4節(jié)中的實(shí)驗(yàn)4和實(shí)驗(yàn)5所證實(shí)的那樣。

5 EXPERIMENTS?實(shí)驗(yàn)

Introduction. Which tasks are appropriate to demonstrate the quality of a novel long time lag	介紹。哪些任務(wù)是合適的，以證明一個(gè)新的長(zhǎng)時(shí)間滯后的質(zhì)量
algorithm? First of all, minimal time lags between relevant input signals and corresponding teacher ?signals must be long for al l training sequences. In fact, many previous recurrent net algorithms ?sometimes manage to generalize from very short training sequences to very long test sequences. ?See, e.g., Pollack (1991). But a real long time lag problem does not have any short time lag ?exemplars in the training set. For instance, Elman's training procedure, BPTT, oine RTRL, ?online RTRL, etc., fail miserably on real long time lag problems. See, e.g., Hochreiter (1991) and ?Mozer (1992). A second important requirement is that the tasks should be complex enough such ?that they cannot be solved quickly by simple-minded strategies such as random weight guessing.?	算法?首先，對(duì)于all訓(xùn)練序列，相關(guān)輸入信號(hào)與相應(yīng)教師信號(hào)之間的最小時(shí)滯必須很長(zhǎng)。事實(shí)上，許多以前的遞歸網(wǎng)絡(luò)算法有時(shí)能夠?qū)⒎浅６痰挠?xùn)練序列推廣到非常長(zhǎng)的測(cè)試序列。參見(jiàn)，例如Pollack(1991)。但是一個(gè)真實(shí)的長(zhǎng)時(shí)間滯后問(wèn)題在訓(xùn)練集中沒(méi)有任何短時(shí)間滯后的例子。例如，Elman的訓(xùn)練過(guò)程，BPTT, oine RTRL, online RTRL等，在真實(shí)的長(zhǎng)時(shí)間滯后問(wèn)題上嚴(yán)重失敗。例如Hochreiter(1991)和Mozer(1992)。第二個(gè)重要的要求是，任務(wù)應(yīng)該足夠復(fù)雜，不能用簡(jiǎn)單的策略(如隨機(jī)猜測(cè)權(quán)值)快速解決。 ?
Guessing can outperform many long time lag algorithms. Recently we discovered ?(Schmidhuber and Hochreiter 1996, Hochreiter and Schmidhuber 1996, 1997) that many long ?time lag tasks used in previous work can be solved more quickly by simple random weight guessing ?than by the proposed algorithms. For instance, guessing solved a variant of Bengio and Frasconi's ?\parity problem" (1994) problem much faster4 ?than the seven methods tested by Bengio et al. ?(1994) and Bengio and Frasconi (1994). Similarly for some of Miller and Giles' problems (1993). Of ?course, this does not mean that guessing is a good algorithm. It just means that some previously ?used problems are not extremely appropriate to demonstrate the quality of previously proposed ?algorithms. ?	猜測(cè)可以勝過(guò)許多長(zhǎng)時(shí)間延遲的算法。最近我們發(fā)現(xiàn)(Schmidhuber and Hochreiter 1996, Hochreiter and Schmidhuber 1996, 1997)，以前工作中使用的許多長(zhǎng)時(shí)間延遲任務(wù)可以通過(guò)簡(jiǎn)單的隨機(jī)猜測(cè)權(quán)值來(lái)快速解決，而不是通過(guò)所提出的算法。例如，猜測(cè)解決了Bengio和Frasconi's奇偶校驗(yàn)問(wèn)題(1994)的一個(gè)變體，比Bengio等人(1994)和Bengio和Frasconi(1994)測(cè)試的七種方法要快得多。類(lèi)似地，米勒和賈爾斯的一些問(wèn)題(1993年)。當(dāng)然，這并不意味著猜測(cè)是一個(gè)好的算法。這只是意味著一些以前用過(guò)的問(wèn)題不是非常適合用來(lái)演示以前提出的算法的質(zhì)量。?
What's common to Experiments 1{6. All our experiments (except for Experiment 1) ?involve long minimal time lags \| there are no short time lag training exemplars facilitating ?learning. Solutions to most of our tasks are sparse in weight space. They require either many ?parameters/inputs or high weight precision, such that random weight guessing becomes infeasible. ?	實(shí)驗(yàn)1{6。我們所有的實(shí)驗(yàn)(除了實(shí)驗(yàn)1)都涉及到長(zhǎng)時(shí)間的最小滯后時(shí)間\|沒(méi)有短時(shí)間的滯后訓(xùn)練范例來(lái)促進(jìn)學(xué)習(xí)。我們大多數(shù)任務(wù)的解在權(quán)值空間中是稀疏的。它們要么需要許多參數(shù)/輸入，要么需要較高的權(quán)值精度，這樣隨機(jī)猜測(cè)權(quán)值就變得不可行了。?
We always use on-line learning (as opposed to batch learning), and logistic sigmoids as activation ?functions. For Experiments 1 and 2, initial weights are chosen in the range [0:2; ?0:2], for ?the other experiments in [0:1; ?0:1]. Training sequences are generated randomly according to the ?various task descriptions. In slight deviation from the notation in Appendix A1, each discrete ?time step of each input sequence involves three processing steps: (1) use current input to set the ?input units. (2) Compute activations of hidden units (including input gates, output gates, memory ?cells). (3) Compute output unit activations. Except for Experiments 1, 2a, and 2b, sequence ?elements are randomly generated on-line, and error signals are generated only at sequence ends. ?Net activations are reset after each processed input sequence.	我們總是使用在線學(xué)習(xí)(而不是批量學(xué)習(xí))，并使用邏輯sigmoids作為激活函數(shù)。實(shí)驗(yàn)1和實(shí)驗(yàn)2的初始權(quán)值選擇在[0:2;0:2]，用于其他實(shí)驗(yàn)[0:1;0:1)。根據(jù)不同的任務(wù)描述，隨機(jī)生成訓(xùn)練序列。與附錄A1中的符號(hào)略有偏差，每個(gè)輸入序列的每個(gè)離散時(shí)間步都涉及三個(gè)處理步驟: (1)使用電流輸入設(shè)置輸入單元。 (2)計(jì)算隱藏單元的激活(包括輸入門(mén)、輸出門(mén)、存儲(chǔ)單元)。 (3)計(jì)算輸出單元激活。除實(shí)驗(yàn)1、2a、2b外，序列元素在線隨機(jī)生成，僅在序列末端產(chǎn)生誤差信號(hào)。Net激活在每個(gè)處理的輸入序列之后被重置。?
For comparisons with recurrent nets taught by gradient descent, we give results only for RTRL, ?except for comparison 2a, which also includes BPTT. Note, however, that untruncated BPTT (see, ?e.g., Williams and Peng 1990) computes exactly the same gradient as oine RTRL. With long time ?lag problems, oine RTRL (or BPTT) and the online version of RTRL (no activation resets, online ?weight changes) lead to almost identical, negative results (as conrmed by additional simulations ?in Hochreiter 1991; see also Mozer 1992). This is because oine RTRL, online RTRL, and full ?BPTT all suer badly from exponential error decay. ?	對(duì)于用梯度下降法講授的循環(huán)網(wǎng)的比較，我們只給出了RTRL的結(jié)果，除了比較2a，其中也包括了BPTT。但是，請(qǐng)注意未截?cái)嗟腂PTT(參見(jiàn)， Williams和Peng(1990)計(jì)算的梯度與oine RTRL完全相同。由于存在長(zhǎng)時(shí)間滯后問(wèn)題，oine RTRL(或BPTT)和RTRL的在線版本(沒(méi)有激活重置，在線權(quán)重變化)導(dǎo)致幾乎相同的負(fù)結(jié)果(如Hochreiter 1991中的額外模擬所證實(shí)的;參見(jiàn)Mozer 1992)。這是因?yàn)閛ine RTRL、online RTRL和full BPTT都嚴(yán)重依賴(lài)于指數(shù)誤差衰減。? ?
Our LSTM architectures are selected quite arbitrarily. If nothing is known about the complexity ?of a given problem, a more systematic approach would be: start with a very small net consisting ?of one memory cell. If this does not work, try two cells, etc. Alternatively, use sequential network ?construction (e.g., Fahlman 1991).	我們的LSTM架構(gòu)是任意選擇的。如果對(duì)給定問(wèn)題的復(fù)雜性一無(wú)所知，那么一種更系統(tǒng)的方法是:從一個(gè)由一個(gè)記憶單元組成的非常小的網(wǎng)絡(luò)開(kāi)始。如果這不起作用，嘗試兩個(gè)單元格，等等。或者，使用順序網(wǎng)絡(luò)結(jié)構(gòu)(例如，Fahlman 1991)。 ? ? ?

Outline of experiments??試驗(yàn)大綱

Experiment 1 focuses on a standard benchmark test for recurrent nets: the embedded Reber ?grammar. Since it allows for training sequences with short time lags, it is not a long time ?lag problem. We include it because (1) it provides a nice example where LSTM's output ?gates are truly benecial, and (2) it is a popular benchmark for recurrent nets that has been ?used by many authors | we want to include at least one experiment where conventional ?BPTT and RTRL do not fail completely (LSTM, however, clearly outperforms them). The ?embedded Reber grammar's minimal time lags represent a border case in the sense that it ?is still possible to learn to bridge them with conventional algorithms. Only slightly long? minimal time lags would make this almost impossible. The more interesting tasks in our ?paper, however, are those that RTRL, BPTT, etc. cannot solve at all. ?
Experiment 2 focuses on noise-free and noisy sequences involving numerous input symbols ?distracting from the few important ones. The most dicult task (Task 2c) involves hundreds ?of distractor symbols at random positions, and minimal time lags of 1000 steps. LSTM solves ?it, while BPTT and RTRL already fail in case of 10-step minimal time lags (see also, e.g., ?Hochreiter 1991 and Mozer 1992). For this reason RTRL and BPTT are omitted in the ?remaining, more complex experiments, all of which involve much longer time lags. ?
Experiment 3 addresses long time lag problems with noise and signal on the same input ?line. Experiments 3a/3b focus on Bengio et al.'s 1994 \2-sequence problem". Because ?this problem actually can be solved quickly by random weight guessing, we also include a ?far more dicult 2-sequence problem (3c) which requires to learn real-valued, conditional ?expectations of noisy targets, given the inputs. ?
Experiments 4 and 5 involve distributed, continuous-valued input representations and require ?learning to store precise, real values for very long time periods. Relevant input signals ?can occur at quite dierent positions in input sequences. Again minimal time lags involve ?hundreds of steps. Similar tasks never have been solved by other recurrent net algorithms.?
Experiment 6 involves tasks of a dierent complex type that also has not been solved by ?other recurrent net algorithms. Again, relevant input signals can occur at quite dierent ?positions in input sequences. The experiment shows that LSTM can extract information ?conveyed by the temporal order of widely separated inputs.

Subsection 5.7 will provide a detailed summary of experimental conditions in two tables for reference.

實(shí)驗(yàn)1著重于遞歸網(wǎng)絡(luò)的標(biāo)準(zhǔn)基準(zhǔn)測(cè)試:嵌入式Reber語(yǔ)法。因?yàn)樗试S訓(xùn)練序列有短的時(shí)間滯后，所以它不是一個(gè)長(zhǎng)時(shí)間滯后的問(wèn)題。我們包括是因?yàn)?1),它提供了一個(gè)很好的例子,LSTM輸出門(mén)真正benecial,和(2)這是一個(gè)流行的復(fù)發(fā)性基準(zhǔn)網(wǎng),已經(jīng)被許多作者|我們希望包括至少一個(gè)實(shí)驗(yàn),常規(guī)BPTT和RTRL不完全失敗(然而,LSTM明顯優(yōu)于他們)。嵌入式Reber語(yǔ)法的最小時(shí)間延遲代表了一種邊界情況，在這種情況下，學(xué)習(xí)用傳統(tǒng)算法橋接它們?nèi)匀皇强赡艿摹Ｖ灰晕㈤L(zhǎng)一點(diǎn)的時(shí)間延遲，這幾乎是不可能的。然而，我們的論文中更有趣的任務(wù)是那些RTRL、BPTT等根本無(wú)法解決的任務(wù)。
實(shí)驗(yàn)2著重于無(wú)噪聲和有噪聲的序列，這些序列涉及大量的輸入符號(hào)，分散了對(duì)少數(shù)重要符號(hào)的注意力。最復(fù)雜的任務(wù)(task 2c)包含數(shù)百個(gè)隨機(jī)位置的干擾符號(hào)，最小延遲為1000步。LSTM解決了這個(gè)問(wèn)題，而B(niǎo)PTT和RTRL已經(jīng)在10步最小時(shí)間延遲的情況下失敗了(參見(jiàn)Hochreiter 1991和Mozer 1992)。因此，RTRL和BPTT在剩余的、更復(fù)雜的實(shí)驗(yàn)中被忽略，所有這些實(shí)驗(yàn)都涉及更長(zhǎng)的時(shí)間滯后。 
實(shí)驗(yàn)3解決了在同一輸入線上存在噪聲和信號(hào)的長(zhǎng)時(shí)間滯后問(wèn)題。實(shí)驗(yàn)3a/3b集中于Bengio等人的1994 \2-sequence問(wèn)題”。因?yàn)檫@個(gè)問(wèn)題實(shí)際上可以通過(guò)隨機(jī)猜測(cè)權(quán)值來(lái)快速解決，所以我們還包括了一個(gè)更復(fù)雜的2-序列問(wèn)題(3c)，該問(wèn)題要求在給定輸入的情況下學(xué)習(xí)噪聲目標(biāo)的實(shí)值、條件期望。 
實(shí)驗(yàn)4和5涉及到分布式的連續(xù)值輸入表示，需要學(xué)習(xí)長(zhǎng)時(shí)間存儲(chǔ)精確的、真實(shí)的值。相關(guān)的輸入信號(hào)可以出現(xiàn)在輸入序列的不同位置。同樣，最小的時(shí)間延遲涉及數(shù)百個(gè)步驟。其他遞歸網(wǎng)絡(luò)算法從未解決過(guò)類(lèi)似的問(wèn)題。
實(shí)驗(yàn)6涉及到不同復(fù)雜類(lèi)型的任務(wù)，其他遞歸網(wǎng)絡(luò)算法也沒(méi)有解決這些任務(wù)。同樣，相關(guān)的輸入信號(hào)可以出現(xiàn)在輸入序列的不同位置。實(shí)驗(yàn)結(jié)果表明，LSTM可以提取出由時(shí)間順序的離散輸入所傳遞的信息。
第5.7款將提供兩個(gè)表內(nèi)實(shí)驗(yàn)條件的詳細(xì)摘要，以供參考。

5.1 EXPERIMENT 1: EMBEDDED REBER GRAMMAR ?實(shí)驗(yàn)1:嵌入式REBER語(yǔ)法

Task. Our rst task is to learn the \embedded Reber grammar", e.g. Smith and Zipser (1989), ?Cleeremans et al. (1989), and Fahlman (1991). Since it allows for training sequences with short ?time lags (of as few as 9 steps), it is not a long time lag problem. We include it for two reasons: (1) ?it is a popular recurrent net benchmark used by many authors | we wanted to have at least one ?experiment where RTRL and BPTT do not fail completely, and
(2) it shows nicely how output ?gates can be bene cial.

任務(wù)。我們的首要任務(wù)是學(xué)習(xí)嵌入的Reber語(yǔ)法”，例如Smith和Zipser(1989)、Cleeremans等人(1989)和Fahlman(1991)。因?yàn)樗试S訓(xùn)練序列有短的時(shí)間滯后(只有9個(gè)步驟)，所以它不是一個(gè)長(zhǎng)時(shí)間滯后的問(wèn)題。我們引入它有兩個(gè)原因:

(1)它是一個(gè)被許多作者|使用的流行的周期性網(wǎng)絡(luò)基準(zhǔn)，我們希望至少有一個(gè)RTRL和BPTT不會(huì)完全失敗的實(shí)驗(yàn)，
(2)它很好地展示了輸出門(mén)是如何可以帶來(lái)好處的。

《新程序員》：云原生和全面數(shù)字化實(shí)踐50位技術(shù)專(zhuān)家共同創(chuàng)作，文字、視頻、音頻交互閱讀

總結(jié)

以上是生活随笔為你收集整理的LSTM：《Long Short-Term Memory》的翻译并解读的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇： CV之SR：超分辨率(Super res
下一篇： AI：一个20年程序猿的学习资料大全—区

国产亚洲精品久久久久动-影视先锋中文字幕-av网站在线观看一区-亚洲视频 在线观看-久久亚洲不卡-欧美精品一区在线观看-欧美乱淫视频-欧美熟妇另类久久久久久不卡-粉嫩av一区二区三区四区五区-日韩欧美操

编程问答

LSTM：《Long Short-Term Memory》的翻译并解读

Long Short-Term Memory

Abstract

1 INTRODUCTION

2 PREVIOUS WORK ?

3 CONSTANT ERROR BACKPROP? 固定誤差支持

3.1 EXPONENTIALLY DECAYING ERROR? ?指數(shù)衰減誤差

3.2 CONSTANT ERROR FLOW: NAIVE APPROACH?常量錯(cuò)誤流:簡(jiǎn)單的方法

4 LONG SHORT-TERM MEMORY

5 EXPERIMENTS?實(shí)驗(yàn)

5.1 EXPERIMENT 1: EMBEDDED REBER GRAMMAR ?實(shí)驗(yàn)1:嵌入式REBER語(yǔ)法

總結(jié)

国产亚洲精品久久久久动-影视先锋中文字幕-av网站在线观看一区-亚洲视频在线观看-久久亚洲不卡-欧美精品一区在线观看-欧美乱淫视频-欧美熟妇另类久久久久久不卡-粉嫩av一区二区三区四区五区-日韩欧美操