Paper:《Adam: A Method for Stochastic Optimization》的翻译与解读
Paper:《Adam: A Method for Stochastic Optimization》的翻譯與解讀
?
?
目錄
Adam: A Method for Stochastic Optimization
ABSTRACT
1、INTRODUCTION
3、CONCLUSION
?
?
?
Adam: A Method for Stochastic Optimization
論文出處:Adam: A Method for Stochastic Optimization
ABSTRACT
| We introduce Adam, an algorithm for first-order gradient-based optimization ofstochastic objective functions, based on adaptive estimates of lower-order moments.The method is straightforward to implement, is computationally efficient,has little memory requirements, is invariant to diagonal rescaling of the gradients,and is well suited for problems that are large in terms of data and/or parameters.The method is also appropriate for non-stationary objectives and problems withvery noisy and/or sparse gradients. The hyper-parameters have intuitive interpretationsand typically require little tuning. Some connections to related algorithms,on which Adam was inspired, are discussed. We also analyze the theoretical convergence properties of the algorithm and provide a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework. Empirical results demonstrate that Adam works well in practice and compares favorably to other stochastic optimization methods. Finally, we discuss AdaMax, a variant of Adam based on the infinity norm. | 介紹了一種基于低階矩自適應(yīng)估計(jì)的隨機(jī)目標(biāo)函數(shù)一階梯度優(yōu)化算法Adam。該方法易于實(shí)現(xiàn),計(jì)算效率高,對(duì)內(nèi)存的要求少,對(duì)梯度的對(duì)角重新縮放是不變的,并且非常適合于數(shù)據(jù)和/或參數(shù)很大的問(wèn)題。該方法也適用于非平穩(wěn)目標(biāo)和具有非常嘈雜和/或稀疏梯度的問(wèn)題。超參數(shù)有直觀的解釋,通常需要很少的調(diào)整。本文討論了一些與相關(guān)算法的聯(lián)系,Adam 正是在這些算法上受到啟發(fā)。我們還分析了算法的理論收斂性,并給出了與在線凸優(yōu)化框架下的最優(yōu)結(jié)果相當(dāng)?shù)氖諗窟z憾界。實(shí)驗(yàn)結(jié)果表明,該方法在實(shí)際應(yīng)用中效果良好,與其它隨機(jī)優(yōu)化方法相比具有一定的優(yōu)越性。最后,我們討論了AdaMax,一個(gè)基于無(wú)窮范數(shù)的Adam變體。 |
?
1、INTRODUCTION
| Stochastic gradient-based optimization is of core practical importance in many fields of science and engineering. Many problems in these fields can be cast as the optimization of some scalar parameterized objective function requiring maximization or minimization with respect to its parameters. If the function is differentiable w.r.t. its parameters, gradient descent is a relatively efficient optimization method, since the computation of first-order partial derivatives w.r.t. all the parameters is of the same computational complexity as just evaluating the function. Often, objective functions are stochastic. For example, many objective functions are composed of a sum of subfunctions evaluated at different subsamples of data; in this case optimization can be made more efficient by taking gradient steps w.r.t. individual subfunctions, i.e. stochastic gradient descent (SGD) or ascent. SGD proved itself as an efficient and effective optimization method that was central in many machine learning success stories, such as recent advances in deep learning (Deng et al., 2013; Krizhevsky et al., 2012; Hinton & Salakhutdinov, 2006; Hinton et al., 2012a; Graves et al., 2013). Objectives may also have other sources of noise than data subsampling, such as dropout (Hinton et al., 2012b) regularization. For all such noisy objectives, efficient stochastic optimization techniques are required. The focus of this paper is on the optimization of stochastic objectives with high-dimensional parameters spaces. In these cases, higher-order optimization methods are ill-suited, and discussion in this paper will be restricted to first-order methods. | 基于隨機(jī)梯度的優(yōu)化方法在許多科學(xué)和工程領(lǐng)域具有核心的實(shí)際意義。這些領(lǐng)域中的許多問(wèn)題都可以轉(zhuǎn)化為對(duì)某個(gè)標(biāo)量參數(shù)化目標(biāo)函數(shù)的優(yōu)化,該函數(shù)的參數(shù)需要最大化或最小化。如果函數(shù)的參數(shù)是可微的,梯度下降法是一種比較有效的優(yōu)化方法,因?yàn)橐浑A偏導(dǎo)數(shù)的計(jì)算與函數(shù)的求值具有相同的計(jì)算復(fù)雜度。通常,目標(biāo)函數(shù)是隨機(jī)的。例如,許多目標(biāo)函數(shù)由在不同數(shù)據(jù)子樣本上求值的子函數(shù)和組成;在這種情況下,優(yōu)化可以通過(guò)采取梯度步驟w.r.t.單獨(dú)的子函數(shù),即隨機(jī)梯度下降(SGD)或上升來(lái)提高效率。SGD證明了自己是一種高效和有效的優(yōu)化方法,這在許多機(jī)器學(xué)習(xí)成功的故事中都是核心,比如最近在深度學(xué)習(xí)方面的進(jìn)展(Deng et al., 2013;Krizhevsky等,2012;Hinton & Salakhutdinov, 2006;Hinton等,2012a;Graves 等人,2013)。目標(biāo)也可能有數(shù)據(jù)子采樣之外的其他噪音源,如dropout (Hinton et al., 2012b)正則化。對(duì)于所有這些有噪聲的目標(biāo),都需要有效的隨機(jī)優(yōu)化技術(shù)。本文主要研究具有高維參數(shù)空間的隨機(jī)目標(biāo)的優(yōu)化問(wèn)題。在這種情況下,高階優(yōu)化方法是不合適的,本文的討論將局限于一階方法。 |
| We propose Adam, a method for efficient stochastic optimization that only requires first-order gradients with little memory requirement. The method computes individual adaptive learning rates for different parameters from estimates of first and second moments of the gradients; the name Adam is derived from adaptive moment estimation. Our method is designed to combine the advantages of two recently popular methods: AdaGrad (Duchi et al., 2011), which works well with sparse gradients, and RMSProp (Tieleman & Hinton, 2012), which works well in on-line and non-stationary settings; important connections to these and other stochastic optimization methods are clarified in section 5. Some of Adam’s advantages are that the magnitudes of parameter updates are invariant to rescaling of the gradient, its stepsizes are approximately bounded by the stepsize hyperparameter, it does not require a stationary objective, it works with sparse gradients, and it naturally performs a form of step size annealing. | 我們提出了一種只需要一階梯度且對(duì)內(nèi)存要求很小的高效隨機(jī)優(yōu)化方法Adam。該方法通過(guò)估計(jì)梯度的一階矩和二階矩計(jì)算不同參數(shù)的個(gè)體自適應(yīng)學(xué)習(xí)率;Adam這個(gè)名字來(lái)源于自適應(yīng)矩估計(jì)。我們的方法結(jié)合了兩種最近流行的方法的優(yōu)點(diǎn):AdaGrad (Duchi et al., 2011)和RMSProp (Tieleman & Hinton, 2012),前者在稀疏梯度下工作良好,后者在在線和非平穩(wěn)環(huán)境下工作良好;與這些和其他隨機(jī)優(yōu)化方法的重要聯(lián)系在第5節(jié)中闡明。Adam的一些優(yōu)勢(shì)的大小參數(shù)更新不變的尺度改變梯度,其stepsizes大約有界的stepsize hyperparameter,它不需要一個(gè)固定的目標(biāo),它適用于稀疏的梯度,它自然地執(zhí)行步長(zhǎng)退火的一種形式。 |
?
?
3、CONCLUSION
| We have introduced a simple and computationally efficient algorithm for gradient-based optimization of stochastic objective functions. Our method is aimed towards machine learning problems with large datasets and/or high-dimensional parameter spaces. The method combines the advantages of two recently popular optimization methods: the ability of AdaGrad to deal with sparse gradients, and the ability of RMSProp to deal with non-stationary objectives. The method is straightforward to implement and requires little memory. The experiments confirm the analysis on the rate of convergence in convex problems. Overall, we found Adam to be robust and well-suited to a wide range of non-convex optimization problems in the field machine learning. | 介紹了一種簡(jiǎn)單、計(jì)算效率高的隨機(jī)目標(biāo)函數(shù)梯度優(yōu)化算法。我們的方法是針對(duì)大數(shù)據(jù)集和/或高維參數(shù)空間的機(jī)器學(xué)習(xí)問(wèn)題。該方法結(jié)合了兩種最近流行的優(yōu)化方法的優(yōu)點(diǎn):AdaGrad處理稀疏梯度的能力和RMSProp處理非平穩(wěn)目標(biāo)的能力。該方法易于實(shí)現(xiàn),并且需要的內(nèi)存很少。實(shí)驗(yàn)驗(yàn)證了凸問(wèn)題收斂速度的分析。總的來(lái)說(shuō),我們發(fā)現(xiàn)Adam是健壯的,并且非常適合于在領(lǐng)域機(jī)器學(xué)習(xí)中大量的非凸優(yōu)化問(wèn)題。 |
?
?
?
?
總結(jié)
以上是生活随笔為你收集整理的Paper:《Adam: A Method for Stochastic Optimization》的翻译与解读的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: CSDN:因博主近期注重写专栏文章(已超
- 下一篇: DL之模型调参:深度学习算法模型优化参数