當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

对比学习系列论文CPC（二）—Representation Learning with Contrastive Predictive Coding

發布時間：2025/4/5 编程问答 35 豆豆

生活随笔收集整理的這篇文章主要介紹了对比学习系列论文CPC（二）—Representation Learning with Contrastive Predictive Coding 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

0.Abstract

0.1逐句翻譯

While supervised learning has enabled great progress in many applications, unsupervised learning has not seen such widespread adoption, and remains an important and challenging endeavor for artificial intelligence.
盡管監督學習在許多應用中取得了巨大進展，但無監督學習尚未得到如此廣泛的采用，仍然是人工智能的重要和具有挑戰性的努力。

In this work, we propose a universal unsupervised learning approach to extract useful representations from high-dimensional data, which we call Contrastive Predictive Coding.
在這項工作中，我們提出了一種通用的無監督學習方法來從高維數據中提取有用的表示，我們稱之為對比預測編碼。

The key insight of our model is to learn such representations by predicting the future in latent space by using powerful autoregressive models.
我們模型的關鍵觀點是通過使用強大的自回歸模型預測潛在空間中的 the future來學習這種表示。
感覺這個future應該是有特殊含義的，但是現在還不理解

We use a probabilistic contrastive loss which induces the latent space to capture information that is maximally useful to predict future samples.
我們使用一個概率對比損失，誘導潛在空間捕捉信息，這是最有用的預測未來的樣本。

It also makes the model tractable by using negative sampling.
同時利用負采樣使模型易于被管理。
（大約就是用負例讓整個模型變得更容易控制）

While most prior work has focused on evaluating representations for a particular modality, we demonstrate that our approach is able to learn useful representations achieving strong performance on four distinct domains: speech, images, text and reinforcement learning in 3D environments.
雖然之前的大部分工作都集中在評估特定形態的表示，但我們證明了我們的方法能夠學習有用的表示，在四個不同的領域實現強大的性能:語音、圖像、文本和3D環境中的強化學習。
（大約就是說之前的模型都是在特定的環境下才能取得較好的效果，但是本文的提出的方法在各種方面都得到了有效的驗證）

0.2總結

1.這個東西是進行無監督表示學習方面內容研究的
2.這個東西關注一種叫做future的東西
3.這個東西是有負例的，所以便于控制模型進行work
4.這個東西經過測試可以在各種領域有效地work

1.Introduction

第一段（肯定有監督特征提取的發展，并指出不足）

Learning high-level representations from labeled data with layered differentiable models in an end-to-end fashion is one of the biggest successes in artificial intelligence so far.
以端到端的方式，用分層可微模型從標記數據中學習高級表示是人工智能領域迄今為止最大的成功之一。（這里是肯定有監督學習在學習特征方面工作的效果）

These techniques made manually specified features largely redundant and have greatly improved state-of-the-art in several real-world applications [1, 2, 3].
這些技術使得手工指定的特性在很大程度上是冗余的，并在幾個實際應用程序中大大改進了技術水平。
（就是說自動特征提取在較多應用上已經可以取代手工的特征提取了）

However, many challenges remain, such as data efficiency, robustness or generalization.
然而，仍然存在許多挑戰，如數據效率、健壯性或泛化。

第二段（無監督因為沒有對特征進行領域特異化，所以魯棒性可能更好）

Improving representation learning requires features that are less specialized towards solving a single supervised task.
改進表示學習需要的特性不是專門針對解決單個監督任務。
（表征學習的改進）

For example, when pre-training a model to do image classification, the induced features transfer reasonably well to other image classification domains, but also lack certain information such as color or the ability to count that are irrelevant for classification but relevant for e.g. image captioning [4].
例如，當預先訓練一個模型進行圖像分類時，誘導特征可以很好地轉移到其他圖像分類領域，但也缺乏某些信息，如顏色或計數能力，這些信息與分類無關，但與圖像字幕[4]相關。
（這里存在一個問題，我們只是訓練一個特定分類網絡，那么我們這個網絡雖然也有上采樣和下采樣過程，但是我們提取的特征其實是不全面的，我們只是提取了我們當前分類任務的特征，很多在其他領域有效果的特征其實就被我們丟棄了。）

Similarly, features that are useful to transcribe human speech may be less suited for speaker identification, or music genre prediction.
類似地，那些對人類語言轉錄有用的特征可能不太適合于說話者識別或音樂類型預測。
（就是我們使用某種方法提取出來的特征可能領域遷移能力很弱）

Thus, unsupervised learning is an important stepping stone towards robust and generic representation learning.
因此，無監督學習是實現魯棒性和泛型表征學習的重要跳板。
（就是說魯棒性的不好的原因是我們在提取特征的過程中，我們無意識的丟棄了一些特征，而無監督能避免這個問題，所以使用無監督能解決這個問題）

第三段（但是現在沒有很好的無監督學習方法，也沒有辦法評估）

Despite its importance, unsupervised learning is yet to see a breakthrough similar to supervised learning: modeling high-level representations from raw observations remains elusive.
盡管它的重要性，非監督學習還沒有看到類似于監督學習的突破:從原始觀察建模高級表示仍然是難以捉摸的。
（盡管無監督從上面的分析當中可以看出是非常重要的，但是我們還沒有有效地從原始觀測獲得無監督表示特征的方法。）

Further, it is not always clear what the ideal representation is and if it is possible that one can learn such a representation without additional supervision or specialization to a particular data modality.
此外，人們并不總是清楚理想的表示是什么，以及是否有可能在沒有額外監督或專門針對特定數據形態的情況下學習這種表示。

第四段（介紹本文提出的方法）

One of the most common strategies for unsupervised learning has been to predict future, missing or contextual information.
最常見的非監督學習策略之一是預測未來、缺失或上下文信息。

This idea of predictive coding [5, 6] is one of the oldest techniques in signal processing for data compression.
這種預測編碼的思想[5,6]是信號處理中最古老的數據壓縮技術之一。

In neuroscience, predictive coding theories suggest that the brain predicts observations at various levels of abstraction [7, 8].
在神經科學中，預測編碼理論認為，大腦在不同的抽象層次上預測觀察結果[7,8]。

Recent work in unsupervised learning has successfully used these ideas to learn word representations by predicting neighboring words [9].
最近在無監督學習方面的工作已經成功地使用這些想法通過預測相鄰單詞[9]來學習單詞表示。

For images, predicting color from grey-scale or the relative position of image patches has also beenshown useful [10, 11].
對于圖像，從灰度或圖像斑塊的相對位置預測顏色也被證明是有用的[10,11]。

We hypothesize that these approaches are fruitful partly because the context from which we predict related values are often conditionally dependent on the same shared high-level latent information.
我們假設這些方法是卓有成效的，部分原因是我們預測相關價值的背景往往有條件地依賴于相同的共享的高級潛在信息。

And by casting this as a prediction problem, we automatically infer these features of interest to representation learning.
通過將其作為一個預測問題，我們自動地推斷出這些對表征學習感興趣的特征。

第五段（介紹本文提出的CPC）

In this paper we propose the following: first, we compress high dimensional data into a much more compact latent embedding space in which conditional predictions are easier to model.
本文提出以下建議:
首先，我們將高維數據壓縮到一個更緊湊的潛在嵌入空間，在這個空間中條件預測更容易建模。

Secondly, we use powerful autoregressive models in this latent space to make predictions many steps in the future.
之后，我們在這一潛在空間中使用強大的自回歸模型對未來的許多步驟進行預測。

Finally, we rely on Noise-Contrastive Estimation [12] for the loss function in similar ways that have been used for learning word embeddings in natural language models, allowing for the whole model to be trained end-to-end.
最后，我們使用與自然語言模型中用于學習單詞嵌入的方法類似的方法，依靠噪聲對比估計[12]來計算損失函數，從而允許對整個模型進行端到端訓練。

We apply the resulting model, Contrastive Predictive Coding (CPC) to
widely different data modalities, images, speech, natural language and reinforcement learning, and show that the same mechanism learns interesting high-level information on each of these domains, outperforming other approaches.
我們將得到的模型，對比預測編碼(CPC)應用于不同的數據模式、圖像、語音、自然語言和強化學習，表明相同的機制可以在每個領域學習有趣的高級信息，表現優于其他方法。

1.2總結

大約的邏輯是：

1.有監督的特征提取已經取得了較好的效果，但是這些特征的魯棒性或是可泛化能力還有一定的不足。
2.作者認為這種不足的原因可能是，我們在特異性標簽訓練的時候。我們可能只是提取了當前標簽領域相關的信息，而丟棄了其他領域的特恒。
3.無監督因為沒有特定的標簽，也就沒有特定的領域信息也就更不會產生針對某一個特定領域的遷移的情況，所以避免了這種問題。所以作者提出了使用無監督的方式可以獲得更加魯邦的標簽。
4.但是現在無監督沒有成熟的方法，也沒有成熟的評價方式。（這個文章寫的時候可能確實是這樣的，這個和simCLR、MOCO同時期的，但時這倆都順利中了，這個文章卻被反復拒稿，所以后來在投遞的時候，其實對比學習的各種都已經成熟了。）

通過上述的陳述本文作者提出了自己的方法：

1.首先，從古老的預測編碼技術（壓縮當中通過一個位置獲得前后的信息）取得靈感，可以通過預測前后的內容獲得有效地訓練。
2.所以作者提出了前后預測的方法。

作者仔細敘述了自己的方法為：

1.首先將這些所有的數據壓縮在較為緊湊的環境當做，我理解這里可能在一定程度上提升訓練效率。
2.使用這些得到的信息，預測前后的內容（作者應該是認為，這些前后內容其實有一些）

2 Contrastive Predicting Coding

We start this section by motivating and giving intuitions behind our approach.
我們通過介紹給我們動力和背后只覺得的內容開始本節介紹

Next, we introduce thearchitecture of Contrastive Predictive Coding (CPC).
接下來，我們介紹了對比預測編碼(CPC)的體系結構。

After that we explain the loss function that is based on Noise-Contrastive Estimation.
然后，我們解釋了基于噪聲對比估計的損失函數。

Lastly, we discuss related work to CPC.
最后，對CPC工作進行了探討。

2.1 Motivation and Intuitions

2.1.1逐句翻譯

第一段（這種跨越維度的特征，可能更能反映全局信息，并且很少受到干擾）

The main intuition behind our model is to learn the representations that encode the underlying shared information between different parts of the (high-dimensional) signal.
我們模型背后的主要直覺是學習編碼(高維)信號的不同部分之間的底層共享信息的表示。
（就是高維度的信號在底層是有很多共享信息的）

At the same time it discards low-level information and noise that is more local.
同時，它摒棄了低層次的信息和噪音，這是更局部的。

In time series and high-dimensional modeling, approaches that use next step prediction exploit the local smoothness of the signal. When predicting further in the future, the amount of shared information becomes much lower, and the model needs to infer more global structure.
在時間序列和高維建模中，使用下一步預測的方法利用了信號的局部平滑性。在未來進一步預測時，共享的信息量會大大減少，模型需要推斷出更多的全局結構。
（因為信號具有局部的平滑性，但是我們在推斷出臨近的信息可能比較簡單，但是如果我們想要推算更加遠的信息，就需要掌握更多的全局信息才能完成）

These ’slow features’ [13] that span many time steps are often more interesting (e.g., phonemes and intonation in speech, objects in images, or the story line in books.).
這些跨越多個時間步驟的“慢特征”[13]通常更有趣(例如，語音中的音素和語調，圖像中的物體，或書中的故事線)。
（就是這些跨越很長的時間維度的信息往往更加對全局有表現能力）

第二段

One of the challenges of predicting high-dimensional data is that unimodal losses such as meansquared error and cross-entropy are not very useful, and powerful conditional generative models which need to reconstruct every detail in the data are usually required.
預測高維數據的挑戰之一是unimodal losses (如均方誤差和交叉熵)不是很有用，通常需要重建數據中的每個細節。

But these models are computationally intense, and waste capacity at modeling the complex relationships in the data x, often ignoring the context c.
但這些模型的計算量很大，在建模數據x中的復雜關系時浪費了能力，往往忽略了上下文c。

For example, images may contain thousands of bits of information while the high-level latent variables such as the class label contain much less information (10 bits for 1,024 categories).
例如，圖像可能包含數千位信息，而高級潛在變量(如類標簽)包含的信息要少得多(10bit他最多包含1024的特征)。

This suggests that modeling p(x|c) directly may not be optimal for the purpose of extracting shared information between x and c.
這表明直接建模p(x|c)對于提取x和c之間的共享信息的目的可能不是最優的。

When predicting future information we instead encode the target x (future) and context c (present) into a compact distributed vector representations (via non-linear learned mappings) in a way that maximally preserves the mutual information of the original signals x and c defined as
當預測未來信息時，我們將目標x(未來)和上下文c(現在)編碼成一個緊湊的分布式向量表示(通過非線性學習映射)，以最大限度地保留原始信號x和c的相互信息定義為

By maximizing the mutual information between the encoded representations (which is bounded by the MI between the input signals), we extract the underlying latent variables the inputs have in commmon.
通過最大限度地提高編碼表示之間的相互信息(它以輸入信號之間的MI為界)，我們提取了輸入之間共有的潛在變量。

2.1.2總結

大約就是說夸時間之間是可以提取一些平滑的信息出來的，就是識別一些全局的特征。

2.2 Contrastive Predictive Coding

Figure 1 shows the architecture of Contrastive Predictive Coding models.
圖1顯示了對比預測編碼模型的架構。

第一段（主要是說明當前網絡的情況）

First, a non-linear encoder genc maps the input sequence of observations xt to a sequence of latent representations zt = genc(xt), potentially with a lower temporal resolution.
首先，非線性編碼器genc將觀測數據的輸入序列xt映射到潛在表示序列zt = genc(xt)，可能具有較低的時間分辨率。

Next, an autoregressive model gar summarizes all z≤t in the latent space and produces a context latent representation ct = gar(z≤t).
然后，一個自回歸模型gar總結了所有的z≤t在潛行空間，并產生了上下文潛行表示ct = gar(z≤t)。

3.Experiments

第一段（主要是介紹這些實驗都是怎么設計和進行的）

We present benchmarks on four different application domains: speech, images, natural language and reinforcement learning.
我們提出了四個不同應用領域的基準:語音、圖像、自然語言和強化學習。

For every domain we train CPC models and probe what the representations contain with either a linear classification task or qualitative evaluations, and in reinforcement learning we measure how the auxiliary CPC loss speeds up learning of the agent.
對于每一個領域，我們訓練CPC模型，并通過線性分類任務或定性評估來探索其包含的表示，在強化學習中，我們測量輔助CPC損失如何加速agent的學習。

3.1 Audio（針對音頻的測試）

3.1.1 逐句翻譯

For audio, we use a 100-hour subset of the publicly available LibriSpeech dataset [30].
對于音頻，我們使用公開可用的librisspeech數據集[30]的100小時子集。

Although the dataset does not provide labels other than the raw text, we obtained force-aligned phone sequences

總結

以上是生活随笔為你收集整理的对比学习系列论文CPC（二）—Representation Learning with Contrastive Predictive Coding的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。