Lecture 4- excercises in the tutorials
For the 24 people involved, the local encoding is created using a sparse 24-dimensional vector with all components zero, except one. E.g.
Colin?\equiv(1,0,0,0,0,\ldots,0)≡(1,0,0,0,0,…,0), Charlotte?\equiv(0,0,1,0,0,\ldots,0)≡(0,0,1,0,0,…,0), Victoria?\equiv (0,0,0,0,1,\ldots,0)≡(0,0,0,0,1,…,0)
and so on.
Why don't we use a more succinct encoding like the ones computers use for representing numbers in binary ?
Colin?\equiv (0, 0, 0, 0, 1)≡(0,0,0,0,1), Charlotte?\equiv (0, 0, 0, 1, 1)≡(0,0,0,1,1), Victoria?\equiv (0, 0, 1, 0, 1)≡(0,0,1,0,1)
etc, even though this encoding will use 5-dimensional vectors as opposed to 24-dimensional ones.
Check all that apply.
Its always better to have more input dimensions
未選擇的是正確的?The 24-d encoding makes each?subset?of persons linearly separable from every other disjoint subset while the 5-d does not
正確?Considering the way this encoding is used, the 24-d encoding asserts no a-priori knowledge about the persons while the 5-d one does.
正確?
In what ways is the task of predicting 'B' given 'A R'?different?from predicting a class label given the pixels of an image? Check all that apply.
'A' and 'R' are symbols rather than dense vectors of real numbers.
正確?'B' given 'A R' involves predicting a set of targets from a set of inputs.
這個選項的答案不正確?In the case of 'A R', the input dimension is the number of possible values for A plus the number of possible values for R; whereas for images, the input dimension is exponentially less than the number of possible values.
這應該被選擇?The ordering of the elements of A (i.e. the fact that 'John' might correspond to input index 1 and 'Mary' to input index 2) provides no additional information while the spatial ordering of the pixels does provide information.
正確?E=\frac{1}{2}(y-t)^2E=21?(y?t)2, where?y = \sigma(z) = \frac{1}{1+\exp(-z)}y=σ(z)=1+exp(?z)1?, derivatives tend to "plateau-out" when?yy?is close to 0 or 1.
Which of the following statements are true ?
\frac{dE}{dz} = (y-t)*y*(1-y)dzdE?=(y?t)?y?(1?y)
正確?The first option can be seen to be true just by taking derivatives, similarly the third option can be trivially shown to be wrong. The second option is subtle, but in general this is?nota good way to fix this problem since it will?amplify?the gradients for training cases that are not close to 0 or 1. The cost function used in the last option is called cross-entropy and it has a nice looking derivative that doesn't suffer from this plateau problem. Don't worry if it is not immediately obvious how we arrived at it.
A good way to fix the problem is by having a large global learning rate.
未選擇的是正確的?\frac{dE}{dz} = (y-t)*ydzdE?=(y?t)?y
未選擇的是正確的?Using a loss function?E = -t\log(y) - (1-t)\log(1-y)E=?tlog(y)?(1?t)log(1?y)?will fix the problem because then?\frac{dE}{dz} = y-tdzdE?=y?t.
這應該被選擇?If?\mathbf{z} = (z_1, z_2, \ldots z_k)z=(z1?,z2?,…zk?)?is the input to a k-way softmax unit, the output distribution is?\mathbf{y}=(y_1, y_2, \ldots y_k)y=(y1?,y2?,…yk?), where
y_i = \dfrac{\exp(z_i)}{\sum_j\exp(z_j)}yi?=∑j?exp(zj?)exp(zi?)?
Which of the following statements are true ?
The output distribution would still be the same if the input vector was?c\mathbf{z}cz?for any positive constant?cc.
這個選項的答案不正確?The output distribution would still be the same if the input vector was?c + \mathbf{z}c+z?for any positive constant?cc.
正確?Regarding the first two options:
Let's say we have two?zz's:?z_1=2,z_2=-2z1?=2,z2?=?2. Now let's take a softmax over them:?\frac{\exp(z_1)}{\exp(z_1) + \exp(z_2)}=\frac{\exp(2)}{\exp(2)+\exp(-2)}exp(z1?)+exp(z2?)exp(z1?)?=exp(2)+exp(?2)exp(2)?. If we add some positive constant?cc?to each?z_izi??then this becomes:
\frac{\exp(2+c)}{\exp(2+c) + \exp(-2+c)}=\frac{\exp(2)\exp(c)}{(\exp(2)+\exp(-2))\exp(c)}=\frac{\exp(2)}{\exp(2)+\exp(-2)}exp(2+c)+exp(?2+c)exp(2+c)?=(exp(2)+exp(?2))exp(c)exp(2)exp(c)?=exp(2)+exp(?2)exp(2)?.
Multiplying each?z_izi??by?cc?gives:
\frac{\exp(2c)}{\exp(2c) + \exp(-2c)}=\frac{\exp(2)^c}{\exp(2)^c + \exp(-2)^c} \neq \frac{\exp(2)}{\exp(2)+\exp(-2)}exp(2c)+exp(?2c)exp(2c)?=exp(2)c+exp(?2)cexp(2)c?≠exp(2)+exp(?2)exp(2)?
Any probability distribution?PP?over discrete states (P(x) > 0 \ \ \forall xP(x)>0???x) can be represented as the output of a softmax unit for some inputs.
Each output of a softmax unit always lies in?(0,1)(0,1)
Consider the following two networks with no bias weights. The network on the left takes 3 n-length word vectors corresponding to the previous 3 words, computes 3 d-length individual word-feature embeddings and then a k-length joint hidden layer which it uses to predict the 4th word. The network on the right is comparatively simpler in that it takes the previous 3 words and uses them to predict the 4th word.
If?n=100,000n=100,000,?d=1,000d=1,000?and?k=10,000k=10,000, which network has more parameters?
The network on the left.
The network on the right.
正確?The network on the left as 3nd + 3dk + nk parameters which comes out to 1,330,000,000 while the network on the right has 30,000,000,000 parameters, an order of magnitude more. One advantage of the neural representation is that we can get much more compact representations of our data while still making good predictions.
總結
以上是生活随笔為你收集整理的Lecture 4- excercises in the tutorials的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 灰度变换及dithering抖动算法
- 下一篇: 图像抖动(加入随机噪声+矩阵有序抖动)J