Tuesday, September 1, 2015

Recursive Neural Nets

I've been looking for a way to up my haiku game for a while now, and I think I've found it. Recursive, or recurrent, neural networks (RNNs) are a novel (for me) neural network architecture that can be "'deep’ in both space and time, in the sense that every piece of information passing either vertically or horizontally through the computation graph will be acted on by multiple successive weight matrices and nonlinearities" (Graves, 2014). Sorry for the long quote there, but I feel like it encapsulates the architecture of RNNs pretty well. Instead of passing input vectors through a series of vertical (or horizontal) nodes, like in feed forward networks, activations are formed using information from weight matrices in the same layer which are then fed forward through the network in the usual manner. If you want to visualize this a little better, look at this page (which is also the source of my inspiration for this little project). A lot of articles, like the one I quoted above, talk about propagation in time and space. This is just a convenient way of talking about 'moving through' (another way of saying 'doing lots of matrix-vector products') a neural network. Normal multilayer perceptrons (MLP) like the one that I made/borrowed only move through the spacial direction, which means that successive iterations have no effect on each other. RNNs are essentially constructed of feed forward networks that are connected at some level, usually through the hidden layer. This means that the hidden layers have their own weight matrices. If you think that my understanding of RNNs, or MLPs is incorrect, please let me know, as I'm still trying to learn and understand this stuff.





I often find that a little math helps to ground me, so I'll explain how inputs are turned into outputs in RNNs. I'm closely following the description in the article I cited above. I'm going to use the example of character sequences in text to further ground the math. Imagine that I want to use a relatively large body of text, say all of F. Scott Fitzgerald's work, as my training data. A big question that I've been asking myself is 'how do I quantify something like a word?' My first response is to make a list of all the unique words in the text, and then identify a specific word by an index. If 'tofurky' is the 232nd word in my list of unique words, then 'tofurky' would be represented as the number 232. Or, if I wanted to represent it as a vector, it would be a vector of length $L_w$, where $L_w$ is the number of unique words in the text, whose entries are all zero, except for the 232nd entry, which would be a one. Now I say 'hold up, that could be a helluva lot of zeros.' If I have 10,000 unique words in a text, that means that the input vector to my neural network will be a super big vector that's mostly zeros. So why not just use individual characters? The number of unique characters is bound to be less than the number of unique words. It seems crazy to think that we would be able to generate anything meaningful by assembling characters, but RNNs are capable of doing this. Anyways, characters are now represented by vectors of length $L_c$, where $L_c$ is the number of unique characters in a text.

Conceptually, the task of training the network becomes relatively straight forward. The idea is to get the network to correctly predict the next character in a sequence. Following Anrej Karpathy's example, if I have the sequence "dean", I would want to be able to feed the network "dea" and have it spit out 'ean' (in vectorized form, of course). If you've been wondering why we would convert our nice character indices into vectors in the first place, this is why. The output of the network has to be a vector, such that we can have a probability distribution across all available characters. In the example I gave above, in order to tweak the weights of the matrix to get the correct output, the input 'd' needs to provide a list of possible next characters, not just a single index. I only emphasize this point because I spent some time being confused by it.

As I indicated above, RNNs are more architecturally complicated than standard NNs. Each input sequence has its own set of associated weight matrices that are used to feed forward and to influence calculation of future sequences. Let's dive in, following Graves. Say I have an $L_c\times T$ matrix $\mathbf{X}$ whose columns correspond to character vectors. $T$ is the number of the input sequences into the network. To compute the activation $h_t$ for the input value $x_t$ I apply the following equation. Boldface letters indicate matrices.
$$
h_t  = \mathcal{H}_h(\mathbf{W_{xh}}\cdot x_t + \mathbf{W_{hh}}\cdot h_{t-1} + b_h)
$$
Here, the subscripts on matrices indicate the direction we're moving in the layers. The $\mathbf{W_{xh}}$ matrix, for example, is the matrix that relates inputs to activations, and $\mathbf{W_{hh}}$ is the matrix that acts on activations from a previous time step. In a hot sec we'll use the $ \mathbf{W_{hy}}$ matrix which takes the activation $h_t$ and produces our output vector $y_t$. Note that $b_h$ is a bias vector belonging to the hidden layer. The $\mathcal{H}_h$ is a function, like $tanh$ or the sigmoid function used to "squash" vectors. Computing the output proceeds as follows.
$$
\hat{y_t} = b_y + \mathbf{W_{hy}} \cdot h_t \\
y_t = \mathcal{H}_y(\hat{y_t})
$$
Here $\mathcal{H}_y$ is some function, like the softmax function that's used at the final layer to produce output. Depending on the nature of the phenomenon you're modeling, $\mathcal{H}_y$ could be different. Something to note: I've only done the math for a RNN with a single hidden layer. Extending this to multiple layers is not particularly hard. If you want to see it, let me know.

Training the RNN involves using stochastic gradient descent method. In theano (and this is the reason why I'm struggling through learning theano) this is a one line command. RNNs are notoriously difficult to train, because the gradients tend to blow up. One way that researchers have mitigated this problem is by using something called Long Short Term Memory (LSTM) at the hidden layer level. In practice, this involves throwing in a series of new terms when calculating the activation vector at the hidden layer level of the net. I'm not entirely sure how it helps solving the gradient problem, however. Once I get a basic version of my simple RNN up and running, I plan on implementing this.

Creating novel sequences of characters simply involves using the output of one time step as the input for the next step. This is the functionality that I want to use for my haikus.


No comments:

Post a Comment