In most discussions I’ve seen of deep learning, and certainly most of the models demonstrated, there is no discussion of temporal sequence modeling. I was curious where the state of the art was for this task, and thought to compare with some of my own intuitions about sequence modeling.
As a first pass, I found a handful of papers that discuss using stacked Restricted Boltzmann Machines in various configurations to achieve temporal learning – they are “Robust Generation of Dynamical Patterns in Human Motion by a Deep Belief Nets“, “Temporal Convolution Machines for Sequence Learning“, and “Sequential Deep Belief Networks“. The last of these three lays out their approach nicely in one sentence: “An L-layer SDBN is formed by stacking multiple layers of SRBMs.” It is, in essence, the stacking game continued.
While the approaches described in the three papers above would seem to yield decent results, I often wonder about extendability and scalability. RBMs / DBNs have lovely analytical properties, and cleverly get around intractable subproblems with use of sampling schemes, but their topology is nonetheless locked in all setups I’ve seen. This leaves no room for simulated neurogenesis.
Why would simulated neurogenesis be important? Moving the synapse weights around in a fixed topology yields nice results, but there may situations where we would want the topology to grow as needed – namely, if the network did not have a representation for a given pattern, it could create one. This assumes that we are ok with being slightly inefficient in terms of ensuring that the representation space is adequately filled before we generate new neurons. The tradeoff is a greater amount of space in memory.
Lately I’ve been working on these types of networks, and it is most certainly difficult to do. First you have the basic foundation of any neural network – learning spatial patterns. That part is relatively easy: you can quickly build something to learns to model spatial patterns and produce labels for them.
The next step is creating the dynamics that model temporal sequences. For example, a sequence of words: if I say “four score and-“, many of you will instantly think “seven years ago”. This is a sequence. The spatial patterns are the specific combinations of letters to form words, and the temporal pattern is sequence of those words. We learn this kind of thing with relative ease, but for machines this is a huge task.
Unsurprisingly, it is in this step that things get complicated. The noted Deep Learning approach is to specify a given time depth. In one paper T=3, meaning it can model sequences 3 time steps deep – in our example, up to 2 words in advance. When the network receives “four score and”, if it knows the sequence in question then it is thinking of “seven years”. When it gets to “seven”, it is thinking “years ago”.
Note that in one of the papers above, they use a different subnetwork entirely to model temporal constraints. Again, while this is nice from an analytical point of view, it likely has little to do with the way the brain works. The brain is essentially a hundred billion cells who knows nothing but how to behave in response to electrical and chemical signals. Their emergent behavior gives rise to your consciousness. The brain, at least as far as anybody can tell, does not have a “temporal network”. Temporal information is learned as a natural consequence of the dynamics of the vast network of cells. Somehow, we need to figure out a way to model temporal information inline with the spatial information, and make it all fit together nicely.
That said, the approach I’ve come to borrows a bit from Deep Learning, a little bit from Jeff Hawkins’ and Dileep George’s work on spatiotemporal modeling, and a little from complex systems in general. I’ve been searching for the core information overlap between different approaches, and have found some commonalities. From that, I’ve come to some notes of practice.
First, it almost always becomes necessary to sacrifice analytical elegance for emergent behavior. In many ways, emergent behavior is innately chaotic, and therefore difficult to model mathematically. Stochastic methods yield some insight, but there are higher-level states of emergence that may not be obvious from analysis of a single equation or definition of a system. In this case, it is simpler in practice to use heuristic methods to find emergent complexity, and attempt to characterize it as it is discovered, rather than attempt to discover all possible states from the definition of the system. A characteristic of chaotic systems is that you have to actually advance/evolve them in order to derive their behavior, as opposed to knowing in advance via analytical methods.
Second, and further emphasizing the use of heuristic methods, tuning the model with genetic algorithms tends to yield better results than attempting to solve for optimal parameters explicitly. Perhaps this is merely a difference in style, and if I’d spent more time studying complex systems I might know of better ways to do this, but at my current level of understanding a simple genetic algorithm that swarms over parameter configurations yields better results than attempting to understand what the “perfect network” might look like. There is a philosophical difference in that with this method you’re using the machine to understand the machine, in some sense surrendering control and rendering of insights to the machine itself. Genetic algorithms may be able to find subtleties that my slightly-more-evolved-ape-brain will miss or otherwise fail to conceptualize merely from the definition of the system and associated intuitions.
Deep Learning is advancing quickly, and while it offers some interesting food for thought when attempting to solve temporal modeling problems, I am not yet sold on the notion that it is the final answer to this more general problem. Choices of representation and methods of optimization may be trivial in some cases, but when they differ greatly from the norm they may yield some advantage. Not only that, but sticking closer to the way the brain represents information has done nothing but improve the performance and capabilities of the resulting systems. The path I’m on may all come to nothing, or it may shine light on some new ways to think about temporal modeling problems.
A closing note: A whitepaper describing my work is underway, so you can stare in awe at some formidable-looking equations and cryptic diagrams. I’ve been several years down this path now, and it’s high time to encapsulate all of the work done in a comprehensive overview.