Concepts as Programs

The awesomeness that is Bayesian Program Learning

In December 2015, this nifty paper showed up on the scene, titled “Human-level concept learning through probabilistic program induction“. I read it, then read it again, then read it 20 more times. (Not even kidding here.) It’s difficult to relate why this is so important, but I’ll try my best to explain.

First, the best way for you to get introduced to this is to check out this short video:

You can grab the accompanying Matlab code here.

They call their approach Bayesian Program Learning, or BPL. It’s a huge step forward, and here are some reasons why:

  1. The current machine learning approach that’s all the rage, Deep Learning, still requires hundreds or even thousands of examples in most cases. BPL, in this context, requires as little as one example.
  2. The model is “learning to learn”: “Learning proceeds by constructing programs that best explain the observations under a Bayesian criterion, and the model “learns to learn” (23, 24) by developing hierarchical priors that allow previous experience with related concepts to ease learning of new concepts (25, 26). These priors represent a learned inductive bias (27) that abstracts the key regularities and dimensions of variation holding across both types of concepts and across instances (or tokens) of a concept in a given domain.” In short, its past experience helps inform its learning process. While some Deep Learning models have demonstrated this, they do so in a far more limited way.
  3. Explicitly representing concepts as compositionally built programs starts to look a lot like the dreams of old functional AI realized, though instead through probabilistic methods.
  4. The approach of learning nested programs is highly generalizable, and could even be made hierarchical a la Deep Belief Nets.


What immediately stood out to me about BPL is that the “primitives” could be anything, and the images could likewise be anything. In the BPL paper, the primitives are pen strokes, and the images are handwritten characters. There is no reason I’m aware of that the primitives couldn’t be partials of some other function, and that the images couldn’t be an input to another system – possibly even another BPL node. Instead of the model learning to compose handwritten characters from pen strokes, it could be learning to compose solutions to puzzles based on available moves in context.

In order to generalize the BPL beyond handwriting, we’ll need to drop some of the domain-specific constraints:

  • In the algorithm, they sample noise/variance in the motor programs. This is a natural thing to do in order to mimic human-like variability in the output of the learned motor program, but may not be valid in other contexts. This step in the algorithm can be made optional.
  • The spatial trajectories are all smooth and, generally, connected. In a more abstract domain, such as learning a sequence of actions to take in order to solve a puzzle, the resulting ‘image’ will not necessarily be smooth at all, but could end up looking more like a QR code. There is no reason that I’m aware of to have a requirement for smooth trajectories in the general case; this should be optional. (If an issue is because of the random walk algorithm used, it may be possible to swap that for a different process in order to get the required samples.)
  • Not every application will require continuous distributions. There may be cases where discrete distributions are more apt. It is worth testing this in the future.

To start exploring more general forms of the BPL algorithm and try them in real-world, production use cases, I’ve started a Python implementation. My hope is to further explore BPL and build an ecosystem around it, to encourage developers to make their own variants and test them on specific use cases. As the project matures and I do my own tests, I’ll blog about them here.

Representing concepts as programs is not a completely new idea, but BPL pulls it off unlike anything else prior. The algorithm offers a solid base to build from, and a new way to start thinking about general learning capabilities in the machine.