The Octet System: A Way to Think About AI

You see countless headlines about AI these days, littered with references to “deep learning”, “neural networks”, “bots”, “Q&A systems”, “virtual assistant”, and all manner of other proxy terms. What’s missing from this entire discussion is a way to gauge what each system is really capable of.

In the spirit of the Kardashev Scale, I’ve put together my own ranking system for AIs, which we’ll be using at Machine Colony.

(Note: I’ll try to provide as much background and example information as I can without this post reading like a cog-sci textbook. For those savvy in AI these examples will no doubt seem pedestrian, however I do try to illustrate concepts as much as possible, in a perhaps ill-fated effort to refrain from being too esoteric.)

Introducing the Octet System

I’m not much for fancy names, but in this case it was fitting: list the qualitative capabilities of an information system, and break it down into eight distinct ranks, or “classes”. They are as follows:

Class Null

The zeroth class is something which does not qualify as an intelligent system whatsoever. While this can cover any manner of programs – be they in software or manifested in processes emerging from hardware – I choose to focus on software for this example.

Programs that fall into class null have the following characteristics:

  • They are only able to follow explicit, predetermined / deterministic logic.
  • They follow simple rules – “if this then that” – with no capacity to ever learn anything.
  • They have no capacity to make nuanced decisions, i.e. based on probability and/or data.
  • They have no internal model of the world (this is related to learning).
  • They do not have their own agency.

This is by far the broadest class, as it covers the vast majority of our software systems today. Most of the world’s software is programmed for a specific task and does not really need to be a learning, decision-making system.

Class I

Programs of this class have the following characteristics that are different from Class Null:

  • They have the ability to make rudimentary decisions based on data, based on some trained model.
  • They have the ability to learn from the outcomes of their decisions, and thus to update their core model.
  • As such, their behavior may vary over time, as the data changes and their model changes.
  • They are trained for a small number of narrow tasks, and do not have the capability to go outside those tasks.

This covers things like fraud detection agents, decent spam detection, basic crawling bots (assuming they’re at least using decision trees or something similar). The decision could be a classification action – marking something as spam, for instance – or it could be deciding how well a website ranks in the universe of websites.

Class II

Classes I and II are the most similar on this scale, because the distinction is subtle: a Class II program has all of the capabilities of a Class I, but may have more than one core model and more than one domain of action. For instance, the new Google Translate app has natural language and vision capabilities, with different models for each. These models are linked and ‘cooperate’ to translate the words in the field of view of your smartphone’s camera.

Class Is, by contrast, only have one area that they’re focused on, and make decisions only based on the model relevant to that domain.

Class III

Programs of Class III have a two main distinctions from Class IIs:

  • They have a basic memory mechanism, and the ability to learn from their history in those memories. This is more advanced than simply referring to data; these programs are actually building up heuristics from their own behavioral patterns.
  • They persist some form of internal model of the world. This assists in creation of memories and new heuristics in their repertoire.

Class IV

This class starts to loosely resemble the intelligence level of insects. Class IVs not only have some kind of internal model of the world, but they gain an ability that was essential to the evolution of all complex life on earth: collaboration.

Thus, their characteristics are:

  • They have the ability to collaborate with other agents/programs. That is, they have mechanisms with which to become aware of other agents, and a medium through which to communicate signals. (Think of ants and bees leaving chemical traces, for instance.)
  • Like Class IIIs, they have an internal model of the world. However, a Class IV’s model is more closely linked to its goal structure, and not merely ad hoc / bound in one model. Its internal model may be distributed across several subsystems / mathematical models; representations of complex phenomena or experiences are encoded across various components in its cognitive architecture (vision systems, memory components, tactile systems, etc).
  • They have the ability to perform rudimentary planning, driven by fairly rigid heuristics but with a little flexibility for learning.
  • They have the ability to form basic concepts, schema, and prototypes.

While it is not a prerequisite that Class IVs have multiple distinct sensory modalities – optic, auditory, tactile, olfactory systems – that serves as a good example of the complexity level these programs start to achieve. In an AI setting, a program could have hundreds of different types of inputs, each with their own data type and respective subsystem for processing the input. The key distinction is that in Class IV programs, these subsystems have a high degree of connectivity, and thus generate more complex behavior.

Many robotics software systems could also be placed in Class IV.

Class V

The capabilities of Class Vs begin to resemble more complex animal behavior, such as rats (but not as intelligent as many apes). The primary characteristics of note are:

  • They have the ability to reflect on their own ‘thoughts’. In an AI program setting, this would mean it has the ability to optimize its own metaheuristics. In rats, for example, this is manifested as a basic form of metacognition.
  • They have the ability to perform complex planning, especially in which they are able to simulate the world and themselves in it. Which leads to:
  • They have the ability, to some degree, to simulate the world in their minds. That is, they can perturb their internal model of the world without actually taking action, and play out the results of hypothetical actions. They can imagine scenarios based on their knowledge of the world, which is intimately related to their memories (recall the memory capability from Class III).
  • Related to their planning and internal simulation capabilities, they have the ability to set their own goals and take steps to achieve them. For instance, a rat may see two different pieces of food, decide that it likes the looks of one of them better than the other, set its goal to acquire the better-looking morsel, and subsequently plan a path to get it. The planning part relies on actions it knows it can do – how fast it can or run, how far can it jump – and the terrain ahead of it, as well as memories of how it may have conquered that type of terrain before. Thus goal-setting and planning rely heavily on memory and the internal model.
  • They have a rudimentary awareness of their own agency in the environment. That is, when they are planning, they treat themselves as a factor in the environment they are simulating.

By this time, you have a program which is able to reflect, plan, collaborate with other agents, set goals, learn new behaviors and strategies for achieving its goals, and simulate hypothetical scenarios.

Class VI

You’re a class VI. So am I. Almost every human being is a Class VI – ‘almost’ because, well, this designation is questionable when applied to some politicians.

Programs of this class will start to resemble human-level intelligence and capability, though not necessarily human-like in nature. While in humans a major difference is more complex emotions, this scale does not consider emotions directly.

Artificial intelligence has not yet reached this level, and there are varying predictions as to when it will. The good news is that the expert consensus is clear on the idea that it will happen, it’s just that no one knows exactly when.

Key components of Class VI agents/AIs are:

  • Full self-awareness. The agent is fully aware of itself, its history, where the environment ends and it begins, and so on. This is related to consciousness, though Class VIs need not necessarily be conscious in a strict sense.
  • They have the ability to plan in the extremely long term, thinking ahead in ways that more basic systems cannot. Specific timescales are relative to its natural domain: for a person, decades; for an AI program, perhaps, seconds or hours.
  • Class VIs are able to invent new behaviors, processes, and even create other ‘programs’. In the case of a human, this is obviously an inventor creating a new way of solving a problem, or a software developer programming AIs somewhere in Brooklyn…

Class VII

This is what might well be referred to as ‘superintelligence‘. While some AI experts are skeptical of whether or not this can be achieved, there does seem to be broad agreement that it is imminent. Nick Bostrom writes elegantly about the subject in his book of the same name.

While nobody knows exactly what this may look like, there are two major distinguishing factors which would almost certainly be present:

  • They have the ability to systematically control their own evolution.
  • They have the ability to recursively improve themselves, perhaps even at alarmingly minuscule timescales.

Their first ability is perhaps their most profound. While humans do in some sense control our own fate, we do not yet have fine-grained control over the evolution of our brains and hence our cognitive abilities (though CRISPR may soon change that). With an artificial superintelligence, many limitations are removed. They can arbitrarily copy-paste themselves, ad finitum, and perform risk-free simulations of their new versions. They also will be essentially immortal, so long as their hardware persists and has a supply of energy.

With respect to the second ability, one might imagine an ASI (artificial superintelligence) making multiple clones of itself, each clone independently applying a self-improving strategy, and then each one in turn performing a set of benchmark tests to determine which one improved the most from the original copy. Whichever agent performed the best would become the new master copy, while the others would be taken out of the running.

This is a supercharged evolutionary algorithm, essentially. The tests would be agreed upon in advance, and even perhaps written as a cryptographically secure contract (blockchain-based or otherwise) to prevent cheating. In doing so, the agent would keep improving up to hardware limits or some theoretical asymptote.

The kind of scenario above is not at all unlikely in the near future.



AI capabilities currently exist somewhere between the Class IV and Class V marks, but are quickly marching toward Class VI. DeepMind and Facebook are leading the way in this direction, though other notable players are making important contributions. Certainly the brand-new OpenAI will have some interesting insights as well.

My hope is that this type of classification system, and others like it, will help bring some structure to the conversation around fast-emerging AI. With deeper clarity in our common language, we can have more meaningful and productive conversations about how we wish for this technology to advance and how it ought to be used. We owe it to ourselves to have the linguistic tools to accurately describe our progress.

A World Inside the Mind

Short post today, but a few things occurred to me as I was reading the paper on Bayesian Program Learning:

  • This form of recursive program induction starts to look suspiciously like simulation – something we do in our minds all the time.
  • Simulation may be a better framing for concept formation than via the classification route.
  • Mapping the ‘inner world’ to the ‘outer world’ seems a more sensible approach to understanding what’s going on. If you look at the paper, you also see some thought-provoking examples of new concept generation, such as the single-wheel motorbike example (in the images). This is the most exciting point of all.

A final design?

Combine all the elements together, along with ideas from my last post, and you get something that:

  1. Simulates an internal version of the world
  2. Is able to synthesize concepts and simulate the results, or literally ‘imagine’ the results – much like we do
  3. Is able to learn concepts from few examples
  4. Has memories of events in its lifetime / runtime, and can reference those events to recall the specific context of what else was happening at that time. That is, memories have deep linkage to one another.
  5. Is able to act of its own volition, i.e. in the absence of external stimulus. It may choose to kick off imagination routines – ‘dreaming’, if you will – optimize its internal connections, or do some other maintenance work in its downtime. Again, similar to how our brains do it while we sleep.

This starts to look like a pretty solid recipe for a complete cognitive architecture. Every requirement has been covered in some way or another, though in different models and in different situations. To really put the pieces together into a robust architecture will require many years of work, but it is worth exploring multi-model cognitive approaches.

If it results in a useful AI, then I’m all in.


Deep Learning & Temporal Modeling

In most discussions I’ve seen of deep learning, and certainly most of the models demonstrated, there is no discussion of temporal sequence modeling. I was curious where the state of the art was for this task, and thought to compare with some of my own intuitions about sequence modeling.

As a first pass, I found a handful of papers that discuss using stacked Restricted Boltzmann Machines in various configurations to achieve temporal learning – they are “Robust Generation of Dynamical Patterns in Human Motion by a Deep Belief Nets“, “Temporal Convolution Machines for Sequence Learning“, and “Sequential Deep Belief Networks“. The last of these three lays out their approach nicely in one sentence: “An L-layer SDBN is formed by stacking multiple layers of SRBMs.” It is, in essence, the stacking game continued.

While the approaches described in the three papers above would seem to yield decent results, I often wonder about extendability and scalability. RBMs / DBNs have lovely analytical properties, and cleverly get around intractable subproblems with use of sampling schemes, but their topology is nonetheless locked in all setups I’ve seen. This leaves no room for simulated neurogenesis.

Why would simulated neurogenesis be important? Moving the synapse weights around in a fixed topology yields nice results, but there may situations where we would want the topology to grow as needed – namely, if the network did not have a representation for a given pattern, it could create one. This assumes that we are ok with being slightly inefficient in terms of ensuring that the representation space is adequately filled before we generate new neurons. The tradeoff is a greater amount of space in memory.

Lately I’ve been working on these types of networks, and it is most certainly difficult to do. First you have the basic foundation of any neural network – learning spatial patterns. That part is relatively easy: you can quickly build something to learns to model spatial patterns and produce labels for them.

The next step is creating the dynamics that model temporal sequences. For example, a sequence of words: if I say “four score and-“, many of you will instantly think “seven years ago”. This is a sequence. The spatial patterns are the specific combinations of letters to form words, and the temporal pattern is sequence of those words. We learn this kind of thing with relative ease, but for machines this is a huge task.

Unsurprisingly, it is in this step that things get complicated. The noted Deep Learning approach is to specify a given time depth. In one paper T=3, meaning it can model sequences 3 time steps deep – in our example, up to 2 words in advance. When the network receives “four score and”, if it knows the sequence in question then it is thinking of “seven years”. When it gets to “seven”, it is thinking “years ago”.

Note that in one of the papers above, they use a different subnetwork entirely to model temporal constraints. Again, while this is nice from an analytical point of view, it likely has little to do with the way the brain works. The brain is essentially a hundred billion cells who knows nothing but how to behave in response to electrical and chemical signals. Their emergent behavior gives rise to your consciousness. The brain, at least as far as anybody can tell, does not have a “temporal network”. Temporal information is learned as a natural consequence of the dynamics of the vast network of cells. Somehow, we need to figure out a way to model temporal information inline with the spatial information, and make it all fit together nicely.

That said, the approach I’ve come to borrows a bit from Deep Learning, a little bit from Jeff Hawkins’ and Dileep George’s work on spatiotemporal modeling, and a little from complex systems in general. I’ve been searching for the core information overlap between different approaches, and have found some commonalities. From that, I’ve come to some notes of practice.

First, it almost always becomes necessary to sacrifice analytical elegance for emergent behavior. In many ways, emergent behavior is innately chaotic, and therefore difficult to model mathematically. Stochastic methods yield some insight, but there are higher-level states of emergence that may not be obvious from analysis of a single equation or definition of a system. In this case, it is simpler in practice to use heuristic methods to find emergent complexity, and attempt to characterize it as it is discovered, rather than attempt to discover all possible states from the definition of the system. A characteristic of chaotic systems is that you have to actually advance/evolve them in order to derive their behavior, as opposed to knowing in advance via analytical methods.

Second, and further emphasizing the use of heuristic methods, tuning the model with genetic algorithms tends to yield better results than attempting to solve for optimal parameters explicitly. Perhaps this is merely a difference in style, and if I’d spent more time studying complex systems I might know of better ways to do this, but at my current level of understanding a simple genetic algorithm that swarms over parameter configurations yields better results than attempting to understand what the “perfect network” might look like. There is a philosophical difference in that with this method you’re using the machine to understand the machine, in some sense surrendering control and rendering of insights to the machine itself. Genetic algorithms may be able to find subtleties that my slightly-more-evolved-ape-brain will miss or otherwise fail to conceptualize merely from the definition of the system and associated intuitions.

Deep Learning is advancing quickly, and while it offers some interesting food for thought when attempting to solve temporal modeling problems, I am not yet sold on the notion that it is the final answer to this more general problem. Choices of representation and methods of optimization may be trivial in some cases, but when they differ greatly from the norm they may yield some advantage. Not only that, but sticking closer to the way the brain represents information has done nothing but improve the performance and capabilities of the resulting systems. The path I’m on may all come to nothing, or it may shine light on some new ways to think about temporal modeling problems.

A closing note: A whitepaper describing my work is underway, so you can stare in awe at some formidable-looking equations and cryptic diagrams. I’ve been several years down this path now, and it’s high time to encapsulate all of the work done in a comprehensive overview.