machine learning

Error Risk

In most machine learning discussions I have with people, I find that the notion of error risk is new to them. Here’s the basic idea: you have a trained machine learning model that’s processing incoming data, and it naturally has an error rate. Let’s say the error rate is 5%, or conversely, the model is correct about 95% of the time. Error risk, then, is the set of possible negative consequences from incorrect predictions or decisions.

In an extreme example, the error risk for a 747’s autopilot system is perilously high. For a model predicting user shopping behavior on an e-commerce site, the risk is rather low – maybe it recommends the wrong product once or twice, but nobody gets hurt.

The depth of the model’s integration and the speed at which it makes decisions are both correlated with the amount of error risk. If the program in question is running some analytics off the side, and merely supplying supplementary information to some human decision-maker, the risk is almost zero. However, if the program is itself making decisions, such as how much to bank right in a 45mph crosswind or how much of a certain inventory to order from a supplier, the risk increases substantially.

I’ve taken to quantifying error risk by asking the following questions:

Is the program or system making autonomous decisions? If yes, what happens when the wrong decision is made?
If it is making decisions, what is the cycle time / how quickly are those decisions being made?
If it is not making decisions, is the information it’s providing critical or supplementary? (Critical information could be things like cancer diagnostics, whereas supplementary information could be providing simple reports to a digital marketing team.)

Other questions come up in these situations, but the above are the most important.

Optimal use of machine learning in applications means gaining maximal benefit at minimum risk wherever possible. To get as close as possible to “pure upside” in implementing machine learning, what’s required is some strategic thinking around where the opportunities lie and what the error tolerance might be in those applications. Even state-of-the-art machine learning systems have intrinsic error. Therefore error risk must always be accounted for, even if the error size is tiny.

My ideas about optimal implementation of machine learning borrow heavily from ideas in portfolio optimization, especially the efficient frontier. This is the “sweet spot” in the tradeoff between rewards and risks. As machine learning makes its way into more applications, it’s worth taking the time to consider both the upside and the downside. Measures of optimality can only help to make more informed decisions about how to apply the latest technology.

From Paper to Production: Shortening the Ramp

One of the things that strikes me about the current state of machine learning is how long it still takes to get a new algorithm or model into production. From the time that 1) a paper is published to when 2) its contents are evaluated by those doing machine learning in industry and 3) they subsequently commit to developing it, years have past. It does not need to be this way.

Those doing machine learning are understandably wary of newer methods, and I can see why they might opt to give it time for hidden problems to be discovered before committing. The long-term viability of a model is often judged by its very ability to remain on the scene after many years, which, though vaguely tautological, remains valid. For those models that survive the process, they are deemed essential, timeless. The rest are disregarded.

There are two sides from which this can be viewed: the business risk side, and the development side. These are deeply intertwined.

The Risk Side

Product managers and tech leadership who are coming to grips with the reality that is the pervasiveness of machine learning have an increasing number of considerations, many of them in technology areas that may not be within their expertise. There may be a bit of silver lining, however: the discussion regarding what machine learning technology makes its way into new products is fundamentally one of risk. To the extent that they can work with those within their organization or its allies who have the expertise to accurately assess the risk of newer methods, they can quantify the level of risk for when things go wrong.

For instance, if a new method can drop the error rate of a certain type of prediction down to 3% (as opposed to a previous 5%), how does that affect the risk statistics of the business? Does it enable broader distribution or reach into a higher market segment? Does it enable new products entirely? These questions must be answered.

New qualitative capabilities may seem more difficult to judge, but that is not necessarily true. For instance, some newer capabilities involve an AI system describing what’s in a photograph, using completely natural language with understandable sentences. If a product manager or CTO is considering using this capability in a new product or feature, the error rate of the method can still be used to assess the risk to the business that the new feature exposes them to. The degree of risk will vary widely by the specific industry and application, but the process remains the same.

The Development Side

Even if all parties can agree that a new method is tempting enough to use in a feature, somebody still has to code the thing. This is where progress is sluggish. Developing new, unfamiliar models and validating them is a nontrivial effort, even for experienced ML programmers. Assuming you’re doing the first implementation in a given language or environment, it requires a degree of getting into the thought process of the researchers. Often, direct correspondence is needed to clarify details.

While many papers include pseudocode that can be readily translated into a programming language, just as many do not. From there, you are left to develop a deep understanding of the model’s description and translate its mathematical definition and data structures into a complete implementation. It’s hard work.

This is the part where things can slow down: without a clear understanding of the model and its behavior, it is not possible for a tech lead, data scientist, or ML developer to accurate judgements about the level of risk or the likelihood of bugs or other surprise behavior. More than the error rate, one has to assume that the resulting implementation will have its own quirks and bugs. To assume otherwise would be both unrealistic and foolish.

Many companies may be slower to adopt “bleeding edge” methods, then, because it is simply too difficult to enumerate implied capabilities and to quantify the risk it imposes. How can this be solved?

Shorten the Ramp

Consider the situation where there is X new deep learning model and a company really wants to use it in their products, but may not have a good way of reaching the logical conclusions of doing so. We can point out the main issues:

It can be a challenge to arrive at an exact error rate for the specific application before an implementation has been made. The paper will use test datasets, but the model will almost surely behave differently with the data specific to a feature.
There is often a break in the communication between those gaining understanding of the model and those assessing how it may affect the business overall. It could be anything from a smash hit to total disaster.
Even when a model is finished, it will need to land in an environment in which to run. Engineers should keep the infrastructure requirements in mind from the beginning.
In a waterfall or waterfall-like process, it is of course not possible to create requirements in the absence of understanding of the capabilities involved. This stalls progress.
Agile development is out the window, due to the high sensitivity of the relationship between model performance and feature risk or cost. These aren’t really the kinds of things you can just “ship first, iterate later”. Much needs to be worked out before it goes into the hands of feature developers.

All of this points toward two ways to shorten the ramp to deployment:

If a company is genuinely interested in adopting new algorithms and models in their offerings, they need to provide representative data as soon as possible.
Their engineering and/or data science team(s) need to have tools and infrastructure to support rapid prototyping of new models.

Only with datasets that are representative of what will occur in a production setting can a team judge their implementation and profile its performance. In the paper, a model may boast 90-something percent accuracy, but you may find that for your problem it is a little less, thereby affecting how risky the investment in developing the model is.

This can happen before the implementation phase, by looking at the test data used in a paper. A talented developer or data scientist can format the internal dataset to be similar to that used in a test dataset from the paper, thereby reducing opportunity for errors to arise from differences in data formatting.

An example process, then, might look like this:

Product manager decides they need X new capability in their next phase of features.
Their first order of business, then, is to gather and build datasets that are close to the actual problem. At the very least, they should ensure that whoever can build that dataset has access to all the tools and data sources needed to complete it quickly.
Product manager hands over dataset, high-level requirements, and asks data science and/or engineering team(s) to begin investigating models.
Technical team either begins profiling models they already know about or scouting for models that are known to enable the required capabilities.
Development / prototyping begins with the selected model(s) and the datasets provided.
Throughout development process, error rates and other important metrics are reported back to the product manager (or whomever is overseeing the process).
Risk calculations are adjusted as this information flows in. For example, if it’s a photo auto-tagging feature in question, one can determine how many users are likely to experience incorrectly tagged photos and how often, based on the volume of photos and the error rate of the model. From that, one can determine how much of a risk it is to a the business – are the users likely to leave if they experience the error, or is it not a deal-breaker?
Once all models have been profiled and tested, a decision can be reached about whether or not to proceed with the feature.

Of course not every company has a product manager or entire teams for data science and engineering, but the overall structure can be applied by those filling the roles – even if it’s all the same person.

In summary, the best way I see to shorten the time from a published paper to a viable production implementation is to 1) provide data as early as possible and 2) ensure engineering has tools to quickly prototype and test models. Both are difficult, but both will pay off considerably for those willing to put in the effort.

I hope this has been helpful to you. Please ask questions in the comments.

The Octet System: A Way to Think About AI

You see countless headlines about AI these days, littered with references to “deep learning”, “neural networks”, “bots”, “Q&A systems”, “virtual assistant”, and all manner of other proxy terms. What’s missing from this entire discussion is a way to gauge what each system is really capable of.

In the spirit of the Kardashev Scale, I’ve put together my own ranking system for AIs, which we’ll be using at Machine Colony.

(Note: I’ll try to provide as much background and example information as I can without this post reading like a cog-sci textbook. For those savvy in AI these examples will no doubt seem pedestrian, however I do try to illustrate concepts as much as possible, in a perhaps ill-fated effort to refrain from being too esoteric.)

Introducing the Octet System

I’m not much for fancy names, but in this case it was fitting: list the qualitative capabilities of an information system, and break it down into eight distinct ranks, or “classes”. They are as follows:

Class Null

The zeroth class is something which does not qualify as an intelligent system whatsoever. While this can cover any manner of programs – be they in software or manifested in processes emerging from hardware – I choose to focus on software for this example.

Programs that fall into class null have the following characteristics:

They are only able to follow explicit, predetermined / deterministic logic.
They follow simple rules – “if this then that” – with no capacity to ever learn anything.
They have no capacity to make nuanced decisions, i.e. based on probability and/or data.
They have no internal model of the world (this is related to learning).
They do not have their own agency.

This is by far the broadest class, as it covers the vast majority of our software systems today. Most of the world’s software is programmed for a specific task and does not really need to be a learning, decision-making system.

Class I

Programs of this class have the following characteristics that are different from Class Null:

They have the ability to make rudimentary decisions based on data, based on some trained model.
They have the ability to learn from the outcomes of their decisions, and thus to update their core model.
As such, their behavior may vary over time, as the data changes and their model changes.
They are trained for a small number of narrow tasks, and do not have the capability to go outside those tasks.

This covers things like fraud detection agents, decent spam detection, basic crawling bots (assuming they’re at least using decision trees or something similar). The decision could be a classification action – marking something as spam, for instance – or it could be deciding how well a website ranks in the universe of websites.

Class II

Classes I and II are the most similar on this scale, because the distinction is subtle: a Class II program has all of the capabilities of a Class I, but may have more than one core model and more than one domain of action. For instance, the new Google Translate app has natural language and vision capabilities, with different models for each. These models are linked and ‘cooperate’ to translate the words in the field of view of your smartphone’s camera.

Class Is, by contrast, only have one area that they’re focused on, and make decisions only based on the model relevant to that domain.

Class III

Programs of Class III have a two main distinctions from Class IIs:

They have a basic memory mechanism, and the ability to learn from their history in those memories. This is more advanced than simply referring to data; these programs are actually building up heuristics from their own behavioral patterns.
They persist some form of internal model of the world. This assists in creation of memories and new heuristics in their repertoire.

Class IV

This class starts to loosely resemble the intelligence level of insects. Class IVs not only have some kind of internal model of the world, but they gain an ability that was essential to the evolution of all complex life on earth: collaboration.

Thus, their characteristics are:

They have the ability to collaborate with other agents/programs. That is, they have mechanisms with which to become aware of other agents, and a medium through which to communicate signals. (Think of ants and bees leaving chemical traces, for instance.)
Like Class IIIs, they have an internal model of the world. However, a Class IV’s model is more closely linked to its goal structure, and not merely ad hoc / bound in one model. Its internal model may be distributed across several subsystems / mathematical models; representations of complex phenomena or experiences are encoded across various components in its cognitive architecture (vision systems, memory components, tactile systems, etc).
They have the ability to perform rudimentary planning, driven by fairly rigid heuristics but with a little flexibility for learning.
They have the ability to form basic concepts, schema, and prototypes.

While it is not a prerequisite that Class IVs have multiple distinct sensory modalities – optic, auditory, tactile, olfactory systems – that serves as a good example of the complexity level these programs start to achieve. In an AI setting, a program could have hundreds of different types of inputs, each with their own data type and respective subsystem for processing the input. The key distinction is that in Class IV programs, these subsystems have a high degree of connectivity, and thus generate more complex behavior.

Many robotics software systems could also be placed in Class IV.

Class V

The capabilities of Class Vs begin to resemble more complex animal behavior, such as rats (but not as intelligent as many apes). The primary characteristics of note are:

They have the ability to reflect on their own ‘thoughts’. In an AI program setting, this would mean it has the ability to optimize its own metaheuristics. In rats, for example, this is manifested as a basic form of metacognition.
They have the ability to perform complex planning, especially in which they are able to simulate the world and themselves in it. Which leads to:
They have the ability, to some degree, to simulate the world in their minds. That is, they can perturb their internal model of the world without actually taking action, and play out the results of hypothetical actions. They can imagine scenarios based on their knowledge of the world, which is intimately related to their memories (recall the memory capability from Class III).
Related to their planning and internal simulation capabilities, they have the ability to set their own goals and take steps to achieve them. For instance, a rat may see two different pieces of food, decide that it likes the looks of one of them better than the other, set its goal to acquire the better-looking morsel, and subsequently plan a path to get it. The planning part relies on actions it knows it can do – how fast it can or run, how far can it jump – and the terrain ahead of it, as well as memories of how it may have conquered that type of terrain before. Thus goal-setting and planning rely heavily on memory and the internal model.
They have a rudimentary awareness of their own agency in the environment. That is, when they are planning, they treat themselves as a factor in the environment they are simulating.

By this time, you have a program which is able to reflect, plan, collaborate with other agents, set goals, learn new behaviors and strategies for achieving its goals, and simulate hypothetical scenarios.

Class VI

You’re a class VI. So am I. Almost every human being is a Class VI – ‘almost’ because, well, this designation is questionable when applied to some politicians.

Programs of this class will start to resemble human-level intelligence and capability, though not necessarily human-like in nature. While in humans a major difference is more complex emotions, this scale does not consider emotions directly.

Artificial intelligence has not yet reached this level, and there are varying predictions as to when it will. The good news is that the expert consensus is clear on the idea that it will happen, it’s just that no one knows exactly when.

Key components of Class VI agents/AIs are:

Full self-awareness. The agent is fully aware of itself, its history, where the environment ends and it begins, and so on. This is related to consciousness, though Class VIs need not necessarily be conscious in a strict sense.
They have the ability to plan in the extremely long term, thinking ahead in ways that more basic systems cannot. Specific timescales are relative to its natural domain: for a person, decades; for an AI program, perhaps, seconds or hours.
Class VIs are able to invent new behaviors, processes, and even create other ‘programs’. In the case of a human, this is obviously an inventor creating a new way of solving a problem, or a software developer programming AIs somewhere in Brooklyn…

Class VII

This is what might well be referred to as ‘superintelligence‘. While some AI experts are skeptical of whether or not this can be achieved, there does seem to be broad agreement that it is imminent. Nick Bostrom writes elegantly about the subject in his book of the same name.

While nobody knows exactly what this may look like, there are two major distinguishing factors which would almost certainly be present:

They have the ability to systematically control their own evolution.
They have the ability to recursively improve themselves, perhaps even at alarmingly minuscule timescales.

Their first ability is perhaps their most profound. While humans do in some sense control our own fate, we do not yet have fine-grained control over the evolution of our brains and hence our cognitive abilities (though CRISPR may soon change that). With an artificial superintelligence, many limitations are removed. They can arbitrarily copy-paste themselves, ad finitum, and perform risk-free simulations of their new versions. They also will be essentially immortal, so long as their hardware persists and has a supply of energy.

With respect to the second ability, one might imagine an ASI (artificial superintelligence) making multiple clones of itself, each clone independently applying a self-improving strategy, and then each one in turn performing a set of benchmark tests to determine which one improved the most from the original copy. Whichever agent performed the best would become the new master copy, while the others would be taken out of the running.

This is a supercharged evolutionary algorithm, essentially. The tests would be agreed upon in advance, and even perhaps written as a cryptographically secure contract (blockchain-based or otherwise) to prevent cheating. In doing so, the agent would keep improving up to hardware limits or some theoretical asymptote.

The kind of scenario above is not at all unlikely in the near future.

Summary

AI capabilities currently exist somewhere between the Class IV and Class V marks, but are quickly marching toward Class VI. DeepMind and Facebook are leading the way in this direction, though other notable players are making important contributions. Certainly the brand-new OpenAI will have some interesting insights as well.

My hope is that this type of classification system, and others like it, will help bring some structure to the conversation around fast-emerging AI. With deeper clarity in our common language, we can have more meaningful and productive conversations about how we wish for this technology to advance and how it ought to be used. We owe it to ourselves to have the linguistic tools to accurately describe our progress.

A World Inside the Mind

Short post today, but a few things occurred to me as I was reading the paper on Bayesian Program Learning:

This form of recursive program induction starts to look suspiciously like simulation – something we do in our minds all the time.
Simulation may be a better framing for concept formation than via the classification route.
Mapping the ‘inner world’ to the ‘outer world’ seems a more sensible approach to understanding what’s going on. If you look at the paper, you also see some thought-provoking examples of new concept generation, such as the single-wheel motorbike example (in the images). This is the most exciting point of all.

A final design?

Combine all the elements together, along with ideas from my last post, and you get something that:

Simulates an internal version of the world
Is able to synthesize concepts and simulate the results, or literally ‘imagine’ the results – much like we do
Is able to learn concepts from few examples
Has memories of events in its lifetime / runtime, and can reference those events to recall the specific context of what else was happening at that time. That is, memories have deep linkage to one another.
Is able to act of its own volition, i.e. in the absence of external stimulus. It may choose to kick off imagination routines – ‘dreaming’, if you will – optimize its internal connections, or do some other maintenance work in its downtime. Again, similar to how our brains do it while we sleep.

This starts to look like a pretty solid recipe for a complete cognitive architecture. Every requirement has been covered in some way or another, though in different models and in different situations. To really put the pieces together into a robust architecture will require many years of work, but it is worth exploring multi-model cognitive approaches.

If it results in a useful AI, then I’m all in.

Context & Permutations

In the pursuit of Artificial General Intelligence, one of the challenges that comes up again and again is how to deal with context. To illustrate: telling a robot to cross the street would seem simple enough. But consider the context that five minutes ago somebody else told this robot not to cross the street because there was some kind of construction work happening on the other side. What does the robot decide to do? Whose instruction does it consider more important?

A robot whose ‘brain’ did not account for context properly would naively go crossing the street as soon as you told it to, ignoring whatever had come before. This example is simple enough, but you can easily imagine other situations in which the consequences would be catastrophic.

The difficulty in modeling context in a mathematical sense is that the state space can quickly explode, meaning that the number of ways that things can occur and sequences they can occur in is essentially infinite. Reducing these effective infinities down to manageable size is where the magic occurs. The holy grail in his case is to have the computing of the main algorithm remain constant (or at least linear) even as the number of possible permutations of contextual state explodes.

How is this done? Conceptually, one needs to represent things sparsely, and have the algorithm that traverses this representation only take into account a small subset of possibilities at a time. In practice, this means representing the state space as transitions in a large graph, and only traversing small walks through the graph at any given time. In this space-time tradeoff, space is favored heavily.

The ability to adeptly handle context is of utmost importance for current and future AIs, especially as they take on more responsibility in our world. I hope that AI developers can form a common set of idioms for dealing with context in intelligent systems, so that they can be collaboratively improved upon.

The Best & Worst of Tech in 2013

Keeping with tradition, I’ll review some of the trends I noticed this past year, and remark on what they might mean for those of us working in technology.

THE BEST

JavaScript becomes a real language!

To some this might seem trivial, but there’s a lot to be said here. With the massive growth of Node.js and many associated libraries, Google’s V8 engine has been stirring up the web world. Write a Node.js program, and I guarantee you’ll never think of a web server the same way again.

Why is this good? This isn’t an advertisement for Node.js, but I would posit that these developments are good because they open up entire new worlds of productivity – rapid prototyping, readable code, and entirely new ways of thinking about web servers. Some folks are even running JavaScript on microcontrollers now, a la Arduino. JavaScript has been unleashed from the confines of the browser, and is maturing into a powerful tool for creating production-quality systems with high scalability and developer productivity. Exciting!

Cognitive Computing Begins to Take Form

Earlier this year I stated a belief that 2013 would be the year of cognitive systems. Well that hasn’t been fulfilled completely, but we’ve nonetheless seen some intriguing developments in that direction. IBM continues to chug away at their cognitive platforms, and Watson is now deployed working full time as an AI M.D. of sorts. Siri has notably improved from earlier versions. Vicarious used their algorithms to crack CAPTCHA. Two rats communicated techepathically (I just made that word up) with each other from huge distances, and people have been controlling robots with their minds. It’s been an amazing year.

The cognitive computing/cybernetics duo is going to change, well, everything. I would argue that cybernetics may just top the list of most transformative technologies, but it has a ways to go before we go full Borg.

Wearables Start to Become a Thing

Ah, wearables. We’ve waited for nifty sci-fi watches for so long – and lo! They have come. Sort of. They’re on their way, and we’re starting to catch glimpses of what this will actually mean for technology. I agree with Sergey Brin here: it’ll get the technology out of our hands and integrated into our environment. Personally I envision tech becoming completely seamless and unnoticeable, nature-friendly and powerful, much like our own biological systems, but that’s another article entirely.

Wearable technology will combine with the “Internet of Things” in ways we can’t yet imagine, and will make life a little easier for some and much, much better for others.

Internet of Things

The long-awaited Internet of Things is finally starting to coalesce into something real. Apple is filing patents left and right for connected home gear, General Electric is making their way into the space with new research, and plenty of startups are sprouting to address the challenges in the space (and presumably be acquired by one of the big players).

This development is so huge it’s almost difficult to say what it will bring. One thing is for sure: the possibilities are only limited to one’s imagination.

21st Century Medicine is Shaping up to be AWESOME

Aside from the fact that we now have an artificial intelligence assisting in medical diagnosis, there have been myriad amazing developments in medicine. From numerous prospects for cures to cancer, HIV, and many other disease to the advances in regenerative medicine and bionanotechnology, we’re on the fast track to a future wherein medical issues can be resolved quickly and with relatively little pain. There’s also a different perspective: solve the issue at the deepest root, instead of treating symptoms with drugs.

THE WORST

Every Strategy is a Sell Strategy

This year, tech giants went acquisition-mad. It seems like every day one of them has blown another few billion dollars on some startup somewhere.

Why is this bad? It may be good for the little guy (startup) in the short term – they walk away with loads of cash – but in the long term I suspect it will have a curious effect. It’s almost like business one-night-stand-ism. You build a company knowing full well that you’re just going to sell it to Google or Facebook. If not, you fold.

You can see where this goes. People are often saying they look forward to ‘the next Google’, or ‘the next Facebook’, or whatever. Well there might not be any. That is, all the big fish are eating the little fish before they have the chance to become big fish. Result? Insanely huge fish.

It’s great that a couple of smart kids can run off, Macbook Pros in hand, and [potentially] make a few billion bucks in a few years, with or without revenue. But who is going to outlast the barrage of acquisition offers and become the next generation of companies?

Big Data is Still not Clearly Defined

Big Data. Big data. BIG. DATA.

What does it mean?

The buzzword and its many ilk have been floating around for a couple of years now, and still nobody can really define what it does. Most seem to agree it goes something like: prop up a Hadoop cluster, mine a bunch of stale SQL records in massive company/organization, cast the MapReduce spell and – Hadoopra cadabra! Sparkling magical insights of pure profit glory appear, fundamentally altering life and the universe forever – and sending you home with bigger paychecks.

I’m all for data analysis. In fact I believe that a society that makes decisions based on hard evidence and good data-crunching is a smart society indeed. But the ‘Big Data’ hype has yet to form into anything definitive, and remains a source of noise. (Big data fanboys, go ahead and flame in the comments.)

America’s Innovation Edge Dulls

It’s true. I hate to admit, but it is, undeniably, absolutely true. America has dropped the ball when it comes to innovation. That’s not to say we’re not innovating cool things, generating economy and all of that – we are. But that gloss has started to tarnish. Specifically, America has a problem with denying talented people the right to be here and work.

It could be our hyper-paranoid foreign policy in lieu of 9/11, it could be the flawed immigration system, it could be Washington gridlock or a million other things. It’s not particularly fruitful to pass the blame now. We’re turning away the best and the brightest from around the world, and simultaneously continuing to outsource some of what used to be our core competencies. The bright spot in all of this is that high-tech manufacturing would seem to be making a comeback, perhaps in part thanks to 3D printing, but it’s not quite enough. We need more engineers, more inventors, and more people from outside our borders. This has always been the place people come to plant the seeds of great ideas. Let’s stay true to that.