From Paper to Production: Shortening the Ramp

One of the things that strikes me about the current state of machine learning is how long it still takes to get a new algorithm or model into production. From the time that 1) a paper is published to when 2) its contents are evaluated by those doing machine learning in industry and 3) they subsequently commit to developing it, years have past. It does not need to be this way.

Those doing machine learning are understandably wary of newer methods, and I can see why they might opt to give it time for hidden problems to be discovered before committing. The long-term viability of a model is often judged by its very ability to remain on the scene after many years, which, though vaguely tautological, remains valid. For those models that survive the process, they are deemed essential, timeless. The rest are disregarded.

There are two sides from which this can be viewed: the business risk side, and the development side. These are deeply intertwined.

The Risk Side

Product managers and tech leadership who are coming to grips with the reality that is the pervasiveness of machine learning have an increasing number of considerations, many of them in technology areas that may not be within their expertise. There may be a bit of silver lining, however: the discussion regarding what machine learning technology makes its way into new products is fundamentally one of risk. To the extent that they can work with those within their organization or its allies who have the expertise to accurately assess the risk of newer methods, they can quantify the level of risk for when things go wrong.

For instance, if a new method can drop the error rate of a certain type of prediction down to 3% (as opposed to a previous 5%), how does that affect the risk statistics of the business? Does it enable broader distribution or reach into a higher market segment? Does it enable new products entirely? These questions must be answered.

New qualitative capabilities may seem more difficult to judge, but that is not necessarily true. For instance, some newer capabilities involve an AI system describing what’s in a photograph, using completely natural language with understandable sentences. If a product manager or CTO is considering using this capability in a new product or feature, the error rate of the method can still be used to assess the risk to the business that the new feature exposes them to. The degree of risk will vary widely by the specific industry and application, but the process remains the same.

The Development Side

Even if all parties can agree that a new method is tempting enough to use in a feature, somebody still has to code the thing. This is where progress is sluggish. Developing new, unfamiliar models and validating them is a nontrivial effort, even for experienced ML programmers. Assuming you’re doing the first implementation in a given language or environment, it requires a degree of getting into the thought process of the researchers. Often, direct correspondence is needed to clarify details.

While many papers include pseudocode that can be readily translated into a programming language, just as many do not. From there, you are left to develop a deep understanding of the model’s description and translate its mathematical definition and data structures into a complete implementation. It’s hard work.

This is the part where things can slow down: without a clear understanding of the model and its behavior, it is not possible for a tech lead, data scientist, or ML developer to accurate judgements about the level of risk or the likelihood of bugs or other surprise behavior. More than the error rate, one has to assume that the resulting implementation will have its own quirks and bugs. To assume otherwise would be both unrealistic and foolish.

Many companies may be slower to adopt “bleeding edge” methods, then, because it is simply too difficult to enumerate implied capabilities and to quantify the risk it imposes. How can this be solved?

Shorten the Ramp

Consider the situation where there is X new deep learning model and a company really wants to use it in their products, but may not have a good way of reaching the logical conclusions of doing so. We can point out the main issues:

  • It can be a challenge to arrive at an exact error rate for the specific application before an implementation has been made. The paper will use test datasets, but the model will almost surely behave differently with the data specific to a feature.
  • There is often a break in the communication between those gaining understanding of the model and those assessing how it may affect the business overall. It could be anything from a smash hit to total disaster.
  • Even when a model is finished, it will need to land in an environment in which to run. Engineers should keep the infrastructure requirements in mind from the beginning.
  • In a waterfall or waterfall-like process, it is of course not possible to create requirements in the absence of understanding of the capabilities involved. This stalls progress.
  • Agile development is out the window, due to the high sensitivity of the relationship between model performance and feature risk or cost. These aren’t really the kinds of things you can just “ship first, iterate later”. Much needs to be worked out before it goes into the hands of feature developers.

All of this points toward two ways to shorten the ramp to deployment:

  1. If a company is genuinely interested in adopting new algorithms and models in their offerings, they need to provide representative data as soon as possible.
  2. Their engineering and/or data science team(s) need to have tools and infrastructure to support rapid prototyping of new models.

Only with datasets that are representative of what will occur in a production setting can a team judge their implementation and profile its performance. In the paper, a model may boast 90-something percent accuracy, but you may find that for your problem it is a little less, thereby affecting how risky the investment in developing the model is.

This can happen before the implementation phase, by looking at the test data used in a paper. A talented developer or data scientist can format the internal dataset to be similar to that used in a test dataset from the paper, thereby reducing opportunity for errors to arise from differences in data formatting.

An example process, then, might look like this:

  1. Product manager decides they need X new capability in their next phase of features.
  2. Their first order of business, then, is to gather and build datasets that are close to the actual problem. At the very least, they should ensure that whoever can build that dataset has access to all the tools and data sources needed to complete it quickly.
  3. Product manager hands over dataset, high-level requirements, and asks data science and/or engineering team(s) to begin investigating models.
  4. Technical team either begins profiling models they already know about or scouting for models that are known to enable the required capabilities.
  5. Development / prototyping begins with the selected model(s) and the datasets provided.
  6. Throughout development process, error rates and other important metrics are reported back to the product manager (or whomever is overseeing the process).
  7. Risk calculations are adjusted as this information flows in. For example, if it’s a photo auto-tagging feature in question, one can determine how many users are likely to experience incorrectly tagged photos and how often, based on the volume of photos and the error rate of the model. From that, one can determine how much of a risk it is to a the business – are the users likely to leave if they experience the error, or is it not a deal-breaker?
  8. Once all models have been profiled and tested, a decision can be reached about whether or not to proceed with the feature.

Of course not every company has a product manager or entire teams for data science and engineering, but the overall structure can be applied by those filling the roles – even if it’s all the same person.

In summary, the best way I see to shorten the time from a published paper to a viable production implementation is to 1) provide data as early as possible and 2) ensure engineering has tools to quickly prototype and test models. Both are difficult, but both will pay off considerably for those willing to put in the effort.

I hope this has been helpful to you. Please ask questions in the comments.