Building Human Curiosity into Artificial Intelligence Source: Scott Zoldi
Advances in artificial intelligence (AI) and machine learning are boosting the ability of predictive analytics to boost bottom lines. Does that mean that smart machines are about to replace humans in higher-complexity jobs? No doubt, smart machines are getting smarter. But even the smartest machines lack fundamental human characteristics absolutely critical to solving certain types of problems, whether human or machine. One of these key capabilities is curiosity ― but how can we replicate that?
For the answer, we need to look at neuro-dynamic programming. It’s an analytic method for learning and anticipating how current and future actions are likely to contribute to a long-term cumulative reward. This technique is related to advanced AI reinforcement learning methods, which take inspiration from behaviorist psychology to connect future reward/penalty back to earlier steps in a decision-making process. That contrasts with traditional supervised learning, which attributes reward only to the current decision.
These advanced methods focus on repeated experimentation and prediction and ultimately these chains of actions produce much more complex decisions/strategies and outcomes.    For example, these methods are leveraged in robotics to allow learning to occur to stabilize, grasp and manipulate an object. These analytic methods mimic the way the brain learns complex task sequences through pleasurable or painful feedback signals that may occur later in time �C essentially, how humans seek and achieve long-term positive results. Think about how you learned to ride a bike -- gradually mastering balance, braking, mounting and dismounting (and falling safely).
Clearly, analytics that can “think” well ahead and focus on the most favorable long-term outcomes are highly valued. That’s particularly true in the many operational decisions about customers that have long-term consequences and where loyalty is earned over repeated interactions with an organization.
High customer lifetime value and healthy, sustainable cash flow are both produced by a series of interactions: The business takes an action, the customer reacts, the business responds to the new state of the relationship with another action, the customer reacts … and so on. In this way, neuro-dynamic programming enables smart machines to think ahead -- potentially making moves early in the decision chain that may not appear optimal in the short run but lead to better decisions in the long term.
Another way to think about this concept is to consider a group of dumb software agents, similar to individual ants. The agents interact with their environment, rewarded or penalized around a small set of success criteria. Gradually sequences of successful behavior emerge as the agents begin to map out the risk and reward of various inter-related activities �C many paths are explored and non-optimal ones learned and abandoned in the pursuit for the best chains of actions. Those agents with few successes receive a low “fitness” score and die out, whereas those with many successful sequences score high and are allowed to reproduce, mutate, or combine with other high-scoring agents. In this way, the overall performance of the group increases.
All the while, their environment is changing. So these agents not only act in the optimal way based on their current best “map of the world,” they also experiment to deal with these changing conditions. Using probabilities, they make slight variations and mutate around the optimal strategies. As these activities result in rewards and penalties, they learn from these experiments and adjust to a changing fitness landscape continually.
As you can see in Figure 1, at any point in the sequence, the current state of the customer relationship is the result not only of the just-taken action, but also of the string of previous actions. Just as in a chess match, where a checkmate could be rooted 10 moves back ― or even in the first move ― the loss of a valuable customer may have started with actions taken months ago. To be successful, a business needs to understand and track this dynamic.
Figure 1: Learning to Make Better Decisions from Long-Term Results
Figure 2 depicts how these analytics learn about long-term effects by assigning credits for successful outcomes and penalties for unsuccessful ones. Although the action immediately before the outcome may receive a larger share of the credits or penalties, reinforcement learning principles require distribution of some amount of rewards/penalties across the entire sequence of actions.
Figure 2: Predicting the Outcome of a Sequence of Actions and Reactions
During training with historical data, the model learns to associate value (total discounted rewards and penalties) with a customer state and with each of the potential actions the business can take at that particular point. After training, when presented with new data on a customer indicating a given state, the model is able to predict the long-term value of taking one action over another -- and to select the best next action proven most likely to maximize the long-term value.
To improve business actions and results at a fast pace, analytics must have a way to learn causal relationships (this change in action A causes outcome Y to change in this specific way, usually expressed in expectations because Y is uncertain) from the data.
To do this, the algorithm performs a controlled amount of deliberate experimentation. While customers in similar states with similar characteristics would normally be targeted with the same action according to deterministic rules -- creating targeting bias and with it, difficulty in identifying causal effects -- advanced reinforcement-learning algorithms assign a small fraction of similar customers to somewhat different actions. In neuro-dynamic programming, these miniature experiments are essential. That’s because they help the neural networks �C models that mimic the brain function to process a large number of inputs, utilizing high-speed computers and algorithms that learn to recognize complex patterns of behavior �C to understand the causal effect relationships between states and actions, on state-to-state transition probabilities, and thus on customer value.
| }
|