Harvard

Example Of Infinite Horizon Mdp

Example Of Infinite Horizon Mdp
Example Of Infinite Horizon Mdp

An infinite horizon Markov Decision Process (MDP) is a mathematical framework used to model decision-making problems in situations where the decision-maker has to make a sequence of decisions over an infinite horizon. In an infinite horizon MDP, the decision-maker's goal is to maximize the expected cumulative reward over an infinite time horizon. This type of problem is commonly encountered in various fields, including operations research, economics, and artificial intelligence.

Infinite Horizon MDP Definition

An infinite horizon MDP is defined by the following components:

  • State space: A set of states, denoted by S, that the system can be in.
  • Action space: A set of actions, denoted by A, that the decision-maker can take in each state.
  • Transition function: A function, denoted by P, that specifies the probability of transitioning from one state to another given an action.
  • Reward function: A function, denoted by R, that specifies the reward received by the decision-maker for taking an action in a particular state.
  • Discount factor: A parameter, denoted by γ, that specifies the relative importance of future rewards compared to immediate rewards.

The decision-maker’s goal is to find a policy, denoted by π, that maps each state to an action, such that the expected cumulative reward over an infinite time horizon is maximized.

Example of Infinite Horizon MDP

Consider a simple example of an infinite horizon MDP, where a decision-maker has to manage a inventory of products. The state space S consists of the current inventory level, which can be either 0, 1, or 2. The action space A consists of two actions: order a new product or do nothing. The transition function P is defined as follows:

Current StateActionNext StateProbability
0Order11.0
0Do nothing01.0
1Order21.0
1Do nothing00.5
1Do nothing10.5
2Order21.0
2Do nothing11.0

The reward function R is defined as follows:

StateActionReward
0Order-10
0Do nothing0
1Order-10
1Do nothing10
2Order-10
2Do nothing10

The discount factor γ is set to 0.9. The decision-maker’s goal is to find a policy that maximizes the expected cumulative reward over an infinite time horizon.

💡 In this example, the decision-maker has to balance the cost of ordering new products with the reward of having products in stock. The optimal policy will depend on the specific values of the transition and reward functions, as well as the discount factor.

Solving Infinite Horizon MDPs

There are several methods for solving infinite horizon MDPs, including:

  • Value iteration: An iterative algorithm that computes the optimal value function, which represents the expected cumulative reward for each state.
  • Policy iteration: An iterative algorithm that computes the optimal policy, which represents the best action to take in each state.
  • Linear programming: A method that formulates the infinite horizon MDP as a linear program, which can be solved using standard linear programming techniques.

These methods can be used to solve infinite horizon MDPs with a finite state and action space. However, in practice, the state and action spaces may be large or infinite, requiring approximation methods or specialized algorithms.

Approximation Methods

There are several approximation methods for solving infinite horizon MDPs with large or infinite state and action spaces, including:

  • Function approximation: A method that approximates the value function or policy using a parametric function, such as a neural network.
  • Sampling-based methods: A method that approximates the value function or policy by sampling the state and action spaces.
  • Model-based methods: A method that approximates the transition and reward functions using a parametric model, such as a Gaussian process.

These approximation methods can be used to solve infinite horizon MDPs with large or infinite state and action spaces, but may require significant computational resources and careful tuning of hyperparameters.

What is the difference between a finite horizon MDP and an infinite horizon MDP?

+

A finite horizon MDP has a finite time horizon, whereas an infinite horizon MDP has an infinite time horizon. In a finite horizon MDP, the decision-maker’s goal is to maximize the expected cumulative reward over a finite time horizon, whereas in an infinite horizon MDP, the decision-maker’s goal is to maximize the expected cumulative reward over an infinite time horizon.

How do you solve an infinite horizon MDP with a large state and action space?

+

There are several methods for solving infinite horizon MDPs with large state and action spaces, including function approximation, sampling-based methods, and model-based methods. These methods can be used to approximate the value function or policy, but may require significant computational resources and careful tuning of hyperparameters.

Related Articles

Back to top button