Markov Decision Processes#

(MDS: A Structured Framework for Sequential Decision‑Making)#

Markov Decision Processes (MDPs) form the mathematical backbone of much of reinforcement learning, operations research, and decision‑analytic modeling. They formalize how an agent interacts with an environment using states, actions, and stochastic transitions that obey the Markov property.

1. Basic Intuition#

An MDP is a four‑tuple ((S, A, T, R)) describing a sequential decision‑making problem where:

S – Finite (or countable) set of states.
A – For each (s\in S), a finite set of available actions (A(s)).
T – Transition function (T(s,a,s’) = P(s’ \mid s,a)), giving the probability of moving to state (s’) after taking action (a) in state (s).
R – Reward function (R(s,a,s’)), a scalar value returned after the transition.

The agent’s goal is to choose actions over time to maximize some notion of cumulative reward (e.g., expected total discounted reward).

2. The Markov Property#

The Markov property states that the probability of reaching the next state (and the associated reward) depends only on the current state and action, not on the past history:

[ P(s_{t+1}, r_{t+1} \mid s_t, a_t, s_{t-1}, a_{t-1}, \dots) = P(s_{t+1}, r_{t+1} \mid s_t, a_t). ]

This memoryless property simplifies analysis and enables dynamic programming solutions.

3. Key Components in Detail#

Component	Role	Example
State	Full description of the environment at a moment.	Position of a robot in a grid.
Action	Decision an agent can make.	Move north, turn left.
Transition	Model of uncertainty.	Probability that a north move succeeds or a robot slips.
Reward	Objective signal.	+1 for reaching goal, –1 for hitting an obstacle.

Policy (\pi(s)): Mapping from states to actions (deterministic or stochastic).
Value function (V^\pi(s)): Expected cumulative reward following policy (\pi) from state (s).

4. Solving an MDP#

4.1 Value Iteration#

Iteratively update state values using Bellman optimality:

V_{k+1}(s) = max_a [ Σ_{s'} P(s'|s,a) * ( R(s,a,s') + γ V_k(s') ) ]

Where (γ \in [0,1)) is the discount factor. After convergence, an optimal policy is derived:

π*(s) = argmax_a [ Σ_{s'} P(s'|s,a) * ( R(s,a,s') + γ V*(s') ) ]

4.2 Policy Iteration#

Policy Evaluation: Compute (V^\pi) for a fixed policy.\n2. Policy Improvement: Greedy update against (V^\pi).
Repeat until the policy stops changing.

5. Applications#

Domain	Why MDPs Work
Robotics	Navigation, grasping tasks under uncertainty.
Finance	Portfolio allocation with stochastic markets.
Game AI	Planning in board games (e.g., chess, go).
Healthcare	Treatment planning, sequential therapies.

6. Extensions#

Extension	Motivation
Partially Observable MDP (POMDP)	When state is hidden; uses belief states.
Continuous MDPs	Continuous state/action spaces; solved by approximate dynamic programming.
Multi‑Agent MDPs	Multiple agents with joint actions; leads to stochastic games.

7. Summary#

A Markov Decision Process offers a clean, tractable abstraction for problems where decisions have sequential and probabilistic effects. By pairing the Markov property with formal rewards, agents can compute optimal strategies through well‑understood algorithms such as value iteration, policy iteration, or Q‑learning. The framework’s versatility lies in its universal applicability—from robotic control to economic modeling—making MDPs a cornerstone of modern decision theory.