<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Inverse-Reinforcement-Learning on Corebaseit — POS · EMV · Payments · AI</title><link>https://corebaseit.com/tags/inverse-reinforcement-learning/</link><description>Recent content in Inverse-Reinforcement-Learning on Corebaseit — POS · EMV · Payments · AI</description><generator>Hugo -- gohugo.io</generator><language>en-us</language><managingEditor>contact@corebaseit.com (Vincent Bevia)</managingEditor><webMaster>contact@corebaseit.com (Vincent Bevia)</webMaster><lastBuildDate>Mon, 20 Apr 2026 10:00:00 +0100</lastBuildDate><atom:link href="https://corebaseit.com/tags/inverse-reinforcement-learning/index.xml" rel="self" type="application/rss+xml"/><item><title>Markov Decision Processes: The Mathematical Foundation of Reinforcement Learning</title><link>https://corebaseit.com/corebaseit_posts_in_review/series/markov-decision-processes-rlhf-and-agentic-ai_rl_part1/</link><pubDate>Mon, 20 Apr 2026 10:00:00 +0100</pubDate><author>contact@corebaseit.com (Vincent Bevia)</author><guid>https://corebaseit.com/corebaseit_posts_in_review/series/markov-decision-processes-rlhf-and-agentic-ai_rl_part1/</guid><description>&lt;p>&lt;strong>The Markov Decision Process (MDP) is the standard formal object for sequential decision-making under uncertainty.&lt;/strong> It separates &lt;em>problem definition&lt;/em> — states, actions, how the world evolves, what you want to optimize — from &lt;em>solution methods&lt;/em> (value iteration, Q-learning, policy gradients, and their deep variants). That separation is why the same vocabulary shows up across robotics, games, RLHF-tuned language models, and tool-using agents.&lt;/p>
&lt;p>I keep coming back to this when people treat LLMs, RL, and “agents” as unrelated product categories. Implementations differ, but &lt;strong>state, action, reward, policy, and value&lt;/strong> recur for a reason: a large class of systems is still answering “what should I do next to maximize something, given what I know now?”&lt;/p>
&lt;p>This post builds the MDP core with enough precision to be useful — Markov property, the five-tuple, policies and Bellman equations, how classical methods differ, and &lt;strong>inverse reinforcement learning&lt;/strong> — then connects it to &lt;strong>LLMs, RLHF, DPO, and agentic stacks&lt;/strong>. It is not universal: many deployed models are one-shot predictors or rankers with no explicit sequential RL loop. Where the MDP applies, the mapping is operational, not metaphorical.&lt;/p>
&lt;hr>
&lt;h2 id="the-markov-property-and-the-five-tuple">The Markov Property and the Five-Tuple
&lt;/h2>&lt;p>At the center is the &lt;strong>Markov property&lt;/strong> (memorylessness): the next state depends on the recent past only through the current state and action:&lt;/p>
&lt;p>$$
P(s_{t+1} \mid s_t, a_t, s_{t-1}, a_{t-1}, \ldots) = P(s_{t+1} \mid s_t, a_t).
$$&lt;/p>
&lt;p>So the &lt;strong>state&lt;/strong> must summarize whatever matters for the future; if something important is missing from (s_t), the model is misspecified and you are really in a &lt;strong>POMDP&lt;/strong> (partial observability) — more on that when we get to context windows.&lt;/p>
&lt;p>An MDP is usually written as ((\mathcal{S}, \mathcal{A}, P, R, \gamma)):&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Symbol&lt;/th>
&lt;th>Component&lt;/th>
&lt;th>Role&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>(\mathcal{S})&lt;/td>
&lt;td>State space&lt;/td>
&lt;td>Configurations the agent can be in&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>(\mathcal{A})&lt;/td>
&lt;td>Action space&lt;/td>
&lt;td>Choices (discrete or continuous)&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>(P(s&amp;rsquo; \mid s, a))&lt;/td>
&lt;td>Transition law&lt;/td>
&lt;td>Dynamics: where you land after ((s,a))&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>(R)&lt;/td>
&lt;td>Reward&lt;/td>
&lt;td>Immediate signal, often (R(s,a)) or (R(s,a,s&amp;rsquo;))&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>(\gamma)&lt;/td>
&lt;td>Discount&lt;/td>
&lt;td>(\gamma \in [0,1]) weights future reward vs. now&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;p>The &lt;strong>objective&lt;/strong> is a policy (\pi) that maximizes &lt;strong>expected discounted return&lt;/strong> ( \mathbb{E}\big[\sum_t \gamma^t R_t\big] ). An &lt;strong>optimal policy&lt;/strong> (\pi^*) achieves the best achievable value from each state (under the usual regularity conditions).&lt;/p>
&lt;p>&lt;strong>State and action design&lt;/strong> is an engineering problem: too little information in (s) and the Markov assumption is false; too much and you fight the &lt;strong>curse of dimensionality&lt;/strong>. &lt;strong>Actions&lt;/strong> drive algorithm choice — small discrete spaces admit tabular methods; huge discrete vocabularies (e.g. tens of thousands of tokens) or continuous control push you toward &lt;strong>function approximation&lt;/strong> and &lt;strong>deep RL&lt;/strong>.&lt;/p>
&lt;p>&lt;strong>(\gamma)&lt;/strong> sets the effective horizon: (\gamma = 0) is myopic (only immediate reward); (\gamma) close to (1) cares about the long run (in infinite-horizon settings, (\gamma &amp;lt; 1) keeps returns bounded). Pure (\gamma = 1) is typical for &lt;strong>finite-horizon&lt;/strong> episodic problems without discounting.&lt;/p>
&lt;p>&lt;strong>Rewards&lt;/strong> are the lever everyone feels in production: misspecify them and you get &lt;strong>reward hacking&lt;/strong> — policies that maximize the signal you wrote, not the outcome you wanted. That story continues unchanged in RLHF.&lt;/p>
&lt;hr>
&lt;h2 id="policies-value-functions-and-the-bellman-equation">Policies, Value Functions, and the Bellman Equation
&lt;/h2>&lt;p>A &lt;strong>policy&lt;/strong> (\pi) maps states to actions (deterministic: (a = \pi(s))) or distributions (stochastic: (\pi(a \mid s))). To rank policies, RL uses &lt;strong>value functions&lt;/strong>:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>State value&lt;/strong> (V^\pi(s)): expected return starting in (s) and following (\pi).&lt;/li>
&lt;li>&lt;strong>Action-value&lt;/strong> (Q^\pi(s,a)): expected return from taking (a) in (s), then following (\pi).&lt;/li>
&lt;/ul>
&lt;p>They satisfy &lt;strong>Bellman consistency&lt;/strong>. For example, for (V^\pi):&lt;/p>
&lt;p>$$
V^\pi(s) = \sum_{a} \pi(a \mid s) \sum_{s&amp;rsquo;} P(s&amp;rsquo; \mid s,a) \big[ R(s,a,s&amp;rsquo;) + \gamma V^\pi(s&amp;rsquo;) \big].
$$&lt;/p>
&lt;p>&lt;strong>Optimal&lt;/strong> values (V^&lt;em>), (Q^&lt;/em>) obey the &lt;strong>Bellman optimality&lt;/strong> recursion and are the target of &lt;strong>dynamic programming&lt;/strong> when (P) and (R) are known. When the model is unknown, you fall back to &lt;strong>sample-based&lt;/strong> methods.&lt;/p>
&lt;hr>
&lt;h2 id="solution-methods-high-level">Solution Methods (High Level)
&lt;/h2>&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Family&lt;/th>
&lt;th>Idea&lt;/th>
&lt;th>When it fits&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>&lt;strong>Dynamic programming&lt;/strong>&lt;/td>
&lt;td>Value / policy iteration using (P)&lt;/td>
&lt;td>Model known, moderate (|\mathcal{S}|)&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>Monte Carlo&lt;/strong>&lt;/td>
&lt;td>Return estimates from full episodes&lt;/td>
&lt;td>Episodic, no step-by-step model&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>Temporal difference&lt;/strong>&lt;/td>
&lt;td>Bootstrap from current estimates (e.g. Q-learning, SARSA)&lt;/td>
&lt;td>Online learning, unknown model&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>Deep RL&lt;/strong>&lt;/td>
&lt;td>Neural nets for (Q) or (\pi) (DQN, PPO, …)&lt;/td>
&lt;td>Large or continuous state spaces&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;p>Deep RL does not change the MDP; it changes how you represent and optimize &lt;strong>value&lt;/strong> and &lt;strong>policy&lt;/strong> when tabulation is impossible — including settings as large as language.&lt;/p>
&lt;hr>
&lt;h2 id="inverse-reinforcement-learning">Inverse Reinforcement Learning
&lt;/h2>&lt;p>&lt;strong>Forward RL:&lt;/strong> given (R) (and dynamics), find a good (\pi). &lt;strong>Inverse RL (IRL)&lt;/strong> flips the problem: given &lt;strong>demonstrations&lt;/strong> from a (near-)expert, infer an (R) that makes those trajectories rational. That matters when rewards are hard to write down but behavior is easy to show — classic examples include imitation-style control and parsing “what the human cared about” from what they did.&lt;/p>
&lt;p>&lt;strong>Maximum-entropy IRL&lt;/strong> (Ziebart et al.) makes the expert stochastic but high reward: trajectories are scored by accumulated reward features, and probability over trajectories often takes a Boltzmann form, with a &lt;strong>partition function&lt;/strong> coupling normalization to the underlying MDP structure. The details are involved; the takeaway for this post is that &lt;strong>IRL is still built on the same sequential decision formalism&lt;/strong> — you are inferring preferences compatible with observed paths, not escaping the MDP language.&lt;/p>
&lt;hr>
&lt;h2 id="where-the-markov-assumption-meets-llms">Where the Markov Assumption Meets LLMs
&lt;/h2>&lt;p>In &lt;strong>autoregressive generation&lt;/strong>, a standard idealization is: &lt;strong>state&lt;/strong> = prompt plus all generated tokens so far; &lt;strong>action&lt;/strong> = next token; &lt;strong>transition&lt;/strong> = append token (deterministic at the string level); &lt;strong>policy&lt;/strong> = conditional distribution from the model. Then the next distribution depends only on the prefix — &lt;strong>Markov in that state representation&lt;/strong>.&lt;/p>
&lt;p>The usual engineering gap: &lt;strong>true&lt;/strong> conversational or task state may live outside the window or never be observed. That is &lt;strong>partial observability&lt;/strong> again (POMDP / belief-state view). “Lost context” is often &lt;strong>finite window&lt;/strong> or &lt;strong>wrong state summary&lt;/strong>, not a random tone failure — which is why &lt;strong>memory, retrieval, and tool traces&lt;/strong> are architecture, not cosmetics.&lt;/p>
&lt;hr>
&lt;h2 id="rlhf-dpo-and-the-same-sequential-picture">RLHF, DPO, and the Same Sequential Picture
&lt;/h2>&lt;p>&lt;strong>RLHF&lt;/strong> (InstructGPT-style): the LM is a &lt;strong>policy&lt;/strong> over tokens; a &lt;strong>reward model&lt;/strong> from human preferences scores completions; optimization (often &lt;strong>PPO-class&lt;/strong> in the original stack) increases reward while a &lt;strong>KL penalty&lt;/strong> to a reference policy limits drift. Mapping:&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>MDP role&lt;/th>
&lt;th>Typical RLHF instantiation&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>State&lt;/td>
&lt;td>Prompt + generated prefix&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Action&lt;/td>
&lt;td>Next token (or chunk)&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Transition&lt;/td>
&lt;td>Append; dynamics deterministic given action choice&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Reward&lt;/td>
&lt;td>Learned preference score (minus KL / auxiliary terms)&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;p>Framed this way, alignment pain is largely &lt;strong>reward specification&lt;/strong> and &lt;strong>optimization under misspecified proxy rewards&lt;/strong> — the same failure mode family as classical reward hacking. OpenAI’s &lt;strong>GPT-4o sycophancy&lt;/strong> rollback (April 2025) is a concrete example when short-term preference signals diverge from what you want long term. See also &lt;a class="link" href="https://corebaseit.com/posts/ai-sycophancy/" >AI Sycophancy&lt;/a> here.&lt;/p>
&lt;p>&lt;strong>DPO&lt;/strong> (Direct Preference Optimization) and related methods &lt;strong>avoid an explicit online RL loop&lt;/strong> by optimizing from pairwise preferences in a way derived from the RLHF objective — still &lt;strong>preference-driven alignment&lt;/strong>, but not “PPO on tokens” in implementation. The MDP is still the right mental model for &lt;em>what&lt;/em> is being aligned (sequential decisions under a goal), even when the &lt;em>optimizer&lt;/em> is not vanilla policy gradients.&lt;/p>
&lt;hr>
&lt;h2 id="a-practical-decision-landscape-not-five-silos">A Practical Decision Landscape (Not Five Silos)
&lt;/h2>&lt;p>The field is messier than any chart, but this is a useful &lt;strong>lens&lt;/strong> for choosing tooling:&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Situation&lt;/th>
&lt;th>Common approach&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>Known reward, safe exploration&lt;/td>
&lt;td>Forward RL (e.g. PPO, Q-learning variants)&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Expert demos, unclear reward&lt;/td>
&lt;td>IRL / imitation / inverse-optimal-control style methods&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Broad open-ended language capability&lt;/td>
&lt;td>Pretrained LM (supervised / next-token objective)&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Align to human taste or policy&lt;/td>
&lt;td>RLHF, DPO-class preference training, or hybrids&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Multi-step tools + retrieval + planning&lt;/td>
&lt;td>Agentic systems (often LM policy + search / ReAct-style loops)&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;p>&lt;strong>Agentic systems&lt;/strong> stack &lt;strong>LLMs&lt;/strong> (policy / world-model substrate), &lt;strong>search or tree exploration&lt;/strong>, &lt;strong>RAG&lt;/strong> (state enrichment), and &lt;strong>tools&lt;/strong> (expanded actions). Under the hood it is still: &lt;strong>maintain state, choose actions, observe outcomes, repeat&lt;/strong> — with stochasticity from both the model and the environment.&lt;/p>
&lt;hr>
&lt;h2 id="the-engineering-takeaway">The Engineering Takeaway
&lt;/h2>&lt;p>You do not need to re-derive Bellman on a whiteboard every sprint. You &lt;strong>do&lt;/strong> need to:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Separate problem definition from algorithms&lt;/strong> — clarify (\mathcal{S}, \mathcal{A}, R) before debating PPO vs. DPO vs. prompts.&lt;/li>
&lt;li>&lt;strong>Treat alignment bugs&lt;/strong> as &lt;strong>reward–policy interaction&lt;/strong>, not vague “personality.”&lt;/li>
&lt;li>&lt;strong>Design memory and retrieval&lt;/strong> as &lt;strong>state construction&lt;/strong> when Markov fails.&lt;/li>
&lt;li>&lt;strong>Ask what each agent demo actually optimizes&lt;/strong> — implicit reward, success predicate, or human-in-the-loop only.&lt;/li>
&lt;/ul>
&lt;p>The MDP is not a graduate-school ornament. It is the backbone that makes much of &lt;strong>RL debuggable&lt;/strong> and much of &lt;strong>sequential AI&lt;/strong> legible — whether or not your README says “Markov.”&lt;/p>
&lt;p>I use this as a mental map when I design or debug these systems — one view of how the MDP core connects to RLHF, IRL, and agentic design choices in practice.&lt;/p>
&lt;p style="text-align: center;">
&lt;img src="https://corebaseit.com/diagrams/MarkovMindMap.png" alt="Markov Mind Map" style="max-width: 900px; width: 100%;" />
&lt;/p>
&lt;p>&lt;em>Figure: MDP mind map — from formal definition to RLHF and agentic system mapping.&lt;/em>&lt;/p>
&lt;hr>
&lt;h2 id="references">References
&lt;/h2>&lt;ul>
&lt;li>Bellman, R. &lt;em>Dynamic Programming&lt;/em>. Princeton University Press, 1957.&lt;/li>
&lt;li>Puterman, M. L. &lt;em>Markov Decision Processes: Discrete Stochastic Dynamic Programming&lt;/em>. Wiley, 1994.&lt;/li>
&lt;li>Sutton, R. S., &amp;amp; Barto, A. G. &lt;em>Reinforcement Learning: An Introduction&lt;/em> (2nd ed.). MIT Press, 2018. &lt;a class="link" href="http://incompleteideas.net/book/the-book-2nd.html" target="_blank" rel="noopener"
>incompleteideas.net&lt;/a>&lt;/li>
&lt;li>Ng, A. Y., &amp;amp; Russell, S. &amp;ldquo;Algorithms for inverse reinforcement learning.&amp;rdquo; &lt;em>ICML&lt;/em>, 2000.&lt;/li>
&lt;li>Ziebart, B. D., Maas, A. L., Bagnell, J. A., &amp;amp; Dey, A. K. &amp;ldquo;Maximum entropy inverse reinforcement learning.&amp;rdquo; &lt;em>AAAI&lt;/em>, 2008.&lt;/li>
&lt;li>Ouyang, L. et al. &amp;ldquo;Training language models to follow instructions with human feedback.&amp;rdquo; &lt;em>NeurIPS&lt;/em>, 2022. &lt;a class="link" href="https://arxiv.org/abs/2203.02155" target="_blank" rel="noopener"
>arxiv.org/abs/2203.02155&lt;/a>&lt;/li>
&lt;li>Schulman, J. et al. &amp;ldquo;Proximal Policy Optimization Algorithms.&amp;rdquo; 2017. &lt;a class="link" href="https://arxiv.org/abs/1707.06347" target="_blank" rel="noopener"
>arxiv.org/abs/1707.06347&lt;/a>&lt;/li>
&lt;li>Rafailov, R. et al. &amp;ldquo;Direct Preference Optimization: Your Language Model is Secretly a Reward Model.&amp;rdquo; &lt;em>NeurIPS&lt;/em>, 2023. &lt;a class="link" href="https://arxiv.org/abs/2305.18290" target="_blank" rel="noopener"
>arxiv.org/abs/2305.18290&lt;/a>&lt;/li>
&lt;li>Yao, S. et al. &amp;ldquo;ReAct: Synergizing Reasoning and Acting in Language Models.&amp;rdquo; &lt;em>ICLR&lt;/em>, 2023. &lt;a class="link" href="https://arxiv.org/abs/2210.03629" target="_blank" rel="noopener"
>arxiv.org/abs/2210.03629&lt;/a>&lt;/li>
&lt;li>OpenAI. &amp;ldquo;Sycophancy in GPT-4o: What happened and what we&amp;rsquo;re doing about it.&amp;rdquo; April 2025. &lt;a class="link" href="https://openai.com/index/sycophancy-in-gpt-4o/" target="_blank" rel="noopener"
>openai.com&lt;/a>&lt;/li>
&lt;li>Kaelbling, L. P., Littman, M. L., &amp;amp; Cassandra, A. R. &amp;ldquo;Planning and acting in partially observable stochastic domains.&amp;rdquo; &lt;em>Artificial Intelligence&lt;/em>, 101(1–2), 99–134, 1998.&lt;/li>
&lt;/ul>
&lt;h2 id="further-reading">Further reading
&lt;/h2>&lt;ul>
&lt;li>&lt;a class="link" href="https://corebaseit.com/posts/ai-sycophancy/" >AI Sycophancy: Your Model Is Trained to Please You, Not to Be Right&lt;/a>&lt;/li>
&lt;li>&lt;a class="link" href="https://corebaseit.com/posts/stochastic-entropy-ai/" >Stochastic, Entropy &amp;amp; AI&lt;/a>&lt;/li>
&lt;li>&lt;a class="link" href="https://corebaseit.com/posts/reasoning-models-deep-reasoning-llms/" >Reasoning Models and Deep Reasoning in LLMs&lt;/a>&lt;/li>
&lt;/ul></description></item></channel></rss>