<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Maximum-Entropy on Corebaseit — POS · EMV · Payments · AI</title><link>https://corebaseit.com/tags/maximum-entropy/</link><description>Recent content in Maximum-Entropy on Corebaseit — POS · EMV · Payments · AI</description><generator>Hugo -- gohugo.io</generator><language>en-us</language><managingEditor>contact@corebaseit.com (Vincent Bevia)</managingEditor><webMaster>contact@corebaseit.com (Vincent Bevia)</webMaster><lastBuildDate>Thu, 23 Apr 2026 10:00:00 +0100</lastBuildDate><atom:link href="https://corebaseit.com/tags/maximum-entropy/index.xml" rel="self" type="application/rss+xml"/><item><title>Maximum Entropy Inverse Reinforcement Learning: Understanding the Trajectory Formula</title><link>https://corebaseit.com/corebaseit_posts/maximum-entropy-irl-trajectory-formula_rl_part2/</link><pubDate>Thu, 23 Apr 2026 10:00:00 +0100</pubDate><author>contact@corebaseit.com (Vincent Bevia)</author><guid>https://corebaseit.com/corebaseit_posts/maximum-entropy-irl-trajectory-formula_rl_part2/</guid><description>&lt;p>&lt;strong>Inverse reinforcement learning (IRL) asks a different question from classical RL: instead of assuming a reward function and learning a policy, you observe expert behavior and infer what reward would make that behavior look rational.&lt;/strong> One of the cleanest probabilistic formulations is the &lt;strong>maximum entropy&lt;/strong> trajectory model. This post is a practical engineering note on what the formula means, why entropy matters, and where the &lt;strong>Markov decision process (MDP)&lt;/strong> shows up under the hood.&lt;/p>
&lt;p style="text-align: center;">
&lt;img src="https://corebaseit.com/diagrams/maximun_entropy.png" alt="Markov Diagram" style="max-width: 900px; width: 100%;" />
&lt;/p>
&lt;hr>
&lt;h2 id="the-formula">The formula
&lt;/h2>&lt;p>Over candidate trajectories \(T_i\), the model defines:&lt;/p>
$$
P(T_i \mid \mathbf{w}) = \frac{1}{Z(\mathbf{w})} \exp\Bigl( \sum_{(s,a) \in T_i} \mathbf{w} \cdot \boldsymbol{\phi}(s,a) \Bigr)
$$&lt;ul>
&lt;li>\(T_i\) — a complete &lt;strong>trajectory&lt;/strong> (a sequence of state–action pairs).&lt;/li>
&lt;li>\(\mathbf{w}\) — a &lt;strong>weight vector&lt;/strong>; in linear IRL, \(\mathbf{w}\) parameterizes the implied reward \(R(s,a) = \mathbf{w} \cdot \boldsymbol{\phi}(s,a)\).&lt;/li>
&lt;li>\(\boldsymbol{\phi}(s,a)\) — a &lt;strong>feature vector&lt;/strong> for taking action \(a\) in state \(s\) (progress toward a goal, proximity to obstacles, smoothness, etc.).&lt;/li>
&lt;li>\(\mathbf{w} \cdot \boldsymbol{\phi}(s,a)\) — scalar &lt;strong>reward contribution&lt;/strong> for one step.&lt;/li>
&lt;li>\(\sum_{(s,a) \in T_i} \mathbf{w} \cdot \boldsymbol{\phi}(s,a)\) — &lt;strong>trajectory score&lt;/strong> (total reward along the path).&lt;/li>
&lt;li>\(Z(\mathbf{w})\) — the &lt;strong>partition function&lt;/strong>: sum of \(\exp(\text{score})\) over all trajectories so probabilities normalize to one.&lt;/li>
&lt;/ul>
&lt;h3 id="notation-at-a-glance">Notation at a glance
&lt;/h3>&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Term&lt;/th>
&lt;th>Meaning&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>\(T_i\)&lt;/td>
&lt;td>One candidate trajectory, path, or behavior sequence&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>\(\mathbf{w}\)&lt;/td>
&lt;td>Learned weights; defines the linear reward in this setup&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>\(\boldsymbol{\phi}(s,a)\)&lt;/td>
&lt;td>Features for \((s,a)\): progress, risk, smoothness, compliance, …&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>\(\mathbf{w} \cdot \boldsymbol{\phi}(s,a)\)&lt;/td>
&lt;td>Scalar reward for one step&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>\(\sum_{(s,a) \in T_i}\)&lt;/td>
&lt;td>Sum of step rewards along the trajectory&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>\(\exp(\cdot)\)&lt;/td>
&lt;td>Turns scores into positive, unnormalized masses&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>\(Z(\mathbf{w})\)&lt;/td>
&lt;td>Normalizer over trajectories (the hard part in large spaces)&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;hr>
&lt;h2 id="intuition-a-softmax-over-trajectories">Intuition: a softmax over trajectories
&lt;/h2>&lt;p>Operationally, the recipe is:&lt;/p>
&lt;ol>
&lt;li>&lt;strong>Score&lt;/strong> each trajectory by summing \(\mathbf{w} \cdot \boldsymbol{\phi}(s,a)\) along the path.&lt;/li>
&lt;li>&lt;strong>Exponentiate&lt;/strong> each score.&lt;/li>
&lt;li>&lt;strong>Divide&lt;/strong> by the sum of all exponentiated scores.&lt;/li>
&lt;/ol>
&lt;p>That is a &lt;strong>softmax&lt;/strong> over trajectories: higher-scoring paths get higher probability, but nothing is forced to probability one unless the data and weights make that inevitable.&lt;/p>
&lt;p>That matters for real experts. Demonstrations are rarely perfectly deterministic; several trajectories can be good. Maximum-entropy modeling keeps that &lt;strong>uncertainty&lt;/strong> instead of collapsing everything onto a single “best” path.&lt;/p>
&lt;hr>
&lt;h2 id="irl-mdps-and-feature-matching">IRL, MDPs, and feature matching
&lt;/h2>&lt;p>&lt;strong>Forward RL:&lt;/strong> given dynamics and reward, find a good policy. &lt;strong>IRL:&lt;/strong> given demonstrations, infer a reward (or reward parameters) that rationalizes them.&lt;/p>
&lt;p>In &lt;strong>maximum-entropy IRL&lt;/strong>, one assumes \(R(s,a) = \mathbf{w} \cdot \boldsymbol{\phi}(s,a)\). The trajectory score is the sum of per-step rewards. Expert data are treated as if they were drawn from a distribution where &lt;strong>higher-reward trajectories are exponentially more likely&lt;/strong> — exactly the Boltzmann form above.&lt;/p>
&lt;p>&lt;strong>Estimation goal:&lt;/strong> find \(\mathbf{w}\) such that observed demonstrations are &lt;strong>likely&lt;/strong> under \(P(T \mid \mathbf{w})\). In practice, that means learning what the expert seems to care about — safety, efficiency, smooth motion, rule compliance — &lt;strong>only through the features you encode&lt;/strong>.&lt;/p>
&lt;p>A standard theoretical consequence is &lt;strong>feature matching&lt;/strong> (in expectation): the model’s distribution aligns &lt;strong>expected feature counts&lt;/strong> with those implied by the demonstrations. If experts consistently avoid obstacles and move smoothly, the inferred reward should induce similar statistics. &lt;strong>Feature design is not cosmetic&lt;/strong>; it is the language in which preferences become identifiable.&lt;/p>
&lt;h3 id="where-the-mdp-enters">Where the MDP enters
&lt;/h3>&lt;p>Trajectories \(T_i\) are usually &lt;strong>feasible paths in an environment&lt;/strong>: states, actions, and transitions. In the finite case, that is a &lt;strong>finite MDP&lt;/strong>: discrete states and actions, and transitions \(P(s' \mid s, a)\) constrain which sequences are valid. The formula does not print the transition table, but &lt;strong>candidate trajectories are generated inside that structure&lt;/strong>. When the state space is huge, you cannot enumerate all trajectories; \(Z(\mathbf{w})\) is approximated with &lt;strong>dynamic programming&lt;/strong>, sampling, or other tractable surrogates — that is the central engineering bottleneck.&lt;/p>
&lt;hr>
&lt;h2 id="what-maximum-entropy-means-here">What “maximum entropy” means here
&lt;/h2>&lt;p>&lt;strong>Entropy&lt;/strong> measures how spread out a distribution is. High entropy: mass over many trajectories. Low entropy: mass concentrated on a few.&lt;/p>
&lt;p>&lt;strong>Maximum entropy&lt;/strong> means: among all distributions that satisfy chosen &lt;strong>constraints&lt;/strong> (typically, matching statistics of the demonstrations, especially feature expectations), pick the distribution that is &lt;strong>otherwise least committal&lt;/strong> — it adds no extra assumptions beyond those constraints.&lt;/p>
&lt;p>If several trajectories fit the data, probability should &lt;strong>spread&lt;/strong> across them instead of assigning one path probability one and the rest zero. The &lt;strong>exponential-family&lt;/strong> form is not arbitrary: it arises from maximizing entropy subject to &lt;strong>feature-matching&lt;/strong> constraints. The same structure appears in maximum-entropy IRL, &lt;strong>conditional random fields&lt;/strong>, &lt;strong>Gibbs distributions&lt;/strong>, and &lt;strong>softmax&lt;/strong> policies.&lt;/p>
&lt;hr>
&lt;h2 id="tiny-worked-example-features-and-weights">Tiny worked example (features and weights)
&lt;/h2>&lt;p>Suppose step-level features capture: &lt;strong>progress toward goal&lt;/strong>, &lt;strong>collision / obstacle contact&lt;/strong>, and &lt;strong>smooth motion&lt;/strong>. A weight vector might be \(\mathbf{w} = [2.0,\,-5.0,\,1.0]\): progress is good, collisions are strongly penalized, smoothness is mildly rewarded.&lt;/p>
&lt;ul>
&lt;li>Trajectory &lt;strong>A&lt;/strong> reaches the goal while avoiding obstacles → total score &lt;strong>6&lt;/strong>.&lt;/li>
&lt;li>Trajectory &lt;strong>B&lt;/strong> reaches the goal but clips an obstacle → total score &lt;strong>1&lt;/strong>.&lt;/li>
&lt;/ul>
&lt;p>Because \(\exp(6) \gg \exp(1)\), &lt;strong>A&lt;/strong> gets much higher probability under \(P(T \mid \mathbf{w})\). &lt;strong>B&lt;/strong> is not impossible — only less likely. That is the maximum-entropy mindset: &lt;strong>bias&lt;/strong> toward what explains the expert, &lt;strong>without&lt;/strong> zeroing out plausible alternatives.&lt;/p>
&lt;hr>
&lt;h2 id="python-scores-partition-function-probabilities">Python: scores, partition function, probabilities
&lt;/h2>&lt;p>The snippet below scores a &lt;strong>small finite set&lt;/strong> of candidate trajectories (enumeration is only toy-scale here).&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-python" data-lang="python">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#f92672">import&lt;/span> math
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>w &lt;span style="color:#f92672">=&lt;/span> [&lt;span style="color:#ae81ff">0.8&lt;/span>, &lt;span style="color:#f92672">-&lt;/span>&lt;span style="color:#ae81ff">0.2&lt;/span>]
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>T1 &lt;span style="color:#f92672">=&lt;/span> [[&lt;span style="color:#ae81ff">1&lt;/span>, &lt;span style="color:#ae81ff">0&lt;/span>], [&lt;span style="color:#ae81ff">0&lt;/span>, &lt;span style="color:#ae81ff">1&lt;/span>], [&lt;span style="color:#ae81ff">1&lt;/span>, &lt;span style="color:#ae81ff">1&lt;/span>]]
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>T2 &lt;span style="color:#f92672">=&lt;/span> [[&lt;span style="color:#ae81ff">0&lt;/span>, &lt;span style="color:#ae81ff">1&lt;/span>], [&lt;span style="color:#ae81ff">0&lt;/span>, &lt;span style="color:#ae81ff">1&lt;/span>], [&lt;span style="color:#ae81ff">1&lt;/span>, &lt;span style="color:#ae81ff">0&lt;/span>]]
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>T3 &lt;span style="color:#f92672">=&lt;/span> [[&lt;span style="color:#ae81ff">1&lt;/span>, &lt;span style="color:#ae81ff">1&lt;/span>], [&lt;span style="color:#ae81ff">1&lt;/span>, &lt;span style="color:#ae81ff">0&lt;/span>]]
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>trajectories &lt;span style="color:#f92672">=&lt;/span> [T1, T2, T3]
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">def&lt;/span> &lt;span style="color:#a6e22e">dot&lt;/span>(u, v):
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">return&lt;/span> sum(ui &lt;span style="color:#f92672">*&lt;/span> vi &lt;span style="color:#66d9ef">for&lt;/span> ui, vi &lt;span style="color:#f92672">in&lt;/span> zip(u, v))
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">def&lt;/span> &lt;span style="color:#a6e22e">trajectory_score&lt;/span>(T, w):
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">return&lt;/span> sum(dot(w, phi) &lt;span style="color:#66d9ef">for&lt;/span> phi &lt;span style="color:#f92672">in&lt;/span> T)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>scores &lt;span style="color:#f92672">=&lt;/span> [trajectory_score(T, w) &lt;span style="color:#66d9ef">for&lt;/span> T &lt;span style="color:#f92672">in&lt;/span> trajectories]
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>unnorm &lt;span style="color:#f92672">=&lt;/span> [math&lt;span style="color:#f92672">.&lt;/span>exp(score) &lt;span style="color:#66d9ef">for&lt;/span> score &lt;span style="color:#f92672">in&lt;/span> scores]
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>Z &lt;span style="color:#f92672">=&lt;/span> sum(unnorm)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>probs &lt;span style="color:#f92672">=&lt;/span> [u &lt;span style="color:#f92672">/&lt;/span> Z &lt;span style="color:#66d9ef">for&lt;/span> u &lt;span style="color:#f92672">in&lt;/span> unnorm]
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">for&lt;/span> i, (score, prob) &lt;span style="color:#f92672">in&lt;/span> enumerate(zip(scores, probs), start&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#ae81ff">1&lt;/span>):
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> print(&lt;span style="color:#e6db74">f&lt;/span>&lt;span style="color:#e6db74">&amp;#34;T&lt;/span>&lt;span style="color:#e6db74">{&lt;/span>i&lt;span style="color:#e6db74">}&lt;/span>&lt;span style="color:#e6db74">: score=&lt;/span>&lt;span style="color:#e6db74">{&lt;/span>score&lt;span style="color:#e6db74">:&lt;/span>&lt;span style="color:#e6db74">.3f&lt;/span>&lt;span style="color:#e6db74">}&lt;/span>&lt;span style="color:#e6db74">, P(T&lt;/span>&lt;span style="color:#e6db74">{&lt;/span>i&lt;span style="color:#e6db74">}&lt;/span>&lt;span style="color:#e6db74">|w)=&lt;/span>&lt;span style="color:#e6db74">{&lt;/span>prob&lt;span style="color:#e6db74">:&lt;/span>&lt;span style="color:#e6db74">.3f&lt;/span>&lt;span style="color:#e6db74">}&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>)
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;ul>
&lt;li>Each step is a feature vector \(\boldsymbol{\phi}\).&lt;/li>
&lt;li>The trajectory score is \(\sum \mathbf{w} \cdot \boldsymbol{\phi}\).&lt;/li>
&lt;li>\(\exp(\text{score})\) gives an unnormalized weight; &lt;strong>\(Z\)&lt;/strong> sums them; division yields &lt;strong>valid probabilities&lt;/strong>.&lt;/li>
&lt;/ul>
&lt;hr>
&lt;h2 id="application-sketch-warehouse-navigation">Application sketch: warehouse navigation
&lt;/h2>&lt;p>A mobile robot moves to a packing station while avoiding shelves, slowing near people, and preferring smooth motion. A natural feature vector per \((s,a)\) might include: &lt;strong>moves toward goal&lt;/strong>, &lt;strong>close to obstacle&lt;/strong>, &lt;strong>near human worker&lt;/strong>, &lt;strong>smooth motion&lt;/strong>. Learned \(\mathbf{w}\) encodes implicit priorities from demonstrations (positive weights reward preferred behavior, negative weights penalize risk or inefficiency).&lt;/p>
&lt;p>&lt;strong>Pipeline (conceptual):&lt;/strong>&lt;/p>
&lt;ol>
&lt;li>Collect expert demonstrations (human teleop, planner traces, etc.).&lt;/li>
&lt;li>Represent each trajectory by summing \(\boldsymbol{\phi}(s,a)\) along the path (or equivalently summing \(\mathbf{w}\cdot\boldsymbol{\phi}\) if optimizing \(\mathbf{w}\)).&lt;/li>
&lt;li>Use \(P(T \mid \mathbf{w}) \propto \exp(\sum \mathbf{w}\cdot\boldsymbol{\phi})\) so expert-like paths get higher mass.&lt;/li>
&lt;li>Fit \(\mathbf{w}\) so demonstrated trajectories are likely and &lt;strong>expected features&lt;/strong> match the data.&lt;/li>
&lt;li>Deploy the implied reward for planning in &lt;strong>new&lt;/strong> layouts.&lt;/li>
&lt;/ol>
&lt;p>In a real warehouse, &lt;strong>many&lt;/strong> paths can be safe. Maximum entropy avoids pretending there is a single “correct” route; it learns a reward that ranks paths while &lt;strong>preserving uncertainty&lt;/strong> where data are ambiguous.&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Feature (example)&lt;/th>
&lt;th>Typical meaning&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>Moves toward goal&lt;/td>
&lt;td>Progress to target&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Close to obstacle&lt;/td>
&lt;td>Penalize risk near shelves or walls&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Near human worker&lt;/td>
&lt;td>Tighter safety margin&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Smooth motion&lt;/td>
&lt;td>Predictable, executable motion&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;hr>
&lt;h2 id="numerical-walkthrough-two-trajectories-two-features">Numerical walkthrough (two trajectories, two features)
&lt;/h2>&lt;p>Let \(\boldsymbol{\phi}(s,a) = [\text{towardGoal}, \text{energyCost}]\) and \(\mathbf{w} = [2,\,-1]\): reward progress toward the goal, penalize energy.&lt;/p>
&lt;p>Two &lt;strong>toy&lt;/strong> three-step trajectories (chosen by hand for readability — in real IRL they would come from demos, simulation, or a planner):&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Trajectory&lt;/th>
&lt;th>Step 1 \(\boldsymbol{\phi}\)&lt;/th>
&lt;th>Step 2 \(\boldsymbol{\phi}\)&lt;/th>
&lt;th>Step 3 \(\boldsymbol{\phi}\)&lt;/th>
&lt;th>Total score \(\sum \mathbf{w}\cdot\boldsymbol{\phi}\)&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>\(T_1\)&lt;/td>
&lt;td>\([1,1]\)&lt;/td>
&lt;td>\([1,0]\)&lt;/td>
&lt;td>\([1,1]\)&lt;/td>
&lt;td>&lt;strong>4&lt;/strong>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>\(T_2\)&lt;/td>
&lt;td>\([1,1]\)&lt;/td>
&lt;td>\([0,1]\)&lt;/td>
&lt;td>\([1,1]\)&lt;/td>
&lt;td>&lt;strong>1&lt;/strong>&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;p>Step scores for \(T_1\): \(1,\,2,\,1\) → sum &lt;strong>4&lt;/strong>. For \(T_2\): \(1,\,-1,\,1\) → sum &lt;strong>1&lt;/strong>.&lt;/p>
&lt;p>Unnormalized weights: \(\exp(4) \approx 54.60\), \(\exp(1) \approx 2.72\). So \(Z(\mathbf{w}) \approx 57.32\), and&lt;/p>
$$
P(T_1 \mid \mathbf{w}) \approx 0.953,\qquad P(T_2 \mid \mathbf{w}) \approx 0.047.
$$&lt;p>The lower-scoring trajectory is unlikely but &lt;strong>not zero&lt;/strong> — by design.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-python" data-lang="python">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#f92672">import&lt;/span> math
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>w &lt;span style="color:#f92672">=&lt;/span> [&lt;span style="color:#ae81ff">2&lt;/span>, &lt;span style="color:#f92672">-&lt;/span>&lt;span style="color:#ae81ff">1&lt;/span>]
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>T1 &lt;span style="color:#f92672">=&lt;/span> [[&lt;span style="color:#ae81ff">1&lt;/span>, &lt;span style="color:#ae81ff">1&lt;/span>], [&lt;span style="color:#ae81ff">1&lt;/span>, &lt;span style="color:#ae81ff">0&lt;/span>], [&lt;span style="color:#ae81ff">1&lt;/span>, &lt;span style="color:#ae81ff">1&lt;/span>]]
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>T2 &lt;span style="color:#f92672">=&lt;/span> [[&lt;span style="color:#ae81ff">1&lt;/span>, &lt;span style="color:#ae81ff">1&lt;/span>], [&lt;span style="color:#ae81ff">0&lt;/span>, &lt;span style="color:#ae81ff">1&lt;/span>], [&lt;span style="color:#ae81ff">1&lt;/span>, &lt;span style="color:#ae81ff">1&lt;/span>]]
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">def&lt;/span> &lt;span style="color:#a6e22e">dot&lt;/span>(a, b):
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">return&lt;/span> sum(x &lt;span style="color:#f92672">*&lt;/span> y &lt;span style="color:#66d9ef">for&lt;/span> x, y &lt;span style="color:#f92672">in&lt;/span> zip(a, b))
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">def&lt;/span> &lt;span style="color:#a6e22e">score&lt;/span>(T, w):
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">return&lt;/span> sum(dot(w, phi) &lt;span style="color:#66d9ef">for&lt;/span> phi &lt;span style="color:#f92672">in&lt;/span> T)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>s1, s2 &lt;span style="color:#f92672">=&lt;/span> score(T1, w), score(T2, w)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>e1, e2 &lt;span style="color:#f92672">=&lt;/span> math&lt;span style="color:#f92672">.&lt;/span>exp(s1), math&lt;span style="color:#f92672">.&lt;/span>exp(s2)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>Z &lt;span style="color:#f92672">=&lt;/span> e1 &lt;span style="color:#f92672">+&lt;/span> e2
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>print(round(e1 &lt;span style="color:#f92672">/&lt;/span> Z, &lt;span style="color:#ae81ff">3&lt;/span>), round(e2 &lt;span style="color:#f92672">/&lt;/span> Z, &lt;span style="color:#ae81ff">3&lt;/span>)) &lt;span style="color:#75715e"># 0.953, 0.047&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;hr>
&lt;h2 id="engineering-takeaways">Engineering takeaways
&lt;/h2>&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Takeaway&lt;/th>
&lt;th>Why it matters&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>&lt;strong>Feature quality dominates&lt;/strong>&lt;/td>
&lt;td>Weak or wrong features → weak or misleading inferred rewards.&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>Maximum entropy reduces brittleness&lt;/strong>&lt;/td>
&lt;td>Multiple near-optimal behaviors can coexist instead of a forced deterministic story.&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>\(Z(\mathbf{w})\) is the hard part&lt;/strong>&lt;/td>
&lt;td>Exact enumeration is intractable in large MDPs; implementations use DP, sampling, or approximations.&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>IRL targets objectives, not only actions&lt;/strong>&lt;/td>
&lt;td>A learned reward often &lt;strong>generalizes&lt;/strong> to new situations better than pure behavior cloning — when the model is right.&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;hr>
&lt;h2 id="where-else-this-pattern-appears">Where else this pattern appears
&lt;/h2>&lt;p>The same &lt;strong>score then normalize&lt;/strong> structure shows up wherever you need a &lt;strong>distribution over structured sequences&lt;/strong>: imitation learning under ambiguity, trajectory prediction in driving, user / gameplay sequence modeling, and &lt;strong>structured prediction&lt;/strong> in ML (exponential models over outputs). The unifying pattern: meaningful features, learned weights, trajectory scores, and a normalized probability over candidates.&lt;/p>
&lt;hr>
&lt;h2 id="bottom-line">Bottom line
&lt;/h2>&lt;p>The maximum-entropy trajectory model gives a precise way to say: &lt;strong>expert-like trajectories should be more probable when they score higher under a hidden linear reward&lt;/strong>, while the model &lt;strong>stays honest about uncertainty&lt;/strong> when the data do not support a sharper conclusion.&lt;/p>
&lt;p>For builders: &lt;strong>define features carefully&lt;/strong>, &lt;strong>infer \(\mathbf{w}\)&lt;/strong> so demonstrations and feature statistics are explained, and use the &lt;strong>maximum-entropy&lt;/strong> distribution to avoid overfitting a single story to limited data.&lt;/p>
&lt;hr>
&lt;h2 id="references">References
&lt;/h2>&lt;ul>
&lt;li>Ziebart, B. D., Maas, A. L., Bagnell, J. A., &amp;amp; Dey, A. K. &amp;ldquo;Maximum entropy inverse reinforcement learning.&amp;rdquo; &lt;em>AAAI&lt;/em>, 2008.&lt;/li>
&lt;li>Ng, A. Y., &amp;amp; Russell, S. &amp;ldquo;Algorithms for inverse reinforcement learning.&amp;rdquo; &lt;em>ICML&lt;/em>, 2000.&lt;/li>
&lt;li>Sutton, R. S., &amp;amp; Barto, A. G. &lt;em>Reinforcement Learning: An Introduction&lt;/em> (2nd ed.). MIT Press, 2018. &lt;a class="link" href="http://incompleteideas.net/book/the-book-2nd.html" target="_blank" rel="noopener"
>incompleteideas.net&lt;/a>&lt;/li>
&lt;li>Puterman, M. L. &lt;em>Markov Decision Processes: Discrete Stochastic Dynamic Programming&lt;/em>. Wiley, 1994.&lt;/li>
&lt;/ul>
&lt;h2 id="further-reading">Further reading
&lt;/h2>&lt;ul>
&lt;li>&lt;em>Markov Decision Processes: The Mathematical Foundation of Reinforcement Learning&lt;/em> — companion draft on corebaseit covering the MDP tuple, Bellman structure, and RLHF; link it here once the post is live under &lt;code>/posts/&lt;/code>.&lt;/li>
&lt;li>&lt;a class="link" href="https://corebaseit.com/posts/reasoning-models-deep-reasoning-llms/" >Reasoning Models and Deep Reasoning in LLMs&lt;/a> — sequential structure and verification&lt;/li>
&lt;/ul></description></item></channel></rss>