Beyond Markowitz: Why LLM Supported Reinforcement Learning Is the Missing Piece in Portfolio Optimization

Why Markowitz fails in production

The Eurekahedge AI Hedge Fund Index has tracked funds that use machine learning as a primary driver of investment decisions since 2010. From late 2009 through mid 2024 it returned roughly 9.8% annualized, against the S&P 500's roughly 13.7% over the same window. Fifteen years of underperformance from the segment of the industry that spends the most on machine learning. A small number of firms, including Bridgewater's AIA Labs, Man Group's AlphaGPT, and BlackRock's Asimov, have built systems that generate genuine alpha, but the median AI driven portfolio strategy has not beaten passive equity exposure. For a CTO evaluating where to invest engineering effort, the question is not whether to adopt AI for portfolio decisions; it is what specific architectural choices separate the systems that work from those that have collectively lost to an index fund. The starting point for that diagnosis is the framework every quantitative team builds on: Markowitz mean variance optimization. It has been the bedrock of institutional asset allocation since 1952, and it is broken in at least four specific ways.

The mean variance formulation solves a quadratic program that every quantitative team starts from:

Equation 1: Markowitz mean variance optimization

where w ∈ ℝⁿ is the weight vector, Σ ∈ ℝⁿˣⁿ is the covariance matrix, and μ ∈ ℝⁿ is the expected return vector. The equation finds the lowest risk mix of assets that still hits the desired return, with the constraint that the weights sum to one (the portfolio is fully invested). The mathematics is clean. It is also broken in at least four specific ways.

Figure 1 — Three efficient frontiers produced by small perturbations to μ under the same Σ. Each tangency portfolio w* lands at a dramatically different risk level. This behavior is what practitioners call "error maximizing."

Estimation error is amplified, not averaged. The tangency portfolio w* = Σ⁻¹(μ − r_f 𝟙), where r_f is the risk free rate, is notoriously sensitive to μ; a handful of basis points of error in expected returns produces dramatically different allocations, as Figure 1 illustrates. DeMiguel, Garlappi, and Uppal (2009) demonstrated that a naive 1/N equal weight portfolio outperforms optimized portfolios out of sample across 14 models and 7 datasets.
The covariance matrix is typically ill conditioned. When the number of assets approaches the number of observations, Σ̂ becomes near singular. Random matrix theory, going back to Laloux et al. (1999), shows that the majority of eigenvalues in an empirical covariance matrix are indistinguishable from noise. Inverting such a matrix is mathematically legal and operationally dangerous.
The framework is fundamentally single period. Markowitz answers: what is the best portfolio right now? The question a portfolio manager actually faces is different: what action should I take now to produce the best outcome over the next several rebalancing decisions, accounting for transaction costs, tax lots, turnover limits, and regime shifts?
It has no mechanism for unstructured information. Earnings calls, central bank minutes, annual filings, analyst reports, private market documents, none of this enters a covariance matrix.

Fixes exist for each. Ledoit and Wolf (2004) shrinkage handles the first two failure modes. Hierarchical Risk Parity (López de Prado, 2016) sidesteps matrix inversion entirely. Black and Litterman (1992) provides a principled way to inject investor views. Multi period extensions and model predictive control address the third. These are useful patches, but the first three failures motivate reinforcement learning as a native framework, and the fourth is where large language models earn their place.

Portfolio management as a Markov decision process

Reinforcement learning is purpose built for sequential decision problems with uncertain outcomes, formalized as a Markov decision process (MDP). An agent observes the state of the world, takes an action, receives feedback, and updates its strategy. Over many iterations, it learns a policy, namely a mapping from states to actions, that maximizes cumulative long term reward rather than any single decision in isolation.

Figure 2 — The RL feedback loop applied to portfolio management. The agent continuously learns from the outcomes of its rebalancing decisions.

The state at time t might include current portfolio weights, recent returns, volatility estimates, regime indicators, and transaction cost parameters. The action is the rebalancing decision. The reward function shapes learned behavior most profoundly, and the right choice is rarely raw realized return.

Reward design matters. The Sharpe ratio is normally computed over a full window of returns, which makes it useless as a per step signal for an agent that needs feedback at every rebalance. Moody and Saffell (1998, 2001) introduced the differential Sharpe ratio, which gives the marginal contribution of the latest period to the running Sharpe. The agent gets a usable feedback signal at every step rather than at the end of an episode. For commodities and derivatives portfolios where return distributions are heavy tailed, formulations constrained on Conditional Value at Risk (CVaR), the expected loss in the worst α-tail of the distribution, are preferable.

Figure 3 — Portfolio management as a Markov decision process. Each state transition carries a reward; the agent optimizes the cumulative sequence, not any single step.

Crucially, portfolio management is not a series of independent decisions. It is a chain where each action reshapes the next decision's landscape. Setting an aggressive allocation today and watching volatility spike tomorrow constrains the next rebalance through liquidity and costs that a more conservative prior action would have avoided. That chain of state, action, new state, next action is what makes the problem sequential, and what makes reinforcement learning the right framework.

Institutional portfolios also face hard risk constraints that cannot be expressed as soft penalties in a reward function. The appropriate formulation is the Constrained Markov Decision Process:

Equation 2: constrained MDP with CVaR limit

The unconstrained objective tells the agent to chase return; the CVaR constraint says but not at the cost of catastrophic tail losses. Lagrangian relaxation turns the constraint into a penalty term that grows whenever the policy violates the limit, which lets the same RL machinery handle both the objective and the safety bound. This formulation is the mathematical mechanism behind the deterministic constraint layer described in §3.

Even with the right formalism, four specific failure modes have prevented naive RL from delivering in production:

Non stationarity. Markets do not hold still. A policy fit on 2015 through 2019 data will behave catastrophically in March 2020. No published RL approach has formally solved this; mitigations require regime aware training, ensemble policy selection, and change point detection.
Backtest overfitting. RL agents have enough flexibility to memorize patterns that will not recur. Quantopian's study of 888 live algorithms found in sample Sharpe had near zero predictive power for live results (R² below 0.025); Bailey, Borwein, López de Prado, and Zhu (2014) formalized why.
Exploration cost. Standard RL improves by trial and error. In a portfolio, every exploratory action is real P&L, regulated, and potentially career ending. This shapes the entire training paradigm described in §3.
Lookahead bias from pretraining. Frontier language models were trained on text that may include outcome data from your test windows. Glasserman and Lin (2023) documented this empirically. Mitigation requires verified training cutoffs and leak robust feature categories.

These four problems are why the academic literature on reinforcement learning in finance is enormous while live deployments remain rare. The architecture proposed in the remainder of this piece is built around them. Each subsequent section maps to at least one failure mode it mitigates.

The corrective architecture

The architecture has four components: a language model feature extractor that ingests unstructured text, a continuous time RL policy that produces target weights, a deterministic constraint layer that enforces hard risk limits, and a human approval gate. Each handles what it excels at, and each is owned by a different team.

Figure 4 — Production architecture. Data flows top to bottom through four layers; monitoring observes features and actions in parallel; the deterministic constraint layer between the policy and execution is the auditability boundary.

The execution layer at the bottom of the stack translates the policy's target weights into child orders routed through the firm's Order Management System (OMS) and submitted to venues over the Financial Information eXchange (FIX) protocol, the institutional standard for electronic trading messages. The four components above it, working bottom up, are the policy itself, the deterministic constraints, the language model feature extractor, and the human approval gate, with monitoring observing every layer in parallel. The remainder of this section walks through each, explaining what naive implementations get wrong and what the production version does instead.

The continuous time RL policy

This is where the exploration cost and non stationarity failure modes from §2 are addressed. Standard RL improves by trial and error on real capital, which no institution will tolerate, and discretizing markets into step wise transitions introduces errors that compound with rebalancing frequency. Both problems dissolve when the formulation is rebuilt in continuous time, which is what a body of work by Zhou and collaborators at Columbia has done over the past five years.

Figure 5 — Discrete time samples the process at rebalancing points and interpolates linearly between them; continuous time models the full trajectory Xₜ as an SDE. Transaction costs, volatility clustering, and exploration all behave differently under the two formulations.

The state evolves according to a stochastic differential equation (SDE), and the policy optimizes an entropy regularized objective on top of it. Wang, Zariphopoulou, and Zhou (2020) introduced both:

Equation 3: state dynamics and entropy regularized objective

Drift b is the deterministic average tendency, diffusion σ is the random shock, and Wₜ is a Brownian motion. The objective scores every policy by the expected total reward it earns over time, with future rewards discounted, plus a bonus for staying flexible, ending with whatever the final state is worth. The discount factor e^−βt shrinks the value of returns the further they sit in the future, and the entropy term ℋ(πₜ) rewards a policy that maintains uncertainty rather than collapsing onto a single action, with temperature λ controlling how much exploration the agent gets paid for. The entropy term is not an engineering trick; it falls out of the exploration exploitation tradeoff when the problem is formulated properly in continuous time. As λ shrinks toward zero, the policy converges to the deterministic optimal control.

The optimal policy is a Gibbs measure over a continuous time analog of the discrete Q function, the little q function q(t, x, a) introduced by Jia and Zhou (2023):

Equation 4: Gibbs policy

This is the same softmax form used in Soft Actor Critic, but here it is derived from first principles rather than designed as an engineering choice. Tang and Zhou (2024) give explicit regret bounds for annealing schedules, showing that as λ decreases, the policy converges to the optimal deterministic control at a rate characterized by the temperature schedule. The exploration schedule is not an engineering heuristic; it is a mathematically characterized procedure with provable convergence, which is the kind of guarantee a risk committee finds reassuring.

Training the policy without exploring on real capital. The standard offline RL recipe uses Bellman targets of the form r + γ max_a' Q(s', a'). Online, errors the max operator makes on untried actions get corrected by the next episode of exploration; offline, there is no next episode, and the policy chases hallucinated optima. Three discrete time families address this rigorously: Batch Constrained Q learning (BCQ; Fujimoto et al., 2019) restricts proposals to the data distribution, Conservative Q learning (CQL; Kumar et al., 2020) penalizes out of distribution Q values, and Implicit Q learning (IQL; Kostrikov et al., 2021) avoids querying Q on out of distribution actions entirely.

The continuous time formulation offers a fourth approach distinct in mechanism. Jia and Zhou (2022) characterized policy evaluation as maintaining a martingale condition. A martingale is a process whose expected value at any future time equals its current value: a fair game with no upward or downward bias. The process below is a martingale under policy π if and only if V and q are the correct value and q functions; train the network so that this process does not drift on average, and the value and q estimates must be right.

Equation 5: martingale loss

This objective avoids the bootstrapping pathologies of temporal difference learning entirely, which is why for continuous time q learning the martingale loss is the natural offline training objective; IQL serves as the fallback when the martingale characterization is not tractable.

Validation discipline is non negotiable. Walk forward cross validation with strict purging of overlapping samples and embargo periods around test windows, following López de Prado's combinatorial purged cross validation, is the baseline. Every feature in the training set must be reconstructed point in time, as it would have been visible at the decision moment. Without this discipline, any reported Sharpe is fiction.

Synthetic environments answer the data sparsity problem. Offline RL solves how to train safely from a fixed dataset; it does not solve the problem that the fixed dataset is sparse in exactly the regimes that matter most: crises, regime transitions, liquidity shocks. A policy trained on 2015 through 2019 data may never have seen a March 2020 analog. Domain randomization on simple SDEs such as geometric Brownian motion is insufficient because those models fail to capture fat tails, volatility clustering, and regime structure.

Generative models of market dynamics address this directly. A score based diffusion model learns a score function ∇ₓ log pₜ(x), the local gradient of data density across a noise schedule, and generates samples by reversing the noising process. Unlike point forecast models trained on next step prediction, diffusion models learn the full conditional distribution of market trajectories, producing ensembles of plausible futures while preserving fat tails, autocorrelation structure, and cross asset dependencies.

Figure 6 — Offline training pipeline. Historical trajectories alone underrepresent rare regimes; diffusion generated synthetic trajectories expand the distribution, and the trainer fits the policy by minimizing the martingale loss.

Gao, Zha, and Zhou (2025) close the loop: training a diffusion model is equivalent to solving an entropy regularized continuous time RL problem, with the score function playing the role of a q function and the reverse time SDE as the policy. Kronos (Tsinghua, 2025) is the financial empirical anchor: a foundation model pretrained on 12 billion K line records across 45 global exchanges, reporting 22% improved synthetic trajectory fidelity and a 93% improvement in forecasting RankIC, a rank correlation metric standard in quantitative equity research, over leading time series foundation models. Reward directed diffusion extends this further by biasing generation toward trajectories with specified properties: stress scenarios with large drawdowns, regimes with elevated correlations, periods of illiquidity.

LLM features for unstructured data

This subsection addresses the fourth Markowitz failure: the framework has no mechanism for unstructured information. Earnings calls, central bank minutes, analyst notes, and regulatory filings all contain signal a covariance matrix cannot see. Language models can read what a covariance matrix cannot, but only as feature extractors that transform text into dense vector representations the RL policy learns from, never as standalone trading agents.

At each rebalancing period the system ingests the relevant text corpus and produces fixed dimensional embeddings concatenated with numerical state features to form the full state vector. This is a meaningful departure from the conventional two stage pipeline that extracts a sentiment score or a handful of topic labels and feeds those as columns into a statistical model. The extraction step discards most of the information; embeddings preserve the high dimensional structure and let the policy learn end to end which dimensions are predictive of future portfolio performance.

What gets lost in a sentiment score. A score of 0.6 indicates mild positivity. It does not indicate which entities are affected, why, or how the conditional risk structure has changed. The embedding preserves the high dimensional structure; the policy learns end to end which dimensions matter.

Three caveats matter for production deployment.

Embeddings are not interchangeable. Even at the same dimensionality, swapping the underlying model requires retraining the policy or training a learned adapter between the embedding spaces.

Pretraining leakage is real. As discussed in §2, base model weights may have been trained on information from after your backtest window. Mitigation requires verified training cutoffs predating test windows, and restricting LLM features to categories robust to leakage. Sentiment of a document at publication time is fine; LLM generated forward expectations are not.

Hallucination must be actively controlled. The FailSafeQA benchmark (2025) documented hallucination rates up to 41% on finance specific queries across frontier models. Effective mitigations stack: retrieval-augmented generation (RAG), in which the model is given source documents at inference time rather than relying on its training data, so every extracted feature cites specific source chunks; structured output schemas with programmatic validation; critic agents that check for unsupported claims; and tool calls for any arithmetic. For anything touching numerical values, the right pattern is to route through computational tools rather than ask the LLM to do the math.

Latency forecloses high frequency applications. Language model inference runs at 50 to 150 milliseconds per token, three to six orders of magnitude too slow for sub second execution. The architecture proposed here targets rebalancing horizons measured in hours and days. This is where Bridgewater's AIA Labs, Man Group's AlphaGPT, and BlackRock's Asimov have placed language model features, and it is where I would suggest starting.

Deterministic constraints and human in the loop

This is the single most differentiating component of the architecture relative to academic RL papers, and it addresses a class of failures naive RL formulations cannot represent: regulatory limits, fiduciary mandates, and operational guardrails that must hold in all market conditions, not on average. A learned policy cannot be trusted to enforce hard constraints. Constraints belong in code, not in a value function.

The deterministic constraint layer enforces exposure limits, turnover caps, liquidity screens, sector concentrations, and jurisdictional regulatory rules as explicit code that cannot be learned around. The CMDP formulation from §2 is what the policy optimizes; this layer is what catches anything the policy proposes that violates a hard limit, regardless of how attractive the predicted return looks. It is also the auditability boundary: a risk committee can inspect exactly which constraints were active at each decision, and the natural language explanation generated alongside satisfies the documentation requirements of model risk management frameworks such as the US SR 11/7 guidance.

The auditability boundary. A risk committee cannot inspect a neural network's reasoning. They can inspect a list of if exposure > limit: reject. This is why the constraint layer sits between the policy and execution and not inside the policy: it is the line a regulator can read.

Human in the loop. Each proposed trade is presented with a feature attributed explanation, and an out of distribution flag escalates any action outside the policy's historical distribution rather than executing it. This is what gives an institution a credible incremental path from AI as advisor to AI as delegated agent, with the trust accumulating one approved trade at a time.

Production monitoring

Backtest overfitting and non stationarity are training time problems that recur as runtime problems. A policy that performed well in backtest can degrade silently when input feature distributions shift, when regime changes invalidate learned correlations, or when actions drift outside the historical distribution the policy was trained on. Production monitoring is what catches this before it costs money.

Three things run continuously. Drift detection on both input feature distributions and the action distribution flags when the live state moves outside what training data covered. An audit log records every state, action, and constraint evaluation, supporting both regulatory disclosure and post hoc model debugging. Shadow mode A/B rollout validates any new policy against live market data without real execution, giving an honest comparison before promotion. This is the unglamorous infrastructure that makes the difference between a system one can defend in front of regulators and a system one cannot.

Backtest fiction versus production fact. A policy with a backtested Sharpe of 2.0 and no drift monitoring is fiction. A policy with a backtested Sharpe of 1.0 and live drift detection on every feature is a production system. The first number is easier to put in a pitch deck; the second is what survives the first regime change.

The architecture generalizes across asset classes; the class specific work is in the state encoding and the constraint layer. Fixed income requires features encoding yield curve dynamics, duration, and convexity. Foreign exchange needs carry features and central bank language as text input. Commodities encode term structure, seasonality, and inventory data.

Future directions: from pipeline to agentic system

The architecture above is a pipeline. It is also, by construction, ready to become a component in an agentic system. Three phases describe the progression.

Phase one, today, is the architecture just described. The language model is a passive feature extractor; the RL policy is the decision maker; the human approves. Shippable now, bounded enough to satisfy compliance.

Phase two, over the next twelve to eighteen months, is the LLM moving up the stack from feature extraction to hypothesis generation. It proposes new features, reward adjustments, and strategy variants, codes them, runs them through the offline RL pipeline under the same walk forward validation discipline, and surfaces promising candidates to a human researcher for review. Man Group's AlphaGPT operates at this level in production today; Microsoft's RD Agent integrated with Qlib demonstrates the pattern in open source.

Phase three, beyond, decomposes the trading desk into specialized agents: research, risk, execution, and compliance, orchestrated by a planner LLM that routes tasks. FinCon (NeurIPS 2024) and TradingAgents prototype this architecture academically. The RL policy from Phase 1 does not disappear in Phase 3; it becomes one tool among several that the planner can call.

Three architectural choices in Phase 1 determine whether Phases 2 and 3 are reachable: the offline RL training infrastructure, since Phase 2 needs a safe fast training loop to evaluate strategies it proposes; the deterministic constraint layer, since autonomous research without enforced constraints is unbounded liability; and the feature attributed explanation interface, since Phase 3 requires agents that can explain their reasoning to other agents and to humans. Build these now and the agentic future is incremental. Skip them and it is a rewrite.

What a serious pilot looks like: a staged timeline

A pilot designed to answer whether this works for a specific mandate, rather than whether the technology is interesting, has a specific shape over approximately ninety days.

Weeks one through three: data and validation infrastructure. Point in time data plumbing for whatever subset of the asset universe the pilot covers, the walk forward validation harness with purged combinatorial cross validation, and a baseline notebook demonstrating the validation discipline. This is unglamorous and it is where most projects that later fail were already failing.

Weeks four through seven: offline RL baseline. An IQL or continuous time q learning policy trained on a simple numerical state representation, no LLM features yet, with differential Sharpe reward and CVaR constraint. Benchmarked against equal weight, mean variance with Ledoit Wolf shrinkage, and momentum. The goal is not to beat these benchmarks but to verify that the training pipeline works, the validation is honest, and the constraint layer rejects what it should.

Weeks eight through eleven: LLM features and ablation. Embeddings added to the state vector, policy retrained, comparisons rerun. Critically, the ablation isolates how much improvement comes from LLM features versus the offline RL framework versus the reward design. Most published results conflate these contributions; the conflation hides where the value actually lives.

Week twelve: paper trading. The resulting policy runs in shadow mode against live market data with no real execution, comparing proposed trades against a human portfolio manager's actual trades. The deliverable is not a deployed system. It is a written assessment of three things: whether the architecture beats the firm's existing baseline under honest validation, what production deployment would cost, and which of the failure modes identified in §2 the specific use case is most exposed to.

Summary

The fifteen year underperformance gap between AI driven hedge funds and the S&P 500 is not evidence that machine learning has nothing to offer portfolio management. It is evidence that the architectures most teams have shipped do not match what production deployment in finance actually requires. Markowitz is broken in four specific ways; naive RL is broken in four more; large language models alone introduce three more on top of that.

The architecture proposed here addresses each. Continuous time RL with the martingale loss handles offline training and exploration cost. Diffusion based synthetic environments expand training coverage into the rare regimes that decide whether a system survives a crisis. Language models as feature extractors capture unstructured signal that classical state representations cannot. Deterministic constraints in code, not in a value function, give a regulator something to inspect. Production monitoring closes the runtime loop. And the staged twelve week pilot tests the whole thing for a quarter of an engineer's time without putting capital at risk.

An architecture of this shape ships. It defends in front of a risk committee and a regulator. It captures signal a classical stack cannot see. And it gives an institution a credible incremental path from AI as advisor to AI as delegated agent. The architectures most teams have shipped since 2010 do none of these things. That is the gap, and it is closeable now.

Beyond Markowitz: Why LLM Supported Reinforcement Learning Is the Missing Piece in Portfolio Optimization

Why Markowitz fails in production

Portfolio management as a Markov decision process

The corrective architecture

The continuous time RL policy

LLM features for unstructured data

Deterministic constraints and human in the loop

Production monitoring

Future directions: from pipeline to agentic system

What a serious pilot looks like: a staged timeline

Summary

Ready to Build Production AI for Finance?

Want More AI Research?