Why Reinforcement Learning Might be the Most Overlooked AI Breakthrough Yet
A Survey of Advancements in Gen AI with Reinforcement Learning: How RLHF and Reasoning LLMs are Using Reinforcement Learning to Push the Capabilities of AI
Why is Reinforcement Learning So Important?
While there are a lot of different mathematical frameworks that have evolved over time, Reinforcement Learning is the one with the most potential that I feel has also been massively overlooked. I am actually a total fan girl when it comes to this mathematical framework. Throughout my career I have been surprised more people aren’t using it in day-to-day Data Science projects across industries as it elegantly captures the world we live in and casts it into a mathematical representation that can then be operated on to learn and solve for optimal outcomes that can be measured and guided. And thinking in a Reinforcement Learning framework has always helped my mathematically inclined brain make sense of the world around me.
But recently, Reinforcement Learning has made its way onto the buzzy AI scene with ground breaking applications to advance the capabilities of GenAI. With RLHF and the latest breakthrough LLM Reasoning capabilities, Reinforcement Learning is now the center of the GenAI hype, driving improvements and enabling new applications in GenAI.
GenAI + RL — a Match Made in Heaven
The two significant frameworks that are driving innovation in AI are both built on mathematics that mimics human behavior — this is Reinforcement Learning and Neural Networks.
Reinforcement Learning borrows from psychology by modeling the dopamine receptor process to provide a framework for agents to learn from experience. This algorithm focuses on sequential decision-making to maximize long-term rewards through trial. With Reinforcement Learning’s ability to mathematically model dopamine receptors, the Agents learn somewhat by trial and error, just like I do!
Neural Networks, such as the LLM models we are familiar with today, have a mathematical architecture that mimics the structure of neurons in the human brain. These neurons mathematically encode “learning” from data fed into the algorithm during training on example data, then those learned patterns can be used to make future predictions when given new data. GenAI models like LLMs are Neural Networks that model data distributions to generate realistic and coherent outputs. It is key to realize that GenAI models are non-deterministic predictive Neural Networks that are purpose built to generate content such as languages and images.
Figure 1: Image from Reinforcement Learning An Introduction by Richard S. Sutton and Andrew G. Barto
What’s Wrong With GenAI?
Although GenAI models seem to have recently captured the minds and hearts of the tech community, I do not personally find these algorithms particularly captivating beyond their appropriate application to content generation, or as a small component in a larger AI system, where I do believe they will be truly disruptive. Despite the noteworthy buzz and amazing breakthroughs in the GenAI space, GenAI has some limitations that Reinforcement Learning can help address.
In particular, I’ve observed that GenAI suffers in 2 areas — The Alignment Problem, and The Feedback Loop Challenge. None of these issues should prevent you from developing and applying AI, but instead an awareness of these gaps should guide you on where to focus and what issues to anticipate and mitigate as you roll out high quality effective AI solutions.
The Alignment Problem
- What’s the issue? LLMs are purpose built to generate language, and they are often trained over a very large corpus of text data in order to do so. Ensuring AI systems pursue objectives that align with human values and morals, is complex, context-dependent, and often difficult to specify formally. Even with well-defined instructions in the prompt, The LLMs can still generate unintended output that is not aligned with the purpose of your LLM app or the business goals you have for it. These general models also just might not have or be focusing on the niche knowledge in their dataset required for the task you set for it. So there is a challenge in aligning general LLM with a specific purpose and making sure the model consistently produces outputs that align with human values, goals, and safety requirements. Even with fine-tuning, or a RAG system, LLMs may produce outputs that don’t fully align with the user’s intent. Without an explicit reward function that is in alignment with the intended use of the model, it is very difficult to optimize a model to perform the expected task.
- How does RL help? A technique called Reinforcement Learning Human Feedback (RLHF) described in the section below is used to align models to produce appropriate content, and to teach them to operate well for specific tasks.
The Observability Problem
- What’s the issue? If you didn’t define something explicitly mathematically you can’t always really measure it mathematically. A prompt defines the purpose of an LLM app, like “generate a recipe” but then how do we create a metric to measure if the output is actually a reasonable recipe? That is a difficult thing to measure. Personally, I like to use a mix of heuristic metrics as well as LLM as a Judge to measure and monitor LLMs for accuracy over time. There is an art to determining what exactly your LLM should be monitoring for, which I address here. Yet without an explicit reward function that is in alignment with the intended use of the model, it is very difficult to measure the model’s performance on that task.
- How does RL help? Reinforcement Learning allows us to define clear objectives for an AI system in terms of a reward function. This creates a mathematical framework for measuring part of the systems performance, and to control it.
The Feedback Loop Challenge
- What’s the issue? GenAI models do not naturally learn over time at scale. Once these models are trained they remain static until manually updated. So fundamentally, LLMs can become a bit frozen in time and aren’t naturally self-improving AI systems. AI systems can be more powerful if they could learn from us as we interact with them. And this is how many people interacting with these models assume they are working already. There are methods to build a feedback and model updating loop into the software infrastructure around the models, and many models today are updated over time through fine tuning with new data. However, fine tuning LLMs models at scale is expensive and resource-intensive.
- How does RL help? RLHF involves collecting human annotated data and using this “Alignment Dataset” to further tune the systems performance over time.
The Echo Chamber Effect
- What’s the issue? A well studied effect in recommender systems and now also relevant in the content of generative models. Like Recommender Systems, GenAI models also learn from vast datasets, and as a result they can unintentionally reinforce existing biases found in the data. If a model is trained on biased data, it can perpetuate and even exaggerate those biases in its outputs. This creates the echo chambers by amplifying dominant viewpoints while filtering out diverse perspectives without exploring new patterns that could be optimal or more beneficial. Data created by AI will be fed back into AI systems further reinforcing the patterns it is producing. Yikes.
- How does RL help? The explore/exploit paradigm of Reinforcement Learning described below could be leveraged to inject new unseen patterns into AI systems.
Reinforcement Learning algorithms, however, stand out from many AI and ML methods as self-improving algorithms. Reinforcement Learning algorithms are adaptive systems that optimize decision-making by learning through trial and error, using user defined rewards and penalties to refine their behavior over time in alignment with their purpose. They have the built in ability to learn from experience and adapt to changing patterns in data dynamically. It’s me. It’s how I learn. By trial and error and bumping into a lot of walls. Reinforcement Learning agents even feel things like “regret” — which is kinda cute and relatable, and also useful if you’re trying to teach the system to take proper actions that lead to reducing undesirable outcomes based on the defined purpose.
A lot of recent breakthroughs in AI are now at the intersection of GenAI and Reinforcement Learning together. We are discovering that Reinforcement Learning can be applied with Generative AI to create more powerful self-improving AI systems. But before I dive into the latest breakthroughs and talk about how they address these pain points in GenAI, let’s pause and review the timeline of Reinforcement Learning.
A Timeline of Reinforcement Learning
Reinforcement Learning has been growing in popularity before the discovery of its applications to improve GenAI and is a field that has evolved over decades. Here are some key milestones worth noting.
🔹 1950: Alan Turing publishes “Computing Machinery and Intelligence” (Mind, Vol. 59, №236), proposing an approach to machine learning based on rewards and punishments while addressing the question “Can machines think?” This early work laid conceptual groundwork for reinforcement learning principles.
🔹 1957–1962: Richard Bellman develops dynamic programming, establishing mathematical foundations with the Bellman equation, a cornerstone of modern RL.
🔹 1989: Christopher Watkins introduces Q-learning in his doctoral thesis, a breakthrough algorithm that allows learning optimal policies without requiring a model of the environment. (Watkins, C.J.C.H. (1989). “Learning from Delayed Rewards.” PhD thesis, Cambridge University)
🔹 1992: Richard Sutton formalizes temporal difference learning methods. (Sutton, R. S. (1992). “Learning to predict by the methods of temporal differences.” Machine Learning)
🔹 1998: The publication of “Reinforcement Learning: An Introduction” by Richard Sutton and Andrew Barto is considered a defining moment, as it formalized and unified the field.
🔹 2013: DeepMind publishes their work on Deep Q-Networks (DQN), combining deep neural networks with Q-learning to master Atari games. (Mnih, V. et al. (2013). “Playing Atari with Deep Reinforcement Learning.” arXiv:1312.5602)
🔹 2016: AlphaGo (DeepMind) uses Deep Reinforcement Learning and Monte Carlo Tree Search to defeat world champion Lee Sedol in Go. (Silver, D. et al. (2016). “Mastering the game of Go with deep neural networks and tree search.” Nature)
🔹 2019–present: Deep Reinforcement Learning is adopted for various application areas including robotics, self-driving cars, and social media feed optimization.
🔹 2022–present: Application in Generative AI through Reinforcement Learning from Human Feedback (RLHF), becoming instrumental in training large language models.
🔹 2024: Turing Award is awarded to two research scientists (Richard Sutton and Andrew Barto) for their work that laid the foundation of reinforcement learning. link
It is also worth noting the significant application areas that are actively developing. Robotics has seen advances using Reinforcement Learning techniques, Autonomous Vehicles are incorporating Reinforcement Learning techniques for perception, decision-making, and control systems, and Social Media Platforms are driving user engagement with implementations of Reinforcement Learning algorithms to increase your propensity for “doom scrolling” through personalized content selection and recommendation systems.
Reinforcement Learning 101
To understand how Reinforcement Learning can advance GenAI, we must start with understanding the proper nomenclature and foundational mathematical definitions within RL. I’m going to simplify this as much as possible.
Figure 2: Image from Reinforcement Learning An Introduction by Richard S. Sutton and Andrew G. Barto.
This diagram from Sutton’s book illustrates the core Reinforcement Learning cycle, showing how an agent interacts with its environment. Here are the definitions of the key nomenclature:
- Agent: The learning entity or algorithm that observes states, makes decisions, and takes actions to maximize rewards. The Agent is an Acting Entity. Its ability to take actions independently differentiates it from other AI apps or predictive Machine Learning algorithms.
- Environment: The external system with which the agent interacts; it responds to the agent’s actions by transitioning to new states and providing rewards.
- State (S₍ₜ₎): The current situation or configuration of the environment at time t that the agent can observe. Note that (S₍ₜ₊₁₎) defines the new state of the environment at the next time step (t+1) after the agent has taken action A₍ₜ₎.
- Action (A₍ₜ₎): The decision or move made by the agent at time t in response to the current state.
- Reward (R₍ₜ₎): The immediate feedback signal received by the agent after taking an action at time t, indicating how good or bad that action was. Note that (R₍ₜ₊₁₎) defines the reward received at the next time step (t+1).
This feedback loop is a fundamental process in Reinforcement Learning because this is how the Agents learns and improves its actions through trial and error. The Agent can ultimately converge to make better decisions over time, while still randomly exploring outside the optimal path to discover any new patterns it should exploit. This is all done through updating a Policy (think of this as the Agent’s strategy) to optimize the outcome Reward and Regret.
Value Function (V): Represents how good it is for the agent to be in a particular state. It’s the expected total reward the agent can accumulate starting from that state and following a specific policy. The value function helps the agent evaluate states beyond just immediate rewards by considering long-term outcomes.
Policy (π): The strategy the agent follows to decide which action to take in each state. It’s essentially the agent’s decision-making rule that maps states to actions (or probabilities of actions). The strategy or function that the agent uses to determine which action to take in a given state. Formally, it maps states to actions (or probability distributions over actions). The policy guides the agent’s behavior and is what the agent is trying to optimize. Policies can be:
There are many variations of different learning algorithms such as Q-learning, SARSA, Policy Gradient algorithms, and more.
Beyond this, there are also more complex additions to this framework merging deep learning with reinforcement learning approaches to enable these systems to capture complex patterns from large datasets to inform their decision making.
A Survey of Current Advancements in the Generative AI Landscape using Reinforcement Learning
Now that we understand the basics of Reinforcement Learning, let’s talk about how it is relevant to GenAI today. Recent research at the intersection of reinforcement learning and LLMs has led to significant advancements making GenAI more adaptable, useful, and controllable. Below are some notable developments worth following:
RLHF: Reinforcement Learning Human Feedback was the first application of Reinforcement Learning to improve an LLM. This approach enhances LLMs by aligning their outputs with human guidance through a Reinforcement Learning feedback loop mechanism.
Figure 3: Image from Training language models to follow instructions with human feedback. Long Ouyang, et all. 2022. https://arxiv.org/pdf/2203.02155
RLHF involves three key steps:
- Collect human feedback. Human feedback on model-generated responses is collected and the output is ranked based on quality and alignment with human preferences. This creates an Alignment Dataset that will be used to tune the model. The alignment dataset typically consists of pairs of model-generated responses ranked by human annotators, but can also be curated examples that help prevent biased, harmful, or misleading outputs, as RLHF is often used to direct a model away from producing unwanted responses.
- Train a reward model. A reward model is defined and trained to predict human preferences using the alignment dataset. The reward model learns to approximate human judgment by mimicking the feedback provided.
- Fine-tune the LLM using Reinforcement Learning. The model is fine-tuned using RL techniques like Proximal Policy Optimization (PPO), Group Relative Policy Optimization (GRPO) or Process Reward Models (PRMs) to maximize the reward signal from the trained reward model. This improves the LLM’s output forcing it to produce results more in alignment with human feedback.
Figure 4: Image from “Reinforcement Learning in the Era of LLMs: What is Essential? What is needed? An RL Perspective on RLHF, Prompting, and Beyond”. Hao Sun. 2023. https://arxiv.org/pdf/2310.06147
One potential limitation of the RLHF approach is the need for human annotated data to align the algorithm with its defined purpose. A truly automatic feedback loop that leverages the explore/exploit framework in Reinforcement Learning could improve an LLM’s behavior while still building in mechanisms to measure, align, and intervene with the algorithm’s output.
Reasoning Models: Most standard LLMs (like GPT-3) had weak reasoning capabilities, particularly in math, logic, and multi-step problem-solving. Although powerful for language generation, these models were unable to “reasoning” effectively to accurately solve mathematical problems or to perform complex reasoning tasks. This was a known limitation of LLMs, but recently, new models emerged that leverage Reinforcement Learning to better perform these types of reasoning tasks. OpenAI’s o1 and o3 series, DeepSeek R1, and Google’s Gemini 2.0 Flash Thinking. These models all leverage Reinforcement Learning to improve their reasoning capabilities.
I was specifically impressed with the new reasoning model’s performance on the mathematical benchmark datasets MATH-500 and AIME 2024.
Figure 5: Image from the Deepseek R1 release paper https://github.com/deepseek-ai/DeepSeek-R1/blob/main/DeepSeek_R1.pdf
Reinforcement Learning for Multimodal Learning: Multimodal Models are trained to process multiple types of data such as text, audio, images, and video. These models are powerful methods to incorporate multiple types of data to produce meaningful output. They are used in text-to-image models, for example, to train models to generate an image based on a text description.
How to Get Started in Reinforcement Learning Today
I don’t really see a powerful future with LLMs and GenAI if the brain of the system cannot easily be defined for a specific purpose that can also be represented mathematically. This allows us to control and measure the performance of the system. Also I don’t see a future if they can’t self improve and inject and improve from new data at scale.
If you want to dive deeper into Reinforcement Learning and explore its applications to LLMs or contribute to advancements in this area, here are some tips to get started.
📚 First, understand the foundational mathematics of Reinforcement Learning and LLMs. This can be done through a combination of reading text books and papers. The references linked at the bottom of this blog are a great place to start. I like to read papers aloud and discuss with a reading group to solidify my understanding.
💻 Second, start coding with it. Once you have a foundational understanding of the underlying mathematics and these concepts begin to feel intuitive to you, start playing around with practical applications. Get coding. Some good python frameworks for implementing your own Reinforcement Learning project are OpenAI’s Gym, Gymnasium, and Google’s TF-Agents. Work through some predefined code examples, or come up with your own personal project to apply these concepts in a real world scenario. There are a lot of fun application areas for Reinforcement Learning so I think you will have no problem coming up with a creative and engaging project to guide your hands-on learning experience.
I am personally excited to see how Reinforcement Learning will continue to develop not only applications of GenAI, but to generally improve AI capabilities across the board. Since my undergraduate studies, I’ve chosen Reinforcement Learning as my favorite family of algorithms because after studying many various forms of mathematics, I believe RL is the most elegant and complete mathematical framework to enable AI and model the world we live in. I do not think AI will be able to continue to advance without the ability to mathematically define the purpose of the algorithm, measure its performance more directly, and have a mechanism built in to align it and intervene to guide the results. The Reinforcement Learning framework lends itself well to all these tasks to address the major gaps in GenAI today, and will probably unlock more AI advancements to come even beyond LLM and GenAI applications.
I’m excited to see what you build! 🚀
References:
- Introduction to Reinforcement Learning Video Lectures and slides by David Silver of DeepMind
- Mathematical-Foundation-of-Reinforcement-Learning book and video lectures by Shiyu Zhao
- Reinforcement Learning An Introduction by Richard S. Sutton and Andrew G. Barto
- Training language models to follow instructions with human feedback. Long Ouyang, et all. 2022. https://arxiv.org/pdf/2203.02155
- Reinforcement Learning in the Era of LLMs: What is Essential? What is needed? An RL Perspective on RLHF, Prompting, and Beyond. Hao Sun. 2023. https://arxiv.org/pdf/2310.06147
- DeepSeek R1: Features, o1 Comparison, Distilled Models & More. Alex Olteanu. 2025. https://www.datacamp.com/blog/deepseek-r1
- Release paper for Deepseek R1. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning, DeepSeek-AI, 2025. https://github.com/deepseek-ai/DeepSeek-R1/blob/main/DeepSeek_R1.pdf
