• Will Bennett
  • Posts
  • Reinforcement learning: The security layer for LLMs

Reinforcement learning: The security layer for LLMs

...and the emergence of MLSecOps

👋 Semi-regular writing about technology and early-stage investing. Investor at Seedcamp, Europe’s seed fund.

Reinforcement learning (RL) is fast becoming the linchpin for integrating LLMs in the enterprise. It allows a business to layer in its own preferences and policies with any off-the-shelf-model. Relatively painlessly.

Off-the-shelf LLMs will enable every employee to query disparate institutional knowledge via chat and produce near-perfect output in natural language. In doing so, the risk that employees will access data they shouldn’t, violate policies and share non-compliant material is substantial.

With reinforcement learning, a business can manage access, curate output and preserve brand integrity. RL may become a de facto security and compliance layer for LLMs.

What is reinforcement learning?

Reinforcement learning is a subset of ML in which an AI or agent learns through feedback from its actions. That feedback might be driven by a ‘human-in-the-loop’, another AI program, or from contextual data. These methods are classified in three buckets. Reinforcement learning with human feedback (RLHF) or reinforcement learning with machine feedback (aka RLAIF) or inverse reinforcement learning.

‘Feedback’ takes the form of critiquing model output variants, typically in a ranking system. Good outputs fare well against this ‘reward model’ and when this is fed back to the model, it improves the model’s ‘reward score’. This is effective at improving and/or adapting model performance; by applying RLHF, Meta 2-Chat improved with successive feedback loops.

Reinforcement learning is different to supervised learning, which is the primary technology used to train a foundational model. Supervised learning uses labeled examples with a clear right or wrong answer. Reinforcement learning explore a new environment with unlabelled examples and learns a policy, via feedback, that dictates what action should be taken at a given moment. Whereas reinforcement learning builds a reward model to compare right and wrong answers, supervised learning depends on a labeller actually drafting the desired behaviour.

The InstructGPT paper describes the process as follows. Step 1 is supervised learning. Steps 2 and 3 are reinforcement learning.

Why is reinforcement learning important for securing LLMs?

Across different enterprises, ‘good’ and ‘bad’ model output is different. For employees at two different organisations doing different roles, the same underlying ‘performance’ of a standard model might be required, but the definition of a ‘good’ output might vary according to clearance, relevancy, risk of leaking, confidentiality, etc.

Rather than custom fine-tuning a model by way of an employee showing a model ‘good’ answers (supervised learning), an employee can more quickly rank outputs across different tasks to feedback into a model that is effectively locally customised. This is significantly cheaper and is probably more response and adaptive, enabling each organisation to achieve security and confidence in local model output.

Llama 2-Chat and GPT-4 quickly improved with this type of RLHF fine-tuning, relative to given reward models.

This process of RL fine-tuning can also be automated with ‘inverse’ reinforcement learning. Rather than an employee or a machine applying labels and ranks manually, an algorithm can read an enterprise’s data, extract the reward function based on observed behaviour and then perform this feedback process based on the context. i.e. Based on the fact that no analyst has ever seen the company’s P&L in their inbox, RL would feedback to the model that the P&L is not an acceptable output for an analyst’s prompt.

Inverse reinforcement learning is often preferred as it evades philosophical challenges. Who decides the rules? Can human values be represented with rewards and does this enable reward hacking? Additionally, reward functions generated by humans are often not “dense and well-shaped” enough for compelling fine-tuning use cases. By definition, labellers are constrained to incremental/numerical ranking systems that may literally be binary. This can obscures nuance in ‘good’ and ‘bad’ that can be unlocked with organisational context.

Crucially, others ways to ’secure’ LLMs may denigrate performance. For example, if confidential data that shouldn’t be accessed by certain parties is excluded from the dataset, there will be functional limits. I.e. HR can’t search for someone’s salary or Finance can’t access financial statements.

Will OpenAI build this?

Maybe. Probably not. Security - along with analytics - for enterprise products with end-users is typically sold by a 3rd party vendor. It rarely makes sense for a vendor to adjudicate the safety and performance of user interactions with its own product.

A company that offers security to resolve misuse of its own product is conflicted by its vested interest. It creates fundamental misalignment. The more performant a model becomes at retrieving relevant data, the smaller the incentive to provide a layer that limits its capabilities. Why make the core product less exciting?

This trust problem is that reason that Gmail didn’t build Tessian to protect people against phishing. Gmail wants more emails to reach their destination, not fewer!

That said, in a world where a reinforcement learning product is pitched as the customisation engine for LLMs - rather than security - its plausible that Meta, Google, Anthropic, etc all offer enterprise users the opportunity to rewire their local models with reinforcement learning.

Other uses of reinforcement learning

Reinforcement learning is also used to optimize off the shelf models when they are being trained for production. The result in this case is improved performance rather than security. The ‘feedback’ in this scenario is very general rather than specific to a given enterprise.

PPO (Proximal Policy Optimization) and FARL (Final-Answer Reinforcement Learning) are RL techniques that improve performance relative to given benchmark criteria.

This application of reinforcement learning creates challenges where the end user is unspecified. In the case of OpenAI, for example, the deployment of reinforcement learning led to accusations of bias. Biased model output could result from the way a labeller ranks outputs. RLHF rankings may prejudice towards a labeller’s personal perspectives and opinions in addition to the relevant reward criteria.

Minecraft and other games are also starting to use RLHF based on implied user feedback in how they move through the game. This enables rapid game updates and may at some point in the future allow everyone to play slightly different versions of the game based on their own preferences. RLHF for gaming has also been examined in academic settings using pre-defined motivations - i.e. an intrinsic reward model - in the cases of games like Crafter and Housekeep.

Reinforcement learning is by no means new, but has perhaps been searching for an independent use case. Deepmind first caught people’s attention with an algorithm that could beat humans in Atari games when given peoples’ scores and the images they saw while playing, which soon led to AlphaGo. In many other cases, it was less obvious where reinforcement learning would have an advantage over supervised learning.

LLM security could be that use case. Very differently to other forms of MLSecOps which provide ‘security’ in a more classic sense - protecting vulnerabilities like insecure endpoints, pipelines or infrastructure misconfigurations - reinforcement learning stops the wrong people seeing the wrong stuff.

In a world where enterprise data continues to roughly double in volume every two years, the technology to secure it will be immensely important.

Join the conversation

or to participate.