Will Bennett
Posts
Small language models and building defensibility

Small language models and building defensibility

We want models on our laptops! Takeaways from Oxford GenAI Conference

Will Bennett
October 23, 2023

👋 Semi-regular writing about technology and early-stage investing. Investor at Seedcamp, Europe’s seed fund.

Sir Nigel Shadbolt recently kicked-off Oxford’s Generative AI Summit with a keynote on the emerging large language model research. Skipping over the arms race for the most performant mega-models, Sir Nigel focused on smaller models.

Compared to large models, with billions of parameters that rely on billions of bytes of data and tens of thousands of processors, smaller models target equivalent performance at a tiny fraction of the cost. For many enterprise use cases, this approach may be a much better fit.

It is difficult to build a moat around large foundation models

The largest language models are battling each other to be the most ‘accurate’, a proxy for performance as measured by benchmarking tools like Google’s BigBench. Since the release of GPT-4 in March, top spot for model accuracy has been hotly contested.

February’s performance rankings for open-source models

Most recently, Mistral’s launch propelled its 7B parameter model to equivalency with its closest open-sources competitors, Meta’s LlaMa and Google’s PaLm. Mistral is part-way through consuming $113m to get there. The performance of yet-to-be-launched models such as Poolside and closed-source models like OpenAI’s ChatGPT would likely achieve similar benchmarks.

Currently, ‘performance’ correlates to the number of parameters in a model. To train a model with lots of parameters, you need lots of GPU. GPU is finite and so the competitive advantage is simply cash availability. In today’s market, models are growing in size at a much faster rate than GPUs can lever Moore’s Law. As a result, it is increasingly likely that at some point GPU chips capability will not be able to cater to growing model size.

So how can AI companies build an edge without infinite capital?

Small models

Small models currently receive less attention than their larger peers, but are emerging as the ground-layer of very defensible businesses.

Small models are achieved by condensing or ‘fine-tuning’ open-source LLMs like Meta’s LLaMa and usually trade off a small amount of accuracy for enormous cost savings.

BAIR’s illustration of how a small model is developed

Small models fall into two approximate groupings:

Group 1: General-purpose versions of bigger models like Vicuna, Koala or Alpaca, that enhance speed and security.

Group 2: Use case-specific versions of bigger models designed for specific applications, like Sec-Palm and MedPalm.

The most obvious benefit of both groups is that small models are just cheaper. In some cases, they can be trained on a single GPU, and at most, the cost lands somewhere in the tens of millions rather than the hundreds of millions. Vicuna-13B, for example, costs just $300 to train. Smaller models are also able to run locally, giving CISO’s immense peace of mind.

Both groups are also able to achieve a level of accuracy comparable with larger models.

For example, Vicuna offers accuracy equivalent to roughly 90% that of ChatGPT. BAIR, the Berkeley research group responsible for Koala, discovered that over half of its sample users felt that Koala and Alpaca models were as-good or better than ChatGPT.

Emerging research is even showing signs of being able to enhance the throughput of smaller models to such an extent as to outperform their older sisters.

Test Sets for Alpaca and Koala used open-source and ‘distillation data’. Distillation data included 60K dialogues shared by users via ShareGPT and the HC3 english dataset, which contains around 60k human answers and 27k ChatGPT answers to around 24k questions.

The few basis points of difference between small and large models may not really matter at all and the jury is out on whether organisations will settle for enterprise tooling based on small models.

Apple for example, is currently exploring how to incorporate small models that live locally on your device. Details of this strategy led by Tatsunori Hashimoto of Stanford University are still emerging, but the model is likely to be device-constrained and have only enough compute power to work on context lengths of a small number of tokens. ChatGPT whereas works on context lengths of up to 8,192 sub-word tokens, or 3,000 words.

The subject-specific datasets of Group 2 can enhance accuracy even more dramatically since the model input is designed for the desired output. This approach can lead to models that are especially performant - and still cheaper - for specific use cases. As Elad Gill notes, we could land in a world where “niche models end up performing roughly as well as, or better then, the large scale models for most applications”.

The reason for this is that generalist LLMs basically ingest a lot of stuff. It is consensus that a lot of this is not useful for enterprises, as demonstrated by the data ingested by Google’s PaLM below.

Lots of it is human-to-human chatter. Even some of the non-chatter data is likely to contain biases relevant to the publisher. At Oxford’s GenAI conference, Nigel Toon, CEO of Graphcore, even reckoned that of Wikipedia’s contributing editors for example, >80% are white middle-age Americans.

PaLM: Scaling Language Modeling with Pathways

If you eliminate this chatter with fine-tuning, you can arrive at a much more focused model. For example, Google’s MedPalm and SecPalm, are fine-tuned for medical and security use cases. Sec-Palm focuses on specific security intelligence, such as Mandiant’s intelligence on vulnerabilities, malware and threat indicators. Similarly, Writer’s Palmyra-Med is based on Palmyra-40B, trained purely on curated medical datasets, PubMedQA and MedQA.

Med-Palm famously passed the US Medical Licensing Examination and Med-PaLM 2 reached 86.5% accuracy on the MedQA medical exam benchmark.

In a similar example, TinyStories, a very small artificial dataset of tales containing words typical of toddlers is able to reproduce production worthy passages. Tiny Stories datasets create plausible text at ~10M parameters and if increased to ~30M parameters, can directly compete with GPT2-XL, which has 1.5B parameters.

What does a world of small models look like?

If a world of small models materializes, the way we all interface with LLMs is likely to require a few extra layers of technology.

Rather than prompting ChatGPT directly via OpenAI, or prompting LLaMa to generate output aligned to enterprise ‘guardrails’, we might prompt a routing layer. A routing layer would calibrate the prompt to the small model that best fits the prompt’s purpose. Equally, the routing layer might decide whether you need LLaMa full-version, or just a smaller, fine-tuned version.

This world requires a “constellation of models” in the words of Tomasz Tunguz. In order for a routing layer to work, a critical mass of models must exist to rival the breadth of a larger model across every use case. Whatever this future looks like, we almost certainly won’t all need a gazillion parameters for everything we want to do.

Reply

or to participate.