The Definitive Guide to Large Language Models and High-Performance Marketing Content

White Paper

November 15, 2023
Reading time: 15 min

By Toby Coulthard, Jasper Pye, and Neil Yager, PhD

As we pass through the “trough of disillusionment” in generative AI, enterprise marketers are beginning to understand the inherent limitations of using today’s probabilistic Large Language Models (LLMs) with regard to trust, scale, and performance for marketing engagement. LLMs were built to be generally good at many different tasks. In reality, AI models leveraging symbolic logic may be necessary to counter the challenges LLMs face with trust and scale. Performant short-form text content is the last remaining vector for marketers to leverage AI to improve open and conversion rates through digital channels. This paper firstly demonstrates that LLMs alone aren’t well suited to solve the needs of the modern marketer due to their inherent limitations. Secondly, it proposes an alternative solution for performant content generation at scale.

Introduction

In the rapidly evolving landscape of artificial intelligence (AI), few innovations have captured the attention of both tech enthusiasts and the public as much as Large Language Models (LLMs). These algorithms have dominated the AI news cycle and found their way into the mainstream media, leading to a mix of justified excitement and unjustified speculation. Almost everyone in the AI community agrees that LLMs are not just a passing trend. LLMs are here to stay. Amara’s Law states that people will tend to overestimate the impact of a technology in the short term and underestimate the impact in the long term. Only through understanding the inherent limitations of a novel technology can we accurately understand (and predict) what kind of impact it may have.

LLMs are ushering in a new era of marketing innovation. This is one of the first use cases that emerged after the release of OpenAI’s GPT-3 in 2020 and continues to be one of the most exciting applications. Understanding this space is not an option but a necessity for any modern and forward-thinking company.

Some companies will proceed with skepticism and caution, perhaps banning LLMs entirely. These companies may be forced to hire additional staff to enable content generation at scale. There are more viable and competitive long-term strategies than this. Most companies have accepted that they will increasingly rely on generative AI for content over the coming months and years. According to Gartner, by 2025, 30% of outbound marketing will be machine-generated, up from around 2% in 2022.

Companies adopting generative AI are faced with two options:

Use a 3rd party solution
Develop an in-house solution using commercial or open-source LLMs

This paper focuses on the second option: developing robust and effective solutions with LLM for the real world.

AI vs LLMs:
What’s the difference?

The terms AI and LLM are often used interchangeably in current discourse. However, these are two distinct concepts:

Artificial Intelligence (AI) encompasses a broad range of technologies and applications, including but not limited to machine learning, natural language processing, computer vision, and robotics. AI aims to create systems capable of performing tasks that typically require human intelligence.
Large Language Models (LLMs) are a subset of AI and machine learning technologies specializing in language-related tasks. LLMs are powered by deep learning and neural networks and are trained on massive datasets. Due to their ability to generate text, they fall under the “Generative AI” category.

Developing with Large Language Models (LLMs)

Companies with internal development resources may wish to investigate building an in-house solution. In theory, this might be more cost-effective and fit for purpose. There are many tools and platforms available for developing with LLMs. For example, OpenAI has an easy-to-use API, and several open-source models, such as Meta’s Llama, are available for commercial use.

LLMs are probability distributions over the words in a vocabulary

Considerations and constraints

It is possible to build prototypes with LLMs quickly. This can lead to impressive demos, with the thinking: “If we can do this in a week, imagine what we can do in a few months!”. However, this logic is risky and often leads to disappointment. LLMs make it easy to support common cases, but the performance can degrade quickly for the long tail of corner cases.

Under the hood, LLMs are probability distributions over the words in a vocabulary. They do not have a knowledge base or contain dedicated algorithms for common sense reasoning and abstraction. They find the most likely output for a given input (usually a prompt). This imposes some fundamental limitations. In fact, Bill Gates stated in February 2023 that he does not expect LLMs to scale without knowledge representation or symbolic logic. Here are some challenges to be aware of when building LLM-based content generation systems for marketing:

Brand safety

There is a risk of LLMs outputting biased or inappropriate language. Progress has been made since early LLMs, and techniques such as “reinforcement learning from human feedback” (RLHF) can help. However, these are imperfect solutions. Furthermore, some open-source models provide raw output. In today’s climate, a viral misstep can take a steep toll on a brand’s reputation.

Factual errors

The output of an LLM can range from misleading to patently false. This common phenomenon is known as a “hallucination.” The core problem is that LLMs don’t have an underlying knowledge base. As mentioned above, they are probability distributions, not a set of facts. LLMs will output anything that fits the context, regardless of the  underlying truth.

Fixing problems

The output of an LLM will not always be appropriate, correct, or acceptable. For example, it may contain factual errors, or the tone doesn’t match the intended brand voice. Developing with LLMs is unlike traditional software development, where this is a “bug” that can be corrected with a code fix. Since LLMs are probabilistic models, fixing the problem will require changes to the prompt or retraining the model. However, these updates may change the output for all inputs. Therefore, by solving one problem, you are likely to introduce new ones.

The above are inherent limitations of LLMs in general. There are also practical development and engineering limitations:

When building a solution, it is risky to have a critical step that is outside of your control. For example, if your solution is built around a specific LLM accessed via a 3rd party API, the costs can change, the API will have downtime, or the vendor may drop support for a model or feature you rely on. On the 8th and 9th of November 2023, OpenAI had over 90 minutes of API downtime.
LLMs require significant computing resources to make inferences. A single request may take several seconds. This is known as “inference latency”. It may not be a problem for some use cases. However, this is orders of magnitude too slow for most real-time applications. Especially those that require scalability.
The cost of a call to an LLM can vary widely. A single call might cost a fraction of a penny. However, prices can climb quickly depending on the provider, model, and use cases. For example, high-volume applications (such as personalization) may be prohibitively expensive.
Hosting your own open-source model gives some degree of protection and control. However, hosting and deployment are difficult technically and require a specialized skill set that not all companies have. Furthermore, due to economies of scale, hosting and serving your model can be significantly more expensive than using a 3rd party API.
Open-source models are not currently at the cutting edge. Training LLMs from scratch is very expensive – in the 10s of millions of dollars. The organizations behind open-source models do not have the same budget and computing resources as major players like OpenAI, Meta, and Google. Open-source models are suitable for many applications, but they are not currently state-of-the-art for content generation.
There is a rapid release cycle for closed and open LLMs. New models are coming out every few months. If you spend too much developing your solution for a single, specific LLM, it will likely be obsolete by the time you release it.

Fine-tuning with training data

Fine-tuning Large Language Models (LLMs) involves additional training of pre-existing language models for specific tasks or domains. In other words, instead of training an LLM from scratch (with a price tag in the millions), fine-tuning builds on the parameters already learned by the base (or “foundation”) model. This is done through a process of supervised learning. The first step is to create a training set. This is a set of input examples and expected (or acceptable) output. The training algorithm adjusts the weights of the underlying model in response to the new training data. It is important not to confuse the machine learning definition of fine-tuning with the dictionary definition – in machine learning, to fine-tune does not simply mean “to improve.”

Some companies use their proprietary data to fine-tune an LLM. They hope this will improve the output quality and build a defensive moat between themselves and their competitors. There are some areas where fine-tuning is known to be helpful. For example, it can significantly improve classification task accuracy when specific domain expertise is required (e.g., processing legal documents). However, the process is not as straightforward or effective as one might hope for text-generation applications.

The quality of training data is a key variable for the success of fine-tuning. Significant manual effort is often required to ensure the training data is suitable. The training data should be:

Accurate and free from errors. Furthermore, outliers and noisy data should be removed.
The data must be consistent with the input expected when the model is released.
There must be a sufficient volume of data for the model to learn from
The data points should be balanced. In other words, the training data should not be overly focused on a few possible input types. It is better to have a broad range of data.

The fine-tuned model’s performance will decrease if the training data is not properly structured, comprehensive, and of high quality.

As with all machine learning applications, the fine-tuned model’s performance will decrease if the training data is not properly structured, comprehensive, and of high quality.

Case Study: Fine-tuning to generate high-performing marketing language

Marketers have a message they want to deliver to their audience. However, there are many possible ways to craft that message. For example, assume there is a promotion that includes free shipping. The phrases “free shipping” and “we’ll cover the shipping costs” are equivalent from a semantics perspective. However, one of the phrases will resonate with a customer better than the other, making them more likely to convert. This is what we mean by “high performance.” The goal is to maximize engagement in some way that can be measured: opens, clicks, likes, shares, etc.

This is a potential application for fine-tuning. The idea is conceptually simple: show an LLM many examples of high-performing content, and the system will “learn” how to maximize engagement when generating future campaigns. This is a compelling idea, so we performed several experiments to evaluate it.

Generation

A training set is required to fine-tune an LLM. You can’t simply use examples of campaigns that have performed well in the past. Many factors contribute to a campaign’s success, and the wording of a language is just one of those factors. In fact, the same content can have differing performance depending on when it is delivered, in what channel it is delivered, and to whom it is delivered. Here are some of the factors that impact the performance of a piece of content:

Time of
send

This has short-term (time of day and day of the week) and long-term (seasonal trends) components.

Campaign content history & frequency

The history of messages sent to individuals will impact their future behavior.

Audience segmentation

Audiences can be split by demographics (e.g., age ranges or geographies) or engagement levels (e.g., loyalty tiers). These audiences will respond to content uniquely.

Channel 
selection

A promotion sent through different channels (e.g., email vs push) will have different results.

Therefore, raw engagement metrics do not necessarily correlate with the quality of the content. The solution is to train on split test data. In this case, the message variants are sent to different random subsets of an audience simultaneously. Since there is only one independent variable, any difference in performance can be attributed to the content variations themselves.

One way to frame the high-performance generation problem is as machine translation. Instead of translating between two languages (e.g., English and Icelandic), the goal is to re-write the input into a new version expected to perform better. As with translation between languages, the training data is language pairs: one variant with poor performance and a corresponding variant with high performance. For example:

Input (low performing): Running shoes are on sale.
Output (high performing): Get ready to hit the track since running shoes are on sale now.

Out of 100 generations, more than 20 contained obvious factual errors.

The training data for this experiment is from a database of real-world email subject line split test results. We used 5000 data points, each containing an email subject line that achieved a low open rate when delivered to an audience segment and a corresponding variation with a high open rate when sent to a different audience segment. We used GPT-3.5 Turbo, which, at the time of writing, is the best model from OpenAI that supports fine-tuning.

The results showed that this is not a suitable approach to the problem. Out of 100 generations, more than 20 contained obvious factual errors. Many of the others included more subtle problems. Here are some examples:

Input	Output – Generated Content	Comment
💸 Original is BEST – clothes to make your own… Style and savings with up to 50% off!	Don’t let up to 50% off pass you by.	This is an example of valid output – it contains no contradictory information and maintains the tone of the message.
🎄 Get into focus… Enjoy the holiday season!	Because We’re Festive… Here’s What You Need for The Holidays🎄	The message content is fine. However, it has been converted to title case, which may not be acceptable, depending on brand guidelines.
We love a good deal—get up to 70% off!	Your brain boost has just landed in your inbox—check out with TOTAL serenity, gal! This way for something heavenly. 💭	The input made no mention of a “brain boost.” Furthermore, the use of “gal” changes the tone of the message.
Watch the Golden Globe Award winners	* Just in * Did you predict these winners? Here is what we loved from the movies in 2017 …	The year 2017 has been hallucinated.
🏀 Find everything NBA! Thursdays are for saving.	Please confirm your email address – dunk & save entered the paint…	The reference to “confirm your email” has been hallucinated. Also, the wording for the phrase “entered the paint” is awkward.

These results are not surprising when you consider the training data. Consider this example again:

Input (low performing): Running shoes are on sale.
Output (high performing): Get ready to hit the track since running shoes are on sale now.

Even though the underlying messages between the input and output are consistent (“shoes are on sale”), the high-performing output contains additional words and phrases to add context and flavor (without fundamentally changing the meaning). By structuring the training data in this way, the system has “learned” that you can add language to improve the performance of the content. However, it cannot distinguish between permissible and impermissible content additions.

This problem cannot be solved by increasing the amount of training data. Other third-party results also corroborate these findings, with recent research finding 3% to 27% of results include hallucinations.

There are other ways to structure the training data. For example, you can use pattern matching and named entity recognition (NER) to ensure the input and output have the same information content. However, in this case, the system learns that only small tweaks can be made to improve content performance. In reality, tweaks are unlikely to impact the performance of the language. Alternatively, the training data can be structured so the system learns commonplace expressions. In this case, the output is highly generic and suffers from a lack of diversity.

Performance prediction

The experiments in the previous section demonstrate that fine-tuning is not a promising approach for coaxing LLMs into generating high-performance content. The solution is to logically separate the problems of generation and performance (more on this below). However, perhaps we can fine-tune an LLM to learn to distinguish high-performing content from low-performing content. This would be an alternative to a smaller and dedicated machine learning model, with the motivation being that perhaps an LLM has an implicit notion of language and performance that would give it an edge over smaller models.

We ran an experiment to compare the accuracy of a fine-tuned LLM against our task-specific performance prediction engine. The task-specific engine uses an attention-based Long Short-Term Memory (LSTM) neural network.

The LLM-based solution had an error rate 30% higher than our dedicated machine learning model.

The training data for fine-tuning contains examples of low and high-performing content from real-world split test results. The output is the variant that achieved a higher open rate when sent to real-world test groups.

The results were that the LLM-based solution had an error rate 30% higher than our dedicated machine learning model. This is not a surprising result. It has been observed in academia and industry that task-specific models often outperform fine-tuned LLMs.

Analysis

Content generation and performance optimization are distinct and orthogonal problems. Our experiments suggest that conflating these problems in a single LLM leads to a higher hallucination rate. This is because the model cannot distinguish between core content (what to say) and stylistics (how to say it). The former is fixed, while the latter can be tailored to suit an audience. By trying to solve both problems simultaneously, the system is prone to maximize performance by changing the message’s meaning. BCG has done a similar analysis highlighting the challenges of fine-tuning your own model and how responses are bound by the model’s capability and original dataset.

There is another risk when content and performance are intertwined. A well-known principle in business is Goodhart’s Law: when a measure becomes a target, it ceases to be a good measure. This manifests itself in AI-powered content generation in two ways:

It is usually possible to improve one metric at the expense of another. For example, if the AI system is designed to maximize open rates, this can be done at a cost to click rates by using spammy and misleading subject lines. An example would be a subject line “Congratulations, you’ve won a $100 gift card!” with the body of the email containing a message along the lines of “Gotcha! While you’re here, check out these deals”. A strategy like this would achieve high open rates in the short term. However, it would also alienate and anger an audience. This is an extreme example for illustration, but it demonstrates that there is a risk when an AI system optimizes towards a single metric without checks and balances in place.
Another risk of treating every campaign in isolation is that it can lead to repetition over time. The system might decide on the “best” content and use it repeatedly. Eventually, this will lead to ad fatigue and a disengaged audience.

One must take a long-term view of performance rather than treat each campaign as a one-off optimization.

For these reasons, one must take a long-term view of performance rather than treat each campaign as a one-off optimization. To accomplish this, a clear separation between content and performance is essential. A monolithic neural network that handles both content generation and campaign planning is doomed to fail. As illustrated in Figure 1, a modular system with specialized sub-systems for generation, performance, historical diversity, and brand safety will lead to greater control and better long-term outcomes.

Figure 1: A modular architecture for generating performant content

Best Practices

Here are some best practices to keep in mind when developing a system to maximize the performance of marketing content:

Beware of LLM maximalism. LLMs are undoubtedly powerful models. However, they are not the best solution for everything; new technologies are seldom a panacea. As we saw in the experiments above, sometimes a task-specific model can significantly outperform even the latest LLMs.
Be model agnostic. Risks are associated with being tied to a single LLM model or vendor. Also, new models are being released all the time. Therefore, design your system to work with a variety of underlying models. This will allow you to be nimble and quickly adopt new models as they are released.
When it comes to brand voice and hallucination, it is best to use NLP technology to post-process your data and catch problems. Using a rule-based solution will be more reliable and give you more control than attempting to force an LLM to output perfect content every time.
Keep a long-term view of “performance.” Do not focus on a single metric that is maximized independently for each campaign. Furthermore, implement systems to avoid excessive repetition over time.
A dedicated machine learning model should handle performance prediction. This way, it will not interfere with content generation. Make sure only to train the model on carefully controlled and curated split test data so that any differences in engagement can be attributed to differences in the content variations.
Your learning system should have a feedback loop. The results from new campaigns should be automatically added to your training data so your system continues improving over time.
Make sure you have the right team for generation and optimization. This will include data scientists, linguists, MLOps engineers, etc.

Conclusion

We are in the early days of a technological revolution driven by AI. However, many open questions exist about how to use this powerful new technology. There is no shortage of vendors who are offering impressive-sounding products. However, many of these are little more than a thin user interface in front of technology they do not own (or even fully understand).

When developing in-house solutions, it may sound compelling to say: “We have data we have been collecting for years. Let’s feed that into an LLM.” Firstly, we have demonstrated in this white paper that this isn’t a fruitful approach to the problem. Secondly, and more importantly, this isn’t the correct way to approach R&D. There is a lot of hype around AI, and a fear of missing out is driving many companies’ technical decisions. They are looking for any way they might be able to integrate LLMs into their tech stack. However, this is backward. The correct approach is to start with a well-defined, valuable, and pre-existing problem. Next, evaluate a range of possible solutions. LLMs will be integral to solving complex problems like generating high-performing marketing content. However, the solution will also require various other technologies, such as natural language processing, reinforcement learning, optimization algorithms, and heuristic rules. Sequoia Capital declared that the next generation of performant technologies will use foundational models as a piece of a more comprehensive solution rather than the entire solution. Embracing a holistic approach ensures the effective utilization of AI in solving complex challenges.

Chief Data Scientist at Phrasee

Neil Yager, PhD

A 20-year veteran of the tech industry, Neil Yager, PhD is one of the world’s leading experts in the commercialization of AI. He has worked in various roles for prominent innovative and forward-thinking technology companies, including Canon Information Systems Research Australia (CiSRA), BT Imaging, and Biometix. He has written over a dozen academic publications, authored a book on data mining, and holds several patents. He holds a PhD in Computer Science from the University of New South Wales in Australia, and in 2009, Neil was awarded the Malcolm Chaikin Prize for Research Excellence in Engineering, a prestigious innovation award.

Chief Product Officer at Phrasee

Toby Coulthard

With experience at IBM, Salesforce, McLaren Applied Technologies and Braze, from fledgeling business to post IPO – and with an academic background in Electronic Engineering and Nanotechnology at University College London and the London Centre for Nanotechnology, Toby Coulthard excels in taking innovative technologies to market. Toby has spent the last 9 years in marketing technology pushing the boundaries of what’s possible for today’s modern marketers. Toby also holds professional certifications from Harvard and Cornell universities.

Chief Technology Officer at Phrasee

Jasper Pye

Passionate product and technology leader with 14 years of experience spanning Product Management, Design, UX, and Brand in various industries. Specializing in Generative AI over the last eight years as a founding team member of Phrasee and responsible for leading product strategy, design, engineering, and data science. Born in Adelaide, London native, usually found enjoying the finest parts of Italy with wife Elisa or watching a band you’ve probably never heard of.

AI Content Vendor Selection: Request for Proposal Template

The Data Scientist’s View: What Marketers Need to Know About ChatGPT