Subject line split tests: which performance metric matters most?

Why split test?


Assumptions

  • Let’s assume we have a mailing list of 200k subscribers, and that we’ve set aside 30% of that database for split testing. 
  • We’ll split test 10 subject line variants to discover which subject line performs best on our audience.
  • We will declare the winner” (the highest-performing subject line) based on its open rate and we will then send that subject line out to the remaining 70% of our audience.
  • Let’s further assume that there was a clear winner in our set of 10 subject line variations. Nine of the subject lines have an open rate of 15%, and one has an open rate that is 10% better (a 16.5% open rate). 

Sampling error and noise

The reason we’re split testing is to help us pick the true winner” (highest-performing subject line) from the 10 iterations we began with 

In statistics, there is a concept called “sampling error”, which is an error caused by observing a sample (the split tests) instead of observing the whole population.  

For example, consider one of our 9 subject lines with a 15% open rate. In our imaginary situation, each of our split test groups is made up of 6,000 people ((200,000×0.3)/10). In theory – based on our 15% open rate – exactly 900 people from each group should open the email. In practice, however, the actual number of opens generated varies in almost all cases.

Maybe 873 people will open, or maybe 926. This is partly due to sampling error: our larger test group are randomly assigned to smaller split test groups, with some groups randomly getting more “openers” (folks who are more likely to open a marketing email for whatever reason). There are other sources of *noise* as well. For example, perhaps one subscriber clicked on an email by mistake, while another didn’t sleep well the night before, causing them to not open an email they normally would have.  

The real world is messy. 
 

Let’s take a look at an example. In the following graphs, we’ve “simulated” an email subject line split test. The top graph indicates the actual open rates. On the bottom, the results from an actual split test experiment can be seen. 



As you can see above, the best-performing email subject line did indeed have the highest open rate in this case, but in the actual split test data there is quite a bit of variation among all the other lines. This is due to the sampling error and noise” we’ve just discussed 

It is important to keep the following in mind: sometimes split test can give us the wrong result.

The probability of a split test selecting the subject line which will actually perform best when sent to a larger audience depends on a number of factors: 

  1. The baseline open rate (in our case ~15%). Rarer events tend to be noisier. The higher the baseline, the cleaner the split test data.
  2. The difference between the open rate of the “winning” subject line and the other subject lines used in the split test (in our case a 10% difference). Smaller variances are harder to detect. The bigger the variance between our “winning” subject line and the other subject lines used in the test, the cleaner the split test data.
  3. The size of the split test groups. Using bigger groups provides a more robust sample size and gives us more confidence in the results of our split test.

The problem with picking a winning subject line based on click rates

The urge to pick the winner of a subject line split test based on which subject line generated the highest click rate is very real. The reasoning behind such decisions is “opens are great, but what we really care about is getting people to visit our site”. 

This logic does appear sound in theory. However, consider factor #1 above: the likelihood of picking the true winner depends in large part on our baseline open rate. Heres the problem: click rates are invariably lower than open rates (because you can’t get a click without an open!). Often the actual variance between open rates and click rates is quite large. Therefore, there is substantially more uncertainty in the “observed” values (the results of the split test experiment) 

Let’s take a look at an example. Assume the same setup as above, but in this case the 9 subject line variations have a true click rate of 1%, and one is 10% better (click rate = 1.1%).  

Here is a sample result: 



As you can see, there is a lot more variation of the observed click rates in this caseFurthermore, there is another line that has a higher observed result than the real winner. In other words, in this case, our split test failed to pick the best-performing subject line.  
 
That is not an ideal outcome. 


What about conversion rates?

As with click rates, sometimes email marketers are tempted to think “opens and clicks are great, but what we really care about is getting people to buy our products”.  

Again, the logic here is sound, but you can probably see where this is going. Conversions are even rarer than clicks, so the likelihood of encountering “noise” when split testing for conversions is even higher.  

Let’s assume a conversion rate of 0.075% and see what happens:  




Based on the number of subscribers we are dealing with here, the expected number of conversions for each split test group is very low. In this case, the email with the highest-performing email subject line didn’t generate any actual orders!  

Picking a winning email subject line based on the highest observed conversion rate is akin to simply picking a subject line at random. 


How does split test accuracy change over time?

In the examples above, we’ve looked at single examples of subject line split test outcomes. In these cases, the correct winner was selected using open rates, but using click and conversion rates lead to errors.  

If we repeat this experiment many times, how likely is it that we will uncover the true winner over time? 

As a general rule, conducting repeated split tests over a longer period of time produces much more robust performance data than any single subject line split test can. The insights we can glean into the subject line language that engages our audience most effectively carry much more weight and will have a greater impact on the long-term performance of an email marketing program 



Picking the right testing strategy for your list size

In the section on “Sampling error and noise” above, we looked at three factors that impact how likely it is for the best email subject line to win in a split test.  

The first factor we considered was baseline open rates. This is what we’ve been looking at so far: open rates are invariably higher than click rates or conversion rates, so we have a higher probability of generating a correct outcome when we use open rates as our key performance metric.

Another factor worth considering is the size of the subscriber list we’re dealing with. So far, we’ve been assuming a list size of 200,000, which is pretty small. For those brands with a substantially larger list size, using observed click rates or conversion rates as the key performance metrics to test for becomes more practical. 

In this interactive chart, we’ll see how our split test results change depending on the total database size we are working with. The left-hand side shows how our experiment changes depending on the database size, while the righthand side shows the long running accuracy of the test.  


Pick a strategy based on your audience parameters

In all of our previous examples, we have used a 15% open rate, 1% click rate and 0.075% conversion rate for our simulations. Now, let’s customize this experiment for your brand’s unique audience parameters!

By choosing your average open, click and conversion rates below, you’ll be able see the long running accuracies of these metric-based tests for your brand and its unique audience:


Bottom line impact

Such theories are all well and good, but why would a profit-driven company care  

The answer is quite simple: the decisions you make while conducting email subject line split tests, including the sample sizes you use and the metrics you choose to test for, will directly impact your email marketing bottom line. 

Let’s go back to our original example in which one subject line performs 10% better than the others in the test and the average open, click and conversion rates were 15%, 1% and 0.075% respectively. To be able to estimate the expected revenues these subject lines will generate, we’ll need to make couple more assumptions: 

– Our full list size is 700,000. 

– Each conversion generates $100 in profit. 

– Assume underlying true open and click rates have a correlation of 0.9. Based on extensive Phrasee split test data, this would be plausible estimate for the underlying relationship between opens and clicks. The more people open your emails, the more they will click on them. Note that this is subject to variants being on-brand, not misleading, and containing comparable messaging. 

– Finally, assume the same 0.9 correlation between clicks and conversions. Again, it is logical to assume that as more people come to your website, more purchases will be made. 

The chart below shows expected profit for a single campaign sent, depending on which metric you used to find your “winning” subject line. This demonstrates that decisionmaking accuracy should be prioritized over simply picking the metric that is furthest down the sales funnel and therefore closest to final revenuedriving events. 



Final considerations and conclusion

All examples above have assumed that the best-performing email subject line generates 10% better results than all the other variants you started with. Hence, the accuracy of our data would be lower when split testing finer differences.

There’s one last point to consider: we have assumed that we’re using 30% of our total database for our split tests. This is a reasonable assumption for average-sized datasets, but for particularly large datasets, we might consider decreasing the split percentage to optimize our potential revenue once we’ve identified our highest-performing email subject line.

In this article, we have investigated in detail the effects of noise and sampling error on email subject line split tests. Consequently, we have seen that without an effective, rigid strategy, making subject line decisions based on split test results doesn’t always work.

The last couple of interactive charts allowed you to tailor mathematical simulations to your specific dataset parameters. Hopefully, this article will help you to build a statistically sound split testing strategy and apply it to your business in the future!

Good luck!
Phrasee Data Science Team


Trying to grab the limelight in travel?
Find out what testing can do for your engagement