Misadventures in experiments for growth

by MICHAEL FORTE

Large-scale live experimentation is a big part of online product development. In fact, this blog has published posts on this very topic. With the right experiment methodology, a product can make continuous improvements, as Google and others have done. But what works for established products may not work for a product that is still trying to find its audience. Many of the assumptions on which the "standard" experiment methodology is premised are not valid. This means a small and growing product has to use experimentation differently and very carefully. Indeed, failure to do so may cause experiments to mislead rather than guide. This blog post is about experimentation in this regime.


Established versus fledgling products

For the purpose of this post, "established products" are products that have found viable segments of their target user populations, and have sustained retention among those segments. These established products fill a particular need for a particular set of users, and while these products would want to expand, they do not need to as a matter of existence. Product viability these days does not necessarily mean being a financially viable standalone product either. Fulfilling unmet user needs is often enough to be of value to a larger product that might someday purchase you. For established products, growth is structured as incremental rather than a search for viability, or a matter of survival.

In contrast, "fledgling products" are products that are still trying to find their market. Now how is it possible for these fledgling products exist, do something, have enough users that one could contemplate experimentation, and yet still not have market fit? Wonders of the internet (and VC funding)! Modern products often don't start with set-in-stone business models because starting and scaling costs are low. Modern products often start with an idea, but then gather enough momentum to pivot to fill emergent needs. You do some cool stuff and then try to figure out from usage patterns what hole your product is filling in the world (so-called "paving the cowpath"). Instrumentation and analysis are critical to finding this unexpected use.


Why does anyone use experiments?

Let's revisit the various reasons for running experiments to see how relevant they are for a fledgling product:

To decide on incremental product improvements

This is the classic use case of experimentation. Such decisions involve an actual hypothesis test on specific metrics (e.g. version A has better task completion rates than B) that is administered by means of an experiment. Are the potential improvements realized and worthwhile? This scenario is typical for an established product. Often, an established product will have an overall evaluation criterion (OEC) that incorporates trade-offs among important metrics and between short- and long-term success. If so, decision making is further simplified.

On the other hand, fledgling products often have neither the statistical power to identify the effects of small incremental changes, nor the luxury to contemplate small improvements. They are usually making big changes in an effort to provide users a reason to try and stick with their fledgling product.

To do something sizable

Sizable changes are the bulk of changes a fledgling product is making. But these are not usually amenable to A/B experimentation. The metrics to measure the impact of the change might not yet be established. Typically, it takes a period of back-and-forth between logging and analysis to gain the confidence that a metric is actually measuring what we designed for it to measure. Only after such validation would a product make decisions based on a metric. With major changes, the fledgling product is basically building the road as it travels on it.

That said, there might still be reasons to run a long-term holdback experiment (i.e. withhold the change from a subset of users). It can provide a post hoc measure of eventual impact, and hence insight into what the product might try next. This is not the classic case of hypothesis testing via experimentation, and thus the measured effects are subject to considerations that come with the territory of unintentional data.

To roll out a change

We have a change we know we want to launch — we just want to make sure we didn't break anything. We might use randomization at the user level to spread the rollout. Unlike, say, picking a data center, randomization produces metrics that are immune to sampling bias and can thus detect regressions in fewer treated units. This is a good use of existing experiment infrastructure, but it is not really an experiment as in hypothesis testing.


In summary, classic experimentation is applicable to fledgling products but in a much more limited way than to established products. With continual large product changes, a fledgling product's metrics may not be mature enough for decision making, let alone amenable to an OEC. To focus on the long term is great advice  once immediate survival can be assumed.

Your users aren't who you think they are!

A more fundamental problem with live experiments is that the users whose behavior they measure might not be who we imagine them to be. To illustrate the issues that arise from naively using experimentation in a fledgling product let us imagine a toy example:



We have an MP3 music sales product that we have just launched in a "beta" state. Our specialty is to push users through a series of questions and then recommend, for purchase, tracks that we think they will like. We back up our belief by offering a full refund if they don't love the song. Each page-view of recommendations consists of an appealing display of tracks of which the user may click on one to purchase. The product is premised on making it a no-brainer to purchase a single song ("for the price of chewing gum", according to our marketing message).

We define an impression as a recommendation page-view and a sale as the purchase of a track. Of particular interest to us is the conversion rate, defined as fraction of impressions resulting in sales. To grow, we paid for a small amount of advertising and have a slow but steady stream of sales, say roughly 5,000 sales per day from about 100K impressions (5% conversion rate).

The design team decides it wants to add BPM (beats per minute) to the song list page but isn't sure how to order it with the title (e.g. should it be [Artist Title BPM] or [BPM Artist Title]). So they set up an experiment to see which one our users prefer. This is expected to be a small change and does not change the sort order, just adds a little extra information.

The experiment had 10,000 impressions in each arm with results as shown below with 95% confidence intervals. These are binomial confidence intervals computed naively under assumptions of impressions being independent (this is usually a poor assumption, but for now let us proceed with it):

TreatmentImpressionsSalesConversion RateDelta From Control
[Artist Title] (control)100004004.00±0.38%
-
[Artist Title BPM]100005005.00±0.43%+1.00±0.57%
[BPM Artist Title]100006006.00±0.47%+2.00±0.60%

Given just this information, it seems obvious to us that we should pick "[BPM Artist Title]" going forward and that we can expect an uplift of roughly 2% more of our impressions to turn into sales. Going from 4 to 6%, that seems like a big win.

Unfortunately this analysis missed one subtle but very important caveat. Early in our product's life cycle we have a user population that strongly prefers EDM (electronic dance music) to the point that roughly 80% of the 5,000 songs we sell are EDM. Given this information it might seem obvious in retrospect that adding BPM to the song list would lead to more sales (BPM is an important selection parameter for EDM music).

How could more sales be a problem? Putting BPM first in the song list came at the expense of putting artist first, and if we had broken out our user population by EDM listener and non-EDM listener we would have seen something very telling:

EDM users (8,000 impressions):
TreatmentImpressionsSales
Conversion Rate
Delta From Control
[Artist Title] (control)80003204.00±0.43%-
[Artist Title BPM]80004405.50±0.50%+1.50±0.66%
[BPM Artist Title]80005707.12±0.56%+3.12±0.71%


Non-EDM users (2,000 impressions):
TreatmentImpressionsSalesConversion RateDelta From Control
[Artist Title] (control)2000804.00±0.86%-
[Artist Title BPM]2000603.00±0.75%-1.00±1.14%
[BPM Artist Title]2000301.50±0.53%-2.50±1.01%


From this it is clear that we have sacrificed sales from non-EDM users for EDM users. This might be an acceptable trade-off if we have looked at the marketplace and decided to make a niche product for EDM users. But the charts indicate that EDM music makes up only 4% of total music sales (source), which means our product might not appeal to 96% of the market. So by optimizing short-term metrics such as sales volume we might have actually hurt our long-term growth potential.

The underlying principle at play is that your current user base is different from your target user base. This fact will always be true, but the bias is dramatically worse for fledgling products as early growth tends to be in specific pockets of users (often due to viral effects) and not uniformly spread across the planet. Those specific pockets won't behave like the broader population along some dimension (here it is EDM vs non-EDM music preference).


And how to do it right (or at least better)

Continuing with our MP3 product, how can we undo this bias that our non-representative users are injecting?

There are a few ways to de-bias the data to make the experimental results usable. The easiest approach is to identify the segments and reweight them based on the target population distribution.

Since we don't particularly want to build a product optimized for EDM users, we can reweight back to the mean of the broader population. To do that we can separate the populations and then take a weighted mean of the effects to project the effects onto the target user population.
Here the target population is 96% non-EDM, 4% EDM, so to reweight the conversion rate this amounts to: $$
  0.04 \times EDMrate + 0.96 \times nonEDMrate
$$The confidence intervals must also be adjusted, as the standard errors add in quadrature: $$
(0.04 \times EDMrateSE)^2 + (0.96 \times nonEDMrateSE)^2
$$

Weighted average conversion rates:
TreatmentEDM Conversion RateNon-EDM Conversion RateWeighted Average
[Artist Title] (control)4.00±0.43%4.00±0.86%4.00±0.83%
[Artist Title BPM]5.50±0.50%3.00±0.75%3.10±0.72%
[BPM Artist Title]7.12±0.56%1.50±0.53%1.72±0.51%


From this it becomes clear that we might not want to add BPM at all, but if we needed the change for some reason other than conversion rate, we should put it after the title.

Also notice the change in confidence intervals in the weighted average versus the original conversion rates; in the original control group we had ±0.38%, now it is ±0.83%. This large increase is a result of the fact we don't have much data from the "target" user base and so we cannot speak very confidently about its behavior.

This strategy only works if we have the ability to identify EDM users. If, for example, we were optimizing the first interaction with our product, we wouldn't know if a new user was an EDM lover or not since they would not have purchased anything yet.

This early user classification problem goes hand in hand with product personalization. Luckily the user segmentation (e.g. "EDM fans") that we aim to use for experiments can also be useful for personalizing our user interface. For our product this might mean simply asking the user when they sign up what their favorite song is. This can then be used for tailoring the product to the user, but also for weighting experimental analysis.

This example with EDM users is clearly a cartoon. In reality, there will be more than two slices. This reweighting technique generalizes to the case when users fall into a small number of slices. But often there are multiple dimensions whose Cartesian product is large, leading to sparse observations within slices. In this case, we need a propensity-score model to provide the appropriate weight for each user.

Do you even want those users?

The idea that your current users aren't your target users can be taken a step further. For our music example, we imagined that EDM users don't approximate the target population for some experiments. But what if certain users didn't even represent the kind of a user we wanted (e.g. their lifetime value was negative)?

One example of this for our music product could be die-hard fans of the American rock band Tool. Tool does not allow any digital sales of their albums, so users coming to our site looking for this band's music will leave with negative sentiment of our product. They may subsequently return any tracks they purchased, leading to an actual cost to our business. These users might further share their experiences with non-Tool fans on social media, causing more damage.

Early in our product's lifecycle, this population of users will contribute to our active user population as they explore our product and maybe even purchase some albums. But without finding their core audio preferences they will likely churn.

Gaining more of these users may increase our short-term metrics, but these users do not offer long term stable revenue and may negatively impact our ability to gain non-Tool users in the future.

The tech-savvy users' siren call

Hopefully it is now clear that using experiments without understanding how the existing user population differs from the target population is a dangerous exercise.

On top of this idiosyncratic population bias due to uneven population growth rates, there is a more persistent early adopter bias. These early adopters tend to be much more tech-savvy than the general population, trying out new products to be on the cutting edge of technology. This tech-savvy population desires features that can be detrimental to the target population. In our music example, tech-savvy users will want to select the specific bit-rate and sampling frequency of the song they are buying, but forcing our target population through this flow would lead to confusion and decreased conversion rates.

Where the average user would walk away, tech-savvy users may be more willing to see past issues to find value in your product. For example, if we add roadblocks in the purchase flow, the average user will abandon the purchase. In contrast, the tech-savvy user is capable of navigating the complicated process without dropping or even developing a negative sentiment. If we assume that these early users represent our target users we will miss the fact that our product is actually churning our target population.

Unfortunately since these tech-savvy users often have a larger than average social media/megaphone presence, we need to be delicate with how we react to them. In evaluating product changes, we will rarely make trade-offs in their favor at the cost of most users. But we still want the product to work well enough for them so they don't have negative experiences. This might mean having the bit-rate setting buried in the fine print, available if absolutely needed, but not distracting to the target users.

Conversions are not independent

Further complicating matters, when products are small they are much more susceptible to error from individual "power" users. In our music product, most users will buy a single song, that one track that they heard on the radio. Indeed, that is the premise of our product, and how we built the user experience. But every once in a while there will be a user who decides to rebuy his or her entire CD collection on MP3. This wasn't what we intended and our UI doesn't make it easy, but there it is. The behavior of this single user user appears in our data as a large number of impressions with conversions.

Imagine that early in our product's lifecycle we have one such user per week who buys 1,000 tracks, even though in a given week we only sell 2,000 tracks total. In other words, this single user represents half our sales. If we run a null A/B experiment where the users are randomly assigned to the A and B arms with the collection buyer in the A arm, we will have 1500 sales in A and 500 in B. This makes it look as though the A arm performs 3x better than the B arm, even though they are actually the same. As our product grows, it is less likely that a single user's behavior will affect aggregate metrics, but this example illustrates why we usually don't want to assume conversions are independent across impressions. The binomial confidence intervals we computed earlier may greatly underestimate the uncertainty in our inference. It is imperative that we use techniques such as resampling entire users to correct for this kind of user effect. This applies to a product of any size, but is a greater concern when sample sizes are smaller.

A word on growth hacking

Growth hacking is an emergent field attempting to optimize product growth. It often comes up in discussions of a fledgling product's adoption rates. Unfortunately this space has mainly functioned as a way to optimize marketing spend and small product changes under the assumption that a product has already found "product-market-fit". This mentality does not mesh well with our description earlier of a fledgling product. Modern software products do not come onto the market as fixed immutable "things" but instead iteratively (and sometimes drastically) evolve to find their niche.

Of particular concern in growth hacking is the focus on influencers for pushing growth. Influencers very rarely represent your target user base and focussing your product features too much on them can lead you to have a product that influencers love but your target users don't find compelling (e.g. Twitter). This doesn't mean you shouldn't attempt to obtain them, but you should not design for them at the expense of your target user.

Conclusion

Creating something from nothing is the hardest thing humans do. It takes imagination, execution, and a dose of luck to build a successful product. While barriers to entry for a new product have come down, success is always elusive. This means there will be an increasing number of fledgling products out there trying to make it. In this post, we described several ways in which such products may not be able to leverage experiment methodology to the same extent as established products. Nor does growth hacking provide ready answers. But if there is one thing that I have learned from my experience working on fledgling products it is to be explicit and vigilant about the population for whom the product is built. Those with expertise in large-scale experimentation are typically mindful of evaluation metrics. My experience suggests that to a fledgling product being mindful of the target user population is just as important. Never stop asking, "do the users I want, want this product?"


Comments