Please introduce yourself.
My name is Leemay, pronounced “Lee-mah.” I’m a senior engineering manager at Spotify, and I wrote a book about A/B testing, Practical A/B Testing for The Pragmatic Bookshelf.
What is A/B testing?
A/B testing is a method for evaluating the impact or effectiveness of a change in production on a subset of users. Often it’s done on products that are user facing to understand if this new feature, software architecture, or in general any change, better, worse, or the same as the control.
How do you grow a user-focused environment within a company by incorporating A/B testing?
A/B testing is a catalyst for understanding the impact your ideas have on user engagement. A/B testing helps product and engineering teams think about the impact of their decisions on users. A big part of A/B testing is metrics. There are three types of metrics in particular.
- There are system metrics, so those are engineering-oriented metrics like the response time of your services, latency, and all that good stuff that enables us to understand how engineering systems are performing.
- Then there are business metrics, like revenue.
- And, then there are user metrics, like engagement. Are they clicking? Are they scrolling a lot? Is it taking a lot of time for them to find the feature or the thing that they came to your product to do?
A/B testing helps engineering and product organizations become more aware of how design, product, and engineering decisions influence the user experience.
So, it seems that A/B testing is centered around making choices and more particularly making informed choices.
How does that differ from simply consuming metrics and looking for weak spots?
Let’s say you have a risky change or you have a new feature that you’re unsure of the validity or how effective it would be. Or perhaps your company is very risk averse and they don’t want to launch changes that could really disrupt how a user engages with the product. If this is the case, allow A/B testing to be your safety net.
Instead of launching that product feature and then doing data collection after the fact and then realizing, “Oh no, we made a mistake,” and then rolling the feature back, A/B testing enables you to evaluate that risky change or that risky idea on a subset of users. You get data on the impact of a change before you increase the blast radius or increase the stakes and enable it for all your users.
What are some ways where stakes might be very expensive, that would demand a test group, an A/B test group, to be added into the process?
Let’s think of examples of risky changes where the stakes are high. Maybe you’re completely changing the purchase process on your application. This is the process in which a user puts something into the basket to then eventually buy. Let’s say you introduce a new step there, you’d want to evaluate that because that could prevent users from maybe buying that product because it makes the process less seamless. Or maybe the process is actually easier for the user and therefore results in increased purchases. You’d want to understand which outcome is indeed the case by evaluating this in the scope of an A/B test.
When I think A/B testing, I usually think about rolling out new features or changing design. It sounds like A/B testing has a much broader scope than just those things. Can you talk about where the most common places are that A/B testing is currently used and where it should be used and where it’s underutilized?
That’s a good question. I’d say the most common area that teams use A/B testing is on the product side where the work directly influences the user’s experience. That’s the obvious. Exactly what you said, teams build a new feature, they want to evaluate the effectiveness of that feature by enabling it to a subset of users, then measuring the effect using metrics, and comparing to the control group who has not received the new feature.
The places where A/B testing is not as frequently used but should be are definitely cases where you can use A/B testing to evaluate engineering architectures.
For example, let’s say you rearchitect a complex part of your system and that system is invoked on the user request path or in one way or another influences the user experience. Using A/B testing to understand if we make these changes to our system design, we want to understand that if it’s slower, does that mean that users may use the product less? There are past studies that suggest latency loading or scrolling on the product has a direct correlation to user metrics. Most users expect a product to load quickly, and if you roll out a new architecture that degrades the user experience, you’d want to know early rather than later by evaluating it in the scope of an experiment.
How many users do you need? How important is it to find the right sample size?
Determining the right sample size and test duration is key. There are statistics that go into defining the duration of a test, the number of users allocated to your test, and control variants to ensure the test results are not just by chance and instead they’re actually statistically significant.
The reason sample size and duration are important is because you want to ensure the experiment’s outcome is not by random chance but rather suggestive of the impact in a real-world product setting when the feature has been enabled for all your users.
So to answer your question in so few words, understanding these nuances is important to ensure your test configuration is valid so you really have confidence in your test results. Partnering with your data scientists can help a lot in the statistical aspects of A/B testing. The more you A/B test, the more you realize that data scientists will be your best friends.
What are the costs involved? Servers and employees and money and such?
The cost of A/B testing is definitely time and energy. It does take time to set up an experiment. If you’re introducing A/B testing at your company, you’re adding a very key step, but nonetheless it’s a step to your product development life cycle that did not exist prior to building an A/B testing platform. You have the leverage to be able to say this feature is not going to production until we have data insights on the impact by running an experiment.
So obviously time is a factor. Your team will spend time ensuring the A/B test is set up correctly, monitoring it throughout its duration, analyzing data after the test is complete, and so forth.
There’s also a key factor that’s related to cost from an engineering perspective to ensure that your code is written in such a way where this feature is only available for this subset of users, and then let’s say you don’t launch the feature, then you’re going to have to remove that code from your code base. Again, that’s time and energy.
So the cost element is definitely there. And I think the cost element is one reason why a lot of teams don’t incorporate A/B testing into their product development life cycle. This evaluation step, although very important, does delay when a new feature is in front of your users, but it’s better to delay than to launch a feature that doesn’t provide the impact that you think it will. It’s a cost you have to be willing to pay.
There’s often a feeling that once you ship, you are married to a feature…that this is a promise or a contract to the consumers. How does that tie in to the A/B step in terms of cost benefit analysis?
There definitely is a sentiment in some engineering or product organizations that when you ship a feature there’s not only an emotional attachment to it but also that it must exist in production forever. But I think the more you A/B test, the less you get attached to these ideas and these features and the more you get attached to impact. It’s OK to deprecate a feature if you realize the value of that feature is low, right? Why would you want a feature that provides little impact from a metrics perspective to be in the product?
Once you start framing your work so that impact and value add is greater than simply delivering new features, the less attached you will get to whether that feature is in production forever. I think you start to build a muscle where you’re like, it doesn’t matter, it didn’t work, let’s just try something new.
How do you plan for measuring metrics in your design of the A/B experience?
A/B tests have two types of metrics. Your success metrics, which could be clickthrough rate or revenue or consumption of a specific type of content on your product. Second, your test will have guardrail metrics.
Guardrail metrics can be engineering system–oriented metrics like app crashes, GCP, or AWS cost to support the feature. It’s essentially any metric that is holistic to your product or engineering systems that you don’t want to degrade. If they do degrade, given some predefined margin usually, then the feature shouldn’t be shipped, even if the success metrics are improving. It’s about trade-offs and where you’re willing to give a little in favor of improvements in other areas.
Let’s say your success metric is revenue and the outcome of the experiment suggests it has increased by x percent, but the trade-off is that now it is double the cost to support from an engineering perspective in production. Said otherwise, the feature, although its improving revenue. is actually very costly from an AWS perspective, which was your guardrail metric. With this data insight, this feature may not be worth shipping given the heavy engineering cost outweighs the revenue gains.
Taco Bell just released a new kind of food, it’s got the shell of their apple pies and inside is chicken and cheese.
Is this real?
This is real. This is real and it’s in Knoxville.
And the question is, what are the benefits and the costs of choosing particular countries or locations to limit your A/B test? How can that skew your results or how can that benefit by creating a sense of urgency and jealousy that, hey, they got to test it but nobody else did?
Well it’s exactly what you described very well. One of the possible downsides to only evaluating changes in specific countries or specific localities is there might be biases toward the new feature. One part of the country may have an affinity for a specific type of food, to use the Taco Bell analogy, that differs compared to another part of the country or another country in itself. You may not be able to say for certain that the metrics that are regionally specific can translate to every location.
There may be biases, there may be different tastes or different engagement behaviors. So that’s one downside. I’d say probably that’s the biggest downside. But then the pros are that you start to really understand your users and you start to really understand your users per location.
There’s also a concept known as the Hawthorne effect, which I believe, if I remember correctly, is when users are aware they’re in an experiment and start to change their habits. Time is one way to combat this—is the engagement a novelty effect or will it continue on a longer timeline?
But yeah, Taco Bell, that’s a great example of testing changes in specific regions. Kind of like market rollouts in software. Have you tried out this new…I don’t even know…taco?
I live 2,000 miles away from Nashville.
Got you. That’s quite far. Okay, so you have not.
It’s a little hard to get to the corner. How does user awareness play into the design of an A/B test? You mentioned that they may act differently because of the novelty effect. How do you plan for awareness? How do you work around awareness? How big can changes be while you still have valid testing?
I think the biggest thing here is if novelty effect is a factor, then you need to ensure that you’re running your test long enough so that the novelty effect wears off and that you get a better baseline understanding of the impact this feature has on business, product, and user metrics.
I’m making these numbers up, but let’s say a new feature has been introduced and there is some fear of a novelty effect occurring with your users, which will affect their engagement. Let’s say you run it for four weeks and you work closely with your data scientists to get an understanding, okay, after the first week we’ll say that’s the novelty effect. We see that spike in metrics. Let’s see if that’s sustained after week two, three, four, or if it dips, and then how much does it dip by?
I think that the most ideal approach is calibrating how long you run your test for and considering how much of that time was a novelty effect and how much of that time was, “Now this is a normal feature for them, the novelty has worn off.” But you do have to work closely with your data scientists to understand that and use your research partners.
When you’re developing your test, do you start by creating user stories of how a user might use this feature? If so, how does it drive the development of your plan, creating your hypotheses and trying to figure out what the outcomes that you are looking for are specified?
I think definitely good products are backed by two things: qualitative and quantitative insights. Qualitative can be derived from user research studies. This is where you can start to really work with your user researcher and your product peer together to understand those user stories and pinpoint the bottlenecks, the problem areas, the pain points for users, which would all, ideally, influence the features that are built.
Then, when you have a sense of what features you want to build, that’s when you start to scope in the work to design your A/B test. It’s a process. Those two pieces, user research and offline and online evaluations, these types of experiments I think of when working in tandem build great products.
It’s like a partnership. You want this offline understanding with qualitative metrics to influence, to direct you where you should be putting your time toward and what experiments are worth evaluating, just because, like we spoke about earlier, A/B testing is expensive.
There is a cost to A/B testing, and the more information you can gather before you run an online experiment, the better. By partnering closely with your user researchers and conducting studies that can inform your product roadmap, you are doing your best to ensure your spending engineering, design, and product time on the right initiatives.
By incorporating this experimentation methodology in your product development process, you build a culture that really embraces A/B testing when you’re okay to be wrong. There’s less fear of being wrong and more fear of deploying something to production that’s not good for the user experience.
“You build a culture that really embraces A/B testing when you’re okay to be wrong.”
How do you build metrics both to test an hypothesis and to extend those metrics to catch anything else that might affect your experiment?
The thing about A/B testing is it tells you what happened, but it doesn’t tell you why it happened. This is a classic statement made by data scientists. If you want to understand why the outcome was what it was beyond your success and guardrail metrics, you have to do ad hoc data analysis.
Doing post-analysis can enable you to derive what the effect was on specific user groups, what was the effect on specific days, days of the week, month. Post analysis, I think, plays a big role in A/B testing.
For this to be feasible, you have to make sure that your data infrastructure is adequately set up for ad hoc analysis. That’s a key step in getting everything you can out of A/B testing. You need to be able to query raw datasets, join engagement metrics with user metadata to provide that richer context and so forth.
You don’t just want to listen to your metrics, but you also want to understand why and what really happened by digging in a little bit deeper. You could answer questions like, What was the impact of this new feature on users who had iPhone versus Android? Or the effect on users in specific regions or demographics?
What are your thoughts about A/B testing in a two-stage deployment? First with internal testing and deployment (“dogfooding”), and then through external testing?
I think that’s a key step, but you have to take it with a grain of salt. It’s pretty common before you roll out an A/B test that you incorporate your own team members, employees within the company into the experiment, but you do so to catch bugs.
By including employees, you can get early insights and catch bugs, but be mindful of how much weight you put into these data points because there is a clear bias. If you’re engaging with the product that you built, you’re engaging with it in a different mindset than real users. I think it’s important that you include yourself into A/B tests so that you can make sure it works as expected and not to derive metric insights.
What is a guardrail metric and how do you develop them?
A guardrail metric is a metric that you want to monitor but you don’t necessarily want to optimize toward by introducing this new feature. So for example, the way you should think about your guardrail metrics are, what are things that we want to keep an eye on, but we know this new feature is built to optimize toward.
If it increases, that may be good. It really depends on your metric. So let’s say if your guardrail metric is app crashes, mobile app crashes, you don’t want that to increase. If it decreases, great, that’s awesome! The goal is to incorporate that performance engineering metric as a guardrail metric to make sure you don’t do harm to this metric by introducing this new feature.
Another example of a guardrail metric could be revenue. Let’s say you introduce a feature that wasn’t necessarily implemented to increase revenue. You should still include revenue as a guardrail metric because you want to make sure you don’t do harm to that metric as it’s a critical metric for your company.
Guardrail metrics play two roles. First, measure if you’re doing harm or degrading a metric. And second, you also want to know if you’re positively influencing a metric that you didn’t expect you would.
Let’s say you have an A/B test out there and it’s deployed and there’s a certain segment, and everybody who’s using this feature just is in love with it and going on and on about it. However, your guardrail metrics tell you that, oops, this is influencing other things that are critical to business operations. And then you stop the A/B test early and create what is essentially an experienced degradation. How do you deal with that?
It’s great that you evaluated the change on just the subset of users and did it in a A/B test versus simply launching it straight to production for all your users to engage with. Catching a degradation early on just a subset of users should be considered a win.
When you launch changes that aren’t performing as expected, you learn from them. And that’s a good thing! It’s better to learn than to be unaware of the impact.
There is no such thing as a failed experiment. The only failure is not learning. And one of the beauties with A/B testing is ideally you’re learning. You’re learning about the impact your changes have on your users, your product, your business, what works, what doesn’t work, and that should influence future decisions.
“The only failure is not learning.”
How can people stay in touch with you?
Folks can find me on Twitter, I’m @LeemayNassery. I have an experimentation Substack where I talk more about A/B testing. That’s experimenting.substack.com.