Logo

The ultimate guide to A/B testing | Ronny Kohavi (Airbnb, Microsoft, Amazon)

en

July 27, 2023

TLDR: Ronny Kohavi explains how to foster a culture of experimentation and its critical role in trust, discussing common pitfalls, experiment results, and metrics like A/B testing, OEC evaluation criterion, p-value and statistics on tests' impact.

1Ask AI
  • The Importance of A/B Testing and ExperimentationConducting experiments and A/B testing is crucial for product companies to make informed decisions, as even small changes can have unexpected impacts. It is important to allocate time to high-risk ideas and be prepared for failures. Trust and an experiment-driven culture are essential for successful experiments.

    Conducting experiments and A/B testing is crucial for product companies to drive growth and make informed decisions. Ronny Kohavi emphasizes the importance of testing every code change and new feature, as even small changes can have unexpected impacts. He recommends allocating time to high-risk, high-reward ideas, understanding that most experiments will fail. Ronny advises being ready to fail 80% of the time when pursuing significant innovations. He also highlights the significance of trust in successful experiments, along with the importance of creating an experiment-driven culture within a company. This conversation highlights the practicality of running experiments and the potential for surprising results that can lead to breakthroughs.

  • Uncovering Revenue-Boosting Changes Through ExperimentsSmall changes can have a big impact on revenue. Conducting experiments and not underestimating seemingly trivial changes can lead to significant long-term benefits without compromising user experience.

    A small, seemingly insignificant change can have a substantial impact on revenue without compromising the user experience. In the case discussed, shifting the order of two lines in the search results increased Bing's revenue by around 12%, amounting to $100 million. This change did not negatively affect user metrics, unlike superficial methods like displaying more ads. It highlights the importance of conducting experiments to uncover unexpected results and learn from them. Additionally, the conversation emphasizes the significance of not underestimating the value of seemingly trivial changes and the need to prioritize and remember successful experiments to capitalize on their potential long-term benefits.

  • The Power of Small Improvements and Constant ExperimentationSmall gains in metrics and experiments can lead to significant improvements in revenue, but it requires constant experimentation and evaluation to find the few ideas that truly make a difference.

    Small improvements can have a significant impact. Both Ronny Kohavi and Lenny discuss examples where small gains in metrics and experiments led to substantial improvements in revenue. Ronny mentions that at Bing, the relevance team's goal was to improve their metric by just 2% each year, which added up to a remarkable 2% improvement overall. Similarly, at Airbnb, running 250 experiments resulted in a 6% increase in revenue. However, it's important to note that the majority of experiments (92%) fail to improve the intended metric. So, while small improvements can be powerful, they require constant experimentation and evaluation to find those few ideas that truly make a difference.

  • The importance of institutional memory and documentation for organizational learning and improvement.Embrace experimentation, document surprises, maintain a comprehensive database, and stay data-driven to drive innovation and make informed decisions in an organization.

    Institutional memory and documentation are crucial for organizational learning and improvement. Ronny Kohavi emphasizes the importance of summarizing the learnings from experiments and conducting regular meetings to discuss the most surprising experiments, whether they were successful or not. Surprising experiments, where the expected outcome differs significantly from the actual result, provide valuable insights and opportunities for learning. It is essential to document these surprises and remember them when designing future iterations or making decisions. Kohavi suggests maintaining a comprehensive database of successes and failures and enabling keyword search to easily retrieve experiment history. By actively embracing experimentation and staying data-driven, organizations can drive innovation and make informed decisions, even for seemingly small changes or bug fixes.

  • Balancing Incremental Changes and High-Risk Ideas in ExperimentationExperimentation is essential for success, but it requires finding a balance between small changes and risky ideas. Learning from failures and using controlled experiments can determine if users benefit from an idea. A/B testing is valuable in the software industry. Startups should implement it once they have enough users and a supportive platform.

    Experimentation is crucial for success, but it's important to strike a balance between incremental changes and high-risk, high-reward ideas. Ronny Kohavi emphasizes the need for a portfolio of experiments, some of which may lead to significant breakthroughs, while others may fail. It's important to allocate efforts to both types of experiments and be prepared for a high failure rate, especially when attempting big ideas. The conversation highlights the importance of learning from failures and using controlled experiments as the ultimate oracle to determine if users are benefiting from a particular idea. While A/B testing may not be suitable for all domains, it is highly valuable in the software industry, especially when a mature platform with low incremental costs is in place. Startups should consider implementing A/B testing once they have enough users and a platform that supports it.

  • The Importance of a Clear Overall Evaluation Criterion for Experimentation and A/B Testing in StartupsWhen conducting experimentation and A/B testing, startups should consider a clear overall evaluation criterion that balances revenue growth with user experience and long-term factors such as user satisfaction.

    Experimentation and A/B testing can be valuable tools for startups, but it's important to have a clear overall evaluation criterion (OEC) in place. Simply optimizing for revenue is not enough, as it can lead to actions that harm the user experience in the long run. For example, adding more ads to a search page may increase revenue initially, but it can negatively impact user satisfaction and result in increased churn. The OEC should consider various metrics, such as time to successful result and percentage of successful sessions, to strike a balance between revenue growth and user experience. It's also crucial to consider long-term factors, such as user satisfaction after a purchase or stay.

  • The importance of long-term value and metrics in accurate predictions and decisions.Considering the long-term value of users and incorporating metrics such as retention rates and time to achieve a task is crucial for making accurate predictions and decisions.

    In order to make accurate predictions and decisions, it is crucial to consider the long-term value of users and the countervailing metrics associated with a particular action. Ronny Kohavi emphasizes the importance of defining the OEC (Overall Evaluation Criterion) in a way that causally predicts the lifetime value of the user. By incorporating metrics such as retention rates and time to achieve a task, the OEC becomes more useful in driving long-term success. Moreover, Ronny suggests two approaches for understanding long-term metrics: running long-term experiments to learn and building models based on historical data and background knowledge. This conversation highlights the significance of considering both short-term and long-term impacts when making strategic decisions.

  • The importance of cautious and iterative redesignsBy implementing smaller changes and learning from each iteration, teams can avoid negative outcomes and find successful alterations. A data-driven approach is crucial in recognizing the value of incremental changes.

    It is crucial to approach redesigns and large-scale changes in a cautious and iterative manner. Both Ronny Kohavi and Lenny highlight the negative consequences of full redesigns, emphasizing that they often lead to negative outcomes and require significant effort to rectify. Instead, they advocate for incrementally testing and adjusting changes along the way. By implementing smaller changes and learning from each iteration, teams can identify the ideas that actually work and avoid the negative impact of unsuccessful alterations. Additionally, they stress the importance of being open to failure and adopting a data-driven approach to decision-making. Running experiments and analyzing results can help organizations recognize the value of incremental changes and overcome the resistance to them.

  • Striking a Balance in Product and Process RedesignAllocating resources to both known optimizations and high-risk endeavors, prioritizing experiments, and understanding project purpose and impact can optimize processes effectively.

    When considering redesigning a product or process, it is important to strike a balance between taking big bets and iterating towards improvement. While completely redesigning something may offer the potential for breakthrough success, it is crucial to recognize that 80% of the time such attempts fail. Allocating resources to both known optimizations and high-risk, high-reward endeavors is a wise approach. This rule of thumb applies to many organizations and is evident in the allocation of resources at Google. Additionally, it is essential to prioritize running experiments and avoiding the shipment of features that do not provide value or even have a negative impact. By understanding the purpose and potential impact of each project, organizations can make informed decisions and optimize their processes effectively.

  • The Importance of a Data-Driven Approach for Airbnb's SuccessCompanies like Airbnb need to prioritize data-driven decision-making, including conducting controlled experiments, to ensure long-term growth and success, especially during uncertain times like the COVID-19 pandemic.

    Airbnb's shift towards a more top-down, vision-oriented approach may have hindered its potential for success. While design aspects were given attention by Brian, the search team, responsible for neural networks and search algorithms, heavily relied on A/B testing before launching anything. The absence of controlled experiments in other teams, coupled with the departure of data-driven advocates like Greg Greeley, may have impacted Airbnb's overall performance. Furthermore, during the COVID-19 pandemic, Ronny Kohavi emphasizes the importance of continued experimentation, as it enables companies to make informed decisions even during uncertain times. In retrospect, these insights highlight the importance of maintaining a data-driven approach and conducting controlled experiments for the long-term growth and success of a company like Airbnb.

  • Importance of Trust in Experimentation Platform and Results AnalysisTrust is crucial in an experimentation platform as it provides a safety net for aborting and promoting safe deployments. Trustworthy results are essential for informed decision-making, highlighting the need for organizational trust in the experiment's scientific nature and control.

    Trust is crucial in running experiments and building an experimentation platform. Ronny Kohavi emphasized the importance of trust in two aspects. Firstly, he mentioned that the experimentation platform serves as a safety net, allowing quick aborts when something goes wrong, promoting safe deployments and velocity. Secondly, the platform provides trustworthy results at the end of an experiment, analyzing key metrics and debugging. Kohavi highlighted the need for organizational trust in the experiment's scientific nature and control. He cautioned against using real-time P value monitoring, which could lead to inflated error rates and false positive results, damaging trust in the platform. The conversation serves as a reminder that accurate and reliable experimentation is paramount to make informed decisions.

  • Addressing Sample Ratio Mismatch in ExperimentsSample ratio mismatch in experiments can indicate a problem with the experiment. By using a formula/spreadsheet, one can calculate the probability of mismatch occurring by chance and take necessary actions to address it.

    A common issue when running experiments is sample ratio mismatch, which occurs when the distribution of users between control and treatment groups is not as designed. This can be a red flag that something is wrong with the experiment. By using a formula or spreadsheet, one can determine the probability of such a mismatch occurring by chance. It was found that approximately 8% of experiments at Microsoft suffered from this issue. Bots and problems with the data pipeline are often the causes of sample ratio mismatch. To address this, a warning banner was initially added, but people ignored it. Eventually, a compromise was reached by highlighting the numbers in the scorecard with a red line to signal a sample ratio mismatch.

  • The Importance of Being Skeptical and Investigative in Data InterpretationDon't be fooled by impressive results. Investigate and analyze the data thoroughly before drawing conclusions, as they may contain flaws or be misleading. Be cautious of relying solely on P values for statistical significance.

    We should be cautious when interpreting results that appear too good to be true. Ronny Kohavi explains that people have a natural bias towards wanting to see success, which can lead them to overlook flaws in the data. Twyman's law, a concept introduced by a person working in radio media, states that figures that look interesting or different are usually wrong. This emphasizes the need to investigate and not immediately celebrate extraordinary results, as there is a high probability of finding flaws in the experiment. Additionally, Ronny highlights the misconception around P values and advises against relying solely on them. The false positive risk, which tends to be much higher than commonly thought, should be considered when interpreting statistical significance.

  • The Power and Benefits of Experimentation and A/B TestingImplementing experimentation and A/B testing can lead to success in a company, even with some uncertainty. Lowering P-value and consulting experts can increase success, and starting with a focused team can shift the culture towards embracing experimentation.

    Implementing experimentation and A/B testing in a company can be highly valuable and lead to success. Although data scientists are aware that experiments are not perfect and there is some uncertainty, launching positive experiments can still have a positive impact. It's okay to be occasionally wrong as long as the overall balance is in favor of successful experiments. Lowering the P-value and implementing replication can increase success and reduce the false positive rate. Additionally, keeping track of experiment failure rates and consulting internal experts can help in starting experiments. If resistance to experimentation exists, starting with a team or department that frequently launches and has a clear optimization goal can help shift the culture towards embracing experimentation. The success of experimentation in one area, like Bing, can influence and inspire other teams in the company. Using third-party experimentation platforms is also a viable option today.

  • The Importance of Building an Experimentation Platform for Effective Decision-MakingBuilding a platform for running experiments is crucial for effective decision-making. It should focus on self-service, reducing costs, strong analysis capabilities, trust, speed, and variance reduction techniques.

    Building a platform for running experiments is crucial for effective and efficient experimentation. Ronny Kohavi emphasizes the importance of self-service and reducing the marginal cost of experiments to zero. By providing a platform that allows users to easily set up, run, and analyze experiments, organizations can streamline the experimentation process. Kohavi also highlights the need for strong analysis capabilities within the platform to avoid the reliance on data scientists. Trust is vital in running experiments, but speed is also important. Kohavi recommends having a scorecard soon after the experiment finishes and utilizing variance reduction techniques, such as capping metrics and using pre-experiment data to adjust results. Overall, the conversation emphasizes the value of building a comprehensive experimentation platform to drive effective decision-making.

  • Recommended Books and TV Series by Ronny KohaviRonny Kohavi highly recommends books like "Calling Bullshit," "Hard Facts, Dangerous Half-Truths And Total Nonsense," and "Mistakes Were Made (But Not by Me)" for insightful perspectives and challenging commonly held beliefs. He also praises the TV series "Chernobyl" and emphasizes the impact of using structured narratives in product development.

    There are several books and a TV series that Ronny Kohavi highly recommends. One book called "Calling Bullshit" provides insightful perspectives on extreme claims and encourages skepticism. Another book, "Hard Facts, Dangerous Half-Truths And Total Nonsense," challenges commonly held beliefs, showing that many things we consider well-understood may lack justification. The book "Mistakes Were Made (But Not by Me)" explores the fallacies we often succumb to, leading to humbling outcomes. Additionally, the TV series "Chernobyl" is highly praised by Kohavi for its portrayal of the disaster. Furthermore, Kohavi discusses using structured narratives, a concept he learned at Amazon, as a minor change that has had a significant impact on their product development process.

  • Shifting from PowerPoint to structured documents improves feedback and decision-making.Using structured documents improves feedback, decision-making, and the ability to reference information, while prioritizing controlled experiments over anecdotal or observational studies leads to more data-driven decisions.

    Implementing a structured document instead of PowerPoint presentations can have a significant impact. Ronny Kohavi shared his experience at Amazon, where they shifted from using paper-based presentations to Word or Google Docs. This change allowed for team members to provide honest feedback and for that feedback to be easily documented and referenced after the meeting. Additionally, Ronny emphasized the importance of the hierarchy of evidence when it comes to making decisions based on information. Trusting controlled experiments and multiple controlled experiments over anecdotal or observational studies is crucial. Understanding the concept of control experiments can help individuals make data-driven decisions and improve their overall decision-making processes.

Was this summary helpful?

Recent Episodes

Relentless curiosity, radical accountability, and HubSpot’s winning growth formula | Christopher Miller (VP of Product, Growth and AI)

Relentless curiosity, radical accountability, and HubSpot’s winning growth formula | Christopher Miller (VP of Product, Growth and AI)

Lenny's Podcast: Product | Growth | Career

Christopher Miller shares his experience leading growth and AI teams at HubSpot, including insights on customer obsession, product management skills, and PLG strategies that drove the company's success.

August 10, 2023

Velocity over everything: How Ramp became the fastest-growing SaaS startup of all time | Geoff Charles (VP of Product)

Velocity over everything: How Ramp became the fastest-growing SaaS startup of all time | Geoff Charles (VP of Product)

Lenny's Podcast: Product | Growth | Career

Geoff Charles is Ramp's VP of Product, responsible for releasing over 60 products and features annually. He shares insights on velocity in culture and hiring at Ramp, as well as writing to unlock creativity.

August 06, 2023

How to measure and improve developer productivity | Nicole Forsgren (Microsoft Research, GitHub, Google)

How to measure and improve developer productivity | Nicole Forsgren (Microsoft Research, GitHub, Google)

Lenny's Podcast: Product | Growth | Career

Dr. Nicole Forsgren, developer productivity expert, discusses measuring and improving developer productivity through frameworks like DORA and SPACE, common mistakes to avoid, resources for improvement, and decision-making strategies.

July 30, 2023

The 10 traits of great PMs, how AI will impact your product, and Slack’s product development process | Noah Weiss (Slack, Foursquare, Google)

The 10 traits of great PMs, how AI will impact your product, and Slack’s product development process | Noah Weiss (Slack, Foursquare, Google)

Lenny's Podcast: Product | Growth | Career

Brought to you by Sidebar—Catalyze your career with a Personal Board of Directors | Superhuman—The fastest email experience ever made | Vanta—Automate compliance. Simplify security.—Noah Weiss is Chief Product Officer at Slack, where he leads all aspects of the product organization, including the self-service SMB business, the team that launched huddles and clips, and the search and machine-learning teams. Prior to Slack, Noah served as SVP of Product at Foursquare. He started his career at Google, leading the structured data search team and working on display ads. In today’s episode, we discuss:• The top 10 traits of great PMs• How “complaint storms” helped Slack teams foster empathy• How Slack’s product team is approaching AI• “Comprehension desirability” and other key factors leading to Slack’s success• Why you should be customer-aware but not customer-obsessed• Important areas of growth for both new PMs and senior PMsCurious to learn more about Slack? You can try Slack Pro and get 50% off using this link.—Find the transcript at: https://www.lennyspodcast.com/the-10-traits-of-great-pms-how-ai-will-impact-your-product-and-slacks-product-development-process/—Where to find Noah Weiss:• Twitter: https://twitter.com/noah_weiss• LinkedIn: https://www.linkedin.com/in/noahw/—Where to find Lenny:• Newsletter: https://www.lennysnewsletter.com• Twitter: https://twitter.com/lennysan• LinkedIn: https://www.linkedin.com/in/lennyrachitsky/—In this episode, we cover:(00:00) Noah’s background(04:22) Noah’s advice on new parenthood(07:23) Lessons learned from leading product at Foursquare(11:33) Advice for working with strongly opinionated founders(14:14) Thinking of involvement on a U-shaped curve(16:53) Principles at Slack(19:32) Implementing ML, AI, and LLMs in meaningful ways(25:11) How Slack structures AI teams(26:59) Complaint storms and how they help foster empathy(30:01) Slack’s approach to prioritization (32:26) How delight is baked into the DNA of Slack(34:41) How Slack thinks about competition (38:04) Building a culture that takes big bets(41:40) Rituals at Slack(44:51) How Slack unlocked new levers of growth and revived their self-serve business(52:01) Slack’s early success and the factors that made them successful (58:08) Slack’s pilot programs for testing new features(1:02:03) Noah’s famous blog post: “The 10 Traits of Great Product Managers”(1:10:15) Book recommendations to improve your writing(1:12:30) Managing up and the importance of data fluency(1:14:54) The most important skills to improve as an early-career PM and as a senior PM(1:17:16) Lightning round—Referenced:• Emily Oster: https://emilyoster.net/• Dennis Crowley: https://denniscrowley.com/• Stewart Butterfield on Twitter: https://twitter.com/stewart• Don’t Make Me Think, Revisited: A Common Sense Approach to Web Usability: https://www.amazon.com/Dont-Make-Think-Revisited-Usability/dp/0321965515• Gustav Söderström on Lenny’s Podcast: https://www.lennyspodcast.com/lessons-from-scaling-spotify-the-science-of-product-taking-risky-bets-and-how-ai-is-already-impacting-the-future-of-music-gustav-soderstrom-co-president-cpo-and-cto-at-spotify/• Seth Godin: https://seths.blog/• Noah’s blog post on the 10 traits of great PMs: https://medium.com/@noah_weiss/10-traits-of-great-pms-a7776cd3d9cd• Five Dangerous Myths about Product Management: https://medium.com/@noah_weiss/five-dangerous-myths-about-product-management-d1d852ed02a2• Paul Graham: http://paulgraham.com/• Ben Horowitz on Twitter: https://twitter.com/bhorowitz• On Writing: A Memoir of the Craft: https://www.amazon.com/Writing-Memoir-Craft-Stephen-King/dp/1982159375• On Writing Well: The Classic Guide to Writing Nonfiction: https://www.amazon.com/Writing-Well-Classic-Guide-Nonfiction/dp/0060891548• Nobody Wants to Read Your Sh*t: And Other Tough-Love Truths to Make You a Better Writer: https://www.amazon.com/Nobody-Wants-Read-Your-Tough-Love/dp/1936891492• Several Short Sentences About Writing: https://www.amazon.com/Several-Short-Sentences-About-Writing/dp/0307279413• Paige Costello on Twitter: https://twitter.com/paigenow• Creative Selection: Inside Apple's Design Process During the Golden Age of Steve Jobs: https://www.amazon.com/Creative-Selection-Inside-Apples-Process/dp/1250194466• The Innovator’s Dilemma: When New Technologies Cause Great Firms to Fail: https://www.amazon.com/Innovators-Dilemma-Technologies-Management-Innovation/dp/1633691780• Radical Candor: https://www.amazon.com/Radical-Candor-Revised-Kim-Scott/dp/1250258405• Leadership: In Turbulent Times: https://www.amazon.com/Leadership-Turbulent-Doris-Kearns-Goodwin/dp/1476795924• Succession on HBO: https://www.hbo.com/succession• The Bear on Hulu: https://www.hulu.com/series/the-bear-05eb6a8e-90ed-4947-8c0b-e6536cbddd5f• Nanit: https://www.nanit.com/• Snoo: https://www.happiestbaby.com/products/snoo-smart-bassinet• Uppababy: https://uppababy.com/—Production and marketing by https://penname.co/. For inquiries about sponsoring the podcast, email podcast@lennyrachitsky.com.—Lenny may be an investor in the companies discussed. Get full access to Lenny's Newsletter at www.lennysnewsletter.com/subscribe

July 23, 2023

Related Episodes

Career frameworks, A/B testing mistakes, counterintuitive onboarding tips, selling to developers | Laura Schaffer (VP of Growth at Amplitude)

Career frameworks, A/B testing mistakes, counterintuitive onboarding tips, selling to developers | Laura Schaffer (VP of Growth at Amplitude)

Lenny's Podcast: Product | Growth | Career

Laura Schaffer shares stories from her experience leading growth teams at Twilio, Bandwidth, and Rapid. She discusses the role of experimentation and data in growth and how they helped uplevel her career and inform their product and monetization strategies.

March 09, 2023

#315: The Power of Small Experiments to Supercharge Your Success

#315: The Power of Small Experiments to Supercharge Your Success

The Art of Manliness

Noah Kagan shares his experience of losing a $185 million pay day after getting fired from Facebook before it went public. He explains his process for testing business ideas and how he used it to start several successful ventures.

June 22, 2017

The right way to do price testing

The right way to do price testing

AdBriefing Copywriting Tips

Price testing can increase net value when done correctly, considering modeling and operational costs by Marketers who understand business impact can unlock hidden profits through proper metrics like marketing investment valuation.

March 18, 2021

Using behavioral science to improve your product | Kristen Berman (Irrational Labs)

Using behavioral science to improve your product | Kristen Berman (Irrational Labs)

Lenny's Podcast: Product | Growth | Career

Kristen Berman shares the 3B Framework of Behavioral Design and its application through real-life examples in various industries like fintech and healthcare. She explains how to identify and overcome cognitive biases influencing behavior change.

October 02, 2022

AI

Ask this episodeAI Anything

Lenny's Podcast: Product | Growth | Career

Hi! You're chatting with Lenny's Podcast: Product | Growth | Career AI.

I can answer your questions from this episode and play episode clips relevant to your question.

You can ask a direct question or get started with below questions -

Sign In to save message history