AB testing is a fundamental tool in data science and product optimization. This comprehensive guide covers the statistical foundations, sample size calculations, and practical implementation strategies.
📊 Complete Implementation: All code examples and implementations discussed in this guide are available in the AB Testing GitHub Repository.
Computing Sample Size and Duration
Before running any AB test, we need to determine the appropriate sample size based on several key parameters:
# Current baseline metrics
current_conv_rate = 0.1 # Base conversion rate (10%)
std_dev = 1
# Desired uplift: minimum detectable effect
desired_uplift = 0.5 # 50% relative improvement
# Statistical parameters
statistical_power = 0.8 # Power (1-β)
alpha = 0.05 # Significance level
confidence_level = 1 - alpha # 95% confidence
side = 2 # Two-sided test
# Business metrics
number_of_events_per_week = 1000
Setting Up Hypotheses
The foundation of any AB test lies in proper hypothesis formulation:
Null Hypothesis (H₀): The treatment has no effect (equality hypothesis)
Alternative Hypothesis (H₁): The treatment has a significant effect
The alternative hypothesis can be:
- Directional: Specifies direction (greater than or less than)
- Non-directional: Only specifies difference (not equal to)
Significance Level and Statistical Power
The significance level (α = 0.05) represents the probability of rejecting H₀ when it's true (Type I error). Statistical power (1-β = 0.8) is the probability of correctly rejecting H₀ when it's false.
Choosing the Right Test
For Continuous Data:
Z-Test (N > 30):
When sample size exceeds 30, the Central Limit Theorem applies, and we can use the Z-test:
- H₀: Average metric is the same for both versions
- H₁: Average metric differs between versions
The Z-score represents standard deviations from the null hypothesis. Higher Z-scores indicate stronger evidence against H₀.
Student's t-Test (N < 30):
For smaller samples:
- One-sample t-test: Compare sample mean to population mean
- Two-sample t-test: Compare means of two samples
For Binary Data (Conversion Rates):
We can adapt the Z-test using binomial distribution moments. As sample size increases, results converge to Chi-square distribution behavior.
Non-Inferiority Testing
Sometimes we want to prove a new solution is "not worse" than the current one:
- Null Hypothesis: Variant < Control - δ (worse than control minus tolerance)
- Alternative Hypothesis: Variant ≥ Control - δ (not significantly worse)
The non-inferiority margin (δ) represents the maximum acceptable difference while still considering performance equivalent.
Practical Implementation
Key considerations for successful AB testing:
- Sample Size Planning: Use power analysis to determine required sample size
- Test Duration: Balance statistical significance with business timelines
- Multiple Testing: Apply corrections when running multiple tests
- Practical Significance: Ensure detected effects are business-relevant
Conclusion
Proper AB testing requires careful attention to statistical foundations, from hypothesis formulation to test selection. By understanding these principles, you can design experiments that provide reliable, actionable insights for your business decisions.
View Complete Implementation on GitHub