← Back to Blog

A Complete Guide to AB Testing: Statistical Foundations and Sample Size Calculations

January 20, 2024 • 12 min read • By Tom

AB testing is a fundamental tool in data science and product optimization. This comprehensive guide covers the statistical foundations, sample size calculations, and practical implementation strategies.

📊 Complete Implementation: All code examples and implementations discussed in this guide are available in the AB Testing GitHub Repository.

Computing Sample Size and Duration

Before running any AB test, we need to determine the appropriate sample size based on several key parameters:

# Current baseline metrics
current_conv_rate = 0.1  # Base conversion rate (10%)
std_dev = 1

# Desired uplift: minimum detectable effect
desired_uplift = 0.5  # 50% relative improvement

# Statistical parameters
statistical_power = 0.8  # Power (1-β)
alpha = 0.05  # Significance level
confidence_level = 1 - alpha  # 95% confidence
side = 2  # Two-sided test

# Business metrics
number_of_events_per_week = 1000

Setting Up Hypotheses

The foundation of any AB test lies in proper hypothesis formulation:

Null Hypothesis (H₀): The treatment has no effect (equality hypothesis)
Alternative Hypothesis (H₁): The treatment has a significant effect

The alternative hypothesis can be:

Directional: Specifies direction (greater than or less than)
Non-directional: Only specifies difference (not equal to)

Significance Level and Statistical Power

The significance level (α = 0.05) represents the probability of rejecting H₀ when it's true (Type I error). Statistical power (1-β = 0.8) is the probability of correctly rejecting H₀ when it's false.

Choosing the Right Test

For Continuous Data:

Z-Test (N > 30):

When sample size exceeds 30, the Central Limit Theorem applies, and we can use the Z-test:

H₀: Average metric is the same for both versions
H₁: Average metric differs between versions

The Z-score represents standard deviations from the null hypothesis. Higher Z-scores indicate stronger evidence against H₀.

Student's t-Test (N < 30):

For smaller samples:

One-sample t-test: Compare sample mean to population mean
Two-sample t-test: Compare means of two samples

For Binary Data (Conversion Rates):

We can adapt the Z-test using binomial distribution moments. As sample size increases, results converge to Chi-square distribution behavior.

Non-Inferiority Testing

Sometimes we want to prove a new solution is "not worse" than the current one:

Null Hypothesis: Variant < Control - δ (worse than control minus tolerance)
Alternative Hypothesis: Variant ≥ Control - δ (not significantly worse)

The non-inferiority margin (δ) represents the maximum acceptable difference while still considering performance equivalent.

Practical Implementation

Key considerations for successful AB testing:

Sample Size Planning: Use power analysis to determine required sample size
Test Duration: Balance statistical significance with business timelines
Multiple Testing: Apply corrections when running multiple tests
Practical Significance: Ensure detected effects are business-relevant

Conclusion

Proper AB testing requires careful attention to statistical foundations, from hypothesis formulation to test selection. By understanding these principles, you can design experiments that provide reliable, actionable insights for your business decisions.

View Complete Implementation on GitHub