In this series of three we’ll go into Experiment Driven Development and A/B Testing. EDD is fact-based development: based on evidence, not intuition.
We've previously explored the foundations of Experiment Driven Design and A/B Testing. Today we'll dig into A/B Testing with Python because analysis can be easily automated with existing open source python libraries. In this post we will explore their usage with an example.
To orient the reader, we will state a few definitions to anchor the example:
Let us assume that we ran an A/B Test feature experiment for two weeks. The UI modifications consisted of rearranging a button used in the process of drawing polygons around objects
Let us assume these were recorded annotation times per image in minutes, for users of each variant (it can be represented as a python list):
A. Control group (original arrangement):variant_a =
B. Experiment group (rearranged button):variant_b =
1. Run the t-test from the scipy.stats module of scipy (a mathematical, scientific and engineering library).import scipy.stats as statst, p = stats.ttest_ind(variant_a, variant_b, equal_var=False)
2. Calculate the degrees of freedom according to Welch’s t-test definition which is the one implemented in stats.ttest_ind# For illustrative details see Wikipedia
s1 = np.std(variant_a)s2 = np.std(variant_b)n1 = len(variant_a)n2 = len(variant_b)df = np.floor(((((s1 ** 2) / n1) + ((s2 ** 2) / n2)) ** 2) / (((s1 ** 4) / ((n1 ** 2) * (n1 - 1))) + ((s2 ** 4) / ((n2 ** 2) * (n2 - 1)))))
3. Now, using the same scipy.stats library, get the t-critical value for 95% or an alpha of 0.05 (1 - confidence level) from the t distribution’s ppf (percent point function) function and evaluate the t statistic from the previous step. If it falls in the range then H0 cannot be rejected, if it is outside, then we can reject H0 in favor of H1:alpha = 0.05t_critical_value = stats.t.ppf(1 - (alpha/2), df)null_hypothesis = bool(t_critical_value >= t_value >= -t_critical_value)
4. The confidence interval of variant_b (experiment) will help us visualize the difference between the two variants. If the mean of the control group doesn’t fall inside of this interval then the means of the two groups are significantly apart from each other, which suggests that the results are statistically significant.s = np.std(variant_b)x = np.mean(variant_b)n = len(variant_b)rho = (t_critical_value * s) / np.sqrt(n)conf_int = x - rho, x + rho
5. Statistical power is the probability that the test correctly rejects the null hypothesis, in other words, the probability of a true positive result. This is only useful when the null hypothesis is rejected. A low value of power could be an indication that the sample size is not big enough yet to validate the results. To calculate the statistical power we use the class TTestIndPower from the module statsmodels.stats.power of the statsmodel library.from statsmodels.stats.power import TTestIndPower# Effect size based on Cohen’s d formula.
x1 = np.mean(variant_a)x2 = np.mean(variant_b)s1 = np.std(variant_a)s2 = np.std(variant_b)n1 = len(variant_a)n2 = len(variant_b)s = np.sqrt((((n1 - 1) * (s1 ** 2)) + ((n2 - 1) * (s2 ** 2))) / (n1 + n2 - 2))effect = np.abs((x1 - x2) / s)power = TTestIndPower().power(effect, nobs1=n1, ratio=n2 / n1, df=(n1 + n2 - 2), alpha=alpha)
6. Plot the sample distributions with confidence intervals as a visual aid using matplotlib library.import matplotlib.pyplot as plt# Controlfig, ax = plt.subplots(figsize=(12,6))xA = np.linspace(40, x1 + 3*s1, 100)yA = stats.norm(loc=x1, scale=s1).pdf(xA)ax.plot(xA, yA, c='red', label='Variant A Distribution')ax.axvline(x=x1, c='red', alpha=0.5, linestyle='--', label='Variant A')# ExperimentalxB = np.linspace(40, x2 + 3*s2, 100)yB = stats.norm(loc=x2, scale=s2).pdf(xB) ax.plot(xB, yB, c='blue', label='Variant B Distribution')ax.axvline(x=x2, c='blue', alpha=0.5, linestyle='--', label='Variant B')# Confidence intervalax.axvline(conf_int, c='green', linestyle='--', alpha=0.5)ax.axvline(conf_int, c='green', linestyle='--', alpha=0.5, label='Confidence Interval')plt.xlabel('Annotation Time')plt.ylabel('Percent of Tasks per Annotation Time')plt.title('Annotation Time Distributions')plt.legend()plt.show()