當前位置：首頁 > 编程语言 > python >内容正文

python

bland c++_为什么要使用bland altman图进行ab测试python代码

發布時間：2024/1/1 python 27 豆豆

生活随笔收集整理的這篇文章主要介紹了 bland c++_为什么要使用bland altman图进行ab测试python代码小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

bland c++

目錄(Table of contents)

Why using the Bland-Altman plot

為什么要使用Bland-Altman圖

The used data

使用的數據

How the Bland-Altman plot is built

布蘭德·奧特曼圖的構建方式

How to interpret the Bland-Altman plot for A/B test

如何解釋A / B測試的Bland-Altman圖

Conclusion

結論

為什么要使用Bland-Altman圖(Why use the Bland-Altman plot)

The Bland-Altman plot comes from the medical industry in order to compare the measure for two instruments. The first objective of John Martin Bland & Douglas Altman was to answer this question :

Bland-Altman圖來自醫療行業，目的是比較兩種儀器的測量值。 John Martin Bland和Douglas Altman的首要目標是回答這個問題：

Do the two methods of measurement agree sufficiently closely ? — D. G. ALTMAN and J. M. BLAND [1]

兩種測量方法是否足夠接近？ — DG ALTMAN和JM BLAND [1]

If it is the case, it means that if you have two instruments where one is the state of the art at a high cost and the second is 10 times cheaper, do the results obtained by the cheapest method are comparable to the reference and could it be replaced with a sufficient accuracy? For example, does the heart rate provided by a $20 connected watch is sufficiently accurate as the result of an electrocardiogram ? The second objective was to produce a method where the results are easily understandable to non-statisticians.

如果是這樣，則意味著如果您有兩臺儀器，其中一臺是最新技術，價格高昂，而另一臺則便宜10倍，那么用最便宜的方法獲得的結果是否可與參考相比？被足夠的精度所取代？例如，作為心電圖檢查的結果，一塊價值20美元的手表所提供的心率是否足夠準確？第二個目標是提供一種方法，使非統計學家容易理解結果。

In analytics, A/B testing (as well known as Champion-Challenger) is a common methodology of test to compare the results of a new action / a new treatment / a new design / … on population_A to population_B having the current action. Once we have the test’s results, they have to be analysed and presented to a decisional team mostly composed of non-statisticians. That is why the Bland-Altman plot is relevant because it will compare the results of the A/B test on one plot with all of the statistical measures displayed in an understandable way.

在分析中，A / B測試(即眾所周知的Champion-Challenger)是一種常見的測試方法，用于比較針對種群_A的新操作/新處理/新設計/…的結果與具有當前操作的種群_B的結果。一旦獲得測試結果，就必須對其進行分析，并提交給主要由非統計人員組成的決策團隊。這就是為什么Bland-Altman圖具有相關性的原因，因為它將以一種易于理解的方式將A / B測試在一個圖上的結果與所有統計量進行比較。

In their paper, they also showed why the correlation coefficient, the statistical test of mean comparison and the regression are inappropriate to decide on the agreement of two measures which would be in our A/B testing case to decide on the power of the challenger compared to the one of the champion.

在他們的論文中，他們還表明了為什么相關系數，均值比較的統計檢驗和回歸分析不適合決定兩種方法的一致性，而這在我們的A / B測試案例中就無法確定所比較挑戰者的能力給冠軍之一。

使用的數據 (The used data)

For this article I will use a dataset available on Kaggle (coming from a DataCamp project) called “Mobile Games A/B Testing with Cookie Cats”. The link are in the references part [2].

在本文中，我將使用Kaggle(來自DataCamp項目)上可用的數據集，稱為“使用Cookie Cats進行Mobile Games A / B測試”。鏈接在參考部分[2]中。

Cookie Cats is a popular mobile puzzle game where as a player progress the levels, he will encounter “gates” that will force him to wait for some time before continuing to play or to make a purchase. In such an industry the retention is one of the key metrics and the team in charge of the game would see the impact if the first gate was moved from level 30 to level 40 on the retention at 7 days. To see the behavior of such a move they did an A/B test and they provided us the dataset of these results. We will see how the Bland-Altman plot will answer the following question : “How to analyze the A/B results on the level of retention at 7 days when the waiting time passes from the level 30 to the level 40 ?”

Cookie Cats是一款流行的移動益智游戲，隨著玩家逐步升級關卡，他將遇到“門”，這將迫使他等待一段時間才能繼續玩或進行購買。在這樣的行業中，保留率是關鍵指標之一，負責游戲的團隊會發現，如果第一個門在7天時從30級升至40級，則會對保留率產生影響。為了查看此舉的行為，他們進行了A / B測試，并向我們提供了這些結果的數據集。我們將看到Bland-Altman圖將如何回答以下問題：“當等待時間從30級變為40級時，如何分析7天保留率的A / B結果？”

The dataset is composed of 90.189 rows where we have the player’s unique id, the A/B test’s version (waiting time at gate_30 / gate_40), the game rounds’ sum, retention_1 is a boolean saying if the player came back on the next day and retention_7 is a boolean saying if the player came back after 7 days. In order to have the relevant data to answer our question, it is necessary to do some cleaning. I will only keep the client having a retention_1 = True (because if retention_1=False, retention_7 is False as well), a number of game rounds ≥ 30 (because if they don’t go until 30, they will not be impacted by the gate) and a number of game rounds < 170 (because if we consider the duration of a game = 5 minutes, if a player plays 2 hours per day during 7 day he will play 120*7/5 = 168 games. A higher number would be considered as an anormal usage). After this filter, the dataset is composed of 20.471 rows as Figure1 below. Moreover, the dataset is equally balanced between gate_30 & gate_40.

數據集由90.189行組成，其中有玩家的唯一ID，A / B測試的版本(等待時間在gate_30 / gate_40)，游戲回合的總和，retention_1是一個布爾值，表示玩家是否在第二天回來保留值[7]是布爾值，表示玩家是否在7天后回來。為了獲得相關數據來回答我們的問題，有必要進行一些清潔。我只會讓客戶保持值_1 = True(因為如果retention_1 = False，retention_7也為False)，游戲回合數≥30(因為如果直到30時才回合，則不會受到影響)門數)和小于170的游戲回合數(因為如果我們考慮游戲的持續時間= 5分鐘，則如果玩家在7天中每天玩2個小時，他將玩120 * 7/5 = 168場游戲。將被視為正常使用)。經過此過濾器后，數據集由20.471行組成，如下圖1所示。此外，數據集在gate_30和gate_40之間均等地平衡。

Figure1. CookieCats dataset圖1。 CookieCats數據集

如何建立Bland-Altman圖(How is built the Bland-Altman plot)

We will see on this section how to adapt the original Bland-Altman plot in order to apply it to an A/B test. First of all I am going to explain how is the plot built in its original version [1], [2] and then, I will explain how to build it with our A/B tests’ data.

我們將在本節中看到如何適應原始的Bland-Altman圖，以將其應用于A / B測試。首先，我將解釋如何在原始版本[1]，[2]中構建圖表，然后，將解釋如何使用A / B測試數據構建圖表。

Due that the original Bland-Altman plot compare the measurement of 2 instruments, they have the same length by design. For example, with the heart rate measurement between the $20 connect watch and the electrocardiogram, the measure are taken on the same time with the same conditions which lead to have the same number of measurement for the 2 methods. So we can represent each dataset’s row as an experience like on the example in the Figure2 below.

由于原始的Bland-Altman圖比較了兩種儀器的測量結果，因此它們在設計上具有相同的長度。例如，在$ 20 Connect手表和心電圖之間進行心率測量時，在相同條件下同時進行測量，導致這兩種方法的測量次數相同。因此，我們可以將每個數據集的行表示為一種體驗，就像下面的圖2中的示例一樣。

Figure2. Instrument measure per experience圖2。根據經驗衡量

This is where we encounter the first “pain point”. An A/B test is considered as an unique experience while as we see above, we need several experiences in order to build the plot. To bypass this limitation we will create from the A/B test several bootstrapped samples having both the same & different length.

這是我們遇到的第一個“痛點”。 A / B測試被認為是獨特的體驗，而正如我們上面所看到的，我們需要一些經驗來構建圖。為了繞過此限制，我們將從A / B測試中創建幾個長度相同且長度不同的自舉樣本。

We generate 300 non-unique random integers between 200 and 1.000. These integers will represent the length of each bootstrapped sample and in order to benefit the bootstrap’s statistical properties, each non-unique random integer is duplicated 50 times. These numbers are used in order to have a sample diversity but it is arbitrary and the length depends from the size of the original dataset. These 15.000 (300*50) bootstrapped samples having a length between 200 and 2.000 are obtained by a random sampling with a replacement from the original dataset and they are concatenated together. It can be represented as the Figure3.

我們生成200到1.000之間的300個非唯一隨機整數。這些整數將代表每個自舉樣本的長度，并且為了受益于引導程序的統計屬性，每個非唯一隨機整數均重復50次。使用這些數字是為了使樣本具有多樣性，但它是任意的，其長度取決于原始數據集的大小。這些長度為200到2.000之間的15.000(300 * 50)個自舉樣本是通過從原始數據集中進行替換的隨機抽樣獲得的，并將它們串聯在一起。它可以表示為圖3。

Figure3. Bootstrapped dataset building圖3。自舉數據集構建

The following code creates the bootstrapped dataset from the original data (be careful, it can takes time because the bootstrapped dataset has a length of 9.184.350 rows {by changing the random_state, we would have in average (((200+1.000)/2)*300*50 = 9.000.000 rows}).

以下代碼根據原始數據創建自舉數據集(請注意，這可能會花費一些時間，因為自舉數據集的長度為9.184.350行{通過更改random_state，我們平均可以得到((((200 + 1.000)/ 2)* 300 * 50 = 9.000.000行} )。

Then, we group by n_sample (the id of each 15.000 bootstrapped sample), n_sample_2 (the length of each bootstrapped sample) and version in order to have the sum of player’s retention at 7 days per gates as in Figure 4.

然后，我們將n_sample(每個15.000自舉樣本的ID)，n_sample_2(每個自舉樣本的長度)和版本進行分組，以使每個門的玩家保留天數總計為7天，如圖4所示。

Figure4. Bootstrapped dataset after groupby圖4。分組后引導數據集

We can read this output as: the bootstrapped sample n°0/14.999 is composed of 564 rows where 98 players are still playing at 7 days with a waiting time at gate_30 while 105 players are still playing at 7 days with a waiting time at gate_40.

我們可以將輸出讀取為：引導樣本n°0 / 14.999由564行組成，其中98個玩家在7天仍在玩游戲，等待時間在gate_30，而105個玩家仍在7天在玩游戲，等待時間在gate_40 。

Then, we use a statistical property of the boostrap saying that the mean of a bootstrap sample is a good estimator of the true mean of a distribution. We make a group by n_sample_2 and version in order to have for each unique sample’s length the average number of player’s retention at 7 days per gates as in Figure 5.

然后，我們使用boostrap的統計性質，即自舉樣本的均值是分布真實均值的良好估計。我們按n_sample_2和版本進行分組，以便針對每個唯一樣本的長度，每個門在7天的平均玩家保留數如圖5所示。

Figure5. Average number of players still playing at 7 days for each unique sample length and per gates圖5。每個唯一的樣本長度和每個門，仍在玩7天的平均玩家數量

We can read this output as: when the sample has 200 rows there are in average 34.60 players who are still playing at 7 days with a waiting time at gate_30 while 34.38 players who are still playing at 7 days with a waiting time at gate_40.

我們可以將輸出讀取為：當樣本有200行時，平均有34.60名玩家仍在7天的比賽中等待時間在gate_30處，而仍有34.38名玩家仍在7天的比賽中等待時間在gate_40處。

Then we unstack the database in order to have the dataset in a clearer format as the Figure6.

然后我們對數據庫進行拆棧，以使數據集的格式更加清晰，如圖6所示。

Figure6. Unstacked version of the dataset圖6。數據集的未堆疊版本

At this stage we have all the necessary information in order to build the Bland-Altman plot and the representation of the dataset is the same as in the Figure2 above.

在此階段，我們擁有所有必要的信息以構建Bland-Altman圖，并且數據集的表示與上圖2相同。

The Bland-Altman plot is composed of 2 axis. The x axis is the average of the two methods to compare. So it is for each row: (gate_30i + gate_40i) / 2 ||| The y axis is the difference between method A and method B. So it is for each row: (gate_30i - gate_40i) ||| And here is the second “pain point” we have. By keeping the y axis as it is, the increase of the samples’ size will increase the differences’ variability. As a result, the statistical measure we will obtain later will be over-weighted by the biggest samples. To bypass this limitation, we will represent the y axis in percentage [3]. To do it, the calculation of y is for each row: ((gate_30i - gate_40i)*100 / (( gate_30i + gate_40i)/2) ||| The dataset looks like Figure7.

Bland-Altman圖由2軸組成。 x軸是兩種比較方法的平均值。每一行都是這樣：(gate_30i + gate_40i)/ 2 ||| y軸是方法A和方法B之間的差。每一行都是這樣：(gate_30i-gate_40i)||| 這是我們的第二個“痛點”。通過保持y軸不變，樣本大小的增加將增加差異的變異性。結果，我們稍后將獲得的統計量將被最大樣本所加權。為了繞過此限制，我們將以百分比[3]表示y軸。為此，對每一行進行y的計算：((gate_30i-gate_40i)* 100 /(((gate_30i + gate_40i)/ 2)|||數據集如圖7所示。

Figure7. x & y axis圖7。 x＆y軸

We have to check that the y axis is normally distributed in order to trust the confidence interval who will be displayed. You can assess it by using the shapiro-wilk test or at least with an histogram. If the distribution is not Normal then you can do a transformation such as logarithmic transformation. In our case, I consider the distribution as Normal.

我們必須檢查y軸是否正態分布，以便信任將顯示誰的置信區間。您可以使用shapiro-wilk檢驗或至少與直方圖進行評估。如果分布不是正態分布，則可以進行對數轉換。在我們的情況下，我認為該分布為正態分布。

Figure8. Histogram of the y axis圖8。 y軸的直方圖

The Bland-Altman is composed of 3 lines (see Figure9):

Bland-Altman由3條線組成(請參見圖9)：

the average values of y
y的平均值
the y’s upper bound of the confidence interval (here at 95% given the 1.96)
y的置信區間上限(此處為1.96的95％)
the y’s lower bound of the confidence interval (at 95%)
y的置信區間下限(95％)

Figure9. Values of the Bland-Altman plot圖9。布randint-奧爾特曼圖的值

We put all together, the package pyCompare allows to draw the Bland-Altman plot on a very easy way without having to build db[‘y’]:

我們放在一起，包pyCompare允許以非常簡單的方式繪制Bland-Altman圖，而無需構建db ['y'] ：

It takes first the method A (the champion) and then the method B (the challenger). Then, if percentage = True, it will automatically do the calculus we made above. There are some other parameters we will discuss later.

它首先采用方法A(冠軍)，然后采用方法B(挑戰者)。然后，如果percent = True ，它將自動執行我們上面進行的演算。我們稍后還會討論其他一些參數。

如何為A / B測試解釋Bland-Altman圖 (How to interpret the Bland-Altman plot for the A/B test)

Here we are ! Here is the figure of the Bland-Altman plot for the A/B testing generated by the code above:

我們來了！這是上面的代碼生成的A / B測試的Bland-Altman圖的圖：

Figure10. Bland-Altman plot applied for A/B test圖10。 Bland-Altman圖應用于A / B測試

First of all, the mean and the mean’s confidence interval (light blue stripe) are different than 0 (higher in our case). Which means that the level of retention (named bias in the original paper) of gate_30 and gate_40 are significantly different. Due that 2.93 > 0 it means that A > B < — > Champion > Challenger and as a result that a waiting period on gate_30 provides a bigger retention than a waiting period on gate_40.

首先，均值和均值的置信區間(淺藍色條紋)不同于0(在我們的情況下較高)。這意味著gate_30和gate_40的保留水平(在原始論文中稱為偏見)顯著不同。由于2.93> 0，這意味著A> B <—>冠軍>挑戰者，結果是gate_30上的等待時間比gate_40上的等待時間更長。

The two salmon bars represent the confidence interval at 95% (named limit of agreement in the original paper) saying that we are convinced that 95% of the values will be between [-0.78% ; 6.63%]. In our exemple, this is very powerful because we can say that the retention of gate_30 will be almost always bigger than the one of gate_40.

兩條鮭魚條代表95％的置信區間(在原始論文中稱為協議限制)，表示我們確信95％的值將在[-0.78％之間； 6.63％] 。在我們的例子中，這非常強大，因為可以說gate_30的保留幾乎總是大于gate_40的保留。

As you can see, there are 2 values above the upper salmon stripe and 4 below the lower one, which is 6/300 = 0.02 < 0.05 so due that we are certain that 95% of the values are within the 2 boundaries, 5% can be above or below and in our case it represents 2% then it is perfectly normal ;)

如您所見，在上面的鮭魚條上方有2個值，在下面的鮭魚條下方有4個值，這是6/300 = 0.02 <0.05，因此我們可以確定95％的值在2個邊界內，即5％可以高于或低于此值，在我們的示例中，它代表2％，那么這是完全正常的；)

In the pyCompare package there is the parameter limitOfAgreement who aims to change the boundaries of confidence. Here, a relevant question would be : “At which percentage can I be sure that the retention of gate_30 will be always bigger than the one of gate_40 ?” To answer this question, the lowest boundary has to be equal to 0, so we have to find the right value in order to have 0 as we can see in the code below which provide the Figure11:

在pyCompare包中，有一個參數limitOfAgreement旨在更改置信度邊界。在這里，一個相關的問題是：“我可以確保gate_30的保留率始終大于gate_40的保留率？” 要回答這個問題，最低邊界必須等于0 ，所以我們必須找到正確的值才能有0，如下面提供Figure11的代碼所示：

Figure11. Output when changing the limitOfAgreement’s parameter圖11。更改limitOfAgreement參數時的輸出

We see that when limitOfAgreement = 1.55, the boundary is almost equal to 0. Then we have to check in the Normal distribution table the value at 1.55 which is 0.9394, so we are sure at ((1–0.9394)*2)*100 = 87.88% that the retention of gate_30 will always be bigger than the one of gate_40

我們看到，當limitOfAgreement = 1.55時，邊界幾乎等于0。然后我們必須在正態分布表中檢查1.55處的值0.9394，因此我們可以確定((1–0.9394)* 2)* 100 = 87.88％，gate_30的保留總是大于gate_40的保留

A last point to add is that whatever the average value of the sample, they are uniformly represented on the plot which means that the interpretation we are making are generalized whatever the size of the sample is. Indeed if we had seen a kind of conic representation of the values we could have conclude that the size of the sample has an impact on the results so we cannot have a valid interpretation.

最后要補充的一點是，無論樣本的平均值是多少，它們都在圖表上統一表示，這意味著無論樣本的大小如何，我們所做的解釋都是通用的。的確，如果我們看到值的一種圓錐形表示，我們可以得出結論，樣本的大小會對結果產生影響，因此我們無法做出有效的解釋。

結論 (Conclusion)

We saw why it can be relevant to use the Bland-Altman plot in order to have one view about the results of an A/B test on a simple plot, how to create the plot from an A/B test and how to interpret it. This only works in case of normality of the difference however it will be necessary to transform the data.

我們已經看到了為什么使用Bland-Altman圖對一個簡單圖上的A / B測試結果有一個看法，如何從A / B測試創建圖以及如何解釋它的意義，。這僅在差異正常的情況下有效，但是有必要轉換數據。

Moreover, I checked the App and the gates appears to be on gate_40 while we proved that the retention at 7 days was better at 30. In this case, it shows that the retention is maybe not the best metric to follow compared to the profitability !

此外，我檢查了該應用程序，發現登機口位于gate_40上，而我們證明了7天的保留率要好于30天。在這種情況下，它表明與獲利能力相比，保留率可能不是最佳的衡量標準！

翻譯自: https://towardsdatascience.com/why-how-to-use-the-bland-altman-plot-for-a-b-testing-python-code-78712d28c362