Power calculation for difference in proportions

Power calculation for difference in proportions

code{white-space: pre;}

pre:not([class]) {
background-color: white;
}

if (window.hljs && document.readyState && document.readyState === “complete”) {
window.setTimeout(function() {
hljs.initHighlighting();
}, 0);
}

.main-container {
max-width: 940px;
margin-left: auto;
margin-right: auto;
}

I almost posted the following on stackexchange, but as so often before, the process of writing down my question led to its natural resolution!
Dear all, I am trying to understand the origin of the actual formula used to compute the power of the power.prop.test() function in R, which is defined in lines 12-14 of the source:

 tside <- switch(alternative, one.sided = 1, two.sided = 2)
 p.body <- quote(pnorm(((sqrt(n) * abs(p1 - p2) - (qnorm(sig.level/tside, 
        lower.tail = FALSE) * sqrt((p1 + p2) * (1 - (p1 + p2)/2))))/sqrt(p1 * 
        (1 - p1) + p2 * (1 - p2)))))

I took the liberty of rewriting the expression inside the pnorm() function in readable notation: \[
\frac{ \sqrt{n} \cdot |p_1 – p_2| – z_{1-\alpha/k}
\cdot \sqrt{(p_1 + p_2) \cdot (1 – (p_1 + p_2)/2)}}{\sqrt{p_1 \cdot (1 – p_1) + p_2 \cdot (1 – p_2)}}
\]
where k(=tside) is either 1 or 2 depending on the alternative argument. I do not quite understand the 2nd term in the numerator, there is no reference in the man page. In particular, if I interpret it as a reference value, I am surprised to see it depend on \(p_2\)!
In addition, this power function does not agree with my naive version below. Let us fix \(p_1=0.04, p_2=0.05\), we are told (power.prop.test(p1=0.04,p2=0.05,power=0.95)) that we need a sample size of 11166 in each group to achieve a power of 0.95. If I was to test the hypothesis \(H_0: p_1=p_2\) my critical value would be simply tr=qnorm(1-0.05/2, mean=0,sd=sr)= which does NOT depend on \(p_2\). Here \(sr=\sqrt{p_1 \cdot (1 – p_1) + p_1 \cdot (1 – p_1)}\) is the sample standard deviation of the difference in proportions (under the assumption that \(H_0\) holds).
This is where I realized my fallacy: I was about to substitute \(p_1=0.04\) in the expression for sr. But that would be testing the following Null hypothesis: \(H_0: p_1=p_2=0.04\) which is a very different conjecture! So instead, we simply estimate the common \(p\) by the average \((p_1 + p_2)/2\) and that is exactly how the formula above is formed !

// add bootstrap table styles to pandoc tables
$(document).ready(function () {
$(‘tr.header’).parent(‘thead’).parent(‘table’).addClass(‘table table-condensed’);
});

(function () {
var script = document.createElement(“script”);
script.type = “text/javascript”;
script.src = “https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML&#8221;;
document.getElementsByTagName(“head”)[0].appendChild(script);
})();

Posted in Uncategorized | Leave a comment

Inverse Regression revisited

Inverse Regression, Regression to the Mean and stochastic dependent variables

I am trying to reconcile the following observations/viewpoints:

  • Inverse Regression: It is well known that the slope of the inverse regression line is not simply β=1/β but instead we have the following relationship
                                                              ββ=R2
  • Regression to the Mean The relationship from above is intimately connected with the phenomenon of Regression to the Mean which can be stated in many ways:
    1. yˆ=ρx if both variables are standardized, i.e.
                                 yˆyˉsy=ρxxˉsx
    2. Let X1, X2 be random variables with identical marginal distributions with mean μ. In this formalization, the bivariate distribution of X1 and X2 is said to exhibit regression toward the mean if, for every number c>μ, we have
      μE(X2|X1=c)<c
    3. Regression toward the mean simply says that, following an extreme random event, the next random event is likely to be less extreme.
    • Stochastic Regressors: Note the identical marginal distributions condition from above. This would imply that the independent variable x “needs” to be a (normally distributed) random variable as well! But regression can be (and often is) defined for non-stochastic, fixed exogenous x; in fact the Auer book explicitly makes that assumption (Annahme C1). And yet, the mathematics of the slope computation does not change whether x is stochastic or not.Read on: https://dl.dropboxusercontent.com/u/8089659/stuff/InvRegression.html
Posted in Uncategorized | Leave a comment

Type-I error in bandits

Type-I error in bandits

Posted in Uncategorized | Leave a comment

Aggregating multiple markets in microeconomics

The following technical note addresses the algorithm proposed in Chapter 3.2 of the
Oz Shy book “How to Price”. A common thread throughout the book is the
suggestion of pseudo-code algorithms designed to help the reader implement the
theory into working code. Unfortunately, this worthy idea is frequently less valuable
than it could be as no or little effort has been invested into pursuing a minimal
standard of algorithmic efficiency. We now point out the shortcomings of Algorithm
3.3 -in particular its exponential complexity- and suggest an alternative that
scales polynomially in the number of markets. Let us summarize the section
first:

A computer algorithm for M markets

We now proceed to describe a computer algorithm for selecting
the most profitable price when there are M markets, and where
each market is composed of homogeneous consumers with identical
maximum willingness to pay, as was already illustrated in Figure 3.3
for the example of M = 3 markets. The user must run Algorithm
3.2 before running Algorithm 3.3 for the purpose of generating and
storing the 2M possible selections of markets that can be served.
Recalling Algorithm 3.2, the results are stored by the array of arrays
of binary variables Sel[i,l] of dimension 2 M ŒM, where i is the index
of a specific selection of markets, i = 0,1,…,2 M 1, and l is a
specific market. Market l is served under selection i if Sel[i, l] = 1
and is not served if Sel[i, l] = 0.

In the following the basic formula for the profit is being illustrated in one
market.

Yi = (pi - μ)* qi - ϕi - ϕ

  • Y –> final profit in this market
  • p –> price per product
  • μ –> marginal cost
  • q –> quantity of customers in this market
  • ϕi –> fixed costs in this specific market
  • ϕ –> production fixed cost (2000$)

We use the following sum formula.

       ∑
YMC =     [(pMC - μi)*qi - ϕi]- ϕ
      i∈MC
(1)

So in order to calculate all possible market constellations [MC] 2M possible  markets arise, as each market is being calculated and is added up with every potential market constellation. In the above-mentioned formula for the total income the price p corresponds to the lowest price of all included markets, therefore all customers of the included markets also want to buy the product. This process can be made more efficient by utilizing the following three observations.

  1. Only markets corresponding to positive terms of the total sum Eq. (1) will
    be included
  2. In every market the profit decreases monotonically with decreasing price
  3. Therefore, once an individual market profit becomes non-positive, Y i 0,
    it will remain so and its computation can be omitted.

So it is only the lower triangular matrix that is being computed as the markets have
been ordered decreasingly by the price level. Hence, only those markets need to be
calculated, which would also be willing to pay the particular price. This creates M2-  2
possible markets as the worst-case scenario.. Within a market, the profit is now
calculated with price as the only variable. The values in the columns of the matrix
are therefore monotonically decreasing. The algorithm terminates the calculation for
a certain market as soon as the market internal profit is less than or equal to zero.
This creates a nearly banded matrix, as there is no need to calculate every
single market constellation anymore. Finally, the rows are added and the
maximum profit in that new column is selected. This creates a much more
efficient computing scheme which scales polynomially instead of exponenially:
O(M2) << O(2M) where the multiplying constant of the M2 term can be much less
than 1 2.

Small example proposed by Shy

As an introductory example, the calculation is to Oz Shy in Chapter 3.2 page 72 is
being displayed.

  • Markets: M1 M2 M3
  • Prices: 30$ 20$ 10$
  • quantities: 200 600 200
  • fixed costs: 1000$ 5000$ 0$
  • production fixed costs: 2000$

This is an example with numbers for one calculation for one market with one price.
YM1 = (30$- 5$)* 200- 1000$  This calculation for each market and for each
price is shown in the table below. In the resulting profit column the fixed
costs of production with 2000$ were already subtracted. They don’t need
to be removed individually from each market anymore. (Prices are in $.)








Price M1 M2 M3 profit resulting profit






30 4000 0 0 4000 2000
20 2000 4000 0 6000 4000
10 0 -2000 1000 -1000 1000







Example with 15 markets

  • Prices: 150 140 130 120 110 100 90 80 70 60 50 40 30 20 10
  • quantities: 60 10 10 60 10 60 60 10 60 60 10 10 60 60 60
  • fixed costs: 0 800 500 600 800 200 100 200 500 800 400 600 800 300 100

Exhaustive search proposed by Shy

After searching through all 215 1 = 32767 possible market combinations, the
maximum profit was found to be 17950 by including markets 1,2,3,4,5,6,7 . This
took 16.25 secs.

Structured search proposed in this note

After searching through 116 possible market combinations, the maximum profit was
found to be 17950 by including markets 1,2,3,4,5,6,7 . This took 0 secs.

The search matrix is displayed here:



















Price M1 M2 M3 M4 M5 M6 M7 M8 M9 M10 M11 M12 M13 M14 M15 Profit

















150 8700 6700
140 8100 550 6650
130 7500 450 750 6700
120 6900 350 650 6300 12200
110 6300 250 550 5700 250 11050
100 5700 150 450 5100 150 5500 15050

















90 5100 50 350 4500 50 4900 5000 17950

















80 4500 -50 250 3900 -50 4300 4400 550 15800
70 3900 150 3300 3700 3800 450 3400 16700
60 3300 50 2700 3100 3200 350 2800 2500 16000
50 2700 -50 2100 2500 2600 250 2200 1900 50 12250
40 2100 1500 1900 2000 150 1600 1300 -50 -250 8250
30 1500 900 1300 1400 50 1000 700 700 5550
20 900 300 700 800 -50 400 100 100 600 1850
10 300 -300 100 200 -200 -500 -500 0 200 -2700


















Posted in Uncategorized | Leave a comment

Graphing a structural break (chow test) in gretl

While gretl allows you to test for a structural break in your data via the chow test, it does not easily visualize the resulting two regression lines.

The following steps enable you to create a piecewise linear graph of the two regions defined in your test. As an example we load the sample data data3-7 (Ramanathan) into gretl. Graphing cost against miles suggests that the region to the left of miles=60 is different from the region to the right. So let us define the following two new variables.

Macintosh HD:Users:sitamenon:Desktop:Screen shot 2014-03-05 at 8.53.04 PM.png

The indicator (binary) variable:

Macintosh HD:Users:sitamenon:Desktop:Screen shot 2014-03-05 at 8.54.13 PM.png

and an interaction term which allows gretl to estimate different slopes for the two regions.

Macintosh HD:Users:sitamenon:Desktop:Screen shot 2014-03-05 at 8.55.34 PM.png

At this point we should have 5 variables in our dataset (see pic below).  The chow test is nothing more than (i) fitting an OLS model to the three variables which are miles plus our two newly defined variables and (ii) performing an F-test on the joint significance of the two extra coefficients. For the graph we will need to execute only step (i):

Macintosh HD:Users:sitamenon:Desktop:Screen shot 2014-03-05 at 8.56.38 PM.png

Macintosh HD:Users:sitamenon:Desktop:Screen shot 2014-03-05 at 8.57.15 PM.png

Macintosh HD:Users:sitamenon:Desktop:Screen shot 2014-03-05 at 8.57.35 PM.png

One very useful feature of gretl is its ability to save outpt from the model as new variables; in particular we will save the fitted values as the new variable yhat2:

::::sitamenon:Desktop:Screen shot 2014-03-05 at 9.02.22 PM.png

::::sitamenon:Desktop:Screen shot 2014-03-05 at 9.03.27 PM.png

We are almost there. Let us graph both the original cost data as well as the fitted values against the x-variable miles:

::::sitamenon:Desktop:Screen shot 2014-03-05 at 9.04.28 PM.png

::::sitamenon:Desktop:Screen shot 2014-03-05 at 9.05.01 PM.png

And this is our scatterplot where the blue dots represent the two regression lines.

::::sitamenon:Desktop:Screen shot 2014-03-05 at 9.05.37 PM.png

I find it more pleasing to have a blue line instead of the points, so I change this option in the plot controls:

::::sitamenon:Desktop:Screen shot 2014-03-05 at 9.06.47 PM.png

And this is the final graph:

::::sitamenon:Desktop:Screen shot 2014-03-05 at 9.07.20 PM.png

And the command log for this sessions is simply:
Inline image 1

As a postscript, I probably should have shown you how to run and interpet the actual Chow test in the first place. Here is the sequence of steps:
We first fit a simple OLS model of cost against just miles. (Unless I am being asked for it, I am omitting the instructions on how to do this.)
Next we find the Chow test in the Test pulldown menu:Inline image 3

Specify the indicator variable for the structural break regions:

Inline image 1

Let us inspect the output summary for this test:
Inline image 2
As I mentioned above, the actual fit is just the OLS model where the original variable miles is augmented by the dummy/indicator as well as the interaction term.
Most importantly though is the reported F-test on the joint significance of the two extra coefficients, which in this case is highly significant.
The individual t-ratios and p-values would suggest otherwise.

Could we run the same F-test manually in gretl without the predefined Chow test? Yes, we can, and here is how:
TestOmitVariables TestOmitVariablesMenu TestOmitVariablesOutput

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Reassuring that the values from this explicit F-test are identical to the Chow test summary.

Posted in Uncategorized | Leave a comment

Regression to the mean

  • Intro
    A person who scored 750 out of a possible 800 on the quantitative portion of the SAT takes the SAT again (a different form of the test is used). Assuming the second test is the same difficulty as the first and that there was no learning or practice effect, what score would you expect the person to get on the second test? The surprising answer is that the person is more likely to score below 750 than above 750; the best guess is that the person would score about 725. If this surprises you, you are not alone. This phenomenon, called regression to the mean, is counter intuitive and confusing to many professionals as well as students.
  • Case Study
    Consider an actual study that received considerable media attention. This study sought to determine whether a drug that reduces anxiety could raise SAT scores by reducing test anxiety. A group of students whose SAT scores were surprisingly low (given their grades) was chosen to be in the experiment.These students, who presumably scored lower than expected on the SAT because of test anxiety, were administered the anti-anxiety drug before taking the SAT for the second time. The results supported the hypothesis that the drug could improve SAT scores by lowering anxiety: the SAT scores were higher the second time than the first time. Since SAT scores normally go up from the first to the second administration of the test, the researchers compared the improvement of the students in the experiment with nationwide data on how much students usually improve. The students in the experiment improved significantly more than the average improvement nationwide. The problem with this study is that by choosing students who scored lower than expected on the SAT, the researchers inadvertently chose students whose scores on the SAT were lower than their “true” scores. The increase on the retest could just as easily be due to regression toward the mean as to the effect of the drug. The degree to which there is regression toward the mean depends on the relative role of skill and luck in the test.
  • Distributional Dependence
    Recall the main argument from above that the top
    decile of the first year scores contains MORE students that came into it from
    below than from above (through noise), hence chances being high that in the
    next year they will score lower on the average. That argument holds true only
    for unimodal, bell shaped distributions, where there are more students closer
    to the mean than to the extremes of the distribution. For a uniform distribution
    one might expect the probabilities of students moving between deciles cancel, i.e
    as many top students are lost into a lower decile as lower performing students
    are bumped up.
    Empirical insights along those speculations are given by Figures 1 and 2.

Figure 1: Scores for a hypothetical example of successive testing of 4th and 5th graders for Gaussian distributions of true ability with additive (i) Gaussian and (ii) uniform noise. The observed scores in 4th grade are assigned to their empirical deciles.

I have an R package that allows you to explore your own favorite distributions and produce graphs analogous to the above. If I do not get around to posting it to CRAN in the next few weeks, feel free to ask me directly for it.

Posted in Uncategorized | Leave a comment

Latex in wordpress

The one thing that was holding me back in becoming a prolific blogger: LaTeX math had to come in via file upload… until now.

The codecogs equation editor has an HTML integration scheme that will let you import a SWF image by entering LaTeX code directly in the URL. If you like this, please donate to CodeCogs…

http://latex.codecogs.com/swf.latex?\sum_{i=1}^{N}{\alpha_i}

Posted in Uncategorized | 2 Comments

JSM 2012, RgoogleMaps

JSM 2012, RgoogleMaps

This is the paper I submitted to JSM 2012.

Posted in Uncategorized | Leave a comment

Why most published research findings are false

Why most published research findings are false

This post is intended to clarify some of the notation and results in the intriguing article Why most published research findings are false written by John P. A. Ioannidis and published in the open access journal PLOS Medicine. It was not entirely obvious to me how truth table 1 in the paper was derived, so here is my attempt of saving you the time and work to rederive it. The paper states:

Consider a 2 × 2 table in which research findings are compared against the gold standard of true relationships in a scientific field. In a research field both true and false hypotheses can be made about the presence of relationships. 1Let R be the ratio of the number of “true relationships” to “no relationships” among those tested in the field. R is characteristic of the field and can vary a lot depending on whether the field targets highly likely relationships or searches for only one or a few true relationships among thousands and millions of hypotheses that may be postulated. Let us also consider, for computational simplicity, circumscribed fields where either there is only one true relationship (among many that can be hypothesized) or the power is similar to find any of the several existing true relationships. The pre-study probability of a relationship being true is R/(R + 1). The probability of a study finding a true relationship reflects the power 1 – β (one minus the Type II error rate). The probability of claiming a relationship when none truly exists reflects the Type I error rate, α. Assuming that c relationships are being probed in the field, the expected values of the 2 × 2 table are given in Table 1. After a research finding has been claimed based on achieving formal statistical significance, the post-study probability that it is true is the positive predictive value, PPV. The PPV is also the complementary probability of what Wacholder et al. have called the false positive report probability [10]. According to the 2 × 2 table, one gets PPV = (1 – β)R/(R – βR + α). A research finding is thus more likely true than false if (1 – β)R > α. Since usually the vast majority of investigators depend on α = 0.05, this means that a research finding is more likely true than false if (1 – β)R > 0.05.

Table 1. Research Findings and True Relationships from journal.pmed.0020124

Given the two types of errors – Type I error rate α and Type II error rate β –   the following truth table is the easiest to write down: Note that the columns are conditional probabilities, so the total pobability does not add up to 1. To change the conditional table above into an absolute or unconditional table we need to multiply the columns by their respective marginals, i.e. column 1 by N_T the number of actually true relationships and column 2 by N_F, the number of tested relationships that were false. We then obtain the following truth table: Let us define R = N_T/N_F as the ratio of true to no relationships. Simple algebra then yields R+1 = c/N_F and R/(R+1) = N_T/c. If we now replace N_T by c \cdot R/(R+1) and N_F by c/(R+1) we end up with the truth table as written in the paper: One of the main findings/surprises from the paper is the low positive predictive value PPV, even for reasonably high power and low type I error probability α. This is yet another instance where the well known inequalities P(A|B) \neq P(B|A), and P(A|B) \neq 1 - P(A|\bar{B}) play against intuition, especially in the context of massive testing and low prior probabilities. Inequality (2) here implies PPV = P(relationship truly exists | claiming a relationship) ≠ 1- P(claiming a relationship |  none  truly exists) = 0.95, and in particular can be MUCH smaller. To actually compute PPV = P(A|B) = P(relationship truly exists | claiming a relationship) we need to apply Bayes Theorem:
P(A|B) = \frac{P(A)}{P(B)} P(B|A)
Let us compute one item at a time:

P(A) = P(relationship truly exists) = N_T/c = R/(R+1)

P(B|A) = P(claiming a relationship | relationship truly exists) =  1 – β 

So the posterior probability depends on all three parameters:  α, β and R.
For example, with α = 0.05, β = 0.1 and R = 0.02 the probability that a “discovered” association will be true, is just 25% (PPV = 0.26).
The exact dependence of PPV on these parameters is depicted in Figure 1 from the paper, which we reproduce here without the bias (i.e. u=0):

Note that these findings are not that different from the canonical disease diagnostic case as explained here.

Posted in Uncategorized | Leave a comment