Statistics, Probability and R

Power calculation for difference in proportions

Posted on September 26, 2014 by codeandmath

Power calculation for difference in proportions

code{white-space: pre;}

pre:not([class]) {
background-color: white;
}

if (window.hljs && document.readyState && document.readyState === “complete”) {
window.setTimeout(function() {
hljs.initHighlighting();
}, 0);
}

.main-container {
max-width: 940px;
margin-left: auto;
margin-right: auto;
}

I almost posted the following on stackexchange, but as so often before, the process of writing down my question led to its natural resolution!
Dear all, I am trying to understand the origin of the actual formula used to compute the power of the power.prop.test() function in R, which is defined in lines 12-14 of the source:

 tside <- switch(alternative, one.sided = 1, two.sided = 2)
 p.body <- quote(pnorm(((sqrt(n) * abs(p1 - p2) - (qnorm(sig.level/tside, 
        lower.tail = FALSE) * sqrt((p1 + p2) * (1 - (p1 + p2)/2))))/sqrt(p1 * 
        (1 - p1) + p2 * (1 - p2)))))

I took the liberty of rewriting the expression inside the pnorm() function in readable notation: \[
\frac{ \sqrt{n} \cdot |p_1 – p_2| – z_{1-\alpha/k}
\cdot \sqrt{(p_1 + p_2) \cdot (1 – (p_1 + p_2)/2)}}{\sqrt{p_1 \cdot (1 – p_1) + p_2 \cdot (1 – p_2)}}
\] where k(=tside) is either 1 or 2 depending on the alternative argument. I do not quite understand the 2nd term in the numerator, there is no reference in the man page. In particular, if I interpret it as a reference value, I am surprised to see it depend on $p_2$!
In addition, this power function does not agree with my naive version below. Let us fix $p_1=0.04, p_2=0.05$, we are told (power.prop.test(p1=0.04,p2=0.05,power=0.95)) that we need a sample size of 11166 in each group to achieve a power of 0.95. If I was to test the hypothesis $H_0: p_1=p_2$ my critical value would be simply tr=qnorm(1-0.05/2, mean=0,sd=sr)= which does NOT depend on $p_2$. Here $sr=\sqrt{p_1 \cdot (1 – p_1) + p_1 \cdot (1 – p_1)}$ is the sample standard deviation of the difference in proportions (under the assumption that $H_0$ holds).
This is where I realized my fallacy: I was about to substitute $p_1=0.04$ in the expression for sr. But that would be testing the following Null hypothesis: $H_0: p_1=p_2=0.04$ which is a very different conjecture! So instead, we simply estimate the common $p$ by the average $(p_1 + p_2)/2$ and that is exactly how the formula above is formed !

// add bootstrap table styles to pandoc tables
$(document).ready(function () {
$(‘tr.header’).parent(‘thead’).parent(‘table’).addClass(‘table table-condensed’);
});

(function () {
var script = document.createElement(“script”);
script.type = “text/javascript”;
script.src = “https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML”;
document.getElementsByTagName(“head”)[0].appendChild(script);
})();

Posted in Uncategorized | Leave a comment

Inverse Regression revisited

Posted on May 21, 2014 by codeandmath

Inverse Regression, Regression to the Mean and stochastic dependent variables

I am trying to reconcile the following observations/viewpoints:

Inverse Regression: It is well known that the slope of the inverse regression line is not simply β′=1/β but instead we have the following relationship
β⋅β′=R2
Regression to the Mean The relationship from above is intimately connected with the phenomenon of Regression to the Mean which can be stated in many ways:
1. yˆ=ρ⋅x if both variables are standardized, i.e. $y ˆ - y ˉ s y = ρ \cdot x - x ˉ s x$
2. Let X1, X2 be random variables with identical marginal distributions with mean μ. In this formalization, the bivariate distribution of X1 and X2 is said to exhibit regression toward the mean if, for every number c>μ, we have $μ \leq E (X 2 | X 1 = c) < c$
3. Regression toward the mean simply says that, following an extreme random event, the next random event is likely to be less extreme.
- Stochastic Regressors: Note the identical marginal distributions condition from above. This would imply that the independent variable x “needs” to be a (normally distributed) random variable as well! But regression can be (and often is) defined for non-stochastic, fixed exogenous x; in fact the Auer book explicitly makes that assumption (Annahme C1). And yet, the mathematics of the slope computation does not change whether x is stochastic or not.Read on: https://dl.dropboxusercontent.com/u/8089659/stuff/InvRegression.html

Posted in Uncategorized | Leave a comment

Type-I error in bandits

Posted on April 5, 2014 by codeandmath

Type-I error in bandits

Posted in Uncategorized | Leave a comment

Aggregating multiple markets in microeconomics

Posted on March 26, 2014 by codeandmath

The following technical note addresses the algorithm proposed in Chapter 3.2 of the
Oz Shy book “How to Price”. A common thread throughout the book is the
suggestion of pseudo-code algorithms designed to help the reader implement the
theory into working code. Unfortunately, this worthy idea is frequently less valuable
than it could be as no or little effort has been invested into pursuing a minimal
standard of algorithmic efficiency. We now point out the shortcomings of Algorithm
3.3 -in particular its exponential complexity- and suggest an alternative that
scales polynomially in the number of markets. Let us summarize the section
first:

A computer algorithm for M markets

We now proceed to describe a computer algorithm for selecting
the most profitable price when there are M markets, and where
each market is composed of homogeneous consumers with identical
maximum willingness to pay, as was already illustrated in Figure 3.3
for the example of M = 3 markets. The user must run Algorithm
3.2 before running Algorithm 3.3 for the purpose of generating and
storing the 2^M possible selections of markets that can be served.
Recalling Algorithm 3.2, the results are stored by the array of arrays
of binary variables Sel[i,l] of dimension 2 M ŒM, where i is the index
of a specific selection of markets, i = 0,1,…,2 M – 1, and l is a
specific market. Market l is served under selection i if Sel[i, l] = 1
and is not served if Sel[i, l] = 0.

In the following the basic formula for the profit is being illustrated in one
market.

$Yi = (pi - μ)* qi - ϕi - ϕ$

Y –> final profit in this market
p –> price per product
μ –> marginal cost
q –> quantity of customers in this market
ϕ_i –> fixed costs in this specific market
ϕ –> production fixed cost (2000$)

We use the following sum formula.

∑ YMC = [(pMC - μi)*qi - ϕi]- ϕ i∈MC

(1)

So in order to calculate all possible market constellations [MC] 2^M possible markets arise, as each market is being calculated and is added up with every potential market constellation. In the above-mentioned formula for the total income the price p corresponds to the lowest price of all included markets, therefore all customers of the included markets also want to buy the product. This process can be made more efficient by utilizing the following three observations.

Only markets corresponding to positive terms of the total sum Eq. (1) will
be included
In every market the profit decreases monotonically with decreasing price
Therefore, once an individual market profit becomes non-positive, Y _i ≤ 0,
it will remain so and its computation can be omitted.

So it is only the lower triangular matrix that is being computed as the markets have
been ordered decreasingly by the price level. Hence, only those markets need to be
calculated, which would also be willing to pay the particular price. This creates M2- 2
possible markets as the worst-case scenario.. Within a market, the profit is now
calculated with price as the only variable. The values in the columns of the matrix
are therefore monotonically decreasing. The algorithm terminates the calculation for
a certain market as soon as the market internal profit is less than or equal to zero.
This creates a nearly banded matrix, as there is no need to calculate every
single market constellation anymore. Finally, the rows are added and the
maximum profit in that new column is selected. This creates a much more
efficient computing scheme which scales polynomially instead of exponenially:
O(M²) << O(2^M) where the multiplying constant of the M² term can be much less
than 1 2 .

Small example proposed by Shy

As an introductory example, the calculation is to Oz Shy in Chapter 3.2 page 72 is
being displayed.

Markets: M1 M2 M3
Prices: 30$ 20$ 10$
quantities: 200 600 200
fixed costs: 1000$ 5000$ 0$
production fixed costs: 2000$

This is an example with numbers for one calculation for one market with one price.
$YM1 = (30$- 5$)* 200- 1000$$ This calculation for each market and for each
price is shown in the table below. In the resulting profit column the fixed
costs of production with 2000$ were already subtracted. They don’t need
to be removed individually from each market anymore. (Prices are in $.)


Price	M1	M2	M3	profit	resulting profit

30	4000	0	0	4000	2000
20	2000	4000	0	6000	4000
10	0	-2000	1000	-1000	1000

Example with 15 markets

Prices: 150 140 130 120 110 100 90 80 70 60 50 40 30 20 10
quantities: 60 10 10 60 10 60 60 10 60 60 10 10 60 60 60
fixed costs: 0 800 500 600 800 200 100 200 500 800 400 600 800 300 100

Exhaustive search proposed by Shy

After searching through all 2¹⁵ – 1 = 32767 possible market combinations, the
maximum profit was found to be 17950 by including markets 1,2,3,4,5,6,7 . This
took 16.25 secs.

Structured search proposed in this note

After searching through 116 possible market combinations, the maximum profit was
found to be 17950 by including markets 1,2,3,4,5,6,7 . This took 0 secs.

The search matrix is displayed here:


Price	M1	M2	M3	M4	M5	M6	M7	M8	M9	M10	M11	M12	M13	M14	M15	Profit

150	8700															6700
140	8100	550														6650
130	7500	450	750													6700
120	6900	350	650	6300												12200
110	6300	250	550	5700	250											11050
100	5700	150	450	5100	150	5500										15050

90	5100	50	350	4500	50	4900	5000									17950

80	4500	-50	250	3900	-50	4300	4400	550								15800
70	3900		150	3300		3700	3800	450	3400							16700
60	3300		50	2700		3100	3200	350	2800	2500						16000
50	2700		-50	2100		2500	2600	250	2200	1900	50					12250
40	2100			1500		1900	2000	150	1600	1300	-50	-250				8250
30	1500			900		1300	1400	50	1000	700			700			5550
20	900			300		700	800	-50	400	100			100	600		1850
10	300			-300		100	200		-200	-500			-500	0	200	-2700

Posted in Uncategorized | Leave a comment

Graphing a structural break (chow test) in gretl

Posted on March 5, 2014 by codeandmath

While gretl allows you to test for a structural break in your data via the chow test, it does not easily visualize the resulting two regression lines.

The following steps enable you to create a piecewise linear graph of the two regions defined in your test. As an example we load the sample data data3-7 (Ramanathan) into gretl. Graphing cost against miles suggests that the region to the left of miles=60 is different from the region to the right. So let us define the following two new variables.

Macintosh HD:Users:sitamenon:Desktop:Screen shot 2014-03-05 at 8.53.04 PM.png

The indicator (binary) variable:

Macintosh HD:Users:sitamenon:Desktop:Screen shot 2014-03-05 at 8.54.13 PM.png

and an interaction term which allows gretl to estimate different slopes for the two regions.

Macintosh HD:Users:sitamenon:Desktop:Screen shot 2014-03-05 at 8.55.34 PM.png

At this point we should have 5 variables in our dataset (see pic below). The chow test is nothing more than (i) fitting an OLS model to the three variables which are miles plus our two newly defined variables and (ii) performing an F-test on the joint significance of the two extra coefficients. For the graph we will need to execute only step (i):

Macintosh HD:Users:sitamenon:Desktop:Screen shot 2014-03-05 at 8.56.38 PM.png

Macintosh HD:Users:sitamenon:Desktop:Screen shot 2014-03-05 at 8.57.15 PM.png

Macintosh HD:Users:sitamenon:Desktop:Screen shot 2014-03-05 at 8.57.35 PM.png

One very useful feature of gretl is its ability to save outpt from the model as new variables; in particular we will save the fitted values as the new variable yhat2:

::::sitamenon:Desktop:Screen shot 2014-03-05 at 9.02.22 PM.png

::::sitamenon:Desktop:Screen shot 2014-03-05 at 9.03.27 PM.png

We are almost there. Let us graph both the original cost data as well as the fitted values against the x-variable miles:

::::sitamenon:Desktop:Screen shot 2014-03-05 at 9.04.28 PM.png

::::sitamenon:Desktop:Screen shot 2014-03-05 at 9.05.01 PM.png

And this is our scatterplot where the blue dots represent the two regression lines.

::::sitamenon:Desktop:Screen shot 2014-03-05 at 9.05.37 PM.png

I find it more pleasing to have a blue line instead of the points, so I change this option in the plot controls:

::::sitamenon:Desktop:Screen shot 2014-03-05 at 9.06.47 PM.png

And this is the final graph:

::::sitamenon:Desktop:Screen shot 2014-03-05 at 9.07.20 PM.png

And the command log for this sessions is simply:
Inline image 1

As a postscript, I probably should have shown you how to run and interpet the actual Chow test in the first place. Here is the sequence of steps:
We first fit a simple OLS model of cost against just miles. (Unless I am being asked for it, I am omitting the instructions on how to do this.)
Next we find the Chow test in the Test pulldown menu: Inline image 3

Specify the indicator variable for the structural break regions:

Inline image 1

Let us inspect the output summary for this test:
Inline image 2
As I mentioned above, the actual fit is just the OLS model where the original variable miles is augmented by the dummy/indicator as well as the interaction term.
Most importantly though is the reported F-test on the joint significance of the two extra coefficients, which in this case is highly significant.
The individual t-ratios and p-values would suggest otherwise.

Could we run the same F-test manually in gretl without the predefined Chow test? Yes, we can, and here is how:

Reassuring that the values from this explicit F-test are identical to the Chow test summary.

Posted in Uncategorized | Leave a comment

Regression to the mean

Posted on November 21, 2012 by codeandmath

Intro
A person who scored 750 out of a possible 800 on the quantitative portion of the SAT takes the SAT again (a different form of the test is used). Assuming the second test is the same difficulty as the first and that there was no learning or practice effect, what score would you expect the person to get on the second test? The surprising answer is that the person is more likely to score below 750 than above 750; the best guess is that the person would score about 725. If this surprises you, you are not alone. This phenomenon, called regression to the mean, is counter intuitive and confusing to many professionals as well as students.
Case Study
Consider an actual study that received considerable media attention. This study sought to determine whether a drug that reduces anxiety could raise SAT scores by reducing test anxiety. A group of students whose SAT scores were surprisingly low (given their grades) was chosen to be in the experiment.These students, who presumably scored lower than expected on the SAT because of test anxiety, were administered the anti-anxiety drug before taking the SAT for the second time. The results supported the hypothesis that the drug could improve SAT scores by lowering anxiety: the SAT scores were higher the second time than the first time. Since SAT scores normally go up from the first to the second administration of the test, the researchers compared the improvement of the students in the experiment with nationwide data on how much students usually improve. The students in the experiment improved significantly more than the average improvement nationwide. The problem with this study is that by choosing students who scored lower than expected on the SAT, the researchers inadvertently chose students whose scores on the SAT were lower than their “true” scores. The increase on the retest could just as easily be due to regression toward the mean as to the effect of the drug. The degree to which there is regression toward the mean depends on the relative role of skill and luck in the test.
Distributional Dependence
Recall the main argument from above that the top
decile of the first year scores contains MORE students that came into it from
below than from above (through noise), hence chances being high that in the
next year they will score lower on the average. That argument holds true only
for unimodal, bell shaped distributions, where there are more students closer
to the mean than to the extremes of the distribution. For a uniform distribution
one might expect the probabilities of students moving between deciles cancel, i.e
as many top students are lost into a lower decile as lower performing students
are bumped up.
Empirical insights along those speculations are given by Figures 1 and 2.

Figure 1: Scores for a hypothetical example of successive testing of 4th and 5th graders for Gaussian distributions of true ability with additive (i) Gaussian and (ii) uniform noise. The observed scores in 4th grade are assigned to their empirical deciles.

I have an R package that allows you to explore your own favorite distributions and produce graphs analogous to the above. If I do not get around to posting it to CRAN in the next few weeks, feel free to ask me directly for it.

Posted in Uncategorized | Leave a comment

Latex in wordpress

Posted on November 8, 2012 by codeandmath

The one thing that was holding me back in becoming a prolific blogger: LaTeX math had to come in via file upload… until now.

The codecogs equation editor has an HTML integration scheme that will let you import a SWF image by entering LaTeX code directly in the URL. If you like this, please donate to CodeCogs…

http://latex.codecogs.com/swf.latex?\sum_{i=1}^{N}{\alpha_i}

Posted in Uncategorized | 2 Comments

JSM 2012, RgoogleMaps

Posted on September 22, 2012 by codeandmath

JSM 2012, RgoogleMaps

This is the paper I submitted to JSM 2012.

Posted in Uncategorized | Leave a comment

Why most published research findings are false

Posted on September 6, 2012 by codeandmath

Why most published research findings are false

This post is intended to clarify some of the notation and results in the intriguing article Why most published research findings are false written by John P. A. Ioannidis and published in the open access journal PLOS Medicine. It was not entirely obvious to me how truth table 1 in the paper was derived, so here is my attempt of saving you the time and work to rederive it. The paper states:

Consider a 2 × 2 table in which research findings are compared against the gold standard of true relationships in a scientific field. In a research field both true and false hypotheses can be made about the presence of relationships. 1Let R be the ratio of the number of “true relationships” to “no relationships” among those tested in the field. R is characteristic of the field and can vary a lot depending on whether the field targets highly likely relationships or searches for only one or a few true relationships among thousands and millions of hypotheses that may be postulated. Let us also consider, for computational simplicity, circumscribed fields where either there is only one true relationship (among many that can be hypothesized) or the power is similar to find any of the several existing true relationships. The pre-study probability of a relationship being true is R/(R + 1). The probability of a study finding a true relationship reflects the power 1 – β (one minus the Type II error rate). The probability of claiming a relationship when none truly exists reflects the Type I error rate, α. Assuming that c relationships are being probed in the field, the expected values of the 2 × 2 table are given in Table 1. After a research finding has been claimed based on achieving formal statistical significance, the post-study probability that it is true is the positive predictive value, PPV. The PPV is also the complementary probability of what Wacholder et al. have called the false positive report probability [10]. According to the 2 × 2 table, one gets PPV = (1 – β)R/(R – βR + α). A research finding is thus more likely true than false if (1 – β)R > α. Since usually the vast majority of investigators depend on α = 0.05, this means that a research finding is more likely true than false if (1 – β)R > 0.05.

Table 1. Research Findings and True Relationships from journal.pmed.0020124

Given the two types of errors – Type I error rate α and Type II error rate β – the following truth table is the easiest to write down: Note that the columns are conditional probabilities, so the total pobability does not add up to 1. To change the conditional table above into an absolute or unconditional table we need to multiply the columns by their respective marginals, i.e. column 1 by $N_T$ the number of actually true relationships and column 2 by $N_F$ , the number of tested relationships that were false. We then obtain the following truth table: Let us define $R = N_T/N_F$ as the ratio of true to no relationships. Simple algebra then yields $R+1 = c/N_F$ and $R/(R+1) = N_T/c$ . If we now replace $N_T$ by $c \cdot R/(R+1)$ and $N_F$ by $c/(R+1)$ we end up with the truth table as written in the paper: One of the main findings/surprises from the paper is the low positive predictive value PPV, even for reasonably high power and low type I error probability α. This is yet another instance where the well known inequalities $P(A|B) \neq P(B|A)$ , and $P(A|B) \neq 1 - P(A|\bar{B})$ play against intuition, especially in the context of massive testing and low prior probabilities. Inequality (2) here implies PPV = P(relationship truly exists | claiming a relationship) ≠ 1- P(claiming a relationship | none truly exists) = 0.95, and in particular can be MUCH smaller. To actually compute PPV = P(A|B) = P(relationship truly exists | claiming a relationship) we need to apply Bayes Theorem:
$P(A|B) = \frac{P(A)}{P(B)} P(B|A)$
Let us compute one item at a time:

P(A) = P(relationship truly exists) = $N_T/c = R/(R+1)$

P(B|A) = P(claiming a relationship | relationship truly exists) = 1 – β

So the posterior probability depends on all three parameters: α, β and R.
For example, with α = 0.05, β = 0.1 and R = 0.02 the probability that a “discovered” association will be true, is just 25% (PPV = 0.26).
The exact dependence of PPV on these parameters is depicted in Figure 1 from the paper, which we reproduce here without the bias (i.e. u=0):

Note that these findings are not that different from the canonical disease diagnostic case as explained here.

Posted in Uncategorized | Leave a comment

Statistics, Probability and R

Power calculation for difference in proportions

Power calculation for difference in proportions

M Loecher

Friday, September 26, 2014

Inverse Regression revisited

Inverse Regression, Regression to the Mean and stochastic dependent variables

Type-I error in bandits

Aggregating multiple markets in microeconomics

Small example proposed by Shy

Example with 15 markets

Exhaustive search proposed by Shy

Structured search proposed in this note

Graphing a structural break (chow test) in gretl

Regression to the mean

Latex in wordpress

JSM 2012, RgoogleMaps

Why most published research findings are false

Blogroll