Tuesday, January 5, 2010

Least-Squares



All those years. All this result.

For those of you who still have a memory, undaunted by either age or drugs, there is a simple formula for determining the mean (average) value of a sampled set.



Take a measurement of a thing (X). The average value of X is equal to Mu (μ ). Say you have ten things, like the length of beans. One bean is 3 inches long. Another is 4 inches long. The others are 3.2, 4.2, 2.6, 3.7, 3.9, 2.1, 4.9 and 3.3 inches long. μ = the average value of all those lengths, or, the combined value of all those lengths distributed over the range of observations made.

If you make a single observation, the mean is equal to the observation. That is to say, if you measure one bean, the average length of that bean is equal to its length. While it would be true that the length of that bean is equal to its average length, the question should be asked, "why do we care?"

And, in fact, we don't.

The length of any one single thing is always equal to its mean (or average) value.

The question then becomes, how do we take a look at a class of things, in this case beans, and determine an expectation of value? This is important if we are designers of bean cans. If we intend to can whole beans it only makes sense to manufacture cans that allow us to put entire beans inside those cans. If we only intended to sell partial beans, can size is less important. Cut beans can pretty much be crammed into any size can we produce.

But let's take a look at the value of beans listed above. What are the chances that the values of bean length aren't the value of bean length (ρ, or rho)? Since we measured the beans, the chances of the lengths being different than the actual lengths is 1. This is a mathematical way of saying that the length of the beans is equal to the lengths of the beans. (Do you remember the Identity Property?)

The Sigma (Σ), or sum, or any equation when multiplied by 1 is equal to the sum. (1 is the Rho part. We measured the beans, so the probablity that the length we measured is equal to the length we measured. When a thing is a thing, the mathematical expression of this is 1.) So, the n=10 value of our beans is the sum of the length of beans divided by the number of beans we counted, or 34.9 inches (Sigma of X).

So, a restatement of the mean:



says, that the the sum of all the numbers χ (Chi), from the first number to the nth (last or final) number, when multiplied by the reciprocal, or inverse, of the amount of observations of n (our n, again is equal to ten) gives us the mean, or average of our observations.

So, we did the sum of the first x (shown with a subscript 1) plus the second x (with subscript 2) until we added all ten numbers together. The sum of x is 34.9, n = 10. The inverse of 10 is one over ten.

Or, 34.9 divided by ten.

Average length? 3. 49, or, rounding up, 3.5 inches.

How many of our beans were 3.5 inches long? Well, none. But we're not done yet. We're going to do something with our beans. We're going to can them, and to make sure we order cans that will allow us to fit entire beans by length into these cans, we need to order cans that will serve our needs, in most cases.

One of our beans was pretty long. 4.9 inches! If we made our cans to include this monster, our cans would bigger than would be necessary for more normal sized beans! Talk about a waste.

So how do we develop an understanding of what "normal sized beans" means in terms of our demand for cans?

Old guys who do math see this as a problem that can be solve with a question. How normal is the size of this sample of beans? Or, if we look at these beans as being demonstrative of the length of beans, how "normal" are these length values?

Didja ever wonder about the word "normal"? I'm either normal or not. Are you "normal"? And what are the attributes of normalcy that you must adhere to--voluntarily or not--in order for you to claim adherence to some outwardly conceived admission of normalcy?

When we examine beans, we have limited descriptive statements that can be used to determine the limits of what is or isn't a bean.

It's green. It's a longer, rather than fatter, vegetable. It is green. And it has a normal length.

But, what is a bean's normal length?

In our example of testing, we found an average length of bean to be 3 and a half inches long. (Which, if you know anything about beans, depending upon the type of bean you're growing, is a pretty average length!)

But, how "normal" is our average length, in terms of our sample?


This is a graph known as the normal Bell Curve. If you are reading this, I know that you've come across this curve at least once in your life. Mebbe it was when you took a standardized test back in high school and learned that you should end your life pumping gas. As a high-light.

There are variations on the "normally" distributed curves. These are known as "skewed" curves. If you have any idea how curves could be skewed beyond or above the normal mean value, you can skip the rest of this test. (Give yourself a B+.)

One of the more interesting questions that can be asked of the graph above is, what is meant by 1SDV, 2SDV...etc.

SDV in this case refers to Standard Deviation. ( I usually use the shorthand sd.) For all you deviates out there, this could be good news. We can measure how close to "normal" your "deviation" may be.

And cooler still, we can take your deviation, or the deviation in the length of beans, and determine, statistically, whether or not you, or the beans, fall within 68 percent of all weird deviancies--or bean length--or not.

How to do this?



First, we compute the mean for the data. We did this. 'Member the number? (3.5)

Then, we compute the deviation by subtracting the mean from each value.

3 inches long, another is 4 inches long, the others are 3.2, 4.2, 2.6, 3.7, 3.9, 2.1, 4.9 and 3.3 inches long. So, we get

3.5 minus 3.

3.5 minus 4.

3.5 minus 3.2...etc.

"Standard Deviation" isn't some magic number that only math guys can do. If we have a mean (average) value, then any value that we have in our sample that isn't exactly the same value of that mean (average) value, deviates from that value. Ain't the same, it deviates.

We're going to find out what the deviance is--the difference between each value in our sample and the mean--for each sampled value. And then we're going to "normalize" the difference of these sums. We're not simply going to take the average (mean) of the difference, we're going to take a look at the average value of the difference in terms of the mean.

Huh?

Some of the differences that we came up with were negative. Our average was 3.5. Some of our beans were only 3.2 or 2.6 inches long. Because of the "identity" property of math, what we deal with is that differences that are "positive" or "negative" are erased, because what we're looking for isn't a value that is described as either negative or positive, but as an absolute.

(For any of you who don't get the idea that negative one times negative one is equal to one, give me a note. It took me two years of asking stupid questions of professors until I found one that took the time to take me past "doing the math" into understanding the math. )

So, we're going to take the diffences of each sampled value, less the mean, and come up with a number that we're going to square, in order to remove the negative sign...that is, to come up with an "absolute" value of the difference.

So you get differences like 0.5, -0.5...0.2. We're going to remove the postive and negative signs
by squaring (n²) the differences. We've ten differences. We're going to "square" (multiply each difference by that same difference) each of the individual differences.

So, the first difference is -0.5.

What is -0.5 squared? It's -0.5 x -0.5. Or, 0.25.

The second bean length was 4.00 inches.

The difference is 0.5 inches. 0.5 x 0.5 equals 0.25 inches. (So, the "absolute" value of the diffence of two numbers is still the same!)

We go ahead and finish off the rest of the differences between our mean value and the actual value of our sample and find that the values of the differences are

-0.3

0.7

0.9

0.2

0.4

1.4

1.4

0.2

So, what are the squares of these values?

We've already done the first two.

0.25 and 0.25.

What are the next eight? I'll do a couple, then you do the rest. Math isn't hard. And statistics is just math.

-0.3 squared? 0.09.

0.7 squared? (0.7 x 0.7) 0.49.

Did you find all the squares of the differences?

Cool. Here's a note.

Most of the kids I TA'd in beginning Stats didn't have the math to do this. College Algebra is a misnomer. It is neither "college" level or taught to a level that allows children to understand math. But, I digress.

So, what is the sum of our squares? (You're not doing the math. The answer is 6.05.) (Slacker.)

So, now we divide by n-1, or one less than the sample size. (Our sample size was ten.)

Or, 6.o5 divided by 9.

Shall we? 6.05 over nine is 0.672.

We have a number! We have a number!

But, what does this number mean?

Hella good question.


See the graph?

We have one more step to take. Remember that we took the "squares" of the differences? One last step. We're now going to find the square root of the the average differences of our squares.

Yep. We transformed our diffences into squares and then we reduced that mean by one n less, and now we're going to find the square root of the difference.

What is the square root of 0.672? Remember, we're talking about the length of beans.

0.8197560612767679

But, we don't deal with this. We ended up with three decimals to the right of the decimal. So, our answer is 0.82.

So, within the first standard deviation of our bean, we could add or subtract 0.82 inches to come up with a length that would fall in our first Standard Deviation.

The mean or average length of our beans was 3.49 inches. And we have found that if the lengths of beans are normally distributed, that 68 percent of all beans are within 0.82 inches of 3.49 inches, or at the low end of the range 2.67 inches and at the high end of the range, 4.31 inches.

But "normal" distribution has nothing to do with "normal" beans. All of our beans were within the same field, from the same seeds. Geographic distances were small. Field watering was consistent. But three of our beans, at 2.1 inches, 2.6 inches and 4.9 inches, fell outside of the first standard deviation of "beans" in our normally distributed curve.

Words that mathematicians and statisticians use have very explicit meaning. Otherwise, we wouldn't know what we were talking about.

Normal distribution refers to unbiased sampling results. If we found we had biased sampling results, we have a vocabulary to deal with that bias.

But normal in a mathematician's vernacular (a statistician is a mathematician) has to do with those things we would expect to be evenly distributed. As if they were random. That there isn't a bias.

And that is, my friend, the purpose for this post.

If you have been following the "Climaquiddick" or "Climategate" posts occuring around the intertubes, you may have noticed that a great deal of the criticism being brought upon the Global Warmers has been based upon criticisms of their mathematical models.

In my simple explanation of statistics in this post, we looked at the values of lengths of beans at one particular time. We determined an average (mean) length and the expected values for bean length within one standard deviation (sd). We found that if bean length was normally distributed around the mean, that the predicted value of 68 percent of the sampled bean length being within one sd of mean was confirmed. Three of the bean lengths were found to be outside of the predicted value of the sd.

What happens to a statistical model as the number of beans sampled in a statistical measurement increases? Are there tests to find out if our statistical findings are significant? Can we draw a statistical inference about the reliability of our measurements?

Yep.

There are a lot of things that occur within the study of statistics. There are schools of statistics. Not schools, like George Washington University versus Texas Tech. There are fundamental assumptions of what we use for test for significance, difference and distribution that can vary widely and give us different results as to whether or not our findings are of any interest, or not. That is, there are closely held beliefs between certain schools of thought as to how some information should, or should not, be interpreted.

What I've shared with you today are the fundamentals of statistical analysis that are shared by most schools of statistical inference. There aren't two schools of what the definition of "mean" is. Nor are there two schools--or more--of what defines standard deviation.

But what does this analysis leave us with?

What would happen to our analysis of bean length if I'd only sample one bean, and it's length was 4.9 inches?

What would happen if I came back to this same field the following year and found another bean of 4.9 inches in length, and I decided to report my findings?

A coupla things would have happened. And this is why the discussions being undertaken by serious mathematicians, statisticians and climatologists are occuring around the globe.

There is significant attention being addressed by the "skeptics" at the methods that were employed by the "consensus" scientists who had gained the main stage on the debate of global climate change. One of these issues centers around the number of sites that were used in reporting temperature data. That is, in some instances, only one bean length was counted.

The "science" of statistics is readily accessible to all of us. If you have a background in calculus, obviously a lot of the equations are simpler. But if you rely upon algebra, there is nothing in this post that will stop you from determining for yourself that limits upon sample size will obviously affect the reported values of any statistical analysis.

Your choice is to do the math. Or, at last have enough conversance with the process that you can read a post about the statistical method and be conversant enough to follow the criticisms of the authors. The folks who are attempting to claim consensus rely upon you to have a certain deficiency in math tools to sway you with their moral suasion.

Please, don't fall for their planned moral suasion. It's easily falsifiable. But, it's up to you to understand the basics, and then ask questions.

12 comments:

ZZMike said...

One problem is that statistics is not a trivial science. Consider one of the more widely-used statistics books in colleges:

Essentials of Statistics (3rd Edition) (Paperback), by Triola.

Note the type: paperback.
Price (new - which is what the incoming student is likely to pay): $113.33
720 pp.
Weight: 3.2 lb.

Amazon has used ones starting at $60.

(That's really my rant on the cost of textbooks, but the number of pages attests to the subject's non-triviality.)

For example, I read a post about the t-test a few days ago. Can we say that if t=.049, we absolutely reject the hypothesis, but if it's 0.051 we absolutely accept it (or is it the other way around?).

Statistics are perhaps the most intentionally-misused data. (Somewhere else I have a long article on the use of statistics in court cases - most juroes ignore them because they can't understand them, and somejudges throw them out becuse they don't think evidence can be reduced to equations. (Especially in cases of DNA evidence - which shows likelihood, not certainty.)

Then there are advertisements, which claim "our product is 30% better!!!"

OregonGuy said...

ZZ--

When you bring up t tests, the question of homogeniety and heteroskedasticity need to be asked. What is the normal fit?

And yet unasked, what is the confidence interval, and whether or not you can demonstrate statistical significance?

My point in the post is that some questions are easy. Like how big was the sample?

A lot of the debate going on in the war on Man Made Global Warming is based on sample sizes that would make your mind spin. Like 1.

What is the normal distribution of a sample size of one? Hmm.

Thinking about statistics doesn't require much more than a basic understanding of the four operators. You don't need to stand in awe of an elite.

Especially one that is attempting to hornswoggle you.
.

g said...

I'm sorry, I couldn't get past the first picture. What was the question again?

OregonGuy said...

G--

Loser.
.

ZZMike said...

[Me, earlier] "One problem is that statistics is not a trivial science."

[OG] "... the question of homogeniety and heteroskedasticity need to be asked."

I rest my case.

Here's another example: The last 3 digits on my car's license plate are the same as those on my wife's. (California has a dcccddd pattern; the cars were bouth several years apart.)

What are the chances of that happening? Well, the chances muyst be 1, since it happened.

"... sample sizes that would make your mind spin. Like 1."

Exactly. There's probably a pretty good reason why statistics is sometimes called the law of large numbers. (Or is that "the lore ..."?)

OregonGuy said...

ZZ--

I'm pretty sure you can figure out on your own the likelyhood of your wife's and your cars having license plates with the same last three digits, in the same order.

(For those of you who haven't remembered, there's a formula that can predict the chance of drawing a Heart, if you've already drawn a Heart. The likelyhood of drawing a Heart from a deck of fifty-two is one in four. Why does the probability change if you've drawn a Heart?)

One of my favourite daily reads is Briggs. His link is over there on the right.

Math isn't hard. Statistics isn't hard.

You just can't make shit up.
.

g said...

I'm stuck. Damn.

OregonGuy said...

G--

You found the comments button.

Now, go back to the top, and s-l-o-w-l-y scroll down the page.

Start anywhere, but remember, there are words for you to read.

Try it a couple of times.

Then take two aspirin and call me in the morning.
.

Gordon R. Durand said...

Perhaps "g" is attempting to point out the one glaring error in your analysis, that is, the statement that "The 'science' of statistics is readily accessible to all of us."

Remember that IQ is also normally distributed with a mean of 100 and a SD of (I think) 15 or 16 depending on the test. My guess is that anyone with an IQ less than 115 would have considerable difficulty with the math, and that anyone whose IQ is below average--fully 50% of the population, as dismal as that seems--would struggle with the concepts themselves.

I would nevertheless recommend to everyone Darrell Huff's classic How to Lie with Statistics, written in 1954 and still in print.

ZZMike said...

"... Why does the probability change if you've drawn a Heart?)"
Because now there are only 12 Hearts left? (But also one fewer card. S'poze you've drawn 5 cards and then a Heart. Have the odds changed? Clearly, if you draw 39 cards and no Hearts, the odds go way up for the next one being a Heart.)

"Math isn't hard. Statistics isn't hard."
Neither is brain surgery, once you've got the knack of it.

"One of my favourite daily reads is Briggs. His link is over there on the right."
?? I did a "Find" for "briggs" on the whole page - nothing.

Re your advice to "g":
"Now, go back to the top, and s-l-o-w-l-y scroll down the page."
No: start at the bottom of the page, and slowly scroll up. I think that'll help.

OregonGuy said...

ZZ--

Thanks for the note. I had assumed that my long-term readership, posting and linking had included a link for William M. Briggs, Statistician & Consultant.

In terms of g...

He has a nice smile.
.

g said...

Stuck...

Son of a...