Introduction

When statistics are valuable

1 Can only give answers if the data collection and the data collected allow such answers

2 User is aware the statistics is just another strategy for finding patterns in the data

3 Statistics are based on certain assumptions. If those assumptions are not true the technique can still be applied but significance tests must treated with caution

4 User is aware that techniques are mathematical models. Reality in all its complexity cannot be modeled in a useful way. Complex models may imitate reality but they will be equally complex and therefore not useful. Summarizing data in a complex way is not a step forward.

5 Data exploration needs to be done before any higher level modeling

Introduction to statistics in geography

difficult to give a definition to statistics but one try is - the study of numerical data, and as such is a branch of applied mathematics

it has only been around since the end of the 19th century

Aim: statistics is concerned with the analysis, organization and simplification of information about real world phenomena as an aid to their description, interpretation and prediction

teaches you how to apply a set of techniques to data acquired either directly through field work or from secondary sources such as published materials

i.e. - primary - stream discharge, temperature

secondary - census data, economic data

consequence - statistical analysis develops a more critical approach to research or real world phenomena, you think more rigorously and precisely about the phenonema - less likely to make unsubstantiated vague generalizations

if there are computer packages why learn to any of this by hand

1) dramatically increases one’s understanding of statistics

2) aids in interpreting and implementing solutions

difference between mathematics and statistics is that math is deductive

Toronto > London

London > Sarnia

ˆ Toronto > Sarnia

statistics is inductive

inductive arguments give rise to conclusions that often exceed the content of the information on which they are based

e.g. take a survey of 1000 Canadians, (not possible to ask all Canadians) about the next election if 550 respond liberal (this is info)

we infer or conclude that 55% of all Canadians will vote liberal

however qualifications are added like:

it is probably true that

almost certainly

plus or minus

because we took a sample of all voters

descriptive statistics

there is a tremendous amount of info available that is increasing at an accelerating rate

how can we deal with it all? one approach is to use descriptive statistics both for ordinary data and spatial data (data that relates to 2D surfaces or 3D space)

descriptive statistics help to summarize this info

(no inferences are made) i.e summarizes info about

1) places - what is average distance between stores

2) patterns - what is the arrangement of tall trees vs small trees

3) areas - what is the average income in London

4) temporal trends - is the average price of oil (corrected) increasing

we will first look at methods that help summarize and describe large sets of data

data when summarized has properties like means and variance which will be useful when we move on to inferential stats

inferential statistics

provides formal methods for calculating the limits of probability, certainty

do this by

1) estimating the representativeness of samples

2) estimating the degree to which data support hypothesis

i.e. was a representative group of Canadians interviewed

hypothesis: most Canadians will vote liberal (true or false)

significance

one of the most powerful uses of statistics in helping to decide whether an observed difference or relationship between 2 sets of sample data is significant

statistical significance

is concerned with whether an observed difference truly exists

significance relies heavily on the concept of probability

statistics allows us to make more informed judgments

prediction

the 4th major use of statistics

completely accurate prediction is only possible in a completely deterministic system

very few geographic processes are deterministic

if the process is not completely random it may be possible to predict the outcome of a particular combination of circumstances

types of statistical approaches

1) confirmatory statistics - parametric and some nonparametric

2) exploratory data analysis - nonparametric

we'll do some of both!

misuse of statistics

statistics is not a method by which you can prove anything you want, it has a set of clearly defined rules so that interpretations don't exceed the data

statistics is not a substitute for abstract theoretical reasoning or examination of exceptional cases

it is a complementary tool

common mistakes

(a) a nonrandom sample is drawn, ie. 'volunteers', people who dial in to 900 numbers, even if sample is random those that answer may not be, some may be motivated to respond

(b) untruthful answers, i.e. age, income

(c) ecological fallacy - happens when comparing statistics across scales or translating results across disparate environments

i.e. using events observed at metropolitan level to predict individual behaviour

so stats is not just a collection of facts or a recipe book

variables

variable is a quantity being measured that can take on a set of measured values

2 type

(a) - discrete - only certain fixed whole number values

(b) - continuous - theoretically can take on an 4 values between the 2 end points

variates are the single values observed

e.g. variable=area of country measured on scale = mi²

observation #1 3,831,033 Canada

observation #2 3,678,896 USA

#3 145,709 Japan

#4 .2 Vatican City

the number of variates of usually denoted by 'n'

observation - a value assigned to one item of a variable

the symbol '3' = sigma, it means add up the 'n' observations of variable X in sequence i=1...n

eg variable = ph

obs=1 x₁=7.6

obs=2 x₂=5.9

obs=3 x₃=6.0

obs=4 x₄=7.0

constant - a single derived value

data

a body of information in numerical form

a set of data in tabular form is referred to as a data matrix i.e. a spreadsheet

data are the 'raw materials' of descriptive and inferential statistics and model building

'garbage in garbage out' (GIGO)

need to be concerned with the methods of collection so that data obtained can be used with confidence

quality of data

1) valid - measurements actually measure what we think they measure

concept º operational definition º variable

concepts are not directly measurable or observable, they are often abstract concepts such as social space, attitudes

we often use surrogates - that is a variable that stands in for the concept we are trying to measure

statistical modeling cannot proceed unless some means is found of expressing the concept in some form of measurement scale need some fairly 'pure' measurement of the concept

the operational definition - specifies the measurement process

i.e. how will it actually be measured?

variable(s) may be representative of the concept

i.e social class = f(income, educ.,status, occup.)

2) reliability

measurements should be free of substantial bias i.e. loaded questions, biased samples

3) precision and accuracy - preciseness is measure of degree of error in measurement, accuracy is exactness, we might measure precisely but the it could be off, i.e. a badly calibrated thermometer might measure precisely but not be accurate

measurement scales

4 levels of measurement

1) nominal scale - the simplest scale - allows simple classification or categorization. there is no ordering of the items

categories are mutually exclusive (no item in > 1 category

exhaustive (include all known cases)

any 2 items in a category are considered equal or the same

2) ordinal scale - where enough information is available to place categories in rank order

(a) weak ordinal - where can identify more than or less than quality to > 2 categories i.e. high order, low order central places, agree/neutral/disagree on attitudinal scale

(b) strong ordinal - stronger ranking - but where exact quantitative difference between pairs of ranked items is unknown

i.e. preference scale [1/2/3/4/5] of preferred cities in terms of residential desirability

3) interval - measured and positioned on a continuous scale and can assume any value within its range, the starting point is arbitrary i.e. temp

4) ratio - has additional feature that the ratio of any 2 values on a ratio scale are independent of the scale of measurement, has an absolute nonarbitrary zero point i.e. pop

i.e. 1lb/2lb is same ratio as 453 grams/906 grams

you can go down the scale of measurement but not up

good rule of thumb is to collect as much info as you can

i.e. rainfall is at least interval

5 cm 3

15 cm 1 low

4 cm 4 medium

6 cm 2 high

errors and precision

answers can be given to any level of precision but all measurements are subject to error

beware of spurious precision

in general you can only be as precise as your least precise number