Lab One: Graphs, Histograms, Choropleth Maps, and Euclidean Distance

 

Introduction

This lab presents some of the most commonly used techniques for summarizing large data sets so that the researcher can visualize basic patterns within the data and make concise generalizations about their characteristics.

Graph: A graph is a diagram (as a series of one or more points, lines, curves, or areas) that represents the changes in a variable when compared to one or more other variables. 

Histogram: A histogram is a bar graph. It is used to show the distribution of values in a set of data. Often the data are grouped into classes (a category of values such as 0-9 years of age, 30-39 cm of rainfall, or $50,000-54,999 annual income). The bar of each class represents the number of cases in that class. This technique is very useful for identifying the number of occurrences of a class in a data set (frequency), and the distribution from the highest to lowest values in the data set (frequency distribution). 

Histograms and bar graphs look very similar but they're different.

Histograms can be thought of as "sorting bins." You have one variable, and you sort data by this variable by placing them into "bins." Then you count how many pieces of data are in each bin. The height of the rectangle you draw on top of each bin is proportional to the number of pieces in that bin.

In bar graphs you have several measurements of different items, and you compare them. The main question a histogram answers is: "How many measurements are there in each of the classes of measurements?" The main question a bar graph answers is: "What is the measurement for each item?" 

Choropleth Map: A choropleth map displays geographic data using shades of tone or colors to represent regional values of a data set. Euclidean Distance: Distances measured on the Earth's surface using Euclid's idea that the Earth can be measured using a series of straight and parallel lines and angles, thus treating the Earth as a flat plane rather than a sphere. u>

Computational Formulae:

a. Straight Line Distance for the calculation of the distance between two points termed A(x1 , y1) and B(x2 , y2) is:

DAB = [ (x1 - x2)² + (y1 - y2)² ]½ ,

where x and y are physical distances.

 

b. Windy Road Distance for the calculation of the distance between two points termed A(x1 , y1) and B(x2 , y2

separated by a non-linear transportation corridor: DAB = [ *x1 - x2*p + *y1 - y2*p ]1/p

in which p = a specified power, such as 1.2, which typifies a super highway, or .8, which is a windy road.

c. Great circle distance: Since the earth is not flat the curvature of the earth becomes an important consideration

when large distances are calculated.  In these cases one should use the Haversine Formula 

(from R.W. Sinnott, "Virtues of the Haversine", Sky and Telescope, vol. 68, no. 2, 1984, p. 159):

dlon = lon2 - lon1
dlat = lat2 - lat1
a = (sin(dlat/2))^2 + cos(lat1) * cos(lat2) * (sin(dlon/2))^2
c = 2 * arcsin(min(1,sqrt(a)))
d = R * c

where lon2 is the longitude of the second point, lon1 is the longitude of the first point

lat2 is the latitude of the second point and lat1 is the latitude of the first point

This formula will give mathematically and computationally exact results. The
intermediate result c is the great circle distance in radians. The
great circle distance d will be in the same units as R.  For further discussion see 

http://www.census.gov/cgi-bin/geo/gisfaq?Q5.1

Mode: The most frequently occurring observation in the data set. 

Median: The middle observation of a ranked data set. 

Mean: x= Ex/n

where x is an individual observation; E is the sum of every individual observation; and n is the total number of observations in the data set.)

Range: The highest observation in the data set minus the lowest observation.

Variance: F2= E(x-x)2/n

Standard Deviation: F = the square root of the variance

 

GRAPHING:

Introduction

A graph displays information as a diagram of one variable as it changes relative to other variables. Typically, graphs have two axes: the X axis on the horizontal and the Y axis on the vertical. The Y axis best represents the independent variable whose value changes according to the value of X.

Graphs are a tool used to visualize data to discern patterns within the data that one might not readily see by just looking at a series of numbers. As such, these tools need to be calibrated properly so that you, the observer, can recognize and understand the information being presented. The axes act as rulers, marking equal intervals into which the data may be placed for visualization. The proper labeling of the axes is important, for without titles, units, numbers, and legends, the graph is not easily understood.

HISTOGRAMS

Introduction 

Put very simply, histograms are visual representations of data or bar graphs using only one variable. Histograms take data, and rather than placing the data in a graph as a series of points, places the data into categories for analysis of frequencies and distributions. Histograms are also graphs, and as such they follow the same principles about axes and labeling as do regular graphs. Because histograms are different - they display data in groups, how the groups are defined is important. There are no "natural classes" in nature, so any grouping must be defined by the person using the data. One of the simple rules of thumb for determining the number of groups to be displayed in a histogram is: there should be no more than 5 times the log of the number of observations { 5(log10 n) } where n is the number of observations in the data set. Too few groups will hide interesting details; and too many groups defeats the purpose of grouping data in the first place. The intervals between the groups should always be the same, e.g., groups of ten, twenty, thousands, etc. Each group should be independent of the other groups i.e., no values in the groups are the same. To determine the interval for your groups, take the range (defined above) and divide it by the number of groups. Try to make intervals which are easy to work with and understand e.g., 10s, 20s, 100s, 1000s, rather than intervals defined by the (range/groups) values which may create odd numbers like 821.3 . An important part of grouping data is to determine the frequencies of the groups. Frequency has already been defined above (the number of occurrences in the group), but cumulative frequencies are somewhat different. Cumulative frequencies show the number of observations in the data that do not exceed a certain value.

PART I (40 marks)

Instructions: All questions in Part I are to be answered using the following data sets. Remember to follow the rules of graphing and of histograms to complete the questions below. 
DATASET #1: Marks Distribution by Age for Geography 999B.
Age Grade Age Grade Age Grade Age Grade
19 94 20 85 22 61 19 80
22 76 21 58 19 81 34 81
22 70 19 75 20 56 21 68
19 55 19 70 38 64 18 72
45 77 24 76 19 73 24 78
23 82 28 78 24 73 30 76
23 63 20 75 23 61 19 72
19 74 20 72 20 68 22 72
20 86 19 63 21 84 20 78
19 76 33 68 19 74 25 72
DATASET #2: Number of Patrons by Distance to London Galleria Mall (kms)
Distance Patrons Distance Patrons Distance Patrons Distance Patrons
0.25 18425 2.75 4700 5.25 2700 7.75 1700
0.50 12650 3.00 4300 5.50 2550 8.00 1620
0.75 9125 3.25 4000 5.75 2400 8.25 1550
1.00 7850 3.50 3750 6.00 2300 8.50 1500
1.25 6950 3.75 3600 6.25 2200 8.75 1430
1.50 6250 4.00 3450 6.50 2150 9.00 1380
1.75 5750 4.25 3400 6.75 2075 9.25 1300
2.00 5400 4.50 3250 7.00 2005 9.50 1200
2.25 5150 4.75 3100 7.25 1900 9.75 1150
2.50 4950 5.00 2950 7.50 1780 10.00 1000

Questions:

1) Draw graphs for the two data sets, and plot the data as points on the graphs (i.e., do NOT connect the points). 

2) The graphs show two very different patterns. Describe the difference between the two graphs. Does the graph from data set 2 resemble any geographic pattern? Explain. USE DATA SET #1 FOR THE FOLLOWING: 

3) Draw a histogram showing the frequency counts of the "grade" data from data set 1. Remember to follow the guidelines above, and establish an appropriate number of groups. Prepare a table showing your groups, their frequencies, and the cumulative frequencies. 

4) Label on the histogram where the mean, mode, and first standard deviate (plus and minus the mean) are located. How many of the data observations are within one standard deviation? 

5) Explain the importance of why groups should be independent of each other. Why are classes not defined as 40-50, 50-60, 60-70? 

6)  What effect can changing the size of the classes have on interpretation of histograms?

 

CHOROPLETH MAPS

Introduction 

Choropleth maps display spatial (geographic) data in shades of one color, or in various colors to represent data.

Rather than using a graph with X and Y axes to plot your data, choropleth maps allow you to visualize your data on a map of the region from where the data were taken. These maps have the distinct advantage over graphs in that the user gets to see differences in the data as they change from state to state, province to province, or country to country. As with all techniques, certain rules apply. For choropleth maps, the choice of tones or colors is very important. For example, take a look at any USA Today weather map. The hot areas are in warm colors such as yellows, oranges and reds. Cool areas are colored in blues and violets. Would it not be confusing if the colors were randomly chosen for each temperature level? You, too, must choose tones and colors carefully so that the viewer of your choropleth map can easily tell what the changes are in the data, even before that person reads the legend. When given a data set, the 5 times the natural log rule {5(log10 n) where n is the number of observations in the data set} applies to determine the number of groups, and the size of each group (see PART I in Histograms for a refresher). 

 

EUCLIDEAN DISTANCE

Introduction 

Distances measured on the Earth's surface using Euclid's idea that the Earth can be measured using a series of straight and parallel lines and angles, thus treating the Earth as a flat plane rather than a sphere. When the Earth is treated as a flat surface rather than a sphere, all the places on earth can be given X and Y co-ordinates, just as if they were part of a graph - each place a point on the graph. So how then does one calculate the distance between points on a graph? The distance between the points is determined using a variation of Pythagoras' theorem a2 = b2 + c2. The calculation is given in the Formulae and Definition section above. Simply, Euclidean distance is determined by taking two points, let's call them i and j. I and j have X and Y co-ordinates because they are part of a graph (a flat plane). Take the point furthest to the right (j) (the highest X value) and subtract from it the X value of the other point (i) (the one to the left). Take that value and square it. The calculation you have just done is the first part of the distance equation from above. For the next half of the equation, take the Y value for the highest point (the one furthest from the X axis) and subtract from it the lower Y value (the one closest to the X axis). Take that value and square it. Now you have both parts of the distance equation. Add these two values together. Take the square root of this sum. NOW, you have the distance between the points i and j!

PART II (40 marks)

Instructions: All questions in Part II are to be answered with the data sets provided below. This section can be completed with the use of a computer for the questions related to Euclidean Distances. Try to complete the calculations manually. 

Questions

DATA: Question 7: Number of Establishments per State.

Alabama

86537

Montana

25028

Alaska

14773

Nebraska

43749

Arizona

86469

Nevada

29932

Arkansas

53409

New Hampshire

33249

California

745686

New Jersey

214076

Colorado

98826

New Mexico

35700

Connecticut

92816

New York

463762

Delaware

18781

North Carolina

165076

Florida

361330

North Dakota

18979

Georgia

157667

Ohio

248694

Hawaii

29313

Oklahoma

74663

Idaho

26513

Oregon

81077

Illinois

272738

Pennsylvania

279595

Indiana

128311

Rhode Island

27726

Iowa

73130

South Carolina

79743

Kansas

65858

South Dakota

20492

Kentucky

79006

Tennessee

113292

Louisiana

88290

Texas

394482

Maine

34840

Utah

26585

Maryland

114874

Vermont

19639

Massachusetts

158329

Virginia

149695

Michigan

210303

Washington

131919

Minnesota

112167

West Virginia

37687

Mississippi

52888

Wisconsin

122142

Missouri

130284

Wyoming

14630

Source: Bureau of the Census (1993), County Business Patterns 1990, U.S. Dept. of Commerce

7) Draw a choropleth map for the above data following the rules of groups and colors/tones. 

Euclidean Co-ordinates for Selected Canadian Cities
point City x y
1 Victoria 0 28
2 Vancouver 2 30
3 Whitehorse 3 67
4 Yellowknife 28 58
5 Edmonton 21 35
6 Calgary 18 28
7 Saskatoon 31 28
8 Regina 34 23
9 Thompson 46 36
10 Winnipeg 46 20

8) Graph the data points using the X and Y co-ordinates provided above. Calculate the Euclidean distances from point 1 to 2, 2 to 3, 3 to 4 etc. for the entire data set. You may wish to present your answers in a table. 

9) Euclidean distances are good measures over short distances and over flat, even terrain. Explain what some of the short-comings of Euclidean distance are, and how they affect measures of distance. 

10) For each of the following city pairs, you are asked to compute, by hand, the Straight Line, and Windy Road distances that separate them. Remember to show all of your work!!! All of the necessary information that will enable you to calculate these distances is provided as follows:

 

First Set: Straight line distance, London to Victoria is 3170km due West, and 620km due North. 

Windy Road = .8

Second Set: Straight line distance, St. John's to Valdez is 5220km due West, and 1495km due North. 

Windy Road = .75

Third Set: Straight line distance, L.A. to London, UK is 8475km due East, and 1955km due North. 

Windy Road = .85.

Part III (20 marks)

 

Now its time to do some work using your own data. The reality of data analysis is often more complex and difficult than it appears in class examples or laboratory exercises. It is important that you get a feel for what is involved in real life situations. Therefore, go and collect some data applicable to your area of interest in geography. The data should be such that it can be graphed as a histogram or a map or both. If you want to get some extra credit (see below) be sure you data has locational information (coordinates).

For this data, you should explain where it comes from, what are the likely sources of error, and its significance to geographic study. Then graphically analyze your data and explain what you find.

Extra credit: (10 marks) Present a hypothesis where the location of your data points is important. Explain how the distances between the locations are or are not of importance.

 

You may want to try the link below for more information about histograms

 

http://davidmlane.com/hyperstat/desc_univ.html