Tuesday, April 21, 2015

Extra Credit Opportunity for Thursday, April 23, 2015

Good evening,

For anyone who attends the IC Stand-up comedy event on Thursday at 7 p.m. in Emerson Suites, +3 extra credit points will be awarded.

The show features our own Chris Ptak opening up for Ben Weger (also a TV-R major).

Thanks.

Jack


Wednesday, April 1, 2015

MMRM Exam #2 Review Sheet

Surveys
Two major types of surveys (descriptive and analytical)
Advantages and disadvantages
Data collection techniques in surveys (know one advantage and disadvantage of each)
How can we (perhaps) increase response rate?
Obstacles (some) of survey research
CATI system
Placement order of questions (general to specific, sensitive issues at the end, demographic info typically near the end etc.)
Double-barreled questions, filter questions

Content Analysis
Definition of content analysis
Characteristics (objective, systematic, empirical, quantitative)
Manifest vs. latent content
Importance of inter-coder reliability
Codebook and code sheets
Composite week
Purposes of content analysis
Unit of analysis in content analysis
Can we make conclusions about media effects based on content analysis?

Experiments

Advantages and disadvantages
Typical steps that a laboratory experimenter takes
Problem of confounding variables
Importance of randomization
Experimental designs—pretest-posttest-control group design
Solomon four-group design (pretest-treatment-posttest; pretest-posttest; treatment-posttest; posttest only)
Validity and reliability in experiments
Double-blind experiments


Qualitative Research 
Four criteria used to evaluate qualitative research (article posted on blog):
naturalistic observation
contextualization
maximized comparisons
sensitized concepts

Positivist Paradigm vs. Interpretive Paradigm. Which is associated with Quantitative Techniques? Which is associated with Qualitative Techniques?

Major types of qualitative data collection techniques:

In-depth interviews
Focus Groups
Participant Observation
Case Studies

Understanding "Sense-Making"

Putting together the qualitative report (what are the steps?)

Make sure you know the following:

NOM IV + NOM DV = chi-square
NOM IV + I/R DV = t-test/ANOVA
I/R IV + I/R DV = correlation


Statistics
Definition
Central tendency vs. dispersion
Mean, mode, median
Frequencies
Type I vs. Type II error and the “null hypothesis”

Test-statistics—

Know when to use, how to solve, and how to interpret chi-square

Know when to use, how to solve, and how to interpret cross-tabulation

Know when to use, how to solve, and how to interpret t-test 

Know when to use and how to interpret correlation

Degrees of freedom

The exam will feature at least one one chi-square problem, one cross-tabulation problem, one t-test problem, and one correlation interpretation problem. 

There will also be a few questions about data interpretation. Specifically, you'll have see if a hypothesis is supported or not supported based on p < .05.

Practice problems:


Practice Statistical questions:
1. Chi-square. 

Ithaca 
School Year------------ Observed Freq. -----------Expected Freq.
Freshmen ------------------------15---------------------------- 27
Sophomores ---------------------20---------------------------- 35
Juniors ---------------------------10----------------------------- 20
Seniors ----------------------------15---------------------------- 25
Where o = observed frequency; e = expected frequency.

The table above provides the expected and observed frequencies of IC students who drop out of school during any given year. The admissions department would like to know if their retention efforts are making a difference.

Using the chi-square test, please tell me if there is a significant difference between the observed and expected frequencies (at the .05 level).

Are the retention efforts working? Why or why not?


2. Cross-tabulation.
I’m testing the following hypothesis:
Men are more likely than women to prefer TV sitcoms to TV dramas.
After collecting my data, I’m left with the following cross-tab:
Comedy TV
Drama TV
Total:
Male
40
( )
42
( )
82
Female
29
( )
57
( )
86
Total:
69
99
168
Using chi-square, tell me whether or not the data support my hypothesis (at the .05 level). What use is this data to the ad agency representing Schick Quattro for Men (shaving products—face razors)?


T-Test
  1. The following are data of TV use per two weeks by gender. Using t-test (independent samples), determine the statistical significance with probability .05 between the two groups. Are these groups statistically different or not? Why?
Where the denominator is the difference between the standard error of the mean for each group, and X is the average/mean for each group.
Gender -------------------Male------------------- Female
Mean --------------------41 hour --------------- 56 hours
Participants ---------------10---------------------------20
Standard error of mean 2.01 -----------------------0.58
What is the t-value? What conclusions can you make?



Correlation

OK, so here’s the deal. I’m a TV news investigative reporter for Newswatch 16 and I’ve got a tip that a local grocery store is knowingly selling kid yogurt that contains unsafe levels of bacteria. The thing is, I’m, uh, “allergic” to numbers and I can’t make heads or tails out of this information. The tipster, a food safety scientist from Cornell, gave me the following info from his random survey of children who consumed the tainted yogurt, but I have no idea what it means. Can you help me? Do I have a story here? What do all these numbers mean? (5 points) 

Correlation table of unsafe levels of bacteria in yogurt to intestinal illness among children 0-14:
Children ages 0 thru 2                        .38
Children ages 3 thru 5                        .17
Children ages 6 thru 8                        .66
Children ages 9 thru 11                      .22*
Children ages 12 thru 14                    .04*

Understanding Correlation

Correlation Overview

So far, we've talked about Margin of Error, Standard Deviation, z-Score, t-Test, and Chi-square. 


Remember that, depending on the type of measurement for the IV and DV, we use certain tests.

Specifically,--If the IV is nominal and the DV is nominal, we use chi-square. 

If the IV is nominal and the DV is interval/ratio, we use t-test.
If the IV is interval/ratio and the DV is interval/ratio, we us correlation.

Correlation is the single most common statistical test in mass media research. 


Correlation is a statistical technique that can show whether and how strongly pairs of variables are related. For example, height and weight are related; taller people tend to be heavier than shorter people. The relationship isn't perfect. People of the same height vary in weight, and you can easily think of two people you know where the shorter one is heavier than the taller one. Nonetheless, the average weight of people 5'5'' is less than the average weight of people 5'6'', and their average weight is less than that of people 5'7'', etc. Correlation can tell you just how much of the variation in peoples' weights is related to their heights. 


Although this correlation is fairly obvious your data may contain unsuspected correlations. You may also suspect there are correlations, but don't know which are the strongest. An intelligent correlation analysis can lead to a greater understanding of your data.


Like all statistical techniques, correlation is only appropriate for certain kinds of data. 


Correlation works for quantifiable data in which numbers are meaningful, usually quantities of some sort. It cannot be used for purely categorical data, such as gender, brands purchased, or favorite color



Rating Scales
Rating scales are a controversial middle case. The numbers in rating scales have meaning, but that meaning isn't very precise. They are not like quantities. With a quantity (such as dollars), the difference  between 1 and 2 is exactly the same as between 2 and 3. With a rating scale, that isn't really the case. You can be sure that your respondents think a rating of 2 is between a rating of 1 and a rating of 3, but you cannot be sure they think it is exactly halfway between. This is especially true if you labeled the mid-points of your scale (you cannot assume "good" is exactly half way between "excellent" and "fair").

Most statisticians say you cannot use correlations with rating scales, because the mathematics of the technique assume the differences between numbers are exactly equal. Nevertheless, many survey researchers do use correlations with rating scales, because the results usually reflect the real world. Our own position is that you can use correlations with rating scales, but you should do so with care. When working with quantities, correlations provide precise measurements. When working with rating scales, correlations provide general indications.

The main result of a correlation is called the correlation coefficient (or "r"). It ranges from -1.0 to +1.0. The closer r is to +1 or -1, the more closely the two variables are related.

While correlation coefficients are normally reported as r = (a value between -1 and +1), squaring them makes then easier to understand. The square of the coefficient (or r squared) is equal to the percent of the variation in one variable that is related to the variation in the other. After squaring r, ignore the decimal point. An r of .5 means 25% of the variation is related (.5 squared =.25). An r value of .7 means 49% of the variance is related (.7 squared = .49).
A correlation report can also show a second result of each test - statistical significance. In this case, the significance level will tell you how likely it is that the correlations reported may be due to chance in the form of random sampling error. If you are working with small sample sizes, choose a report format that includes the significance level. This format also reports the sample size.
A key thing to remember when working with correlations is never to assume a correlation means that a change in one variable causes a change in another. Sales of personal computers and athletic shoes have both risen strongly in the last several years and there is a high correlation between them, but you cannot assume that buying computers causes people to buy athletic shoes (or vice versa).
The second caveat is that the Pearson correlation technique works best with linear relationships: as one variable gets larger, the other gets larger (or smaller) in direct proportion. It does not work well with curvilinear relationships (in which the relationship does not follow a straight line). An example of a curvilinear relationship is age and health care. They are related, but the relationship doesn't follow a straight line. Young children and older people both tend to use much more health care than teenagers or young adults.


If r is close to 0, it means there is no relationship between the variables. If r is positive, it means that as one variable gets larger the other gets larger. If r is negative it means that as one gets larger, the other gets smaller (often called an "inverse" correlation).




Info provided by Survey System.

Monday, March 23, 2015

Link IPEI

https://www.facebook.com/tompkinstrustcompany?sk=app_403834839671843&brandloc=DISABLE&app_data=chk-550c9fe484598&pnref=story

in-class experiment link

https://www.surveymonkey.com/r/YFJM5FS

T-test overview

Often in social science situations, we want to see if there is a statistical difference between two groups.

That is, are any recorded differences due to "chance" (not significant), or are they NOT due to chance (significant)?

To determine if the differences are significant, we use a simple inferential statistical test called the t-test.

Here's how we solve for t:

t = X1 - X2/ Sm1 - Sm2

where X1 = mean (average) of the first group and X2 = mean (average) of the second group, and Sm = standard error of the mean.

So, for example, let's say we have two groups of students in an experiment where we're trying to test whether or not kids can learn their multiplication facts better via a TV show than in school.

Let's say we bring in 20 kids and we randomly assign them to two groups. The first group of 10 learns their multiplication facts via TV show, and the second group of 10 learns via the traditional classroom approach.

So we have something like this:

# of TV kids = 10
# of class kids = 10
Overall N = 20

We examine their scores and we see that the TV kids averaged a 2/10 on a post-test quiz measuring multiplication facts and the traditional class kids averaged a 6/10 on the same test.

Normally, you'd have to solve for the Sm, but-- since this is an 11-day online course-- I'm just going to provide it to you to simplify the process (plus, it's simple to solve and most statistical packages provide it as a matter of course, so almost never do you solve for it by hand). Let's say that the Sm = 1.05.

Ok, so here's what we do--

We solve for t like this:

t = 2 - 6/1.05

t = -4/1.05

t = 3.81 (we always take the absolute value)

This value, in and of itself, tells us nothing.

Just like chi-square, however, we have to be concerned about degrees of freedom (df).

The df for a t-test is simple-- you take the N for the group and subtract one. So, for the first group, the df is 9 (10-1), and for the second group it's also 9 (10-1).

9 + 9 = 18, so the df = 18.

So, now armed with this info, we can check out the t-value chart (either the one in the book, or one we find easily online-- like here for example-- t-test table), and we see that in order to be significant with 18 df (at the .05 level), the t-value needs to be greater than 2.10.

Since our t-value of 3.81 is higher than 2.10, we say that there is a significant difference between the groups.

We then look back at our original data and we see that the traditional kids scored, on average, much better than the TV kids, so we conclude that it's better to use the traditional method.

(Typically, in experimental settings, we never make such a claim without several iterations of the experiment).

Get it?

Chi-square overview

When we talk about inferential statistics, we're simply determining whether or not the results we obtained were due to chance, or not due to chance.

(Inferential means that, if our sample is representative, we can INFER from our sample that the results are indicative of the entire population).

If they are not due to chance, we suggest a relationship between variables. If it is due to chance, we can not make such a claim.

Think of inferential stats as a light switch-- it's either on or off. In the social sciences, if the significance is .05 or lower (that is, we allow for 95% confidence), then we say the switch is "on" and the results are "significant"-- meaning that we are 95% sure that the results are NOT due to chance.

If p (the probability that the results would show up like this by chance) is HIGHER than .05, then we say there is NO significance-- which means we can't argue that the variables are related.

Well, we already know that if we have a nominal IV and a nominal DV, then we test using chi-square.

Chi-square is a simple statistical test.

Put simply, chi-square is the sum of the observed frequency minus the expected frequency, squared-- divided by the expected frequency.

The observed frequency is simply the number reported. The expected frequency is what you'd expect if it were completely by chance.

It's best explained with an example.

Suppose we had the following data:

We asked 97 people about their political affiliations (let's assume it's a random sample) and we got this:

Gender-------Republican------ Democrat----- Row Total

Male:-----------  23---------------- 17------------ 40 

Female:--------- 20---------------- 37------------ 57

Column Total:---43----------------54------------ 97


Our hypothesis is:

H1: Women are more likely to be affiliated with the Democratic party than men.

By looking at the raw data, it's difficult to say, with certainty, that this is the case, so we test the hypothesis using chi square.

Our first order of business is to find the "expected" frequency.

The "expected" frequency is R x C / N (where R is the ROW total; C is the COLUMN total; and N is the overall number.

So the ROW total for men is 40.
The ROW total for women is 57.

The Column total for Republicans is 43.
The Column total for Democrats is 54.

The overall N is 97.

The expected frequency for males who should be Republicans based on chance is 40 x 43, which is 1,720 / 97 = 17.73.

Ok, so now we know that the observed frequency for men who are Republicans is 23, and the expected frequency is 17.73. This gives us a difference of (5.27). We square this value and get 27.77.

Once we have that value, we divide by the expected value and get this-- 27.77/17.73 = 1.57 (we always round to the nearest hundredth).

Remember, though, chi-square is the SUM OF, so we have compute it for each cell.

So, we repeat the process for each "cell" and then add up the totals.

Once we have the sum of the chi-squares, we check with a chi-square chart to see if it's significant at the .05 level. You can check here-- chi-square chart.

You'll note something called "degrees of freedom," or "df." A df helps us to determine what line to look at on the chart. The easiest way to remember df is this-- it's (R-1) x (C-1) where R is the number of rows and C is the number columns. In this case, we have 2 rows and 2 columns, which gives us a df of 1 because (R-1) = (2-1), and (C-1) = (2-1), and 1 x 1 = 1.

Also, remember that we use the .05 level of significance.

Go ahead and solve for the sum of chi-square and post what you get. Tell me if it's significant and whether the hypothesis is supported.

Jack

Intro to Statistics (Overview)

Statistics are mathematical methods to collect, organize, summarize, and analyze data. Statistics provide valid and reliable results only when the data collection and research methods follow established scientific procedures. With the development of the computer, the science of statistics has changed dramatically.

Basic statistical procedures include:

descriptive statistics,
sample distribution, and
data transformation.

In descriptive statistics, the chapter presents the concept of data distribution, frequency distribution, cumulative frequency, histogram, bar chart, frequency polygon, normal curve, and skewness.

Summary statistics make data more manageable by measuring two basic tendencies of distributions: 1) central tendency; and 2) dispersion (variability). These statistics make it easier for researchers to understand data.

Central tendency statistics provide information about the grouping of numbers in a distribution by giving a single number that characterizes the entire distribution. Using the mode, median, and mean, researchers can figure out a typical score of a distribution.

In addition, dispersion measures describe the way scores are spread out about a central point. Using range, variance, and standard deviation, allows researchers to understand the characteristics of the data.

The term sample distribution—the distribution of some characteristic measured on an individual or other unit of analysis that were part of a sample. Additionally, it's important to understand the notion of a sampling distribution—a theoretical probability distribution of all values of a variable for a given sample size.

Most statistical procedures are based on the assumption that the data are normally distributed. When some anomalies arise, researchers can attempt to transform the data to achieve normality. Data transformation can be possible by multiplying or dividing each score by a certain number, or taking the square root or log of the scores.

Hypothesis Testing Overview

Hypothesis development in scientific research is important because the process refines and focuses research by excluding extraneous variables and permitting variables to be quantified.

Scientists rarely begin a research study without a problem or a question to test. Without research questions or hypotheses, research proves to be a waste of time.

Researchers develop studies based on existing theory and are thus able to make predictions about the outcome of their work. Therefore, hypothesis development is usually the culmination of a rigorous literature review (we don't have the time to conduct a full-scale literature review in this class, so our theories will be based on our own experiences).

Researchers should use hypotheses in scientific research to:

1) provide direction for a study;
2) eliminate trial-and-error research;
3) rule out intervening and confounding variables; and,
4) allow for quantification of variables.

In addition, hypotheses should be:

1) compatible with current knowledge in the area;
2) logically consistent;
3) stated concisely; and,
4) testable.


In hypothesis testing, a researcher either rejects or fails to reject the null hypothesis that the statistical differences being analyzed are due to chance or random error.

To determine the statistical significance of a research study, the research must set a probability level (significance level) against which the null hypothesis is tested. If the results of the study indicate a probability lower than this level, the researcher can reject the null hypothesis. If the research outcome has a high probability, the researcher fails to reject the null hypothesis. It is common practice in mass media research studies to set the probability level at .05, which means that either one or five times out of 100, significant results of the study occur because of random error or chance. Another way to think of this is to say that "we are 95% confident that our results are not due to chance."

All research contains error. Typically, two types of error (Type I error: the rejection of a null hypothesis that should not be rejected, and Type II error: the acceptance of a null hypothesis that should be rejected) are relevant to hypothesis testing.

There is always the possibility of making an error in rejecting or failing to reject a null hypothesis. It is not easy for researchers to balance these two error types, but one procedure, power analysis, helps researchers deal with the problem. Because power (the probability of rejecting the null hypothesis when it is true) indicates the probability that a statistical test of a null hypothesis will result in the conclusion that the phenomenon under study actually exists, if there is a difference, researchers are able to detect it.

Sunday, March 22, 2015

Exam #2 Date: April 6, 2015

The date of the second exam is April 6, 2015.

It will be an open-book, open-resource exam.

Jack


Big Data Show that Nepotism Rules in America

THERE is a very real chance that the presidential election in 2016 will pit Jeb Bush against Hillary Clinton. According to oddsmakers, this is the likeliest outcome.
Many Americans are uncomfortable with the idea that two families could dominate the presidency that way. Whether or not you like one of the candidates, it just doesn’t feel right, in part because a second Bush-Clinton election makes a mockery of our self-identification as a democratic meritocracy.
How bad is America’s nepotism problem? Can data science help us gauge its depth? It can — and what the data shows is that something has gone haywire.
I studied the probability of male baby boomers’ reaching the same level of success as their fathers. I had to limit myself to fathers and sons because this was a highly sexist period in which women held few powerful political positions.
Let’s start with the presidency. Thirteen sons of presidents were born during America’s baby boom. One of the 13 became president himself, of course, and Jeb would make a second. Of the roughly 37 million boomer males who weren’t born to a president, two won the White House. Maybe it’s an anomaly that George W. Bush became president in 2001, but his advent means that in our era a son of a president was roughly 1.4 million times more likely to become president than his supposed peers.
The presidency is obviously a small sample. But the same calculations can be done for other political positions. Take governors.
Because it is difficult to be sure that you have counted all the sons of governors, let’s assume that governors reproduce at average rates. This would mean there were about 250 baby boomer males born to governors. Five of them became governors themselves, about one in 50. This is 6,000 times the rate of the average American. The same methodology suggests that sons of senators had an 8,500 times higher chance of becoming a senator than an average American male boomer.
There is some evidence that the parental advantage in politics is actually getting bigger. George W. Bush ended a 171-year drought for presidential sons. From 2003 to 2006, the Senate had the highest percentage of senators’ children — six — in its history.
Continue reading the main story

Thanks, Dad!

Thirteen sons of presidents were born during America's baby boom. One, George W. Bush, also became president. Below are the odds that a boomer man matched his father's achievement — compared to the odds for the average male boomer.
BILLIONAIRES
1 in
9
1 in
258,141
followed
dad’s
footsteps
average
boomer
men
Ross Perot, Ross Perot Jr.
PRESIDENTS
1 in
13
1 in
18,715,250
George W. Bush, George Bush
SENATORS
1 in
47
1 in
398,197
Al Gore Sr., Al Gore Jr.
GOVERNORS
1 in
51
1 in
306,807
Mitt Romney, George Romney
M.L.B. PLAYERS
1 in
73
1 in
14,966
Barry Bonds, Bobby Bonds
N.F.L. PLAYERS
1 in
113
1 in
7,220
Dave Shula, Don Shula (shown later as coaches)
Is this electoral edge unusual? Successful parents, whatever their occupation, pass on their genes and plenty of other stuff to their kids. Do different fields have similar familial patterns?
In just about every field I looked at, having a successful parent makes you way more likely to be a big success, but the advantage is much smaller than it is at the top of politics.
Using the same methodology, I estimate that the son of an N.B.A. player has about a one in 45 chance of becoming an N.B.A. player. Since there are far more N.B.A. slots than Senate slots, this is only about an 800-fold edge.
Think about the N.B.A. further. The skills necessary to be a basketball player, especially height, are highly hereditary. But the N.B.A. is a meritocracy, with your performance easy to evaluate. If you do not play well, you will be cut, even if the team is the New York Knicks and your name is Patrick Ewing Jr. Father-son correlation in the N.B.A. is only one-eleventh as high as it is in the Senate.
The parental edge in football and baseball is much lower than it is in basketball, probably because there is less reliance on height.
I went through a wide range of fields and found a consistent pattern: greater success for the sons, but nothing like the edge a winning politician provides.
Here is the estimated parental edge for other big American prizes and positions. An American male is 4,582 times more likely to become an Army general if his father was one; 1,895 times more likely to become a famous C.E.O.; 1,639 times more likely to win a Pulitzer Prize; 1,497 times more likely to win a Grammy; and 1,361 times more likely to win an Academy Award. Those are pretty decent odds, but they do not come close to the 8,500 times more likely a senator’s son is to find himself chatting with John McCain or Dianne Feinstein in the Senate cloakroom.
THE Bush story is also telling, when we compare it to familial success in other fields.
Has any modern family dominated a meritocracy the way that the Bushes dominate politics? I could not find one. The Mannings, in football, probably come closest. But while Archie Manning, the father of two Super Bowl-winning quarterbacks, Peyton and Eli, was a solid N.F.L. player, he was hardly the football equivalent of a president.
Internationally, the greatest father-son, merit-based, same-field accomplishment is probably Niels Bohr’s son Aage matching his father’s Nobel Prize in Physics. But neither the Bohrs nor the Mannings dominated physics or football the way the Bush family dominates American politics.
Regression to the mean limits family dominance in any meritocratic field. If you have a well-above-average dose of a trait, you can expect your child to be closer to average.
Regression to the mean is so powerful that once-in-a-generation talent basically never sires once-in-a-generation talent. It explains why Michael Jordan’s sons were middling college basketball players and Jakob Dylan wrote two good songs. It is why there are no American parent-child pairs among Hall of Fame players in any major professional sports league.
The Bush family’s dominance would be the basketball equivalent of Michael Jordan being the father of LeBron James and Kevin Durant — and of Michael Jordan’s father being Walt Frazier.
In other words, it is virtually impossible, statistically speaking, that Bushes are consistently the most talented people to lead our country. Same for Chelsea Clinton or any other member of a political dynasty thought to be possible presidential timber.
Politics is not the absolute worst field in giving an advantage to certain families. In my research, I found two fields with a bigger family edge.
First is billionaires. According to my calculation, you have about a 28,000 times higher chance of being a billionaire if your father was a billionaire. And billionaires like the Waltons or the Rockefellers before them probably dominate American wealth more than the Bushes dominate American politics.
These billionaires, of course, have inherited their status, not earned it. Call me jaded, but it seems to me that most heirs to billionaires don’t do much more than marry nice-looking people and take sports franchises that I support and run them into the ground.
The second group is reality TV stars. You have about a 9,300 times edge in becoming a reality television star if your father is one. But this is precisely because some of these shows star famous people’s families.
We should not take this criticism too far. In 2008, the United States chose the mixed-race son of a Kenyan and a Kansan to be president. More than 90 percent of senators had parents who weren’t top politicians. And political campaigns can be unpredictable. For all we know, the 2016 election could be fought between Senator Elizabeth Warren, daughter of a janitor, and Gov. Scott Walker, son of a minister.
Unless of course the Democratic candidate is Andrew M. Cuomo, son of a governor, and the Republican candidate is Rand Paul, son of a congressman.
There are plenty of countries that are worse. Over the past 50 years, being the son of a leader of North Korea increased your probability of being a leader of North Korea by a factor of infinity. An infinite advantage to having a powerful father has been common in human history.
But Big Data allows us not just vague comparisons to other countries or time periods. We can see precisely how much families dominate in many different spheres and we can see what true meritocracies look like. The data shows conclusively that we have a nepotism problem. So now the question is: Why does the modern United States tolerate this level of privilege for political name brands?