Friday, August 1, 2014

A half-baked unit idea for hypothesis testing

The first topic that broke me as a teacher was teaching hypothesis testing as a student teacher.  It broke me so bad that I totally changed my stats course to better engage students in the statistical process.  It was the catalyst that fixed so many other things in my class, and did improve how I teach hypothesis testing, but kids still struggle with it more than anything else.

At TMC14 last week, one huge theme in many of the sessions and "My Favorites" was getting students to talk about math.  The example that stood out to me was Chris Luzniak (@PIspeak)'s first session on debating in math class.  He focused on structured arguments.  When students raised their hand to share, he asked them to make a claim (statement) and then give a warrant (reason).  He builds a culture of listening, clearly stating a view, and backing up that view with sound logic.  Chris uses mostly informal debate (more like a classroom discussion), but for the purposes of this new unit I think I want it to be more of a prepared debate.

I am trying to envision a unit where hypothesis testing is taught in multiple ways and students, as their unit project, are left in teams to defend their preferred method for making decisions with data.  I see two big questions that should be very approachable by my students:.  The first is whether to use simulation or a probability model to obtain a p-value:

  1. The AP Statistics curriculum, along with nearly every other stats course, uses a probability model (the normal curve) as the basis for inference.  Because of the Central Limit Theorem, even the craziest of distributions have a very normal looking sampling distribution if the sample size is large enough (often >30).  This method for doing inference is clean and analytical (could be worked out by hand).  The downside is that, like all models, there are some assumptions that must be made that are not always valid, especially with smaller samples.
  2. The preferred method of many programmers (including me) is to create a very simplistic model in software and repeat it thousands of times to create a sampling distribution.  This method is called bootstrapping.  Even though programming may not be accessible to most students, tools with clean interfaces such as StatKey make it easy to perform the calculations.  This method for inference is very easy to understand and explain (drawing values from a hat), and since computers are fast, it is just repeated many thousands of times.  It also does not require any assumptions about the size or shape of the sample's distribution.
  3. If you understand Bayesian hypothesis testing, you could have kids argue for the need to use likelihood ratios based on meaningful prior probabilities.  Despite spending half of the day researching it, I really don't understand it well enough to explain it to anyone else, so this option is out for me.  If you want to learn, one awesome option is a free book one of my college computing professors wrote called "Think Bayes".
The second question is why we need p-values at all:
  1. The traditional hypothesis test starts with a default claim and seeks evidence that suggests it to be unlikely.  The statistician starts with a threshold level, such as 5%, and tries to show that there is less than that probability of finding the data they found when he assumes the null hypothesis to be true.  The actual probability, the p-value, is often reported along with the decision to reject or fail to reject the null.  This is seen in nearly all science publications.
  2. Another viewpoint held by a minority group of stats folks (including me) is to not use p-values at all.  Instead, a correctly computed confidence interval could be compared to a null hypothesis to see if the null mean is captured by the interval or not.  For example, a soda pop can of Coke (covered all my regions there...) claims to have 12oz of liquid inside.  I am trying to prove that it is less than 12oz in a one-sided test at the threshold of 5%.  Instead of finding a p-value, I could create a confidence interval around my data.  Since I want each tail of the interval to have 5%, the interval would be the middle 90%.  If my 90% confidence interval was 11.5oz to 11.9oz, then I could conclude that I was being ripped off.  The reason I like this approach, despite its apparent complexity, is that it doesn't just tell you that you are getting ripped off -- it tells you how much pop you can expect to get in the can.
The point is not for the teacher to ride their high horse about their answers to each question.  The fact is that there is a group of intelligent people who support all of these answers, so they could all be defended in front of the class.  In order to make a good defense, each group needs to understand their opponent's responses well enough to counter their argument.  The normal-curve based model will be fairly non-intuitive to my students since I plan to introduce them to inference using bootstrapped confidence intervals in a unit on infographics.  They will need to understand enough about the Central Limit Theorem to argue why the normal curve is a good model and know the assumptions that are made to use it.  They will want to demonstrate how easy calculation can be with a TI-83 and how easy it is to explain with a sketch of the normal curve.  The simulator groups will need to understand when every simulation yields a different answer and why increasing the number of bootstrapped samples converges the answer to a more precise value.  They need to understand why they don't need to check any assumptions (other than having a SRS) like the other groups need to do.  There are a similarly large number of key ideas that need to be well understood about the second question to properly debate it as well.

The big graded tasks in the unit would probably be giving and receiving feedback in mini-debates between paired teams, a final debate between groups in front of the entire class, and a write-up after the debates picking a personal stance and communicating all that they learned in the process.

The unit would be kicked off with the big question of "how can you use data to make decisions"?  From there, we would generate a list of topics we needed to know more about in order to get ready for the debate.  Some of these big ideas I would lecture on for the class, while less conceptual ideas would be left for students to watch in my videos.  I would also give short quizzes to make sure students were grasping the basics of hypothesis testing and the different approaches so I could target struggling students and their groups for quicker intervention.

So...I need help.  I'm sure there are a ton of holes in this concept or things I did not clarify that really matter.  I would love to not only turn this into an awesome unit for my classes, but for everyone who would want to use it, so please poke holes as ruthlessly as if it were your own curriculum.  Thanks!


  1. So much going on in this post. I would suggest you post it to the AP Stats message board as well, and I am confident you will receive many measured responses.

    The "big idea" for students in Stats is to reach a defensible conclusion through data. As you note, this can be done in many ways. As a reader, I see how many students have been taught to approach testing. A majority use a "canned" hypothesis test approach: memorized lines, hokey acronyms, without clear understanding of what is happening. But the fun in reading papers is seeing breakthroughs in pedagogy evident on the paper; well-contructed responses which demonstrate infusion of stats into the context. I continue to use the tradition p-value in my classroom, but have increased the expectation for clear communication of the meaning of p-value, and have also worked to keep CI's and P-value approaches from feeling like different worlds.

    I agree with the confidence interval approach, as it provides information on an estimate, as well as some basis for a hypothesis test conclusion. The problem here is that we would then need to teach P-values as new when we get to scenarios which don't lend themselves to CI's, like Chi-Squared. Ruth Carver has done a lot of work with simulation tests and swears by them. Not sure how much she has available online, but try a Google search for her.

  2. Thanks Bob. Great idea to post to AP list -- I will have to figure out how to do that, but I can't imagine it's too hard. I also looked up Ruth -- she actually teaches it the same way I did last year, so that shouldn't be hard to incorporate :).

    As for having to teach p-values anew later, I agree. I guess I should clarify that I would have students not only learn one method, but that they would learn all of them and, in a debate setting, need to formally defend their method as the best method. I will have to figure out more about how the AP readers decide what makes a well constructed (but unusual) response, but I would want to assess their ability to present a conclusion in each format with all of them checking out as valid.

  3. Great post! I think I have some useful comments I'm too rushed (and hungry) to make right now, but I gotta say THANKS A MILLION!!! for the link to Think Bayes! The first few pages tell me I have to read the rest ASAP.

  4. For some reason I do testing—using randomization—before doing the bootstrap. But I actually get to teach stats again in the Spring and I will think about your order. As to the bootstrap itself, and whether it "works," I've done a little experimenting. Very little. It's not much. But I posted it here:

    The upshot is that with small samples and the three simple source distributions I looked at, you can't make good probability statements about whether a bootstrap interval includes the population statistic. That doesn't actually bother me, but it might bother you! Larger samples (N > 50, say) no problem.

    Also, a terminology thing that may not be an issue: In computing and simulation, doing things a jillion times, "bootstrap" the word is broad, covering all sorts of using randomness and Monte Carlo techniques. In stats, I think we use "bootstrap" more narrowly to refer to the procedure where you sample n times with replacement from your sample in order to get a sample statistic (usually the mean) that you collect into a sampling distribution, with the goal of getting an interval estimate—as you describe—and NOT a P-value. A hypothesis-testing procedure (e.g., scrambling group assignments and looking at difference of means, and doing that a jillion times) where you get P-values isn't called a "bootstrap," though: that's a randomization (or sometimes "permutation") test.

    1. Your blog post evaluating bootstrap intervals is awesome -- I just commented with some more thoughts. Specifically relating to this, I struggle with how the randomization test for a single mean actually works. From playing with StatKey, my best guess was that it created a sampling distribution using resampling with replacement, and then shifted that distribution so the mean lined up with the null hypothesis. I hope I'm wrong because I don't get why this is a good thing to do (other than the fact that we do it in a similar way with traditional inference). If you can explain why I'm wrong or why being this way is okay, that would help me a lot. All of the two-sample randomization tests make sense, but the one-sample stuff is just weird to me.

  5. I just found your blog through the TMC14 website (I went to TMC13 but haven't been too active recently in the MTBoS) and I feel like I could not have found it at a better time! I’m currently in the process of revamping my stats course in a similar direction. I went that direction a couple of years ago using StatKey and Lock5, but the execution was flawed enough that I went back to the old way; now I think I’m more ready for it. So if you want to bounce ideas around on those topics, hit me up!

    Anyway, I really like the idea of a debate, but one of my concerns would be that many of the general high level arguments for and against the different methods may not be accessible. For example, the examination of bootstrap CIs that bestcase linked to above would be a nontrivial coding exercise for many kids I think. Regardless, I think it is an EXCELLENT idea to have students defend not just their conclusion, but their choice of method. For example, when comparing class heart rates before and after running up and down the stairs (always a popular activity), I used to have them construct a CI for difference in means, but I’m sure the activity would be improved by giving them a choice of CI vs HT, simulation vs not, and defending that choice.

    Anyway, Chris probably already showed this to you, but if not, I think it’s helpful to see even a short video of his debate protocols in the classroom. I went to a PD he ran in NYC and the classroom footage made the ideas much more concrete.

    1. Thanks David. As you look at your stats curriculum for this year, feel free to look at how I organize the course ( or my actual class resources ( I hope they can spark some ideas even if the actual content isn't helpful for you.

      As for the accessibility curve for students getting into any deep comparison or analysis, I agree with you. Instead of being a barrier, it sounds like a late-night coding project -- if I could convert the key components of Tim's program into JavaScript and get it online so students could play with parameters in the browser (and hopefully not crash their computers), they could benefit from 80% of the learning with 20% of the time and skills.

      Thank you also for the link to Chris's video -- I'm not sure if I would want a stand-up informal debate like the video demonstrated, a more formal debate activity, or just a written justification comparing and contrasting methods.