## Monday, July 18, 2016

### Interleaving Study Is Not Interleaving Learning

The study I briefly discuss here is closely related to one I wrote up on interleaving just a few months ago (no surprise, since the two studies share an author). You can find a free link to the article referenced in that post here.

In the latest research, the authors found that a blocked schedule (presenting examples from one category at a time) outperformed an interleaved schedule (interspersing examples from all the categories) for category learning when the examples to be classified were more highly discriminable. This result was consistent across the two experiments in the study (p = 0.055 and p = 0.04). Importantly, however, although interleaving was a better strategy for learning categories of lower discriminability, the effects across the experiments were much weaker (p = 0.2 and p = 0.08). Blocking had either a significant or close to significant effect, whereas interleaving didn't get nearly as close (if you like p-values, anyway).

The Study

The participants in the first experiment of the study (we'll only focus on that one here) were quite a bit older than I'm used to reading about in education studies: between 19 and 57 years old, with a mean age of 30. (This is similar to the previous study.) Subjects were divided into two groups, one of which was presented with images like these:

These images represent the four presented categories: long, steep; long, flat; short, steep; and short, flat. One subset of participants in this group was exposed to 64 of these images in a blocked schedule (16 from each category at a time) while the other was presented with the images in an interleaved schedule. Each example was appropriately labeled with a category letter (A, B, C, or D). After this initial exposure, subjects were then given a test on the same set of 64 images, randomly ordered, requiring them to assign the image to the appropriate category.

The other group of participants was presented with similar images (line segments) and in blocked or interleaved subsets. However, for this group the images were rotated 45 degrees. According to the researchers, this created a distinction between the groups in which the first was learning verbalizable, highly discriminable categories ("long and flat," etc.) whereas the second was learning categories difficult to express in words—categories of low discriminability.

Discussion, Questions, Connections

As mentioned above, the blocked arrangement of the examples produced a learning benefit for categories of higher discriminability when compared with interleaving. The same cannot be said for interleaving examples in the low-discriminability sequences, although the benefits for interleaving in these sequences were headed in the positive direction. So we are left to wonder about the positive effects of blocking here.

The authors suggest an answer by mentioning some data they are collecting in a separate pilot study: blocking makes it easier for learners to disregard irrelevant information.

We compared learning under a third study schedule (n = 26) in which the relevant dimensions were interleaved, but the irrelevant ones were blocked (i.e., this schedule was blocked-by-irrelevant-dimensions, as opposed to blocked-by-category), which was designed to draw learners’ attention to noticing what dimensions were relevant or irrelevant. On both the classification test and a test in which participants had to identify the relevant and irrelevant dimensions, this new blocked-by-irrelevant-dimensions condition yielded performance at a level comparable to the blocked condition and marginally better compared to the interleaved condition.

Therefore, although we initially hypothesized that participants, when studying one category at a time, are better able to compare exemplars from the same category and to generate and test their hypotheses as to the dimensions [that] define category membership (and this may still be true, particularly for Experiment 1), these pilot data suggest that with the addition of irrelevant dimensions . . . the blocking benefit is perhaps more likely driven by the fact that it allows participants to more easily identify and disregard the irrelevant dimensions.

This strikes me as making a good deal of sense. And it points to something that I may have previously confused: interleaving study examples is different from interleaving initial learning examples. When students are first learning something, blocking may be better; after acquisition, interleaving may benefit learners more.

We have a tendency, in my opinion, to ignore acquisition in learning. I'm not sure where this comes from. Perhaps it is believed that if we are justified in rejecting tabula rasa, we are safe to assume there are absolutely no rasas on any kid's tabula. At any rate, it's worth being clear about where in the learning process interleaving is beneficial—and where it may not be.

Postscript: It’s not unusual to believe, about a child’s cognitive subjectivity, that it is like a large glop of amorphous dough and that instruction or experience acts like a cookie cutter, shaping the child’s mind according to pre-made patterns and discarding the bits that don't fit.

But these results could suggest something different—that prior to learning, the world is a hundred trillion things that must be connected together, not a stew of sensation that must be partitioned into socially valuable units.

This may be why blocking could work well for newly learned categories and for so-called highly discriminable categories: because what is new to us is highly discriminable—separate, without precedent, meaningless.

Image credit: Danny Nicholson

Noh, S., Yan, V., Bjork, R., & Maddox, W. (2016). Optimal sequencing during category learning: Testing a dual-learning systems perspective Cognition, 155, 23-29 DOI: 10.1016/j.cognition.2016.06.007

## Thursday, June 30, 2016

### The Problem of Stipulation

I think it would surprise people to read Engelmann and Carnine's Theory of Instruction. (The previous link is to the 2016 edition of the book on Amazon, but you can find the same book for free here.)

Tangled noticeably within its characteristic atomism and obsession with the naïve learner are a number of ideas that seem downright progressive—a label never ever attached to Engelmann or his work. One of these ideas in particular is worth mentioning—what the authors call the problem of stipulation.

The diagram at left shows a teaching sequence described in the book, in all its banal, robotic glory.

To be fair, it's much harder to roll your eyes at it (or at least it should be) when you consider the audience for whom it is intended—usually special education students.

Anyway, the sequence features a table and a chalkboard eraser in various positions relative to the table. And the intention is to teach the concept of suspended, allowing learners to infer the meaning of the concept while simultaneously preventing them from learning misrules.

Stipulation occurs when the learner is repeatedly shown a limited range of positive variation. If the presentation shows suspended only with respect to an eraser and table, the learner may conclude that the concept suspended is limited to the eraser and the table.

You may know about the problem of stipulation in mathematics education as the (very real) problem of math zombies (or maybe Einstellung)—which is, to my mind anyway, the sine qua non of anti-explanation pedagogical reform efforts.

But of course prescribing doses of teacher silence is only one way to deal with the symptoms of stipulation. Engelmann and Carnine have another.

To counteract stipulation, additional examples must follow the initial-teaching sequence. Following the learner's successful performance with the sequence that teaches slanted, for instance, the learner would be shown that slanted applies to hills, streets, floors, walls, and other objects. Following the presentation of suspended, the learner would be systematically exposed to a variety of suspended objects.

Stipulation of the type that occurs in initial-teaching sequences is not serious if additional examples are presented immediately after the initial teaching sequence has been presented. The longer the learner deals only with the examples shown in the original setup, the greater the probability that the learner will learn the stipulation.

It's a shame, I think, that more educators are not exposed to this work in college. It's a shame, too, that Engelmann's own commercial work has come to represent the theory in full flower—unjustly, I believe. Theory of Instruction could be much more than what it is sold to be, to antagonists and protagonists alike.

Update (07.13): It's worth mentioning that, insofar as the problem of stipulation can be characterized as a student's dependence on teacher input for thinking, complexity and confusion can be even more successful at creating cognitive dependence than monotony and hyper-simplicity. When students—or adults—are in a constant state of confusion, they may learn, incidentally, that the world is unpredictable and incommensurate with their own attempts at understanding. In such cases, even the unfounded opinions of authority figures will provide relief.

Audio Postscript

## Sunday, June 19, 2016

### Growing Up Too Fast or Too Slowly?

If someone were to ask me where this post came from, I would be inclined to answer "everywhere." But I can offer two or three of the connections explicitly here. My thinking of late was simple enough: could variable levels of patience with children's developmental timelines influence one's traditionalist or progressivist orientation to education?

First, I was remembering an interaction I had with a former colleague last year1. We were talking about some math activities we were designing for third graders. What I recall from the exchange is that I was interested in being a little more helpful with the information we provided, suggesting that we had time to fade away our help (the program we were working on was serving children in Grades 3–8). She, on the other hand, argued that "students can't wait forever" to be expected to reason about mathematics. (I've been on her side of the argument many many times too, though never with a time-is-running-out rationale.)

Second, I came across, quite unexpectedly, some related, illuminating thoughts in a book called Algorithms to Live By: The Computer Science of Human Decisions. The authors attempt to provide some real-world relevance to the tradeoff between exploration and exploitation—a longstanding tension in computer science between "gathering information and . . . using the information you have to get a known good result."

One of the curious things about human beings, which any developmental psychologist aspires to understand and explain, is that we take years to become competent and autonomous. Caribou and gazelles must be prepared to run from predators the day they're born, but humans take more than a year to make their first steps.

Alison Gopnik, professor of develpmental psychology at UC Berkeley . . . has an explanation for why human beings have such an extended period of dependence: "it gives you a developmental way of solving the exploration/exploitation tradeoff . . . Childhood gives you a period in which you can just explore possibilities, and you don't have to worry about payoffs because payoffs are being taken care of by the mamas and the papas and the grandmas and the babysitters . . . .

If you look at the history of the way that people have thought about children, they have typically argued that children are cognitively deficient in various ways—because if you look at their exploit capacities, they look terrible. They can't tie their shoes, they're not good at long-term planning, they're not good at focused attention."

[But] our intuitions about rationality are too often informed by exploitation rather than exploration.

Third and finally—and again, entirely and bizarrely unexpectedly—was this, which rounds out our trip from left of center to neutral to right of center (admittedly leaving the left of center unfairly underrepresented):

Treat children like children, treat grown-ups like grown-ups. An 11-year old doesn't need to teach himself, and shouldn't. A 22-year old does need to teach himself and must. And the best way to become a self-teaching 22-year old is to have teachers and parents who directly teach you when you're 11. People have known this for hundreds of years--thousands of years--and yet our public schools have somehow forgotten.

Although I'm more inclined to favor early and protracted exploration (input, learning) followed by later exploitation (performance, output) at virtually every scale in education—something that I have undoubtedly been unable to conceal even in this post—I don't intend in this writing to pass judgment one way or another2; only to suggest that where one finds oneself on the exploration-exploitation continuum is likely predictive of where one finds oneself on the traditionalist-progressivist continuum. Respectively.

Image credit: Ken Munson Photography

1. As far as education goes, I would consider myself right of center, and I would say my colleague was left of left of center. But that might be just what I would say. To her, she may have been either neutral or left of center, and I was right of right of center. That's how this game often works when you're not thinking hard about it: you're always neutral, and the other guy is the extremist.

2. I won't even mention that, given certain assumptions, the optimum balance (according to something called the Gittins Index) between exploration and exploitation is not 50-50; more like 70-30. : )

## Tuesday, May 31, 2016

In this post, we looked at the perceptron, a machine-learning algorithm that is able to "learn" the distinction between two linearly separable classes (categories of objects whose data points can be separated by a line—or, with multi-dimensional categories, by a hyperplane).

The data shown below resemble the apples and oranges data we used last time. There are two classes, or categories—in this case, the two categories are the setosa and versicolor species of the Iris flower. And each species has two dimensions (or features): sepal length and petal length. In our hypothetical apple-orange data, the two dimensions were weight and number of seeds.

Using just two dimensions allows us to plot each instance in the training data on a coordinate plane as above and draw some of the prediction lines as the program cycles through the data. For the above data, a solution is found after about 5–6 "epochs" (cycles of $$\mathtt{n}$$ runs through the data). This solution is represented by the blue dashed line.

This process is a bit clunky. The coefficients, or weights $$\mathtt{w}$$ are updated using the learning rate $$\mathtt{\eta}$$ by $$\mathtt{\Delta w_{1,2} = \pm 2\eta w_k}$$ and $$\mathtt{\Delta w_0 = \pm 2\eta}$$. This process, though it always converges to a solution so long as there is one, tends to jolt the prediction line back and forth a bit abruptly.

With gradient descent, we can gradually reduce the error in the prediction. Sometimes. We do this by making use of the sum of squared errors function—a quadratic function (parabola) that has a minimum: $\mathtt{\frac{1}{2}\sum(y - wx)^2}$

This formula shows the target vector $$\mathtt{y}$$ (the collection of target values of 1 and -1) minus the input vector—the linear combinations of weights and dimensions for each object in the data, $$\mathtt{w^ix^i}$$. The components of the difference vector are squared and summed, and the result is divided by 2, which gives us a scalar value that places us somewhere on the parabola. We don't use this "cost" value except to keep track of the cost to see if it reduces over cycles.

Okay, so we don't know what side of the parabola we're on. In that case, we look at the opposite of the gradient of the curve with respect to the weights, or the opposite of the partial derivative of the curve with respect to the weights: $\mathtt{-\frac{\partial}{\partial w}(y - wx)^2 = -\frac{\partial}{\partial u}(u^2)\frac{\partial}{\partial w}(y - wx) = -2u \cdot -x = -2(y - wx)(-x)}$

Multiply this result by $$\mathtt{\frac{1}{2}}$$, plug that summation back in, and we get a gradient of $$\mathtt{\sum (y - wx)(x)}$$. Finally, multiply by the learning rate $$\mathtt{\eta}$$ to get the change to each weight: $\mathtt{\eta\sum (y - wx)(x)}$

An Example

Let's take a look at a small example. I'll use data for just 10 flowers in the Iris data set. All of these belong to the setosa species of the Iris flower. I'll use a learning rate of $$\mathtt{\eta = 0.0001}$$.

Sepal Length (cm)Petal Length (cm)
5.11.4
4.91.4
4.71.3
4.61.5
51.4
5.41.7
4.61.4
51.5
4.41.4
4.91.5

In that case, all of these instances have target values of $$\mathtt{-1}$$, which don't change as the data cycle through. Our $$\mathtt{y}$$ vector, then, is [$$\mathtt{-1, -1, -1, -1, -1, -1, -1, -1, -1, -1}$$], and our starting weights are [0, 0, 0]. The first weight here is the "intercept" weight, or bias weight, which gets updated differently from the others.

Our input vector, $$\mathtt{w^{T}x}$$, is the combination sepal length × weight 1 + petal length × weight 2 for each object in the data. At the start, then, our input vector is a zero vector: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]. The difference vector, $$\mathtt{y - w^{T}x}$$, is, in this case, equal to the y vector: just a collection of ten negative 1s.

The bias weight, $$\mathtt{w_0}$$, is updated by learning rate × the sum of the components of the difference vector, or $$\mathtt{\eta\sum(y - w^{T}x)}$$. This gives us $$\mathtt{w_0 = w_0 + 0.0001 \times -10 = -0.001}$$.

The $$\mathtt{w_1}$$ weight is updated as 0.0001(5.1 × -1 + 4.9 × -1 + 4.7 × -1 + 4.6 × -1 + 5 × -1 + 5.4 × -1 + 4.6 × -1 + 5 × -1 + 4.4 × -1 + 4.9 × -1) and the $$\mathtt{w_2}$$ weight is updated as 0.0001(1.4 × -1 + 1.4 × -1 + 1.3 × -1 + 1.5 × -1 + 1.4 × -1 + 1.7 × -1 + 1.4 × -1 + 1.5 × -1 + 1.4 × -1 + 1.5 × -1).

Through all that gobbledygook, our new weights are $$\mathtt{-0.00486}$$ and $$\mathtt{-0.00145}$$ with a bias weight of $$\mathtt{-0.001}$$. You can see below how different the gradient descent process is from the perceptron model. This model in particular doesn't have to stop when it finds a line that separates the categories, since the minimum may not be reached even when the program can classify the categories precisely.

## Friday, May 27, 2016

### Enhance the Salience of Relevant Variables

This study has some connections to things I've written up here—in particular, Hallowell, et. al on how the salience of characteristics of solids affects the responses of young children to certain visuo-spatial tasks. When these salient characteristics are irrelevant to the task, inhibitory processes can be measured, mentioned in studies here and again here.

The study we will look at in this post explored whether increasing the salience of the relevant variable in tasks with both relevant and irrelevant variables of competing significance would improve students' performance. In particular, researchers looked at whether increasing the salience of perimeter would help students with responses to tasks where area was irrelevant but salient.

Three groups of shape-pair categories were used, along with two levels of complexity:

In the Congruent condition, the shape with the greater area also has the greater perimeter. In the Incongruent Inverse condition, the shape with the smaller area has the greater perimeter. And in the Incongruent Equal condition, the areas differ, but the perimeters are equal.

The Experiment and Results

Participants (58 fifth- and sixth-graders) were divided into two groups. One group of students were tested first on the shapes as you see them above. The second group, however, were tested first on what researchers called the discrete mode of presentation. In this mode, the perimeters of the shapes are made more salient by drawing them using equal-sized matchsticks instead of as continuous lines:

Each group was tested on the other mode of presentation 10 days later. So, students who began with the discrete mode were tested in the continuous mode later, and students tested first in the continuous mode were tested 10 days later in the discrete mode.

The results are pretty staggering. And keep in mind that students in both groups took both tests. The order of the tests is what is primarily responsible for the dramatically different results.

Previous research in mathematics and science education has shown that specific modes of presentation may improve students’ performance (i.e., Clement 1993; Martin and Schwartz 2005; Stavy and Berkovitz 1980; Tirosh and Tsamir 1996). Our results corroborate these findings and indeed show that success rate in the discrete mode of presentation test is significantly higher than in the continuous mode of presentation test. This significant difference is evident in all conditions (congruent, incongruent inverse, and incongruent equal) and in all levels of complexity (simple and complex). The visual information depicted in the discrete mode of presentation strongly enhances the salience of the relevant variable, perimeter, and somewhat decreases the salience of the irrelevant variable, area. This visual information also emphasizes the perimeter’s segments that should be mentally manipulated when solving the task. The discrete mode of presentation, therefore, enhances the use of strategies that are regularly used when solving this task. In the continuous mode of presentation, however, no hint of such possibility of mentally breaking the solid line into relevant segments is given.

The researchers note that a similar study using discrete segments for perimeter found no effect for the order of presentation. The authors surmise that this was due to the fact that, in that study, researchers did not use equal-sized discrete units, so students could not manipulate these units to determine perimeter. That is, discreteness is not what mattered, but that the discrete units could be manipulated successfully to produce correct responses to perimeter questions. The key, still, is that insights gained from working first with these discrete, equal-sized units transferred to the continuous mode of presentation.

The positive effect of a previous presentation observed in the current study can be seen as “teaching by analogy.” In teaching by analogy students are first presented with an “anchoring task” that elicits a correct response due to the way it is presented and hence supports appropriate solution strategies. Later on, students are presented with a similar “target task” known to elicit incorrect responses. The anchoring task probably encourages appropriate solution strategies, and such a sequence of instruction was effective in helping students overcome difficulties (e.g., Clement 1993; Stavy 1991; Tsamir 2003). . . .

Performing the discrete mode of presentation test strongly enhances the salience of the relevant variable, perimeter, and somewhat decreases that of area. This enhancement supports appropriate solution strategies that lead to improved performance. This effect is robust and transfers to continuous mode of presentation for at least 10 days. In line with this conclusion, a student who performed the continuous test after the discrete one commented that, “It [continuous] was harder this time but I used the previous shapes, because I could do tricks with the matchsticks.”

Babai, R., Nattiv, L., & Stavy, R. (2016). Comparison of perimeters: improving students’ performance by increasing the salience of the relevant variable ZDM, 48 (3), 367-378 DOI: 10.1007/s11858-016-0766-z

## Thursday, May 26, 2016

### The Perceptron

I'd like to write about something called the perceptron in this post—partly because I'm just learning about it, and writing helps me wrap my head around new-to-me things, and partly because, well, it's interesting.

It works like this. Take two categories which are, in some quantifiable way, completely different from each other. In this example, let's talk about apples and oranges as the categories. And we'll give each of these categories just two dimensions—weight in grams and number of seeds. (I'll fudge these numbers a bit to help with the example.)

Studying this information alone, you're already prepared to know with 100% confidence whether an object is an apple or an orange given its weight and number of seeds. In fact, the number of seeds is unnecessary. If you have just the two objects to choose from, and you are given only an object's weight, you have all the information you need to assign it to the apple category or the orange category.

But, you have to play along here. We want to train the computer to make the inference we just made above, given only a set of data with the weights and number of seeds of apples and oranges. Crucially, what the perceptron does is find a line between the two categories. You can easily see how to draw a line between the categories at the left (where I have plotted 100 random apples and oranges in their given ranges of weights and seeds), but the perceptron program allows a computer to learn where to draw a line between the categories, given only a set of data about the categories.

Training the Perceptron

The way we teach the computer how to draw this line is that we ask it to essentially draw a prediction line. Then we make it look at each instance (apple or orange, one at a time) and see whether the line correctly predicts the category of the object. If it does, we make no change to the line; if it doesn't we adjust the line.

The prediction line takes the standard form $$\mathtt{Ax + By + C = 0}$$. If $$\mathtt{Ax + By + C > 0}$$, we predict orange. If $$\mathtt{Ax + By + C \leq 0}$$, we predict apple. We can put the "or equal to" on either one of these inequalities, and we need the "or equal to" on one of them for this to be a truly binary decision.

Okay, so let's draw a prediction line across the data above at $$\mathtt{y = 119}$$, say. In that case, $$\mathtt{A = 0, B = 1,}$$ and $$\mathtt{C = -119}$$. When we come across apple data points, they will all be categorized correctly (as below the line), but some of the orange data points will be categorized correctly and some incorrectly. On correct categorizations, we don't want the line to update at all, and on incorrect categorizations, we want to change the line. Here's how both of these things are accomplished:

Let's take the point at (10, 109). This should be classified as orange, which we'll quantify as $$\mathtt{1}$$. The prediction line, however, gives us $$\mathtt{0(10) + 1(109) + -119}$$, which is $$\mathtt{-10 \leq 0}$$. So, the predictron mistakenly predicts this point to be an apple, which we'll quantify as $$\mathtt{-1}$$. We want to adjust the line.

Subtract the prediction ($$\mathtt{-1}$$) from the actual (1) to get $$\mathtt{2}$$. Multiply this by a small learning rate of 0.1, so $$\mathtt{0.1 \times 2 = 0.2}$$. Finally, change $$\mathtt{A}$$, $$\mathtt{B}$$, and $$\mathtt{C}$$ like this:

$$\mathtt{A = A + 0.2 \times 10}$$

$$\mathtt{B = B + 0.2 \times 109}$$

$$\mathtt{C = C + 0.2}$$

So, in our example, $$\mathtt{A}$$ becomes $$\mathtt{0 + 0.2 \times 10}$$, or 2, $$\mathtt{B}$$ becomes $$\mathtt{1 + 0.2 \times 109}$$, or 22.8, and $$\mathtt{C}$$ becomes $$\mathtt{-119 + 0.2}$$, or $$\mathtt{-118.8}$$.

We have a new prediction line, which is $$\mathtt{2x + 22.8y + -118.8 = 0}$$. This line now makes the correct prediction for the orange at (10, 109), as you can see at the right with the blue line.

But this point is only used to adjust A, B, and C (called the "weights"), and then it's on to the next point to see if the new prediction line succeeds in making a correct prediction or not. The weights change with each incorrect prediction, and these changing weights move the prediction line up (left) and down (right) and alter its slope as well.

Many Iterations

Suppose we next encounter a point at (10, 90), which should be an apple. Our predictron, however, will make the prediction $$\mathtt{2(10) + 22.8(90) - 118.8 = 2153.2 > 0}$$, which is a prediction of orange, or 1. Subtract the prediction (1) from the actual ($$\mathtt{-1}$$) to get $$\mathtt{-2}$$, and multiply by the learning rate to get $$\mathtt{-0.2}$$. Our weights are adjusted as follows:

$$\mathtt{A = 2 + -0.2 \times 10 = 0}$$

$$\mathtt{B = 22.8 + -0.2 \times 90 = 4.8}$$

$$\mathtt{C = -118.8 + -0.2 = -119}$$

If you graph this new line, $$\mathtt{0x + 4.8y + -119 = 0}$$, you'll notice that it's between the other two, but it would still make the incorrect categorization for the apple at (10, 90). This is why the predictron must cycle through the data several times in order to "train" itself into determining the correct line. If you can make sense of it, there is a proof that this algorithm always converges in finite time for data that can be separated by a line (with a sufficiently small learning rate).

Other Notes

It's worth mentioning that the perceptron finds a line between the categories, but there are an infinite number of lines available. Also, in this example, we have two dimensions, or features, but the perceptron works for as many dimensions as you please.

So, to get schmancy, for each object in the data, $$\mathtt{z}$$, our prediction function takes a linear combination of weights (coefficients + that C intercept weight) and coordinates (features, or dimensions) $$\mathtt{z = w_{0}x_{0} + w_{1}x_{1} + \ldots + w_{m}x_{m} = w^{T}x}$$ and outputs $\theta(z) = \left\{\begin{array}{11} \color{white}{-}1, & \quad z > 0 \\ -1, & \quad z \leq 0 \end{array} \right.$

Even with multi-dimensional vectors, the weights are updated just as above, perhaps with a different learning rate, chosen at the beginning.

Below is a Python implementation of the Perceptron from Sebastian Raschka, using a classic data set about irises (the flowers).

## Tuesday, May 24, 2016

### Is Area All About Rectangles?

A very common way of teaching students the reasoning behind the formula for the area of a non-rectangular parallelogram is to demonstrate how to take apart such a parallelogram and turn it into a rectangle. Some U.S. states' elementary and middle-school mathematics standards used to be fairly explicit about this "way" of teaching. Here are two examples—the first from California, the second from Florida:

Derive and use the formula for the area of a triangle and of a parallelogram by comparing it with the formula for the area of a rectangle (i.e., two of the same triangles make a parallelogram with twice the area; a parallelogram is compared with a rectangle of the same area by cutting and pasting a right triangle on the parallelogram).

Derive and apply formulas for areas of parallelograms, triangles, and trapezoids from the area of a rectangle.

The NCTM's 2000 publication Principles and Standards for School Mathematics (PSSM) also endorsed this method as consistent with stimulating students' understanding of, and investigation into, area:

One particularly accessible and rich domain for such investigation is areas of parallelograms, triangles, and trapezoids. Students can develop formulas for these shapes using what they have previously learned about how to find the area of a rectangle, along with an understanding that decomposing a shape and rearranging its component parts without overlapping does not affect the area of the shape.

And, from the third edition of this well-known book by Van de Walle, we have the following:

Parallelograms that are not rectangles can be transformed into rectangles having the same area. The new rectangle has the same height and two sides the same as the original parallelogram. Students should explore these relationships on grid paper, on geoboards, or by cutting paper models, and should be quite convinced that the areas are the same and that such reassembly can always be done with any parallelogram. As a result, the area of a parallelogram is base times height, just as for rectangles.

I should note that, of the four snippets I present above, only Van de Walle's even makes an attempt at a consistent distinction between non-rectangular parallelograms ("parallelograms that are not rectangles") and rectangles. But then even in the Van de Walle snippet, this distinction breaks down in the last sentence.

Consistent with the standards and the specific observations and suggestions in the widely referenced publications listed above, textbook lessons typically introduce students to the idea that the area of a non-rectangular parallelogram can be described by the same formula as that used to describe the area of a rectangle with the same base and height, using one or more illustrations like the one at right, accompanied by written instruction to help students understand the illustration.

Generally, the purpose of this written instruction is to clarify for students that (a) a part of the non-rectangular parallelogram was simply removed and then re-attached somewhere else on the figure--a transposition that does not change the area of the figure, and (b) the resulting figure is a rectangle whose height and base are the same as that of the non-rectangular parallelogram.

But, this progression has always struck me as a little backward, because rectangles and squares are special kinds of parallelograms. Deriving or developing a formula to describe the area of certain non-rectangular parallelograms based on what is taught about the area of rectangles--as is often done in this instruction--is a bit like arguing that the numbers 3, 1, 11, and 45 are all integers because you can count up from them or count down from them and reach the integer 9.

It turns out in both cases that what is stated is true—non-rectangular parallelograms and rectangles share the same area formula, and the numbers above, along with the number 9, are all integers. And it so happens that in both of these cases, the reasons are difficult to argue with as well—transforming a non-rectangular parallelogram into a rectangle without gaining or losing area is certainly a convincing demonstration, and so long as one construes "counting" as being restricted to integers, the reason given above for the "integer-ness" of 3, 1, 11, and 45 is satisfactory, albeit sloppy.

Still, the "integer-ness" of integers has nothing to do with the number 9, and the reason the area of a parallelogram can be described by the formula $$\mathtt{A = bh}$$ or $$\mathtt{A = lw}$$ may have nothing in general to do with rectangles and everything to do with the fact that parallel lines are at all corresponding points the same distance apart.

Is it not the case that the shaded figures shown have the same area (if their opposite sides are parallel)? Cavalieri's Principle, no? But this goes back to Euclid, too. Also, there's a connection to bivectors in higher-level math.

This idea--perhaps a slightly more general idea about area--awaits actual fleshing out in lessons, where the other principles (clarity, order, and cohesion), along with precision again, must be considered. If, though, we are to keep this topic (parallelogram area) where it typically falls in a mathematics curriculum, then we would need to move up discussions about parallel lines from where they typically fall, and such discussions could no longer be considered in isolation.

## Saturday, May 7, 2016

### Making "Connections"

I'm nearing the end of my read of James Lang's terrific book Small Teaching, and I've wanted for the last 100 pages or so to recommend it highly here.

While I do that, however, I'd also like to mention a confusion that piqued my interest near the middle of the book—a very common false distinction, I think, between 'making connections' and knowing things. Lang sets it up this way:

When we are tackling a new author in my British literature survey course, I might begin class by pointing out some salient feature of the author's life or work and asking students to tell me the name of a previous author (whose work we have read) who shares that same feature. "This is a Scottish author," I will say. "And who was the last Scottish author we read?" Blank stares. Perhaps just a bit of gaping bewilderment. Instead of seeing the broad sweep of British literary history, with its many plots, subplots, and characters, my students see Author A and then Author B and then Author C and so on. They can analyze and remember the main works and features of each author, but they run into trouble when asked to forge connections among writers.

What immediately follows this paragraph is what one would expect from a writer who has done his homework on the research: Lang reminds himself that his students are novices and he an expert; his students' knowledge of British literature and history is "sparse and superficial."

But then, suddenly, the false distinction, where 'knowledge' takes on a different meaning, becoming synonymous with "sparse and superficial," and his students have it again:

In short, they have knowledge, in the sense that they can produce individual pieces of information in specific contexts; what they lack is understanding or comprehension.

And they lack comprehension, even more shortly, because they lack connections.

Nope, Still Knowledge

As we saw here, with the Wason Selection Task, reasoning ability itself is dependent on knowledge. Participants who were given abstract rules had tremendous difficulties with modus tollens reasoning in particular, yet when these rules were set in concrete contexts, the difficulties all but vanished.

One might say, indeed, that in concrete contexts, the connections are known, not inferred. Thus, if you want students to make connections among various authors, it might help to tell them that they are connected, and how.

## Monday, May 2, 2016

### The Wason Selection Task, Part II

Before we sink our teeth even deeper into the Wason Selection Task, we should look briefly at conditional reasoning arguments.

Conditional reasoning generally starts with the statement "if P is true, then Q is true" or "if P, then Q" (P → Q). For example, the statement "If this tree is a spruce (P), then it has needles (Q)" is a conditional statement.

There are four types of conditional reasoning arguments that apply to the Wason Selection Task—two of them valid and two of them logically invalid. Each of these introduces a different second statement, after the "if P, then Q" formulation. That is, each of the following four types of conditional reasoning arguments can be identified according to the statement that comes right after the "if P, then Q" statement.

• Modus Ponens (P is true): This argument proceeds as follows: If P is true, then Q is true. P is indeed true. Therefore, Q is true. This is a valid form of reasoning. Example: If this tree is a spruce (P), then it has needles (Q). This tree is indeed a spruce (P). Therefore, this tree has needles (Q).
• Denying the Antecedent (P is not true): This is a fallacy and proceeds as follows: If P is true, then Q is true. P is not true. Therefore, Q is not true. Example: If this tree is a spruce (P), then it has needles (Q). This tree is not a spruce (not P). Therefore, this tree does not have needles (not Q).
• Affirming the Consequent (Q is true): This is also a fallacy and proceeds as follows: If P is true, then Q is true. Q is indeed true. Therefore, P is true. Example: If this tree is a spruce (P), then it has needles (Q). This tree indeed has needles (Q). Therefore, this tree is a spruce (P).
• Modus Tollens (Q is not true): This argument proceeds as follows: If P is true, then Q is true. Q is not true. Therefore, P is not true. Example: If this tree is a spruce (P), then it has needles (Q). This tree does not have needles (not Q). Therefore, this tree is not a spruce (not P).

It is important to note that the terms valid and invalid used to describe these arguments tell us nothing about the correctness of their conclusions. For example, each of these lines of reasoning is logically invalid . . .

• Affirming the Consequent: If today is June 1, then tomorrow is June 2. Tomorrow is indeed June 2. Therefore, today is June 1.
• Denying the Antecedent: If today is June 1, then tomorrow is June 2. Today is not June 1. Therefore, tomorrow is not June 2.

. . . even though they are undeniably "correct," as far as that goes. Formal logic does not concern itself necessarily with the contents of arguments, only their form.

Applying the Rules to the Selection Task

The rule given in any Wason Selection Task is considered to be a statement of the form "if P, then Q." So, the rule "every person that has an alcoholic drink is of legal age" that I included in my previous post might be recast as "if alcoholic drink (P), then legal age (Q). Similarly, in the more formal version, the rule "every card that has a D on one side has a 3 on the other" might be recast as "if D (P), then 3 (Q)."

Accordingly, each of the four answer choices in a Wason Selection Task is seen as the second statement in a conditional reasoning argument—either a statement about P (i.e., P is true [P] or P is not true [∼P]) or a statement about Q (i.e., Q is true [Q] or Q is not true [∼Q]).

In the "drinking" version of the selection task, for example, statements about drink type are the Ps and statements about age are the Qs:

soda
18
vodka
29
Not Alcoholic
(∼P)
Not Legal Age
(∼Q)
Alcoholic
(P)
Legal Age
(Q)

The Ps and Qs for the more formal task would be assigned this way:

D
K
3
7
D
(P)
Not D
(∼P)
3
(Q)
Not 3
(∼Q)

In each case, the statements which form a valid argument (i.e., P and ∼Q) indicate those cards (or people) that must be checked to determine the validity of the if-then statement.

### The Wason Selection Task, Part I

The Pope, a nun, Kermit the Frog, and Bruce Lee are all sitting at a bar. Well, actually it's just four people, represented by the cards below.

soda
18
vodka
29

Each person has an age and a drink type, but you can see only one of these for each person. Here is a rule: "every person that has an alcoholic drink is of legal age." Your task is to select all those people, but only those people, that you would have to check in order to discover whether or not the rule has been violated.

Most people have little trouble picking the correct answer above. But, "across a wide range of published literature only around 10% of the general population" finds the correct answer to the infamous Wason selection task shown below:

D
K
3
7

Each card has a letter on one side and a number on the other, but you can see only one of these for each card. Here is a rule: "every card that has a D on one side has a 3 on the other." Your task is to select all those cards, but only those cards, which you would have to turn over in order to discover whether or not the rule has been violated.

In fact, Matthew Inglis and Adrian Simpson (2004) found that mathematics undergraduates as well as mathematics academic staff, though performing significantly better than history undergraduates, performed unexpectedly poorly on the task, with only 29% of math undergrads and a shocking 43% of staff finding the correct answer.

In a chapter from The Cambridge Handbook of Expertise and Expert Performance, Paul Feltovich, Michael Prietula, and K. Anders Ericsson indicate the one factor that explains these differential results: knowledge.

Some studies showed reasoning itself to be dependent on knowledge. Wason and Johnson-Laird (1972) presented evidence that individuals perform poorly in testing the implications of logical inference rules (e.g., if p then q) when the rules are stated abstractly. Performance greatly improves for concrete instances of the same rules (e.g., 'every time I go to Manchester, I go by train'). Rumelhart (1979), in an extension of this work, found that nearly five times as many participants were able to test correctly the implications of a simple, single-conditional logical expression when it was stated in terms of a realistic setting (e.g., a work setting: 'every purchase over thirty dollars must be approved by the regional manager') versus when the expression was stated in an understandable but less meaningful form (e.g., 'every card with a vowel on the front must have an integer on the back').

Reference: Inglis, M. & Simpson, A. Mathematicians and the Selection Task. Proceedings of the 28th Conference of the International Group for the Psychology of Mathematics Education, 2004. (3) 89-96.