## Sunday, June 19, 2016

### Growing Up Too Fast or Too Slowly?

If someone were to ask me where this post came from, I would be inclined to answer "everywhere." But I can offer two or three of the connections explicitly here. My thinking of late was simple enough: could variable levels of patience with children's developmental timelines influence one's traditionalist or progressivist orientation to education?

First, I was remembering an interaction I had with a former colleague last year1. We were talking about some math activities we were designing for third graders. What I recall from the exchange is that I was interested in being a little more helpful with the information we provided, suggesting that we had time to fade away our help (the program we were working on was serving children in Grades 3–8). She, on the other hand, argued that "students can't wait forever" to be expected to reason about mathematics. (I've been on her side of the argument many many times too, though never with a time-is-running-out rationale.)

Second, I came across, quite unexpectedly, some related, illuminating thoughts in a book called Algorithms to Live By: The Computer Science of Human Decisions. The authors attempt to provide some real-world relevance to the tradeoff between exploration and exploitation—a longstanding tension in computer science between "gathering information and . . . using the information you have to get a known good result."

One of the curious things about human beings, which any developmental psychologist aspires to understand and explain, is that we take years to become competent and autonomous. Caribou and gazelles must be prepared to run from predators the day they're born, but humans take more than a year to make their first steps.

Alison Gopnik, professor of develpmental psychology at UC Berkeley . . . has an explanation for why human beings have such an extended period of dependence: "it gives you a developmental way of solving the exploration/exploitation tradeoff . . . Childhood gives you a period in which you can just explore possibilities, and you don't have to worry about payoffs because payoffs are being taken care of by the mamas and the papas and the grandmas and the babysitters . . . .

If you look at the history of the way that people have thought about children, they have typically argued that children are cognitively deficient in various ways—because if you look at their exploit capacities, they look terrible. They can't tie their shoes, they're not good at long-term planning, they're not good at focused attention."

[But] our intuitions about rationality are too often informed by exploitation rather than exploration.

Third and finally—and again, entirely and bizarrely unexpectedly—was this, which rounds out our trip from left of center to neutral to right of center (admittedly leaving the left of center unfairly underrepresented):

Treat children like children, treat grown-ups like grown-ups. An 11-year old doesn't need to teach himself, and shouldn't. A 22-year old does need to teach himself and must. And the best way to become a self-teaching 22-year old is to have teachers and parents who directly teach you when you're 11. People have known this for hundreds of years--thousands of years--and yet our public schools have somehow forgotten.

Although I'm more inclined to favor early and protracted exploration (input, learning) followed by later exploitation (performance, output) at virtually every scale in education—something that I have undoubtedly been unable to conceal even in this post—I don't intend in this writing to pass judgment one way or another2; only to suggest that where one finds oneself on the exploration-exploitation continuum is likely predictive of where one finds oneself on the traditionalist-progressivist continuum. Respectively.

Image credit: Ken Munson Photography

1. As far as education goes, I would consider myself right of center, and I would say my colleague was left of left of center. But that might be just what I would say. To her, she may have been either neutral or left of center, and I was right of right of center. That's how this game often works when you're not thinking hard about it: you're always neutral, and the other guy is the extremist.

2. I won't even mention that, given certain assumptions, the optimum balance (according to something called the Gittins Index) between exploration and exploitation is not 50-50; more like 70-30. : )

## Tuesday, May 31, 2016

In this post, we looked at the perceptron, a machine-learning algorithm that is able to "learn" the distinction between two linearly separable classes (categories of objects whose data points can be separated by a line—or, with multi-dimensional categories, by a hyperplane).

The data shown below resemble the apples and oranges data we used last time. There are two classes, or categories—in this case, the two categories are the setosa and versicolor species of the Iris flower. And each species has two dimensions (or features): sepal length and petal length. In our hypothetical apple-orange data, the two dimensions were weight and number of seeds.

Using just two dimensions allows us to plot each instance in the training data on a coordinate plane as above and draw some of the prediction lines as the program cycles through the data. For the above data, a solution is found after about 5–6 "epochs" (cycles of $$\mathtt{n}$$ runs through the data). This solution is represented by the blue dashed line.

This process is a bit clunky. The coefficients, or weights $$\mathtt{w}$$ are updated using the learning rate $$\mathtt{\eta}$$ by $$\mathtt{\Delta w_{1,2} = \pm 2\eta w_k}$$ and $$\mathtt{\Delta w_0 = \pm 2\eta}$$. This process, though it always converges to a solution so long as there is one, tends to jolt the prediction line back and forth a bit abruptly.

With gradient descent, we can gradually reduce the error in the prediction. Sometimes. We do this by making use of the sum of squared errors function—a quadratic function (parabola) that has a minimum: $\mathtt{\frac{1}{2}\sum(y - wx)^2}$

This formula shows the target vector $$\mathtt{y}$$ (the collection of target values of 1 and -1) minus the input vector—the linear combinations of weights and dimensions for each object in the data, $$\mathtt{w^ix^i}$$. The components of the difference vector are squared and summed, and the result is divided by 2, which gives us a scalar value that places us somewhere on the parabola. We don't use this "cost" value except to keep track of the cost to see if it reduces over cycles.

Okay, so we don't know what side of the parabola we're on. In that case, we look at the opposite of the gradient of the curve with respect to the weights, or the opposite of the partial derivative of the curve with respect to the weights: $\mathtt{-\frac{\partial}{\partial w}(y - wx)^2 = -\frac{\partial}{\partial u}(u^2)\frac{\partial}{\partial w}(y - wx) = -2u \cdot -x = -2(y - wx)(-x)}$

Multiply this result by $$\mathtt{\frac{1}{2}}$$, plug that summation back in, and we get a gradient of $$\mathtt{\sum (y - wx)(x)}$$. Finally, multiply by the learning rate $$\mathtt{\eta}$$ to get the change to each weight: $\mathtt{\eta\sum (y - wx)(x)}$

An Example

Let's take a look at a small example. I'll use data for just 10 flowers in the Iris data set. All of these belong to the setosa species of the Iris flower. I'll use a learning rate of $$\mathtt{\eta = 0.0001}$$.

Sepal Length (cm)Petal Length (cm)
5.11.4
4.91.4
4.71.3
4.61.5
51.4
5.41.7
4.61.4
51.5
4.41.4
4.91.5

In that case, all of these instances have target values of $$\mathtt{-1}$$, which don't change as the data cycle through. Our $$\mathtt{y}$$ vector, then, is [$$\mathtt{-1, -1, -1, -1, -1, -1, -1, -1, -1, -1}$$], and our starting weights are [0, 0, 0]. The first weight here is the "intercept" weight, or bias weight, which gets updated differently from the others.

Our input vector, $$\mathtt{w^{T}x}$$, is the combination sepal length × weight 1 + petal length × weight 2 for each object in the data. At the start, then, our input vector is a zero vector: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]. The difference vector, $$\mathtt{y - w^{T}x}$$, is, in this case, equal to the y vector: just a collection of ten negative 1s.

The bias weight, $$\mathtt{w_0}$$, is updated by learning rate × the sum of the components of the difference vector, or $$\mathtt{\eta\sum(y - w^{T}x)}$$. This gives us $$\mathtt{w_0 = w_0 + 0.0001 \times -10 = -0.001}$$.

The $$\mathtt{w_1}$$ weight is updated as 0.0001(5.1 × -1 + 4.9 × -1 + 4.7 × -1 + 4.6 × -1 + 5 × -1 + 5.4 × -1 + 4.6 × -1 + 5 × -1 + 4.4 × -1 + 4.9 × -1) and the $$\mathtt{w_2}$$ weight is updated as 0.0001(1.4 × -1 + 1.4 × -1 + 1.3 × -1 + 1.5 × -1 + 1.4 × -1 + 1.7 × -1 + 1.4 × -1 + 1.5 × -1 + 1.4 × -1 + 1.5 × -1).

Through all that gobbledygook, our new weights are $$\mathtt{-0.00486}$$ and $$\mathtt{-0.00145}$$ with a bias weight of $$\mathtt{-0.001}$$. You can see below how different the gradient descent process is from the perceptron model. This model in particular doesn't have to stop when it finds a line that separates the categories, since the minimum may not be reached even when the program can classify the categories precisely.

## Friday, May 27, 2016

### Enhance the Salience of Relevant Variables

This study has some connections to things I've written up here—in particular, Hallowell, et. al on how the salience of characteristics of solids affects the responses of young children to certain visuo-spatial tasks. When these salient characteristics are irrelevant to the task, inhibitory processes can be measured, mentioned in studies here and again here.

The study we will look at in this post explored whether increasing the salience of the relevant variable in tasks with both relevant and irrelevant variables of competing significance would improve students' performance. In particular, researchers looked at whether increasing the salience of perimeter would help students with responses to tasks where area was irrelevant but salient.

Three groups of shape-pair categories were used, along with two levels of complexity:

In the Congruent condition, the shape with the greater area also has the greater perimeter. In the Incongruent Inverse condition, the shape with the smaller area has the greater perimeter. And in the Incongruent Equal condition, the areas differ, but the perimeters are equal.

The Experiment and Results

Participants (58 fifth- and sixth-graders) were divided into two groups. One group of students were tested first on the shapes as you see them above. The second group, however, were tested first on what researchers called the discrete mode of presentation. In this mode, the perimeters of the shapes are made more salient by drawing them using equal-sized matchsticks instead of as continuous lines:

Each group was tested on the other mode of presentation 10 days later. So, students who began with the discrete mode were tested in the continuous mode later, and students tested first in the continuous mode were tested 10 days later in the discrete mode.

The results are pretty staggering. And keep in mind that students in both groups took both tests. The order of the tests is what is primarily responsible for the dramatically different results.

Previous research in mathematics and science education has shown that specific modes of presentation may improve students’ performance (i.e., Clement 1993; Martin and Schwartz 2005; Stavy and Berkovitz 1980; Tirosh and Tsamir 1996). Our results corroborate these findings and indeed show that success rate in the discrete mode of presentation test is significantly higher than in the continuous mode of presentation test. This significant difference is evident in all conditions (congruent, incongruent inverse, and incongruent equal) and in all levels of complexity (simple and complex). The visual information depicted in the discrete mode of presentation strongly enhances the salience of the relevant variable, perimeter, and somewhat decreases the salience of the irrelevant variable, area. This visual information also emphasizes the perimeter’s segments that should be mentally manipulated when solving the task. The discrete mode of presentation, therefore, enhances the use of strategies that are regularly used when solving this task. In the continuous mode of presentation, however, no hint of such possibility of mentally breaking the solid line into relevant segments is given.

The researchers note that a similar study using discrete segments for perimeter found no effect for the order of presentation. The authors surmise that this was due to the fact that, in that study, researchers did not use equal-sized discrete units, so students could not manipulate these units to determine perimeter. That is, discreteness is not what mattered, but that the discrete units could be manipulated successfully to produce correct responses to perimeter questions. The key, still, is that insights gained from working first with these discrete, equal-sized units transferred to the continuous mode of presentation.

The positive effect of a previous presentation observed in the current study can be seen as “teaching by analogy.” In teaching by analogy students are first presented with an “anchoring task” that elicits a correct response due to the way it is presented and hence supports appropriate solution strategies. Later on, students are presented with a similar “target task” known to elicit incorrect responses. The anchoring task probably encourages appropriate solution strategies, and such a sequence of instruction was effective in helping students overcome difficulties (e.g., Clement 1993; Stavy 1991; Tsamir 2003). . . .

Performing the discrete mode of presentation test strongly enhances the salience of the relevant variable, perimeter, and somewhat decreases that of area. This enhancement supports appropriate solution strategies that lead to improved performance. This effect is robust and transfers to continuous mode of presentation for at least 10 days. In line with this conclusion, a student who performed the continuous test after the discrete one commented that, “It [continuous] was harder this time but I used the previous shapes, because I could do tricks with the matchsticks.”

Babai, R., Nattiv, L., & Stavy, R. (2016). Comparison of perimeters: improving students’ performance by increasing the salience of the relevant variable ZDM, 48 (3), 367-378 DOI: 10.1007/s11858-016-0766-z

## Thursday, May 26, 2016

### The Perceptron

I'd like to write about something called the perceptron in this post—partly because I'm just learning about it, and writing helps me wrap my head around new-to-me things, and partly because, well, it's interesting.

It works like this. Take two categories which are, in some quantifiable way, completely different from each other. In this example, let's talk about apples and oranges as the categories. And we'll give each of these categories just two dimensions—weight in grams and number of seeds. (I'll fudge these numbers a bit to help with the example.)

Studying this information alone, you're already prepared to know with 100% confidence whether an object is an apple or an orange given its weight and number of seeds. In fact, the number of seeds is unnecessary. If you have just the two objects to choose from, and you are given only an object's weight, you have all the information you need to assign it to the apple category or the orange category.

But, you have to play along here. We want to train the computer to make the inference we just made above, given only a set of data with the weights and number of seeds of apples and oranges. Crucially, what the perceptron does is find a line between the two categories. You can easily see how to draw a line between the categories at the left (where I have plotted 100 random apples and oranges in their given ranges of weights and seeds), but the perceptron program allows a computer to learn where to draw a line between the categories, given only a set of data about the categories.

Training the Perceptron

The way we teach the computer how to draw this line is that we ask it to essentially draw a prediction line. Then we make it look at each instance (apple or orange, one at a time) and see whether the line correctly predicts the category of the object. If it does, we make no change to the line; if it doesn't we adjust the line.

The prediction line takes the standard form $$\mathtt{Ax + By + C = 0}$$. If $$\mathtt{Ax + By + C > 0}$$, we predict orange. If $$\mathtt{Ax + By + C \leq 0}$$, we predict apple. We can put the "or equal to" on either one of these inequalities, and we need the "or equal to" on one of them for this to be a truly binary decision.

Okay, so let's draw a prediction line across the data above at $$\mathtt{y = 119}$$, say. In that case, $$\mathtt{A = 0, B = 1,}$$ and $$\mathtt{C = -119}$$. When we come across apple data points, they will all be categorized correctly (as below the line), but some of the orange data points will be categorized correctly and some incorrectly. On correct categorizations, we don't want the line to update at all, and on incorrect categorizations, we want to change the line. Here's how both of these things are accomplished:

Let's take the point at (10, 109). This should be classified as orange, which we'll quantify as $$\mathtt{1}$$. The prediction line, however, gives us $$\mathtt{0(10) + 1(109) + -119}$$, which is $$\mathtt{-10 \leq 0}$$. So, the predictron mistakenly predicts this point to be an apple, which we'll quantify as $$\mathtt{-1}$$. We want to adjust the line.

Subtract the prediction ($$\mathtt{-1}$$) from the actual (1) to get $$\mathtt{2}$$. Multiply this by a small learning rate of 0.1, so $$\mathtt{0.1 \times 2 = 0.2}$$. Finally, change $$\mathtt{A}$$, $$\mathtt{B}$$, and $$\mathtt{C}$$ like this:

$$\mathtt{A = A + 0.2 \times 10}$$

$$\mathtt{B = B + 0.2 \times 109}$$

$$\mathtt{C = C + 0.2}$$

So, in our example, $$\mathtt{A}$$ becomes $$\mathtt{0 + 0.2 \times 10}$$, or 2, $$\mathtt{B}$$ becomes $$\mathtt{1 + 0.2 \times 109}$$, or 22.8, and $$\mathtt{C}$$ becomes $$\mathtt{-119 + 0.2}$$, or $$\mathtt{-118.8}$$.

We have a new prediction line, which is $$\mathtt{2x + 22.8y + -118.8 = 0}$$. This line now makes the correct prediction for the orange at (10, 109), as you can see at the right with the blue line.

But this point is only used to adjust A, B, and C (called the "weights"), and then it's on to the next point to see if the new prediction line succeeds in making a correct prediction or not. The weights change with each incorrect prediction, and these changing weights move the prediction line up (left) and down (right) and alter its slope as well.

Many Iterations

Suppose we next encounter a point at (10, 90), which should be an apple. Our predictron, however, will make the prediction $$\mathtt{2(10) + 22.8(90) - 118.8 = 2153.2 > 0}$$, which is a prediction of orange, or 1. Subtract the prediction (1) from the actual ($$\mathtt{-1}$$) to get $$\mathtt{-2}$$, and multiply by the learning rate to get $$\mathtt{-0.2}$$. Our weights are adjusted as follows:

$$\mathtt{A = 2 + -0.2 \times 10 = 0}$$

$$\mathtt{B = 22.8 + -0.2 \times 90 = 4.8}$$

$$\mathtt{C = -118.8 + -0.2 = -119}$$

If you graph this new line, $$\mathtt{0x + 4.8y + -119 = 0}$$, you'll notice that it's between the other two, but it would still make the incorrect categorization for the apple at (10, 90). This is why the predictron must cycle through the data several times in order to "train" itself into determining the correct line. If you can make sense of it, there is a proof that this algorithm always converges in finite time for data that can be separated by a line (with a sufficiently small learning rate).

Other Notes

It's worth mentioning that the perceptron finds a line between the categories, but there are an infinite number of lines available. Also, in this example, we have two dimensions, or features, but the perceptron works for as many dimensions as you please.

So, to get schmancy, for each object in the data, $$\mathtt{z}$$, our prediction function takes a linear combination of weights (coefficients + that C intercept weight) and coordinates (features, or dimensions) $$\mathtt{z = w_{0}x_{0} + w_{1}x_{1} + \ldots + w_{m}x_{m} = w^{T}x}$$ and outputs $\theta(z) = \left\{\begin{array}{11} \color{white}{-}1, & \quad z > 0 \\ -1, & \quad z \leq 0 \end{array} \right.$

Even with multi-dimensional vectors, the weights are updated just as above, perhaps with a different learning rate, chosen at the beginning.

Below is a Python implementation of the Perceptron from Sebastian Raschka, using a classic data set about irises (the flowers).

## Tuesday, May 24, 2016

### Is Area All About Rectangles?

A very common way of teaching students the reasoning behind the formula for the area of a non-rectangular parallelogram is to demonstrate how to take apart such a parallelogram and turn it into a rectangle. Some U.S. states' elementary and middle-school mathematics standards used to be fairly explicit about this "way" of teaching. Here are two examples—the first from California, the second from Florida:

Derive and use the formula for the area of a triangle and of a parallelogram by comparing it with the formula for the area of a rectangle (i.e., two of the same triangles make a parallelogram with twice the area; a parallelogram is compared with a rectangle of the same area by cutting and pasting a right triangle on the parallelogram).

Derive and apply formulas for areas of parallelograms, triangles, and trapezoids from the area of a rectangle.

The NCTM's 2000 publication Principles and Standards for School Mathematics (PSSM) also endorsed this method as consistent with stimulating students' understanding of, and investigation into, area:

One particularly accessible and rich domain for such investigation is areas of parallelograms, triangles, and trapezoids. Students can develop formulas for these shapes using what they have previously learned about how to find the area of a rectangle, along with an understanding that decomposing a shape and rearranging its component parts without overlapping does not affect the area of the shape.

And, from the third edition of this well-known book by Van de Walle, we have the following:

Parallelograms that are not rectangles can be transformed into rectangles having the same area. The new rectangle has the same height and two sides the same as the original parallelogram. Students should explore these relationships on grid paper, on geoboards, or by cutting paper models, and should be quite convinced that the areas are the same and that such reassembly can always be done with any parallelogram. As a result, the area of a parallelogram is base times height, just as for rectangles.

I should note that, of the four snippets I present above, only Van de Walle's even makes an attempt at a consistent distinction between non-rectangular parallelograms ("parallelograms that are not rectangles") and rectangles. But then even in the Van de Walle snippet, this distinction breaks down in the last sentence.

Consistent with the standards and the specific observations and suggestions in the widely referenced publications listed above, textbook lessons typically introduce students to the idea that the area of a non-rectangular parallelogram can be described by the same formula as that used to describe the area of a rectangle with the same base and height, using one or more illustrations like the one at right, accompanied by written instruction to help students understand the illustration.

Generally, the purpose of this written instruction is to clarify for students that (a) a part of the non-rectangular parallelogram was simply removed and then re-attached somewhere else on the figure--a transposition that does not change the area of the figure, and (b) the resulting figure is a rectangle whose height and base are the same as that of the non-rectangular parallelogram.

But, this progression has always struck me as a little backward, because rectangles and squares are special kinds of parallelograms. Deriving or developing a formula to describe the area of certain non-rectangular parallelograms based on what is taught about the area of rectangles--as is often done in this instruction--is a bit like arguing that the numbers 3, 1, 11, and 45 are all integers because you can count up from them or count down from them and reach the integer 9.

It turns out in both cases that what is stated is true—non-rectangular parallelograms and rectangles share the same area formula, and the numbers above, along with the number 9, are all integers. And it so happens that in both of these cases, the reasons are difficult to argue with as well—transforming a non-rectangular parallelogram into a rectangle without gaining or losing area is certainly a convincing demonstration, and so long as one construes "counting" as being restricted to integers, the reason given above for the "integer-ness" of 3, 1, 11, and 45 is satisfactory, albeit sloppy.

Still, the "integer-ness" of integers has nothing to do with the number 9, and the reason the area of a parallelogram can be described by the formula $$\mathtt{A = bh}$$ or $$\mathtt{A = lw}$$ may have nothing in general to do with rectangles and everything to do with the fact that parallel lines are at all corresponding points the same distance apart.

Is it not the case that the shaded figures shown have the same area (if their opposite sides are parallel)? Cavalieri's Principle, no? But this goes back to Euclid, too. Also, there's a connection to bivectors in higher-level math.

This idea--perhaps a slightly more general idea about area--awaits actual fleshing out in lessons, where the other principles (clarity, order, and cohesion), along with precision again, must be considered. If, though, we are to keep this topic (parallelogram area) where it typically falls in a mathematics curriculum, then we would need to move up discussions about parallel lines from where they typically fall, and such discussions could no longer be considered in isolation.

## Saturday, May 7, 2016

### Making "Connections"

I'm nearing the end of my read of James Lang's terrific book Small Teaching, and I've wanted for the last 100 pages or so to recommend it highly here.

While I do that, however, I'd also like to mention a confusion that piqued my interest near the middle of the book—a very common false distinction, I think, between 'making connections' and knowing things. Lang sets it up this way:

When we are tackling a new author in my British literature survey course, I might begin class by pointing out some salient feature of the author's life or work and asking students to tell me the name of a previous author (whose work we have read) who shares that same feature. "This is a Scottish author," I will say. "And who was the last Scottish author we read?" Blank stares. Perhaps just a bit of gaping bewilderment. Instead of seeing the broad sweep of British literary history, with its many plots, subplots, and characters, my students see Author A and then Author B and then Author C and so on. They can analyze and remember the main works and features of each author, but they run into trouble when asked to forge connections among writers.

What immediately follows this paragraph is what one would expect from a writer who has done his homework on the research: Lang reminds himself that his students are novices and he an expert; his students' knowledge of British literature and history is "sparse and superficial."

But then, suddenly, the false distinction, where 'knowledge' takes on a different meaning, becoming synonymous with "sparse and superficial," and his students have it again:

In short, they have knowledge, in the sense that they can produce individual pieces of information in specific contexts; what they lack is understanding or comprehension.

And they lack comprehension, even more shortly, because they lack connections.

Nope, Still Knowledge

As we saw here, with the Wason Selection Task, reasoning ability itself is dependent on knowledge. Participants who were given abstract rules had tremendous difficulties with modus tollens reasoning in particular, yet when these rules were set in concrete contexts, the difficulties all but vanished.

One might say, indeed, that in concrete contexts, the connections are known, not inferred. Thus, if you want students to make connections among various authors, it might help to tell them that they are connected, and how.

## Monday, May 2, 2016

### The Wason Selection Task, Part II

Before we sink our teeth even deeper into the Wason Selection Task, we should look briefly at conditional reasoning arguments.

Conditional reasoning generally starts with the statement "if P is true, then Q is true" or "if P, then Q" (P → Q). For example, the statement "If this tree is a spruce (P), then it has needles (Q)" is a conditional statement.

There are four types of conditional reasoning arguments that apply to the Wason Selection Task—two of them valid and two of them logically invalid. Each of these introduces a different second statement, after the "if P, then Q" formulation. That is, each of the following four types of conditional reasoning arguments can be identified according to the statement that comes right after the "if P, then Q" statement.

• Modus Ponens (P is true): This argument proceeds as follows: If P is true, then Q is true. P is indeed true. Therefore, Q is true. This is a valid form of reasoning. Example: If this tree is a spruce (P), then it has needles (Q). This tree is indeed a spruce (P). Therefore, this tree has needles (Q).
• Denying the Antecedent (P is not true): This is a fallacy and proceeds as follows: If P is true, then Q is true. P is not true. Therefore, Q is not true. Example: If this tree is a spruce (P), then it has needles (Q). This tree is not a spruce (not P). Therefore, this tree does not have needles (not Q).
• Affirming the Consequent (Q is true): This is also a fallacy and proceeds as follows: If P is true, then Q is true. Q is indeed true. Therefore, P is true. Example: If this tree is a spruce (P), then it has needles (Q). This tree indeed has needles (Q). Therefore, this tree is a spruce (P).
• Modus Tollens (Q is not true): This argument proceeds as follows: If P is true, then Q is true. Q is not true. Therefore, P is not true. Example: If this tree is a spruce (P), then it has needles (Q). This tree does not have needles (not Q). Therefore, this tree is not a spruce (not P).

It is important to note that the terms valid and invalid used to describe these arguments tell us nothing about the correctness of their conclusions. For example, each of these lines of reasoning is logically invalid . . .

• Affirming the Consequent: If today is June 1, then tomorrow is June 2. Tomorrow is indeed June 2. Therefore, today is June 1.
• Denying the Antecedent: If today is June 1, then tomorrow is June 2. Today is not June 1. Therefore, tomorrow is not June 2.

. . . even though they are undeniably "correct," as far as that goes. Formal logic does not concern itself necessarily with the contents of arguments, only their form.

Applying the Rules to the Selection Task

The rule given in any Wason Selection Task is considered to be a statement of the form "if P, then Q." So, the rule "every person that has an alcoholic drink is of legal age" that I included in my previous post might be recast as "if alcoholic drink (P), then legal age (Q). Similarly, in the more formal version, the rule "every card that has a D on one side has a 3 on the other" might be recast as "if D (P), then 3 (Q)."

Accordingly, each of the four answer choices in a Wason Selection Task is seen as the second statement in a conditional reasoning argument—either a statement about P (i.e., P is true [P] or P is not true [∼P]) or a statement about Q (i.e., Q is true [Q] or Q is not true [∼Q]).

In the "drinking" version of the selection task, for example, statements about drink type are the Ps and statements about age are the Qs:

soda
18
vodka
29
Not Alcoholic
(∼P)
Not Legal Age
(∼Q)
Alcoholic
(P)
Legal Age
(Q)

The Ps and Qs for the more formal task would be assigned this way:

D
K
3
7
D
(P)
Not D
(∼P)
3
(Q)
Not 3
(∼Q)

In each case, the statements which form a valid argument (i.e., P and ∼Q) indicate those cards (or people) that must be checked to determine the validity of the if-then statement.

### The Wason Selection Task, Part I

The Pope, a nun, Kermit the Frog, and Bruce Lee are all sitting at a bar. Well, actually it's just four people, represented by the cards below.

soda
18
vodka
29

Each person has an age and a drink type, but you can see only one of these for each person. Here is a rule: "every person that has an alcoholic drink is of legal age." Your task is to select all those people, but only those people, that you would have to check in order to discover whether or not the rule has been violated.

Most people have little trouble picking the correct answer above. But, "across a wide range of published literature only around 10% of the general population" finds the correct answer to the infamous Wason selection task shown below:

D
K
3
7

Each card has a letter on one side and a number on the other, but you can see only one of these for each card. Here is a rule: "every card that has a D on one side has a 3 on the other." Your task is to select all those cards, but only those cards, which you would have to turn over in order to discover whether or not the rule has been violated.

In fact, Matthew Inglis and Adrian Simpson (2004) found that mathematics undergraduates as well as mathematics academic staff, though performing significantly better than history undergraduates, performed unexpectedly poorly on the task, with only 29% of math undergrads and a shocking 43% of staff finding the correct answer.

In a chapter from The Cambridge Handbook of Expertise and Expert Performance, Paul Feltovich, Michael Prietula, and K. Anders Ericsson indicate the one factor that explains these differential results: knowledge.

Some studies showed reasoning itself to be dependent on knowledge. Wason and Johnson-Laird (1972) presented evidence that individuals perform poorly in testing the implications of logical inference rules (e.g., if p then q) when the rules are stated abstractly. Performance greatly improves for concrete instances of the same rules (e.g., 'every time I go to Manchester, I go by train'). Rumelhart (1979), in an extension of this work, found that nearly five times as many participants were able to test correctly the implications of a simple, single-conditional logical expression when it was stated in terms of a realistic setting (e.g., a work setting: 'every purchase over thirty dollars must be approved by the regional manager') versus when the expression was stated in an understandable but less meaningful form (e.g., 'every card with a vowel on the front must have an integer on the back').

Reference: Inglis, M. & Simpson, A. Mathematicians and the Selection Task. Proceedings of the 28th Conference of the International Group for the Psychology of Mathematics Education, 2004. (3) 89-96.

## Sunday, May 1, 2016

After reading this piece about mathematician Terence Tao, I was turned on to a book about mathematical problem-solving Tao wrote as a 15-year-old, so I decided to check it out. The book contains a nice little nugget about symmetry in the first chapter, the importance of which sails by the author (and thus the reader) a bit, I think. Here's the problem that occupies all of the discussion in Chapter 1:

A triangle has its lengths in an arithmetic progression, with difference $$\mathtt{d}$$. The area of the triangle is $$\mathtt{t}$$. Find the lengths and angles of the triangle.

And here's the nice move with regard to symmetry that Tao makes, almost in passing. It doesn't seem to be a "code-cracking" idea in the context of the problem, but I think it highlights a key process in mathematical thinking that we can work to make more explicit for students. It might be helpful to tinker with the problem a little to put this move in some context:

We can use the data to simplify the notation: we know that the sides are in arithmetic progression, so instead of $$\mathtt{a}$$, $$\mathtt{b}$$, and $$\mathtt{c}$$, we can have $$\mathtt{a}$$, $$\mathtt{a + d}$$, and $$\mathtt{a + 2d}$$ instead. But the notation can be even better if we make it more symmetrical, by making the side lengths $$\mathtt{b - d}$$, $$\mathtt{b}$$, and $$\mathtt{b + d}$$.

I can easily imagine, given what I know about most current curricula, a student writing down the side lengths of the triangle as $$\mathtt{a}$$, $$\mathtt{a + d}$$, and $$\mathtt{a + 2d}$$, because this is how we teach arithmetic progressions. But we don't often explicitly make the fairly simple observation that any term, with the exception of the first, in an arithmetic progression can be written as $$\mathtt{a_n}$$, with the previous term written as $$\mathtt{a_n - d}$$ and the next term as $$\mathtt{a_n + d}$$. It seems that we can find this observation by thinking a bit more about symmetry when we craft our explanations and instruction.

Another Observation

Working to orient oneself to the symmetries available in mathematical situations seems like one appropriate remedy to what I've called "left-to-rightism," or "cinemathematics"—a syndrome that makes us teach concepts like the equals sign (unwittingly) in a left-to-right way, such that students take away (unwittingly) the misconception that the equals sign indicates that some answer is to follow, rather than that two expressions are equal. Some recent research points to the benefits of thinking about symmetry when teaching negative numbers as well.

Tsang, J., Blair, K., Bofferding, L., & Schwartz, D. (2015). Learning to “See” Less Than Nothing: Putting Perceptual Skills to Work for Learning Numerical Structure Cognition and Instruction, 33 (2), 154-197 DOI: 10.1080/07370008.2015.1038539

## Saturday, April 23, 2016

### Evident Even to an Ass

Suppose there is a straight line segment $$\overline{\small\mathtt{AB}}$$ between you, at point $$\small\mathtt{A}$$, and some point $$\small\mathtt{B}$$ you want to reach. Is there a shorter way to point $$\small\mathtt{B}$$ that uses 2 line segment paths?

If you think the answer is yes, or if you think that the answer is no but it's not obviously no, the Epicureans have some pretty mean things to say about you:

It was the habit of the Epicureans, says Proclus, to ridicule this theorem as being evident even to an ass and requiring no proof, and their allegation that the theorem was "known" even to an ass was based on the fact that, if fodder is placed at one angular point and the ass at another, he does not, in order to get to his food, traverse the two sides of the triangle but only the one side separating them (an argument which makes Savile exclaim that its authors were "digni ipsi, qui cum Asino foenum essent"). Proclus replies truly that a mere perception of the truth of the theorem is a different thing from a scientific proof of it and a knowledge of the reason why it is true.

The theorem mentioned is the Triangle Inequality Theorem: the sum of the lengths of any two sides of a triangle is always greater than the length of the remaining side. And it is fascinating to me how seldom I have seen this theorem in textbooks phrased as a "shortest straight-line distance" statement. Most often, I see investigations about what side lengths can make up a triangle, which turns the ridiculously obvious into a complicated issue. But let's discuss that below. For now, a proof!

It's Okay To Be Both Obvious and Require Proof

We want to show that $$\small\mathtt{BA + AC}$$ is greater than $$\small\mathtt{BC}$$, that $$\small\mathtt{AB + BC}$$ is greater than $$\small\mathtt{AC}$$, and that $$\small\mathtt{BC + CA}$$ is greater than $$\small\mathtt{AB}$$.

Euclid starts by extending $$\overline{\small\mathtt{BA}}$$ to a point $$\small\mathtt{D}$$ such that $$\small\mathtt{DA = AC}$$ as I've shown in the diagram. This means that $$\small\Delta\mathtt{ADC}$$ is an isosceles triangle, and $$\small\measuredangle\mathtt{ACD}$$ and $$\small\measuredangle\mathtt{ADC}$$ are congruent. And since "the whole is greater than the part," m$$\small\measuredangle\mathtt{BCD}$$ is greater than m$$\small\measuredangle\mathtt{ACD}$$, which also makes m$$\small\measuredangle\mathtt{BCD}$$ greater than m$$\small\measuredangle\mathtt{ADC}$$.

Dizzy yet? Just remember the last rung of the ladder we got to: m$$\small\measuredangle\mathtt{BCD}$$ is greater than m$$\small\measuredangle\mathtt{ADC}$$. Therefore, we can say something about the line segments that those two angles "catch." Specifically, we can say that $$\overline{\small\mathtt{DB}}$$ is longer than $$\overline{\small\mathtt{BC}}$$, because Proposition 19. This is the same as saying that $$\small\mathtt{DA + AB > BC}$$. And since $$\small\mathtt{DA = AC}$$, we can make a substitution to get $$\small\mathtt{AC + AB > BC}$$, which is the first of the three statements above that we wanted to prove: $$\small\mathtt{BA + AC}$$ is greater than $$\small\mathtt{BC}$$. Euclid tells us that we can prove the other two statements with a similar method, and I believe him.

Internalizing the Idea That Mathematics Is Complex and Intuition-Free

In the link is an example of what we often put students through to investigate the Triangle Inequality Theorem. This is consistent with what I have seen in a lot of lesson plans.

Were the authors of this document aware that there is a very intuitive way of looking at this theorem? It might be enough to suppose that they weren't and that they, like all of us, sometimes just repeat what we see or hear elsewhere.

But it's also worth entertaining the possibility that they did know and pressed on anyway. What good reasons might they have for doing so? I think the best answer is simply Proclus's reply above: "a mere perception of the truth of the theorem is a different thing from a scientific proof of it and a knowledge of the reason why it is true." To which I would respond, Indeed, a different thing. Not necessarily a better thing.

If we can give students the "mere perception of the truth" of a theorem (or of any mathematical idea), we should do so, even if it doesn't make us feel smart, leaves a lot of class time to fill, or runs counter to a set of standards. I would argue that students still have to prove those theorems and justify those ideas. But they can then do so correctly oriented to the reality of what they are doing: drawing on their own perceptions and knowledge to make their ideas plain to others.