## Saturday, August 20, 2016

### Are Teaching and Learning Coevolved?

Just a few pages in to David Didau and Nick Rose's new book What Every Teacher Needs to Know About Psychology, and I've already come across what is, for me, a new thought:

Strauss, Ziv, and Stein (2002) . . . point to the fact that the ability to teach arises spontaneously at an early age without any apparent instruction and that it is common to all human cultures as evidence that it is an innate ability. Essentially, they suggest that despite its complexity, teaching is a natural cognition that evolved alongside our ability to learn.

Or perhaps this is, even for me, an old thought, but just unpopular enough—and for long enough—to seem like a brand new thought. Perhaps after years of exposure to the characterization of teaching as an anti-natural object—a smoky, rusty gearbox of torture techniques designed to break students' wills and control their behavior—I have simply come to accept that it is true, and have forgotten that I had done so.

Strauss, et. al, however, provide some evidence in their research that it is not true. Very young children engage in teaching behavior before formal schooling by relying on a naturally developing ability to understand the minds of others, known as theory of mind (ToM).

Kruger and Tomasello (1996) postulated that defining teaching in terms of its intention—to cause learning, suggests that teaching is linked to theory of mind, i.e., that teaching relies on the human ability to understand the other's mind. Olson and Bruner (1996) also identified theoretical links between theory of mind and teaching. They suggested that teaching is possible only when a lack of knowledge can be recognized and that the goal of teaching then is to enhance the learner's knowledge. Thus, a theory of mind definition of teaching should refer to both the intentionality involved in teaching and the knowledge component, as follows: teaching is an intentional activity that is pursued in order to increase the knowledge (or understanding) of another who lacks knowledge, has partial knowledge or possesses a false belief.

The Experiment

One hundred children were separated into 50 pairs—25 pairs with a mean age of 3.5 and 25 with a mean age of 5.5. Twenty-five of the 50 children in each age group served as test subjects (teachers); the other 25 were learners. The teachers completed three groups of tasks before teaching, the first of which (1) involved two classic false-belief tasks. If you are not familiar with these kinds of tasks, the video at right should serve as a delightfully creepy precis—from what appears to be the late 70s, when every single instructional video on Earth was made. The second and third groups of tasks probed participants' understanding that (2) a knowledge gap between teacher and learner must exist for "teaching" to occur and (3) a false belief about this knowledge gap is possible.

Finally, children participated in the teaching task by teaching the learners how to play a board game. The teacher-children were, naturally, taught how to play the game prior to their own teaching, and they were allowed to play the game with the experimenter until they demonstrated some proficiency. The teacher-learner pair was then left alone, "with no further encouragement or instructions."

The Results

Consistent with the results from prior false-belief studies, there were significant differences between the 3- and 5-year-olds in Tasks (1) and (3) above, both of which relied on false-belief mechanisms. In Task (3), when participants were told, for example, that a teacher thought a child knew how to read when in fact he didn't, 3-year-olds were much more likely to say that the teacher would still teach the child. Five-year-olds, on the other hand, were more likely to recognize the teacher's false belief and say that he or she would not teach the child.

Intriguingly, however, the development of a theory of mind does not seem necessary to either recognizing the need for a special type of discourse called "teaching" or to teaching ability itself—only to a refinement of teaching strategies. Task (2), in which participants were asked, for instance, whether a teacher would teach someone who knew something or someone who didn't, showed no significant differences between 3- and 5-year-olds in the study. But the groups were significantly different in the strategies they employed during teaching.

Three-year-olds have some understanding of teaching. They understand that in order to determine the need for teaching as well as the target learner, there is a need to recognize a difference in knowledge between (at least) two people . . . Recognition of the learner's lack of knowledge seems to be a necessary prerequisite for any attempt to teach. Thus, 3-year-olds who identify a peer who doesn't know [how] to play a game will attempt to teach the peer. However, they will differ from 5-year-olds in their teaching strategies, reflecting the further change in ToM and understanding of teaching that occurs between the ages of 3 and 5 years.

Coevolution of Teaching and Learning

The study here dealt with the innateness of teaching ability and sensibilities but not with the coevolution of teaching and learning, which it mentions at the beginning and then leaves behind.

It is an interesting question, however. Discussions in education are increasingly focused on "how students learn," and it seems to be widely accepted that teaching should adjust itself to what we discover about this. But if teaching is as natural a human faculty as learning, then this may be only half the story. How students (naturally) learn might be caused, in part, by how teachers (naturally) teach, and vice versa. And learners perhaps should be asked to adjust to what we learn about how we teach as much as the other way around.

Those seem like new thoughts to me. But they're probably not.

Strauss, S., Ziv, M., & Stein, A. (2002). Teaching as a natural cognition and its relations to preschoolers’ developing theory of mind Cognitive Development, 17 (3-4), 1473-1487 DOI: 10.1016/S0885-2014(02)00128-4

## Saturday, July 30, 2016

### Problem Solving, Instruction: Chicken, Egg

We've looked before at research which evaluated the merits of different instructional sequences.

In this post, for example, I summarized a research review by Rittle-Johnson that revealed no support for the widespread belief that conceptual instruction must precede procedural instruction in mathematics. The authors of that review went so far as to call the belief (one held and endorsed by the National Council of Teachers of Mathematics) a myth. And another study, summarized in this post, finds little evidence for another very popular notion about instruction—that cognitive conflict of some kind is a necessary prerequisite to learning.

The review we will discuss in this post looks at studies which compared two types of teaching sequences: problem solving followed by instruction (PS-I) and instruction followed by problem solving (I-PS). As far as horserace comparisons, the main takeaway is shown in the table below. Each positive (+) is a study result which showed that PS-I outperformed an I-PS control, each equals sign (=) a result where the two conditions performed the same, and each negative (–) a result where I-PS outperformed PS-I.

ProceduralConceptualTransfer
+

= = = = = = = = =

– –
+ + + +

= = =

– –
+ + + +

= = = =

Summary of learning outcomes for PS-I vs. I-PS.

Importantly, 8 of the results reviewed are not represented in the table above. In these results, the review authors suggest, participants in the PS-I conditions were given better learning resources than those in the I-PS conditions. This difference confounded those outcomes (see Greg's post on this) and, unsurprisingly, added 15 plusses, 7 equals, and just 1 negative to the overall picture of PS-I.

Needless to say, when research has more fairly compared PS-I with I-PS, it has concluded that, in general, the sequence doesn't matter all that much, though there are some positive trends on conceptual and transfer assessments for PS-I. Even if we ignore the Procedural column, roughly 55% of the results are equal or negative for PS-I.

Contrasting Cases and Building on Student Solutions

Horserace aside (sort of), an intriguing discussion in this review centers around two of the confounds identified above—those extra benefits provided in some studies to learners in the PS-I conditions. They were (1) using contrasting cases during problem solving and (2) building on student solutions during instruction. Here the authors describe contrasting cases (I've included their example from the paper):

Contrasting cases consist of small sets of data, examples, or strategies presented side-by-side (e.g., Schwartz and Martin 2004; Schwartz and Bransford 1998). These minimal pairs differ in one deep feature at a time ceteris paribus [other things being equal], thereby highlighting the target features. In the example provided in the right column of Table 2, the datasets of player A and player B differ with regard to the range, while other features (e.g., mean, number of data points) are held constant. The next pair of datasets addresses another feature: player B and C have the same mean and range but different distribution of the data points.

There's something funny about this, I have to admit, given the soaring rhetoric one encounters in education about the benefits of "rich" problems and the awkwardness of textbook problems. Although they are confounds in these studies, contrasting cases manage to be helpful to learning in PS-I as sets of (a) small, (b) artificial problems which (c) vary one idea at a time. "Rich" problems, in contrast, do not show the same positive effects.

And here, some more detail about building on student solutions in instruction. The only note I have here is that it seems worthwhile to point out the obvious: that this confound which also improves learning in PS-I has almost everything to do with the I, and very little to do with the PS:

Another way of highlighting deep features in PS-I is to compare non-canonical student solutions to each other and to the canonical solution during subsequent instruction (e.g., Kapur 2012; Loibl and Rummel 2014a). Explaining why erroneous solutions are incorrect has been found to be beneficial for learning, in some cases even more than explaining correct solutions (Booth et al. 2013). Furthermore, the comparison supports students in detecting differences between their own prior ideas and the canonical solution (Loibl and Rummel 2014a). More precisely, through comparing them to other students’ solution and to the canonical solution, students experience how their solution approaches fail to address one or more important aspects of the problem (e.g., diSessa et al. 1991). This process guides students’ attention to the deep features addressed by the canonical solution (cf. Durkin and Rittle-Johnson 2012).

Loibl, K., Roll, I., & Rummel, N. (2016). Towards a Theory of When and How Problem Solving Followed by Instruction Supports Learning Educational Psychology Review DOI: 10.1007/s10648-016-9379-x

## Thursday, July 28, 2016

### The Appeal to Common Practice

At any point in a child's life or schooling, he or she presents with a number of things he or she can do and a number—which could be 0—of things he or she knows. We can refer to these collectively as the "knowns." And, of course, the "unknowns" are all those things a child does not know or cannot do at any of the same points. The problem of teaching from the known to the unknown involves making some kind of connection from a student's knowns to a very restricted set of unknowns, which, taken together at any point, form a kind of immediate curriculum.

Now, of course, it is impossible to teach without going from the known to the unknown in some way. On the one hand, a student can't learn anything if s/he has absolutely no knowledge or skills (because then s/he wouldn't exist), and on the other hand, nothing can be described purely in terms of itself. The inevitable connection from known to unknown itself is not at issue. What is at issue is the way this connection is made. What knowns are connected to what unknowns?

The Best "Known"

Over a wide variety of topics, educators will often argue about the quality of the knowns to be connected to specific unknowns. The ongoing debate about whether to teach fractions first or decimals first is an area where this argument pops up, with some making the case that place value is the better "known" to be connected to the unknown of rational numbers (decimals first) while others argue that equal shares is the better known (fractions first). Similarly, one can argue that, for the unknown of improper fractions, proper fractions serve as the best "known," whereas another can argue that, because improper and proper fractions are used in such diverse situations (e.g., "no one says that they have eight fifths dollars"), we must scrap the use of proper fractions as the "known" in introducing improper fractions and come back to the connection later.

While there are certainly substantive reasons that serve as foundations for these arguments, there are also problems that seem almost impossible to duck. One of those is called the appeal to common practice.

Appeal to Common Practice

This is a fallacy. And it works like this: Such and such an action is justified because it is what everyone else is doing or what we've always done. Now, it is pretty rare to see an adult actually commit this fallacy so nakedly. But it does creep up somewhat, um, "un-nakedly." Here's Mark falling into the fallacy with repeated multiplication:

Try to give me a simple definition of exponentiation, which is understandable by a fifth or sixth grader, which doesn't at least start by talking about repeated multiplication. Find me a beginners textbook or teachers class plans that explains exponentiation to kids without at least starting with something like "$$\mathtt{5^2 = 5 \times 5}$$, $$\mathtt{5^3 = 5 \times 5 \times 5}$$."

The second of those sentences is pretty clearly the fallacy of appealing to common practice, to the extent that it is used in any way to justify or excuse the teaching of exponentiation as repeated multiplication. But notice what is said in the first sentence: "Try to give me a simple definition of exponentiation, which is understandable by a fifth or sixth grader, which doesn't at least start by talking about repeated multiplication." This, too, is an appeal to common practice, but the practice in this case is not necessarily the teaching of exponentiation as repeated multiplication to fifth or sixth graders but, rather, the teaching of everything before that. The argument is that repeated multiplication is the best "known" because currently the 8 to 10 years of schooling prior to teaching the "unknown" of exponentiation don't prepare students for learning exponentiation any other way (or any better way).

But these circumstances do not make repeated multiplication the best "known," just the most expedient "known." The same goes for repeated addition as a "known" connected to the unknown of multiplication.

All that aside, though, the more general argument is more important: Expedience is not a proper basis for determining quality teaching. Yet, it happens all the time without our noticing it—the appeal to common practice makes it devilishly difficult to discern between expedience and quality.

## Monday, July 18, 2016

### Interleaving Study Is Not Interleaving Learning

The study I briefly discuss here is closely related to one I wrote up on interleaving just a few months ago (no surprise, since the two studies share an author). You can find a free link to the article referenced in that post here.

In the latest research, the authors found that a blocked schedule (presenting examples from one category at a time) outperformed an interleaved schedule (interspersing examples from all the categories) for category learning when the examples to be classified were more highly discriminable. This result was consistent across the two experiments in the study (p = 0.055 and p = 0.04). Importantly, however, although interleaving was a better strategy for learning categories of lower discriminability, the effects across the experiments were much weaker (p = 0.2 and p = 0.08). Blocking had either a significant or close to significant effect, whereas interleaving didn't get nearly as close (if you like p-values, anyway).

The Study

The participants in the first experiment of the study (we'll only focus on that one here) were quite a bit older than I'm used to reading about in education studies: between 19 and 57 years old, with a mean age of 30. (This is similar to the previous study.) Subjects were divided into two groups, one of which was presented with images like these:

These images represent the four presented categories: long, steep; long, flat; short, steep; and short, flat. One subset of participants in this group was exposed to 64 of these images in a blocked schedule (16 from each category at a time) while the other was presented with the images in an interleaved schedule. Each example was appropriately labeled with a category letter (A, B, C, or D). After this initial exposure, subjects were then given a test on the same set of 64 images, randomly ordered, requiring them to assign the image to the appropriate category.

The other group of participants was presented with similar images (line segments) and in blocked or interleaved subsets. However, for this group the images were rotated 45 degrees. According to the researchers, this created a distinction between the groups in which the first was learning verbalizable, highly discriminable categories ("long and flat," etc.) whereas the second was learning categories difficult to express in words—categories of low discriminability.

Discussion, Questions, Connections

As mentioned above, the blocked arrangement of the examples produced a learning benefit for categories of higher discriminability when compared with interleaving. The same cannot be said for interleaving examples in the low-discriminability sequences, although the benefits for interleaving in these sequences were headed in the positive direction. So we are left to wonder about the positive effects of blocking here.

The authors suggest an answer by mentioning some data they are collecting in a separate pilot study: blocking makes it easier for learners to disregard irrelevant information.

We compared learning under a third study schedule (n = 26) in which the relevant dimensions were interleaved, but the irrelevant ones were blocked (i.e., this schedule was blocked-by-irrelevant-dimensions, as opposed to blocked-by-category), which was designed to draw learners’ attention to noticing what dimensions were relevant or irrelevant. On both the classification test and a test in which participants had to identify the relevant and irrelevant dimensions, this new blocked-by-irrelevant-dimensions condition yielded performance at a level comparable to the blocked condition and marginally better compared to the interleaved condition.

Therefore, although we initially hypothesized that participants, when studying one category at a time, are better able to compare exemplars from the same category and to generate and test their hypotheses as to the dimensions [that] define category membership (and this may still be true, particularly for Experiment 1), these pilot data suggest that with the addition of irrelevant dimensions . . . the blocking benefit is perhaps more likely driven by the fact that it allows participants to more easily identify and disregard the irrelevant dimensions.

This strikes me as making a good deal of sense. And it points to something that I may have previously confused: interleaving study examples is different from interleaving initial learning examples. When students are first learning something, blocking may be better; after acquisition, interleaving may benefit learners more.

We have a tendency, in my opinion, to ignore acquisition in learning. I'm not sure where this comes from. Perhaps it is believed that if we are justified in rejecting tabula rasa, we are safe to assume there are absolutely no rasas on any kid's tabula. At any rate, it's worth being clear about where in the learning process interleaving is beneficial—and where it may not be.

Postscript: It’s not unusual to believe, about a child’s cognitive subjectivity, that it is like a large glop of amorphous dough and that instruction or experience acts like a cookie cutter, shaping the child’s mind according to pre-made patterns and discarding the bits that don't fit.

But these results could suggest something different—that prior to learning, the world is a hundred trillion things that must be connected together, not a stew of sensation that must be partitioned into socially valuable units.

This may be why blocking could work well for newly learned categories and for so-called highly discriminable categories: because what is new to us is highly discriminable—separate, without precedent, meaningless.

Image credit: Danny Nicholson

Noh, S., Yan, V., Bjork, R., & Maddox, W. (2016). Optimal sequencing during category learning: Testing a dual-learning systems perspective Cognition, 155, 23-29 DOI: 10.1016/j.cognition.2016.06.007

## Thursday, June 30, 2016

### The Problem of Stipulation

I think it would surprise people to read Engelmann and Carnine's Theory of Instruction. (The previous link is to the 2016 edition of the book on Amazon, but you can find the same book for free here.)

Tangled noticeably within its characteristic atomism and obsession with the naïve learner are a number of ideas that seem downright progressive—a label never ever attached to Engelmann or his work. One of these ideas in particular is worth mentioning—what the authors call the problem of stipulation.

The diagram at left shows a teaching sequence described in the book, in all its banal, robotic glory.

To be fair, it's much harder to roll your eyes at it (or at least it should be) when you consider the audience for whom it is intended—usually special education students.

Anyway, the sequence features a table and a chalkboard eraser in various positions relative to the table. And the intention is to teach the concept of suspended, allowing learners to infer the meaning of the concept while simultaneously preventing them from learning misrules.

Stipulation occurs when the learner is repeatedly shown a limited range of positive variation. If the presentation shows suspended only with respect to an eraser and table, the learner may conclude that the concept suspended is limited to the eraser and the table.

You may know about the problem of stipulation in mathematics education as the (very real) problem of math zombies (or maybe Einstellung)—which is, to my mind anyway, the sine qua non of anti-explanation pedagogical reform efforts.

But of course prescribing doses of teacher silence is only one way to deal with the symptoms of stipulation. Engelmann and Carnine have another.

To counteract stipulation, additional examples must follow the initial-teaching sequence. Following the learner's successful performance with the sequence that teaches slanted, for instance, the learner would be shown that slanted applies to hills, streets, floors, walls, and other objects. Following the presentation of suspended, the learner would be systematically exposed to a variety of suspended objects.

Stipulation of the type that occurs in initial-teaching sequences is not serious if additional examples are presented immediately after the initial teaching sequence has been presented. The longer the learner deals only with the examples shown in the original setup, the greater the probability that the learner will learn the stipulation.

It's a shame, I think, that more educators are not exposed to this work in college. It's a shame, too, that Engelmann's own commercial work has come to represent the theory in full flower—unjustly, I believe. Theory of Instruction could be much more than what it is sold to be, to antagonists and protagonists alike.

Update (07.13): It's worth mentioning that, insofar as the problem of stipulation can be characterized as a student's dependence on teacher input for thinking, complexity and confusion can be even more successful at creating cognitive dependence than monotony and hyper-simplicity. When students—or adults—are in a constant state of confusion, they may learn, incidentally, that the world is unpredictable and incommensurate with their own attempts at understanding. In such cases, even the unfounded opinions of authority figures will provide relief.

Audio Postscript

## Sunday, June 19, 2016

### Growing Up Too Fast or Too Slowly?

If someone were to ask me where this post came from, I would be inclined to answer "everywhere." But I can offer two or three of the connections explicitly here. My thinking of late was simple enough: could variable levels of patience with children's developmental timelines influence one's traditionalist or progressivist orientation to education?

First, I was remembering an interaction I had with a former colleague last year1. We were talking about some math activities we were designing for third graders. What I recall from the exchange is that I was interested in being a little more helpful with the information we provided, suggesting that we had time to fade away our help (the program we were working on was serving children in Grades 3–8). She, on the other hand, argued that "students can't wait forever" to be expected to reason about mathematics. (I've been on her side of the argument many many times too, though never with a time-is-running-out rationale.)

Second, I came across, quite unexpectedly, some related, illuminating thoughts in a book called Algorithms to Live By: The Computer Science of Human Decisions. The authors attempt to provide some real-world relevance to the tradeoff between exploration and exploitation—a longstanding tension in computer science between "gathering information and . . . using the information you have to get a known good result."

One of the curious things about human beings, which any developmental psychologist aspires to understand and explain, is that we take years to become competent and autonomous. Caribou and gazelles must be prepared to run from predators the day they're born, but humans take more than a year to make their first steps.

Alison Gopnik, professor of develpmental psychology at UC Berkeley . . . has an explanation for why human beings have such an extended period of dependence: "it gives you a developmental way of solving the exploration/exploitation tradeoff . . . Childhood gives you a period in which you can just explore possibilities, and you don't have to worry about payoffs because payoffs are being taken care of by the mamas and the papas and the grandmas and the babysitters . . . .

If you look at the history of the way that people have thought about children, they have typically argued that children are cognitively deficient in various ways—because if you look at their exploit capacities, they look terrible. They can't tie their shoes, they're not good at long-term planning, they're not good at focused attention."

[But] our intuitions about rationality are too often informed by exploitation rather than exploration.

Third and finally—and again, entirely and bizarrely unexpectedly—was this, which rounds out our trip from left of center to neutral to right of center (admittedly leaving the left of center unfairly underrepresented):

Treat children like children, treat grown-ups like grown-ups. An 11-year old doesn't need to teach himself, and shouldn't. A 22-year old does need to teach himself and must. And the best way to become a self-teaching 22-year old is to have teachers and parents who directly teach you when you're 11. People have known this for hundreds of years--thousands of years--and yet our public schools have somehow forgotten.

Although I'm more inclined to favor early and protracted exploration (input, learning) followed by later exploitation (performance, output) at virtually every scale in education—something that I have undoubtedly been unable to conceal even in this post—I don't intend in this writing to pass judgment one way or another2; only to suggest that where one finds oneself on the exploration-exploitation continuum is likely predictive of where one finds oneself on the traditionalist-progressivist continuum. Respectively.

Image credit: Ken Munson Photography

1. As far as education goes, I would consider myself right of center, and I would say my colleague was left of left of center. But that might be just what I would say. To her, she may have been either neutral or left of center, and I was right of right of center. That's how this game often works when you're not thinking hard about it: you're always neutral, and the other guy is the extremist.

2. I won't even mention that, given certain assumptions, the optimum balance (according to something called the Gittins Index) between exploration and exploitation is not 50-50; more like 70-30. : )

## Tuesday, May 31, 2016

In this post, we looked at the perceptron, a machine-learning algorithm that is able to "learn" the distinction between two linearly separable classes (categories of objects whose data points can be separated by a line—or, with multi-dimensional categories, by a hyperplane).

The data shown below resemble the apples and oranges data we used last time. There are two classes, or categories—in this case, the two categories are the setosa and versicolor species of the Iris flower. And each species has two dimensions (or features): sepal length and petal length. In our hypothetical apple-orange data, the two dimensions were weight and number of seeds.

Using just two dimensions allows us to plot each instance in the training data on a coordinate plane as above and draw some of the prediction lines as the program cycles through the data. For the above data, a solution is found after about 5–6 "epochs" (cycles of $$\mathtt{n}$$ runs through the data). This solution is represented by the blue dashed line.

This process is a bit clunky. The coefficients, or weights $$\mathtt{w}$$ are updated using the learning rate $$\mathtt{\eta}$$ by $$\mathtt{\Delta w_{1,2} = \pm 2\eta w_k}$$ and $$\mathtt{\Delta w_0 = \pm 2\eta}$$. This process, though it always converges to a solution so long as there is one, tends to jolt the prediction line back and forth a bit abruptly.

With gradient descent, we can gradually reduce the error in the prediction. Sometimes. We do this by making use of the sum of squared errors function—a quadratic function (parabola) that has a minimum: $\mathtt{\frac{1}{2}\sum(y - wx)^2}$

This formula shows the target vector $$\mathtt{y}$$ (the collection of target values of 1 and -1) minus the input vector—the linear combinations of weights and dimensions for each object in the data, $$\mathtt{w^ix^i}$$. The components of the difference vector are squared and summed, and the result is divided by 2, which gives us a scalar value that places us somewhere on the parabola. We don't use this "cost" value except to keep track of the cost to see if it reduces over cycles.

Okay, so we don't know what side of the parabola we're on. In that case, we look at the opposite of the gradient of the curve with respect to the weights, or the opposite of the partial derivative of the curve with respect to the weights: $\mathtt{-\frac{\partial}{\partial w}(y - wx)^2 = -\frac{\partial}{\partial u}(u^2)\frac{\partial}{\partial w}(y - wx) = -2u \cdot -x = -2(y - wx)(-x)}$

Multiply this result by $$\mathtt{\frac{1}{2}}$$, plug that summation back in, and we get a gradient of $$\mathtt{\sum (y - wx)(x)}$$. Finally, multiply by the learning rate $$\mathtt{\eta}$$ to get the change to each weight: $\mathtt{\eta\sum (y - wx)(x)}$

An Example

Let's take a look at a small example. I'll use data for just 10 flowers in the Iris data set. All of these belong to the setosa species of the Iris flower. I'll use a learning rate of $$\mathtt{\eta = 0.0001}$$.

Sepal Length (cm)Petal Length (cm)
5.11.4
4.91.4
4.71.3
4.61.5
51.4
5.41.7
4.61.4
51.5
4.41.4
4.91.5

In that case, all of these instances have target values of $$\mathtt{-1}$$, which don't change as the data cycle through. Our $$\mathtt{y}$$ vector, then, is [$$\mathtt{-1, -1, -1, -1, -1, -1, -1, -1, -1, -1}$$], and our starting weights are [0, 0, 0]. The first weight here is the "intercept" weight, or bias weight, which gets updated differently from the others.

Our input vector, $$\mathtt{w^{T}x}$$, is the combination sepal length × weight 1 + petal length × weight 2 for each object in the data. At the start, then, our input vector is a zero vector: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]. The difference vector, $$\mathtt{y - w^{T}x}$$, is, in this case, equal to the y vector: just a collection of ten negative 1s.

The bias weight, $$\mathtt{w_0}$$, is updated by learning rate × the sum of the components of the difference vector, or $$\mathtt{\eta\sum(y - w^{T}x)}$$. This gives us $$\mathtt{w_0 = w_0 + 0.0001 \times -10 = -0.001}$$.

The $$\mathtt{w_1}$$ weight is updated as 0.0001(5.1 × -1 + 4.9 × -1 + 4.7 × -1 + 4.6 × -1 + 5 × -1 + 5.4 × -1 + 4.6 × -1 + 5 × -1 + 4.4 × -1 + 4.9 × -1) and the $$\mathtt{w_2}$$ weight is updated as 0.0001(1.4 × -1 + 1.4 × -1 + 1.3 × -1 + 1.5 × -1 + 1.4 × -1 + 1.7 × -1 + 1.4 × -1 + 1.5 × -1 + 1.4 × -1 + 1.5 × -1).

Through all that gobbledygook, our new weights are $$\mathtt{-0.00486}$$ and $$\mathtt{-0.00145}$$ with a bias weight of $$\mathtt{-0.001}$$. You can see below how different the gradient descent process is from the perceptron model. This model in particular doesn't have to stop when it finds a line that separates the categories, since the minimum may not be reached even when the program can classify the categories precisely.

## Friday, May 27, 2016

### Enhance the Salience of Relevant Variables

This study has some connections to things I've written up here—in particular, Hallowell, et. al on how the salience of characteristics of solids affects the responses of young children to certain visuo-spatial tasks. When these salient characteristics are irrelevant to the task, inhibitory processes can be measured, mentioned in studies here and again here.

The study we will look at in this post explored whether increasing the salience of the relevant variable in tasks with both relevant and irrelevant variables of competing significance would improve students' performance. In particular, researchers looked at whether increasing the salience of perimeter would help students with responses to tasks where area was irrelevant but salient.

Three groups of shape-pair categories were used, along with two levels of complexity:

In the Congruent condition, the shape with the greater area also has the greater perimeter. In the Incongruent Inverse condition, the shape with the smaller area has the greater perimeter. And in the Incongruent Equal condition, the areas differ, but the perimeters are equal.

The Experiment and Results

Participants (58 fifth- and sixth-graders) were divided into two groups. One group of students was tested first on the shapes as you see them above. The second group, however, was tested first on what researchers called the discrete mode of presentation. In this mode, the perimeters of the shapes are made more salient by drawing them using equal-sized matchsticks instead of as continuous lines:

Each group was tested on the other mode of presentation 10 days later. So, students who began with the discrete mode were tested in the continuous mode later, and students tested first in the continuous mode were tested 10 days later in the discrete mode.

The results are pretty staggering. And keep in mind that students in both groups took both tests. The order of the tests is what is primarily responsible for the dramatically different results.

Previous research in mathematics and science education has shown that specific modes of presentation may improve students’ performance (i.e., Clement 1993; Martin and Schwartz 2005; Stavy and Berkovitz 1980; Tirosh and Tsamir 1996). Our results corroborate these findings and indeed show that success rate in the discrete mode of presentation test is significantly higher than in the continuous mode of presentation test. This significant difference is evident in all conditions (congruent, incongruent inverse, and incongruent equal) and in all levels of complexity (simple and complex). The visual information depicted in the discrete mode of presentation strongly enhances the salience of the relevant variable, perimeter, and somewhat decreases the salience of the irrelevant variable, area. This visual information also emphasizes the perimeter’s segments that should be mentally manipulated when solving the task. The discrete mode of presentation, therefore, enhances the use of strategies that are regularly used when solving this task. In the continuous mode of presentation, however, no hint of such possibility of mentally breaking the solid line into relevant segments is given.

The researchers note that a similar study using discrete segments for perimeter found no effect for the order of presentation. The authors surmise that this was due to the fact that, in that study, researchers did not use equal-sized discrete units, so students could not manipulate these units to determine perimeter. That is, discreteness is not what mattered, but that the discrete units could be manipulated successfully to produce correct responses to perimeter questions. The key, still, is that insights gained from working first with these discrete, equal-sized units transferred to the continuous mode of presentation.

The positive effect of a previous presentation observed in the current study can be seen as “teaching by analogy.” In teaching by analogy students are first presented with an “anchoring task” that elicits a correct response due to the way it is presented and hence supports appropriate solution strategies. Later on, students are presented with a similar “target task” known to elicit incorrect responses. The anchoring task probably encourages appropriate solution strategies, and such a sequence of instruction was effective in helping students overcome difficulties (e.g., Clement 1993; Stavy 1991; Tsamir 2003). . . .

Performing the discrete mode of presentation test strongly enhances the salience of the relevant variable, perimeter, and somewhat decreases that of area. This enhancement supports appropriate solution strategies that lead to improved performance. This effect is robust and transfers to continuous mode of presentation for at least 10 days. In line with this conclusion, a student who performed the continuous test after the discrete one commented that, “It [continuous] was harder this time but I used the previous shapes, because I could do tricks with the matchsticks.”

Babai, R., Nattiv, L., & Stavy, R. (2016). Comparison of perimeters: improving students’ performance by increasing the salience of the relevant variable ZDM, 48 (3), 367-378 DOI: 10.1007/s11858-016-0766-z

## Thursday, May 26, 2016

### The Perceptron

I'd like to write about something called the perceptron in this post—partly because I'm just learning about it, and writing helps me wrap my head around new-to-me things, and partly because, well, it's interesting.

It works like this. Take two categories which are, in some quantifiable way, completely different from each other. In this example, let's talk about apples and oranges as the categories. And we'll give each of these categories just two dimensions—weight in grams and number of seeds. (I'll fudge these numbers a bit to help with the example.)

Studying this information alone, you're already prepared to know with 100% confidence whether an object is an apple or an orange given its weight and number of seeds. In fact, the number of seeds is unnecessary. If you have just the two objects to choose from, and you are given only an object's weight, you have all the information you need to assign it to the apple category or the orange category.

But, you have to play along here. We want to train the computer to make the inference we just made above, given only a set of data with the weights and number of seeds of apples and oranges. Crucially, what the perceptron does is find a line between the two categories. You can easily see how to draw a line between the categories at the left (where I have plotted 100 random apples and oranges in their given ranges of weights and seeds), but the perceptron program allows a computer to learn where to draw a line between the categories, given only a set of data about the categories.

Training the Perceptron

The way we teach the computer how to draw this line is that we ask it to essentially draw a prediction line. Then we make it look at each instance (apple or orange, one at a time) and see whether the line correctly predicts the category of the object. If it does, we make no change to the line; if it doesn't we adjust the line.

The prediction line takes the standard form $$\mathtt{Ax + By + C = 0}$$. If $$\mathtt{Ax + By + C > 0}$$, we predict orange. If $$\mathtt{Ax + By + C \leq 0}$$, we predict apple. We can put the "or equal to" on either one of these inequalities, and we need the "or equal to" on one of them for this to be a truly binary decision.

Okay, so let's draw a prediction line across the data above at $$\mathtt{y = 119}$$, say. In that case, $$\mathtt{A = 0, B = 1,}$$ and $$\mathtt{C = -119}$$. When we come across apple data points, they will all be categorized correctly (as below the line), but some of the orange data points will be categorized correctly and some incorrectly. On correct categorizations, we don't want the line to update at all, and on incorrect categorizations, we want to change the line. Here's how both of these things are accomplished:

Let's take the point at (10, 109). This should be classified as orange, which we'll quantify as $$\mathtt{1}$$. The prediction line, however, gives us $$\mathtt{0(10) + 1(109) + -119}$$, which is $$\mathtt{-10 \leq 0}$$. So, the predictron mistakenly predicts this point to be an apple, which we'll quantify as $$\mathtt{-1}$$. We want to adjust the line.

Subtract the prediction ($$\mathtt{-1}$$) from the actual (1) to get $$\mathtt{2}$$. Multiply this by a small learning rate of 0.1, so $$\mathtt{0.1 \times 2 = 0.2}$$. Finally, change $$\mathtt{A}$$, $$\mathtt{B}$$, and $$\mathtt{C}$$ like this:

$$\mathtt{A = A + 0.2 \times 10}$$

$$\mathtt{B = B + 0.2 \times 109}$$

$$\mathtt{C = C + 0.2}$$

So, in our example, $$\mathtt{A}$$ becomes $$\mathtt{0 + 0.2 \times 10}$$, or 2, $$\mathtt{B}$$ becomes $$\mathtt{1 + 0.2 \times 109}$$, or 22.8, and $$\mathtt{C}$$ becomes $$\mathtt{-119 + 0.2}$$, or $$\mathtt{-118.8}$$.

We have a new prediction line, which is $$\mathtt{2x + 22.8y + -118.8 = 0}$$. This line now makes the correct prediction for the orange at (10, 109), as you can see at the right with the blue line.

But this point is only used to adjust A, B, and C (called the "weights"), and then it's on to the next point to see if the new prediction line succeeds in making a correct prediction or not. The weights change with each incorrect prediction, and these changing weights move the prediction line up (left) and down (right) and alter its slope as well.

Many Iterations

Suppose we next encounter a point at (10, 90), which should be an apple. Our predictron, however, will make the prediction $$\mathtt{2(10) + 22.8(90) - 118.8 = 2153.2 > 0}$$, which is a prediction of orange, or 1. Subtract the prediction (1) from the actual ($$\mathtt{-1}$$) to get $$\mathtt{-2}$$, and multiply by the learning rate to get $$\mathtt{-0.2}$$. Our weights are adjusted as follows:

$$\mathtt{A = 2 + -0.2 \times 10 = 0}$$

$$\mathtt{B = 22.8 + -0.2 \times 90 = 4.8}$$

$$\mathtt{C = -118.8 + -0.2 = -119}$$

If you graph this new line, $$\mathtt{0x + 4.8y + -119 = 0}$$, you'll notice that it's between the other two, but it would still make the incorrect categorization for the apple at (10, 90). This is why the predictron must cycle through the data several times in order to "train" itself into determining the correct line. If you can make sense of it, there is a proof that this algorithm always converges in finite time for data that can be separated by a line (with a sufficiently small learning rate).

Other Notes

It's worth mentioning that the perceptron finds a line between the categories, but there are an infinite number of lines available. Also, in this example, we have two dimensions, or features, but the perceptron works for as many dimensions as you please.

So, to get schmancy, for each object in the data, $$\mathtt{z}$$, our prediction function takes a linear combination of weights (coefficients + that C intercept weight) and coordinates (features, or dimensions) $$\mathtt{z = w_{0}x_{0} + w_{1}x_{1} + \ldots + w_{m}x_{m} = w^{T}x}$$ and outputs $\theta(z) = \left\{\begin{array}{11} \color{white}{-}1, & \quad z > 0 \\ -1, & \quad z \leq 0 \end{array} \right.$

Even with multi-dimensional vectors, the weights are updated just as above, perhaps with a different learning rate, chosen at the beginning.

Below is a Python implementation of the Perceptron from Sebastian Raschka, using a classic data set about irises (the flowers).

## Tuesday, May 24, 2016

### Is Area All About Rectangles?

A very common way of teaching students the reasoning behind the formula for the area of a non-rectangular parallelogram is to demonstrate how to take apart such a parallelogram and turn it into a rectangle. Some U.S. states' elementary and middle-school mathematics standards used to be fairly explicit about this "way" of teaching. Here are two examples—the first from California, the second from Florida:

Derive and use the formula for the area of a triangle and of a parallelogram by comparing it with the formula for the area of a rectangle (i.e., two of the same triangles make a parallelogram with twice the area; a parallelogram is compared with a rectangle of the same area by cutting and pasting a right triangle on the parallelogram).

Derive and apply formulas for areas of parallelograms, triangles, and trapezoids from the area of a rectangle.

The NCTM's 2000 publication Principles and Standards for School Mathematics (PSSM) also endorsed this method as consistent with stimulating students' understanding of, and investigation into, area:

One particularly accessible and rich domain for such investigation is areas of parallelograms, triangles, and trapezoids. Students can develop formulas for these shapes using what they have previously learned about how to find the area of a rectangle, along with an understanding that decomposing a shape and rearranging its component parts without overlapping does not affect the area of the shape.

And, from the third edition of this well-known book by Van de Walle, we have the following:

Parallelograms that are not rectangles can be transformed into rectangles having the same area. The new rectangle has the same height and two sides the same as the original parallelogram. Students should explore these relationships on grid paper, on geoboards, or by cutting paper models, and should be quite convinced that the areas are the same and that such reassembly can always be done with any parallelogram. As a result, the area of a parallelogram is base times height, just as for rectangles.

I should note that, of the four snippets I present above, only Van de Walle's even makes an attempt at a consistent distinction between non-rectangular parallelograms ("parallelograms that are not rectangles") and rectangles. But then even in the Van de Walle snippet, this distinction breaks down in the last sentence.

Consistent with the standards and the specific observations and suggestions in the widely referenced publications listed above, textbook lessons typically introduce students to the idea that the area of a non-rectangular parallelogram can be described by the same formula as that used to describe the area of a rectangle with the same base and height, using one or more illustrations like the one at right, accompanied by written instruction to help students understand the illustration.

Generally, the purpose of this written instruction is to clarify for students that (a) a part of the non-rectangular parallelogram was simply removed and then re-attached somewhere else on the figure--a transposition that does not change the area of the figure, and (b) the resulting figure is a rectangle whose height and base are the same as that of the non-rectangular parallelogram.

Still, the "integer-ness" of integers has nothing to do with the number 9, and the reason the area of a parallelogram can be described by the formula $$\mathtt{A = bh}$$ or $$\mathtt{A = lw}$$ may have nothing in general to do with rectangles and everything to do with the fact that parallel lines are at all corresponding points the same distance apart.