This was written by Cathy N. Davidson, a Duke U niversity professor and author of Now You See It: How the Brain Science of Attention Will Transform the Way We Live, Work, and Learn .
Last spring, when Google’s Project Oxygen revealed the results obtained from number-crunching its entire stock of personnel records — hiring, firing, merit raises, promotions — no one was more surprised than Google to find that the famously data-driven company had actually been promoting managers for their squishy, soft, Management 101 people skills. Google prides itself on managers who have technical chops, but technical expertise didn’t even make the Big Eight of esteemed management qualities. Fortunately, Google used a flexible and open enough text-mining system to be able to see what was there, not what wasn’t, and to make its own contradictions visible. The company is now re-examining its own management “rules” and its deepest assumption about who and what makes a good manager.
But what if Google had asked the question to find out how well its employees fulfilled the company’s stated data-driven values: “How many of our managers have the technical expertise to be good managers?” The outcome of its own data-crunching could have been a disaster. Google (the 2012 “top company” on the Fortune list, by the way) might have failed its own “empirical” and “objective” and “standardized” test. If Google had been a public school, it might have been slated for closure in 2014 because of its failure.
Of course I’m overstating the case for effect. But the point here is that all the data in the world doesn’t matter if you ask the wrong question of the data or if the method of testing isn’t flexible enough to yield real, true data about success and failure.
All the data in the world doesn’t matter if you are collecting one kind of information but the real problem or virtue lies elsewhere. I believe this is the conundrum we are now in with the multiple-choice, end-of-grade form of testing that the United States (and the world) now uses as its gold standard. It’s not only an outmoded form of testing, but teaching aimed at ensuring that students achieve success at bubble tests does not ensure real learning. It also does not ensure that students will retain what they have learned and be able to apply it to their next level of learning challenges, in the classroom or beyond.
How do we measure learning innovation?
This issue came up pointedly at last week’s Harvard Innovations in Learning and Teaching (HILT) symposium where I was delighted to be one of the plenary speakers. The symposium was designed to help us all use the best research on learning to rethink the traditional classroom. One of the closing speakers said that they would be crunching the data to make sure the results of learning experiments were “rigorous. “The best innovators are often not the best evaluators,” he said.
I thought of Project Oxygen — and the dismal state of No Child Left Behind school evaluation methods and responded, “True enough! But the best evaluators are often not the best innovators. We also have to be clear that our metrics are expansive enough to ‘count’ values that may not be testable by current measures.”
Fortunately, Henry “Roddy” Roediger was also a plenary speaker. Roediger’s research shows the limitations of item-response testing that is divorced from that which is being learned. His work in the Memory Lab at Washington University also shows that lecturing is the least effective learning method. If you want people to retain and to be able to master and apply what they learn, they have to be “tested” over and over as they are learning, and with feedback that helps them to learn better.
Roediger’s testing methods include a variety of challenges, including teaching others what you learn, working with someone who has a different answer than you to explain and correct your thinking, writing up your conclusions for a public audience that will challenge you, and other interactive forms of challenge-based testing.
Harvard physicist Eric Mazur also demonstrated his interactive testing-learning methods at HILT. He posed a basic physics problem to the crowd, we clicked our answers, and then he had us try to convince someone else who had a different answer to change their minds.
In my interactions, a problem occurred. Perhaps because I was a plenary speaker, the stranger I chose as my partner, a very smart and lovely person who knew the right answer, didn’t prevail on me forcefully enough to change my answer. I was wavering, convince-able, but the learning transfer didn’t happen in our exchange. (It did, however, when it turned out he was right and I was wrong; I will probably never forget that physics lesson, which proves Mazur’s point, in long form).
But let’s back up. If my partner in this physics audience had been a Web developer, and we were doing a Web-building project together, my wrong answer well might have been the one we went with and our common project would have failed.
Because so much code is written collaboratively, with strangers, where outcomes matter to the success of the project, for future jobs and future collaborations, coders have developed a complex yet easy (and difficult to “game”) system of awarding one another badges for successful, innovative collaboration. They don’t need a multiple-choice test to prove they are good coders. In fact, unlike doctors, accountants, beauticians, or financial advisers, programmers don’t even have a formal certification or credentialing system.
Millions of Web programmers worldwide have learned to innovate at a far faster pace than most of us and to evaluate one another rigorously through peer assessment. Really. That is so counter-intuitive that I’m going to repeat it: “Millions of Web programmers worldwide have learned to innovate at a far faster pace than most of us and to evaluate one another rigorously through peer assessment.”
How is this possible? How can peers really evaluate one another. They can and do in Web world by awarding badges as peer-given contribution and reputation points. Badges are the visible symbol of a complex system of rigorous peer evaluation of all the complex skills (the kind Project Oxygen turned up at Google) as well as all the innovative programming that Web coders contribute to one another.
I believe we can learn much from what and how they do what they do.
Badges, innovation, and evaluation: The example of stack exchange
To understand more about the world of Badges I interviewed Jeff Atwood, cofounder of Stack Exchange, a question-and-answer website which also includes Stack Overflow for programmers, Server Fault for system administrators, and more than 70 others that range from photography to productivity. He also writes the popular blog Coding Horror .
Stack Overflow serves the 12-13 million community of programmers world wide, seeing site traffic in the range of 16 million views per months to their site. Atwood likes to say that one of Stack Exchange’s chief contributions is “making platforms that make it easy for people to contribute their knowledge to one another.” Memberspose questions and other members answer them, and, if the answer is good, you award points to your coding colleague.
If you are heading in a wrong direction (as I was in Eric Mazur’s session at HILT), and someone is able to steer you in the right direction, you award points to that person for their teaching abilities. The points add up, and you can see the results on your own personal website where programmers can proudly display their badges. Click on a programmer’s glowing gold badge and you find a detailed assessment of absolutely everything that contributed to the high scores, including my personal comments about why.
I’m not talking about resume-speak. If I award points, you can read the actual details and reasons for why Captain Coder over in Beijing earned points for her C++ programming chops, or why Mr. Algorithmic in Sidney was awarded top points for being “precognitive,” someone who follows development of new ideas and communities during the earliest stages. Cruncher from Cambridge might earn points for being a “self-learner,” or a “teacher,” or for being “tenacious,” “outspoken,” or “disciplined,” different assets in the community based on algorithms and contributions to the site. (You can see the Stack Overflow badges and points here .)
These qualities merge programming skills with teaching and learning skills — collaborative skills — because, to deliver code on time, you need all of those (as Project Oxygen also found with its personnel-record data-mining). Atwood calls them a reputational “breadcrumb trail on the Internet.” But he’s being modest.
Another part of StackExchange is Careers 2.0, a job posting and connecting service, a kind of Match.Com for jobs. Reputation based on badges and points are the currency of the realm — and it is a leading service for employers looking to hire managers, programmers, and just about anyone else in the world’s mobile, distributed programmer workforce. It should come as no surprise that many of the best tech companies, including Google, use Careers 2.0 for their recruiting.
But one more word about badging. It’s not just about jobs. As Atwood says, the badges on Stack Exchange don’t just record participation, they incentivize it. They also allow you to match a range of qualities you value with the complex range of qualities that peers have recognized and rewarded. You do a good job, others give you credit. And, if I, as an employer, want to find out why someone has earned a badge, all I have to do is click on your badge, find out the details, and read the comments and then I can decide how much I do or don’t trust the reputation. It’s open, so I can see where Mr. Algorithmic is getting his points.
That is the thing about non-standardized open content: others can comment on it, emend it, challenge it. And, if you want to crunch such loosey-goosey evaluation, well, we now have text-mining software that allows that, with remarkable complexity, as we saw from Project Oxygen.
We no longer need to use the A, B, C, D, or None of the Above multiple-choice test invented in 1914 and patterned after the state-of-the-art mass production of its time, Henry Ford’s assembly line. When you think about it, it’s pretty hard to believe that the state of the art evaluation system the world is currently using for evaluating something as complex as learning dates back to the Model T.
Better evaluation systems exist now
We have computers now, everyone . Imagine that! But we are still using the testing methods designed for the era of the Model T, a form of testing for “lower order thinking” that measures the narrow range of thinking measured by Best Available Answer testing. We know, from the best data-based research, this form of testing is a dis-incentive to learning, especially for kids who don’t believe they have any chance of obtaining the goal of using good test scores to get into college. In other words, the tests “incentivize,” to use Jeff Atword’s word, only those aspiring to get to an end: college, a certificate, a credential. The tests do not incentivize contribution, participation, collaboration, and learning — what Stack Exchange strives for.
Think about that. We have a system of tests designed for citizens of the Industrial Age, based on the assembly line, that are extremely costly, don’t measure much of content, and don’t motivate learning. And millions of programmers have found a way that works so well they don’t even need formal credentials and accreditation systems. What they do works — and works based on peers evaluating contribution (they don’t even have a system of “failing:” they reward what works, what is good, setting the bar for reputation at its highest, not at its lowest denominator).
We not only can use far more interactive, complex, humane, interesting, challenging, and innovative forms of assessment for real learning, real teaching, real collaboration — the tech community is already doing that. Teachers, researchers, experimenters, and evaluators all need to think about these systems and learn from them. Project Oxygen revealed patterns even Google didn’t suspect. Stack Exchange is doing that daily, with millions of people.
The badging systems I’m interested in exploring have to be offered by non-profit learning organizations in order to avoid further commercialization and exploitation of our educational system. They have to be less not more expensive to administer than the current cumbersome system of either Human Resource (HR) evaluation or end of grade tests or teacher “standards” and evaluation or merit systems. They have to include peer components. They have to include a range of skills, content, subject matter, mastery, application, theory, and practice, competencies and collaborative or character qualities. And, most important, they have to be tied to the learning process itself and incentivize and motivate — not just document — real, long-term, engaged, interactive learning.
Badges for lifelong learning
Since September, the nonprofit learning network I cofounded, HASTAC (“haystack”) has been working with the John D. and Catherine T. MacArthur Foundation and the Mozilla Foundation to run competitions on Badges for Life Long Learning, as part of our annual Digital Media and Learning Competition .
It turns out that many institutions join us in thinking our Model T form of testing is archaic and a dis-incentive to either real learning or real learning innovation, in schools, in informal learning settings, or in the workplace. Nearly 340 different institutions — from NASA to Intel, from small local schools to the Department of Education — have offered challenges. We’ve just announced winners of the first phase of a separate Teacher Mastery Competition too. And we’re now challenging developers to apply to work with institutions to co-create badging systems that work for the values and learning goals of the institutions.
In the end, we will have a rich portfolio of active projects, all developing badging and reputation systems online, funded for a year so that they can learn and so that we — the public — can learn from an open competition, an open year of co-developing, and an open year of evaluating, recommending, refining, improving, and creating together. That is what learning is about. We can all learn to do this together, in the way that the Open Web has developed for the 21st century but that has yet to penetrate into our institutions of formal learning and into many of our business institutions as well.
You can’t build the next generation of the Web with an assembly line
At the HILT conference at Harvard, we talked a lot about how real metrics, real data, real experiment can serve real learning innovation. If we don’t also think about innovative metrics, data, and experimental methods, we will replicate old standards and values but with some relatively insignificant new tweaks. If we want true innovation in learning, we must strive for true innovation in the methods we use for deciding what counts and how we count. I’m hopeful that we are at a tipping point. I believe we are on the verge of using the successful methods already being used by the developers of the Internet to find the best ways of learning for the Internet Age. I believe we will soon be finding new ways to measure contribution and to motivate learning not for the era of the Model T but for the 21st century.
Follow The Answer Sheet every day by bookmarking http://www.washingtonpost.com/blogs/answer-sheet.