Newsletter #165 Home
Newsletter Contributors
Previous Issues
BARRA Home




Data Mining is Easy




The Market Impact
Model™


October's Market Demons

Equity Program Trading



The BARRA Brainteaser
for Winter 1998


Summer 1997
Solution to The BARRA
Brainteaser




The Bible Code

Data Mining is Easy

Seven Quantitative Insights into Active Management—Part 5

by Ronald N. Kahn

Why is it that so many strategies look great in backtests and disappoint upon implementation? Backtesters always have 95% confidence in their results, so why are investors disappointed far more than 5% of the time? It turns out to be surprisingly easy to search through historical data and find patterns that don't really exist.

To understand why data mining is easy, we must first understand the statistics of coincidence. Let's begin with some non-investment examples. Then we will move on to investment research.

The statistics of coincidence

Several years ago Evelyn Adams won the New Jersey state lottery twice in four months. Newspapers put the odds of that happening at 17 trillion to 1, an incredibly improbable event. A few months later, two Harvard statisticians, Percy Diaconis and Frederick Mosteller, showed that a double win in the lottery is not a particularly improbable event. They estimated the odds at 30 to 1. What explains the enormous discrepancy in these two probabilities?

It turns out that the odds of Evelyn Adams winning the lottery twice are in fact 17 trillion to 1. But that result is presumably of interest only to her immediate family. The odds of someone, somewhere, winning two lotteries—given the millions of people entering lotteries every day—are only 30 to 1. If it wasn't Evelyn Adams, it could have been someone else.

Coincidences appear improbable only when viewed from a narrow perspective. When viewed from the correct (broad) perspective, coincidences are no longer so improbable. Let's consider another non-investment example: Norman Bloom, arguably the world's greatest data miner.

Norman died a few years ago in the midst of his quest to prove the existence of God through baseball statistics and the Dow Jones average. He argued that "BOTH INSTRUMENTS are in effect GREAT LABORATORY EXPERIMENTS wherein GREAT AMOUNTS OF RECORDED DATA ARE COLLECTED, AND PUBLISHED" (capitalization Bloom's). As but one example of thousands of his analyzes of baseball, he argued that the fact that George Brett, the Kansas City third baseman, hit his third home run in the third game of the playoffs, to tie the score 3-3, could not be a coincidence—it must prove the existence of God. In the investment arena, he argued that the Dow's 13 crossings of the 1,000 line in 1976 mirrored the 13 colonies which united in 1776—which also could not be a coincidence. (He pointed out, too, that the 12th crossing occurred on his birthday, deftly combining message and messenger.) He never took into account the enormous volume of data—in fact, an entire New York Public Library's worth—he searched through to find these coincidences. His focus was narrow, not broad.

With Norman's passing, the title of world's greatest living data miner has been left open. Recently, however, Michael Drosnin, author of The Bible Code, seems to have filled it. (For details, see the book review.)

The importance of perspective to understanding the statistics of coincidence was perhaps best summarized by, of all people, Marcel Proust—who often showed keen mathematical intuition:

    The number of pawns on the human chessboard being less than the number of combinations that they are capable of forming, in a theater from which all the people we know and might have expected to find are absent, there turns up one whom we never imagined that we should see again and who appears so opportunely that the coincidence seems to us providential, although, no doubt, some other coincidence would have occurred in its stead had we not been in that place but in some other, where other desires would have been born and another old acquaintance forthcoming to help us satisfy them. (The Guermantes Way, Cities of the Plain, Volume 2 of translation of Marcel Proust's Remembrance of Things Past [New York: Vintage Books, 1982], p. 178.)

Investment research

Investment research involves exactly the same statistics and the same issues of perspective. The typical investment data mining example involves t-statistics gathered from backtesting strategies. The narrow perspective says: "After 19 false starts, this 20th investment strategy finally works. It has a t-statistic of 2."

But the broad perspective on this situation is quite different. In fact, given 20 informationless strategies, the probability of finding at least one with a t-statistic of 2 is 64%. The narrow perspective substantially inflates our confidence in the results. When viewed from the proper perspective, confidence in the results lowers accordingly.

Four guidelines for backtesting integrity

Given that data mining is easy, how can we safeguard against it? Here are four guidelines for data mining integrity:

  • Intuition
  • Restraint
  • Sensibility
  • Out-of-sample testing

    The intuition guideline demands that researchers investigate only those strategies with some ex ante expectation of success. Investment research should never involve free-ranging searches for patterns without regard for intuition.

    The restraint guideline attempts to minimize the number of strategies investigated—i.e., to keep the broad and narrow focus similar. In the best case, researchers decide ex ante exactly which strategies and variants they will investigate, run their tests, and look at the answers. They do not go back and continually refine their investigations.

    The sensibility guideline deletes results that seem improbably successful. Observed t-statistics that are too large may signal database errors or an improper methodology rather than a new strategy.

    The fourth guideline, out-of-sample testing, is the statistician's answer to the curse of data mining. Coincidences observed over one data set are quite unlikely to reoccur in another independent data set.

    Conclusions

    Many backtesting results are not foolproof demonstrations of strategy value but merely coincidence. Four backtesting guidelines can help avoid data mining.





  • [client support]   [portfolio management]   [investment data]   [trading  services]
    [model  &  market information]   [research resources]   [about BARRA]  

    [online product center]

    [search]   [site map]   [contact us]   [home]  

    Any questions or bug reports regarding this service should go to contactus@barra.com.
    © 1995-1999 BARRA, Inc. All rights reserved. Terms of Use.