
|
Data Mining is EasySeven Quantitative Insights into Active ManagementPart 5Why is it that so many strategies look great in backtests and disappoint upon implementation? Backtesters always have 95% confidence in their results, so why are investors disappointed far more than 5% of the time? It turns out to be surprisingly easy to search through historical data and find patterns that don't really exist. To understand why data mining is easy, we must first understand the statistics of coincidence. Let's begin with some non-investment examples. Then we will move on to investment research. The statistics of coincidence Several years ago Evelyn Adams won the New Jersey state lottery twice in four months. Newspapers put the odds of that happening at 17 trillion to 1, an incredibly improbable event. A few months later, two Harvard statisticians, Percy Diaconis and Frederick Mosteller, showed that a double win in the lottery is not a particularly improbable event. They estimated the odds at 30 to 1. What explains the enormous discrepancy in these two probabilities? It turns out that the odds of Evelyn Adams winning the lottery twice are in fact 17 trillion to 1. But that result is presumably of interest only to her immediate family. The odds of someone, somewhere, winning two lotteriesgiven the millions of people entering lotteries every dayare only 30 to 1. If it wasn't Evelyn Adams, it could have been someone else. Coincidences appear improbable only when viewed from a narrow perspective. When viewed from the correct (broad) perspective, coincidences are no longer so improbable. Let's consider another non-investment example: Norman Bloom, arguably the world's greatest data miner. Norman died a few years ago in the midst of his quest to prove the existence of God through baseball statistics and the Dow Jones average. He argued that "BOTH INSTRUMENTS are in effect GREAT LABORATORY EXPERIMENTS wherein GREAT AMOUNTS OF RECORDED DATA ARE COLLECTED, AND PUBLISHED" (capitalization Bloom's). As but one example of thousands of his analyzes of baseball, he argued that the fact that George Brett, the Kansas City third baseman, hit his third home run in the third game of the playoffs, to tie the score 3-3, could not be a coincidenceit must prove the existence of God. In the investment arena, he argued that the Dow's 13 crossings of the 1,000 line in 1976 mirrored the 13 colonies which united in 1776which also could not be a coincidence. (He pointed out, too, that the 12th crossing occurred on his birthday, deftly combining message and messenger.) He never took into account the enormous volume of datain fact, an entire New York Public Library's worthhe searched through to find these coincidences. His focus was narrow, not broad. With Norman's passing, the title of world's greatest living data miner has been left open. Recently, however, Michael Drosnin, author of The Bible Code, seems to have filled it. (For details, see the book review.)
The importance of perspective to understanding the statistics of coincidence was perhaps best summarized by, of all people, Marcel Proustwho often showed keen mathematical intuition:
Investment research Investment research involves exactly the same statistics and the same issues of perspective. The typical investment data mining example involves t-statistics gathered from backtesting strategies. The narrow perspective says: "After 19 false starts, this 20th investment strategy finally works. It has a t-statistic of 2." But the broad perspective on this situation is quite different. In fact, given 20 informationless strategies, the probability of finding at least one with a t-statistic of 2 is 64%. The narrow perspective substantially inflates our confidence in the results. When viewed from the proper perspective, confidence in the results lowers accordingly. Four guidelines for backtesting integrity Given that data mining is easy, how can we safeguard against it? Here are four guidelines for data mining integrity:
The intuition guideline demands that researchers investigate only those strategies with some ex ante expectation of success. Investment research should never involve free-ranging searches for patterns without regard for intuition. The restraint guideline attempts to minimize the number of strategies investigatedi.e., to keep the broad and narrow focus similar. In the best case, researchers decide ex ante exactly which strategies and variants they will investigate, run their tests, and look at the answers. They do not go back and continually refine their investigations. The sensibility guideline deletes results that seem improbably successful. Observed t-statistics that are too large may signal database errors or an improper methodology rather than a new strategy. The fourth guideline, out-of-sample testing, is the statistician's answer to the curse of data mining. Coincidences observed over one data set are quite unlikely to reoccur in another independent data set. Conclusions Many backtesting results are not foolproof demonstrations of strategy value but merely coincidence. Four backtesting guidelines can help avoid data mining. |
[client support]
[portfolio management]
[investment data]
[trading services] [search] [site map] [contact us] [home]
Any questions or bug reports regarding this service should go to contactus@barra.com. |