Exploratory data analysis (EDA) is an approach to
analyzing data for the purpose of formulating
hypotheses worth testing, complementing the tools of conventional
statistics for testing hypotheses. It was so named by
John Tukey.
EDA development
Tukey held that too much emphasis in
statistics was placed on
statistical hypothesis testing (confirmatory data analysis); more emphasis needed to be placed on using
data to suggest
hypotheses to test. In particular, he held that confusing the two types of analyses and employing them on the same set of data can lead to
systematic bias owing to the issues inherent in
testing hypotheses suggested by the data.
The objectives of EDA are to:
Tukey's books were notoriously opaque, and so several attempts were made to popularise his EDA ideas. Prominent among these was the Statistics in Society (MDST242) course of The Open University.
Many EDA techniques have been adopted into data mining and are being taught to young students as a way to introduce them to statistical thinking.
Techniques
There are a number of tools that are useful for EDA, but EDA is characterized more by the attitude taken than by particular techniques.
The principal graphical techniques used in EDA are:
The principal quantitative techniques are:
Graphical and quantitative techniques are:
History
Many EDA ideas can be traced back to earlier authors, for example:
The Open University course Statistics in Society (MDST 242), took the above ideas and merged them with Gottfried Noether's work, which introduced statistical inference via coin-tossing and the median test.
For details of the above, see John Bibby's book HOTS: History of Teaching Statistics.
Software
See also
Bibliography
- (1985). Exploring Data Tables, Trends and Shapes. ISBN 0-471-09776-4.
- (1983). Understanding Robust and Exploratory Data Analysis. ISBN 0-471-09777-2.
- Tukey, John Wilder (1977). Exploratory Data Analysis. Addison-Wesley. ISBN 0-201-07616-0.
- Velleman, P F & Hoaglin, D C (1981) Applications, Basics and Computing of Exploratory Data Analysis ISBN 0-87150-409-X
Notes
References
- Leinhardt, G., Leinhardt, S., Exploratory Data Analysis: New Tools for the Analysis of Empirical Data, Review of Research in Education, Vol. 8, 1980 (1980), pp. 85-157.
External links
- Visalix (free interactive web application for EDA)
- DataDesk (free-to-try commercial EDA software for Mac and PC)
- GGobi (free interactive multivariate visualization software linked to R)
- MANET (free Mac-only interactive EDA software)
- Miner3D (EDA and visualization software)
- Mondrian (free interactive software for EDA)
- Orange (free component-based software for interactive EDA and machine learning)
- ViSta (free interactive software based on Xlisp-Stat for EDA)
- VisuMap (EDA software for high dimensional non-linear data)
- Visulab (free interactive software for high dimensional non-spatial / non-temporal data with interactive EDA and visualization)
- XLisp-Stat (free software and Lisp based EDA development framework for Mac, PC and X Window)
- Experimental Data Analyst Mathematica application package for EDA
- FactoMineR (free exploratory multivariate data analysis software linked to R)