Data Mining Methods and Applications
In this chapter, we provide a review of the knowledge discovery process, including data handling, data mining methods and software, and current research activities. The introduction defines and provides a general background to data mining knowledge discovery in databases, following by an outline of the entire process in the second part. The third part presents data handling issues, including databases and preparation of the data for analysis. The fourth part, as the core of the chapter, describes popular data mining methods, separated as supervised versus unsupervised learning. Supervised learning methods are described in the context of both regression and classification, beginning with the simplest case of linear models, then presenting more complex modeling with trees, neural networks, and support vector machines, and concluding with some methods only for classification. Unsupervised learning methods are described under two categories: association rules and clustering. The fifth part presents past and current research projects, involving both industrial and business applications. Finally, the last part provides a brief discussion on remaining problems and future trends.
This is a preview of subscription content, log in via an institution to check access.
Access this chapter
Subscribe and save
Springer+ Basic
€32.70 /Month
- Get 10 units per month
- Download Article/Chapter or eBook
- 1 Unit = 1 Article or 1 Chapter
- Cancel anytime
Buy Now
Price includes VAT (France)
eBook EUR 298.53 Price includes VAT (France)
Hardcover Book EUR 379.79 Price includes VAT (France)
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Introduction to Data Mining
Chapter © 2021
Data Mining
Chapter © 2020
Data Mining Paradigms
Chapter © 2013
References
- Berry, M.J.A., Linoff, G.: Mastering Data Mining: The Art and Science of Customer Relationship Management. Wiley, New York (2000) Google Scholar
- Wegman, E.: Data Mining Tutorial, Short Course Notes, Interface 2001 Symposium. Cosa Mesa, Californien (2001) Google Scholar
- Adriaans, P., Zantinge, D.: Data Mining. Addison-Wesley, New York (1996) Google Scholar
- Friedman, J.H.: Data Mining and Statistics: What Is the Connection? Technical Report. Stat. Dep., Stanford University (1997) Google Scholar
- Clark, K.B., Fujimoto, T.: Product development and competitiveness. J. Jpn. Int. Econ. 6(2), 101–143 (1992) Google Scholar
- LaBahn, D.W., Ali, A., Krapfel, R.: New product development cycle time. The influence of project and process factors in small manufacturing companies. J. Bus. Res. 36(2), 179–188 (1996) Google Scholar
- Han, J., Kamber, M.: Data Mining: Concept and Techniques. Morgan Kaufmann, San Francisco (2001) Google Scholar
- Hastie, T., Friedman, J.H., Tibshirani, R.: Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, Berlin/Heidelberg/New York (2001) MATHGoogle Scholar
- Weisberg, S.: Applied Linear Regression. Wiley, New York (1980) MATHGoogle Scholar
- Seber, G.: Multivariate Observations. Wiley, New York (1984) MATHGoogle Scholar
- Neter, J., Kutner, M.H., Nachtsheim, C.J., Wasserman, W.: Applied Linear Statistical Models, 4th edn. Irwin, Chicago (1996) Google Scholar
- Hoerl, A.E., Kennard, R.: Ridge regression: biased estimation of nonorthogonal problems. Technometrics. 12, 55–67 (1970) MATHGoogle Scholar
- Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B. 58, 267–288 (1996) MathSciNetMATHGoogle Scholar
- Agresti, A.: An Introduction to Categorical Data Analysis. Wiley, New York (1996) MATHGoogle Scholar
- Hand, D.: Discrimination and Classification. Wiley, Chichester (1981) MATHGoogle Scholar
- McCullagh, P., Nelder, J.A.: Generalized Linear Models, 2nd edn. Chapman Hall, New York (1989) MATHGoogle Scholar
- Hastie, T., Tibshirani, R.: Generalized Additive Models. Chapman Hall, New York (1990) MATHGoogle Scholar
- Cleveland, W.S.: Robust locally-weighted regression and smoothing scatterplots. J. Am. Stat. Assoc. 74, 829–836 (1979) MathSciNetMATHGoogle Scholar
- Eubank, R.L.: Spline Smoothing and Nonparametric Regression. Marcel Dekker, New York (1988) MATHGoogle Scholar
- Wahba, G.: Spline Models for Observational Data, Applied Mathematics, vol. 59. SIAM, Philadelphia (1990) MATHGoogle Scholar
- Härdle, W.: Applied Non-parametric Regression. Cambridge University Press, Cambridge (1990) MATHGoogle Scholar
- Biggs, D., deVille, B., Suen, E.: A method of choosing multiway partitions for classification and decision trees. J. Appl. Stat. 18(1), 49–62 (1991) Google Scholar
- Ripley, B.D.: Pattern Recognition and Neural Networks. Cambridge University Press, Cambridge (1996) MATHGoogle Scholar
- Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification and Regression Trees. Wadsworth, Belmont (1984) MATHGoogle Scholar
- Morgan, J.N., Sonquist, J.A.: Problems in the analysis of survey data, and a proposal. J. Am. Stat. Assoc. 58, 415–434 (1963) MATHGoogle Scholar
- Fielding, A.: Binary segmentation: the automatic interaction detector and related techniques for exploring data structure. In: O'Muircheartaigh, C.A., Payne, C. (eds.) The Analysis of Survey Data, Volume I: Exploring Data Structures, pp. 221–258. Wiley, New York (1977) Google Scholar
- Loh, W.Y., Vanichsetakul, N.: Tree-structured classification via generalized discriminant analysis. J. Am. Stat. Assoc. 83, 715–728 (1988) MathSciNetMATHGoogle Scholar
- Chaudhuri, W.D.L., Loh, W.Y., Yang, C.C., Generalized Regression Trees: Stat. Sin. 5, 643–666 (1995) Google Scholar
- Loh, W.Y., Shih, Y.S.: Split-selection methods for classification trees. Stat. Sin. 7, 815–840 (1997) MathSciNetMATHGoogle Scholar
- Friedman, J.H., Hastie, T., Tibshirani, R.: Additive logistic regression: a statistical view of boosting. Ann. Stat. 28, 337–407 (2000) MathSciNetMATHGoogle Scholar
- Freund, Y., Schapire, R.: Experiments with a new boosting algorithm, machine learning. In: Kaufmann, M. (ed.) Proceedings of the Thirteenth International Conference, Bari, Italy, pp. 148–156 (1996) Google Scholar
- Breiman, L.: Bagging predictors. Mach. Learn. 26, 123–140 (1996) MATHGoogle Scholar
- Friedman, J.H.: Greedy function approximation: a gradient boosting machine. Ann. Stat. 29, 1189–1232 (2001) MathSciNetMATHGoogle Scholar
- Friedman, J.H.: Stochastic gradient boosting. Comput. Stat. Data Anal. 38(4), 367–378 (2002) MathSciNetMATHGoogle Scholar
- Friedman, J.H.: Multivariate adaptive regression splines (with discussion). Ann. Stat. 19, 1–141 (1991) MATHGoogle Scholar
- Friedman, J.H., Silverman, B.W.: Flexible parsimonious smoothing and additive modeling. Technometrics. 31, 3–39 (1989) MathSciNetMATHGoogle Scholar
- Lippmann, R.P.: An Introduction to Computing with Neural Nets, IEEE ASSP Magazine, 4–22 April (1987) Google Scholar
- Haykin, S.S.: Neural Networks: A Comprehensive Foundation, 2nd edn. Prentice Hall, Upper Saddle River (1999) MATHGoogle Scholar
- White, H.: Learning in neural networks: a statistical perspective. Neural Comput. 1, 425–464 (1989) Google Scholar
- Barron, A.R., Barron, R.L., Wegman, E.J.: Statistical learning networks: a unifying view, computer science and statistics. In: Wegman, E.J., Gantz, D.T., Miller, J.J. (eds.) Proceedings of the 20th Symposium on the Interface 1992, pp. 192–203. American Statistical Association, Alexandria (1992) Google Scholar
- Cheng, B., Titterington, D.M.: Neural networks: a review from a statistical perspective (with discussion). Stat. Sci. 9, 2–54 (1994) MATHGoogle Scholar
- Rumelhart, D., Hinton, G., Williams, R.: Learning internal representations by error propagation. In: Rumelhart, D.E., McClelland, J.L. (eds.) Parallel Distributed Processing: Explorations in the Microstructures of Cognition, vol. 1: Foundations, pp. 318–362. MIT, Cambridge (1986) Google Scholar
- LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE. 86(11), 2278–2324 (1998) Google Scholar
- Burges, C.J.C.: A tutorial on support vector machines for pattern recognition. Knowledge Discovery and Data Mining. 2(2), 121–167 (1998) Google Scholar
- Shawe-Taylor, J., Cristianini, N.: Kernel Methods for Pattern Analysis. Cambridge University Press, Cambridge (2004) MATHGoogle Scholar
- Cristianini, N., Shawe-Taylor, J.: An Introduction to Support Vector Machines. Cambridge University Press, Cambridge (2000) MATHGoogle Scholar
- Huber, P.: Robust estimation of a location parameter. Ann. Math. Stat. 53, 73–101 (1964) MathSciNetMATHGoogle Scholar
- Dasarathy, B.V.: Nearest Neighbor Pattern Classification Techniques. IEEE Comput. Soc., New York (1991) Google Scholar
- Hastie, T., Tibshirani, R.: Discriminant adaptive nearest-neighbor classification. IEEE Trans. Pattern Mach. Intell. 18, 607–616 (1996) Google Scholar
- Hastie, T., Tibshirani, R., Buja, A.: Flexible discriminant and mixture models. In: Kay, J., Titterington, M. (eds.) Statistics and Artificial Neural Networks. Oxford University Press, Oxford (1998) MATHGoogle Scholar
- Koza, J.R.: Genetic Programming: On the Programming of Computers by Means of Natural Selection. MIT, Cambridge (1992) MATHGoogle Scholar
- Banzhaf, W., Nordin, P., Keller, R.E., Francone, F.D.: Genetic Programming: An Introduction. Morgan Kaufmann, San Francisco (1998) MATHGoogle Scholar
- Smith, P.W.H.: Genetic programming as a data-mining tool. In: Abbass, H.A., Sarker, R.A., Newton, C.S. (eds.) Data Mining: a Heuristic Approach, pp. 157–173. Idea Group Publishing, London (2002) Google Scholar
- Gordon, A.: Classification, 2nd edn. Chapman Hall, New York (1999) MATHGoogle Scholar
- Ralambondrainy, H.: A conceptual version of the K-means algorithm. Pattern Recogn. Lett. 16, 1147–1157 (1995) Google Scholar
- Zhang, P., Wang, X., Song, P.: Clustering categorical data based on distance vectors. J. Am. Stat. Assoc. 101, 355–367 (2006) MathSciNetMATHGoogle Scholar
- Hartigan, J.A., Wong, M.A.: A K-means clustering algorithm. Appl. Stat. 28, 100–108 (1979) MATHGoogle Scholar
- Park, H., Jun, C.: A simple and fast algorithm for K-Medoids clustering. Expert Syst. Appl. 36, 3336–3341 (2009) Google Scholar
- Ester, M., Kriegel, H.-P., Sander, J., Xu, X.: A density-based algorithm for discovering cluster in large spatial databases. In: Proceedings of 1996 International Conference on Knowledge Discovery and Data Mining (KDD96), Portland, 226–231 (1996) Google Scholar
- Frey, B.J., Dueck, D.: Clustering by passing messages between data points. Science. 315(5814), 972–976 (2007) MathSciNetMATHGoogle Scholar
- Kohonen, T.: Self-Organization and Associative Memory, 3rd edn. Springer, Berlin Heidelberg New York (1989) MATHGoogle Scholar
- Chen, Y., Yang, H.: Self-organized neural network for the quality control of 12-lead ECG signals. Physiol. Meas. 33, 1399–1418 (2012) Google Scholar
- Haughton, D., Deichmann, J., Eshghi, A., Sayek, S., Teebagy, N., Topi, H.: A review of software packages for data mining. Am. Stat. 57(4), 290–309 (2003) Google Scholar
- Friedman, J., Hastie, T., Tibshirani, R.: The Elements of Statistical Learning, vol. 1: Springer Series in Statistics, New York (2001) Google Scholar
- Fawcett, T., Provost, F.: Activity monitoring: noticing interesting changes in behavior. In: Proceedings of KDD-99, San Diego 1999, pp. 53–62, San Diego (1999) Google Scholar
- Woodall, W.H., Tsui, K.-L., Tucker, G.R.: A review of statistical and fuzzy quality control based on categorical data. Front. Stat. Qual. Control. 5, 83–89 (1997) MATHGoogle Scholar
- Montgomery, D.C., Woodall, W.H.: A discussion on statistically-based process monitoring and control. J. Qual. Technol. 29, 121–162 (1997) Google Scholar
- Hayter, A.J., Tsui, K.-L.: Identification and qualification in multivariate quality control problems. J. Qual. Technol. 26(3), 197–208 (1994) Google Scholar
- Mason, R.L., Champ, C.W., Tracy, N.D., Wierda, S.J., Young, J.C.: Assessment of multivariate process control techniques. J. Qual. Technol. 29, 140–143 (1997) Google Scholar
- Jiang, W., Au, S.-T., Tsui, K.-L.: A statistical process control approach for customer activity monitoring, Technical Report, AT&T Labs (2004) Google Scholar
- West, M., Harrison, J.: Bayesian Forecasting and Dynamic Models, 2nd edn. Springer, New York (1997) MATHGoogle Scholar
- Fraley, C., Raftery, A.E.: Model-based clustering, discriminant analysis, and density estimation. J. Am. Stat. Assoc. 97, 611–631 (2002) MathSciNetMATHGoogle Scholar
- Taguchi, G.: Introduction to Quality Engineering: Designing Quality into Products and Processes. Asian Productivity Organization, Tokyo (1986) Google Scholar
- Nair, V.N.: Taguchi's parameter design: a panel discussion. Technometrics. 34, 127–161 (1992) MathSciNetGoogle Scholar
- Tsui, K.-L.: An overview of Taguchi method and newly developed statistical methods for robust design. IIE Trans. 24, 44–57 (1992) Google Scholar
- Tsui, K.-L.: A critical look at Taguchi's modeling approach for robust design. J. Appl. Stat. 23, 81–95 (1996) Google Scholar
- Taguchi, G., Chowdhury, S., Wu, Y.: The Mahalanobis–Taguchi System. McGraw-Hill, New York (2001) Google Scholar
- Taguchi, G., Jugulum, R.: The Mahalanobis–Taguchi Strategy: A Pattern Technology System. Wiley, New York (2002) Google Scholar
- Woodall, W.H., Koudelik, R., Tsui, K.-L., Kim, S.B., Stoumbos, Z.G., Carvounis, C.P.: A review and analysis of the Mahalanobis–Taguchi system. Technometrics. 45(1), 1–15 (2003) MathSciNetGoogle Scholar
- Kusiak, A., Kurasek, C.: Data mining of printed–circuit board defects. IEEE Trans. Robot. Autom. 17(2), 191–196 (2001) Google Scholar
- Kusiak, A.: Rough set theory: a data mining tool for semiconductor manufacturing. IEEE Trans. Electron. Packag. Manuf. 24(1), 44–50 (2001) Google Scholar
- Ultsch, A.: Information and Classification: Concepts, Methods and Applications. Springer, Berlin Heidelberg New York (1993) Google Scholar
- Wong, A.Y.: A statistical approach to identify semiconductor process equipment related yield problems. In: IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems, Paris 1997, pp. 20–22. IEEE Computer Society, Paris (1997) Google Scholar
- ANSI: Am. Nat. Standards Institute, IPC-9261, In-Process DPMO and Estimated Yield for PWB (2002) Google Scholar
- Baron, M., Lakshminarayan, C.K., Chen, Z.: Markov random fields in pattern recognition for semiconductor manufacturing. Technometrics. 43, 66–72 (2001) MathSciNetMATHGoogle Scholar
- King, G.: Event count models for international relations: generalizations and applications. Int. Stud. Q. 33(2), 123–147 (1989) Google Scholar
- Smyth, P.: Hidden Markov models for fault detection in dynamic systems. Pattern Recogn. 27(1), 149–164 (1994) Google Scholar
- Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 580–587 (2014) Google Scholar
- Girshick, R.: Fast R-CNN. In: Proceedings of the IEEE international conference on computer vision, 1440–1448 (2015) Google Scholar
- Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems (NIPS), pp. 91–99 (2015) Google Scholar
- Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.-Y., Berg, A.C.: SSD: single shot MultiBox detector. Lect. Notes Comput. Sci, 21–37 (2016) Google Scholar
- Ye, T., Wang, B., Song, P., Li, J.: Automatic railway traffic object detection system using feature fusion refine neural network under shunting mode. Sensors. 18, 1916 (2018) Google Scholar
- Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016) Google Scholar
- Redmon, J., Farhadi, A.: YOLO9000: Better, Faster, Stronger. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017) Google Scholar
- Redmon, J., Farhadi, A.: YOLOv3: an incremental improvement. arXiv preprint arXiv:1804.02767 (2018) Google Scholar
- Lazer, D., Kennedy, R., King, G., Vespignani, A.: The parable of Google Flu: traps in big data analysis. Science. 343, 1203–1205 (2014) Google Scholar
- Butler, D.: When Google got flu wrong. Nature. 494, 155 (2013) Google Scholar
- Yang, S., Santillana, M., Kou, S.: ARGO: a model for accurate estimation of influenza epidemics using Google search data. Proc. Natl. Acad. Sci. (2015) Google Scholar
- Copeland, P., Romano, R., Zhang, T., Hecht, G., Zigmond, D., Stefansen, C.: Google disease trends: an update. Nature. 457, 1012–1014 (2013) Google Scholar
- Santillana, M., Zhang, D.W., Althouse, B.M., Ayers, J.W.: What can digital disease detection learn from (an external revision to) Google Flu Trends? Am. J. Prev. Med. 47, 341–347 (2014) Google Scholar
Author information
Authors and Affiliations
- Grado Department of Industrial and Systems Engineering, Virginia Polytechnic Institute and State University, Blacksburg, VA, USA Kwok-Leung Tsui
- Department of Industrial, Manufacturing, & Systems Engineering, University of Texas at Arlington, Arlington, TX, USA Victoria Chen & Chen Kan
- Antai College of Economics and Management, Shanghai Jiao Tong University, Shanghai, China Wei Jiang
- School of Intelligent Systems Engineering Sun Yat-sen University, Guangdong, China Fangfang Yang
- Kwok-Leung Tsui