Data Mining Methods and Applications

In this chapter, we provide a review of the knowledge discovery process, including data handling, data mining methods and software, and current research activities. The introduction defines and provides a general background to data mining knowledge discovery in databases, following by an outline of the entire process in the second part. The third part presents data handling issues, including databases and preparation of the data for analysis. The fourth part, as the core of the chapter, describes popular data mining methods, separated as supervised versus unsupervised learning. Supervised learning methods are described in the context of both regression and classification, beginning with the simplest case of linear models, then presenting more complex modeling with trees, neural networks, and support vector machines, and concluding with some methods only for classification. Unsupervised learning methods are described under two categories: association rules and clustering. The fifth part presents past and current research projects, involving both industrial and business applications. Finally, the last part provides a brief discussion on remaining problems and future trends.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic €32.70 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (France)

eBook EUR 298.53 Price includes VAT (France)

Hardcover Book EUR 379.79 Price includes VAT (France)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Introduction to Data Mining

Data Mining

Data Mining Paradigms

References

Berry, M.J.A., Linoff, G.: Mastering Data Mining: The Art and Science of Customer Relationship Management. Wiley, New York (2000) Google Scholar
Wegman, E.: Data Mining Tutorial, Short Course Notes, Interface 2001 Symposium. Cosa Mesa, Californien (2001) Google Scholar
Adriaans, P., Zantinge, D.: Data Mining. Addison-Wesley, New York (1996) Google Scholar
Friedman, J.H.: Data Mining and Statistics: What Is the Connection? Technical Report. Stat. Dep., Stanford University (1997) Google Scholar
Clark, K.B., Fujimoto, T.: Product development and competitiveness. J. Jpn. Int. Econ. 6(2), 101–143 (1992) Google Scholar
LaBahn, D.W., Ali, A., Krapfel, R.: New product development cycle time. The influence of project and process factors in small manufacturing companies. J. Bus. Res. 36(2), 179–188 (1996) Google Scholar
Han, J., Kamber, M.: Data Mining: Concept and Techniques. Morgan Kaufmann, San Francisco (2001) Google Scholar
Hastie, T., Friedman, J.H., Tibshirani, R.: Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, Berlin/Heidelberg/New York (2001) MATHGoogle Scholar
Weisberg, S.: Applied Linear Regression. Wiley, New York (1980) MATHGoogle Scholar
Seber, G.: Multivariate Observations. Wiley, New York (1984) MATHGoogle Scholar
Neter, J., Kutner, M.H., Nachtsheim, C.J., Wasserman, W.: Applied Linear Statistical Models, 4th edn. Irwin, Chicago (1996) Google Scholar
Hoerl, A.E., Kennard, R.: Ridge regression: biased estimation of nonorthogonal problems. Technometrics. 12, 55–67 (1970) MATHGoogle Scholar
Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B. 58, 267–288 (1996) MathSciNetMATHGoogle Scholar
Agresti, A.: An Introduction to Categorical Data Analysis. Wiley, New York (1996) MATHGoogle Scholar
Hand, D.: Discrimination and Classification. Wiley, Chichester (1981) MATHGoogle Scholar
McCullagh, P., Nelder, J.A.: Generalized Linear Models, 2nd edn. Chapman Hall, New York (1989) MATHGoogle Scholar
Hastie, T., Tibshirani, R.: Generalized Additive Models. Chapman Hall, New York (1990) MATHGoogle Scholar
Cleveland, W.S.: Robust locally-weighted regression and smoothing scatterplots. J. Am. Stat. Assoc. 74, 829–836 (1979) MathSciNetMATHGoogle Scholar
Eubank, R.L.: Spline Smoothing and Nonparametric Regression. Marcel Dekker, New York (1988) MATHGoogle Scholar
Wahba, G.: Spline Models for Observational Data, Applied Mathematics, vol. 59. SIAM, Philadelphia (1990) MATHGoogle Scholar
Härdle, W.: Applied Non-parametric Regression. Cambridge University Press, Cambridge (1990) MATHGoogle Scholar
Biggs, D., deVille, B., Suen, E.: A method of choosing multiway partitions for classification and decision trees. J. Appl. Stat. 18(1), 49–62 (1991) Google Scholar
Ripley, B.D.: Pattern Recognition and Neural Networks. Cambridge University Press, Cambridge (1996) MATHGoogle Scholar
Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification and Regression Trees. Wadsworth, Belmont (1984) MATHGoogle Scholar
Morgan, J.N., Sonquist, J.A.: Problems in the analysis of survey data, and a proposal. J. Am. Stat. Assoc. 58, 415–434 (1963) MATHGoogle Scholar
Fielding, A.: Binary segmentation: the automatic interaction detector and related techniques for exploring data structure. In: O'Muircheartaigh, C.A., Payne, C. (eds.) The Analysis of Survey Data, Volume I: Exploring Data Structures, pp. 221–258. Wiley, New York (1977) Google Scholar
Loh, W.Y., Vanichsetakul, N.: Tree-structured classification via generalized discriminant analysis. J. Am. Stat. Assoc. 83, 715–728 (1988) MathSciNetMATHGoogle Scholar
Chaudhuri, W.D.L., Loh, W.Y., Yang, C.C., Generalized Regression Trees: Stat. Sin. 5, 643–666 (1995) Google Scholar
Loh, W.Y., Shih, Y.S.: Split-selection methods for classification trees. Stat. Sin. 7, 815–840 (1997) MathSciNetMATHGoogle Scholar
Friedman, J.H., Hastie, T., Tibshirani, R.: Additive logistic regression: a statistical view of boosting. Ann. Stat. 28, 337–407 (2000) MathSciNetMATHGoogle Scholar
Freund, Y., Schapire, R.: Experiments with a new boosting algorithm, machine learning. In: Kaufmann, M. (ed.) Proceedings of the Thirteenth International Conference, Bari, Italy, pp. 148–156 (1996) Google Scholar
Breiman, L.: Bagging predictors. Mach. Learn. 26, 123–140 (1996) MATHGoogle Scholar
Friedman, J.H.: Greedy function approximation: a gradient boosting machine. Ann. Stat. 29, 1189–1232 (2001) MathSciNetMATHGoogle Scholar
Friedman, J.H.: Stochastic gradient boosting. Comput. Stat. Data Anal. 38(4), 367–378 (2002) MathSciNetMATHGoogle Scholar
Friedman, J.H.: Multivariate adaptive regression splines (with discussion). Ann. Stat. 19, 1–141 (1991) MATHGoogle Scholar
Friedman, J.H., Silverman, B.W.: Flexible parsimonious smoothing and additive modeling. Technometrics. 31, 3–39 (1989) MathSciNetMATHGoogle Scholar
Lippmann, R.P.: An Introduction to Computing with Neural Nets, IEEE ASSP Magazine, 4–22 April (1987) Google Scholar
Haykin, S.S.: Neural Networks: A Comprehensive Foundation, 2nd edn. Prentice Hall, Upper Saddle River (1999) MATHGoogle Scholar
White, H.: Learning in neural networks: a statistical perspective. Neural Comput. 1, 425–464 (1989) Google Scholar
Barron, A.R., Barron, R.L., Wegman, E.J.: Statistical learning networks: a unifying view, computer science and statistics. In: Wegman, E.J., Gantz, D.T., Miller, J.J. (eds.) Proceedings of the 20th Symposium on the Interface 1992, pp. 192–203. American Statistical Association, Alexandria (1992) Google Scholar
Cheng, B., Titterington, D.M.: Neural networks: a review from a statistical perspective (with discussion). Stat. Sci. 9, 2–54 (1994) MATHGoogle Scholar
Rumelhart, D., Hinton, G., Williams, R.: Learning internal representations by error propagation. In: Rumelhart, D.E., McClelland, J.L. (eds.) Parallel Distributed Processing: Explorations in the Microstructures of Cognition, vol. 1: Foundations, pp. 318–362. MIT, Cambridge (1986) Google Scholar
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE. 86(11), 2278–2324 (1998) Google Scholar
Burges, C.J.C.: A tutorial on support vector machines for pattern recognition. Knowledge Discovery and Data Mining. 2(2), 121–167 (1998) Google Scholar
Shawe-Taylor, J., Cristianini, N.: Kernel Methods for Pattern Analysis. Cambridge University Press, Cambridge (2004) MATHGoogle Scholar
Cristianini, N., Shawe-Taylor, J.: An Introduction to Support Vector Machines. Cambridge University Press, Cambridge (2000) MATHGoogle Scholar
Huber, P.: Robust estimation of a location parameter. Ann. Math. Stat. 53, 73–101 (1964) MathSciNetMATHGoogle Scholar
Dasarathy, B.V.: Nearest Neighbor Pattern Classification Techniques. IEEE Comput. Soc., New York (1991) Google Scholar
Hastie, T., Tibshirani, R.: Discriminant adaptive nearest-neighbor classification. IEEE Trans. Pattern Mach. Intell. 18, 607–616 (1996) Google Scholar
Hastie, T., Tibshirani, R., Buja, A.: Flexible discriminant and mixture models. In: Kay, J., Titterington, M. (eds.) Statistics and Artificial Neural Networks. Oxford University Press, Oxford (1998) MATHGoogle Scholar
Koza, J.R.: Genetic Programming: On the Programming of Computers by Means of Natural Selection. MIT, Cambridge (1992) MATHGoogle Scholar
Banzhaf, W., Nordin, P., Keller, R.E., Francone, F.D.: Genetic Programming: An Introduction. Morgan Kaufmann, San Francisco (1998) MATHGoogle Scholar
Smith, P.W.H.: Genetic programming as a data-mining tool. In: Abbass, H.A., Sarker, R.A., Newton, C.S. (eds.) Data Mining: a Heuristic Approach, pp. 157–173. Idea Group Publishing, London (2002) Google Scholar
Gordon, A.: Classification, 2nd edn. Chapman Hall, New York (1999) MATHGoogle Scholar
Ralambondrainy, H.: A conceptual version of the K-means algorithm. Pattern Recogn. Lett. 16, 1147–1157 (1995) Google Scholar
Zhang, P., Wang, X., Song, P.: Clustering categorical data based on distance vectors. J. Am. Stat. Assoc. 101, 355–367 (2006) MathSciNetMATHGoogle Scholar
Hartigan, J.A., Wong, M.A.: A K-means clustering algorithm. Appl. Stat. 28, 100–108 (1979) MATHGoogle Scholar
Park, H., Jun, C.: A simple and fast algorithm for K-Medoids clustering. Expert Syst. Appl. 36, 3336–3341 (2009) Google Scholar
Ester, M., Kriegel, H.-P., Sander, J., Xu, X.: A density-based algorithm for discovering cluster in large spatial databases. In: Proceedings of 1996 International Conference on Knowledge Discovery and Data Mining (KDD96), Portland, 226–231 (1996) Google Scholar
Frey, B.J., Dueck, D.: Clustering by passing messages between data points. Science. 315(5814), 972–976 (2007) MathSciNetMATHGoogle Scholar
Kohonen, T.: Self-Organization and Associative Memory, 3rd edn. Springer, Berlin Heidelberg New York (1989) MATHGoogle Scholar
Chen, Y., Yang, H.: Self-organized neural network for the quality control of 12-lead ECG signals. Physiol. Meas. 33, 1399–1418 (2012) Google Scholar
Haughton, D., Deichmann, J., Eshghi, A., Sayek, S., Teebagy, N., Topi, H.: A review of software packages for data mining. Am. Stat. 57(4), 290–309 (2003) Google Scholar
Friedman, J., Hastie, T., Tibshirani, R.: The Elements of Statistical Learning, vol. 1: Springer Series in Statistics, New York (2001) Google Scholar
Fawcett, T., Provost, F.: Activity monitoring: noticing interesting changes in behavior. In: Proceedings of KDD-99, San Diego 1999, pp. 53–62, San Diego (1999) Google Scholar
Woodall, W.H., Tsui, K.-L., Tucker, G.R.: A review of statistical and fuzzy quality control based on categorical data. Front. Stat. Qual. Control. 5, 83–89 (1997) MATHGoogle Scholar
Montgomery, D.C., Woodall, W.H.: A discussion on statistically-based process monitoring and control. J. Qual. Technol. 29, 121–162 (1997) Google Scholar
Hayter, A.J., Tsui, K.-L.: Identification and qualification in multivariate quality control problems. J. Qual. Technol. 26(3), 197–208 (1994) Google Scholar
Mason, R.L., Champ, C.W., Tracy, N.D., Wierda, S.J., Young, J.C.: Assessment of multivariate process control techniques. J. Qual. Technol. 29, 140–143 (1997) Google Scholar
Jiang, W., Au, S.-T., Tsui, K.-L.: A statistical process control approach for customer activity monitoring, Technical Report, AT&T Labs (2004) Google Scholar
West, M., Harrison, J.: Bayesian Forecasting and Dynamic Models, 2nd edn. Springer, New York (1997) MATHGoogle Scholar
Fraley, C., Raftery, A.E.: Model-based clustering, discriminant analysis, and density estimation. J. Am. Stat. Assoc. 97, 611–631 (2002) MathSciNetMATHGoogle Scholar
Taguchi, G.: Introduction to Quality Engineering: Designing Quality into Products and Processes. Asian Productivity Organization, Tokyo (1986) Google Scholar
Nair, V.N.: Taguchi's parameter design: a panel discussion. Technometrics. 34, 127–161 (1992) MathSciNetGoogle Scholar
Tsui, K.-L.: An overview of Taguchi method and newly developed statistical methods for robust design. IIE Trans. 24, 44–57 (1992) Google Scholar
Tsui, K.-L.: A critical look at Taguchi's modeling approach for robust design. J. Appl. Stat. 23, 81–95 (1996) Google Scholar
Taguchi, G., Chowdhury, S., Wu, Y.: The Mahalanobis–Taguchi System. McGraw-Hill, New York (2001) Google Scholar
Taguchi, G., Jugulum, R.: The Mahalanobis–Taguchi Strategy: A Pattern Technology System. Wiley, New York (2002) Google Scholar
Woodall, W.H., Koudelik, R., Tsui, K.-L., Kim, S.B., Stoumbos, Z.G., Carvounis, C.P.: A review and analysis of the Mahalanobis–Taguchi system. Technometrics. 45(1), 1–15 (2003) MathSciNetGoogle Scholar
Kusiak, A., Kurasek, C.: Data mining of printed–circuit board defects. IEEE Trans. Robot. Autom. 17(2), 191–196 (2001) Google Scholar
Kusiak, A.: Rough set theory: a data mining tool for semiconductor manufacturing. IEEE Trans. Electron. Packag. Manuf. 24(1), 44–50 (2001) Google Scholar
Ultsch, A.: Information and Classification: Concepts, Methods and Applications. Springer, Berlin Heidelberg New York (1993) Google Scholar
Wong, A.Y.: A statistical approach to identify semiconductor process equipment related yield problems. In: IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems, Paris 1997, pp. 20–22. IEEE Computer Society, Paris (1997) Google Scholar
ANSI: Am. Nat. Standards Institute, IPC-9261, In-Process DPMO and Estimated Yield for PWB (2002) Google Scholar
Baron, M., Lakshminarayan, C.K., Chen, Z.: Markov random fields in pattern recognition for semiconductor manufacturing. Technometrics. 43, 66–72 (2001) MathSciNetMATHGoogle Scholar
King, G.: Event count models for international relations: generalizations and applications. Int. Stud. Q. 33(2), 123–147 (1989) Google Scholar
Smyth, P.: Hidden Markov models for fault detection in dynamic systems. Pattern Recogn. 27(1), 149–164 (1994) Google Scholar
Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 580–587 (2014) Google Scholar
Girshick, R.: Fast R-CNN. In: Proceedings of the IEEE international conference on computer vision, 1440–1448 (2015) Google Scholar
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems (NIPS), pp. 91–99 (2015) Google Scholar
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.-Y., Berg, A.C.: SSD: single shot MultiBox detector. Lect. Notes Comput. Sci, 21–37 (2016) Google Scholar
Ye, T., Wang, B., Song, P., Li, J.: Automatic railway traffic object detection system using feature fusion refine neural network under shunting mode. Sensors. 18, 1916 (2018) Google Scholar
Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016) Google Scholar
Redmon, J., Farhadi, A.: YOLO9000: Better, Faster, Stronger. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017) Google Scholar
Redmon, J., Farhadi, A.: YOLOv3: an incremental improvement. arXiv preprint arXiv:1804.02767 (2018) Google Scholar
Lazer, D., Kennedy, R., King, G., Vespignani, A.: The parable of Google Flu: traps in big data analysis. Science. 343, 1203–1205 (2014) Google Scholar
Butler, D.: When Google got flu wrong. Nature. 494, 155 (2013) Google Scholar
Yang, S., Santillana, M., Kou, S.: ARGO: a model for accurate estimation of influenza epidemics using Google search data. Proc. Natl. Acad. Sci. (2015) Google Scholar
Copeland, P., Romano, R., Zhang, T., Hecht, G., Zigmond, D., Stefansen, C.: Google disease trends: an update. Nature. 457, 1012–1014 (2013) Google Scholar
Santillana, M., Zhang, D.W., Althouse, B.M., Ayers, J.W.: What can digital disease detection learn from (an external revision to) Google Flu Trends? Am. J. Prev. Med. 47, 341–347 (2014) Google Scholar

Author information

Authors and Affiliations

Grado Department of Industrial and Systems Engineering, Virginia Polytechnic Institute and State University, Blacksburg, VA, USA Kwok-Leung Tsui
Department of Industrial, Manufacturing, & Systems Engineering, University of Texas at Arlington, Arlington, TX, USA Victoria Chen & Chen Kan
Antai College of Economics and Management, Shanghai Jiao Tong University, Shanghai, China Wei Jiang
School of Intelligent Systems Engineering Sun Yat-sen University, Guangdong, China Fangfang Yang

Kwok-Leung Tsui