Data Mining Methods and Applications

In this chapter, we provide a review of the knowledge discovery process, including data handling, data mining methods and software, and current research activities. The introduction defines and provides a general background to data mining knowledge discovery in databases, following by an outline of the entire process in the second part. The third part presents data handling issues, including databases and preparation of the data for analysis. The fourth part, as the core of the chapter, describes popular data mining methods, separated as supervised versus unsupervised learning. Supervised learning methods are described in the context of both regression and classification, beginning with the simplest case of linear models, then presenting more complex modeling with trees, neural networks, and support vector machines, and concluding with some methods only for classification. Unsupervised learning methods are described under two categories: association rules and clustering. The fifth part presents past and current research projects, involving both industrial and business applications. Finally, the last part provides a brief discussion on remaining problems and future trends.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic €32.70 /Month

Buy Now

Price includes VAT (France)

eBook EUR 298.53 Price includes VAT (France)

Hardcover Book EUR 379.79 Price includes VAT (France)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Similar content being viewed by others

Introduction to Data Mining

Chapter © 2021

Data Mining

Chapter © 2020

Data Mining Paradigms

Chapter © 2013

References

  1. Berry, M.J.A., Linoff, G.: Mastering Data Mining: The Art and Science of Customer Relationship Management. Wiley, New York (2000) Google Scholar
  2. Wegman, E.: Data Mining Tutorial, Short Course Notes, Interface 2001 Symposium. Cosa Mesa, Californien (2001) Google Scholar
  3. Adriaans, P., Zantinge, D.: Data Mining. Addison-Wesley, New York (1996) Google Scholar
  4. Friedman, J.H.: Data Mining and Statistics: What Is the Connection? Technical Report. Stat. Dep., Stanford University (1997) Google Scholar
  5. Clark, K.B., Fujimoto, T.: Product development and competitiveness. J. Jpn. Int. Econ. 6(2), 101–143 (1992) Google Scholar
  6. LaBahn, D.W., Ali, A., Krapfel, R.: New product development cycle time. The influence of project and process factors in small manufacturing companies. J. Bus. Res. 36(2), 179–188 (1996) Google Scholar
  7. Han, J., Kamber, M.: Data Mining: Concept and Techniques. Morgan Kaufmann, San Francisco (2001) Google Scholar
  8. Hastie, T., Friedman, J.H., Tibshirani, R.: Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, Berlin/Heidelberg/New York (2001) MATHGoogle Scholar
  9. Weisberg, S.: Applied Linear Regression. Wiley, New York (1980) MATHGoogle Scholar
  10. Seber, G.: Multivariate Observations. Wiley, New York (1984) MATHGoogle Scholar
  11. Neter, J., Kutner, M.H., Nachtsheim, C.J., Wasserman, W.: Applied Linear Statistical Models, 4th edn. Irwin, Chicago (1996) Google Scholar
  12. Hoerl, A.E., Kennard, R.: Ridge regression: biased estimation of nonorthogonal problems. Technometrics. 12, 55–67 (1970) MATHGoogle Scholar
  13. Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B. 58, 267–288 (1996) MathSciNetMATHGoogle Scholar
  14. Agresti, A.: An Introduction to Categorical Data Analysis. Wiley, New York (1996) MATHGoogle Scholar
  15. Hand, D.: Discrimination and Classification. Wiley, Chichester (1981) MATHGoogle Scholar
  16. McCullagh, P., Nelder, J.A.: Generalized Linear Models, 2nd edn. Chapman Hall, New York (1989) MATHGoogle Scholar
  17. Hastie, T., Tibshirani, R.: Generalized Additive Models. Chapman Hall, New York (1990) MATHGoogle Scholar
  18. Cleveland, W.S.: Robust locally-weighted regression and smoothing scatterplots. J. Am. Stat. Assoc. 74, 829–836 (1979) MathSciNetMATHGoogle Scholar
  19. Eubank, R.L.: Spline Smoothing and Nonparametric Regression. Marcel Dekker, New York (1988) MATHGoogle Scholar
  20. Wahba, G.: Spline Models for Observational Data, Applied Mathematics, vol. 59. SIAM, Philadelphia (1990) MATHGoogle Scholar
  21. Härdle, W.: Applied Non-parametric Regression. Cambridge University Press, Cambridge (1990) MATHGoogle Scholar
  22. Biggs, D., deVille, B., Suen, E.: A method of choosing multiway partitions for classification and decision trees. J. Appl. Stat. 18(1), 49–62 (1991) Google Scholar
  23. Ripley, B.D.: Pattern Recognition and Neural Networks. Cambridge University Press, Cambridge (1996) MATHGoogle Scholar
  24. Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification and Regression Trees. Wadsworth, Belmont (1984) MATHGoogle Scholar
  25. Morgan, J.N., Sonquist, J.A.: Problems in the analysis of survey data, and a proposal. J. Am. Stat. Assoc. 58, 415–434 (1963) MATHGoogle Scholar
  26. Fielding, A.: Binary segmentation: the automatic interaction detector and related techniques for exploring data structure. In: O'Muircheartaigh, C.A., Payne, C. (eds.) The Analysis of Survey Data, Volume I: Exploring Data Structures, pp. 221–258. Wiley, New York (1977) Google Scholar
  27. Loh, W.Y., Vanichsetakul, N.: Tree-structured classification via generalized discriminant analysis. J. Am. Stat. Assoc. 83, 715–728 (1988) MathSciNetMATHGoogle Scholar
  28. Chaudhuri, W.D.L., Loh, W.Y., Yang, C.C., Generalized Regression Trees: Stat. Sin. 5, 643–666 (1995) Google Scholar
  29. Loh, W.Y., Shih, Y.S.: Split-selection methods for classification trees. Stat. Sin. 7, 815–840 (1997) MathSciNetMATHGoogle Scholar
  30. Friedman, J.H., Hastie, T., Tibshirani, R.: Additive logistic regression: a statistical view of boosting. Ann. Stat. 28, 337–407 (2000) MathSciNetMATHGoogle Scholar
  31. Freund, Y., Schapire, R.: Experiments with a new boosting algorithm, machine learning. In: Kaufmann, M. (ed.) Proceedings of the Thirteenth International Conference, Bari, Italy, pp. 148–156 (1996) Google Scholar
  32. Breiman, L.: Bagging predictors. Mach. Learn. 26, 123–140 (1996) MATHGoogle Scholar
  33. Friedman, J.H.: Greedy function approximation: a gradient boosting machine. Ann. Stat. 29, 1189–1232 (2001) MathSciNetMATHGoogle Scholar
  34. Friedman, J.H.: Stochastic gradient boosting. Comput. Stat. Data Anal. 38(4), 367–378 (2002) MathSciNetMATHGoogle Scholar
  35. Friedman, J.H.: Multivariate adaptive regression splines (with discussion). Ann. Stat. 19, 1–141 (1991) MATHGoogle Scholar
  36. Friedman, J.H., Silverman, B.W.: Flexible parsimonious smoothing and additive modeling. Technometrics. 31, 3–39 (1989) MathSciNetMATHGoogle Scholar
  37. Lippmann, R.P.: An Introduction to Computing with Neural Nets, IEEE ASSP Magazine, 4–22 April (1987) Google Scholar
  38. Haykin, S.S.: Neural Networks: A Comprehensive Foundation, 2nd edn. Prentice Hall, Upper Saddle River (1999) MATHGoogle Scholar
  39. White, H.: Learning in neural networks: a statistical perspective. Neural Comput. 1, 425–464 (1989) Google Scholar
  40. Barron, A.R., Barron, R.L., Wegman, E.J.: Statistical learning networks: a unifying view, computer science and statistics. In: Wegman, E.J., Gantz, D.T., Miller, J.J. (eds.) Proceedings of the 20th Symposium on the Interface 1992, pp. 192–203. American Statistical Association, Alexandria (1992) Google Scholar
  41. Cheng, B., Titterington, D.M.: Neural networks: a review from a statistical perspective (with discussion). Stat. Sci. 9, 2–54 (1994) MATHGoogle Scholar
  42. Rumelhart, D., Hinton, G., Williams, R.: Learning internal representations by error propagation. In: Rumelhart, D.E., McClelland, J.L. (eds.) Parallel Distributed Processing: Explorations in the Microstructures of Cognition, vol. 1: Foundations, pp. 318–362. MIT, Cambridge (1986) Google Scholar
  43. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE. 86(11), 2278–2324 (1998) Google Scholar
  44. Burges, C.J.C.: A tutorial on support vector machines for pattern recognition. Knowledge Discovery and Data Mining. 2(2), 121–167 (1998) Google Scholar
  45. Shawe-Taylor, J., Cristianini, N.: Kernel Methods for Pattern Analysis. Cambridge University Press, Cambridge (2004) MATHGoogle Scholar
  46. Cristianini, N., Shawe-Taylor, J.: An Introduction to Support Vector Machines. Cambridge University Press, Cambridge (2000) MATHGoogle Scholar
  47. Huber, P.: Robust estimation of a location parameter. Ann. Math. Stat. 53, 73–101 (1964) MathSciNetMATHGoogle Scholar
  48. Dasarathy, B.V.: Nearest Neighbor Pattern Classification Techniques. IEEE Comput. Soc., New York (1991) Google Scholar
  49. Hastie, T., Tibshirani, R.: Discriminant adaptive nearest-neighbor classification. IEEE Trans. Pattern Mach. Intell. 18, 607–616 (1996) Google Scholar
  50. Hastie, T., Tibshirani, R., Buja, A.: Flexible discriminant and mixture models. In: Kay, J., Titterington, M. (eds.) Statistics and Artificial Neural Networks. Oxford University Press, Oxford (1998) MATHGoogle Scholar
  51. Koza, J.R.: Genetic Programming: On the Programming of Computers by Means of Natural Selection. MIT, Cambridge (1992) MATHGoogle Scholar
  52. Banzhaf, W., Nordin, P., Keller, R.E., Francone, F.D.: Genetic Programming: An Introduction. Morgan Kaufmann, San Francisco (1998) MATHGoogle Scholar
  53. Smith, P.W.H.: Genetic programming as a data-mining tool. In: Abbass, H.A., Sarker, R.A., Newton, C.S. (eds.) Data Mining: a Heuristic Approach, pp. 157–173. Idea Group Publishing, London (2002) Google Scholar
  54. Gordon, A.: Classification, 2nd edn. Chapman Hall, New York (1999) MATHGoogle Scholar
  55. Ralambondrainy, H.: A conceptual version of the K-means algorithm. Pattern Recogn. Lett. 16, 1147–1157 (1995) Google Scholar
  56. Zhang, P., Wang, X., Song, P.: Clustering categorical data based on distance vectors. J. Am. Stat. Assoc. 101, 355–367 (2006) MathSciNetMATHGoogle Scholar
  57. Hartigan, J.A., Wong, M.A.: A K-means clustering algorithm. Appl. Stat. 28, 100–108 (1979) MATHGoogle Scholar
  58. Park, H., Jun, C.: A simple and fast algorithm for K-Medoids clustering. Expert Syst. Appl. 36, 3336–3341 (2009) Google Scholar
  59. Ester, M., Kriegel, H.-P., Sander, J., Xu, X.: A density-based algorithm for discovering cluster in large spatial databases. In: Proceedings of 1996 International Conference on Knowledge Discovery and Data Mining (KDD96), Portland, 226–231 (1996) Google Scholar
  60. Frey, B.J., Dueck, D.: Clustering by passing messages between data points. Science. 315(5814), 972–976 (2007) MathSciNetMATHGoogle Scholar
  61. Kohonen, T.: Self-Organization and Associative Memory, 3rd edn. Springer, Berlin Heidelberg New York (1989) MATHGoogle Scholar
  62. Chen, Y., Yang, H.: Self-organized neural network for the quality control of 12-lead ECG signals. Physiol. Meas. 33, 1399–1418 (2012) Google Scholar
  63. Haughton, D., Deichmann, J., Eshghi, A., Sayek, S., Teebagy, N., Topi, H.: A review of software packages for data mining. Am. Stat. 57(4), 290–309 (2003) Google Scholar
  64. Friedman, J., Hastie, T., Tibshirani, R.: The Elements of Statistical Learning, vol. 1: Springer Series in Statistics, New York (2001) Google Scholar
  65. Fawcett, T., Provost, F.: Activity monitoring: noticing interesting changes in behavior. In: Proceedings of KDD-99, San Diego 1999, pp. 53–62, San Diego (1999) Google Scholar
  66. Woodall, W.H., Tsui, K.-L., Tucker, G.R.: A review of statistical and fuzzy quality control based on categorical data. Front. Stat. Qual. Control. 5, 83–89 (1997) MATHGoogle Scholar
  67. Montgomery, D.C., Woodall, W.H.: A discussion on statistically-based process monitoring and control. J. Qual. Technol. 29, 121–162 (1997) Google Scholar
  68. Hayter, A.J., Tsui, K.-L.: Identification and qualification in multivariate quality control problems. J. Qual. Technol. 26(3), 197–208 (1994) Google Scholar
  69. Mason, R.L., Champ, C.W., Tracy, N.D., Wierda, S.J., Young, J.C.: Assessment of multivariate process control techniques. J. Qual. Technol. 29, 140–143 (1997) Google Scholar
  70. Jiang, W., Au, S.-T., Tsui, K.-L.: A statistical process control approach for customer activity monitoring, Technical Report, AT&T Labs (2004) Google Scholar
  71. West, M., Harrison, J.: Bayesian Forecasting and Dynamic Models, 2nd edn. Springer, New York (1997) MATHGoogle Scholar
  72. Fraley, C., Raftery, A.E.: Model-based clustering, discriminant analysis, and density estimation. J. Am. Stat. Assoc. 97, 611–631 (2002) MathSciNetMATHGoogle Scholar
  73. Taguchi, G.: Introduction to Quality Engineering: Designing Quality into Products and Processes. Asian Productivity Organization, Tokyo (1986) Google Scholar
  74. Nair, V.N.: Taguchi's parameter design: a panel discussion. Technometrics. 34, 127–161 (1992) MathSciNetGoogle Scholar
  75. Tsui, K.-L.: An overview of Taguchi method and newly developed statistical methods for robust design. IIE Trans. 24, 44–57 (1992) Google Scholar
  76. Tsui, K.-L.: A critical look at Taguchi's modeling approach for robust design. J. Appl. Stat. 23, 81–95 (1996) Google Scholar
  77. Taguchi, G., Chowdhury, S., Wu, Y.: The Mahalanobis–Taguchi System. McGraw-Hill, New York (2001) Google Scholar
  78. Taguchi, G., Jugulum, R.: The Mahalanobis–Taguchi Strategy: A Pattern Technology System. Wiley, New York (2002) Google Scholar
  79. Woodall, W.H., Koudelik, R., Tsui, K.-L., Kim, S.B., Stoumbos, Z.G., Carvounis, C.P.: A review and analysis of the Mahalanobis–Taguchi system. Technometrics. 45(1), 1–15 (2003) MathSciNetGoogle Scholar
  80. Kusiak, A., Kurasek, C.: Data mining of printed–circuit board defects. IEEE Trans. Robot. Autom. 17(2), 191–196 (2001) Google Scholar
  81. Kusiak, A.: Rough set theory: a data mining tool for semiconductor manufacturing. IEEE Trans. Electron. Packag. Manuf. 24(1), 44–50 (2001) Google Scholar
  82. Ultsch, A.: Information and Classification: Concepts, Methods and Applications. Springer, Berlin Heidelberg New York (1993) Google Scholar
  83. Wong, A.Y.: A statistical approach to identify semiconductor process equipment related yield problems. In: IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems, Paris 1997, pp. 20–22. IEEE Computer Society, Paris (1997) Google Scholar
  84. ANSI: Am. Nat. Standards Institute, IPC-9261, In-Process DPMO and Estimated Yield for PWB (2002) Google Scholar
  85. Baron, M., Lakshminarayan, C.K., Chen, Z.: Markov random fields in pattern recognition for semiconductor manufacturing. Technometrics. 43, 66–72 (2001) MathSciNetMATHGoogle Scholar
  86. King, G.: Event count models for international relations: generalizations and applications. Int. Stud. Q. 33(2), 123–147 (1989) Google Scholar
  87. Smyth, P.: Hidden Markov models for fault detection in dynamic systems. Pattern Recogn. 27(1), 149–164 (1994) Google Scholar
  88. Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 580–587 (2014) Google Scholar
  89. Girshick, R.: Fast R-CNN. In: Proceedings of the IEEE international conference on computer vision, 1440–1448 (2015) Google Scholar
  90. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems (NIPS), pp. 91–99 (2015) Google Scholar
  91. Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.-Y., Berg, A.C.: SSD: single shot MultiBox detector. Lect. Notes Comput. Sci, 21–37 (2016) Google Scholar
  92. Ye, T., Wang, B., Song, P., Li, J.: Automatic railway traffic object detection system using feature fusion refine neural network under shunting mode. Sensors. 18, 1916 (2018) Google Scholar
  93. Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016) Google Scholar
  94. Redmon, J., Farhadi, A.: YOLO9000: Better, Faster, Stronger. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017) Google Scholar
  95. Redmon, J., Farhadi, A.: YOLOv3: an incremental improvement. arXiv preprint arXiv:1804.02767 (2018) Google Scholar
  96. Lazer, D., Kennedy, R., King, G., Vespignani, A.: The parable of Google Flu: traps in big data analysis. Science. 343, 1203–1205 (2014) Google Scholar
  97. Butler, D.: When Google got flu wrong. Nature. 494, 155 (2013) Google Scholar
  98. Yang, S., Santillana, M., Kou, S.: ARGO: a model for accurate estimation of influenza epidemics using Google search data. Proc. Natl. Acad. Sci. (2015) Google Scholar
  99. Copeland, P., Romano, R., Zhang, T., Hecht, G., Zigmond, D., Stefansen, C.: Google disease trends: an update. Nature. 457, 1012–1014 (2013) Google Scholar
  100. Santillana, M., Zhang, D.W., Althouse, B.M., Ayers, J.W.: What can digital disease detection learn from (an external revision to) Google Flu Trends? Am. J. Prev. Med. 47, 341–347 (2014) Google Scholar

Author information

Authors and Affiliations

  1. Grado Department of Industrial and Systems Engineering, Virginia Polytechnic Institute and State University, Blacksburg, VA, USA Kwok-Leung Tsui
  2. Department of Industrial, Manufacturing, & Systems Engineering, University of Texas at Arlington, Arlington, TX, USA Victoria Chen & Chen Kan
  3. Antai College of Economics and Management, Shanghai Jiao Tong University, Shanghai, China Wei Jiang
  4. School of Intelligent Systems Engineering Sun Yat-sen University, Guangdong, China Fangfang Yang
  1. Kwok-Leung Tsui