In last week’s blog, I briefly discussed the history of data mining (DM). DM has a long history that goes back over one hundred years. In terms of its history in education, it’s quite brief; beginning in the 1990’s. I will write about DM’s brief history in education when I discuss educational data mining on next week’s blog. For this week, I wanted to focus how data mining works. Thus, for this week’s edition of my blog, we will work through the process of how DM works and how it is part of a larger process called knowledge discovery (KDD).
How Data Mining Works
As discussed in the history of DM, DM is a subprocess of KDD. Therefore, in order to understand DM and how it works, a description of KDD must be given to see where DM falls within that process. This will outline how data is targeted, processed, and analyzed. Then, a brief discussion regarding the functionalities of DM will be described along with some of the popular techniques used by practitioners. Finally, a brief discussion regarding the software programs computers use to utilize DM along with its general applications.
Video 1: Reviews background of DM and describes many of its applications.
Data Mining within the KDD Process
During the KDD process, there are nine steps from beginning to end. Within this description, each step will be briefly discussed. Then, DM will be incorporated into the KDD process.
The first three steps of the KDD process involve finding a target set of data, selecting a subset of variables or data samples, and then the data is cleaned by organizing the subsets of data the user wants to perform an analysis on (Fayyad, Piatetsky-Shapiro, & Smyth, 1996).
Next, data reduction and projections occur, which allows for the number of variables to be reduced within a data set and provides for the data to be stored; figures and data representations can make up the data sets at this point of the KDD process.
Now, with all of the data sets narrowed down to the sets you want to manipulate to find a specific pattern, DM can be incorporated within the KDD process by selecting a DM technique (Fayyad et al., 1996). When DM is used, there are a variety of different types of techniques that can be used to manipulate and compute data, which includes clustering, regression, classification, summarization, outlier detection, relationship mining, social network analysis, process mining, text mining, distillation of data for human judgment, and discovery with models (Coenen, 2011; Fayyad et al., 1996; Li, 2007).
Once a DM technique is chosen by the user, they must select an algorithm to search for data patterns that are inputted into the DM technique along with the dataset(s) being mined (Fayyad et al., 1996). Then, once this occurs, the computation will begin amongst the data set, DM technique, and algorithms that will search for patterns of interests “in a particular representational form or a set of such representations which may include classification rules or decision trees, regression, and clustering” (Fayyad et al., 1996, p. 42).
After the data has been computed, the last steps of KDD include interpreting the mined patterns in addition to returning to the steps of this process to manipulate the data set using a different DM technique and/or algorithm associated with that technique (Coenen, 2011; Fayyad et al., 1996; Li, 2007).
Finally, per Fayyad (1996), once the knowledge is discovered from the KDD process, it can be applied to the problem a user is trying to solve or it can be documented and reported to the appropriate personnel.
In-Depth: The Data-Mining Step of the KDD Process. Per Fayyad (1996) and Coenen (2011), DM within the KDD process can be broken down into two differing approaches: the statistical and logical approaches. The statistical approach is the primary approach to DM, which “tends to be most widely used basis for practical data mining applications given the typical presence of uncertainty in real-world data generating processes” (Fayyad, et al., 1996, p. 43). Thus, the statistical approach allows for the various DM techniques to be used and computed with the vast array of algorithms available to perform on a data set(s).
Video 2: In this video, it describes the KDD process.
Data Mining Techniques
Within the discipline of DM, there are a variety of different types of techniques that can be used to manipulate and compute data. These DM techniques include clustering, regression, classification, summarization, outlier detection, relationship mining, social network analysis, process mining, text mining, the distillation of data for human judgment, and discovery with models. Each of these techniques has a particular focus and use because each technique manipulates data sets differently. Therefore, depending on which DM technique you decide to use for the problem at hand, there will be different data outputs from the technique you chose to manipulate the data with.
DM is a major part of the KDD process. One easy way to understand this entire process is to visualize how raw data is taken, organized, and then manipulated to help solve a problem. I did not talk about each DM technique this week because I wanted to talk about specific DM techniques that can be used for problems facing educators in my next blog that I will post next week. In next weeks blog, I will provide background information regarding educational data mining and the DM techniques that can be used to help educators solve many problems they are facing on a daily basis.
Coenen, F. (2011, ). Data mining: Past, present, and future. The Knowledge Engineering Review, 26, 25-29. Retrieved from http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.185.1202&rep=rep1&type=pdf
Fayyad, U., Piateksky-Shapiro, G, & Smith, P. (1996). From data mining to knowledge discovery databases. American Association for Artificial Intelligence, 17, 37-51.