History of Data Mining
When I ask my colleagues, friends, or family members, they have not know how data mining works nor the history of the science. Therefore, I wanted to take this blog post to familiarize you all with the history of data mining. Next week, I will discuss the basic processes of how data mining works to familiarize you with how it works.
Source: Li, R. (2016). History of data mining. Retrieved from https://www.linkedin.com/pulse/history-data-mining-gregory-piatetsky-shapiro
Data mining began with the advent of statistical analysis using Thomas Bayes’ Bayes’ theorem and Adrien Marie Legendre and Carl Gauss’s regression analysis in the 18th-century and early 19th-century (Li, 2016). Under Baye’s theorem, it uses data mining and probability to understand estimated probabilities and solve complex permutations. Regression analysis allows for estimating the relationship between variables, which is one of the major building blocks of modern day data mining (Li, 2016). Following these advancements in statistics, Alan Turing developed the idea of the “Universal Machine,” which would be capable of processing and computing computations that would be later be used in the development of modern-day computers (Li, 2016). During the 1960’s and 1970’s, datasets grew in size and complexity allowing for terabytes of data to be stored. As a result, the statistical technique of decision trees was used to help predict outcomes for the first time using massive data sets (Li, 2007).
Once primitive computers had been developed in the 1960’s and 1970’s, John Henry Holland wrote Adaptations in Nature and Artificial Science in 1975 outlining genetic algorithms. These algorithms were the basis of the beginning of data mining as we know it that was developed in the late 1980’s. Genetic algorithms allow for the manipulating of a population over time, which allows for the optimal solutions to develop as it finds the best traits within a population over successive generations (Li, 2016).
In the late 1980’s, data mining became an “established discipline within the scope of computer science” (Coenen, 2011, p. 27). Then, by the early 1990’s, data mining became recognized as “a sub-process within a larger process called Knowledge Discovery in Databases (KDD) (Coenen, 2011, p. 27). KDD concerns the discovery of hidden information with a wide range of data, which pertains to a variety of processes that include “data preparation (warehousing, data cleaning, preprocessing data, etc) and the analysis/visualization of results”(Coenen, 2011, p. 27). However, per Coenen, today for practical purposes, computer scientists and practitioners who use data mining in their industry use KDD and data-mining as synonymous terms. Yet, technically speaking, data-mining is the subprocess of KDD.
Source: Public data and data mining competitions: What are the lessons? Public Data and Data Mining Competitions – What are Lessons?. (2017). Slideshare.net. Retrieved 13 November 2017, from https://www.slideshare.net/gpiatetskyshapiro/bpdm-2013datacompetitionslessons
By 2003, Moneyball by Michael Lewis was one of the first widely read books by the public that made data-driven statistical analysis mainstream to the public. In Moneyball, Lewis describes how the Major League Baseball team Oakland Athletics used a data-driven statistical approach to analyze and select various qualities in baseball players that were undervalued and cheaper to obtain. Ultimately, the Oakland Athletics were able to make the playoffs based on the teams they were able to assemble with one-third the payroll of most Major League Baseball teams (Lewis, 2004). Thus, as a result of Moneyball and the advent of more powerful computers in the subsequent years, data mining began to become more popular and used within business, science, engineering, and healthcare (Li, 2016).
As of 2015, data mining has expanded exponentially in many industries. Tech companies such as Google, Facebook, Tesla, and IBM lead the way using data mining to innovate industries from healthcare to education (Li, 2016). By the explosion of data mining appearing across all industries, it prompted the White House in 2015 to name a Chief Data Scientist. The Obama administration selected Dr. DJ Patil as the first U.S. Chief Data Scientist to oversee policy in making public data available to the public, data security of personal data, and using data to spur innovation and entrepreneurship (Li, 2016).
Data mining has an interesting history spanning 150 years! As our world continues to rely more and more on computers, you can bet data mining is being utilized for many applications; you may not even know it’s being used behind the scenes of what you are doing on your mobile phone or computer. Every time you log onto a social media website, mobile app, or search engine, data mining is being used; it’s everywhere! By understanding data mining’s history and where it’s at today, it will allow us to dive in next week to see how it works!
If you would like to learn about the history of data to help you understand the history of data mining, take a look at the video I provided below.
Li, R. (2016). History of data mining. Retrieved from https://www.linkedin.com/pulse/history-data-mining-gregory-piatetsky-shapiro
Coenen, F. (2011, ). Data mining: Past, present, and future. The Knowledge Engineering Review, 26, 25-29. Retrieved from http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.185.1202&rep=rep1&type=pdf
Lewis, M. (2004). Moneyball: The art of winning an unfair game (1st ed.). New York, NY: W.W. Norton & Company.