What is Data Mining? and steps in Data Mining
What is Data Mining?
Basically, data mining is the whole process by which we process and extract the required information from the data by using various technics like machine learning, statistics, etc.
The process goes from various steps which are as follows:
1. Establishing Data Mining Goals
The steps in data mining require you to set up goals for the exercise. We have to identify the key questions for which we want to answer the cost and benefits of the exercise are also factors about which we have to think. Furthermore, we have to determine in advance the expected level of accuracy and usefulness of the result obtained from data mining. If money were no object you can throw as many funds as necessary to get the desired answer required however the cost benefits are always a measure feature in determining the key goals of data mining. the level of accuracy from the data mining also affects the cost and vise vers furthermore beyond a certain level of accuracy more accuracy not affect the result much. the cost-benefit trade-offs for the desired level of accuracy are important considerations for the data mining goals.
2. Selecting Data
The output of the data mining exercise largely depends upon the quality of data being used for example at times data is readily available for processing .for instance, retailers often possess large databases of customer purchases and demographics \.On the other hand, data may not be readily available for processing in this case we have to use data effectively, collect data by various activities like performing surveys, etc which affects the cost of the data mining exercise. So the type of data, its size, and frequency of data collection have a direct bearing on the cost of data mining exercise Therfor the type of data needed for the data mining exercise that can answer the data at a reasonable cost is critical.
3. Preprocessing Data
Preprocessing of data is an important part of data mining in this we deal with data to find errors in data or irrelevant data. Sometimes in relevant data, the information is missing. So in preprocessing stage you remove the irrelevant data and do flagging as such necessary in erroneous data. Sometimes human errors are also included. Data should be subject to checks to ensure integrity. Lastly, we have to develop a formal method of dealing with the missing data and determine whether the data are missing randomly or systematically.
if the data were missing randomly, a simple set of solutions would suffice .and However if the missing data miss in a systematic way we have to identify how it affects the result
4. Transforming Data
After the relevant attributes of data are retained, the next step is to determine appropriate formate for them in which data must be stored we are focused on reducing the no of attributes needed to explain the phenomena. This may require transforming data, data reduction algorithms, such as Principal Component Analysis, can reduce any of the attributes without the loss of a significant amount of data, for example, we can group income from all sources of family in the term aggregate family income.
Often you need to transform variables from one type to another. It may be prudent to transform the continuous variable for income into a categorical variable where each record in the database is identified as low, medium, and high-income individual. This could help capture the non-linearities in the underlying behaviors.
5. Storing Data
The transformed data must be stored in the formate such as it becomes conducive for data mining. The data must be stored in a format such that it gives immediate and unrestricted read and writes facility to the data scientists. the new variables can be written easily and the searching algorithms for the data mining processes do not need to do unnecessary searching on different servers and also privacy, security, and safety all are important points in which we have to focus on the data storing steps.
7. Mining Data
The data mining process covers the steps of data analysis methods, including parametric and non-parametric methods and machine-learning algorithms. A good starting point for data mining is data visualization. Multidimensional views of the data using advanced graphing capabilities of data mining software are very helpful in developing a preliminary understanding of the trends hidden in the data set.
8. Evaluating Mining Results
After results have been extracted from data mining we do a formal evaluation of results. Formal evaluation could include testing the predictive capabilities of the models on observed data to see how effective and efficient the algorithms have been in reproducing data. This is known as an in-sample forecast. then the results are shared with the key stakeholders for the feedback and later iterations are followed for the data mining to improve the process.
data mining and evaluating the results become an iterative process such that the analysts use better and improved algorithms to improve the quality of results generated in light of the feedback received from the key stakeholders.
--Reference IBM data science professional course at Coursera
No comments:
Post a Comment