alt
Advertisement
Online Training
Career Series
Exforsys
Exforsys arrow Tutorials arrow MSAS arrow Tutorial 67: MSAS - Introduction to Data Mining
Site Search


Tutorial 67: MSAS - Introduction to Data Mining
Article Index
Tutorial 67: MSAS - Introduction to Data Mining
Page 2
Page 3
The process of probing into a set of information for descriptive and predictive purposes is called data mining. The purpose is to identify those trends and patterns which indicate the direction of effort to achieve desired outcomes. SQL Server 2000 and Analysis Services, has inbuilt powerful data mining capabilities including algorithms for Clustering and for Decision Trees.

Before actually studying the data mining capabilities of Analysis Services, let us briefly look at some terminology generally used while discussing data mining.

Understanding Terms used in Data Mining

A case is the term used for the facts being studied. The data used to study these facts are called case sets. Each data mining case has a unique identifier called a key. Descriptive pieces of information are called attributes or measures. The case may contain information about a single table or from multiple tables. If there are multiple tables from which data is derived, such a case is defined as a case with nested tables. The hierarchical attributes of a case that can be conveniently grouped are called dimensions of the case.

Clustering breaks down large chunks of data into more manageable groups by identifying similar traits. The clusters provide description of the attributes of the members in each cluster. It is often the first technique that is used in a project and the data is used as a source of future mining efforts as it highlights promising areas to investigate. Microsoft SQL 2000 Analysis Server uses a Scaleable Expectation Maximization algorithm to create clusters based on population density. The advantage of this process is that it requires only a single pass over the entire data and the algorithm creates clusters as it passes and the centers of these clusters are adjusted as more data is processed. It provides reasonable results at any point during its computation. Moreover, it works with a minimum amount of memory.

 The strengths of clustering is that it is

1. Undirected
2. Not limited to any type of analysis
3. It handles large case sets
4. Uses minimum memory.

Its weaknesses are:

1. Measurements need to be carefully chosen
2. Results may be difficult to interpret
3. Has no guaranteed value

Another technique used in data mining is the Decision Tree. The Decision tree is used to solve predictive problems. For instance it may be used to predict whether a customer will purchase a particular brand of a product at a particular store in a given geographical location. The algorithm identifies the most relevant characteristics and defines a set of rules that give the percentage of probabilities that new cases will follow the pattern identified.

The decision tree is created using a technique called the recursive partitioning. This algorithm defines the most relevant attributes and splits the population of data on the basis of the attribute. Each partition is called Node. The process is repeated for each subgroup until a good stopping point is found. This may be the point at which all nodes meet the criteria or there are no more nodes that meet the criteria. The last group of cases in the decision tree is called a Leaf node. When all the leaf nodes in a decision tree have only one value the model is said to be over fitted. An over fitted model has the danger of providing unrealistic predictions.

The decision tree model has the following strengths:

1. It provides visual results
2. It is built on understandable rules
3. It is predictive
4. It enables performance for prediction
5. Shows what is important.

On the other hand the weakness of this model is that:

1. It can get spread too thin
2. Performance of training is very expensive as the cases in the node are stored multiple times.

While doing a data mining analysis three distinct types of data sets are used. The initial set of data that is processed and saved for future use is called a training set. This data is used to ‘teach’ the model about the population of data being analyzed. The second set of data required are called test cases. This set is used to build confidence in the hypothesis. Some known attributes are omitted to help us confirm that the model correctly predicts the missing values. The third type of data set used is the evaluation set. The evaluation set focuses on the investigation. It is the final set of cases in which we process the situational data. The model is used to predict behavior based on what the mining activities have learned and is used to drive business strategy.

In the sections that follow we will work on building an OLAP Clustering Data Mining model using the FoodMart 2000 database. We will assume that FoodMart wants to study the loyalty of customers. It will study the number of customers who use FoodMart and how many of them use the FoodMart Club Card. Let us begin by categorizing customer behaviour by using clustering to define the characteristics of people who have the FoodMart Club Card.

The Data mining model can use the OLAP or a Relational Data store. The former allows the user have the advantage of predetermined aggregations and dimensional information from the source cube. The latter requires the user to input information on what describes a complete case. If there are multiple tables the table joins have to be specified.



 
< Prev   Next >
Sponsored Links
© 2008 Exforsys.com
Joomla! is Free Software released under the GNU/GPL License.
Page copy protected against web site content infringement by Copyscape