Data mining knowledge representation

Data mining knowledge representation

         

    Task relevant data: where and how to retrieve the data to be used for mining

       •    Background knowledge: Concept hierarchies

    Interestingness measures: informal and formal selection techniques to be applied to the output knowledge

    Representing input data and output knowledge: the structures used to represent the input of the output of the data mining techniques

    Visualization techniques: needed to best view and document the results of the whole process

 

2           Task relevant data

 

    Database or data warehouse name: where to find the data

    Database tables or data warehouse cubes

    Condition for data selection, relevant attributes or dimensions and data grouping criteria: all this is used in the SQL query to retrieve the data


 

3           Background knowledge: Concept hierarchies

 

The concept hierarchies are induced by a partial order1 over the values of a given attribute. Depending on the type of the ordering relation we distinguish several types of concept hierarchies.

 

             Schema hierarchy

    Relating concept generality. The ordering reflects the generality of the attribute values,

 

             Set-grouping hierarchy

    The  ordering  relation  is  the  subset  relation.          Applies to set values.

Operation-derived hierarchy

Produced by applying an operation (encoding, decoding, information extraction). 


Rule-based hierarchy

Using rules to define the partial order, for example: if antecedent then consequent

defines the order antecedent < consequent.

 

4           Interestingness measures

 Criteria to evaluate hypotheses (knowledge extracted from data when applying data mining techniques). 

             

5           Representing input data and output knowledge

 

             Concepts (classes, categories, hypotheses): things to be mined/learned

    Classification mining/learning: predicting  a  discrete  class,  a  kind of supervised learning, success is measured on new data for which class labels are known (test data).

    Association mining/learning: detecting associations between at- tributes, can be used to predict any attribute value and more than one attribute values, hence more rules can be generated, therefore we need constraints (minimum support and minimum confidence).

    Clustering: grouping similar instances into clusters, a kind of unsu- pervised learning, success is measured subjectively or by objective functions.

    Numeric prediction: predicting a numeric quantity, a kind of su- pervised learning, success is measured on test data.

    Concept description: output of the learning scheme


             Instances (examples, tuples, transactions)

    Things to be classified, associated, or clustered.

    Individual, independent examples of the concept to be learned (tar- get concept).

    Described by predetermined set of attributes.

    Input to the learning scheme: set of instances (dataset), represented as a single relation (table).

    Independence assumption: no relationships between attributes.

    Positive and negative examples for a concept, Closed World As- sumption (CWA): {negative} = {all}\{positive}.

    Relational (First Order Logic) descriptions:


6           Visualization techniques:  

    Identifying problems:

    Histograms for nominal attributes: is the distribution consistent with background knowledge?

    Graphs for numeric values: detecting outliers.

    Visualization show dependencies

    Consulting domain experts

    If data are too much, take a sample


No comments:

Post a Comment