Data mining knowledge representation
•
Task relevant data: where and how to retrieve
the data to be used for mining
•
Background knowledge: Concept
hierarchies
•
Interestingness measures: informal
and formal selection techniques to be applied
to the output knowledge
•
Representing input data and output knowledge: the structures used to represent
the input of the output of the data mining techniques
•
Visualization techniques: needed to best view and document the results of the whole
process
2 Task relevant data
•
Database or data warehouse name: where to find the data
•
Database tables or data warehouse
cubes
•
Condition for data selection,
relevant attributes or dimensions and data
grouping criteria: all this is used
in the SQL query to retrieve the data
3 Background knowledge: Concept hierarchies
The concept hierarchies are induced by a partial order1 over the values of a given attribute. Depending on the type of the ordering relation we distinguish several types of concept hierarchies.
Schema hierarchy
• Relating concept generality. The ordering reflects the generality of the attribute values,
Set-grouping hierarchy
•
The ordering relation
is
the
subset relation. Applies to set values.
Operation-derived hierarchy
Produced by applying an operation
(encoding, decoding, information extraction).
Rule-based hierarchy
Using rules to define the partial order,
for example: if antecedent
then consequent
defines the order antecedent
< consequent.
4 Interestingness measures
5 Representing input data and output knowledge
Concepts (classes, categories, hypotheses): things
to be mined/learned
•
Classification mining/learning: predicting a
discrete class, a kind of supervised learning, success is
measured on new data for which class labels
are known (test data).
•
Association mining/learning: detecting associations between at- tributes, can be used to predict any
attribute value and more than one
attribute values, hence more rules can be generated, therefore we need constraints (minimum support and minimum confidence).
•
Clustering:
grouping similar instances into clusters, a kind of unsu- pervised learning, success is measured
subjectively or by objective functions.
•
Numeric prediction: predicting a numeric
quantity, a kind of su- pervised
learning, success is measured on test data.
•
Concept description: output
of the learning scheme
Instances (examples, tuples, transactions)
•
Things to be classified, associated, or clustered.
•
Individual, independent examples of the concept
to be learned (tar- get concept).
•
Described by predetermined set of attributes.
•
Input to the learning
scheme: set of instances (dataset), represented as a single relation
(table).
•
Independence assumption: no relationships between attributes.
•
Positive and negative examples
for a concept, Closed World As- sumption (CWA):
{negative} = {all}\{positive}.
•
Relational (First Order Logic) descriptions:
6
Visualization techniques:
•
Identifying problems:
–
Histograms for nominal attributes: is the distribution consistent with background knowledge?
– Graphs for numeric values:
detecting outliers.
•
Visualization show dependencies
•
Consulting domain experts
•
If data are too much, take a sample
No comments:
Post a Comment