Data Mining Task Primitives (DMT)

 

Data Mining Task Primitives

Data Mining Task Primitives

1. Each task is given as a data mining Query i.e, I/P to a data mining system
2. Data mining query is a task primitive means it allows user to communicate with DM system and 
3. The primitives used for defining DM task are:
a) Task - relavant primitive
b) Type of knowledge extracted 
c) Background knowledge
d) Pattern analysis measures or threshold
d)Visualization primitive
a) Task - relavant primitive:
-Specifies the part of the database
  that is required by the users
- It continues all into of database such
  db name, db dable, data grauping 
  standard data selection cretiria
b) Type of knowledge extracted : KDD
- specifies various types of functionalities
  of a mining system such as data charactarizatio, 
  data classificatio, association, clustering,
  other or evolution analysis.
KDD: 1.Data cleaning
2.Data integration
3. Data selection
4. Data Transformation
5. Data Mining
6. Pattern evaluation
7. Knowledge representation
c) Background knowledge:
-Used for guiding discovery process
- helps in discovery process
- helps in directing KDD process while evaluating patterns.
d) Pattern analysis measures or threshold:
- used to evaluate the pattern discovered after 
  parformin discovery process
- involves different measures for pattern evaluation suc as simpllicity 
  utility, certinity ...etc,
d)Visualization primitive:
- specifies different ways in which discovered patterns can be represented
- some ways include,
  @ decision trees
  @ OLAP operations - pie charts, graphs




 

    A data mining task can be specified in the form of a data mining query, which is input to the data mining system. A data mining query is defined in terms of data mining task primitives. These primitives allow the user to interactively communicate with the data mining system during discovery to direct the mining process or examine the findings from different angles or depths. The data mining primitives specify the following,

  1. Set of task-relevant data to be mined.
  2. Kind of knowledge to be mined.
  3. Background knowledge to be used in the discovery process.
  4. Interestingness measures and thresholds for pattern evaluation.
  5. Representation for visualizing the discovered patterns.

A data mining query language can be designed to incorporate these primitives, allowing users to interact with data mining systems flexibly. Having a data mining query language provides a foundation on which user-friendly graphical interfaces can be built.

Designing a comprehensive data mining language is challenging because data mining covers a wide spectrum of tasks, from data characterization to evolution analysis. Each task has different requirements. The design of an effective data mining query language requires a deep understanding of the power, limitation, and underlying mechanisms of the various kinds of data mining tasks. This facilitates a data mining system's communication with other information systems and integrates with the overall information processing environment.

List of Data Mining Task Primitives

A data mining query is defined in terms of the following primitives, such as:

1. The set of task-relevant data to be mined

This specifies the portions of the database or the set of data in which the user is interested. This includes the database attributes or data warehouse dimensions of interest (the relevant attributes or dimensions).

In a relational database, the set of task-relevant data can be collected via a relational query involving operations like selection, projection, join, and aggregation.

The data collection process results in a new data relational called the initial data relation. The initial data relation can be ordered or grouped according to the conditions specified in the query. This data retrieval can be thought of as a subtask of the data mining task.

This initial relation may or may not correspond to physical relation in the database. Since virtual relations are called Views in the field of databases, the set of task-relevant data for data mining is called a minable view.

2. The kind of knowledge to be mined

This specifies the data mining functions to be performed, such as characterization, discrimination, association or correlation analysis, classification, prediction, clustering, outlier analysis, or evolution analysis.

3. The background knowledge to be used in the discovery process

This knowledge about the domain to be mined is useful for guiding the knowledge discovery process and evaluating the patterns found. Concept hierarchies are a popular form of background knowledge, which allows data to be mined at multiple levels of abstraction.

Concept hierarchy defines a sequence of mappings from low-level concepts to higher-level, more general concepts.

  • Rolling Up - Generalization of data: Allow to view data at more meaningful and explicit abstractions and makes it easier to understand. It compresses the data, and it would require fewer input/output operations.
  • Drilling Down - Specialization of data: Concept values replaced by lower-level concepts. Based on different user viewpoints, there may be more than one concept hierarchy for a given attribute or dimension.

An example of a concept hierarchy for the attribute (or dimension) age is shown below. User beliefs regarding relationships in the data are another form of background knowledge.

4. The interestingness measures and thresholds for pattern evaluation

Different kinds of knowledge may have different interesting measures. They may be used to guide the mining process or, after discovery, to evaluate the discovered patterns. For example, interesting measures for association rules include support and confidence. Rules whose support and confidence values are below user-specified thresholds are considered uninteresting.

  • Simplicity: A factor contributing to the interestingness of a pattern is the pattern's overall simplicity for human comprehension. For example, the more complex the structure of a rule is, the more difficult it is to interpret, and hence, the less interesting it is likely to be. Objective measures of pattern simplicity can be viewed as functions of the pattern structure, defined in terms of the pattern size in bits or the number of attributes or operators appearing in the pattern.
  • Certainty (Confidence): Each discovered pattern should have a measure of certainty associated with it that assesses the validity or "trustworthiness" of the pattern. A certainty measure for association rules of the form "A =>B" where A and B are sets of items is confidence. Confidence is a certainty measure. Given a set of task-relevant data tuples, the confidence of "A => B" is defined as
    Confidence (A=>B) = # tuples containing both A and B /# tuples containing A
  • Utility (Support): The potential usefulness of a pattern is a factor defining its interestingness. It can be estimated by a utility function, such as support. The support of an association pattern refers to the percentage of task-relevant data tuples (or transactions) for which the pattern is true.
    Utility (support): usefulness of a pattern
    Support (A=>B) = # tuples containing both A and B / total #of tuples
  • Novelty: Novel patterns are those that contribute new information or increased performance to the given pattern set. For example -> A data exception. Another strategy for detecting novelty is to remove redundant patterns.

5. The expected representation for visualizing the discovered patterns

This refers to the form in which discovered patterns are to be displayed, which may include rules, tables, cross tabs, charts, graphs, decision trees, cubes, or other visual representations.

Users must be able to specify the forms of presentation to be used for displaying the discovered patterns. Some representation forms may be better suited than others for particular kinds of knowledge.



No comments:

Post a Comment