Data Mining Primitives, Hierarchies, Architecture and Coupling

List and describe the five primitives for specifying a DATA MINING task.

task-relevant data: the database and tables (or DATA WAREHOUSE and the data cubes), which contain the data to be mined, are specified by the user. This also includes conditions for selecting/grouping this data, and the attributes/dimensions that are under consideration whilst mining.

kind of knowledge to be mined: specifies the data mining function to be conducted which may include characterization, discrimination, association, classification, clustering or evolutionary analysis.

background knowledge: user may choose/specify background knowledge or knowledge about the domain that is to be mined. This knowledge is, hence, useful in aiding the knowledge discovery process and evaluating any patterns found. Background comes in several varieties. One example is concept hierarchies: these let data be mined at “multiple levels of abstraction.” Other examples of BK include user beliefs (regarding relationship of data) – these allow evaluation of discovered patterns with respect to their “degree of unexpectedness” (unexpected patterns are considered interesting) or “expectedness” (patterns that confirm are considered interesting).

interesting measures: functions that separate uninteresting patterns from knowledge. They might guide the mining process or, after discovery, evaluate the patterns that were discovered. Various interestingness measures correspond to different trends of knowledge. The textbook makes frequent mention of interestingness measures for assoc. rules such as support (% of task-relevant data tuples for which the rule pattern appears) and confidence (estimate of strength of implication of the rule). Rules whose support and confidence values are below user pre-defined threshold are considered uninteresting.

presentation/visualization of discovered patterns: how are discovered patterns to be displayed? Users can choose from a variety of formats including rules, tables, charts, graphs, decision trees, and cubes.

Describe and give an example of the four major types of concept hierarchies.

1. schema hierarchies: “a total or partial order among attributes in the database schema.” Schema hierarchies can formally express schema relationships between attributes. Schema hierarchies usually specify DATA WAREHOUSE dimensions.

Ex: Blockbuster (the video chain) may have schema in which relation titles contains the attributes catalog_no, title, genre and format for the video schema.

catalog_no < title < genre < format

The lowest conceptual level is catalog_no, which is a completely unique value, followed by title (there may be two identical titles such as Planet of the Apes which is the title for the old film as well as the remake), followed by genre (sci-fi, horror, drama, comedy, etc) and, lastly, format (DVD or VHS).

Schema hierarchies provide metadata information.

2. set-grouping hierarchies: organizes values for a given attribute or dimension into groups of constants or range values. Total or partial order may be defined amongst groups. Set grouping hierarchies may allow refinement or enrichment of schema-defined hierarchies, when the two types of hierarchies are combined. They are often used to define small sets of object relationships.

Ex: A set-grouping hierarchies for attribute price for video titles for sale at Blockbuster can be specified in terms of ranges:

{bargain, mid_level, expensive} ⊂ all (price)
{1…5} ⊂ bargain
{6…20} ⊂ mid_level
{21…100} ⊂ expensive

3. operation-derived hierarchies: based on operations specified by users, experts or the DATA MINING system. These operations may include decoding information strings, information extraction from complex data objects, and data clustering.

Ex: Histogram analysis can be used for discretization. For example, a histogram showing the data distribution of the attribute price for a given data set. Hence, if we are measuring the price customers are willing to pay for an item, such as jewelry, a histogram can be generated with the Y-axis measuring count (the number of times items within a certain price range have been sold). Therefore, by looking at the distribution, one can literally see the most frequent range (see pg 133 of text; the range in this example is $300 - $325). So $300 - $325 is the top-most hierarchy.

4. rule-based hierarchies: happens when either a whole concept hierarchy or a portion of it is defined by a set of rules and is evaluated dynamically based upon the current database data and the rule definition.

Ex: Say an auto insurance consulting firm is trying to establish projected rates auto owners must pay. The evaluating factor is, for the sake of argument, vehicle age (based on model year):

low_rate(X) ⇐ vehicle_age_now(X, A1) ∧ vehicle_age_in5yrs(X, A2) ∧ ((A2 – A1) ≥ 20yrs)

med_rate(X) ⇐ vehicle_age_now(X, A1) ∧ vehicle_age_in5yrs(X, A2) ∧ ((A2 – A1) ≤ 20yrs) ∧ ((A2 – A1) ≥ 6yrs)

high_rate(X) ⇐ vehicle_age_now(X, A1) ∧ vehicle_age_in5yrs(X, A2) ∧ ((A2 – A1) < 6yrs)

Consider Association rule (see below) which was mined from the student database at Big-University:

major(X, “science”) ⇒ status(X, “undergrad”)

Suppose that the number of students at the univ (no of task relevant tuples) is 5000, that 56% of undergrads at the univ major in science, that 64% of the students are registered in programs leading to an undergrad degree, and that 70% of the students are majoring in science.

(a) Compute the confidence and support of the above Rule.

confidence(A ⇒ B) = (#_tuples_containing_both_A_and_B) / (#_tuples_containing_A)
support(A ⇒ B) = (#_tuples_containing_both_A_and_B) / (total_#_of_tuples)

#_tuples_containing_both_A_and_B = 5000 x 0.56 x 0.64 = 1792
#_tuples_containing_A = 5000 x 0.70 = 3500

confidence = 51.2%
support = 35.4%

(b) Consider the rule below:

major(X, “biology”) ⇒ status(X, “undergrad”) [17%, 80%]

Suppose that 30% of science students are majoring in biology. Would you consider the Rule above to be novel w.r.t. the Rule given in (a)? Explain.

#_tuples_containing_both_A_and_B = 5000 x 0.56. 0.64 x 0.30 = 537.6
#_tuples_containing_A = 5000 x 0.70 x 0.30 = 1050

Expected confidence = 51.2%
Expected support = 10.76%

And, 30% of 35.85% = 10.76%, this shows a NOVEL pattern due to the lack of redundancy. That is, the support for registered biology undergrads is higher than registered science undergrads. Similarly, the confidence that registered biology undergrads 80% vs. the confidence that registered science undergrads is 51.2% shows lack of redundancy (hence, novelty).

Describe the differences between the following architectures for the integration of DATA MINING systems with a RELATIONAL DATABASE or DATA WAREHOUSE system (state which architecture may be the most popular and why).

No coupling: a DATA MINING system will not use any function of a RELATIONAL DATABASE/DATA WAREHOUSE system. It may get data from various sources (like a file system), process data using DATA MINING algorithms, and save these results in another file. This is not a favorable architecture. It lacks the flexibility/efficiency at storing, organizing, accessing and processing data a RELATIONAL DATABASE/DATA WAREHOUSE system provides. The RELATIONAL DATABASE/DATA WAREHOUSE usually provides organized, indexed, cleaned up, integrated, consolidated data which makes finding task-relevant data easy. Also, in a DATA WAREHOUSE/RELATIONAL DATABASE system, testable/scalable algorithms and data structures are implemented. Finally, most data is stored in DATA WAREHOUSE/RELATIONAL DATABASE systems. Without coupling, a DATA MINING system will need to use special tools to extract data – a difficult task.

Loose coupling: DATA MINING system will use SOME facilities of a RELATIONAL DATABASE/DATA WAREHOUSE system – perhaps merely for fetching data. Then, the DATA MINING system may store mined results in a file or a “designated” place in a RELATIONAL DATABASE or DATA WAREHOUSE. This approach is better than NO coupling as it provides greater functionality. But, many loosely coupled DATA MINING systems are “main-memory based.” Due to the fact that mining itself doesn’t explore data structures and query optimization methods provided by RELATIONAL DATABASE/DATA WAREHOUSE systems, it is hard to achieve high scalability and high performance w/large data sets.

Semi-tight coupling: in addition to linking DATA MINING systems to a DATA WAREHOUSE/DATA MINING system, “efficient implementations of a few essential DATA MINING primitives can be provided” in the DATA WAREHOUSE/RELATIONAL DATABASE system. The primitives may include: sorting, indexing, aggregation, histogram, multi-way join, and pre-computation of various statistical techniques.

Tight coupling: DATA MINING is “smoothly integrated” onto the RELATIONAL DATABASE/DATA WAREHOUSE system. The DATA MINING is considered a subsystem of “one functional component” of an IS. DATA MINING queries/functions are optimized w.r.t. mining query analysis, data structures, indexing schemas, and various query processing with DATA WAREHOUSE/RELATIONAL DATABASE, provides a “uniform info processing environment.” This is the most desirable approach due to its high performance.

Which is most popular? I have no background in DATA MINING or DATA WAREHOUSE/RELATIONAL DATABASE so my answer is speculation based on information I have come across in books or the Web. Semi-tight coupling may be the most popular approach. Tight coupling is highly desirable, but “its implementation is nontrivial and more research is needed….” It is also likely to be quite expensive. Semi-tight seems to be a good trade-off between lose and tight coupling and may be the most cost-effective solution.

More Data warehousing and data mining information:

OLAP vs. OLTP

Multiple Dimensional View of Database: ROLAP, MOLAP, HOLAP

Data Warehouse Project Warnings

Data Mining Primitives, Hierarchies, Architecture and Coupling

Data Preprocessing for Data Warehouses

Dimensions of data quality, tuples with missing values, data smoothing and data integration

Data Characterization, Discrimination, Association, Classification, Prediction, Clustering, and Evolution Analysis: Differences and Similarities

Data Warehouse Project vs Any Other Large Database Implementation

Data Mining and Data Warehousing in Biology, Medicine and Health Care

Other Information Technology pages:

Project Management Software

Project Management Training — FAQ part 1

FAQ part 4: Cost-Time Graph and Shortening Critical Path

Back to Info-Source home page