List and describe the five primitives for specifying a DATA MINING task.
Describe and give an example of the four major types of concept hierarchies. 1. schema hierarchies: “a total or partial order among attributes in the database schema.” Schema hierarchies can formally express schema relationships between attributes. Schema hierarchies usually specify DATA WAREHOUSE dimensions. Ex: Blockbuster (the video chain) may have schema in which relation titles contains the attributes catalog_no, title, genre and format for the video schema. catalog_no < title < genre < format The lowest conceptual level is catalog_no, which is a completely unique value, followed by title (there may be two identical titles such as Planet of the Apes which is the title for the old film as well as the remake), followed by genre (sci-fi, horror, drama, comedy, etc) and, lastly, format (DVD or VHS). Schema hierarchies provide metadata information. 2. set-grouping hierarchies: organizes values for a given attribute or dimension into groups of constants or range values. Total or partial order may be defined amongst groups. Set grouping hierarchies may allow refinement or enrichment of schema-defined hierarchies, when the two types of hierarchies are combined. They are often used to define small sets of object relationships. Ex: A set-grouping hierarchies for attribute price for video titles for sale at Blockbuster can be specified in terms of ranges: {bargain, mid_level, expensive} ⊂ all (price) 3. operation-derived hierarchies: based on operations specified by users, experts or the DATA MINING system. These operations may include decoding information strings, information extraction from complex data objects, and data clustering. Ex: Histogram analysis can be used for discretization. For example, a histogram showing the data distribution of the attribute price for a given data set. Hence, if we are measuring the price customers are willing to pay for an item, such as jewelry, a histogram can be generated with the Y-axis measuring count (the number of times items within a certain price range have been sold). Therefore, by looking at the distribution, one can literally see the most frequent range (see pg 133 of text; the range in this example is $300 - $325). So $300 - $325 is the top-most hierarchy. 4. rule-based hierarchies: happens when either a whole concept hierarchy or a portion of it is defined by a set of rules and is evaluated dynamically based upon the current database data and the rule definition. Ex: Say an auto insurance consulting firm is trying to establish projected rates auto owners must pay. The evaluating factor is, for the sake of argument, vehicle age (based on model year): low_rate(X) ⇐ vehicle_age_now(X, A1) ∧ vehicle_age_in5yrs(X, A2) ∧ ((A2 – A1) ≥ 20yrs) med_rate(X) ⇐ vehicle_age_now(X, A1) ∧ vehicle_age_in5yrs(X, A2) ∧ ((A2 – A1) ≤ 20yrs) ∧ ((A2 – A1) ≥ 6yrs) high_rate(X) ⇐ vehicle_age_now(X, A1) ∧ vehicle_age_in5yrs(X, A2) ∧ ((A2 – A1) < 6yrs) Consider Association rule (see below) which was mined from the student database at Big-University: major(X, “science”) ⇒ status(X, “undergrad”) Suppose that the number of students at the univ (no of task relevant tuples) is 5000, that 56% of undergrads at the univ major in science, that 64% of the students are registered in programs leading to an undergrad degree, and that 70% of the students are majoring in science. (a) Compute the confidence and support of the above Rule. confidence(A ⇒ B) = (#_tuples_containing_both_A_and_B) / (#_tuples_containing_A) #_tuples_containing_both_A_and_B = 5000 x 0.56 x 0.64 = 1792 confidence = 51.2% (b) Consider the rule below: major(X, “biology”) ⇒ status(X, “undergrad”) [17%, 80%] Suppose that 30% of science students are majoring in biology. Would you consider the Rule above to be novel w.r.t. the Rule given in (a)? Explain. #_tuples_containing_both_A_and_B = 5000 x 0.56. 0.64 x 0.30 = 537.6 Expected confidence = 51.2% And, 30% of 35.85% = 10.76%, this shows a NOVEL pattern due to the lack of redundancy. That is, the support for registered biology undergrads is higher than registered science undergrads. Similarly, the confidence that registered biology undergrads 80% vs. the confidence that registered science undergrads is 51.2% shows lack of redundancy (hence, novelty). Describe the differences between the following architectures for the integration of DATA MINING systems with a RELATIONAL DATABASE or DATA WAREHOUSE system (state which architecture may be the most popular and why). No coupling: a DATA MINING system will not use any function of a RELATIONAL DATABASE/DATA WAREHOUSE system. It may get data from various sources (like a file system), process data using DATA MINING algorithms, and save these results in another file. This is not a favorable architecture. It lacks the flexibility/efficiency at storing, organizing, accessing and processing data a RELATIONAL DATABASE/DATA WAREHOUSE system provides. The RELATIONAL DATABASE/DATA WAREHOUSE usually provides organized, indexed, cleaned up, integrated, consolidated data which makes finding task-relevant data easy. Also, in a DATA WAREHOUSE/RELATIONAL DATABASE system, testable/scalable algorithms and data structures are implemented. Finally, most data is stored in DATA WAREHOUSE/RELATIONAL DATABASE systems. Without coupling, a DATA MINING system will need to use special tools to extract data – a difficult task. Loose coupling: DATA MINING system will use SOME facilities of a RELATIONAL DATABASE/DATA WAREHOUSE system – perhaps merely for fetching data. Then, the DATA MINING system may store mined results in a file or a “designated” place in a RELATIONAL DATABASE or DATA WAREHOUSE. This approach is better than NO coupling as it provides greater functionality. But, many loosely coupled DATA MINING systems are “main-memory based.” Due to the fact that mining itself doesn’t explore data structures and query optimization methods provided by RELATIONAL DATABASE/DATA WAREHOUSE systems, it is hard to achieve high scalability and high performance w/large data sets. Semi-tight coupling: in addition to linking DATA MINING systems to a DATA WAREHOUSE/DATA MINING system, “efficient implementations of a few essential DATA MINING primitives can be provided” in the DATA WAREHOUSE/RELATIONAL DATABASE system. The primitives may include: sorting, indexing, aggregation, histogram, multi-way join, and pre-computation of various statistical techniques. Tight coupling: DATA MINING is “smoothly integrated” onto the RELATIONAL DATABASE/DATA WAREHOUSE system. The DATA MINING is considered a subsystem of “one functional component” of an IS. DATA MINING queries/functions are optimized w.r.t. mining query analysis, data structures, indexing schemas, and various query processing with DATA WAREHOUSE/RELATIONAL DATABASE, provides a “uniform info processing environment.” This is the most desirable approach due to its high performance. Which is most popular? I have no background in DATA MINING or DATA WAREHOUSE/RELATIONAL DATABASE so my answer is speculation based on information I have come across in books or the Web. Semi-tight coupling may be the most popular approach. Tight coupling is highly desirable, but “its implementation is nontrivial and more research is needed….” It is also likely to be quite expensive. Semi-tight seems to be a good trade-off between lose and tight coupling and may be the most cost-effective solution. More Data warehousing and data mining information: OLAP vs. OLTP |