Characterization: data characterization is a method useful for the derivations of descriptions of data classes or concepts. Basically, characterization is a summary of the general features (characteristics) of a target class (the data of the class being studied). OLAP roll-ups performed on data cubes are useful for summarization. Ex: a data mining system can be used, by city planners, to create descriptions summarizing residents that moved from urban to suburban areas in a ten-year period. The result could a general profile of residents, such as their age, income, etc. Discrimination: data discrimination is a method useful for the derivation of descriptions of data classes or concepts. Basically, discrimination is used to compare general features of target class data objects with the general features of objects from one set of contrasting classes. The target and contrasting classes are user-specified; corresponding data objects are attached thru database queries. The methods are similar to characterization. Ex: Sales of brand A car go up 20% in a year vs. sales of brand B which have gone down 30% -- what are the general features of these cars? Could any one (or more than one) of these features have contributed to the difference in sales. Association: a type of analysis that involves the discovery of association rules. These rules show attribute-value conditions that, often, happen concurrently with a given set of data. Association analysis is frequently used for “market basket” or transaction data analysis. Association rules are often in the form of algorithms. Ex: for the Cars.com e-business an analysis of this type may be done: X → Y or “database tuples that satisfy the conditions in X are also likely to satisfy the conditions of Y”. Age(X, “18…30”) Λ college_education(X, “Assocoates…PhD”)
→ buys (X, “sports car”)
Prediction: goes beyond classification in that is used to predict some unavailable or missing values. Specifically, these values are numerical data. Prediction also allows for the identification of distribution trends based on available data. Ex (classifications): using the cars.com example again, let’s try to classify types of vehicles based on how much commission (types of classes) that Cars.com makes on each type of sale: high commission, medium commission, low commission. A model can be created based on descriptive features such as price, make, country-made, color, etc. The classification that results should “maximally distinguish” one class from another (i.e., one type of commission from another). The “master classification” that is created (as a result) may be in the form of a decision tree – this may indicate that domestic vehicles generate the highest commission, followed by features such as price, color, etc. Therefore, the ads at Cars.com may promote more domestically-made vehicles. For prediction actual commission numerical values ($) may be generated and – hence – trends established. Clustering: This technique analyzes data objects w/o consulting a known class label. Usually, class labels are not available in the training data because they are not known to start with, hence, clustering can be used to create class labels. Basically, clusters of objects are created so that objects inside a cluster have “high similarity” in comparison to one another but are dissimilar to objects in other clusters. Each cluster formed can be regarded as a class of objects from which rules can be derived. Ex: Weather data from satellite imagery can point to areas of the Earth which exhibit certain climatic conditions. This may be graphically represented. “Precipitation objects”, representing rainfall in the jungles such as Amazon and Congo, may indicate a “Tropical class”. Evolution Analysis: describes and models regularities or trends for objects whose behavior changes over time. This can include characterization, discrimination, association, classification, or clustering of time-related data. However, distinct features of evolution analysis include: time series, data analysis, sequence/periodicity pattern matching, similarity-based analysis. Ex: the latest US Census data uses EA to describe changes over time, such as populations expanding into suburban areas in a 10-year period. Discrimination/Classification: Differences: discrimination compares the general features of target class data objects from one or a set of contrasting classes – output is in the form of charts, cubes, graphs, etc; classification finds a set of models (functions) that describe/distinguish data classes or concepts for the purpose of using the model to predict that class of objects whose class label is unknown – model presentation may be in the form of decision trees. Similarities: both functionalities are used to specify what kinds of patterns may be found when mining for data. Further, both deal with data classes and their features. Characterization/Clustering: Differences: clustering analyzes data objects w/o consulting a known class – rather, it is used to generate class labels; characterization summarizes the general characteristics or features of a target class. So, the difference is, in characterization, classes are already known; in clustering, they are not known (but generated later on). Similarities: Both functionalities are used to specify kinds of patterns that can be found in mining tasks. Both deal with data classes and their features. Classification/Prediction: Differences: classification is the process of finding a set of models (functions) that cab be used for predicting the class label of data objects; prediction, rather, is used to predict data values, and is more specific. Further, prediction is also used to identify distribution trends from available data. Similarities: Both functionalities perform inference on current data in order to make predictions. Three challenges to data mining methodology and user interaction issues: 1. Noise handling or incomplete data. Data stored in a database may contain noise, exceptional cases or incomplete data objects. When mining data regularities, these unworthy objects may hinder the process, causing the knowledge created to “overfit the data.” This may result in poor accuracy of the discovered patterns. Hence, data cleaning may be required before hand. 2. Mining different kinds of knowledge databases: Different kinds of knowledge may be of interest to users. This requires the use of multiple data techniques such as characterization, discrimination, association, etc. These activities might use the same database in a variety of ways. This requires the development of several mining techniques – a complicated process. 3. Pattern evaluation / interestingness: With all the patterns a DM system discovers, which are truly interesting – or relevant? The challenge is in creating techniques that asses the interestingness of patterns discovered, especially with respect to “subjective measures” that estimate the value of patterns in terms of a given user class, based on the user beliefs/expectations. More Data warehousing and data mining information: OLAP vs. OLTP |