Data quality can be assessed in terms of accuracy, completeness and consistency. Propose two other dimensions of data quality. [a] Level of Redundancy: how much of the data is repeated from the various
sources one is mining? Redundant data can slow down or confuse the knowledge
discovery process. Data reduction and cleaning methods, carefully employed,
can aid in removing duplicated data prior to its usage. In real-world data, tuples with missing values for some attributes are common occurrences. Describe various methods for handling this problem. [a] ignore the tuple(s): done when the class label is missing. Not a
very effective method unless the tuple has several attributes w/missing
values. Suppose that the data for analysis include the attribute age. The age values for the data tuples are (in increasing order): 13,15,16,16,19,20,21,22,22,25,25,25,25, 30,33,33,35,35,35,35,36,40,45,46,52,70. [a] Use smoothing by bin means to smooth the above data, using a bin depth of 3. Illustrate your steps. Comment on the effect of this technique for the given data. Each bin value is replace by the mean value of the bin. Bin depth (no. of values in each bin, as per requirements, is 3). (Example of calculation for Bin 1: (13+15+16)/3 = 14.67; other Bins use similar method. Bin 1: 14.67, 14.67, 14.67 [b] How might you determine outliers in the date? Looking at the age values, one might cluster values into “decade groups” such as: The teens cluster (13,15,16,29) [c] What other methods are there for data smoothing? [a] Binning by partition (equidepth of 3). Ex: Bin 1: 13,15,16 [b] Binning: smoothening binning boundaries. Find min and max values that define boundaries. Ex: (for equidepth of 3) Bin 1: 13, 13, 16 [c] Use Regression. Linear regression: involves finding the “best” line to fit two variables so that one variable can be used to predict another. Multiple linear regression: extension of linear regression where more than two variables are involved and the data are to fit a multidimensional surface. Regression entails the use of mathematical and statistical techniques. Discuss issues to consider during data integration. Data integration -- “the merging of data from multiple data sources” – is a requirement of data mining. This boils down to combining data from many sources into one “coherent data store”, such as a data warehouse. Issues: [a] Schema integration: A process where the trick is to match up equivalent real-world entities from many data sources. The textbook calls this the “entity identification problem.” Ex.: does product_id in one database and product_number in another refer to the same entity? Often, metadata, found in the DW, can circumvent problems with schema integration. [b] Redundancy: An attribute is redundant IF it can be derived from another table. Ex.: some ones age can be derived from their birth_date. Inconsistencies in attribute or dimension naming can also create redundancies in the data set that results. A mathematical (statistical) scheme, known as correlation analysis, can be used to detect redundancies. Ex.: if you have two attributes A and B, how strongly one attribute implies the other, based on available data, can be determined by this technique. The formula for correlation analysis can be found on pg 113 of the textbook. [c] Relation and resolution of data value conflicts. For a given real-world entity, attribute values may differ depending on source. This can be due to differences in representation, scaling, encoding. Ex.: for a length attribute, a US data source may have measurement values in English (i.e., inches, feet) while a Canadian data source will have the values in metric (centimeters, meters, etc.). More Data warehousing and data mining information: OLAP
vs. OLTP |