Engineering Notes 41

Created by admin on Fri, 16/12/2011 - 14:19
Sub Topic: 
Data Preprocessing
Chapter Name: 
Data ware housing
Description: 
Topics Covered: Why preprocessing, Data cleaning, Data transformation, Data reduction, Discretization and generating concept hierarchies
Content: 
<h1> Data preprocessing</h1> <h2> Why preprocessing ?</h2> <ol> <li> Real world data are generally <ul style="list-style-type:circle;"> <li> Incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data</li> <li> Noisy: containing errors or outliers</li> <li> Inconsistent: containing discrepancies in codes or names</li> </ul> </li> <li> Tasks in data preprocessing <ul style="list-style-type:circle;"> <li> Data cleaning: fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies.</li> <li> Data integration: using multiple databases, data cubes, or files.</li> <li> Data transformation: normalization and aggregation.</li> <li> Data reduction: reducing the volume but producing the same or similar analytical results.</li> <li> Data discretization: part of data reduction, replacing numerical attributes with nominal ones.</li> </ul> </li> </ol> <h2> Data cleaning</h2> <ol> <li> Fill in missing values (attribute or class value): <ul style="list-style-type:circle;"> <li> Ignore the tuple: usually done when class label is missing.</li> <li> Use the attribute mean (or majority nominal value) to fill in the missing value.</li> <li> Use the attribute mean (or majority nominal value) for all samples belonging to the same class.</li> <li> Predict the missing value by using a learning algorithm: consider the attribute with the missing value as a dependent (class) variable and run a learning algorithm (usually Bayes or decision tree) to predict the missing value.</li> </ul> </li> <li> Identify outliers and smooth out noisy data: <ul style="list-style-type:circle;"> <li> Binning <ul> <li> Sort the attribute values and partition them into bins (see &quot;Unsupervised discretization&quot; below);</li> <li> Then smooth by bin means,&nbsp; bin median, or&nbsp; bin boundaries.</li> </ul> </li> <li> Clustering: group values in clusters and then detect and remove outliers (automatic or manual)</li> <li> Regression: smooth by fitting the data into regression functions.</li> </ul> </li> <li> Correct inconsistent data: use domain knowledge or expert decision.</li> </ol> <h2> Data transformation</h2> <ol> <li> Normalization: <ul style="list-style-type:circle;"> <li> Scaling attribute values to fall within a specified range. <ul> <li> Example: to transform&nbsp;<tt>V in [min, max]</tt>&nbsp;to&nbsp;<tt>V&#39; in [0,1]</tt>, apply&nbsp;<tt>V&#39;=(V-Min)/(Max-Min)</tt></li> </ul> </li> <li> Scaling by using mean and standard deviation (useful when min and max are unknown or when there are outliers):&nbsp;<tt>V&#39;=(V-Mean)/StDev</tt></li> </ul> </li> <li> Aggregation: moving up in the concept hierarchy on numeric attributes.</li> <li> Generalization: moving up in the concept hierarchy on nominal attributes.</li> <li> Attribute construction: replacing or adding new attributes inferred by existing attributes.</li> </ol> <h2> Data reduction</h2> <ol> <li> Reducing the number of attributes <ul style="list-style-type:circle;"> <li> Data cube aggregation: applying roll-up, slice or dice operations.</li> <li> Removing irrelevant attributes: attribute selection (filtering and wrapper methods), searching the attribute space (see Lecture 5: Attribute-oriented analysis).</li> <li> Principle component analysis (numeric attributes only): searching for a lower dimensional space that can best represent the data..</li> </ul> </li> <li> Reducing the number of attribute values <ul style="list-style-type:circle;"> <li> Binning (histograms): reducing the number of attributes by grouping them into intervals (bins).</li> <li> Clustering: grouping values in clusters.</li> <li> Aggregation or generalization</li> </ul> </li> <li> Reducing the number of tuples <ul style="list-style-type:circle;"> <li> Sampling</li> </ul> </li> </ol> <h2> Discretization and generating concept hierarchies</h2> <ol> <li> Unsupervised discretization -&nbsp; class variable is not used. <ul style="list-style-type:circle;"> <li> Equal-interval (equiwidth) binning: split the whole range of numbers in intervals with equal size.</li> <li> Equal-frequency (equidepth) binning: use intervals containing equal number of values.</li> </ul> </li> <li> Supervised discretization - uses the values of the class variable. <ul style="list-style-type:circle;"> <li> Using class boundaries. Three steps: <ul> <li> Sort values.</li> <li> Place breakpoints between values belonging to different classes.</li> <li> If too many intervals, merge intervals with equal or similar class distributions.</li> </ul> </li> <li> Entropy (information)-based discretization. Example: <ul> <li> Information in a class distribution: <ul> <li> Denote a set of five values occurring in tuples belonging to two classes (+ and -) as&nbsp;<tt>[+,+,+,-,-]</tt></li> <li> That is, the first 3 belong to &quot;+&quot; tuples and the last 2 - to &quot;-&quot; tuples</li> <li> Then,&nbsp;<tt>Info([+,+,+,-,-]) = -(3/5)*log(3/5)-(2/5)*log(2/5)</tt>&nbsp;(logs are base 2)</li> <li> 3/5 and 2/5 are relative frequencies (probabilities)</li> <li> Ignoring the order of the values, we can use the following notation:&nbsp;<tt>[3,2]</tt>&nbsp;meaning 3 values from one class and 2 - from the other.</li> <li> <tt>Then, Info([3,2]) = -(3/5)*log(3/5)-(2/5)*log(2/5)</tt></li> </ul> </li> <li> Information in a split&nbsp;<tt>(2/5 and 3/5 are weight coefficients)</tt>: <ul> <li> <tt>Info([+,+],[+,-,-]) = (2/5)*Info([+,+]) + (3/5)*Info([+,-,-])</tt></li> <li> Or<tt>, Info([2,0],[1,2]) = (2/5)*Info([2,0]) + (3/5)*Info([1,2])</tt></li> </ul> </li> <li> Method: <ul> <li> Sort the values;</li> <li> Calculate information in all possible splits;</li> <li> Choose the split that minimizes information;</li> <li> Do not include breakpoints between values belonging to the same class (this will increase information);</li> <li> Apply the same to the resulting intervals until some stopping criterion is satisfied.</li> </ul> </li> </ul> </li> </ul> </li> <li> Generating concept hierarchies: recursively applying partitioning or discretization methods.</li> </ol> <p>&nbsp;</p>
engnotes_star_rating: 

Add new comment

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.