Engineering Notes 41
Created by admin on Fri, 16/12/2011 - 14:19
Sub Topic:
Data Preprocessing
Department:
Chapter Name:
Data ware housing
Description:
Topics Covered: Why preprocessing, Data cleaning, Data transformation, Data reduction, Discretization and generating concept hierarchies
Content:
<h1>
Data preprocessing</h1>
<h2>
Why preprocessing ?</h2>
<ol>
<li>
Real world data are generally
<ul style="list-style-type:circle;">
<li>
Incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data</li>
<li>
Noisy: containing errors or outliers</li>
<li>
Inconsistent: containing discrepancies in codes or names</li>
</ul>
</li>
<li>
Tasks in data preprocessing
<ul style="list-style-type:circle;">
<li>
Data cleaning: fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies.</li>
<li>
Data integration: using multiple databases, data cubes, or files.</li>
<li>
Data transformation: normalization and aggregation.</li>
<li>
Data reduction: reducing the volume but producing the same or similar analytical results.</li>
<li>
Data discretization: part of data reduction, replacing numerical attributes with nominal ones.</li>
</ul>
</li>
</ol>
<h2>
Data cleaning</h2>
<ol>
<li>
Fill in missing values (attribute or class value):
<ul style="list-style-type:circle;">
<li>
Ignore the tuple: usually done when class label is missing.</li>
<li>
Use the attribute mean (or majority nominal value) to fill in the missing value.</li>
<li>
Use the attribute mean (or majority nominal value) for all samples belonging to the same class.</li>
<li>
Predict the missing value by using a learning algorithm: consider the attribute with the missing value as a dependent (class) variable and run a learning algorithm (usually Bayes or decision tree) to predict the missing value.</li>
</ul>
</li>
<li>
Identify outliers and smooth out noisy data:
<ul style="list-style-type:circle;">
<li>
Binning
<ul>
<li>
Sort the attribute values and partition them into bins (see "Unsupervised discretization" below);</li>
<li>
Then smooth by bin means, bin median, or bin boundaries.</li>
</ul>
</li>
<li>
Clustering: group values in clusters and then detect and remove outliers (automatic or manual)</li>
<li>
Regression: smooth by fitting the data into regression functions.</li>
</ul>
</li>
<li>
Correct inconsistent data: use domain knowledge or expert decision.</li>
</ol>
<h2>
Data transformation</h2>
<ol>
<li>
Normalization:
<ul style="list-style-type:circle;">
<li>
Scaling attribute values to fall within a specified range.
<ul>
<li>
Example: to transform <tt>V in [min, max]</tt> to <tt>V' in [0,1]</tt>, apply <tt>V'=(V-Min)/(Max-Min)</tt></li>
</ul>
</li>
<li>
Scaling by using mean and standard deviation (useful when min and max are unknown or when there are outliers): <tt>V'=(V-Mean)/StDev</tt></li>
</ul>
</li>
<li>
Aggregation: moving up in the concept hierarchy on numeric attributes.</li>
<li>
Generalization: moving up in the concept hierarchy on nominal attributes.</li>
<li>
Attribute construction: replacing or adding new attributes inferred by existing attributes.</li>
</ol>
<h2>
Data reduction</h2>
<ol>
<li>
Reducing the number of attributes
<ul style="list-style-type:circle;">
<li>
Data cube aggregation: applying roll-up, slice or dice operations.</li>
<li>
Removing irrelevant attributes: attribute selection (filtering and wrapper methods), searching the attribute space (see Lecture 5: Attribute-oriented analysis).</li>
<li>
Principle component analysis (numeric attributes only): searching for a lower dimensional space that can best represent the data..</li>
</ul>
</li>
<li>
Reducing the number of attribute values
<ul style="list-style-type:circle;">
<li>
Binning (histograms): reducing the number of attributes by grouping them into intervals (bins).</li>
<li>
Clustering: grouping values in clusters.</li>
<li>
Aggregation or generalization</li>
</ul>
</li>
<li>
Reducing the number of tuples
<ul style="list-style-type:circle;">
<li>
Sampling</li>
</ul>
</li>
</ol>
<h2>
Discretization and generating concept hierarchies</h2>
<ol>
<li>
Unsupervised discretization - class variable is not used.
<ul style="list-style-type:circle;">
<li>
Equal-interval (equiwidth) binning: split the whole range of numbers in intervals with equal size.</li>
<li>
Equal-frequency (equidepth) binning: use intervals containing equal number of values.</li>
</ul>
</li>
<li>
Supervised discretization - uses the values of the class variable.
<ul style="list-style-type:circle;">
<li>
Using class boundaries. Three steps:
<ul>
<li>
Sort values.</li>
<li>
Place breakpoints between values belonging to different classes.</li>
<li>
If too many intervals, merge intervals with equal or similar class distributions.</li>
</ul>
</li>
<li>
Entropy (information)-based discretization. Example:
<ul>
<li>
Information in a class distribution:
<ul>
<li>
Denote a set of five values occurring in tuples belonging to two classes (+ and -) as <tt>[+,+,+,-,-]</tt></li>
<li>
That is, the first 3 belong to "+" tuples and the last 2 - to "-" tuples</li>
<li>
Then, <tt>Info([+,+,+,-,-]) = -(3/5)*log(3/5)-(2/5)*log(2/5)</tt> (logs are base 2)</li>
<li>
3/5 and 2/5 are relative frequencies (probabilities)</li>
<li>
Ignoring the order of the values, we can use the following notation: <tt>[3,2]</tt> meaning 3 values from one class and 2 - from the other.</li>
<li>
<tt>Then, Info([3,2]) = -(3/5)*log(3/5)-(2/5)*log(2/5)</tt></li>
</ul>
</li>
<li>
Information in a split <tt>(2/5 and 3/5 are weight coefficients)</tt>:
<ul>
<li>
<tt>Info([+,+],[+,-,-]) = (2/5)*Info([+,+]) + (3/5)*Info([+,-,-])</tt></li>
<li>
Or<tt>, Info([2,0],[1,2]) = (2/5)*Info([2,0]) + (3/5)*Info([1,2])</tt></li>
</ul>
</li>
<li>
Method:
<ul>
<li>
Sort the values;</li>
<li>
Calculate information in all possible splits;</li>
<li>
Choose the split that minimizes information;</li>
<li>
Do not include breakpoints between values belonging to the same class (this will increase information);</li>
<li>
Apply the same to the resulting intervals until some stopping criterion is satisfied.</li>
</ul>
</li>
</ul>
</li>
</ul>
</li>
<li>
Generating concept hierarchies: recursively applying partitioning or discretization methods.</li>
</ol>
<p> </p>
Add new comment