Introduction to the Knowledge Discovery Process
* Technological progress
* Too much data
* Traditional data-analysis needs evolution, automation needed
KDD (process)
= Knowledge Discovery in Databases is an (semi)-interactive and iterative, nontrivial process of identifying valid, new, potential usable and understandable patterns in data. Data is turned into information (and eventually knowledge) which can be used in the decision process.
* Step 1: Analysis of the problem
* Step 2: Data collection
* Step 3: Cleaning and preprocessing
* Step 4: Transformation and reduction
* Step 5: Selection of the DM application, task and technique
Decide whether a customer is creditworthy or not.
* Step 6: Evaluation of the output
* Step 7: Consolidation and deployment

* New insights and a competitive advantage
* But: close interaction between the data mining expert and the domain expert is needed
= The thing that is the be learned. An operational object of the real world: a thing, transaction, a relation, a group. (Datamining view: a discrete class, an association between features/concepts, a grouping, a score)
= A special type of example:
* Thing to be classified, associated or clustered
* Individual and independent example of target concept
* Characterized by predetermined attributes
* Input to learning scheme
* Restricted form of input
= Fixed predefined set of features which describe each instance
Attribute types: Nominal, Ordinal, Interval and Ratio.
Nominal quantities (categorical, enumerated, discrete)
* Distinct symbols
* Values that serve as label or name
* Special case: dichotomy (Boolean)
* Only equality tests
* E.g. eye color: brown, bleu …
Ordinal quantities (numeric, continuous)
* Ordering
* E.g. temperature = cool, mild, hot
* Transformation: Ordinal attribute with n values to n-1 boolean attributes
Interval quantities
* Ordering and Distance measure (measured in fixed and equal units: degrees Celsius, years)
* Difference of two values make sense
Ratio quantities
* Ordering, Distance measure and zero point
* Real numbers (all mathematical operations allowed)
File flattening (Denormalization)
= Joining several relations together to make one.
Problem: Spurious regularities that reflect structure of the database (supplier predicts supplier address)
Real world data problems:
Data analysis techniques need their own specific format of input, and want the least possible noise
Bad input = bad output
Too much data
* Corrupt data (noise, anomaly)
* Too many features
* Continuous measuring
* Very big datasets
Too little data
* Missing features
* Missing values
* Very small data sets
Incoherent data
* Multiple data sources
* Granularity
Parameter setting
Machine learning techniques
= Algorithms for acquiring structural descriptions from examples
Data represents population transform with DM methods with specific parameters Model representation that represents the theorie
* Language bias
* Search bias
Greedy search vs Beam search
General-to-specific vs Specific-to-general (pruning of a decision tree)
* Overfitting-avoidance bias

= given a set of pre-classified examples, build a model or classifier to classify new cases (supervised learning)
= 1 attribute ( 1-level decision tree). Rules that all test one attribute.
* Missing value treated as separate category
* Minimize the error rate
