SlideShare a Scribd company logo
1 of 29
Data Extraction, Cleanup
& Transformation
Tools/Data Processing
Tools
By
M.Dhilsath Fathima
Define-Data Preprocessing
• Data preprocessing is a data mining technique
that involves transforming raw data into an
understandable format.
• Data pre-processing is an important step in
the data mining process.
• The product of data pre-processing is the
final training set.
Why Data Preprocessing?
• Data in the real world is dirty.
noisy: containing errors or outliers.
Incomplete: Missing Values, Lacking attribute
values.
Inconsistent Data
• No quality data, no quality mining results!
– Quality decisions must be based on quality data
– Data warehouse needs consistent integration of quality
data
Major Tasks in Data
Preprocessing
• Data cleaning
– Fill in missing values, smooth noisy data, identify or remove outliers,
and resolve inconsistencies
• Data integration
– Integration of multiple databases, data cubes, or files
• Data transformation
– Normalization and aggregation
• Data reduction
– Obtains reduced representation in volume but produces the same or
similar analytical results
Forms of data
preprocessing
Data Cleaning
• Data cleaning tasks
– Fill in missing values
– Identify outliers and smooth out noisy data
– Correct inconsistent data
Missing Data
• Data is not always available
– E.g., many tuples have no recorded value for several attributes,
such as customer income in sales data
• Missing data may be due to
– equipment malfunction
– inconsistent with other recorded data and thus deleted
– data not entered due to misunderstanding
– certain data may not be considered important at the time of entry
– not register history or changes of the data
How to Handle Missing Data?
• Ignore the tuple: usually done when class label is missing
 (Can be applicable for large data set)
• Fill in the missing value manually: tedious + infeasible for large
database?
• Use a global constant to fill in the missing value
• Use the attribute mean to fill in the missing value
• Use the most probable value to fill in the missing value:
inference-based such as Bayesian formula or decision tree
Noisy Data/Outlier
• Noise: random error or variance in a measured
variable
• Incorrect attribute values may due to
– faulty data collection instruments
– data entry problems
– data transmission problems
– inconsistency in naming convention
– duplicate records
– incomplete data
– inconsistent data
OUTLIER
• A Data object or observations that do not
comply with the general behavior or model of
the data. Such data objects, which are grossly
different from or inconsistent with the
remaining set of data, are called outliers.
• A data object that deviates significantly from
the normal objects as if it were generated by a
different mechanism.
• Ex:
How to Handle Noisy Data?
(Not Now)
• Binning method
• Clustering
• Combined computer and human
inspection
• Regression
Data integration and transformation
Data Integration
• Data integration:
– combines data from multiple sources into a coherent store
Three Problems involved in data integration
Schema integration
Detecting and resolving data value conflicts.
Redundant data occur often when integration of multiple
databases
Data Transformation
• Smoothing: remove noise from data
• Aggregation: summarization, data cube construction
• Generalization: concept hierarchy climbing
• Normalization: scaled to fall within a small, specified range
– min-max normalization
– z-score normalization
– normalization by decimal scaling
DATA REDUCTION
Data Reduction Strategies
• Warehouse may store terabytes of data: Complex data
analysis/mining may take a very long time to run on the
complete data set
• Data reduction
– Obtains a reduced representation of the data set that is much
smaller in volume but yet produces the same (or almost the
same) analytical results
• Data reduction strategies
– Data cube aggregation
– Numerosity reduction
– concept hierarchy generation
Data Cube Aggregation
• The lowest level of a data cube
– the aggregated data for an individual entity of interest
– e.g., a customer in a phone calling data warehouse.
• Multiple levels of aggregation in data cubes
– Further reduce the size of data to deal with
• Reference appropriate levels
– Use the smallest representation which is enough to solve the
task
Dimensionality Reduction
• Feature selection (i.e., attribute subset selection):
– Select a minimum set of features such that the probability
distribution of different classes given the values for those
features is as close as possible to the original distribution
given the values of all features
– reduce # of patterns in the patterns, easier to understand
Example of Decision Tree Induction
Initial attribute set:
{A1, A2, A3, A4, A5, A6}
A4 ?
A1? A6?
Class 1 Class 2 Class 1 Class 2
> Reduced attribute set: {A1, A4, A6}
Regression and Log-Linear Models
• Linear regression: Data are modeled to fit a straight line
– Often uses the least-square method to fit the line
• Multiple regression: allows a response variable Y to be
modeled as a linear function of multidimensional feature
vector
• Log-linear model: approximates discrete multidimensional
probability distributions
• Linear regression: Y = α + β X
– Two parameters , α and β specify the line and are to be
estimated by using the data at hand.
– using the least squares criterion to the known values of Y1,
Y2, …, X1, X2, ….
• Multiple regression: Y = b0 + b1 X1 + b2 X2.
– Many nonlinear functions can be transformed into the
above.
• Log-linear models:
– The multi-way table of joint probabilities is approximated
by a product of lower-order tables.
– Probability: p(a, b, c, d) = αab βacχad δbcd
Regress Analysis and Log-
Linear Models
Histograms
• A popular data reduction
technique
• Divide data into buckets
and store average (sum) for
each bucket
• Can be constructed
optimally in one dimension
using dynamic
programming
• Related to quantization
problems.
0
5
10
15
20
25
30
35
40
10000 30000 50000 70000 90000
Clustering
• Partition data set into clusters, and one can store cluster
representation only
• Can be very effective if data is clustered but not if data is
“smeared”
• Can have hierarchical clustering and be stored in multi-
dimensional index tree structures
• There are many choices of clustering definitions and
clustering algorithms, further detailed in Chapter 8
Sampling
• Allow a mining algorithm to run in complexity that is
potentially sub-linear to the size of the data
• Choose a representative subset of the data
– Simple random sampling may have very poor performance
in the presence of skew
• Develop adaptive sampling methods
– Stratified sampling:
• Approximate the percentage of each class (or
subpopulation of interest) in the overall database
• Used in conjunction with skewed data
Sampling
SRSWOR
(simple random
sample without
replacement)
SRSWR
Raw Data
Concept hierarchy
• Arrangement of concepts such as time , location.
– reduce the data by collecting and replacing low
level concepts (such as numeric values for the
attribute age) by higher level concepts (such as
young, middle-aged, or senior).
EXTRACT
• Process of extract the relevant data from the
operational database before bringing into the
data warehouse.
• Many commercial tools are available to help
with the extraction process.
• Data Junction is one of the commercial
products.
LOADING
• Loading often implies physical movement of the data
from the computer(s) storing the source database(s)
to that which will store the data warehouse
database.
• This takes place immediately after the extraction
phase. The most common channel for data
movement is a high-speed communication link.
• Ex: Oracle Warehouse Builder is the API from Oracle,
which provides the features to perform the ETL task
on Oracle Data Warehouse
ETL tools
• A large number of commercial tools support
the ETL process for data warehouses in a
comprehensive way.
 COPYMANAGER (InformationBuilders)
 DATASTAGE (Informix/Ardent)
 POWERMART (Informatica)
 DECISIONBASE (CA/Platinum)
 DATATRANSFORMATIONSERVICE (Microsoft)
 METASUITE (Minerva/Carleton)
 SAGENTSOLUTIONPLATFORM (Sagent)
 WAREHOUSEADMINISTRATOR (SAS)

More Related Content

What's hot

Dimensional Modeling
Dimensional ModelingDimensional Modeling
Dimensional ModelingSunita Sahu
 
Data Mining: Concepts and Techniques (3rd ed.) — Chapter _04 olap
Data Mining:  Concepts and Techniques (3rd ed.)— Chapter _04 olapData Mining:  Concepts and Techniques (3rd ed.)— Chapter _04 olap
Data Mining: Concepts and Techniques (3rd ed.) — Chapter _04 olapSalah Amean
 
Data preprocessing in Data Mining
Data preprocessing in Data MiningData preprocessing in Data Mining
Data preprocessing in Data MiningDHIVYADEVAKI
 
Association rule mining.pptx
Association rule mining.pptxAssociation rule mining.pptx
Association rule mining.pptxmaha797959
 
Data mining concepts and work
Data mining concepts and workData mining concepts and work
Data mining concepts and workAmr Abd El Latief
 
Chapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han & KamberChapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han & Kambererror007
 
Data Mining & Data Warehousing Lecture Notes
Data Mining & Data Warehousing Lecture NotesData Mining & Data Warehousing Lecture Notes
Data Mining & Data Warehousing Lecture NotesFellowBuddy.com
 
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning Gopal Sakarkar
 
Data preprocessing PPT
Data preprocessing PPTData preprocessing PPT
Data preprocessing PPTANUSUYA T K
 
Schemas for multidimensional databases
Schemas for multidimensional databasesSchemas for multidimensional databases
Schemas for multidimensional databasesyazad dumasia
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingankur bhalla
 
Application areas of data mining
Application areas of data miningApplication areas of data mining
Application areas of data miningpriya jain
 
Data Integration and Transformation in Data mining
Data Integration and Transformation in Data miningData Integration and Transformation in Data mining
Data Integration and Transformation in Data miningkavitha muneeshwaran
 
Dbms relational model
Dbms relational modelDbms relational model
Dbms relational modelChirag vasava
 

What's hot (20)

Dimensional Modeling
Dimensional ModelingDimensional Modeling
Dimensional Modeling
 
Data Mining: Concepts and Techniques (3rd ed.) — Chapter _04 olap
Data Mining:  Concepts and Techniques (3rd ed.)— Chapter _04 olapData Mining:  Concepts and Techniques (3rd ed.)— Chapter _04 olap
Data Mining: Concepts and Techniques (3rd ed.) — Chapter _04 olap
 
Data preprocessing in Data Mining
Data preprocessing in Data MiningData preprocessing in Data Mining
Data preprocessing in Data Mining
 
Datawarehouse and OLAP
Datawarehouse and OLAPDatawarehouse and OLAP
Datawarehouse and OLAP
 
Data Preprocessing
Data PreprocessingData Preprocessing
Data Preprocessing
 
Classification of data
Classification of dataClassification of data
Classification of data
 
Association rule mining.pptx
Association rule mining.pptxAssociation rule mining.pptx
Association rule mining.pptx
 
Data mining concepts and work
Data mining concepts and workData mining concepts and work
Data mining concepts and work
 
Chapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han & KamberChapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
 
Data Mining & Data Warehousing Lecture Notes
Data Mining & Data Warehousing Lecture NotesData Mining & Data Warehousing Lecture Notes
Data Mining & Data Warehousing Lecture Notes
 
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning
 
Data preprocessing PPT
Data preprocessing PPTData preprocessing PPT
Data preprocessing PPT
 
Schemas for multidimensional databases
Schemas for multidimensional databasesSchemas for multidimensional databases
Schemas for multidimensional databases
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Application areas of data mining
Application areas of data miningApplication areas of data mining
Application areas of data mining
 
Data Integration and Transformation in Data mining
Data Integration and Transformation in Data miningData Integration and Transformation in Data mining
Data Integration and Transformation in Data mining
 
Dbms relational model
Dbms relational modelDbms relational model
Dbms relational model
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
 
Data models
Data modelsData models
Data models
 
Data warehousing
Data warehousingData warehousing
Data warehousing
 

Similar to Data extraction, cleanup & transformation tools 29.1.16

Similar to Data extraction, cleanup & transformation tools 29.1.16 (20)

Data preprocessing ppt1
Data preprocessing ppt1Data preprocessing ppt1
Data preprocessing ppt1
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
 
data clean.ppt
data clean.pptdata clean.ppt
data clean.ppt
 
Preprocessing
PreprocessingPreprocessing
Preprocessing
 
Data preprocessing 2
Data preprocessing 2Data preprocessing 2
Data preprocessing 2
 
Unit 3 part ii Data mining
Unit 3 part ii Data miningUnit 3 part ii Data mining
Unit 3 part ii Data mining
 
Pre-Processing and Data Preparation
Pre-Processing and Data PreparationPre-Processing and Data Preparation
Pre-Processing and Data Preparation
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocess
Data preprocessData preprocess
Data preprocess
 
Data1
Data1Data1
Data1
 
Data1
Data1Data1
Data1
 
Data Preprocessing&tools
Data Preprocessing&toolsData Preprocessing&tools
Data Preprocessing&tools
 
Dmblog
DmblogDmblog
Dmblog
 
Unit 3 part i Data mining
Unit 3 part i Data miningUnit 3 part i Data mining
Unit 3 part i Data mining
 
Data pre processing
Data pre processingData pre processing
Data pre processing
 
Data preprocessing in Machine learning
Data preprocessing in Machine learning Data preprocessing in Machine learning
Data preprocessing in Machine learning
 

More from Dhilsath Fathima

engineer's are responsible for safety
engineer's are responsible for safetyengineer's are responsible for safety
engineer's are responsible for safetyDhilsath Fathima
 
Dwdm unit 1-2016-Data ingarehousing
Dwdm unit 1-2016-Data ingarehousingDwdm unit 1-2016-Data ingarehousing
Dwdm unit 1-2016-Data ingarehousingDhilsath Fathima
 
business analysis-Data warehousing
business analysis-Data warehousingbusiness analysis-Data warehousing
business analysis-Data warehousingDhilsath Fathima
 
Profession & professionalism
Profession & professionalismProfession & professionalism
Profession & professionalismDhilsath Fathima
 
Engineering as social experimentation
Engineering as social experimentation Engineering as social experimentation
Engineering as social experimentation Dhilsath Fathima
 
Moral autonomy & consensus &controversy
Moral autonomy & consensus &controversyMoral autonomy & consensus &controversy
Moral autonomy & consensus &controversyDhilsath Fathima
 

More from Dhilsath Fathima (10)

Information Security
Information SecurityInformation Security
Information Security
 
Sdlc model
Sdlc modelSdlc model
Sdlc model
 
engineer's are responsible for safety
engineer's are responsible for safetyengineer's are responsible for safety
engineer's are responsible for safety
 
Dwdm unit 1-2016-Data ingarehousing
Dwdm unit 1-2016-Data ingarehousingDwdm unit 1-2016-Data ingarehousing
Dwdm unit 1-2016-Data ingarehousing
 
business analysis-Data warehousing
business analysis-Data warehousingbusiness analysis-Data warehousing
business analysis-Data warehousing
 
Profession & professionalism
Profession & professionalismProfession & professionalism
Profession & professionalism
 
Engineering as social experimentation
Engineering as social experimentation Engineering as social experimentation
Engineering as social experimentation
 
Moral autonomy & consensus &controversy
Moral autonomy & consensus &controversyMoral autonomy & consensus &controversy
Moral autonomy & consensus &controversy
 
Virtues
VirtuesVirtues
Virtues
 
Business analysis
Business analysisBusiness analysis
Business analysis
 

Recently uploaded

Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfSumit Tiwari
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Celine George
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...Marc Dusseiller Dusjagr
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon AUnboundStockton
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...M56BOOKSTORE PRODUCT/SERVICE
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxmanuelaromero2013
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17Celine George
 
Class 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfClass 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfakmcokerachita
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 

Recently uploaded (20)

Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptx
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon A
 
Staff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSDStaff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSD
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptx
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17
 
Class 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfClass 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdf
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 

Data extraction, cleanup & transformation tools 29.1.16

  • 1. Data Extraction, Cleanup & Transformation Tools/Data Processing Tools By M.Dhilsath Fathima
  • 2. Define-Data Preprocessing • Data preprocessing is a data mining technique that involves transforming raw data into an understandable format. • Data pre-processing is an important step in the data mining process. • The product of data pre-processing is the final training set.
  • 3. Why Data Preprocessing? • Data in the real world is dirty. noisy: containing errors or outliers. Incomplete: Missing Values, Lacking attribute values. Inconsistent Data • No quality data, no quality mining results! – Quality decisions must be based on quality data – Data warehouse needs consistent integration of quality data
  • 4. Major Tasks in Data Preprocessing • Data cleaning – Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies • Data integration – Integration of multiple databases, data cubes, or files • Data transformation – Normalization and aggregation • Data reduction – Obtains reduced representation in volume but produces the same or similar analytical results
  • 6. Data Cleaning • Data cleaning tasks – Fill in missing values – Identify outliers and smooth out noisy data – Correct inconsistent data
  • 7. Missing Data • Data is not always available – E.g., many tuples have no recorded value for several attributes, such as customer income in sales data • Missing data may be due to – equipment malfunction – inconsistent with other recorded data and thus deleted – data not entered due to misunderstanding – certain data may not be considered important at the time of entry – not register history or changes of the data
  • 8. How to Handle Missing Data? • Ignore the tuple: usually done when class label is missing  (Can be applicable for large data set) • Fill in the missing value manually: tedious + infeasible for large database? • Use a global constant to fill in the missing value • Use the attribute mean to fill in the missing value • Use the most probable value to fill in the missing value: inference-based such as Bayesian formula or decision tree
  • 9. Noisy Data/Outlier • Noise: random error or variance in a measured variable • Incorrect attribute values may due to – faulty data collection instruments – data entry problems – data transmission problems – inconsistency in naming convention – duplicate records – incomplete data – inconsistent data
  • 10. OUTLIER • A Data object or observations that do not comply with the general behavior or model of the data. Such data objects, which are grossly different from or inconsistent with the remaining set of data, are called outliers. • A data object that deviates significantly from the normal objects as if it were generated by a different mechanism. • Ex:
  • 11. How to Handle Noisy Data? (Not Now) • Binning method • Clustering • Combined computer and human inspection • Regression
  • 12. Data integration and transformation
  • 13. Data Integration • Data integration: – combines data from multiple sources into a coherent store Three Problems involved in data integration Schema integration Detecting and resolving data value conflicts. Redundant data occur often when integration of multiple databases
  • 14. Data Transformation • Smoothing: remove noise from data • Aggregation: summarization, data cube construction • Generalization: concept hierarchy climbing • Normalization: scaled to fall within a small, specified range – min-max normalization – z-score normalization – normalization by decimal scaling
  • 16. Data Reduction Strategies • Warehouse may store terabytes of data: Complex data analysis/mining may take a very long time to run on the complete data set • Data reduction – Obtains a reduced representation of the data set that is much smaller in volume but yet produces the same (or almost the same) analytical results • Data reduction strategies – Data cube aggregation – Numerosity reduction – concept hierarchy generation
  • 17. Data Cube Aggregation • The lowest level of a data cube – the aggregated data for an individual entity of interest – e.g., a customer in a phone calling data warehouse. • Multiple levels of aggregation in data cubes – Further reduce the size of data to deal with • Reference appropriate levels – Use the smallest representation which is enough to solve the task
  • 18. Dimensionality Reduction • Feature selection (i.e., attribute subset selection): – Select a minimum set of features such that the probability distribution of different classes given the values for those features is as close as possible to the original distribution given the values of all features – reduce # of patterns in the patterns, easier to understand
  • 19. Example of Decision Tree Induction Initial attribute set: {A1, A2, A3, A4, A5, A6} A4 ? A1? A6? Class 1 Class 2 Class 1 Class 2 > Reduced attribute set: {A1, A4, A6}
  • 20. Regression and Log-Linear Models • Linear regression: Data are modeled to fit a straight line – Often uses the least-square method to fit the line • Multiple regression: allows a response variable Y to be modeled as a linear function of multidimensional feature vector • Log-linear model: approximates discrete multidimensional probability distributions
  • 21. • Linear regression: Y = α + β X – Two parameters , α and β specify the line and are to be estimated by using the data at hand. – using the least squares criterion to the known values of Y1, Y2, …, X1, X2, …. • Multiple regression: Y = b0 + b1 X1 + b2 X2. – Many nonlinear functions can be transformed into the above. • Log-linear models: – The multi-way table of joint probabilities is approximated by a product of lower-order tables. – Probability: p(a, b, c, d) = αab βacχad δbcd Regress Analysis and Log- Linear Models
  • 22. Histograms • A popular data reduction technique • Divide data into buckets and store average (sum) for each bucket • Can be constructed optimally in one dimension using dynamic programming • Related to quantization problems. 0 5 10 15 20 25 30 35 40 10000 30000 50000 70000 90000
  • 23. Clustering • Partition data set into clusters, and one can store cluster representation only • Can be very effective if data is clustered but not if data is “smeared” • Can have hierarchical clustering and be stored in multi- dimensional index tree structures • There are many choices of clustering definitions and clustering algorithms, further detailed in Chapter 8
  • 24. Sampling • Allow a mining algorithm to run in complexity that is potentially sub-linear to the size of the data • Choose a representative subset of the data – Simple random sampling may have very poor performance in the presence of skew • Develop adaptive sampling methods – Stratified sampling: • Approximate the percentage of each class (or subpopulation of interest) in the overall database • Used in conjunction with skewed data
  • 26. Concept hierarchy • Arrangement of concepts such as time , location. – reduce the data by collecting and replacing low level concepts (such as numeric values for the attribute age) by higher level concepts (such as young, middle-aged, or senior).
  • 27. EXTRACT • Process of extract the relevant data from the operational database before bringing into the data warehouse. • Many commercial tools are available to help with the extraction process. • Data Junction is one of the commercial products.
  • 28. LOADING • Loading often implies physical movement of the data from the computer(s) storing the source database(s) to that which will store the data warehouse database. • This takes place immediately after the extraction phase. The most common channel for data movement is a high-speed communication link. • Ex: Oracle Warehouse Builder is the API from Oracle, which provides the features to perform the ETL task on Oracle Data Warehouse
  • 29. ETL tools • A large number of commercial tools support the ETL process for data warehouses in a comprehensive way.  COPYMANAGER (InformationBuilders)  DATASTAGE (Informix/Ardent)  POWERMART (Informatica)  DECISIONBASE (CA/Platinum)  DATATRANSFORMATIONSERVICE (Microsoft)  METASUITE (Minerva/Carleton)  SAGENTSOLUTIONPLATFORM (Sagent)  WAREHOUSEADMINISTRATOR (SAS)