The document discusses data mining and data preprocessing. It notes that vast amounts of data are collected daily and analyzing this data is important. One emerging data repository is the data warehouse, which contains multiple heterogeneous data sources. Data mining transforms large collections of data into useful knowledge. Effective data preprocessing is important for improving data quality and mining accuracy. Techniques like data cleaning, integration, reduction, and transformation are used to handle issues like missing values, noise, inconsistencies and improve the overall quality of the data.
A top-down look at current industry and technology trends for Big Data, Data Analytics and Machine Learning (cognitive technologies, AI etc.). New slides added for Ark Group presentation on 1st December 2016.
Data science is different from Data Analytics,Data Engineering,Big Data.
Presentation about Data Science.
What is Data Science its process future and scope.
Data Science Presentation By Amit Singh.
"Sexiest job of 21st century"
Data preprocessing techniques
See my Paris applied psychology conference paper here
https://www.slideshare.net/jasonrodrigues/paris-conference-on-applied-psychology
or
https://prezi.com/view/KBP8JnekVH9LkLOiKY3w/
A top-down look at current industry and technology trends for Big Data, Data Analytics and Machine Learning (cognitive technologies, AI etc.). New slides added for Ark Group presentation on 1st December 2016.
Data science is different from Data Analytics,Data Engineering,Big Data.
Presentation about Data Science.
What is Data Science its process future and scope.
Data Science Presentation By Amit Singh.
"Sexiest job of 21st century"
Data preprocessing techniques
See my Paris applied psychology conference paper here
https://www.slideshare.net/jasonrodrigues/paris-conference-on-applied-psychology
or
https://prezi.com/view/KBP8JnekVH9LkLOiKY3w/
Data Mining, KDD Process, Data mining functionalities, Characterization,
Discrimination ,
Association,
Classification,
Prediction,
Clustering,
Outlier analysis, Data Cleaning as a Process
Data Mining: What is Data Mining?
History
How data mining works?
Data Mining Techniques.
Data Mining Process.
(The Cross-Industry Standard Process)
Data Mining: Applications.
Advantages and Disadvantages of Data Mining.
Conclusion.
Talk @ ACM SF Bayarea Chapter on Deep Learning for medical imaging space.
The talk covers use cases, special challenges and solutions for Deep Learning for Medical Image Analysis using Tensorflow+Keras. You will learn about:
- Use cases for Deep Learning in Medical Image Analysis
- Different DNN architectures used for Medical Image Analysis
- Special purpose compute / accelerators for Deep Learning (in the Cloud / On-prem)
- How to parallelize your models for faster training of models and serving for inferenceing.
- Optimization techniques to get the best performance from your cluster (like Kubernetes/ Apache Mesos / Spark)
- How to build an efficient Data Pipeline for Medical Image Analysis using Deep Learning
- Resources to jump start your journey - like public data sets, common models used in Medical Image Analysis
The slide aids to understand and provide insights on the following topics,
* Overview for Data Science
* Definition of Data and Information
* Types of Data and Representation
* Data Value Chain - [ Data Acquisition; Data Analysis; Data Curating; Data Storage; Data Usage ]
* Basic concepts of Big Data
Machine learning topics machine learning algorithm into three main parts.DurgaDeviP2
Machine learning topics
machine learning algorithm into three main parts.
A Decision Process: In general, machine learning algorithms are used to make a prediction or classification. Based on some input data, which can be labeled or unlabeled, your algorithm will produce an estimate about a pattern in the data.
An Error Function: An error function evaluates the prediction of the model. If there are known examples, an error function can make a comparison to assess the accuracy of the model.
A Model Optimization Process: If the model can fit better to the data points in the training set, then weights are adjusted to reduce the discrepancy between the known example and the model estimate. The algorithm will repeat this iterative “evaluate and optimize” process, updating weights autonomously until a threshold of accuracy has been met.
Data Mining, KDD Process, Data mining functionalities, Characterization,
Discrimination ,
Association,
Classification,
Prediction,
Clustering,
Outlier analysis, Data Cleaning as a Process
Data Mining: What is Data Mining?
History
How data mining works?
Data Mining Techniques.
Data Mining Process.
(The Cross-Industry Standard Process)
Data Mining: Applications.
Advantages and Disadvantages of Data Mining.
Conclusion.
Talk @ ACM SF Bayarea Chapter on Deep Learning for medical imaging space.
The talk covers use cases, special challenges and solutions for Deep Learning for Medical Image Analysis using Tensorflow+Keras. You will learn about:
- Use cases for Deep Learning in Medical Image Analysis
- Different DNN architectures used for Medical Image Analysis
- Special purpose compute / accelerators for Deep Learning (in the Cloud / On-prem)
- How to parallelize your models for faster training of models and serving for inferenceing.
- Optimization techniques to get the best performance from your cluster (like Kubernetes/ Apache Mesos / Spark)
- How to build an efficient Data Pipeline for Medical Image Analysis using Deep Learning
- Resources to jump start your journey - like public data sets, common models used in Medical Image Analysis
The slide aids to understand and provide insights on the following topics,
* Overview for Data Science
* Definition of Data and Information
* Types of Data and Representation
* Data Value Chain - [ Data Acquisition; Data Analysis; Data Curating; Data Storage; Data Usage ]
* Basic concepts of Big Data
Machine learning topics machine learning algorithm into three main parts.DurgaDeviP2
Machine learning topics
machine learning algorithm into three main parts.
A Decision Process: In general, machine learning algorithms are used to make a prediction or classification. Based on some input data, which can be labeled or unlabeled, your algorithm will produce an estimate about a pattern in the data.
An Error Function: An error function evaluates the prediction of the model. If there are known examples, an error function can make a comparison to assess the accuracy of the model.
A Model Optimization Process: If the model can fit better to the data points in the training set, then weights are adjusted to reduce the discrepancy between the known example and the model estimate. The algorithm will repeat this iterative “evaluate and optimize” process, updating weights autonomously until a threshold of accuracy has been met.
Jiawei Han, Micheline Kamber and Jian Pei
Data Mining: Concepts and Techniques, 3rd ed.
The Morgan Kaufmann Series in Data Management Systems
Morgan Kaufmann Publishers, July 2011. ISBN 978-0123814791
This PPT gives the details about Introduction, Syntax, Comments, Case Sensitivity,Variables,Data Types, Strings, Constants,Operators,Control Flow Statements,Functions,Arrays
, and Forms
Instructions for Submissions thorugh G- Classroom.pptxJheel Barad
This presentation provides a briefing on how to upload submissions and documents in Google Classroom. It was prepared as part of an orientation for new Sainik School in-service teacher trainees. As a training officer, my goal is to ensure that you are comfortable and proficient with this essential tool for managing assignments and fostering student engagement.
Unit 8 - Information and Communication Technology (Paper I).pdfThiyagu K
This slides describes the basic concepts of ICT, basics of Email, Emerging Technology and Digital Initiatives in Education. This presentations aligns with the UGC Paper I syllabus.
The French Revolution, which began in 1789, was a period of radical social and political upheaval in France. It marked the decline of absolute monarchies, the rise of secular and democratic republics, and the eventual rise of Napoleon Bonaparte. This revolutionary period is crucial in understanding the transition from feudalism to modernity in Europe.
For more information, visit-www.vavaclasses.com
Macroeconomics- Movie Location
This will be used as part of your Personal Professional Portfolio once graded.
Objective:
Prepare a presentation or a paper using research, basic comparative analysis, data organization and application of economic information. You will make an informed assessment of an economic climate outside of the United States to accomplish an entertainment industry objective.
Francesca Gottschalk - How can education support child empowerment.pptxEduSkills OECD
Francesca Gottschalk from the OECD’s Centre for Educational Research and Innovation presents at the Ask an Expert Webinar: How can education support child empowerment?
Biological screening of herbal drugs: Introduction and Need for
Phyto-Pharmacological Screening, New Strategies for evaluating
Natural Products, In vitro evaluation techniques for Antioxidants, Antimicrobial and Anticancer drugs. In vivo evaluation techniques
for Anti-inflammatory, Antiulcer, Anticancer, Wound healing, Antidiabetic, Hepatoprotective, Cardio protective, Diuretics and
Antifertility, Toxicity studies as per OECD guidelines
Embracing GenAI - A Strategic ImperativePeter Windle
Artificial Intelligence (AI) technologies such as Generative AI, Image Generators and Large Language Models have had a dramatic impact on teaching, learning and assessment over the past 18 months. The most immediate threat AI posed was to Academic Integrity with Higher Education Institutes (HEIs) focusing their efforts on combating the use of GenAI in assessment. Guidelines were developed for staff and students, policies put in place too. Innovative educators have forged paths in the use of Generative AI for teaching, learning and assessments leading to pockets of transformation springing up across HEIs, often with little or no top-down guidance, support or direction.
This Gasta posits a strategic approach to integrating AI into HEIs to prepare staff, students and the curriculum for an evolving world and workplace. We will highlight the advantages of working with these technologies beyond the realm of teaching, learning and assessment by considering prompt engineering skills, industry impact, curriculum changes, and the need for staff upskilling. In contrast, not engaging strategically with Generative AI poses risks, including falling behind peers, missed opportunities and failing to ensure our graduates remain employable. The rapid evolution of AI technologies necessitates a proactive and strategic approach if we are to remain relevant.
Model Attribute Check Company Auto PropertyCeline George
In Odoo, the multi-company feature allows you to manage multiple companies within a single Odoo database instance. Each company can have its own configurations while still sharing common resources such as products, customers, and suppliers.
Honest Reviews of Tim Han LMA Course Program.pptxtimhan337
Personal development courses are widely available today, with each one promising life-changing outcomes. Tim Han’s Life Mastery Achievers (LMA) Course has drawn a lot of interest. In addition to offering my frank assessment of Success Insider’s LMA Course, this piece examines the course’s effects via a variety of Tim Han LMA course reviews and Success Insider comments.
Introduction to AI for Nonprofits with Tapp NetworkTechSoup
Dive into the world of AI! Experts Jon Hill and Tareq Monaur will guide you through AI's role in enhancing nonprofit websites and basic marketing strategies, making it easy to understand and apply.
How libraries can support authors with open access requirements for UKRI fund...
Data mining
1. We live in a world where vast amounts of data are collected daily.
Analyzing such data is an important need.
One emerging data repository architecture is the data warehouse .
This is a repository of multiple heterogeneous data sources.
Data mining turns a large collection of data into knowledge.
A search engine (e.g.,Google) receives hundreds of millions of queries
every day.
Each query can be viewed as a transaction where the user describes her
or his information need.
Data Mining
11/9/20191
Data Mining and Data Preprocessing
2. We live in a world where vast amounts of data are collected daily.
Analyzing such data is an important need.
One emerging data repository architecture is the data warehouse .
This is a repository of multiple heterogeneous data sources.
Data mining turns a large collection of data into knowledge.
A search engine (e.g.,Google) receives hundreds of millions of queries
every day.
Each query can be viewed as a transaction where the user describes her
or his information need.
Data Mining Introduction
11/9/20192
Data Mining and Data Preprocessing
3. Data Collection and Database Creation
(1960s and earlier)
Primitive file processing
Data Mining Introduction
11/9/20193
Data Mining and Data Preprocessing
4. Database Management Systems
(1970s to early 1980s)
Hierarchical and network database systems
Relational database systems
Data modeling: entity-relationship models, etc.
Indexing and accessing methods
Data Mining Introduction
11/9/20194
Data Mining and Data Preprocessing
5. Query languages: SQL, etc.
User interfaces, forms, and reports
Query processing and optimization
Transactions, concurrency control, and recovery
Online transaction processing (OLTP)
Data Mining Introduction
11/9/20195
Data Mining and Data Preprocessing
6. Advanced Database Systems
(mid-1980s to present)
Advanced data models: extended-relational, object relational, deductive,
etc.
Managing complex data: spatial, temporal, multimedia, sequence and
structured,
scientific, engineering, moving objects, etc.
Data Mining Introduction
11/9/20196
Data Mining and Data Preprocessing
7. Data streams and cyber-physical data systems
Web-based databases (XML, semantic web)
Managing uncertain data and data cleaning
Integration of heterogeneous sources
Text database systems and integration with information retrieval
Data Mining Introduction
11/9/20197
Data Mining and Data Preprocessing
8. Extremely large data management
Database system tuning and adaptive systems
Advanced queries: ranking, skyline, etc.
Cloud computing and parallel data processing
Issues of data privacy and security
Data Mining Introduction
11/9/20198
Data Mining and Data Preprocessing
9. Advanced Data Analysis
(late- 1980s to present)
Data warehouse and OLAP
Data mining and knowledge discovery: classification, clustering, outlier
analysis, association and correlation, comparative summary,
discrimination analysis, pattern
discovery, trend and deviation analysis, etc.
Mining complex types of data: streams, sequence, text, spatial, temporal,
multimedia, Web, networks, etc.
Data Mining Introduction
11/9/20199
Data Mining and Data Preprocessing
10. Data mining applications: business, society, retail, banking,
telecommunications, science and engineering, blogs, daily life, etc.
Data mining and society: invisible data mining, privacy-preserving data
mining,
mining social and information networks, recommender systems, etc.
One emerging data repository architecture is the data warehouse.
This is a repository of multiple heterogeneous data sources
Data Mining Introduction
11/9/201910
Data Mining and Data Preprocessing
11. Data Mining
The process that finds a small set of precious nuggets from a great deal
of raw material.
Knowledge mining from data.
Also known as knowledge mining from data, knowledge extraction,
data/pattern analysis, data archaeology, and data dredging.
The process of knowledge discovery.
Knowledge Discovery from Data,
or KDD
11/9/201911
Data Mining and Data Preprocessing
12. 1. Data cleaning (to remove noise and inconsistent data)
2. Data integration (where multiple data sources may be combined)
3. Data selection (where data relevant to the analysis task are retrieved
from the
database)
4. Data transformation (where data are transformed and consolidated into
forms appropriate for mining by performing summary or aggregation
operations)
Knowledge Discovery from Data,
or KDD
11/9/201912
Data Mining and Data Preprocessing
13. 5. Data mining (an essential process where intelligent methods are applied
to extract data patterns)
6. Pattern evaluation (to identify the truly interesting patterns representing
knowledge
7. Knowledge presentation (where visualization and knowledge
representation techniques are used to present mined knowledge to users)
Knowledge Discovery from Data,
or KDD
11/9/201913
Data Mining and Data Preprocessing
14. Data mining is the process of discovering interesting patterns and
knowledge from large amounts of data.
The data sources can include databases, data warehouses, the Web, other
information repositories, or data that are streamed into the system
dynamically.
Knowledge Discovery from Data,
or KDD
11/9/201914
Data Mining and Data Preprocessing
15. Knowledge Discovery from Data, or
KDD
11/9/201915
Data Mining and Data Preprocessing
Figure 1.1 Data mining as
a step in the process of
knowledge discovery.
16. Data mining is the process of discovering interesting patterns and
knowledge from large amounts of data.
The data sources can include databases, data warehouses, the Web, other
information repositories, or data that are streamed into the system
dynamically.
Knowledge Discovery from Data,
or KDD
11/9/201916
Data Mining and Data Preprocessing
17. Today’s real-world databases are highly susceptible to noisy, missing,
and inconsistent data due to their typically huge size.
They origin from multiple, heterogeneous sources.
Low-quality data will lead to low-quality mining results.
The data be preprocessed in order to help improve the quality of the data
and, consequently, of the mining results.
To improve the efficiency and ease of the mining process.
Data Preprocessing
11/9/201917
Data Mining and Data Preprocessing
18. There are several data preprocessing techniques.
Data cleaning can be applied to remove noise and correct inconsistencies in
data.
Data integration merges data from multiple sources into a coherent data store
such as a data warehouse.
Data reduction can reduce data size by, for instance, aggregating, eliminating
redundant features, or clustering.
Data transformations (e.g., normalization) may be applied, where data are
scaled to fall within a smaller range like 0.0 to 1.0. This can improve the
accuracy and efficiency of mining algorithms.
Data Preprocessing
11/9/201918
Data Mining and Data Preprocessing
19. Data Quality
Data have quality if they satisfy the requirements of the intended use.
There are many factors comprising data quality
1. Accuracy
2. Completeness
3. Consistency
4. Timeliness
5. Believability and Interpretability.
Data Preprocessing
11/9/201919
Data Mining and Data Preprocessing
20. Inaccurate Data:
The data collection instruments used may be faulty.
There may have been human or computer errors occurring at data
entry.
Users may purposely submit incorrect data values for mandatory fields
when they do not wish to submit personal information.
Data Preprocessing
11/9/201920
Data Mining and Data Preprocessing
21. Incomplete Data:
Incomplete data can occur for a number of reasons.
Attributes of interest may not always be available, such as customer
information for sales transaction data.
Other data may not be included simply because they were not
considered important at the time of entry.
Relevant data may not be recorded due to a misunderstanding or
because of equipment malfunctions.
Data Preprocessing
11/9/201921
Data Mining and Data Preprocessing
22. Inconsistent Data:
Incorrect data may also result from inconsistencies in naming
conventions or data codes, or inconsistent formats for input fields (e.g.,
date).
Timeliness:
The fact that the month-end data are not updated in a timely fashion
has a negative impact on the data quality.
Data Preprocessing
11/9/201922
Data Mining and Data Preprocessing
23. Believability :
It reflects how much the data are trusted by users.
Interpretability:
It reflects how easy the data are understood. Suppose that a database, at
one point, had several errors, all of which have since been corrected.
Even though the database is now accurate, complete, consistent, and
timely, sales department users may regard it as of low quality due to poor
believability and interpretability.
Data Preprocessing
11/9/201923
Data Mining and Data Preprocessing
24. Fill in missing values, smooth out noise while identifying outliers, and
correct inconsistencies in the data.
Missing Values
Ignore the tuple: This is usually done when the class label is
missing.
This method is not very effective, unless the tuple contains
several attributes with missing values.
Data Cleaning
11/9/201924
Data Mining and Data Preprocessing
25. Fill in the missing value manually: Time consuming and may not be
feasible given a large data set with many missing values.
Use a global constant to fill in the missing value: Replace all missing
attribute values by the same constant such as a label like “Unknown”.
Use a measure of central tendency for the attribute (e.g., the
mean or median) to fill in the missing value.
Use the attribute mean or median for all samples belonging
to the same class as the given tuple.
Use the most probable value to fill in the missing value.
Data Cleaning
11/9/201925
Data Mining and Data Preprocessing
26. Noisy Data:
Random error or variance in a measured variable.
Data smoothing techniques:
Binning:
These methods smooth a sorted data value by consulting its “neighbor-
hood,” that is, the values around it.
The sorted values are distributed into a number of “buckets,” or bins.
Binning methods consult the neighborhood of values which perform
local smoothing.
Data Cleaning
11/9/201926
Data Mining and Data Preprocessing
27. Binning methods for data smoothing:
Sorted data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28, 34
Partition into (equal-frequency) bins: Bin 1: 4, 8, 15
Bin 2: 21, 21, 24
Bin 3: 25, 28, 34
Smoothing by bin means: Bin 1: 9, 9, 9
Bin 2: 22, 22, 22
Bin 3: 29, 29, 29
Smoothing by bin boundaries: Bin 1: 4, 4, 15
Bin 2: 21, 21, 24
Bin 3: 25, 25, 34
Data Cleaning
11/9/201927 Data Mining and Data Preprocessing
28. In smoothing by bin medians , the each bin value is replaced by the
bin median.
In smoothing by bin boundaries, the minimum and maximum values
in a given bin are identified as the bin boundaries. Each bin value is then
replaced by the closest boundary value.
Data Cleaning
11/9/201928
Data Mining and Data Preprocessing
29. Regression:
A technique that con- forms data values to a function.
Linear regression involves finding the “best” line to fit two attributes .
One attribute can be used to predict the other.
Multiple linear regression is an extension of linear regression, where
more than two attributes are involved .
The data are fit to a multidimensional surface.
Data Cleaning
11/9/201929
Data Mining and Data Preprocessing
30. Outlier analysis:
Outliers may be detected by clustering.
The similar values are organized into groups, or “clusters.”.
The values that fall outside of the set of clusters may be considered
outliers.
Data Cleaning
11/9/201930
Data Mining and Data Preprocessing
31. Data Cleaning
11/9/201931
Data Mining and Data Preprocessing
Figure 1.2 A 2-D customer data plot with respect to customer locations in a city,
showing three data clusters. Outliers may be detected as values that fall outside of the
cluster sets.
32. Data Cleaning as a process:
Discrepancies can be caused by several factors, including poorly
designed data entry forms that have many optional fields, human
error in data entry, deliberate errors and data decay.
Discrepancies may also arise from inconsistent data
representations and inconsistent use of codes.
Other sources of discrepancies include errors in instrumentation
devices that record data and system errors.
There may also be inconsistencies due to data integration.
Data Cleaning
11/9/201932
Data Mining and Data Preprocessing
33. Discrepancy Detection Methods:
Field overloading is another error source that typically results when
developers squeeze new attribute definitions into unused (bit) portions of
already defined attributes.
A unique rule says that each value of the given attribute must be
different from all other values for that attribute.
A consecutive rule says that there can be no missing values between the
lowest and highest values for the attribute, and that all values must also be
unique (e.g., as in check numbers).
Data Cleaning
11/9/201933
Data Mining and Data Preprocessing
34. Discrepancy Detection Methods:
A null rule specifies the use of blanks, question marks, special
characters, or other strings that may indicate the null condition.
Data scrubbing tools use simple domain knowledge (e.g., knowledge
of postal addresses and spell-checking) to detect errors and make
corrections in the data.. Data auditing tools find discrepancies by
analyzing the data to discover rules and relationships, and detecting data
that violate such conditions.
They are variants of data mining tools. For example, they may employ
statistical analysis to find correlations, or clustering to identify outliers.
Data Cleaning
11/9/201934
Data Mining and Data Preprocessing
35. Commercial tools can assist in the data transformation step.
Data migration tools allow simple transformations to be specified such
as to replace the string “gender” by “sex.”
ETL (extraction/transformation/loading) tools allow users to specify
transforms through a graphical user interface (GUI).
Data Cleaning
11/9/201935
Data Mining and Data Preprocessing