Outline
1
o Big Data: Challenges
and Opportunities
• What is Big Data?
• Harnessing Big Data
o Efforts in Data Science for
Big Data applications
• The paradigm shift
• Data Management
o Challenges being faced
• Data Growth
• Factors that influence
growth
o Evolution of Data Science
o What is Data Mining?
• View of Data Mining:
Web Mining
o Views of Data Mining
o Value in analyzing Big
Data
o Applications in Big
Data Analytics
o Summary
2
Big Data: Biomedical Data Growth
 “The discovery, development and application of powerful
computational tools to extract knowledge from complex
biological data.”
 The Human Genome Project completed a full genetic map of
humans in 2003.
44
Data in Post-genomic Era
5
“Now this is not the end. It is not even the beginning
of the end. But it is, perhaps, the end of the
beginning”
- Spoken in 1942 after 3 years of war
• 43,000,000,000 possible human
DNA sequences.
• 10350 possible proteins.
Only 1079 atoms in
universe.
• GenBank is now over 23.9 billion
bp and doubling every 18
months.
• Millions of entries in
protein databases.
• Microarray data is being collected at
6
Big Data: IOT Growth
The Information Age: Faster, Better,
Cheaper!
Moore’s Law (1965): The number of transistors on a chip doubles every 18-24
months
• Pentium 4, released in 2000 had 42 million transistors;
• Intel 62-Core Xeon Phi released in 2012 had 5,000 million transistors.
• Intel Broadwell-U (2015) has 1.9 billion transistors
Moore (1998): "If the automobile industry advanced as rapidly as the
semiconductor industry, a Rolls Royce would get half a million miles per gallon, and
it would be cheaper to throw it away than to park it."
Invention of the Transistor:
Development of Semiconductor Technology
Factors Influencing Data Growth
 Data collection and data availability
o Automated data collection tools, database systems,
Web, computerized society
 Major sources of abundant data
o The Internet: Social networking, social media….
o Business: Web, e-commerce, transactions, stocks, …
o Science: Remote sensing, bioinformatics, scientific simulation,
…
o Society and everyone: news, digital cameras, YouTube
8
Data wealth assessment
 5 exabytes* of unique information produced per year
o 37,000 of the size of the Library of Congress book
collections (136 TB).
o 92% of the new information was stored on magnetic
media, mostly in hard disks
• Film represents 7% of the total, paper 0.01%, and optical
media 0.002%.
o 800MB each for every man, woman and child.
• It would take about 30 feet of books to store the equivalent
of 800 MB of information on paper.
o 30% annual data growth expected.
“How much information?” Peter Lyman and Hal Varian, Cal-Berkeley, Info Mgmt and Systems (2003)
- *1 exabyte=1000000000000000000 (10^18) bytes
Harnessing Big Data
 OLTP: Online Transaction Processing (DBMSs)
 OLAP: Online Analytical Processing (Data Warehousing)
 RTAP: Real-Time Analytics Processing (Big Data Architecture &
technology)
9
What is Big Data?
10
 No single standard definition…
“Big Data” is data whose scale, diversity, and
complexity require new architecture, techniques,
algorithms, and analytics to manage it and extract
value and hidden knowledge from it…
Efforts in Data Science for Big Data
Applications
Big Data is characterized by
• Volume: Terabytes /
Petabytes of data per day
• Variety: Semi-structured,
Unstructured data
• Velocity: Data growth
• Data Analytics: Real-time
Business Intelligence, Data
mining, Data Integration
http://www.forbes.com/sites/christopherfrank/2012/03/25/improving-decision-making-in-the-world-of-big-data/
2
Characteristics of Big Data: Volume
Exponential
increase in
collected/generat
e d data
Petabyte
s
 Data Volume
o Expected 44 x increase between
years 2009 - 20
o From 0.8 zettabytes to 35zb
 Data volume is increasing exponentially
Terabytes Exabytes
Zettabyte
s
12
Characteristics of Big Data: Variety
 Various formats, types, and
structures
 Text, numerical, images, audio,
video, sequences, time series,
social media data, multi-dim
arrays, etc…
 Static data vs. streaming data
 A single application can be
generating/collecting many types
of data
To extract knowledge all these types
of data need to linked together
13
Characteristics of Big Data: Velocity
 Data is begin generated fast and need to be processed fast
 Online Data Analytics
 Late decisions  missing opportunities
 Examples
o E-Promotions: Based on your current location, your purchase
history, what you like  send promotions right now for store next
to you
o Healthcare monitoring: sensors monitoring your activities and
body
 any abnormal measurements require immediate reaction
14
Some Make it 4Vs
15
What’s driving Big Data Analytics
- Ad-hoc querying and reporting
- Data mining techniques
- Structured data, typical sources
- Small to mid-size datasets
- Optimizations and predictive analytics
- Complex statistical analysis
- All types of data, and many sources
- Very large datasets
- More of a real-time
16
A Paradigm Shift
Massively Parallel
Processing database
technology
• Hadoop (open
source)
• Map-reduce
Actio
n
Information
Integratio
n
Data
“Big Data”
NoSQL
Real-Time
Capture, Read
& Update
Traditional
Databases
RDBMS
“Big Data”
HADOOP
Store and
Analyze
Mobility Trends
Sensor Data
Click Streams
Event Data
Web Logs
Decision Making
4
(Big data is not just) Data Management
Data management is the organization, administration and
governance of large volumes of both structured and
unstructured data.
18
Challenges in Handling Big Data
 The Bottleneck is in technology
o New architecture, algorithms, techniques are needed
 Also in technical skills
o Experts in using the new technology and dealing with big
data
19
Evolution of Sciences: New Data Science Era
 1990-now: Data science
o The flood of data from new scientific instruments and simulations
o The ability to economically store and manage petabytes of data
online
o The Internet and computing Grid that makes all these
archives universally accessible
o Scientific info. management, acquisition, organization, query,
and visualization tasks scale almost linearly with data
volumes
o Data mining is a major new challenge!
21Source: Han and Kamer Data Mining: Concepts and Techniques
ed2
Evolution of “Data Science”
3
• Data collection
• Database creation
• DBMS
• Relational data model
• Relational DBMS implementation
• Advanced Data Models
• Object Oriented Databases
1960s
1970s
1980s
• Streaming data management and mining
• Mobile computing data management
• Global Information Systems
• e-Commerce
• Data Mining
• Data Warehousing
• Multimedia Databases
• Web Databases
1990s
2000s
2010s
• BIG data management
• Distributed cloud management
• Social Networking
• Data Analytics
Jim Gray and Alex Szalay, The World Wide Telescope: An Archetype for Online Science, Comm. ACM, 45(11): 50-54,
Nov. 2002
Source: Han and Kamer Data Mining: Concepts and Techniques
ed2
What Is Data Mining?
 Data mining (knowledge discovery from data)
o Extraction of interesting (non-trivial, implicit, previously unknown
and potentially useful) patterns or knowledge from huge amount
of data
 Alternative names
o Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data
dredging, information harvesting, business intelligence, etc.
23
Data Mining
5
Descriptive Data Mining Predictive Data Mining
Data Mining Core Components
Association
TrendAnalysis
OutlierAnalysis Classification Characterization
DiscriminationClustering
Techniques of Data Mining
 Anomaly detection
 Predictive
modelling,
regression
 Supervised and
unsupervised learning
 Simulation
 Time series analysis
 Visualisation
 Association rule learning
 Classification
 Cluster analysis
 Data fusion and
integration
 Ensemble learning
 Genetic algorithms
 Machine learning
 Neural networks
 Pattern recognition
25
Humanization of the Web
Web 1.0 (1990-2000)
Mostly read-only
web
Approximately 250,000
sites
Approximately 45 million
global users
Approximately 1/20th user
generated content
Web 2.0 (2000-2010)
Widely read-write
web
Approximately 80,000,000
sites
Approximately 1 billion plus
global users
Approximately 1/4th user
generated content
Web 3.0 (2010-2020)
Widely read-write
web
Approximately 800,000,000
sites
Approximately 8 billion plus
global users
Approximately ½ user
generated content
Catalys
t
Collective
Intelligence
25
Views of Data Mining: Eg. Web Mining
26
Knowledge Discovery (KDD) Process
 This is a view from typical database systems and data
warehousing communities
 Data mining plays an essential role in the knowledge discovery
process
27
Example: A Web Mining Framework
Web mining
usually
involves
• Data cleaning
• Data integration from multiple
sources
• Warehousing the data
• Data cube construction
• Data selection for data mining
• Data mining
• Presentation of the mining results
• Patterns and knowledge to be used
or stored into knowledge-base
Web
Minin
g
Web
usage
minin
g
Web
conten
t
mining
We
b
28
structur
e
mining
KDD : The BI perspective
30
Increasing potential
to support
business decisions
End User
Business
Analyst
Data
Analyst
DBA
Decision
Making
Data Presentation
Visualization Techniques
Data Mining
Information Discovery
Data Exploration
Statistical Summary, Querying, and Reporting
Data Preprocessing/Integration, Data Warehouses
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
Source: Han and Kamer Data Mining: Concepts and Techniques
ed2
KDD: Simplified in Data Mining
This is a view from typical machine learning and statistics
communities
Input Data
Data
Mining
Data Pre-
Processing
Post-
Processing
• Data integration
• Normalization
• Feature selection
• Dimension
reduction
• Pattern discovery
• Association &
correlation
• Classification
• Clustering
• Outlier analysis
• Pattern evaluation
• Pattern selection
• Pattern
interpretation
• Pattern visualization
Pattern
Information
and Knowledge
30Source: Han and Kamer Data Mining: Concepts and Techniques
ed2
Which View Do You Prefer?
31Source: Han and Kamer Data Mining: Concepts and Techniques
ed2
 Which view do you prefer?
o KDD, simplified view, or Business Intelligence
 Data Mining vs. Data Exploration
o Business intelligence view
• Warehouse, data cube, reporting but not much mining
o Business objects vs. data mining tools
o Supply chain example: mining vs. OLAP vs. presentation
tools
o Data presentation vs. data exploration
Multi-Dimensional View of Data Mining
Dat
a
Knowledg
e
Technique
s
Applicatio
n
33Source: Han and Kamer Data Mining: Concepts and Techniques
ed2
Multi-Dimensional View of Data Mining
34Source: Han and Kamer Data Mining: Concepts and Techniques
ed2
A. Data to be mined
• Database data (extended-relational, object-oriented, heterogeneous,
legacy), data warehouse, transactional data, stream, spatiotemporal,
time- series, sequence, text and web, multi-media, graphs & social
and information networks
B. Knowledge to be mined (or: Data mining functions)
• Characterization, discrimination, association, classification,
clustering, trend/deviation, outlier analysis, etc.
• Descriptive vs. predictive data mining
• Multiple/integrated functions and mining at multiple levels
A. Techniques utilized
• Data-intensive, data warehouse (OLAP), machine learning,
statistics, pattern recognition, visualization, high-performance,
etc.
B. Applications adapted
• Retail, telecommunication, banking, fraud analysis, bio-data
mining, stock market analysis, text mining, Web mining, etc.
Multi-Dimensional View of Data Mining
Source: Han and Kamer Data Mining: Concepts and Techniques
ed2
Data Mining: On What Kinds of Data?
35Source: Han and Kamer Data Mining: Concepts and Techniques
ed2
 Structured Data: Database-oriented data sets and applications
o Relational database, data warehouse, transactional database
 Unstructured Data: Advanced data sets and advanced applications
o Data streams and sensor data
o Time-series data, temporal data, sequence data (incl. bio-
sequences)
o Structure data, graphs, social networks and multi-linked data
o Object-relational databases
o Heterogeneous databases and legacy databases
o Spatial data and spatiotemporal data
o Multimedia database
o Text databases
o The World-Wide Web
Data Mining is Confluence of Multiple Disciplines
Data
Minin
g
36Source: Han and Kamer Data Mining: Concepts and Techniques
ed2
Machin
e
Learnin
g
Statistic
s
Visualizatio
n
HP
CDatabase
technolog
y
Algorithm
s
Application
s
Pattern
Recognitio
n
Why Confluence of Multiple Disciplines?
Tremendous amount
of data
High-dimensionality
of data
High complexity of
data
• Algorithms must be
highly scalable to handle
such as tera-bytes of
data
37Source: Han and Kamer Data Mining: Concepts and Techniques
ed2
• Data may have tens of
thousands of dimensions
• Data streams and sensor data
• Time-series data, temporal
data, sequence data
• Structure data, graphs, social
networks and multi-linked
data
• Heterogeneous databases
and legacy databases
• Spatial, spatiotemporal,
multimedia, text and Web
data
• Software programs,
scientific simulations
New and
sophisticated
applications
The functions of Data Mining
Data
Minin
g
Generali
z-
ation Associatio
n and
Correlatio
n Analysis
Classificatio
n
Cluster
Analysi
s
Outlier
Analysi
s
Evolution /
Trend
Analysis
Structure /
Network
Analysis
39Source: Han and Kamer Data Mining: Concepts and Techniques
ed2
Current Issues in Data Mining
39Source: Han and Kamer Data Mining: Concepts and Techniques
ed2
 Mining Methodology
o Mining various and new kinds of knowledge
o Mining knowledge in multi-dimensional space
o Data mining: An interdisciplinary effort
o Boosting the power of discovery in a networked
environment
o Handling noise, uncertainty, and incompleteness of data
o Pattern evaluation and pattern- or constraint-guided
mining
 User Interaction
o Interactive mining
o Incorporation of background knowledge
o Presentation and visualization of data mining results
Current Issues in Data Mining
40Source: Han and Kamer Data Mining: Concepts and Techniques
ed2
 Efficiency and Scalability
o Efficiency and scalability of data mining algorithms
o Parallel, distributed, stream, and incremental mining
methods
 Diversity of data types
o Handling complex types of data
o Mining dynamic, networked, and global data repositories
 Data mining and society
o Social impacts of data mining
o Privacy-preserving data mining
o Invisible data mining
Evaluation of Knowledge
41Source: Han and Kamer Data Mining: Concepts and Techniques
ed2
 Are all mined knowledge interesting?
o One can mine tremendous amount of “patterns” and
knowledge
o Some may fit only certain dimension space (time, location,
…)
o Some may not be representative, may be transient, …
 Evaluation of mined knowledge → directly mine only
interesting knowledge?
o Descriptive vs. predictive
o Coverage
o Typicality vs. novelty
o Accuracy
o Timeliness
Value in analyzing Big Data
 Big data is more real-time in
nature than traditional DW
applications
 Traditional DW architectures (e.g.
Exadata, Teradata) are not
well- suited for big data apps
 Shared nothing, massively parallel
processing, scale out
architectures are well-suited for
big data apps
42
Big Data Market Growth
43
Big Data Projected Overall Revenue by 2017
44
Big Data Analytics
45
 Important insights and answers that were
considered unattainable are now available
because of this data.
o The ability to extract insights from big data
 Enables better hypothesis testing and decision making
reliable and faster.
Big Data Tools, Platforms and Services
46
Event
Managemen
t
Threat
Analysi
s
Applications: Cybersecurity (or the
lack thereof)
Surveillanc
e
Cyber
Securit
y
Data
Forensic
s
47
Incidence
Managemen
t
Communit
y
Intelligenc
e
Patien
t
Researche
r
Clinicia
n









tien
 lini
 eal
Healthcare IT
Source:
http://visual.ly/latest-
trends-healthcare-it-
solutions
Big Data and Clinical Informatics
Custome
r
Service
Consumeris
m &
Health
Exchanges
Complianc
e
Sale
s
Marketing
Medica
l
Claims
Dental
MedicaidProvider
Data
Management
Portals
Medicare
Advantag
e
Plannin
g
Workforc
e
Lear
n
META DATA
Various aspects of
business with varied
data needs
Direct links between
databases and business is
confusing and frustrating
Diverse Databases and
data sources
Understand, refine,
and plan for
needs
Planning, learning,
workforce
development
Michael Hehenberger; The Integration of Phenotypic and Genotypic
Data: Clinical Genomics; NIH BECON/BISTIC Symposium,
Bethesda, MD, June 21, 2004
IOT Smart World: an example
Big Data and Recommendation Engines
Big Data and Social Network Analytics
Visual representation of the evolution of social
networks.
Source:
http://datamining.typepad.com/data_mining/2006/10/novel_graph_vis.html
55
Summary
 Big Data Analytics: Discovering interesting patterns and knowledge
from massive amount of data
 A natural evolution of science and information technology, in
great demand, with wide applications
 Data Analytics can be performed in a variety of data
 Data mining functionalities: characterization, discrimination,
association, classification, clustering, trend and outlier analysis, etc.
 Major issues in data mining
 Variety of Applications exist in healthcare, business analytics,
social networks
56
Thank you!
http://dmrl.latech.edu
Big dataorig
Big dataorig
Big dataorig
Big dataorig
Big dataorig
Big dataorig
Big dataorig
Big dataorig
Big dataorig
Big dataorig

Big dataorig

  • 1.
    Outline 1 o Big Data:Challenges and Opportunities • What is Big Data? • Harnessing Big Data o Efforts in Data Science for Big Data applications • The paradigm shift • Data Management o Challenges being faced • Data Growth • Factors that influence growth o Evolution of Data Science o What is Data Mining? • View of Data Mining: Web Mining o Views of Data Mining o Value in analyzing Big Data o Applications in Big Data Analytics o Summary
  • 2.
  • 3.
    Big Data: BiomedicalData Growth  “The discovery, development and application of powerful computational tools to extract knowledge from complex biological data.”  The Human Genome Project completed a full genetic map of humans in 2003. 44
  • 4.
    Data in Post-genomicEra 5 “Now this is not the end. It is not even the beginning of the end. But it is, perhaps, the end of the beginning” - Spoken in 1942 after 3 years of war • 43,000,000,000 possible human DNA sequences. • 10350 possible proteins. Only 1079 atoms in universe. • GenBank is now over 23.9 billion bp and doubling every 18 months. • Millions of entries in protein databases. • Microarray data is being collected at
  • 5.
  • 6.
    The Information Age:Faster, Better, Cheaper! Moore’s Law (1965): The number of transistors on a chip doubles every 18-24 months • Pentium 4, released in 2000 had 42 million transistors; • Intel 62-Core Xeon Phi released in 2012 had 5,000 million transistors. • Intel Broadwell-U (2015) has 1.9 billion transistors Moore (1998): "If the automobile industry advanced as rapidly as the semiconductor industry, a Rolls Royce would get half a million miles per gallon, and it would be cheaper to throw it away than to park it." Invention of the Transistor: Development of Semiconductor Technology
  • 7.
    Factors Influencing DataGrowth  Data collection and data availability o Automated data collection tools, database systems, Web, computerized society  Major sources of abundant data o The Internet: Social networking, social media…. o Business: Web, e-commerce, transactions, stocks, … o Science: Remote sensing, bioinformatics, scientific simulation, … o Society and everyone: news, digital cameras, YouTube 8
  • 8.
    Data wealth assessment 5 exabytes* of unique information produced per year o 37,000 of the size of the Library of Congress book collections (136 TB). o 92% of the new information was stored on magnetic media, mostly in hard disks • Film represents 7% of the total, paper 0.01%, and optical media 0.002%. o 800MB each for every man, woman and child. • It would take about 30 feet of books to store the equivalent of 800 MB of information on paper. o 30% annual data growth expected. “How much information?” Peter Lyman and Hal Varian, Cal-Berkeley, Info Mgmt and Systems (2003) - *1 exabyte=1000000000000000000 (10^18) bytes
  • 9.
    Harnessing Big Data OLTP: Online Transaction Processing (DBMSs)  OLAP: Online Analytical Processing (Data Warehousing)  RTAP: Real-Time Analytics Processing (Big Data Architecture & technology) 9
  • 10.
    What is BigData? 10  No single standard definition… “Big Data” is data whose scale, diversity, and complexity require new architecture, techniques, algorithms, and analytics to manage it and extract value and hidden knowledge from it…
  • 11.
    Efforts in DataScience for Big Data Applications Big Data is characterized by • Volume: Terabytes / Petabytes of data per day • Variety: Semi-structured, Unstructured data • Velocity: Data growth • Data Analytics: Real-time Business Intelligence, Data mining, Data Integration http://www.forbes.com/sites/christopherfrank/2012/03/25/improving-decision-making-in-the-world-of-big-data/ 2
  • 12.
    Characteristics of BigData: Volume Exponential increase in collected/generat e d data Petabyte s  Data Volume o Expected 44 x increase between years 2009 - 20 o From 0.8 zettabytes to 35zb  Data volume is increasing exponentially Terabytes Exabytes Zettabyte s 12
  • 13.
    Characteristics of BigData: Variety  Various formats, types, and structures  Text, numerical, images, audio, video, sequences, time series, social media data, multi-dim arrays, etc…  Static data vs. streaming data  A single application can be generating/collecting many types of data To extract knowledge all these types of data need to linked together 13
  • 14.
    Characteristics of BigData: Velocity  Data is begin generated fast and need to be processed fast  Online Data Analytics  Late decisions  missing opportunities  Examples o E-Promotions: Based on your current location, your purchase history, what you like  send promotions right now for store next to you o Healthcare monitoring: sensors monitoring your activities and body  any abnormal measurements require immediate reaction 14
  • 15.
  • 16.
    What’s driving BigData Analytics - Ad-hoc querying and reporting - Data mining techniques - Structured data, typical sources - Small to mid-size datasets - Optimizations and predictive analytics - Complex statistical analysis - All types of data, and many sources - Very large datasets - More of a real-time 16
  • 17.
    A Paradigm Shift MassivelyParallel Processing database technology • Hadoop (open source) • Map-reduce Actio n Information Integratio n Data “Big Data” NoSQL Real-Time Capture, Read & Update Traditional Databases RDBMS “Big Data” HADOOP Store and Analyze Mobility Trends Sensor Data Click Streams Event Data Web Logs Decision Making 4
  • 18.
    (Big data isnot just) Data Management Data management is the organization, administration and governance of large volumes of both structured and unstructured data. 18
  • 19.
    Challenges in HandlingBig Data  The Bottleneck is in technology o New architecture, algorithms, techniques are needed  Also in technical skills o Experts in using the new technology and dealing with big data 19
  • 20.
    Evolution of Sciences:New Data Science Era  1990-now: Data science o The flood of data from new scientific instruments and simulations o The ability to economically store and manage petabytes of data online o The Internet and computing Grid that makes all these archives universally accessible o Scientific info. management, acquisition, organization, query, and visualization tasks scale almost linearly with data volumes o Data mining is a major new challenge! 21Source: Han and Kamer Data Mining: Concepts and Techniques ed2
  • 21.
    Evolution of “DataScience” 3 • Data collection • Database creation • DBMS • Relational data model • Relational DBMS implementation • Advanced Data Models • Object Oriented Databases 1960s 1970s 1980s • Streaming data management and mining • Mobile computing data management • Global Information Systems • e-Commerce • Data Mining • Data Warehousing • Multimedia Databases • Web Databases 1990s 2000s 2010s • BIG data management • Distributed cloud management • Social Networking • Data Analytics Jim Gray and Alex Szalay, The World Wide Telescope: An Archetype for Online Science, Comm. ACM, 45(11): 50-54, Nov. 2002 Source: Han and Kamer Data Mining: Concepts and Techniques ed2
  • 22.
    What Is DataMining?  Data mining (knowledge discovery from data) o Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data  Alternative names o Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc. 23
  • 23.
    Data Mining 5 Descriptive DataMining Predictive Data Mining Data Mining Core Components Association TrendAnalysis OutlierAnalysis Classification Characterization DiscriminationClustering
  • 24.
    Techniques of DataMining  Anomaly detection  Predictive modelling, regression  Supervised and unsupervised learning  Simulation  Time series analysis  Visualisation  Association rule learning  Classification  Cluster analysis  Data fusion and integration  Ensemble learning  Genetic algorithms  Machine learning  Neural networks  Pattern recognition 25
  • 25.
    Humanization of theWeb Web 1.0 (1990-2000) Mostly read-only web Approximately 250,000 sites Approximately 45 million global users Approximately 1/20th user generated content Web 2.0 (2000-2010) Widely read-write web Approximately 80,000,000 sites Approximately 1 billion plus global users Approximately 1/4th user generated content Web 3.0 (2010-2020) Widely read-write web Approximately 800,000,000 sites Approximately 8 billion plus global users Approximately ½ user generated content Catalys t Collective Intelligence 25
  • 26.
    Views of DataMining: Eg. Web Mining 26
  • 27.
    Knowledge Discovery (KDD)Process  This is a view from typical database systems and data warehousing communities  Data mining plays an essential role in the knowledge discovery process 27
  • 28.
    Example: A WebMining Framework Web mining usually involves • Data cleaning • Data integration from multiple sources • Warehousing the data • Data cube construction • Data selection for data mining • Data mining • Presentation of the mining results • Patterns and knowledge to be used or stored into knowledge-base Web Minin g Web usage minin g Web conten t mining We b 28 structur e mining
  • 29.
    KDD : TheBI perspective 30 Increasing potential to support business decisions End User Business Analyst Data Analyst DBA Decision Making Data Presentation Visualization Techniques Data Mining Information Discovery Data Exploration Statistical Summary, Querying, and Reporting Data Preprocessing/Integration, Data Warehouses Data Sources Paper, Files, Web documents, Scientific experiments, Database Systems Source: Han and Kamer Data Mining: Concepts and Techniques ed2
  • 30.
    KDD: Simplified inData Mining This is a view from typical machine learning and statistics communities Input Data Data Mining Data Pre- Processing Post- Processing • Data integration • Normalization • Feature selection • Dimension reduction • Pattern discovery • Association & correlation • Classification • Clustering • Outlier analysis • Pattern evaluation • Pattern selection • Pattern interpretation • Pattern visualization Pattern Information and Knowledge 30Source: Han and Kamer Data Mining: Concepts and Techniques ed2
  • 31.
    Which View DoYou Prefer? 31Source: Han and Kamer Data Mining: Concepts and Techniques ed2  Which view do you prefer? o KDD, simplified view, or Business Intelligence  Data Mining vs. Data Exploration o Business intelligence view • Warehouse, data cube, reporting but not much mining o Business objects vs. data mining tools o Supply chain example: mining vs. OLAP vs. presentation tools o Data presentation vs. data exploration
  • 32.
    Multi-Dimensional View ofData Mining Dat a Knowledg e Technique s Applicatio n 33Source: Han and Kamer Data Mining: Concepts and Techniques ed2
  • 33.
    Multi-Dimensional View ofData Mining 34Source: Han and Kamer Data Mining: Concepts and Techniques ed2 A. Data to be mined • Database data (extended-relational, object-oriented, heterogeneous, legacy), data warehouse, transactional data, stream, spatiotemporal, time- series, sequence, text and web, multi-media, graphs & social and information networks B. Knowledge to be mined (or: Data mining functions) • Characterization, discrimination, association, classification, clustering, trend/deviation, outlier analysis, etc. • Descriptive vs. predictive data mining • Multiple/integrated functions and mining at multiple levels
  • 34.
    A. Techniques utilized •Data-intensive, data warehouse (OLAP), machine learning, statistics, pattern recognition, visualization, high-performance, etc. B. Applications adapted • Retail, telecommunication, banking, fraud analysis, bio-data mining, stock market analysis, text mining, Web mining, etc. Multi-Dimensional View of Data Mining Source: Han and Kamer Data Mining: Concepts and Techniques ed2
  • 35.
    Data Mining: OnWhat Kinds of Data? 35Source: Han and Kamer Data Mining: Concepts and Techniques ed2  Structured Data: Database-oriented data sets and applications o Relational database, data warehouse, transactional database  Unstructured Data: Advanced data sets and advanced applications o Data streams and sensor data o Time-series data, temporal data, sequence data (incl. bio- sequences) o Structure data, graphs, social networks and multi-linked data o Object-relational databases o Heterogeneous databases and legacy databases o Spatial data and spatiotemporal data o Multimedia database o Text databases o The World-Wide Web
  • 36.
    Data Mining isConfluence of Multiple Disciplines Data Minin g 36Source: Han and Kamer Data Mining: Concepts and Techniques ed2 Machin e Learnin g Statistic s Visualizatio n HP CDatabase technolog y Algorithm s Application s Pattern Recognitio n
  • 37.
    Why Confluence ofMultiple Disciplines? Tremendous amount of data High-dimensionality of data High complexity of data • Algorithms must be highly scalable to handle such as tera-bytes of data 37Source: Han and Kamer Data Mining: Concepts and Techniques ed2 • Data may have tens of thousands of dimensions • Data streams and sensor data • Time-series data, temporal data, sequence data • Structure data, graphs, social networks and multi-linked data • Heterogeneous databases and legacy databases • Spatial, spatiotemporal, multimedia, text and Web data • Software programs, scientific simulations New and sophisticated applications
  • 38.
    The functions ofData Mining Data Minin g Generali z- ation Associatio n and Correlatio n Analysis Classificatio n Cluster Analysi s Outlier Analysi s Evolution / Trend Analysis Structure / Network Analysis 39Source: Han and Kamer Data Mining: Concepts and Techniques ed2
  • 39.
    Current Issues inData Mining 39Source: Han and Kamer Data Mining: Concepts and Techniques ed2  Mining Methodology o Mining various and new kinds of knowledge o Mining knowledge in multi-dimensional space o Data mining: An interdisciplinary effort o Boosting the power of discovery in a networked environment o Handling noise, uncertainty, and incompleteness of data o Pattern evaluation and pattern- or constraint-guided mining  User Interaction o Interactive mining o Incorporation of background knowledge o Presentation and visualization of data mining results
  • 40.
    Current Issues inData Mining 40Source: Han and Kamer Data Mining: Concepts and Techniques ed2  Efficiency and Scalability o Efficiency and scalability of data mining algorithms o Parallel, distributed, stream, and incremental mining methods  Diversity of data types o Handling complex types of data o Mining dynamic, networked, and global data repositories  Data mining and society o Social impacts of data mining o Privacy-preserving data mining o Invisible data mining
  • 41.
    Evaluation of Knowledge 41Source:Han and Kamer Data Mining: Concepts and Techniques ed2  Are all mined knowledge interesting? o One can mine tremendous amount of “patterns” and knowledge o Some may fit only certain dimension space (time, location, …) o Some may not be representative, may be transient, …  Evaluation of mined knowledge → directly mine only interesting knowledge? o Descriptive vs. predictive o Coverage o Typicality vs. novelty o Accuracy o Timeliness
  • 42.
    Value in analyzingBig Data  Big data is more real-time in nature than traditional DW applications  Traditional DW architectures (e.g. Exadata, Teradata) are not well- suited for big data apps  Shared nothing, massively parallel processing, scale out architectures are well-suited for big data apps 42
  • 43.
    Big Data MarketGrowth 43
  • 44.
    Big Data ProjectedOverall Revenue by 2017 44
  • 45.
    Big Data Analytics 45 Important insights and answers that were considered unattainable are now available because of this data. o The ability to extract insights from big data  Enables better hypothesis testing and decision making reliable and faster.
  • 46.
    Big Data Tools,Platforms and Services 46
  • 47.
    Event Managemen t Threat Analysi s Applications: Cybersecurity (orthe lack thereof) Surveillanc e Cyber Securit y Data Forensic s 47 Incidence Managemen t Communit y Intelligenc e
  • 48.
  • 49.
  • 50.
    Big Data andClinical Informatics Custome r Service Consumeris m & Health Exchanges Complianc e Sale s Marketing Medica l Claims Dental MedicaidProvider Data Management Portals Medicare Advantag e Plannin g Workforc e Lear n META DATA Various aspects of business with varied data needs Direct links between databases and business is confusing and frustrating Diverse Databases and data sources Understand, refine, and plan for needs Planning, learning, workforce development
  • 51.
    Michael Hehenberger; TheIntegration of Phenotypic and Genotypic Data: Clinical Genomics; NIH BECON/BISTIC Symposium, Bethesda, MD, June 21, 2004
  • 52.
    IOT Smart World:an example
  • 53.
    Big Data andRecommendation Engines
  • 54.
    Big Data andSocial Network Analytics Visual representation of the evolution of social networks. Source: http://datamining.typepad.com/data_mining/2006/10/novel_graph_vis.html 55
  • 55.
    Summary  Big DataAnalytics: Discovering interesting patterns and knowledge from massive amount of data  A natural evolution of science and information technology, in great demand, with wide applications  Data Analytics can be performed in a variety of data  Data mining functionalities: characterization, discrimination, association, classification, clustering, trend and outlier analysis, etc.  Major issues in data mining  Variety of Applications exist in healthcare, business analytics, social networks 56
  • 57.