Big Data Analytics: From SQL to Machine Learning and Graph Analysis

© 2017 IBM Corporation
Big Data Analytics: From
SQL to Machine Learning
and Graph Analysis
Yuanyuan Tian
IBM Research -- Almaden
Keynote for KDD bigdas 2017

A bit about me
 I am a computer scientist who builds data management and analytics systems
 My talk is from the perspective of a big data analytics system builder
 I have some exposure to healthcare domain data and analytics problems by
collaborating with experts in IBM Watson Health division
2

What is big data?
 Gartner’s 3Vs definition:
 “Big Data is high-volume, high-velocity and/or high-variety
information assets that demand cost-effective, innovative forms of
information processing that enable enhanced insight, decision
making, and process automation.”
 Extra Vs
 Variability, Veracity, Visualization, Value
 How big is big data?
 It is all relative
 It is always a moving definition
 It is not all about the size
 My answer: when conventional data management and analytics tools
are inadequate = big data
3
Figure from https://www.linguamatics.com/blog/big-data-real-world-data-
where-does-text-analytics-fit
Big Data 3Vs

Why is big data important for health
care?
 Large volumes of data
 eHealth
 mHealth
 Sensor & wearable technologies
 Genome sequencing
 New applications
 Personalized medicine
 Clinical risk intervention
 Predictive analytics
4
Big Data

Big data analytics
 Big data analytics comes in different forms!
5

Two dimensions of big data analytics
 Data type
 Structured data
 Records in relational database
tables
 Semi-structured data
 Json and XML
 Unstructured data
 Text data
 Graph data
 Social and interaction data
 Multi-media data
 Images and videos
 Complexity of analytics
 Data entry and retrieval
 Look for a patient’s EHR at check in
 Descriptive summaries
 Compute the number of outbreaks
across different geo regions
 Pattern discovery (data mining)
 Identify unusual patterns of medical
claims by clinics, physicians, labs, etc
 Predictive analytics (machine learning)
 Predict a patient’s readmission to the
hospital
6

Big data analytics landscape
7
Structured Semi-structured Graph Text Multi-media
Date entry and
retrieval
OLTP (online
transactional
processing)
key-value/document
stores
graph databases keyword search …
Descriptive
summaries
SQL-on-Hadoop*: OLAP (online
analytical processing)
degree distribution,
clustering coefficient
distribution
word cloud …
Pattern discovery
(data mining)
DM on big data: frequent pattern
mining, anomaly detection, clustering
Graph processing: graph
clustering, influence
analysis
topic modeling,
sentiment analysis
…
Predictive analytics
(machine learning)
ML on big data: regression, classification, recommendation, link predication
Data Type
AnalyticsComplexity

8
Date entry and
retrieval
OLTP (online
transactional
processing)
key-value/document
stores
Descriptive
summaries
distribution
word cloud …
Pattern discovery
(data mining)
analysis
topic modeling,
sentiment analysis
…
(machine learning)
Data Type
AnalyticsComplexity

9
Date entry and
retrieval
OLTP (online
transactional
processing)
key-value/document
stores
Descriptive
summaries
distribution
word cloud …
Pattern discovery
(data mining)
analysis
topic modeling,
sentiment analysis
…
(machine learning)
Data Type
AnalyticsComplexity

Background on traditional SQL
processing
 OLTP (online transactional processing) vs OLAP (online analytical processing)
 Specialized OLTP and OLAP systems connected by the ETL (extract, transform,
load) process
10
Purpose Queries Speed
OLTP Data entry and retrieval Simple read, insert, update and delete Real-time (low latency and
high throughput)
OLAP BI (business intelligence) or
reporting
More complex analytical and ad hoc
queries (mostly optimized for read)
Interactive
Transactions Analytic Queries
ETL /
Replication
OLTP System OLAP System
EDW (enterprise data
warehouse)

Why SQL-on-Hadoop?
 SQL (Structured Query Language) is the de facto language for transactional
and decision support systems and BI tools
 Healthcare analysts and hospital IT experts are very familiar with SQL
 SQL-on-Hadoop eases the transition to big data
 Little or no change to existing BI tools and applications
 SQL-on-Hadoop overcomes some shortcomings of conventional EDWs
 Scalability & fault tolerance
 Better support for semi-structured data
 Directly work on raw data (query in situ) by avoiding ETL
11

Open Data
SQL Layer
Remove Query
SQL-on-Hadoop Landscape
Impala
Big SQL PolyBase
Proprietary Data
Vortex
SQL-H
Spark SQL
MPP Query Engine
12
dashDB

Technical Challenge
 How to distribute data and computation in a large cluster of machines for performance
 Bottleneck: transferring large volumes of data across the network
 Example: join (combining columns from multiple tables)
13
PID VisitDate Reason
1 2016-03-15 Fever
2 2016-10-20 Headache
1 2017-02-08 Fever
3 2017-06-18 Cold
PID Name BOD Sex
1 Jim Green 1980-04-15 M
2 Alice Lee 1965-11-11 F
3 Rose Darcy 2001-07-21 F
PID VisitDate Reason Name BOD Sex
1 2016-03-15 Fever Jim Green 1980-04-15 M
2 2016-10-20 Headache Alice Lee 1965-11-11 F
1 2017-02-08 Fever Jim Green 1980-04-15 M
3 2017-06-18 Cold Rose Darcy 2001-07-21 F
Clinical Visits Patient Info

SQL-on-Hadoop Strategies (1/2)
 Storing data in formats that are easy for query processing
 Columnar data formats (Parquet, ORCFile)
 Pushing analytics close to the data
 Intelligent data readers (apply predicates and projections while read the data)
 Carefully choosing the algorithm and what data to transfer for each analytics operation
 E.g. how to choose from different join algorithms based on data characteristics
14
VS
Broadcast smaller table
network cost: 2|G|
Repartition both tables
network cost: 2/3|B|+2/3|G|
Blue table (B)
Green table (G)

SQL-on-Hadoop Strategies (2/2)
 Pre-process data into better organization for queries
 Hash or range-based data partitioning and bucketing
 Auxiliary data structures for eliminating unnecessary data access
 Indexing and synopsis
 Better data placement for related data
 E.g. collocating related data together on HDFS (Hadoop distributed file system)
15
Co-partition
network cost: |G|
21
1 1
12
2 2
3
3 3
3
Co-partition and co-location
network cost: 0
1
1 1
2
2 2
3
3 3
1 2 3

16
Date entry and
retrieval
OLTP (online
transactional
processing)
key-value/document
stores
Descriptive
summaries
distribution
word cloud …
Pattern discovery
(data mining)
analysis
topic modeling,
sentiment analysis
…
(machine learning)
Data Type
AnalyticsComplexity

Machine learning on big data
 SQL analytics tools are not enough to capture the full value of big data
 Big data impact on ML (machine learning):
 Opportunities:
 More training data  better predications
 We can train a model with billions of parameters, because we have sufficiently big data
 Making deep learning possible!
 Challenges:
 Scalability and distributed computing
 A big learning curve for data scientists
17

Machine Learning Deep Learning
Big ML systems landscape
18

Different levels of abstractions for big
ML systems
 ML libraries
 E.g. Spark MLlib, H2O, IBM Watson
 Provide a list of parameterized ML algorithms
 Declarative ML
 E.g. SystemML, Mahout
 Expose R or Matlab like language for users
 Primitive: linear algebra and math operations
 Cost-based optimizer to compile execution plans
 Also provide a library of ML algorithms
 AutoML
 E.g. H2O
 Automate the process of training a large selection of candidate models
19
Hadoop or
Spark Cluster
(scale-out)
In-Memory
Single Node
(scale-up)
Runtime
Compiler
Language
SystemML

20
Date entry and
retrieval
OLTP (online
transactional
processing)
key-value/document
stores
Descriptive
summaries
distribution
word cloud …
Pattern discovery
(data mining)
analysis
topic modeling,
sentiment analysis
…
(machine learning)
Data Type
AnalyticsComplexity

Graph analytics on big data
 Graphs provide a powerful primitive for modeling real-world objects and the
relationships between objects
 Patient-patient/doctor-patient interactions, biological pathways, protein
interaction networks, ontologies, knowledge graphs, etc
 Two types:
 Graph databases: focus on real-time graph analytics
 Graph processing systems: focus on batch processing of graphs
21

Graph databases
 Real-time graph analytics
 Updates, simple node and edge retrieval
 Pattern matching queries
 Given a graph pattern, find subgraphs in the database graphs that (exactly or
approximately) match the query
 Example: find out what biological processes are affected by a disease
 Querying a disease pathway against a database of known pathways
22
Graph Databases
SAGA (query
against a database
of pathways)

Graph processing systems
 Batch graph analytics
 Long running (usually iterative) analysis on the entire graph
 E.g. PageRank algorithm to identify key influencers of a disease propagation
network
 Performance bottleneck: network overhead
 Better graph partitioning and absorbing messages within a partition
 Combining messages (when messages can be aggregated)
23
Graph Processing
Microsoft
Graph Engine

24
Date entry and
retrieval
OLTP (online
transactional
processing)
key-value/document
stores
Descriptive
summaries
distribution
word cloud …
Pattern discovery
(data mining)
analysis
topic modeling,
sentiment analysis
…
(machine learning)
Data Type
AnalyticsComplexity

Integrated analytics
 An application often require different types of analytics together
 E.g. SQL is often used to prepare the data for ML
 An example: Medtronic & IBM Watson Health Partnership
 "gathers a patient’s readings from Medtronic insulin pumps and glucose monitors,
and combines them with information taken from the individual’s activity trackers
and diet. The system uses pattern recognition gleaned through IBM’s Watson to
provide feedback on how a patient can manage their diabetes”
 “Medtronic's insulin pumps using Watson artificial intelligence (AI) could warn
patients of abnormally low blood sugar levels up to three hours in advance”
25
References:
https://www.meddeviceonline.com/doc/ibm-watson-to-power-medtronic-s-diabetes-app-under-armour-s-fitness-app-0001

Solutions for Integrated analytics
 Integrating existing analytics systems
 Data transformation: transform the data format between different systems
 Data transfer: transfer the output of one system to another system
 Building a single system for various types of analytics
 E.g Spark, Wildfire (IBM Project EventStore)
26
Spark
OLAPOLTP ML Stream
Batch
GA
Shared Storage
Wildfire
Real
Time GA

Conclusion
 Big data analytics comes in different forms
 What types of data do you have?
 What level of complexity does the analytics require?
 What is the latency requirement?
 An application often require different types of analytics together
 What types of analytics do you need to integrate?
 What is your performance requirement?
 Do you need to integrating existing analytics pipelines or can you start with a
single systems that supports all analytics?
27

Big Data Analytics: From SQL to Machine Learning and Graph Analysis

More Related Content

What's hot

Similar to Big Data Analytics: From SQL to Machine Learning and Graph Analysis

Recently uploaded

Big Data Analytics: From SQL to Machine Learning and Graph Analysis

Editor's Notes