Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Big Data Analytics: From SQL to Machine Learning and Graph Analysis

532 views

Published on

Dr Yuanyuan Tian's Keynote speech at the BigDas Workshop, SIGKDD'2017, August 2017.

Published in: Technology
  • Be the first to comment

Big Data Analytics: From SQL to Machine Learning and Graph Analysis

  1. 1. © 2017 IBM Corporation Big Data Analytics: From SQL to Machine Learning and Graph Analysis Yuanyuan Tian IBM Research -- Almaden Keynote for KDD bigdas 2017
  2. 2. A bit about me  I am a computer scientist who builds data management and analytics systems  My talk is from the perspective of a big data analytics system builder  I have some exposure to healthcare domain data and analytics problems by collaborating with experts in IBM Watson Health division 2
  3. 3. What is big data?  Gartner’s 3Vs definition:  “Big Data is high-volume, high-velocity and/or high-variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation.”  Extra Vs  Variability, Veracity, Visualization, Value  How big is big data?  It is all relative  It is always a moving definition  It is not all about the size  My answer: when conventional data management and analytics tools are inadequate = big data 3 Figure from https://www.linguamatics.com/blog/big-data-real-world-data- where-does-text-analytics-fit Big Data 3Vs
  4. 4. Why is big data important for health care?  Large volumes of data  eHealth  mHealth  Sensor & wearable technologies  Genome sequencing  New applications  Personalized medicine  Clinical risk intervention  Predictive analytics 4 Big Data
  5. 5. Big data analytics  Big data analytics comes in different forms! 5
  6. 6. Two dimensions of big data analytics  Data type  Structured data  Records in relational database tables  Semi-structured data  Json and XML  Unstructured data  Text data  Graph data  Social and interaction data  Multi-media data  Images and videos  Complexity of analytics  Data entry and retrieval  Look for a patient’s EHR at check in  Descriptive summaries  Compute the number of outbreaks across different geo regions  Pattern discovery (data mining)  Identify unusual patterns of medical claims by clinics, physicians, labs, etc  Predictive analytics (machine learning)  Predict a patient’s readmission to the hospital 6
  7. 7. Big data analytics landscape 7 Structured Semi-structured Graph Text Multi-media Date entry and retrieval OLTP (online transactional processing) key-value/document stores graph databases keyword search … Descriptive summaries SQL-on-Hadoop*: OLAP (online analytical processing) degree distribution, clustering coefficient distribution word cloud … Pattern discovery (data mining) DM on big data: frequent pattern mining, anomaly detection, clustering Graph processing: graph clustering, influence analysis topic modeling, sentiment analysis … Predictive analytics (machine learning) ML on big data: regression, classification, recommendation, link predication Data Type AnalyticsComplexity
  8. 8. Big data analytics landscape 8 Structured Semi-structured Graph Text Multi-media Date entry and retrieval OLTP (online transactional processing) key-value/document stores graph databases keyword search … Descriptive summaries SQL-on-Hadoop*: OLAP (online analytical processing) degree distribution, clustering coefficient distribution word cloud … Pattern discovery (data mining) DM on big data: frequent pattern mining, anomaly detection, clustering Graph processing: graph clustering, influence analysis topic modeling, sentiment analysis … Predictive analytics (machine learning) ML on big data: regression, classification, recommendation, link predication Data Type AnalyticsComplexity
  9. 9. Big data analytics landscape 9 Structured Semi-structured Graph Text Multi-media Date entry and retrieval OLTP (online transactional processing) key-value/document stores graph databases keyword search … Descriptive summaries SQL-on-Hadoop*: OLAP (online analytical processing) degree distribution, clustering coefficient distribution word cloud … Pattern discovery (data mining) DM on big data: frequent pattern mining, anomaly detection, clustering Graph processing: graph clustering, influence analysis topic modeling, sentiment analysis … Predictive analytics (machine learning) ML on big data: regression, classification, recommendation, link predication Data Type AnalyticsComplexity
  10. 10. Background on traditional SQL processing  OLTP (online transactional processing) vs OLAP (online analytical processing)  Specialized OLTP and OLAP systems connected by the ETL (extract, transform, load) process 10 Purpose Queries Speed OLTP Data entry and retrieval Simple read, insert, update and delete Real-time (low latency and high throughput) OLAP BI (business intelligence) or reporting More complex analytical and ad hoc queries (mostly optimized for read) Interactive Transactions Analytic Queries ETL / Replication OLTP System OLAP System EDW (enterprise data warehouse)
  11. 11. Why SQL-on-Hadoop?  SQL (Structured Query Language) is the de facto language for transactional and decision support systems and BI tools  Healthcare analysts and hospital IT experts are very familiar with SQL  SQL-on-Hadoop eases the transition to big data  Little or no change to existing BI tools and applications  SQL-on-Hadoop overcomes some shortcomings of conventional EDWs  Scalability & fault tolerance  Better support for semi-structured data  Directly work on raw data (query in situ) by avoiding ETL 11
  12. 12. Open Data SQL Layer Remove Query SQL-on-Hadoop Landscape Impala Big SQL PolyBase Proprietary Data Vortex SQL-H Spark SQL MPP Query Engine 12 dashDB
  13. 13. Technical Challenge  How to distribute data and computation in a large cluster of machines for performance  Bottleneck: transferring large volumes of data across the network  Example: join (combining columns from multiple tables) 13 PID VisitDate Reason 1 2016-03-15 Fever 2 2016-10-20 Headache 1 2017-02-08 Fever 3 2017-06-18 Cold PID Name BOD Sex 1 Jim Green 1980-04-15 M 2 Alice Lee 1965-11-11 F 3 Rose Darcy 2001-07-21 F PID VisitDate Reason Name BOD Sex 1 2016-03-15 Fever Jim Green 1980-04-15 M 2 2016-10-20 Headache Alice Lee 1965-11-11 F 1 2017-02-08 Fever Jim Green 1980-04-15 M 3 2017-06-18 Cold Rose Darcy 2001-07-21 F Clinical Visits Patient Info
  14. 14. SQL-on-Hadoop Strategies (1/2)  Storing data in formats that are easy for query processing  Columnar data formats (Parquet, ORCFile)  Pushing analytics close to the data  Intelligent data readers (apply predicates and projections while read the data)  Carefully choosing the algorithm and what data to transfer for each analytics operation  E.g. how to choose from different join algorithms based on data characteristics 14 VS Broadcast smaller table network cost: 2|G| Repartition both tables network cost: 2/3|B|+2/3|G| Blue table (B) Green table (G)
  15. 15. SQL-on-Hadoop Strategies (2/2)  Pre-process data into better organization for queries  Hash or range-based data partitioning and bucketing  Auxiliary data structures for eliminating unnecessary data access  Indexing and synopsis  Better data placement for related data  E.g. collocating related data together on HDFS (Hadoop distributed file system) 15 Co-partition network cost: |G| 21 1 1 12 2 2 3 3 3 3 Co-partition and co-location network cost: 0 1 1 1 2 2 2 3 3 3 1 2 3
  16. 16. Big data analytics landscape 16 Structured Semi-structured Graph Text Multi-media Date entry and retrieval OLTP (online transactional processing) key-value/document stores graph databases keyword search … Descriptive summaries SQL-on-Hadoop*: OLAP (online analytical processing) degree distribution, clustering coefficient distribution word cloud … Pattern discovery (data mining) DM on big data: frequent pattern mining, anomaly detection, clustering Graph processing: graph clustering, influence analysis topic modeling, sentiment analysis … Predictive analytics (machine learning) ML on big data: regression, classification, recommendation, link predication Data Type AnalyticsComplexity
  17. 17. Machine learning on big data  SQL analytics tools are not enough to capture the full value of big data  Big data impact on ML (machine learning):  Opportunities:  More training data  better predications  We can train a model with billions of parameters, because we have sufficiently big data  Making deep learning possible!  Challenges:  Scalability and distributed computing  A big learning curve for data scientists 17
  18. 18. Machine Learning Deep Learning Big ML systems landscape 18
  19. 19. Different levels of abstractions for big ML systems  ML libraries  E.g. Spark MLlib, H2O, IBM Watson  Provide a list of parameterized ML algorithms  Declarative ML  E.g. SystemML, Mahout  Expose R or Matlab like language for users  Primitive: linear algebra and math operations  Cost-based optimizer to compile execution plans  Also provide a library of ML algorithms  AutoML  E.g. H2O  Automate the process of training a large selection of candidate models 19 Hadoop or Spark Cluster (scale-out) In-Memory Single Node (scale-up) Runtime Compiler Language SystemML
  20. 20. Big data analytics landscape 20 Structured Semi-structured Graph Text Multi-media Date entry and retrieval OLTP (online transactional processing) key-value/document stores graph databases keyword search … Descriptive summaries SQL-on-Hadoop*: OLAP (online analytical processing) degree distribution, clustering coefficient distribution word cloud … Pattern discovery (data mining) DM on big data: frequent pattern mining, anomaly detection, clustering Graph processing: graph clustering, influence analysis topic modeling, sentiment analysis … Predictive analytics (machine learning) ML on big data: regression, classification, recommendation, link predication Data Type AnalyticsComplexity
  21. 21. Graph analytics on big data  Graphs provide a powerful primitive for modeling real-world objects and the relationships between objects  Patient-patient/doctor-patient interactions, biological pathways, protein interaction networks, ontologies, knowledge graphs, etc  Two types:  Graph databases: focus on real-time graph analytics  Graph processing systems: focus on batch processing of graphs 21
  22. 22. Graph databases  Real-time graph analytics  Updates, simple node and edge retrieval  Pattern matching queries  Given a graph pattern, find subgraphs in the database graphs that (exactly or approximately) match the query  Example: find out what biological processes are affected by a disease  Querying a disease pathway against a database of known pathways 22 Graph Databases SAGA (query against a database of pathways)
  23. 23. Graph processing systems  Batch graph analytics  Long running (usually iterative) analysis on the entire graph  E.g. PageRank algorithm to identify key influencers of a disease propagation network  Performance bottleneck: network overhead  Better graph partitioning and absorbing messages within a partition  Combining messages (when messages can be aggregated) 23 Graph Processing Microsoft Graph Engine
  24. 24. Big data analytics landscape 24 Structured Semi-structured Graph Text Multi-media Date entry and retrieval OLTP (online transactional processing) key-value/document stores graph databases keyword search … Descriptive summaries SQL-on-Hadoop*: OLAP (online analytical processing) degree distribution, clustering coefficient distribution word cloud … Pattern discovery (data mining) DM on big data: frequent pattern mining, anomaly detection, clustering Graph processing: graph clustering, influence analysis topic modeling, sentiment analysis … Predictive analytics (machine learning) ML on big data: regression, classification, recommendation, link predication Data Type AnalyticsComplexity
  25. 25. Integrated analytics  An application often require different types of analytics together  E.g. SQL is often used to prepare the data for ML  An example: Medtronic & IBM Watson Health Partnership  "gathers a patient’s readings from Medtronic insulin pumps and glucose monitors, and combines them with information taken from the individual’s activity trackers and diet. The system uses pattern recognition gleaned through IBM’s Watson to provide feedback on how a patient can manage their diabetes”  “Medtronic's insulin pumps using Watson artificial intelligence (AI) could warn patients of abnormally low blood sugar levels up to three hours in advance” 25 References: https://www.meddeviceonline.com/doc/ibm-watson-to-power-medtronic-s-diabetes-app-under-armour-s-fitness-app-0001
  26. 26. Solutions for Integrated analytics  Integrating existing analytics systems  Data transformation: transform the data format between different systems  Data transfer: transfer the output of one system to another system  Building a single system for various types of analytics  E.g Spark, Wildfire (IBM Project EventStore) 26 Spark OLAPOLTP ML Stream Batch GA Shared Storage Wildfire Real Time GA
  27. 27. Conclusion  Big data analytics comes in different forms  What types of data do you have?  What level of complexity does the analytics require?  What is the latency requirement?  An application often require different types of analytics together  What types of analytics do you need to integrate?  What is your performance requirement?  Do you need to integrating existing analytics pipelines or can you start with a single systems that supports all analytics? 27
  28. 28. 28

×