Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Big Data for Managers: From hadoop to streaming and beyond

892 views

Published on

Big Data for Managers: From hadoop to streaming and beyon

Published in: Technology
  • Be the first to comment

Big Data for Managers: From hadoop to streaming and beyond

  1. 1. Big Data for Managers: From Hadoop to Streaming and Beyond Dr. Vladimir Bacvanski vladimir.bacvanski@scispike.com @OnSo5ware
  2. 2. www.scispike.com Copyright © SciSpike 2016 Dr. Vladimir Bacvanski §  Founder of SciSpike, a development, consulting, and training firm §  Passionate about software and data §  PhD in computer science RWTH Aachen, Germany §  Architect, consultant, mentor §  Custom development: Scalable Web and IoT systems §  Training and mentoring in Big Data, Scala, node.js, software architecture @OnSoftware https://www.linkedin.com/in/vladimirbacvanski
  3. 3. www.scispike.com Copyright © SciSpike 2016 Problems with Rela9onal Stores §  Data that does not naturally fit into tables à Impedance mismatch §  Development Eme o5en to long §  Dealing with unstructured data §  Performance problems §  Difficult to run on clusters §  Cost 3
  4. 4. www.scispike.com Copyright © SciSpike 2016 Structured and Unstructured Data Sources Structured Data Sources • ExisEng databases • ERP/CRM/BI systems • Inventory • Supply chain Unstructured Data Sources • Server logs • Search engine logs • Browsing logs • E-Commerce records • Social media • Voice • Video • Sensor data 4
  5. 5. www.scispike.com Copyright © SciSpike 2016 NoSQL Impact 5 Disks Processors x1000 x1000 x1000 Cost / Performance 1M 1B 1T 1Q …HUGE!!! x1000 Rela9onal Database Big Data + NoSQL Tomorrow - Volume is out of reach Today - Doable, but expensive and slow Stabilize Cost & Increase Performance Enable Unlimited Volume Growth
  6. 6. www.scispike.com Copyright © SciSpike 2016 Scale Up vs. Scale Out 6 Capability Cost Scale Up Capability Cost Scale Out
  7. 7. www.scispike.com Copyright © SciSpike 2016 A Common PaNern for Processing Large Data Load a large set of records onto a set of machines Extract something interesEng from each record Shuffle and sort intermediate results Aggregate intermediate results Store end result 7 "Map" "Reduce" Key/Value pairs
  8. 8. www.scispike.com Copyright © SciSpike 2016 Two Key Aspects of Hadoop §  MapReduce framework – How Hadoop understands and assigns work to the nodes (machines) §  Hadoop Distributed File System = HDFS – Where Hadoop stores data – A file system that spans all the nodes in a Hadoop cluster – It links together the file systems on many local nodes to make them into one big file system 8
  9. 9. www.scispike.com Copyright © SciSpike 2016 MapReduce Example: Word Count §  WordCount is the "Hello World" of Big Data – You will see various technologies implemenEng it – A good first step to compare the expressiveness of Big Data tools 9 dog cat bird dog cat bird dog dog cat dog, 1 cat, 1 bird, 1 dog, 1 cat, 1 bird, 1 dog, 1 dog, 1 cat, 1 Map dog, 1 dog, 1 dog, 1 dog, 1 cat, 1 cat, 1 cat, 1 bird, 1 bird, 1 Shuffle dog, 4 cat, 3 bird, 2 Reduce dog cat bird dog cat bird dog dog cat pets.txt dog, 4 cat, 3 bird, 2 pet_freq.txt
  10. 10. www.scispike.com Copyright © SciSpike 2016 10 The MapReduce Programming Model §  "Map" step: –  Input split into pieces –  Worker nodes process individual pieces in parallel (under global control of the Job Tracker node) –  Each worker node stores its result in its local file system where a reducer is able to access it §  "Reduce" step: –  Data is aggregated (‘reduced” from the map steps) by worker nodes (under control of the Job Tracker) –  MulEple reduce tasks can parallelize the aggregaEon 10
  11. 11. www.scispike.com Copyright © SciSpike 2016 Separa9on of Work Programmers • Map • Reduce Framework • Deals with fault tolerance • Assign workers to map and reduce tasks • Moves processes to data • Shuffles and sorts intermediate data • Deals with errors 11
  12. 12. www.scispike.com Copyright © SciSpike 2016 How To Create MapReduce Jobs §  Java API – Low level, very flexible – Time consuming development §  Streaming API – A simple, producEve model for Python and Ruby §  Hive – Open source language / Apache sub-project – Provides a SQL-like interface to Hadoop §  Pig – Data flow language / Apache sub-project 15
  13. 13. www.scispike.com Copyright © SciSpike 2016 The Big Picture: NoSQL + Hadoop in Applica9ons 16 Columnar Price updates Logs Document Product info Graph Customer Agent relaFon- ships RDB XA data Hadoop Oper. analyFcs Price analyFcs Key/Value Session data ApplicaFons
  14. 14. www.scispike.com Copyright © SciSpike 2016 Streaming: A New Paradigm §  ConvenEonal processing: sta9c data DataQueries Results §  Real-time processing: streaming data QueriesData Results 17
  15. 15. www.scispike.com Copyright © SciSpike 2016 Common Streaming Applica9ons §  PersonalizaEon §  Search §  Revenue opEmizaEon §  User events §  Content feeds §  Log processing §  Monitoring §  RecommendaEons §  Ads §  Notable users: –  Twiper –  Yahoo –  SpoEfy –  Cisco –  Flickr –  Weather Channel 18
  16. 16. www.scispike.com Copyright © SciSpike 2016 Beyond Hadoop: Spark & Flink 19 MapReduce Tez Spark Flink
  17. 17. www.scispike.com Copyright © SciSpike 2016 Apache Spark §  Important Features – In Memory Data – Resilient Distributed Datasets (RDDs) •  Datasets can rebuild themselves if failure occurs – Rich set of operators §  Efficient: – 10x (on Disk) -100x (In Memory) faster than Hadoop MR – 2 to 5 Emes less code (Rich APIs in Scala/Java/Python) 20
  18. 18. www.scispike.com Copyright © SciSpike 2016 Spark Architecture §  A powerful set of tools §  Beyond tradiEonal Hadoop Source: hpp://spark.apache.org
  19. 19. www.scispike.com Copyright © SciSpike 2016 Data Sharing in Apache Spark H D F S IteraFon 1 Result 1 Held In Cluster Memory IteraFon 2 Result 2 Held In Cluster Memory Query 1 Query 2
  20. 20. www.scispike.com Copyright © SciSpike 2016 Apache Flink §  ExecuEon: –  Programs compiled into an execuEon plan –  Plan is opEmized –  Executed §  Design goals: –  High performance –  Hybrid batch and streaming runEme –  Simplicity for the developer –  Rich libraries –  IntegraEon with many systems 23
  21. 21. www.scispike.com Copyright © SciSpike 2016 Apache Flink Components §  IntegraEon with Hadoop YARN, MapReduce, HBase, Cassandra, Kara, … §  ExecuEon engine for Apache Beam (Google Dataflow) 24
  22. 22. www.scispike.com Copyright © SciSpike 2016 Flink Op9miza9on and Execu9on §  OpEmizer selects an execuEon plan §  Similar to what we have in relaEonal databases §  OpEmal plan depends on the size of the input files §  Run as standalone or on top of Hadoop §  IntegraEon with many Hadoop technologies 25
  23. 23. www.scispike.com Copyright © SciSpike 2016 Flink & Spark: The Advantages and Outlook §  Less IO overhead than convenEonal Hadoop §  Caching §  IteraEve algorithms §  Unifying batch and stream compuEng §  Scala as a natural, expressive language for Big Data – Other languages: Python, Java, R §  Beware of less mature components 26
  24. 24. www.scispike.com Copyright © SciSpike 2016 Typical NoSQL Systems §  Non-relaKonal §  Distributed §  Horizontally scalable §  No need for a fixed schema §  Several established players §  Systems are specialized 27
  25. 25. www.scispike.com Copyright © SciSpike 2016 NoSQL Stores and Their Categories §  Choose a store that is a best match for your applicaEon §  It is fine to have several different stores used – "Polyglot persistence" 28 k v Key-Value Column- Family Document- Oriented Graph DB
  26. 26. www.scispike.com Copyright © SciSpike 2016 NoSQL Stores: Scale vs. Complexity of Data 29 k v Key-Value Column- Family Document- Oriented complexity scalability Graph DB needs of most applicaFons
  27. 27. www.scispike.com Copyright © SciSpike 2016 Key-Value Stores §  Key à Value mapping §  Large, persistent Map ("hashtable") – Values could be lists and hashes §  Easy to use §  Scale very well §  Data model may be too simple for most applicaEons §  Systems: – Redis, Riak, Memcached, Amazon DynamoDB, Aerospike, FoundaEonDB §  Use when data model is very simple and scalability essenEal 30
  28. 28. www.scispike.com Copyright © SciSpike 2016 Typical Use Cases §  The data model is very simple! – Actual data can be JSON §  Session data §  User preferences and profiles §  Shopping cart §  If other NoSQL store is good enough, you may want to skip this and let Column or Document store handle it 31
  29. 29. www.scispike.com Copyright © SciSpike 2016 Column-Family §  "Column-family": similar to a table – Table is sparse §  Key à (Column:Value)* §  Columns have names §  Can be indexed §  Can store complex data – Denormalize! §  Systems: – Google BigTable, HBase, Cassandra, Amazon SimpleDB, Hypertable §  Use when scalability is essenEal 32
  30. 30. www.scispike.com Copyright © SciSpike 2016 Typical Use Cases §  High insert volume: logging §  Real-Eme updates §  Content management §  Expiring content §  Cross-datacenter replicaEon §  MapReduce analyEcs over stored data §  You don’t need convenEonal (ACID) transacEons 33
  31. 31. www.scispike.com Copyright © SciSpike 2016 Document Stores §  JSON, BSON, XML §  No schema §  Indexes improve performance §  Easy transiEon from RDBMS §  Systems – MongoDB, CouchDB, CouchBase §  Use when data is in semi-structured form §  O5en seen in new Web applicaEons 34
  32. 32. www.scispike.com Copyright © SciSpike 2016 Typical Use Cases §  Logging – Especially with variable content §  Product informaEon §  Customer informaEon §  Content management §  Data to be stored has format that varies over Eme – Flexible schema §  Web analyEcs 35
  33. 33. www.scispike.com Copyright © SciSpike 2016 Graph Databases §  Nodes with properEes §  Nodes connected through relaEonships §  Can model very complex graph data – Social networks §  Systems: – Neo4J, Infinite Graph, TitanDB, OrientDB §  Use when data is a (complex) graph 36
  34. 34. www.scispike.com Copyright © SciSpike 2016 Typical Use Cases §  Highly interconnected data §  Social graphs §  Party relaEonships in an enterprise §  LocaEon based services §  Purchasing analyEcs and recommendaEons §  O5en combined with other systems to store the bulk of data – Graph database can focus on relaEonships 37
  35. 35. www.scispike.com Copyright © SciSpike 2016 Integra9ng Rela9onal, Streams, and Hadoop Streams Data + Big Data TradiEonal Warehouse In-MoEon AnalyEcs Data analyEcs Results Database & Warehouse At-rest data analyEcs Results Ultra Low Latency Results TradiEonal / RelaEonal Data Sources Non-TradiEonal / Non-RelaEonal Data Sources Varied data formats Semi-structured, unstructured... Event System NoSQL 38
  36. 36. www.scispike.com Copyright © SciSpike 2016 Merge Results Lambda Architecture 39 Event (Speed) Layer Real Time Data Batch Layer Serving Layer Master Dataset Batch View Incoming Data Real Time Update Batch Update Queries Rolling Values
  37. 37. www.scispike.com Copyright © SciSpike 2016 Master Data Management and Governance §  Big Data and NoSQL stores can easily become a bigger mess than relaEonal stores §  Introduce a pracEcal plan – Avoid lengthy and cumbersome governance – Actual use should be the driving force – Start slow §  Be ready for change – The technologies change rapidly §  Focus on business outcomes 40
  38. 38. www.scispike.com Copyright © SciSpike 2016 Succeeding with Big Data and NoSQL 1.  AcEvely look for soluEons where the right store can ease the pain 2.  Make sure you deliver tangible value to clients 3.  A5er you get your first apps to work: create a Big Data introducEon and governance plan 4.  PrioriEze: do the most useful thing for the business first 5.  Integrate with exisEng IT 6.  Make sure you hire or grow your Big Data champions 7.  Field is immature: look out for new tools and techniques 41
  39. 39. www.scispike.com Copyright © SciSpike 2016 Conclusions – Hadoop and NoSQL address the weak points of relaEonal systems: •  Scale •  Performance •  Unstructured and semistructured data – Streaming addresses the processing of data in real-Eme – Integrate with convenEonal technologies! – Spark and Flink: the next generaEon Big Data systems 42
  40. 40. QuesKons?

×