Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Liferay and Big Data

2,501 views

Published on

My presentation at Liferay NAS 2014 talking about Liferay and Big Data

Published in: Data & Analytics
  • Be the first to comment

Liferay and Big Data

  1. 1. Liferay & Big Data Getting value from your data ! Miguel Ángel Pastor Olivar Senior Software Engineer
  2. 2. About me Who am I? ! • Miguel Ángel Pastor Olivar ! • Member of the Liferay core infrastructure team ! • Worked in analytics for a long time – Disclaimer: Not a computer scientist ! • Email: miguel.pastor@liferay.com ! • @miguelinlas3 #LRNAS2014
  3. 3. Synopsis What are we going to talk about? ! • Big Data: what is this about? ! • What’s ahead of big data ! • Connecting Liferay with this “new” world ! • Simple architecture proposal ! • Use cases ! • Questions (and hopefully answers) #LRNAS2014
  4. 4. Big Data?
  5. 5. Definitions Big Data ! • It is just a buzzword ! • Data is so big that regular solutions are: ! – Extremely slow ! – Too small ! – Really expensive ! • How we use all the data we already own #LRNAS2014 It is no more than a buzzword but we generally associate it with the problem that datasets has become too big that traditional relational databases are not able to longer work with them. ! Note the NoSQL movement has emerged during the last years and pretends to handle in a better way all this new semistructured data, new ways of scaling, …
  6. 6. Definitions More formally … ! • Volume – Transactions, data streaming from social media, … ! • Velocity – Torrents of data in real time ! • Variety – Numerical data, text, email, video, audio, … #LRNAS2014 1. Many factors have influenced to increase data volumes: Transaction based data stored through the years, social media, … 2. Data streaming is a reality: IOT, smart cities, RFID sensors, … We have to deal with them as fast as we can 3. Tons of different formats that we need to deal with and interconnect to extract useful information
  7. 7. Trending What is trending? ! • Data volumes will keep increasing … rapidly ! • Less emphasis on formal schemas ! • Data driven applications #LRNAS2014 Data volumes: Facebook has over 800PB of data stored in Hadoop clusters !F ormal schemas: data schemas and sources change rapidly, and we need to integrate so many disparate sources of data that we need to rapidly evolve and adapt to the changes ! Self driving cars, smart cities ,… generic algorithm and data structures represent the world using data instead of encoding a model of the world within the software itself (some engineering is required though)
  8. 8. What do you want?
  9. 9. Business goals You already own tons of different data ! • Get value from it! ! • Analyse it so you can … ! – Take faster decisions ! – Take better decisions ! – Improve your users experience ! • Make more money! #LRNAS2014
  10. 10. Business goals Popular applications ! • Recommender system: – Amazon store: you may also like … ! • Predicting the future: – Netflix does autoscaling based on past network data traffic ! • Churn models – Big telco companies build social networks to reduce the churn – Some big banks have tried to do the same #LRNAS2014
  11. 11. Business goals Popular applications ! • Sentiment analysis – Are talking about you in the Internet? – Is it good or bad? ! • Real Time Bidding – Optimise advertising ! • Health care – Improve patients health while reducing costs – Improve quality of life of multiple sclerosis patients #LRNAS2014
  12. 12. Terminology
  13. 13. Terminology Concepts ! • Storage models • Where and how we store our relevant information ! • Computation models • How we process and transform all the previous information ! • Analytics • How we can take actions based on the previous steps #LRNAS2014
  14. 14. Big Data architectures Make a quick tour along some of the popular architectures nowadays: mainly Hadoop/HDFS and all the libraries built on top of the Hadoop API
  15. 15. Storing data
  16. 16. Data storage: HDFS Hadoop Distributed File System (HDFS) ! • Java based file system ! • Scalable, fault-tolerant, distributed storage ! • Designed to run on commodity hardware ! • Closely related to MapReduce #LRNAS2014 This is the most popular alternative which allows you to store your data in a distributed filesystem and execute Map Reduce algorithms on top of it ! We will see other alternatives to Hadoop which can do much more than MapReduce algorithms
  17. 17. Data storage: HDFS #LRNAS2014 Source: http://hortonworks.com/hadoop/hdfs/ An HDFS cluster is comprised of a NameNode which manages the cluster metadata and DataNodes that store the data. Files and directories are represented on the NameNode by inodes. Inodes record attributes like permissions, modification and access times, or namespace and disk space quotas
  18. 18. Data storage: NoSQL NoSQL Movement ! • Semistructured data ! • Focused on ! • Horizontal scalability ! • Availability ! • Different trade-offs: CAP, BASE, … ! • Many alternatives: Cassandra, Riak, HBase, … #LRNAS2014 This “new” movement tries to deal with the huge increase of data (ant is variety) focusing on different topics to those addressed by the traditional relational databases: horizontal scalability, availability, unstructured data models, … ! There is plenty of alternatives: memory based, disk based, key-value, key-document, graph databases, … and the usage of this new databases is increasing on BigData systems ! Some other databases has brought the horizontal scalability and availability to the new !
  19. 19. Data storage: Apache Cassandra An example: Apache Cassandra ! • P2P architecture, no single point of failure ! • Linear scalability ! • Larger than memory datasets ! • Fully durable ! • Tuneable consistency ! • Integrated caching #LRNAS2014
  20. 20. Data storage: NewSQL NewSQL Movement ! • Modern relational databases ! • Same scalable performance than NoSQL for OLTP ! • Maintain ACID guarantees ! • A few alternatives: VoltDB, Google Spanner, FoundationDB, … #LRNAS2014 New designs for traditional databases (pretty different along the different options) ! Google Spanner use GPS based clocks, VoltDB optimise for every specific app by compiling the schema and so on, … !
  21. 21. Computation and Analytics
  22. 22. Computation: Apache Hadoop Apache Hadoop Map Reduce ! • Framework: • Distributed processing • Large datasets • Clusters of computers ! • Simple programming model ! • Coarse grained ! • Verbose and hard to use API #LRNAS2014
  23. 23. Computation: Map Reduce #LRNAS2014
  24. 24. Computation: Map Reduce Liferay projects is #LRNAS2014
  25. 25. Computation: Map Reduce Liferay projects is the #LRNAS2014
  26. 26. Computation: Map Reduce Liferay projects is the best Open Source #LRNAS2014
  27. 27. Computation: Map Reduce Liferay projects is the best Open Source #LRNAS2014
  28. 28. Computation: Map Reduce Liferay projects is the best Open Source project #LRNAS2014
  29. 29. Computation: Map Reduce Liferay projects is the best Open Source project #LRNAS2014 best: 1 is: 1
  30. 30. Computation: Map Reduce Liferay projects is the best Open Source project #LRNAS2014 best: 1 is: 1 Liferay: 1
  31. 31. Computation: Map Reduce Liferay projects is the best Open Source project #LRNAS2014 best: 1 is: 1 Liferay: 1 Open: 1 project: 2 Source: 1 the: 1
  32. 32. Computation: Map Reduce Liferay projects is the best Open Source project #LRNAS2014 best: 1 is: 1 Liferay: 1 Open: 1 project: 2 Source: 1 the: 1 (index, “…”)
  33. 33. Computation: Map Reduce Liferay projects is the best Open Source project #LRNAS2014 best: 1 is: 1 Liferay: 1 Open: 1 project: 2 Source: 1 the: 1 (index, “…”)
  34. 34. Computation: Map Reduce Liferay projects is the best Open Source project #LRNAS2014 best: 1 is: 1 Liferay: 1 Open: 1 project: 2 Source: 1 the: 1 (index, “…”) (index, “…”)
  35. 35. Computation: Map Reduce Liferay projects is the best Open Source project #LRNAS2014 best: 1 is: 1 Liferay: 1 Open: 1 project: 2 Source: 1 the: 1 (index, “…”) (index, “…”)
  36. 36. Computation: Map Reduce Liferay projects is the best Open Source project #LRNAS2014 best: 1 is: 1 Liferay: 1 Open: 1 project: 2 Source: 1 the: 1 (index, “…”) (index, “…”) (index, “…”)
  37. 37. Computation: Map Reduce Liferay projects is the best Open Source project #LRNAS2014 best: 1 is: 1 Liferay: 1 Open: 1 project: 2 Source: 1 the: 1 (index, “…”) (index, “…”) (index, “…”)
  38. 38. Computation: Map Reduce Liferay projects is the best Open Source project #LRNAS2014 best: 1 is: 1 Liferay: 1 Open: 1 project: 2 Source: 1 the: 1 (index, “…”) (index, “…”) (index, “…”) (index, “…”)
  39. 39. Computation: Map Reduce Liferay projects is the best Open Source project #LRNAS2014 best: 1 is: 1 Liferay: 1 Open: 1 project: 2 Source: 1 the: 1 (index, “…”) (index, “…”) (index, “…”) (index, “…”)
  40. 40. Computation: Map Reduce Liferay projects is the best Open Source project #LRNAS2014 best: 1 is: 1 Liferay: 1 Open: 1 project: 2 Source: 1 the: 1 (index, “…”) (index, “…”) (index, “…”) (index, “…”) (index, “…”)
  41. 41. Computation: Map Reduce Liferay projects is the best Open Source project #LRNAS2014 best: 1 is: 1 Liferay: 1 Open: 1 project: 2 Source: 1 the: 1 (index, “…”) (index, “…”) (index, “…”) (index, “…”) (index, “…”)
  42. 42. Computation: Map Reduce Liferay projects is the best Open Source project #LRNAS2014 best: 1 is: 1 Liferay: 1 Open: 1 project: 2 Source: 1 the: 1 (index, “…”) (index, “…”) (index, “…”) (index, “…”) (index, “…”) Sort and shuffle
  43. 43. Computation: Map Reduce Liferay projects is the best Open Source project #LRNAS2014 best: 1 is: 1 Liferay: 1 Open: 1 project: 2 Source: 1 the: 1 (index, “…”) (index, “…”) (index, “…”) (index, “…”) (index, “…”) Sort and shuffle (best, [1]) (is, [1])
  44. 44. Computation: Map Reduce Liferay projects is the best Open Source project #LRNAS2014 best: 1 is: 1 Liferay: 1 Open: 1 project: 2 Source: 1 the: 1 (index, “…”) (index, “…”) (index, “…”) (index, “…”) (index, “…”) Sort and shuffle (best, [1]) (is, [1]) (Liferay: 1)
  45. 45. Computation: Map Reduce Liferay projects is the best Open Source project #LRNAS2014 best: 1 is: 1 Liferay: 1 Open: 1 project: 2 Source: 1 the: 1 (index, “…”) (index, “…”) (index, “…”) (index, “…”) (index, “…”) Sort and shuffle (best, [1]) (is, [1]) (Liferay: 1) (Open, [1]) (project, [1,1]) (Source, [1]) (the, [1])
  46. 46. Computation: Map Reduce Liferay projects is the best Open Source project #LRNAS2014 best: 1 is: 1 Liferay: 1 Open: 1 project: 2 Source: 1 the: 1 (index, “…”) (index, “…”) (index, “…”) (index, “…”) (index, “…”) Sort and shuffle (best, [1]) (is, [1]) (Liferay: 1) (Open, [1]) (project, [1,1]) (Source, [1]) (the, [1])
  47. 47. Computation: Map Reduce Liferay projects is the best Open Source project #LRNAS2014 best: 1 is: 1 Liferay: 1 Open: 1 project: 2 Source: 1 the: 1 (index, “…”) (index, “…”) (index, “…”) (index, “…”) (index, “…”) Sort and shuffle (best, [1]) (is, [1]) (Liferay: 1) (Open, [1]) (project, [1,1]) (Source, [1]) (the, [1])
  48. 48. Computation: Map Reduce Liferay projects is the best Open Source project #LRNAS2014 best: 1 is: 1 Liferay: 1 Open: 1 project: 2 Source: 1 the: 1 (index, “…”) (index, “…”) (index, “…”) (index, “…”) (index, “…”) Sort and shuffle (best, [1]) (is, [1]) (Liferay: 1) (Open, [1]) (project, [1,1]) (Source, [1]) (the, [1])
  49. 49. Computation: Map Reduce Liferay projects is the best Open Source project #LRNAS2014 best: 1 is: 1 Liferay: 1 Open: 1 project: 2 Source: 1 the: 1 (index, “…”) (index, “…”) (index, “…”) (index, “…”) (index, “…”) Sort and shuffle (best, [1]) (is, [1]) (Liferay: 1) (Open, [1]) (project, [1,1]) (Source, [1]) (the, [1])
  50. 50. Computation: Map Reduce Liferay projects is the best Open Source project #LRNAS2014 best: 1 is: 1 Liferay: 1 Open: 1 project: 2 Source: 1 the: 1 (index, “…”) (index, “…”) (index, “…”) (index, “…”) (index, “…”) Sort and shuffle (best, [1]) (is, [1]) (Liferay: 1) (Open, [1]) (project, [1,1]) (Source, [1]) (the, [1])
  51. 51. Computation: Map Reduce Liferay projects is the best Open Source project #LRNAS2014 best: 1 is: 1 Liferay: 1 Open: 1 project: 2 Source: 1 the: 1 (index, “…”) (index, “…”) (index, “…”) (index, “…”) (index, “…”) Sort and shuffle (best, [1]) (is, [1]) (Liferay: 1) (Open, [1]) (project, [1,1]) (Source, [1]) (the, [1])
  52. 52. Computation: Map Reduce Liferay projects is the best Open Source project #LRNAS2014 best: 1 is: 1 Liferay: 1 Open: 1 project: 2 Source: 1 the: 1 (index, “…”) (index, “…”) (index, “…”) (index, “…”) (index, “…”) Sort and shuffle (best, [1]) (is, [1]) (Liferay: 1) (Open, [1]) (project, [1,1]) (Source, [1]) (the, [1])
  53. 53. Computation: Map Reduce Liferay projects is the best Open Source project #LRNAS2014 best: 1 is: 1 Liferay: 1 Open: 1 project: 2 Source: 1 the: 1 (index, “…”) (index, “…”) (index, “…”) (index, “…”) (index, “…”) Sort and shuffle (best, [1]) (is, [1]) (Liferay: 1) (Open, [1]) (project, [1,1]) (Source, [1]) (the, [1])
  54. 54. Computation: Map Reduce Liferay projects is the best Open Source project #LRNAS2014 best: 1 is: 1 Liferay: 1 Open: 1 project: 2 Source: 1 the: 1 (index, “…”) (index, “…”) (index, “…”) (index, “…”) (index, “…”) Sort and shuffle (best, [1]) (is, [1]) (Liferay: 1) (Open, [1]) (project, [1,1]) (Source, [1]) (the, [1])
  55. 55. Computation: Map Reduce Liferay projects is the best Open Source project #LRNAS2014 best: 1 is: 1 Liferay: 1 Open: 1 project: 2 Source: 1 the: 1 (index, “…”) (index, “…”) (index, “…”) (index, “…”) (index, “…”) Sort and shuffle (best, [1]) (is, [1]) (Liferay: 1) (Open, [1]) (project, [1,1]) (Source, [1]) (the, [1])
  56. 56. Computation: Map Reduce Liferay projects is the best Open Source project #LRNAS2014 best: 1 is: 1 Liferay: 1 Open: 1 project: 2 Source: 1 the: 1 (index, “…”) (index, “…”) (index, “…”) (index, “…”) (index, “…”) Sort and shuffle (best, [1]) (is, [1]) (Liferay: 1) (Open, [1]) (project, [1,1]) (Source, [1]) (the, [1])
  57. 57. Computation: Map Reduce Liferay projects is the best Open Source project #LRNAS2014 best: 1 is: 1 Liferay: 1 Open: 1 project: 2 Source: 1 the: 1 (index, “…”) (index, “…”) (index, “…”) (index, “…”) (index, “…”) Sort and shuffle (best, [1]) (is, [1]) (Liferay: 1) (Open, [1]) (project, [1,1]) (Source, [1]) (the, [1])
  58. 58. Computation: Map Reduce Liferay projects is the best Open Source project #LRNAS2014 best: 1 is: 1 Liferay: 1 Open: 1 project: 2 Source: 1 the: 1 (index, “…”) (index, “…”) (index, “…”) (index, “…”) (index, “…”) Sort and shuffle (best, [1]) (is, [1]) (Liferay: 1) (Open, [1]) (project, [1,1]) (Source, [1]) (the, [1])
  59. 59. Computation: Map Reduce Liferay projects is the best Open Source project #LRNAS2014 best: 1 is: 1 Liferay: 1 Open: 1 project: 2 Source: 1 the: 1 (index, “…”) (index, “…”) (index, “…”) (index, “…”) (index, “…”) Sort and shuffle (best, [1]) (is, [1]) (Liferay: 1) (Open, [1]) (project, [1,1]) (Source, [1]) (the, [1])
  60. 60. Computation: Map Reduce Liferay projects is the best Open Source project #LRNAS2014 best: 1 is: 1 Liferay: 1 Open: 1 project: 2 Source: 1 the: 1 (index, “…”) (index, “…”) (index, “…”) (index, “…”) (index, “…”) Sort and shuffle (best, [1]) (is, [1]) (Liferay: 1) (Open, [1]) (project, [1,1]) (Source, [1]) (the, [1])
  61. 61. Computation: Map Reduce Liferay projects is the best Open Source project #LRNAS2014 best: 1 is: 1 Liferay: 1 Open: 1 project: 2 Source: 1 the: 1 (index, “…”) (index, “…”) (index, “…”) (index, “…”) (index, “…”) Sort and shuffle (best, [1]) (is, [1]) (Liferay: 1) (Open, [1]) (project, [1,1]) (Source, [1]) (the, [1])
  62. 62. Computation: Map Reduce Liferay projects is the best Open Source project #LRNAS2014 best: 1 is: 1 Liferay: 1 Open: 1 project: 2 Source: 1 the: 1 (index, “…”) (index, “…”) (index, “…”) (index, “…”) (index, “…”) Sort and shuffle (best, [1]) (is, [1]) (Liferay: 1) (Open, [1]) (project, [1,1]) (Source, [1]) (the, [1])
  63. 63. Computation: Map Reduce Liferay projects is the best Open Source project #LRNAS2014 best: 1 is: 1 Liferay: 1 Open: 1 project: 2 Source: 1 the: 1 (index, “…”) (index, “…”) (index, “…”) (index, “…”) (index, “…”) Sort and shuffle (best, [1]) (is, [1]) (Liferay: 1) (Open, [1]) (project, [1,1]) (Source, [1]) (the, [1])
  64. 64. Computation: Apache Hadoop Apache Hadoop Map Reduce ! • Batch model data crunching ! • Not so good event stream processing ! • But … ! • Many algorithms hard to implement using MapReduce ! • Again, API hard to use ! • Cascading, Scalding, Cascalog, Impala, … #LRNAS2014
  65. 65. Computation: Apache Storm Apache Storm ! • Distributed realtime computation system ! • Easy to reliably process unbounded streams of data ! • Multi language support ! • Realtime analytics, online machine learning, continuous computation, distributed RPC, ETL, … #LRNAS2014
  66. 66. Computation: Apache Storm Spout Spout #LRNAS2014 Bolt Bolt Bolt Spouts are data sources and bolts are the event processors ! There are facilities to support reliable message handling, various sources encapsulated in Spouts and various targets of output. Distributed processing is baked in from the start
  67. 67. Computation: Apache Spark Apache Spark ! • Fast and general-purpose cluster computing system • Developed by Berkeley AMP ! • High level APIs (not MapReduce) ! • Optimised engine: supports general execution graphs ! • Higher-level tools: • Spark SQL, MLib, Spark Streaming, Graphx (will go deeper later on) #LRNAS2014
  68. 68. Computation: Apache Mahout Apache Mahout ! • Scalable machine learning library ! • Built on top of Hadoop ! • Some algorithms don’t require Hadoop at all #LRNAS2014
  69. 69. Computation: Apache Spark R language ! • Focused on: • Data visualisation • Statistical computations • Analysis of data ! • Tons of built-in packages ! • Connect to Hadoop through Hadoop Streaming ! • Not a fast language (compared to proprietary alternatives like SAS) #LRNAS2014
  70. 70. Reference architecture
  71. 71. Reference Architecture How do we proceed? ! • Plenty of alternatives ! • No silver bullet ! • Problems to solve: ! • Data integration ! • Real time ! • Batch processing #LRNAS2014
  72. 72. Reference Architecture #LRNAS2014
  73. 73. Reference Architecture Relational Database #LRNAS2014
  74. 74. Reference Architecture Relational Database #LRNAS2014 User Tracking
  75. 75. Reference Architecture Relational Database #LRNAS2014 User Tracking NoSQL Storage
  76. 76. Reference Architecture Relational Database #LRNAS2014 User Tracking NoSQL Storage System Events
  77. 77. Reference Architecture Relational Database #LRNAS2014 User Tracking NoSQL Storage System Events Search Data
  78. 78. Reference Architecture Relational Database #LRNAS2014 User Tracking NoSQL Storage System Events Search Data Logs
  79. 79. Reference Architecture Relational Database #LRNAS2014 Event System User Tracking NoSQL Storage System Events Search Data Logs
  80. 80. Reference Architecture Relational Database #LRNAS2014 Event System User Tracking NoSQL Storage System Events Search Data Logs
  81. 81. Reference Architecture Relational Database #LRNAS2014 Event System User Tracking NoSQL Storage System Events Search Data Logs
  82. 82. Reference Architecture Relational Database #LRNAS2014 Event System User Tracking NoSQL Storage System Events Search Data Logs
  83. 83. Reference Architecture Relational Database #LRNAS2014 Event System User Tracking NoSQL Storage System Events Search Data Logs
  84. 84. Reference Architecture Relational Database #LRNAS2014 Event System User Tracking NoSQL Storage System Events Search Data Logs
  85. 85. Reference Architecture Relational Database #LRNAS2014 Event System User Tracking NoSQL Storage System Events Search Data Logs
  86. 86. Reference Architecture Relational Database #LRNAS2014 Event System User Tracking NoSQL Storage System Events Search Data Logs
  87. 87. Reference Architecture Relational Database #LRNAS2014 Event System Hadoop User Tracking NoSQL Storage System Events Search Data Logs
  88. 88. Reference Architecture Relational Database #LRNAS2014 Event System Hadoop User Tracking NoSQL Storage System Events Search Data Logs
  89. 89. Reference Architecture Relational Database #LRNAS2014 Event System Hadoop User Tracking NoSQL Storage System Events Search Data Logs Monitoring
  90. 90. Reference Architecture Relational Database #LRNAS2014 Event System Hadoop User Tracking NoSQL Storage System Events Search Data Logs Monitoring
  91. 91. Reference Architecture Relational Database #LRNAS2014 Event System Hadoop User Tracking NoSQL Storage System Events Search Data Logs Monitoring Dataware House
  92. 92. Reference Architecture Relational Database #LRNAS2014 Event System Hadoop User Tracking NoSQL Storage System Events Search Data Logs Monitoring Dataware House
  93. 93. Reference Architecture Relational Database #LRNAS2014 Event System Hadoop User Tracking NoSQL Storage System Events Search Data Logs Monitoring Dataware House Streaming
  94. 94. Reference Architecture Relational Database #LRNAS2014 Event System Hadoop User Tracking NoSQL Storage System Events Search Data Logs Monitoring Dataware House Streaming
  95. 95. Reference Architecture Relational Database #LRNAS2014 Event System Hadoop User Tracking NoSQL Storage System Events Search Data Logs Monitoring Dataware House Streaming Social Graph
  96. 96. Data sources
  97. 97. Reference Architecture Relational Database #LRNAS2014 Event System Hadoop User Tracking NoSQL Storage System Events Search Data Logs Monitoring Dataware House Streaming Social Graph
  98. 98. Reference Architecture Relational Database #LRNAS2014 Event System Hadoop User Tracking NoSQL Storage System Events Search Data Logs Monitoring Dataware House Streaming Social Graph
  99. 99. Reference Architecture: Liferay Liferay ! • Tons of data available within the platform • System events ! • User tracking (client side) • Clicks, navigation, activities, … ! • Monitoring (transactions, load page times, …) ! • Models (message boards, blogs, wiki …) ! • Custom developments … #LRNAS2014
  100. 100. Event system
  101. 101. Reference Architecture Relational Database #LRNAS2014 Event System Hadoop User Tracking NoSQL Storage System Events Search Data Logs Monitoring Dataware House Streaming Social Graph
  102. 102. Reference Architecture Relational Database #LRNAS2014 Event System Hadoop User Tracking NoSQL Storage System Events Search Data Logs Monitoring Dataware House Streaming Social Graph
  103. 103. Reference Architecture: Unified Log Service Data integration Source: http://en.wikipedia.org/wiki/Maslow's_hierarchy_of_needs #LRNAS2014 Effective use of data follows a kind of Maslow's hierarchy of needs. ! 1. Base of the pyramid involves capturing all the relevant data 2. This data needs to be modelled in a uniform way to make it easy to read and process. ! 3. Work on infrastructure to process this data in various ways—MapReduce, real-time query systems, etc.
  104. 104. Reference Architecture: Unified Log Service Log structured data flow ! • Natural data structure for data flow #LRNAS2014 Data Source 0 1 2 3 4 5 6 7 8 Writes 9 Reads Reads System A System B
  105. 105. Distributed log: Apache Kafka Apache Kafka ! • Publish-subscribe as distributed commit log ! • Fast ! • Scalable ! • Durable ! • Distributed by design #LRNAS2014 Fast: Hundreds of megabytes of reads and writes per second from thousands of clients. ! Scalable: Elastically and transparently expanded without downtime. Data streams are partitioned and spread over a cluster of machines to allow data streams larger than the capability of any single machine and to allow clusters of co-ordinated consumers ! Durable: Messages are persisted on disk and replicated within the cluster to prevent data loss. Each broker can handle terabytes of messages without performance impact. ! Distributed by Design: cluster-centric design that offers strong durability and fault-tolerance guarantees.
  106. 106. Distributed log: Apache Kafka Apache Kafka 1000 feet architecture #LRNAS2014 Broker A Broker B Producer Consumer Broker C ZooKeeper
  107. 107. Computation and Analytics
  108. 108. Reference Architecture Relational Database #LRNAS2014 Event System Hadoop User Tracking NoSQL Storage System Events Search Data Logs Monitoring Dataware House Streaming Social Graph
  109. 109. Reference Architecture Relational Database #LRNAS2014 Event System Hadoop User Tracking NoSQL Storage System Events Search Data Logs Monitoring Dataware House Streaming Social Graph
  110. 110. Analytics What are we looking for? • Few different datasources ! • Unified log service in place ! • Tons of info ready to be processed: • Batch processing • Real time processing • Machine learning algorithms • Graph analysis ! • Unified programming model? #LRNAS2014
  111. 111. Analytics Apache Spark • Fast and general engine for large-scale data processing ! • Write your apps in Java, Scala or Python ! • Integrated with Hadoop ! • Run on YARN cluster manager ! • Can read any existing Hadoop data (HDFS) ! • In memory or disk #LRNAS2014
  112. 112. Analytics Apache Spark Main Components #LRNAS2014
  113. 113. Analytics Apache Spark Main Components #LRNAS2014 Apache Spark
  114. 114. Analytics Apache Spark Main Components #LRNAS2014 Apache Spark Spark SQL
  115. 115. Analytics Apache Spark Main Components #LRNAS2014 Apache Spark Spark SQL Spark Streaming
  116. 116. Analytics Apache Spark Main Components #LRNAS2014 Apache Spark Spark SQL Spark Streaming MLib
  117. 117. Analytics Apache Spark Main Components #LRNAS2014 Apache Spark Spark SQL Spark Streaming MLib GraphX
  118. 118. Spark Core
  119. 119. Analytics Apache Spark Components #LRNAS2014 Apache Spark Spark SQL Spark Streaming MLib GraphX
  120. 120. Analytics Apache Spark • Driver program running main function and executes various parallel operations on a cluster ! • Main abstraction: Resilient Distributed Datasets (RDD) • HDFS (or any Hadoop file system) ! • Scala collection ! • Second abstraction: shared variables #LRNAS2014 RDD * collection of elements partitioned across the nodes of the cluster that can be operated on in parallel. ! * created by starting with a - file in the Hadoop file system (or any other Hadoop-supported file system), - Scala collection in the driver program, and transforming it. ! * automatically recover from node failures
  121. 121. Spark SQL
  122. 122. Analytics Apache Spark Components #LRNAS2014 Apache Spark Spark SQL Spark Streaming MLib GraphX
  123. 123. Analytics Spark SQL • Mix SQL queries with Spark programs ! • Unified Data Access ! • Hive compatibility ! • Standard JDBC or ODBC connectivity ! • Same engine for both interactive and long running queries #LRNAS2014
  124. 124. Spark Streaming
  125. 125. Analytics Apache Spark Components #LRNAS2014 Apache Spark Spark SQL Spark Streaming MLib GraphX
  126. 126. Analytics Spark Streaming • Build your apps using high-level operators ! • Fault tolerance: exactly-once semantics out of the box ! • Combine streaming with batch and interactive queries ! • Can read from HDFS, Flume, Kafka, Twitter and ZeroMQ ! • Define your own custom data sources #LRNAS2014
  127. 127. MLIB
  128. 128. Analytics Apache Spark Components #LRNAS2014 Apache Spark Spark SQL Spark Streaming MLib GraphX
  129. 129. Analytics MLib • Scalable machine learning library ! • Basic statistics • Summary statistics • Correlations • …. ! • Classification and regression • Linear models • Decision tress • Naive Bayes #LRNAS2014
  130. 130. Analytics MLib • Clustering • K-Means ! • Collaborative filtering • Alternate least squares ! • Dimensionality reduction ! • Singular value decomposition ! • Principal component analysis #LRNAS2014
  131. 131. GraphX
  132. 132. Analytics Apache Spark Components #LRNAS2014 Apache Spark Spark SQL Spark Streaming MLib GraphX
  133. 133. Analytics GraphX • API for graphs and graph-parallel computation ! • Growing scale and importance • From social networks to language modelling ! • Directed multigraph with properties attached to each vertex and edge ! • Growing collection of graph algorithms and builders #LRNAS2014
  134. 134. Use cases and examples
  135. 135. XXX Remove this slide!! ! For NAS all the following examples will depend on how much free time I get to work on them (I actually need to write one more) until the day of the presentation :( but I guess it should be fine to show some snippets within the slides ! Not all of them will be included, just putting a few ideas
  136. 136. Connecting Liferay and Kafka
  137. 137. Examples: Kafka and Liferay Connecting Liferay and Kafka • Easy to use ! • “Transparent” for the developer ! • Runtime pluggable ! • Common API: use it through our Message Bus ! • You can take a look to Kafka Bridge #LRNAS2014
  138. 138. Examples: Kafka and Liferay #LRNAS2014
  139. 139. Examples: Kafka and Liferay Liferay Core #LRNAS2014
  140. 140. Examples: Kafka and Liferay Liferay Core Liferay App #LRNAS2014
  141. 141. Examples: Kafka and Liferay Liferay Core Liferay App #LRNAS2014 Message Bus API
  142. 142. Examples: Kafka and Liferay Liferay Core Liferay App #LRNAS2014 Message Bus API
  143. 143. Examples: Kafka and Liferay Liferay Core Liferay App #LRNAS2014 Message Bus API
  144. 144. Examples: Kafka and Liferay Liferay Core Liferay App #LRNAS2014 Message Bus API Kafka Topic Message Payload
  145. 145. Examples: Kafka and Liferay Liferay Core #LRNAS2014 Kafka Bridge Liferay App Message Bus API Kafka Topic Message Payload
  146. 146. Examples: Kafka and Liferay Liferay Core #LRNAS2014 Kafka Bridge Liferay App Message Bus API Kafka Topic Message Payload
  147. 147. Examples: Kafka and Liferay #LRNAS2014 Apache Kafka Liferay Core Kafka Bridge Liferay App Message Bus API Kafka Topic Message Payload
  148. 148. Examples: Kafka and Liferay #LRNAS2014 Apache Kafka Liferay Core Kafka Bridge Liferay App Message Bus API Kafka Topic Message Payload
  149. 149. Examples: Kafka and Liferay #LRNAS2014
  150. 150. Recommendation engine
  151. 151. Examples: Recommender’s goals You might want to read … • Blog posts ! • Ratings for previous blog posts ! • Recommend to the user some entries for future reading #LRNAS2014
  152. 152. Examples: Recommender storage #LRNAS2014
  153. 153. Examples: Recommender storage #LRNAS2014 Blog Rating save/update
  154. 154. Examples: Recommender storage #LRNAS2014 Blog Rating save/update Blog Entry save/update
  155. 155. Examples: Recommender storage #LRNAS2014 Blog Rating save/update Blog Entry save/update Apache Kafka
  156. 156. Examples: Recommender storage UserID::BlogEntryID::Rating::Timestamp #LRNAS2014 Blog Rating save/update Blog Entry save/update Apache Kafka
  157. 157. Examples: Recommender storage UserID::BlogEntryID::Rating::Timestamp BlogEntryID::Title::CategoryNames #LRNAS2014 Blog Rating save/update Blog Entry save/update Apache Kafka
  158. 158. Examples: Recommender storage UserID::BlogEntryID::Rating::Timestamp BlogEntryID::Title::CategoryNames #LRNAS2014 Blog Rating save/update Blog Entry save/update Apache Kafka
  159. 159. Examples: Recommender storage UserID::BlogEntryID::Rating::Timestamp BlogEntryID::Title::CategoryNames #LRNAS2014 Blog Rating save/update Blog Entry save/update Apache Kafka HDFS
  160. 160. Examples: Recommender’s analysis Collaborative filtering • Commonly used in recommender systems ! • Try to fill missing entries in association matrix ! • MLib includes the Alternating Least Squares algorithm (ALS) #LRNAS2014
  161. 161. Examples: Recommender’s analysis #LRNAS2014
  162. 162. Takeaways
  163. 163. Takeaways What I would like you’ve learned today • It is not about data size, it’s about how you use it ! • You already own tons of data, you just need to take get value from it ! • There is no silver bullet: you’ve plenty of alternatives ! • JVM Big data related techs are usually a great choice ! • Try it yourself!! #LRNAS2014
  164. 164. References
  165. 165. References References • Apache Kafka ! • Apache Spark ! • Apache Storm ! • Apache Hadoop ! • Big Data definition at Wikipedia ! • What every software engineer should know about a log #LRNAS2014
  166. 166. Thank you!
  167. 167. Questions (and hopefully answers)

×