Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Analysis of Major Trends in Big Data Analytics

799 views

Published on

Analysis of Major Trends in Big Data Analytics

Published in: Technology
  • Be the first to comment

Analysis of Major Trends in Big Data Analytics

  1. 1. Hadoop Summit San Jose, California June 28th 2016 Analysis of Major Trends in Big Data Analytics Slim Baltagi Director, Enterprise Architecture Capital One Financial Corporation
  2. 2. Welcome! About me: • I’m currently director of Enterprise Architecture at Capital One: a top 10 US financial corporation based in McLean, VA. • I have over 20 years of IT experience. • I have over 7 years of Big Data experience: Engineer, Architect, Evangelist, Blogger, Thought Leader, Speaker, Organizer of Apache Flink meetups in many countries, Creator and maintainer of the Big Data Knowledge Base: http://SparkBigData.com with over 7,000 categorized web resources about Hadoop, Spark, Flink, … Thanks: This talk won the community vote of the ‘Future of Apache Hadoop’ track. Thanks to all of you who: voted for this talk, attending this talk now, reading these slides. Disclaimer: This is a vendor-independent talk that expresses my own opinions. I am not endorsing nor promoting any product or vendor mentioned in this talk.2
  3. 3. Agenda 1. Portability between Big Data Execution Engines 2. Emergence of stream analytics 3. In-Memory analytics 4. Rapid Application Development of Big Data applications 5. Open sourcing Machine Learning systems by tech giants 6. Hybrid Cloud Computing 3
  4. 4. What is a typical Big Data Analytics Stack: Hadoop, Spark, Flink, …? 4
  5. 5. 1. Portability between Big Data Execution Engines If you have an existing Big Data application based on MapReduce and you want to benefit from a different execution engine such as Tez, Spark or Flink, you might need to: • Reuse some of your existing code such as mapper and reduce functions. Example: • Leverage a ‘compatibility layer’ to run your existing Big Data application on the new engine. Example: Hadoop Compatibility Layer from Flink • Switch to a different engine if the tool you used supports it. Example: Hive/Pig on Tez, Hive/Pig on Spark, Sqoop on Spark, Cascading on Flink. • Rewrite your Big Data application! 5
  6. 6. 1. Portability between Big Data Execution Engines Apache Beam (unified Batch and Stream processing) is a new Apache incubator project based on years of experience developing Big Data infrastructure (MapReduce, FlumeJava, MillWheel) within Google http://beam.incubator.apache.org/ Apache Beam provides a unified API for Batch and Stream processing and also multiple runners. Beam programs become portable across multiple runtime environments, both proprietary (e.g., Google Cloud Dataflow) and open-source (e.g., Flink, Spark). Apache Beam web resourceshttp://sparkbigdata.com/component/tags/tag/67 6
  7. 7. Agenda 1. Portability between Big Data Execution Engines 2. Emergence of stream analytics 3. In-Memory analytics 4. Rapid Application Development of Big Data applications 5. Open sourcing Machine Learning systems by tech giants 6. Hybrid Cloud Computing 7
  8. 8. 2. Emergence of stream analytics Stonebraker et al. predicted in 2005 that stream processing is going to become increasingly important and attributed this to the ‘sensorization of the real world: everything of material significance on the planet get ‘sensor-tagged’ and report its state or location in real time’. http://cs.brown.edu/~ugur/8rulesSigRec.pdf I think stream processing is becoming important not only because of this sensorization of the real world but also because of the following factors: 1. Data streams 2. Technology 3. Business 4. Consumers 8
  9. 9. 2. Emergence of stream analytics ConsumersData Streams Technology Business1 2 3 4 Emergence of Stream Analytics 9
  10. 10. 2. Emergence of stream analytics 1 Data Streams  Real-world data is available as series of events that are continuously produced by a variety of applications and disparate systems inside and outside the enterprise.  Examples: • Sensor networks data • Web logs • Database transactions • System logs • Tweets and social media data • Click streams • Mobile apps data 10
  11. 11. 2. Emergence of stream analytics 2 Technology Simplified data architecture with Apache Kafka as a major innovation and backbone of stream architectures. Rapidly maturing open source stream analytics tools: Apache Flink, Apache Apex, Spark Streaming, Kafka Streams, Apache Samza, Apache Storm, Apache Gearpump, Heron, … Cloud services for stream processing: Google Cloud Dataflow, Microsoft’s Azure Stream Analytics, Amazon Kinesis Streams, IBM InfoSphere Streams, … Vendors innovating in this space: Confluent, Data Artisans, Databricks, MapR, Hortonworks, StreamSets, … More mobile devices than human beings! 11
  12. 12. 2. Emergence of stream analytics 3 Business Challenges: Lag between data creation and actionable insights. Infrastructure is idle most of the time Web and mobile application growth, new types/sources of data. Need of organizations to shift from reactive approach to a more of a proactive approach to interactions with customers, suppliers and employees. 12
  13. 13. 2. Emergence of stream analytics 3 Business Opportunities: Embracing stream analytics helps organizations with faster time to insight, competitive advantages and operational efficiency in a wide range of verticals. With stream analytics, new startups are/will be challenging established companies. Example: Pay-As- You-Go insurance or Usage-Based Auto Insurance Speed is said to have become the new currency of business. 13
  14. 14. 2. Emergence of stream analytics 4 Consumers Consumers expect everything to be online and immediately accessible through mobile applications. Mobile, always-on consumers are becoming more and more demanding for instant responses from enterprise applications in the way they are used to in mobile applications from social networks such as Twitter, Facebook, Linkedin … Younger generation who grow up with video gaming and accustomed to real-time interaction are now themselves a growing class of consumers. 14
  15. 15. 2. Emergence of stream analytics  Financial services  Telecommunications  Online gaming systems  Security & Intelligence  Advertisement serving  Sensor Networks  Social Media  Healthcare  Oil & Gas  Retail & eCommerce  Transportation and logistics
  16. 16. Stream Processor Business Applications (e.g. Enterprise Command Center) Personal Mobile Applications Data Lake Event Collector & Broker Advanced Analytics & Machine Learning Real-Time Notifications Real-Time DecisionsApps Sensors Devices Other Sources Business System Backend Dashboards Sourcing & Integration Analytics & Processing Serving & Consuming 16 End-to-end stream analytics solution architecture 2. Emergence of stream analytics
  17. 17. Agenda 1. Portability between Big Data Execution Engines 2. Emergence of stream analytics 3. In-Memory analytics 4. Rapid Application Development of Big Data applications 5. Open sourcing Machine Learning systems by tech giants 6. Hybrid Cloud Computing 17
  18. 18. 3. In-Memory Analytics While In-Memory Analytics are not new, the trend is that they are the focus of renewed attention thanks to: • the availability of new memory that could easily fit most active data sets • the maturing or newly available in-memory open source tools in many categories such as:  Memory-centric distributed File System  Columnar data format  Key Value data stores  IMDG: In-Memory Data Grids  Distributed Cache  Very Large Hashmaps In the next couple slides, I will share a few examples 18
  19. 19. 3. In-Memory Analytics Alluxio http://alluxio.org (formerly known as Tachyon) is an open source memory speed virtual distributed storage system. Example of its usage patterns: • Accelerate Big Data Analytics workloads by prefetching views and creating caches on demand. • Sharing data between applications by writing to Alluxio’s in-memory data store and read it back at far greater speed.  Rocks DB https://github.com/facebook/rocksdb/ An open source library from Facebook that provides an embeddable, persistent key-value store. It is suited for fast storage of data on RAM and flash drives. It is used as state backend by Samza, Flink, Kafka Streams, … 19
  20. 20. 3. In-Memory Analytics Apache Arrow (http://arrow.apache.org/) for columnar in- memory analytics. • Apache Arrow enables execution engines to take advantage of the latest SIMD (Single Input Multiple Data) operations included in modern processors, for native vectorized optimization of analytical data processing. • Columnar layout of data also allows for a better use of CPU caches by placing all data relevant to a column operation in as compact of a format as possible. • Apache Arrow advantages is that systems utilizing it as a common memory format have no overhead for cross-system data communication and also can share functionality. 20
  21. 21. Agenda 1. Portability between Big Data Execution Engines 2. Emergence of stream analytics frameworks 3. In-Memory analytics 4. Rapid Application Development of Big Data applications 5. Open sourcing Machine Learning systems by tech giants 6. Deployment of Big Data applications in a hybrid model: on-premise and on the cloud 21
  22. 22. 4. Rapid Application Development of Big Data applications MicroservicesAPIs Notebooks /Shells GUIs1 2 3 4 Rapid Applications Development of Big Data Analytics 22
  23. 23. 4. Rapid Application Development of Big Data applications 1 APIs  Apache Spark and Apache Flink provide high level and easy to use API compared to Hadoop MapReduce  Apache Beam is a new open source project from Google that attempts to unify data processing frameworks with a core API, allowing easy portability between execution engines.  Use Apache Beam unified API for batch and streaming and then run on a local runner, Apache Spark, Apache Flink, …  The biggest advantage is in developer productivity and ease of migration between processing engines. 23
  24. 24. 4. Rapid Application Development of Big Data applications 2 Shells or Notebooks • REPL (Read Evaluate Print Loop) interpreter • Interactive queries • Explore data quickly • Sketch out your ideas in the shell to make sure you’ve got your code right before deploying it to a cluster. • Web-based interactive computation environment • Collaborative data analytics and visualization tool • Combines rich text, execution code, plots and rich media • Exploratory data science • Saving and replaying of written code 24
  25. 25. 4. Rapid Application Development of Big Data applications 2 Shells or Notebooks Apache Zeppelin 25
  26. 26. 4. Rapid Application Development of Big Data applications 3 GUIs  Apache Nifi 26
  27. 27. 4. Rapid Application Development of Big Data applications 4 Microservices:  Microservices are an important trend in building larger systems by: • decomposing their functions into relatively simple, single purpose services • that asynchronously communicate via Apache Kafka as a message passing technology that avoid unwanted dependencies between these services.  This streaming architectural style provides agility as microservices can be built and maintained by small and cross-functional teams. 27
  28. 28. Agenda 1. Portability between Big Data Execution Engines 2. Emergence of stream analytics frameworks 3. In-Memory analytics 4. Rapid Application Development of Big Data applications 5. Open sourcing Machine Learning systems by tech giants 6. Hybrid Cloud Computing 28
  29. 29. 5. Open sourcing Machine Learning systems by tech giants Yahoo CaffeOnSpark Facebook Torch IBM SystemML Google TensorFlow1 2 3 5 Open sourcing machine learning systems by tech giants 29 4 Microsoft DMTK Amazon DSSTNE 6
  30. 30. 5. Open sourcing Machine Learning systems by tech giants 1 Torch http://torch.ch/ is an open source Machine Learning library which provides a wide range of deep learning algorithms. Facebook donated its optimized deep learning modules to the Torch project on January 16, 2015. 2 Apache SystemML http://systemml.apache.org/ is a distributed and declarative machine learning platform. It was created in 2010 by IBM and donated as an open source Apache project on November 2nd, 2015. 3 TensorFlow is an open source machine learning library created by Google. https://www.tensorflow.org It was released under the Apache 2.0 open source license on November 9th, 2015 30
  31. 31. 5. Open sourcing Machine Learning systems by tech giants 4 DMTK (Distributed Machine Learning Toolkit) allows models to be trained on multiple nodes at once. http://www.dmtk.io/ DMTK was open sourced by Microsoft on November 12, 2015. 5 CaffeOnSpark https://github.com/yahoo/CaffeOnSpark is an open source machine learning library created by Yahoo. It was open sourced on February 24th, 2016 DSSTNE (Deep Scalable Sparse Tensor Network Engine) “Destiny” is an Amazon developed library for building Deep Learning (DL) Machine Learning (ML) models. It was open sourced on May 11th, 2016 https://github.com/amznlabs/amazon-dsstne 31 6
  32. 32. 5. Open sourcing Machine Learning systems by tech giants It is expected to see wider adoption of Machine Learning tools by companies besides these tech giants in a similar way that MapReduce and Hadoop helped making “Big Data” a part of just every company’s strategy! These tech giants are not pushing their machine learning systems for internal use only but they are racing to open source them, attract users and committers and advance the entire industry. This combined with deployment on commodity clusters will accelerate such adoption and as a result we will see new machine learning use cases especially building on deep learning that will transform multiple industries. 32
  33. 33. Agenda 1. Portability between Big Data Execution Engines 2. Emergence of stream analytics frameworks 3. In-Memory analytics 4. Rapid Application Development of Big Data applications 5. Open sourcing Machine Learning systems by tech giants 6. Hybrid Cloud Computing 33
  34. 34. 6. Hybrid Cloud Computing Cloud is becoming mainstream and software stack is adapting. Big Data applications will eventually all move to the cloud to benefit from agility, elasticity and on-demand computing! Meanwhile, companies need to advance their strategy for hybrid integration between cloud and on-premise deployments. Deployment of Big Data applications in a hybrid model: on-premise and on the cloud 34
  35. 35. 6. Hybrid Cloud Computing The following are a few patterns for such hybrid integration: 1. Replicating data from SaaS apps to existing on- premise databases to be used by other on-premise applications such as analytics ones. 2. Integrating SaaS applications themselves with on- premise applications. 3. Hybrid Data Warehousing with the Cloud: move data from on-premise data warehouse to the cloud. 4. Real-Time analytics on streaming data: depending on your use case, you might keep your stream analytics infrastructure directly accessible on-premise for low latency.
  36. 36. Key Takeaways 1. Adopt Apache Beam for easier development and portability between Big Data Execution Engines 2. Adopt stream analytics for faster time to insight, competitive advantages and operational efficiency 3. Accelerate your Big Data applications with In-Memory open source tools 4. Adopt Rapid Application Development of Big Data applications: APIs, Notebooks, GUIs, Microservices… 5. Have Machine Learning part of your strategy or passively watch your industry completely transformed! 6. How to advance your strategy for hybrid integration between cloud and on-premise deployments? 36
  37. 37. Thanks! To all of you for attending! Any questions? Let’s keep in touch! • sbaltagi@gmail.com • @SlimBaltagi • https://www.linkedin.com/in/slimbaltagi 37

×