Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Big Data 2.0 - How Spark technologies are reshaping the world of big data analytics

841 views

Published on

In this one-hour webinar, you will be introduced to Spark, the data engineering that supports it, and the data science advances that it has spurned. You’ll discover the interesting story of its academic origins and then get an overview of the organizations who are using the technology. After being briefed on some impressive Spark case studies, you’ll come to know of the next-generation Spark 2.0 (to be released in just a few months). We will also tell you about the tremendous impact that learning Spark can have upon your current salary, and the best ways to get trained in this ground-breaking new technology.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Big Data 2.0 - How Spark technologies are reshaping the world of big data analytics

  1. 1. Big Data 2.0 HOW SPARK TECHNOLOGIES ARE RESHAPING THE WORLD OF BIG DATA ANALYTICS Presented By: Lillian Pierson, P.E.
  2. 2. Today’s webinar Apache Spark: Journey from “Hadoop Eco System component” to “Big Data platform” The story of how Spark began Is Spark a data engineering or data science platform? Who is using Spark and for what? Got Spark skills? Here’s why you should
  3. 3. Apache Spark JOURNEY FROM “HADOOP ECO SYSTEM COMPONENT” TO “BIG DATA PLATFORM”
  4. 4. What is Spark?
  5. 5. “In-memory computing appliances are … faster than the traditional Hadoop system because in- memory appliances don’t use MapReduce… By storing data in memory, in-memory appliances are able to bypass the time-consuming disk accesses that are required as part of the map and reduce operations that comprise the MapReduce process. In-memory data storage processing, and analysis is fast enough to generate data analytics in real-time, derived from streaming data sources.“ – Excerpt from my book: Big Data/Hadoop for Dummies Why in-memory applications?
  6. 6. From Hadoop ecosystem component… HDFS MapReduce 2.0 YARN
  7. 7. From Hadoop ecosystem component… HDFS Spark MapReduce 2.0 YARN
  8. 8. To big data platform HDFS MapReduce 2.0 Spark YARN
  9. 9. To big data platform Spark-as-a-Service
  10. 10. Spark’s 4 submodules Spark SQL MLlib GraphX Streaming
  11. 11. Spark SQL module DataFrames Spark SQL ◦ SQL Hive ◦ HiveQL ◦ Spark Processing Engine
  12. 12. Mllib module Data analysis Statistics Machine learning
  13. 13. GraphX module Graph data storage and processing Graphx ◦ In-memory graph data processing HDFS ◦ Graph data storage
  14. 14. Streaming module Continuously Streaming Data Discreet Data Streams (Dstream) Micro-batch processing
  15. 15. Dstreams and micro-batch architecture Source: http://www.slideshare.net/skpabba/hadoop-and-spark RDD @ time 1 RDD @ time 2 RDD @ time 3
  16. 16. Basic Spark Architecture Spark SQL MLlib GraphX Streaming Physical Hardware Data Storage Layer (HDFS) Resource Manager (YARN) Spark Core Libraries Single Abstraction Layer Processing Processing Processing Processing
  17. 17. Changes with Spark 2.0 RDD API •DataFrame API Spark 1.0 •RDD API •DataFrame API Spark 1.3 *RDD API *DataFrame API *Dataset API Spark 1.6 Dataset API •DataFrame API •RDD API Spark 2.0
  18. 18. Changes with Spark 2.0 RDD API Dataset API DataFrame API RDD API Spark 1.0 Spark 2.0
  19. 19. Changes with Spark 2.0 Structured Stream Processing DataFrame API Dataset API
  20. 20. The story of how Spark began
  21. 21. Taking things from the beginning… 2009 Mesos UC Berkeley Interactive, iterative parallel processing (in- memory) ◦ Machine learning requirements Integrates with Hadoop ecosystem Dr. Ion Stoica Computer Science Professor UC Berkeley
  22. 22. Databricks… the cutting edge of Spark Delivers Apache Spark-as-a-Service Most popular solution for deploying Spark on the cloud Dr. Ion Stoica Executive Chairman, Apache Databricks
  23. 23. Databricks… the cutting edge of Spark Spark on an as-needed basis Automates ◦ Cluster building and configuration ◦ Security ◦ Process monitoring ◦ Resource monitoring Notebooks ◦ For data analysis and machine learning using Python, R, and Scala Data visualization capabilities ◦ Data visualization and dashboard design options
  24. 24. Is Spark a data engineering or data science platform? DATA ENGINEERING COMPONENTS AND TECHNOLOGIES DATA SCIENCE COMPONENTS AND TECHNOLOGIES
  25. 25. Spark’s data engineering elements Automate cluster sizing and configuration requirements Data Storage: HDFS Resource Management: ◦ Spark Standalone ◦ Apache Mesos ◦ Hadoop YARN
  26. 26. Spark’s data engineering elements Spark Streaming Submodule – Reuse same code you use for batch processing, but get real-time results! ◦ Integrates with big data source, like: ◦ HDFS ◦ Flume ◦ Kafka ◦ Twitter and ◦ ZeroMQ
  27. 27. Doing data science with Spark Useful for machine learning and analysis of big data Build big data analytics products Programmable in Python, R, Scala, and SQL Submodules: ◦ SQL and DataFrames ◦ MLlib for machine learning ◦ GraphX for in-memory big (graph) data computations
  28. 28. Doing data science with Spark Spark integrates with the following data sources and formats: ◦ Hive, Avro, Parquet, CSV, JSON, and JDBC, HBase ◦ BI Tools: Tableau, QLIK, ZoomData, etc. (through JDBC)
  29. 29. Who is using Spark and for what? A U T O M A T I C L A B S L E N D U P S E L L P O I N T S F I N D I F Y
  30. 30. Automatic Labs on Databricks Making cars smarter with real-time analytics Connect to, and make smart use, of your car’s data
  31. 31. Automatic Labs on Databricks Automatic apps do things like: ◦ Decoding engine problems ◦ Locating parked cars ◦ Crash detection and response ◦ Low fuel warnings, etc. Automatic is using Spark to make cars smarter with real-time analytics During product development, Automatic needs to query, explore, and visualize large amounts of data, QUICKLY. By moving this work over to Spark, Automatic was able to: ◦ Validate products in days, not weeks ◦ Complete complex queries in minutes ◦ Free up 1 full-time data scientist ◦ Save $10K/month on infrastructure costs
  32. 32. LendUp on Databricks Improving the lending process and experience “Moving up the LendUp Ladder means earning access to more money, at better rates, for longer periods of time” - LendUp
  33. 33. LendUp on Databricks LendUp uses Spark for: ◦ Feature engineering at scale ◦ Fast model building and testing By using Spark to do this work, LendUp is able to: ◦ Build more accurate models, faster ◦ Offer more lines of credit ◦ Develop new products more quickly ◦ Increase in-house productivity of data science team
  34. 34. sellpoints on Databricks Increasing ROI on ad spend
  35. 35. sellpoints on Databricks Increasing ROI on ad spend Sellpoint offers services in: ◦ Identifying qualified shoppers ◦ Driving traffic ◦ Increasing sales conversion By moving to Databricks, sellpoints was able to: ◦ Productize a new predictive analytics offering, improving the ad spend ROI by threefold compared to competitive offerings. ◦ Reduce the time and effort required to deliver actionable insights to the business team while lowering costs. ◦ Improve productivity of the engineering and data science team by eliminating the time spent on DevOps and maintaining open source software.
  36. 36. Findify on Databricks Improving shopping experience for ecommerce customers Uses machine learning to continually improve search accuracy
  37. 37. Findify on Databricks Improving shopping experience for ecommerce customers By moving to Databricks, Findify was able to: ◦ Focus on development instead of infrastructure – Allowing them to complete their feature development projects faster and reduce customer frustration in delayed analytics ◦ Focus on building innovative features - because the managed Spark platform eliminated time spent on DevOps and infrastructure issues. Uses machine learning to continually improve search accuracy
  38. 38. Got Spark skills? Here’s why you should IMPACT ON SALARY TRAINING ISSUES AND OPPORTUNITIES
  39. 39. How much do Spark skills pay? 2015 Data Science Salary Survey, by O’Reilly $11,000 $4,000 $4,600 $8,000 $0 $2,000 $4,000 $6,000 $8,000 $10,000 $12,000 Spark Skills Scala Programming Basic Exploratory Analysis (>4 hr/wk) D3.js Skills Annual Salary Increase Annual Salary Increase
  40. 40. Getting training and experience in Spark $149.50 Sale Until March 30 Only Discount Code: ‘SPRING50’
  41. 41. Getting training and experience in Spark Get hands-on training in the following areas: ◦ Using RDD ◦ Writing applications using Scala ◦ Spark SQL ◦ Spark Streaming ◦ Machine Learning in Spark (Mllib) ◦ Spark GraphX ◦ Spark Project Implementation
  42. 42. Getting training and experience in Spark $149.50 Sale Until March 30 Only Discount Code: ‘SPRING50’
  43. 43. Download these slide
  44. 44. Why Data Science From Simplilearn Key Features 40 hours of real life industry project experience 25 hours of High Quality e-learning Visualize and optimize data effectively using the built-in tools in R , SAS and Excel 48 hours of Live Instructor Led Online sessions Get proficient in using R,SAS and Excel to model data and predict solutions to business problems Master the concepts of statistical analysis like linear & logistic regression, cluster analysis & forecasting
  45. 45. OUR JOURNEY SO FAR Project Management Digital Marketing Big Data & Analytics Business Productivity Tools Quality Management Virtualization and Cloud Computing IT Security Financial Management CompTIA Certification IT Hardware and N/W ERP IT Services and Architecture Agile and Scrum Certification OS and Database Web and App Programming Simplilearn : World’s Largest Certification Training Destination One of the largest collections of accredited certification training in the world. YEAR 2010 YEAR 2015 YEAR 2010 YEAR 2016

×