Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
September 18, 2015
Jason Huang
Senior Solutions Architect, Qubole Inc.
Company Founding
Qubole founders built the Facebook data platform.
The Facebook model changed the role for data
in an ente...
Impediments for an Aspiring Data Driven Enterprise
Where Big
Data falls
short:
• 6-18 month implementation time
• Only 27%...
State of the Big Data Industry (n=417)
0%
10%
20%
30%
40%
50%
60%
70%
80%
Hadoop MapReduce Pig Spark Storm Presto Cassandr...
• Apache Spark is a fast and general engine for big data processing,
with built-in modules for streaming, SQL, machine lea...
• Streaming Data
– Process streaming data with Spark built-in functions
– Applications such as fraud detection and log pro...
• Interactive Analysis
– MapReduce was built to handle batch processing
– SQL-on-Hadoop engines such as Hive or Pig can be...
Use Spark for distributed computation:
- Combine SparkSQL, GraphX along
with MLlib in the same Spark
program
- Ability to ...
• Classification and Regression: logistic regression, linear regression,
linear support vector machine (SVM), naive Bayes,...
• Spark : Fast, Scalable and Flexible
• R : Statistics, Packages and Plots
SparkR combines both - very powerful
Use SparkR...
What about the cloud?
Central
Governance &
Security
Internet
Scale
Instant
Deployment
Isolated Multi-
tenancy
Elastic
Obje...
• Zero configuration – Spark, SparkR, MLlib, GraphX, etc. all pre-
installed on all cluster nodes
– e.g. submit SparkR pro...
• Use AWS S3 object store to decouple compute and storage; scale
processing power and storage capacity independently
• S3 ...
Cloud object store for data sets:
e.g. AWS S3:
• Flexible compute resource options
– High memory instances
• AWS EC2 r3.* for high memory workloads to cache and
manipula...
• Install Spark on EC2 (HDFS if required)
• Choose Spark backend cluster mode and configure it
– Standalone
– Yarn
– Mesos...
EC2 scripts can help:
http://spark.apache.org/docs/latest/ec2-scripts.html
- Helps spin up named clusters
- Creates a secu...
Another (very short) Demo
Qubole Case Study
Qubole Case Study
Operations
Analyst
Marketing
Ops
Analyst
Data
Architect
Busines
s
Users
Product
Suppor...
0101
1010
1010
Qubole Case Study
Qubole Case Study
Producers Continuous Processing Storage Analytics
CDN
Real Time
Bidding...
Upcoming SlideShare
Loading in …5
×

Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15

945 views

Published on

Sparking Data in the Cloud: Data isn’t useful until it’s used to drive decision-making. Companies, like Pinterest, are using Machine Learning to build data-driven recommendation engines and perform advanced cluster analysis. In this talk, Praveen Seluka will cover best practices for running Spark in the cloud, common challenges in iterative design and interactive analysis.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15

  1. 1. September 18, 2015 Jason Huang Senior Solutions Architect, Qubole Inc.
  2. 2. Company Founding Qubole founders built the Facebook data platform. The Facebook model changed the role for data in an enterprise. • Needed to turn the data assets into a “utility” to make a viable business. – Collaborative: over 30% of employees use the data directly. – Accessible: developers, analysts, business analysts or business users all running queries. Has made the company more data driven and agile with data use. – Scalable: Exabyte's of data moving fast It took the founders a team of over 30 people to create this infrastructure and currently the team managing this infrastructure has more than 100 people. Work at Facebook inspired the founding of Qubole Operations Analyst Marketing Ops Analyst Data Architect Business Users Product Support Customer Support Developer Sales Ops Product Managers Data Infrastructure
  3. 3. Impediments for an Aspiring Data Driven Enterprise Where Big Data falls short: • 6-18 month implementation time • Only 27% of Big Data initiatives are classified as “Successful” in 2014 Rigid and inflexible infrastructure Non adaptive software services Highly specialized systems Difficult to build and operate • Only 13% of organizations achieve full-scale production • 57% of organizations cite skills gap as a major inhibitor
  4. 4. State of the Big Data Industry (n=417) 0% 10% 20% 30% 40% 50% 60% 70% 80% Hadoop MapReduce Pig Spark Storm Presto Cassandra HBase Hive
  5. 5. • Apache Spark is a fast and general engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing. Analytic Libraries: • Spark Streaming (Streaming Data) • Spark SQL (Data Processing) • MLlib (Machine Learning) • GraphX (Graph Processing) Apache Spark
  6. 6. • Streaming Data – Process streaming data with Spark built-in functions – Applications such as fraud detection and log processing – ETL via data ingestion • Machine Learning – Helps users run repeated queries and machine learning algorithms on data sets – MLlib can work in areas such as clustering, classification, and dimensionality reduction – Used for very common big data functions - predictive intelligence, customer segmentation, and sentiment analysis Common Spark Use Cases
  7. 7. • Interactive Analysis – MapReduce was built to handle batch processing – SQL-on-Hadoop engines such as Hive or Pig can be too slow for interactive analysis – Spark is fast enough to perform exploratory queries without sampling – Provides multiple language-specific APIs including R, Python, Scala and Java. • Fog Computing – The Internet of Things - objects and devices with tiny embedded sensors that communicate with each other and users, creating a fully interconnected world – Decentralize data processing and storage and use Spark streaming analytics and interactive real time queries Common Spark Use Cases
  8. 8. Use Spark for distributed computation: - Combine SparkSQL, GraphX along with MLlib in the same Spark program - Ability to use language of choice - python/scala/R/java - Extensive algorithms (http://spark.apache.org/docs/latest/ mllib-guide.html) Why Spark MLlib?
  9. 9. • Classification and Regression: logistic regression, linear regression, linear support vector machine (SVM), naive Bayes, decision trees • Collaborative Filtering: alternating least squares (ALS) • Clustering: k-means, Gaussian mixture • Dimensionality Reduction: singular value decomposition (SVD), principal component analysis (PCA) Algorithms
  10. 10. • Spark : Fast, Scalable and Flexible • R : Statistics, Packages and Plots SparkR combines both - very powerful Use SparkR API to take advantage of Spark, bring the data back into R - and do some machine learning, data visualization, etc. How about R? Use SparkR!
  11. 11. What about the cloud? Central Governance & Security Internet Scale Instant Deployment Isolated Multi- tenancy Elastic Object Store Underpinnings
  12. 12. • Zero configuration – Spark, SparkR, MLlib, GraphX, etc. all pre- installed on all cluster nodes – e.g. submit SparkR programs via a client-side API to an on- demand compute cluster • ETL (data cleansing, transformations, table joins, etc.) required prior to any ML modeling and analysis – e.g. Use other Big Data tools in order to prepare data – hive/hadoop/cascading/pig… Spark in the Cloud
  13. 13. • Use AWS S3 object store to decouple compute and storage; scale processing power and storage capacity independently • S3 is highly available, reliable, scalable and cost effective • Elastic compute provides unlimited scale on-demand: calculations may require 10, 100 or 1,000+ compute nodes. • Ability to have multiple clusters – distinguish between teams, workloads, production, non-production R&D/test Spark in the Cloud
  14. 14. Cloud object store for data sets: e.g. AWS S3:
  15. 15. • Flexible compute resource options – High memory instances • AWS EC2 r3.* for high memory workloads to cache and manipulate large Spark RDDs – High CPU • AWS EC2 c3.* for CPU intensive workloads • Automatic cluster termination when idle • Periodically check for bad instances and remove them Spark in the Cloud
  16. 16. • Install Spark on EC2 (HDFS if required) • Choose Spark backend cluster mode and configure it – Standalone – Yarn – Mesos • Spin up a cluster of instances CONFIDENTIAL. SUBJECT TO NDA PROVISIONS. DIY - Getting Started on the Cloud
  17. 17. EC2 scripts can help: http://spark.apache.org/docs/latest/ec2-scripts.html - Helps spin up named clusters - Creates a security group, comes pre-baked with Spark installed CONFIDENTIAL. SUBJECT TO NDA PROVISIONS. DIY - Getting Started on the Cloud
  18. 18. Another (very short) Demo
  19. 19. Qubole Case Study Qubole Case Study Operations Analyst Marketing Ops Analyst Data Architect Busines s Users Product Support Customer Support Developer Sales Ops Product Managers Ease of use for analysts • Dozens of Data Scientist and Analyst users • Produces double- digit TBs of data per day • Does not have dedicated staff to setup and manage clusters and Hadoop Distributions
  20. 20. 0101 1010 1010 Qubole Case Study Qubole Case Study Producers Continuous Processing Storage Analytics CDN Real Time Bidding Retargeting Platform ETL Kinesis S3 Redshift Machine LearningStreaming Customer Data Why Spark? 0101 1010 1010 0101 1010 1010 0101 1010 1010 “Qubole put our cluster management, auto-scaling and ad-hoc queries on autopilot. Its higher performance for Big Data queries translates directly into faster and more actionable marketing intelligence for our customers.” Yekesa Kosuru VP, Technology

×