SlideShare a Scribd company logo
September 18, 2015
Jason Huang
Senior Solutions Architect, Qubole Inc.
Company Founding
Qubole founders built the Facebook data platform.
The Facebook model changed the role for data
in an enterprise.
• Needed to turn the data assets into a “utility” to make a viable
business.
– Collaborative: over 30% of employees use the
data directly.
– Accessible: developers, analysts, business analysts or
business users all running queries. Has made the
company more data driven and agile with data
use.
– Scalable: Exabyte's of data moving fast
It took the founders a team of over 30 people to create
this infrastructure and currently the team managing this
infrastructure has more than 100 people.
Work at Facebook inspired the founding of Qubole
Operations
Analyst
Marketing Ops
Analyst
Data
Architect
Business
Users
Product
Support
Customer
Support
Developer
Sales Ops
Product
Managers
Data
Infrastructure
Impediments for an Aspiring Data Driven Enterprise
Where Big
Data falls
short:
• 6-18 month implementation time
• Only 27% of Big Data initiatives are
classified as “Successful” in 2014
Rigid and
inflexible
infrastructure
Non adaptive
software
services
Highly
specialized
systems
Difficult to
build and
operate
• Only 13% of organizations achieve full-scale production
• 57% of organizations cite skills gap as a major inhibitor
State of the Big Data Industry (n=417)
0%
10%
20%
30%
40%
50%
60%
70%
80%
Hadoop MapReduce Pig Spark Storm Presto Cassandra HBase Hive
• Apache Spark is a fast and general engine for big data processing,
with built-in modules for streaming, SQL, machine learning and
graph processing.
Analytic Libraries:
• Spark Streaming (Streaming Data)
• Spark SQL (Data Processing)
• MLlib (Machine Learning)
• GraphX (Graph Processing)
Apache Spark
• Streaming Data
– Process streaming data with Spark built-in functions
– Applications such as fraud detection and log processing
– ETL via data ingestion
• Machine Learning
– Helps users run repeated queries and machine learning algorithms on
data sets
– MLlib can work in areas such as clustering, classification, and
dimensionality reduction
– Used for very common big data functions - predictive intelligence,
customer segmentation, and sentiment analysis
Common Spark Use Cases
• Interactive Analysis
– MapReduce was built to handle batch processing
– SQL-on-Hadoop engines such as Hive or Pig can be too slow for interactive
analysis
– Spark is fast enough to perform exploratory queries without sampling
– Provides multiple language-specific APIs including R, Python, Scala and Java.
• Fog Computing
– The Internet of Things - objects and devices with tiny embedded sensors that
communicate with each other and users, creating a fully interconnected world
– Decentralize data processing and storage and use Spark streaming analytics
and interactive real time queries
Common Spark Use Cases
Use Spark for distributed computation:
- Combine SparkSQL, GraphX along
with MLlib in the same Spark
program
- Ability to use language of choice -
python/scala/R/java
- Extensive algorithms
(http://spark.apache.org/docs/latest/
mllib-guide.html)
Why Spark MLlib?
• Classification and Regression: logistic regression, linear regression,
linear support vector machine (SVM), naive Bayes, decision trees
• Collaborative Filtering: alternating least squares (ALS)
• Clustering: k-means, Gaussian mixture
• Dimensionality Reduction: singular value decomposition (SVD),
principal component analysis (PCA)
Algorithms
• Spark : Fast, Scalable and Flexible
• R : Statistics, Packages and Plots
SparkR combines both - very powerful
Use SparkR API to take advantage of Spark, bring the data back into
R - and do some machine learning, data visualization, etc.
How about R? Use SparkR!
What about the cloud?
Central
Governance &
Security
Internet
Scale
Instant
Deployment
Isolated Multi-
tenancy
Elastic
Object Store
Underpinnings
• Zero configuration – Spark, SparkR, MLlib, GraphX, etc. all pre-
installed on all cluster nodes
– e.g. submit SparkR programs via a client-side API to an on-
demand compute cluster
• ETL (data cleansing, transformations, table joins, etc.) required prior to
any ML modeling and analysis
– e.g. Use other Big Data tools in order to prepare data –
hive/hadoop/cascading/pig…
Spark in the Cloud
• Use AWS S3 object store to decouple compute and storage; scale
processing power and storage capacity independently
• S3 is highly available, reliable, scalable and cost effective
• Elastic compute provides unlimited scale on-demand: calculations
may require 10, 100 or 1,000+ compute nodes.
• Ability to have multiple clusters – distinguish between teams,
workloads, production, non-production R&D/test
Spark in the Cloud
Cloud object store for data sets:
e.g. AWS S3:
• Flexible compute resource options
– High memory instances
• AWS EC2 r3.* for high memory workloads to cache and
manipulate large Spark RDDs
– High CPU
• AWS EC2 c3.* for CPU intensive workloads
• Automatic cluster termination when idle
• Periodically check for bad instances and remove them
Spark in the Cloud
• Install Spark on EC2 (HDFS if required)
• Choose Spark backend cluster mode and configure it
– Standalone
– Yarn
– Mesos
• Spin up a cluster of instances
CONFIDENTIAL. SUBJECT TO NDA PROVISIONS.
DIY - Getting Started on the Cloud
EC2 scripts can help:
http://spark.apache.org/docs/latest/ec2-scripts.html
- Helps spin up named clusters
- Creates a security group, comes pre-baked with Spark
installed
CONFIDENTIAL. SUBJECT TO NDA PROVISIONS.
DIY - Getting Started on the Cloud
Another (very short) Demo
Qubole Case Study
Qubole Case Study
Operations
Analyst
Marketing
Ops
Analyst
Data
Architect
Busines
s
Users
Product
Support
Customer
Support
Developer
Sales Ops
Product
Managers
Ease of use
for analysts
• Dozens of Data
Scientist and
Analyst users
• Produces double-
digit TBs of data
per day
• Does not have
dedicated staff
to setup and
manage clusters
and Hadoop
Distributions
0101
1010
1010
Qubole Case Study
Qubole Case Study
Producers Continuous Processing Storage Analytics
CDN
Real Time
Bidding
Retargeting
Platform
ETL
Kinesis S3 Redshift
Machine LearningStreaming
Customer Data
Why Spark?
0101
1010
1010
0101
1010
1010
0101
1010
1010
“Qubole put our cluster
management, auto-scaling
and ad-hoc queries on
autopilot. Its higher
performance for Big Data
queries translates directly
into faster and more
actionable marketing
intelligence for our
customers.”
Yekesa Kosuru
VP, Technology

More Related Content

What's hot

Qubole hadoop-summit-2013-europe
Qubole hadoop-summit-2013-europeQubole hadoop-summit-2013-europe
Qubole hadoop-summit-2013-europe
Joydeep Sen Sarma
 
MapR-DB Elasticsearch Integration
MapR-DB Elasticsearch IntegrationMapR-DB Elasticsearch Integration
MapR-DB Elasticsearch Integration
MapR Technologies
 

What's hot (20)

Hadoop Ecosystem at a Glance
Hadoop Ecosystem at a GlanceHadoop Ecosystem at a Glance
Hadoop Ecosystem at a Glance
 
Hadoop in the Cloud: Common Architectural Patterns
Hadoop in the Cloud: Common Architectural PatternsHadoop in the Cloud: Common Architectural Patterns
Hadoop in the Cloud: Common Architectural Patterns
 
Hadoop and HBase @eBay
Hadoop and HBase @eBayHadoop and HBase @eBay
Hadoop and HBase @eBay
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
 
Ignite Your Big Data With a Spark!
Ignite Your Big Data With a Spark!Ignite Your Big Data With a Spark!
Ignite Your Big Data With a Spark!
 
Keynote – From MapReduce to Spark: An Ecosystem Evolves by Doug Cutting, Chie...
Keynote – From MapReduce to Spark: An Ecosystem Evolves by Doug Cutting, Chie...Keynote – From MapReduce to Spark: An Ecosystem Evolves by Doug Cutting, Chie...
Keynote – From MapReduce to Spark: An Ecosystem Evolves by Doug Cutting, Chie...
 
Azure Databricks—Apache Spark as a Service with Sascha Dittmann
Azure Databricks—Apache Spark as a Service with Sascha DittmannAzure Databricks—Apache Spark as a Service with Sascha Dittmann
Azure Databricks—Apache Spark as a Service with Sascha Dittmann
 
Database Camp 2016 @ United Nations, NYC - Michael Glukhovsky, Co-Founder, Re...
Database Camp 2016 @ United Nations, NYC - Michael Glukhovsky, Co-Founder, Re...Database Camp 2016 @ United Nations, NYC - Michael Glukhovsky, Co-Founder, Re...
Database Camp 2016 @ United Nations, NYC - Michael Glukhovsky, Co-Founder, Re...
 
Spark and Couchbase– Augmenting the Operational Database with Spark
Spark and Couchbase– Augmenting the Operational Database with SparkSpark and Couchbase– Augmenting the Operational Database with Spark
Spark and Couchbase– Augmenting the Operational Database with Spark
 
Qubole hadoop-summit-2013-europe
Qubole hadoop-summit-2013-europeQubole hadoop-summit-2013-europe
Qubole hadoop-summit-2013-europe
 
MapR-DB Elasticsearch Integration
MapR-DB Elasticsearch IntegrationMapR-DB Elasticsearch Integration
MapR-DB Elasticsearch Integration
 
Hadoop at Ebay
Hadoop at EbayHadoop at Ebay
Hadoop at Ebay
 
Database Camp 2016 @ United Nations, NYC - Brad Bebee, CEO, Blazegraph
Database Camp 2016 @ United Nations, NYC - Brad Bebee, CEO, BlazegraphDatabase Camp 2016 @ United Nations, NYC - Brad Bebee, CEO, Blazegraph
Database Camp 2016 @ United Nations, NYC - Brad Bebee, CEO, Blazegraph
 
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
 
Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016
 
The Holy Grail of Data Analytics
The Holy Grail of Data AnalyticsThe Holy Grail of Data Analytics
The Holy Grail of Data Analytics
 
Big Data and Hadoop - key drivers, ecosystem and use cases
Big Data and Hadoop - key drivers, ecosystem and use casesBig Data and Hadoop - key drivers, ecosystem and use cases
Big Data and Hadoop - key drivers, ecosystem and use cases
 
Data Engineer's Lunch #55: Get Started in Data Engineering
Data Engineer's Lunch #55: Get Started in Data EngineeringData Engineer's Lunch #55: Get Started in Data Engineering
Data Engineer's Lunch #55: Get Started in Data Engineering
 
HBaseCon 2015: Industrial Internet Case Study using HBase and TSDB
HBaseCon 2015: Industrial Internet Case Study using HBase and TSDBHBaseCon 2015: Industrial Internet Case Study using HBase and TSDB
HBaseCon 2015: Industrial Internet Case Study using HBase and TSDB
 
The Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightThe Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsight
 

Viewers also liked (8)

Qubole State of the Big Data Industry
Qubole State of the Big Data IndustryQubole State of the Big Data Industry
Qubole State of the Big Data Industry
 
Running Spark on Cloud
Running Spark on CloudRunning Spark on Cloud
Running Spark on Cloud
 
State of Big Data Adoption
State of Big Data AdoptionState of Big Data Adoption
State of Big Data Adoption
 
Spark on Yarn
Spark on YarnSpark on Yarn
Spark on Yarn
 
Big Data Platform at Pinterest
Big Data Platform at PinterestBig Data Platform at Pinterest
Big Data Platform at Pinterest
 
7 Big Data Challenges and How to Overcome Them
7 Big Data Challenges and How to Overcome Them7 Big Data Challenges and How to Overcome Them
7 Big Data Challenges and How to Overcome Them
 
5 Factors Impacting Your Big Data Project's Performance
5 Factors Impacting Your Big Data Project's Performance 5 Factors Impacting Your Big Data Project's Performance
5 Factors Impacting Your Big Data Project's Performance
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
 

Similar to Atlanta MLConf

Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...
Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...
Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...
Pentaho
 

Similar to Atlanta MLConf (20)

Apache Spark in Scientific Applications
Apache Spark in Scientific ApplicationsApache Spark in Scientific Applications
Apache Spark in Scientific Applications
 
Apache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsApache Spark in Scientific Applciations
Apache Spark in Scientific Applciations
 
5 Comparing Microsoft Big Data Technologies for Analytics
5 Comparing Microsoft Big Data Technologies for Analytics5 Comparing Microsoft Big Data Technologies for Analytics
5 Comparing Microsoft Big Data Technologies for Analytics
 
Big data.ppt
Big data.pptBig data.ppt
Big data.ppt
 
Lecture1
Lecture1Lecture1
Lecture1
 
Big Data_Architecture.pptx
Big Data_Architecture.pptxBig Data_Architecture.pptx
Big Data_Architecture.pptx
 
CC -Unit4.pptx
CC -Unit4.pptxCC -Unit4.pptx
CC -Unit4.pptx
 
Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...
Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...
Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...
 
DoneDeal - AWS Data Analytics Platform
DoneDeal - AWS Data Analytics PlatformDoneDeal - AWS Data Analytics Platform
DoneDeal - AWS Data Analytics Platform
 
Boston Spark User Group - Spark's Role at MediaCrossing - July 15, 2014
Boston Spark User Group - Spark's Role at MediaCrossing - July 15, 2014Boston Spark User Group - Spark's Role at MediaCrossing - July 15, 2014
Boston Spark User Group - Spark's Role at MediaCrossing - July 15, 2014
 
Simple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform ConceptSimple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform Concept
 
Solution Brief: Big Data Lab Accelerator
Solution Brief: Big Data Lab AcceleratorSolution Brief: Big Data Lab Accelerator
Solution Brief: Big Data Lab Accelerator
 
Lecture1 BIG DATA and Types of data in details
Lecture1 BIG DATA and Types of data in detailsLecture1 BIG DATA and Types of data in details
Lecture1 BIG DATA and Types of data in details
 
Fundamentals of big data analytics and Hadoop
Fundamentals of big data analytics and HadoopFundamentals of big data analytics and Hadoop
Fundamentals of big data analytics and Hadoop
 
20160331 sa introduction to big data pipelining berlin meetup 0.3
20160331 sa introduction to big data pipelining berlin meetup   0.320160331 sa introduction to big data pipelining berlin meetup   0.3
20160331 sa introduction to big data pipelining berlin meetup 0.3
 
5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer
 
Agile data warehousing
Agile data warehousingAgile data warehousing
Agile data warehousing
 
Tarun poladi resume
Tarun poladi resumeTarun poladi resume
Tarun poladi resume
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
Architecting Agile Data Applications for Scale
Architecting Agile Data Applications for ScaleArchitecting Agile Data Applications for Scale
Architecting Agile Data Applications for Scale
 

More from Qubole

Getting to 1.5M Ads/sec: How DataXu manages Big Data
Getting to 1.5M Ads/sec: How DataXu manages Big DataGetting to 1.5M Ads/sec: How DataXu manages Big Data
Getting to 1.5M Ads/sec: How DataXu manages Big Data
Qubole
 
Big dataproposal
Big dataproposalBig dataproposal
Big dataproposal
Qubole
 
Presto in the cloud
Presto in the cloudPresto in the cloud
Presto in the cloud
Qubole
 

More from Qubole (8)

Qubole presentation for the Cleveland Big Data and Hadoop Meetup
Qubole presentation for the Cleveland Big Data and Hadoop Meetup   Qubole presentation for the Cleveland Big Data and Hadoop Meetup
Qubole presentation for the Cleveland Big Data and Hadoop Meetup
 
BIPD Tech Tuesday Presentation - Qubole
BIPD Tech Tuesday Presentation - QuboleBIPD Tech Tuesday Presentation - Qubole
BIPD Tech Tuesday Presentation - Qubole
 
Getting to 1.5M Ads/sec: How DataXu manages Big Data
Getting to 1.5M Ads/sec: How DataXu manages Big DataGetting to 1.5M Ads/sec: How DataXu manages Big Data
Getting to 1.5M Ads/sec: How DataXu manages Big Data
 
Expert Big Data Tips
Expert Big Data TipsExpert Big Data Tips
Expert Big Data Tips
 
Big dataproposal
Big dataproposalBig dataproposal
Big dataproposal
 
Presto in the cloud
Presto in the cloudPresto in the cloud
Presto in the cloud
 
Basic Sentiment Analysis using Hive
Basic Sentiment Analysis using HiveBasic Sentiment Analysis using Hive
Basic Sentiment Analysis using Hive
 
Effective Hive Queries
Effective Hive QueriesEffective Hive Queries
Effective Hive Queries
 

Recently uploaded

一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
enxupq
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
yhkoc
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
ukgaet
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
StarCompliance.io
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
nscud
 
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
Domenico Conte
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
ewymefz
 

Recently uploaded (20)

一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
 
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPsWebinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
 
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
Slip-and-fall Injuries: Top Workers' Comp Claims
Slip-and-fall Injuries: Top Workers' Comp ClaimsSlip-and-fall Injuries: Top Workers' Comp Claims
Slip-and-fall Injuries: Top Workers' Comp Claims
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
 

Atlanta MLConf

  • 1. September 18, 2015 Jason Huang Senior Solutions Architect, Qubole Inc.
  • 2. Company Founding Qubole founders built the Facebook data platform. The Facebook model changed the role for data in an enterprise. • Needed to turn the data assets into a “utility” to make a viable business. – Collaborative: over 30% of employees use the data directly. – Accessible: developers, analysts, business analysts or business users all running queries. Has made the company more data driven and agile with data use. – Scalable: Exabyte's of data moving fast It took the founders a team of over 30 people to create this infrastructure and currently the team managing this infrastructure has more than 100 people. Work at Facebook inspired the founding of Qubole Operations Analyst Marketing Ops Analyst Data Architect Business Users Product Support Customer Support Developer Sales Ops Product Managers Data Infrastructure
  • 3. Impediments for an Aspiring Data Driven Enterprise Where Big Data falls short: • 6-18 month implementation time • Only 27% of Big Data initiatives are classified as “Successful” in 2014 Rigid and inflexible infrastructure Non adaptive software services Highly specialized systems Difficult to build and operate • Only 13% of organizations achieve full-scale production • 57% of organizations cite skills gap as a major inhibitor
  • 4. State of the Big Data Industry (n=417) 0% 10% 20% 30% 40% 50% 60% 70% 80% Hadoop MapReduce Pig Spark Storm Presto Cassandra HBase Hive
  • 5. • Apache Spark is a fast and general engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing. Analytic Libraries: • Spark Streaming (Streaming Data) • Spark SQL (Data Processing) • MLlib (Machine Learning) • GraphX (Graph Processing) Apache Spark
  • 6. • Streaming Data – Process streaming data with Spark built-in functions – Applications such as fraud detection and log processing – ETL via data ingestion • Machine Learning – Helps users run repeated queries and machine learning algorithms on data sets – MLlib can work in areas such as clustering, classification, and dimensionality reduction – Used for very common big data functions - predictive intelligence, customer segmentation, and sentiment analysis Common Spark Use Cases
  • 7. • Interactive Analysis – MapReduce was built to handle batch processing – SQL-on-Hadoop engines such as Hive or Pig can be too slow for interactive analysis – Spark is fast enough to perform exploratory queries without sampling – Provides multiple language-specific APIs including R, Python, Scala and Java. • Fog Computing – The Internet of Things - objects and devices with tiny embedded sensors that communicate with each other and users, creating a fully interconnected world – Decentralize data processing and storage and use Spark streaming analytics and interactive real time queries Common Spark Use Cases
  • 8. Use Spark for distributed computation: - Combine SparkSQL, GraphX along with MLlib in the same Spark program - Ability to use language of choice - python/scala/R/java - Extensive algorithms (http://spark.apache.org/docs/latest/ mllib-guide.html) Why Spark MLlib?
  • 9. • Classification and Regression: logistic regression, linear regression, linear support vector machine (SVM), naive Bayes, decision trees • Collaborative Filtering: alternating least squares (ALS) • Clustering: k-means, Gaussian mixture • Dimensionality Reduction: singular value decomposition (SVD), principal component analysis (PCA) Algorithms
  • 10. • Spark : Fast, Scalable and Flexible • R : Statistics, Packages and Plots SparkR combines both - very powerful Use SparkR API to take advantage of Spark, bring the data back into R - and do some machine learning, data visualization, etc. How about R? Use SparkR!
  • 11. What about the cloud? Central Governance & Security Internet Scale Instant Deployment Isolated Multi- tenancy Elastic Object Store Underpinnings
  • 12. • Zero configuration – Spark, SparkR, MLlib, GraphX, etc. all pre- installed on all cluster nodes – e.g. submit SparkR programs via a client-side API to an on- demand compute cluster • ETL (data cleansing, transformations, table joins, etc.) required prior to any ML modeling and analysis – e.g. Use other Big Data tools in order to prepare data – hive/hadoop/cascading/pig… Spark in the Cloud
  • 13. • Use AWS S3 object store to decouple compute and storage; scale processing power and storage capacity independently • S3 is highly available, reliable, scalable and cost effective • Elastic compute provides unlimited scale on-demand: calculations may require 10, 100 or 1,000+ compute nodes. • Ability to have multiple clusters – distinguish between teams, workloads, production, non-production R&D/test Spark in the Cloud
  • 14. Cloud object store for data sets: e.g. AWS S3:
  • 15. • Flexible compute resource options – High memory instances • AWS EC2 r3.* for high memory workloads to cache and manipulate large Spark RDDs – High CPU • AWS EC2 c3.* for CPU intensive workloads • Automatic cluster termination when idle • Periodically check for bad instances and remove them Spark in the Cloud
  • 16. • Install Spark on EC2 (HDFS if required) • Choose Spark backend cluster mode and configure it – Standalone – Yarn – Mesos • Spin up a cluster of instances CONFIDENTIAL. SUBJECT TO NDA PROVISIONS. DIY - Getting Started on the Cloud
  • 17. EC2 scripts can help: http://spark.apache.org/docs/latest/ec2-scripts.html - Helps spin up named clusters - Creates a security group, comes pre-baked with Spark installed CONFIDENTIAL. SUBJECT TO NDA PROVISIONS. DIY - Getting Started on the Cloud
  • 19. Qubole Case Study Qubole Case Study Operations Analyst Marketing Ops Analyst Data Architect Busines s Users Product Support Customer Support Developer Sales Ops Product Managers Ease of use for analysts • Dozens of Data Scientist and Analyst users • Produces double- digit TBs of data per day • Does not have dedicated staff to setup and manage clusters and Hadoop Distributions
  • 20. 0101 1010 1010 Qubole Case Study Qubole Case Study Producers Continuous Processing Storage Analytics CDN Real Time Bidding Retargeting Platform ETL Kinesis S3 Redshift Machine LearningStreaming Customer Data Why Spark? 0101 1010 1010 0101 1010 1010 0101 1010 1010 “Qubole put our cluster management, auto-scaling and ad-hoc queries on autopilot. Its higher performance for Big Data queries translates directly into faster and more actionable marketing intelligence for our customers.” Yekesa Kosuru VP, Technology