SlideShare a Scribd company logo
1 of 42
1© Cloudera, Inc. All rights reserved.
Large-Scale Data Science
on Hadoop
Uri Laserson | Data Scientist | @laserson
2© Cloudera, Inc. All rights reserved.
About the speaker
• Data Scientist at Cloudera
• PhD in BME at MIT/Harvard
• Committer on ADAM, impyla
• Co-author on Advanced Analytics with Spark
• laserson@cloudera.com
3© Cloudera, Inc. All rights reserved.
What is a data scientist?
4© Cloudera, Inc. All rights reserved.
What is a data scientist?
5© Cloudera, Inc. All rights reserved.
What is a data scientist?
6© Cloudera, Inc. All rights reserved.
What is a data science?
7© Cloudera, Inc. All rights reserved.
What is a data science?
Phase 1. Collect Data Phase 2. Data Science? Phase 3. Profit!
8© Cloudera, Inc. All rights reserved.
Some things you might do as a data scientist
• Data quality issues
• Data formats/versions
• Data source integration
• Exploration/visualization
• Building/deploy models
9© Cloudera, Inc. All rights reserved.
Plumbing
Exploratory Operational
10© Cloudera, Inc. All rights reserved.
1. Data science is data plumbing
11© Cloudera, Inc. All rights reserved.
Example:
• Sells deep analysis of huge satellite images
• Easy: C++ to analyze images
• Hard: continuously reliably
ingesting, transforming
• Expensive:
storing, computing
• Hadoop as the
data science plumber
12© Cloudera, Inc. All rights reserved.
2. Data science is investigative analytics
13© Cloudera, Inc. All rights reserved.
Example: large UK retailer
• Customer Churn
• SAS, Hive
• Path Analysis
• Giraph, MapReduce
• Customer Segmentation
• SAS, Spotfire, Impala
• Hadoop as one hub for investigative tools
• Avoid buying, training for N new tools
14© Cloudera, Inc. All rights reserved.
3. Data science is operational analytics
15© Cloudera, Inc. All rights reserved.
Example:
• Real-time Search, ML over Patient Data
• MapReduce for indexing, learning
• HBase for storage and fast access
• Storm for incremental update
• RDBMS for recent
derived data
• API façade for input and
querying learning
Engineering
Machine Learning
16© Cloudera, Inc. All rights reserved.
Plumbing
Exploratory Operational
17© Cloudera, Inc. All rights reserved.
Factors to consider when choosing your tools
• Single-node performance
• Scalability
• Language and tooling familiarity
• Integration with Hadoop
• Libraries / functions / richness of ecosystem
• Integration with data prep / ETL workflows
Pattern
JPMML
18© Cloudera, Inc. All rights reserved.
Plumbing in a nutshell
Plumbing Apache Kafka
Apache Pig
Apache Crunch
19© Cloudera, Inc. All rights reserved.
Serialization/RPC frameworks
• Specify schemas/services in user-friendly
IDLs
• Code-generation to multiple languages (wire-
compatible/portable)
• Compact, binary formats
• Natural support for schema evolution
• Multiple implementations:
• Apache Thrift, Apache Avro, Google’s
Protocol Buffers
service Twitter {
void ping();
bool postTweet(1:Tweet tweet);
TweetSearchResult searchTweets(1:string query);
}
struct Tweet {
1: required i32 userId;
2: required string userName;
3: required string text;
4: optional Location loc;
16: optional string language = "english"
}
20© Cloudera, Inc. All rights reserved.
Log and service oriented architecture
http://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying
21© Cloudera, Inc. All rights reserved.
Log and service oriented architecture
http://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying
22© Cloudera, Inc. All rights reserved.
Factory (operational) vs. Laboratory (exploratory)
Programming languages
Systems languages
Latency, throughput
Huge data
Online problems
Automated
Developers, Engineers
Statistical environments, BI tools
High-level languages
Accuracy
Medium-sized data
Offline work
Ad-hoc
Statisticians, Analysts
vs.
23© Cloudera, Inc. All rights reserved.
Exploratory analytics
• Offline
• Statistical Environment
• Discovery-phase
• Model Building and Tuning
• Accuracy Important
• Medium-scale
• Visualizations
Exploratory
24© Cloudera, Inc. All rights reserved.
Exploratory: BI/visualization
• Nothing Hadoop-specific
• Take your pick of any 3rd party tool
• Typically connects to Hadoop via SQL
interface with Impala
25© Cloudera, Inc. All rights reserved.
Exploratory: SAS
• Connects to Hadoop data stores
• Can push down some computation
to cluster, but requires data
movement
• Mature and widely used; large algo
library
• Ongoing collaborative engineering
effort with Cloudera
26© Cloudera, Inc. All rights reserved.
Exploratory: Python
• Python and JVM don’t play nice
• Hadoop Streaming / mrjob / scikit-
learn
• Impyla: Python UDFs on Impala
• PySpark: Spark API in Python
27© Cloudera, Inc. All rights reserved.
Operational analytics
• Online
• Real-Time
• Cluster Environment
• Model Serving, Update
• QPS, Latency Important
• Large Scale
Operational
Pattern
JPMML
28© Cloudera, Inc. All rights reserved.
Operational: MLlib (Spark)
• Model building on Spark
• Fast (distributed in-memory)
• Basic algorithms only
• LR, SVM, decision tree
• PCA, SVD
• K-means
• ALS
• Easy integration with Spark-as-ETL
29© Cloudera, Inc. All rights reserved.
GROUPBY integration with Hadoop
Read Hadoop data Requires data movement
30© Cloudera, Inc. All rights reserved.
GROUPBY integration with Hadoop
YARN-managed Outside
31© Cloudera, Inc. All rights reserved.
GROUPBY open source
Open source Closed source
32© Cloudera, Inc. All rights reserved.
GROUPBY active community
Active community Not
33© Cloudera, Inc. All rights reserved.
Languages
Java Python R Scala
34© Cloudera, Inc. All rights reserved.
• Next-generation general processing engine for Hadoop
• APIs in Python, Java, Scala (and early R)
• DAG execution / in-memory
• Interactive REPL
• Batch or streaming
• MLlib, GraphX
• Active community
• Scala-like API
35© Cloudera, Inc. All rights reserved.
Large scale or real-time?
Large-Scale
Offline
Batch
Real-Time
Online
Streaming
vs
36© Cloudera, Inc. All rights reserved.
Large scale or real-time?
Large-Scale
Offline
Batch
Real-Time
Online
Streaming
vs
Why Don’t We Have Both?
λ!
37© Cloudera, Inc. All rights reserved.
Lambda architecture
• Tackle in 3 Layers
• Batch Layer:
offline, big model build
• Speed Layer:
near-real-time, approximate update
• Serving Layer:
real-time model
query / scoring
38© Cloudera, Inc. All rights reserved.
PMML
• Predictive Modeling Markup Language
• XML-based format for predictive models
• Standardized by Data Mining Group
(www.dmg.org)
• Wide tool support
<PMML xmlns="http://www.dmg.org/PMML-4_1"
version="4.1">
<Header copyright="www.dmg.org"/>
<DataDictionary numberOfFields="5">
<DataField name="temperature"
optype="continuous"
dataType="double"/>
…
</DataDictionary>
<TreeModel modelName="golfing"
functionName="classification">
<MiningSchema>
<MiningField name="temperature"/>
…
</MiningSchema>
<Node score="will play">
<Node score="will play">
<SimplePredicate field="outlook"
operator="equal"
value="sunny"/>
…
</Node>
</Node>
</TreeModel>
</PMML>
39© Cloudera, Inc. All rights reserved.
Lambda implementation: Oryx 2.x
• Generic lambda-architecture platform
• With ML specializations
• hyperparam selection
• Built on Spark Streaming, Kafka
• With Intel
• 2.x: pre-alpha
github.com/OryxProject/oryx
40© Cloudera, Inc. All rights reserved.
Lambda implementation: Oryx 2.x
github.com/OryxProject/oryx
41© Cloudera, Inc. All rights reserved.
HTTP REST API
• Convention for RPC-like request /
response
• HTTP verbs, transport
• GET : query
• POST : add input
• Easy from browser, CLI, Java,
Python, Scala, etc.
GET /recommend/jwills
HTTP/1.1 200 OK
Content-Type: text/plain
"Ray LaMontagne",0.951
"Fleet Foxes",0.7905
"The National",0.688
"Shearwater",0.3017
42© Cloudera, Inc. All rights reserved.
Thank you
laserson@cloudera.com

More Related Content

What's hot

Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkDataWorks Summit
 
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...DataWorks Summit
 
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemWhy Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemCloudera, Inc.
 
Real-time Hadoop: The Ideal Messaging System for Hadoop
Real-time Hadoop: The Ideal Messaging System for Hadoop Real-time Hadoop: The Ideal Messaging System for Hadoop
Real-time Hadoop: The Ideal Messaging System for Hadoop DataWorks Summit/Hadoop Summit
 
Genome Analysis Pipelines with Spark and ADAM
Genome Analysis Pipelines with Spark and ADAMGenome Analysis Pipelines with Spark and ADAM
Genome Analysis Pipelines with Spark and ADAMAllen Day, PhD
 
Deep Learning with Spark and GPUs
Deep Learning with Spark and GPUsDeep Learning with Spark and GPUs
Deep Learning with Spark and GPUsDataWorks Summit
 
Applied Deep Learning with Spark and Deeplearning4j
Applied Deep Learning with Spark and Deeplearning4jApplied Deep Learning with Spark and Deeplearning4j
Applied Deep Learning with Spark and Deeplearning4jDataWorks Summit
 
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and SparkFiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and SparkEvan Chan
 
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on HiveFaster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on HiveDataWorks Summit/Hadoop Summit
 
Large Scale Graph Analytics with JanusGraph
Large Scale Graph Analytics with JanusGraphLarge Scale Graph Analytics with JanusGraph
Large Scale Graph Analytics with JanusGraphP. Taylor Goetz
 
Bringing complex event processing to Spark streaming
Bringing complex event processing to Spark streamingBringing complex event processing to Spark streaming
Bringing complex event processing to Spark streamingDataWorks Summit
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowDataWorks Summit
 
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...Spark Summit
 
Functional Programming and Big Data
Functional Programming and Big DataFunctional Programming and Big Data
Functional Programming and Big DataDataWorks Summit
 

What's hot (20)

LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache Flink
 
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
 
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemWhy Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
 
Real-time Hadoop: The Ideal Messaging System for Hadoop
Real-time Hadoop: The Ideal Messaging System for Hadoop Real-time Hadoop: The Ideal Messaging System for Hadoop
Real-time Hadoop: The Ideal Messaging System for Hadoop
 
Genome Analysis Pipelines with Spark and ADAM
Genome Analysis Pipelines with Spark and ADAMGenome Analysis Pipelines with Spark and ADAM
Genome Analysis Pipelines with Spark and ADAM
 
Time-oriented event search. A new level of scale
Time-oriented event search. A new level of scale Time-oriented event search. A new level of scale
Time-oriented event search. A new level of scale
 
Deep Learning with Spark and GPUs
Deep Learning with Spark and GPUsDeep Learning with Spark and GPUs
Deep Learning with Spark and GPUs
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
Apache Spark Briefing
Apache Spark BriefingApache Spark Briefing
Apache Spark Briefing
 
Applied Deep Learning with Spark and Deeplearning4j
Applied Deep Learning with Spark and Deeplearning4jApplied Deep Learning with Spark and Deeplearning4j
Applied Deep Learning with Spark and Deeplearning4j
 
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and SparkFiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
 
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on HiveFaster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
 
Large Scale Graph Analytics with JanusGraph
Large Scale Graph Analytics with JanusGraphLarge Scale Graph Analytics with JanusGraph
Large Scale Graph Analytics with JanusGraph
 
Bringing complex event processing to Spark streaming
Bringing complex event processing to Spark streamingBringing complex event processing to Spark streaming
Bringing complex event processing to Spark streaming
 
Hadoop 3 in a Nutshell
Hadoop 3 in a NutshellHadoop 3 in a Nutshell
Hadoop 3 in a Nutshell
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache Arrow
 
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
 
Running Spark in Production
Running Spark in ProductionRunning Spark in Production
Running Spark in Production
 
Functional Programming and Big Data
Functional Programming and Big DataFunctional Programming and Big Data
Functional Programming and Big Data
 

Viewers also liked

Transform Banking with Big Data and Automated Machine Learning 9.12.17
Transform Banking with Big Data and Automated Machine Learning 9.12.17Transform Banking with Big Data and Automated Machine Learning 9.12.17
Transform Banking with Big Data and Automated Machine Learning 9.12.17Cloudera, Inc.
 
Put Alternative Data to Use in Capital Markets

Put Alternative Data to Use in Capital Markets
Put Alternative Data to Use in Capital Markets

Put Alternative Data to Use in Capital Markets
Cloudera, Inc.
 
Non-geek's big data playbook - Hadoop & EDW - SAS Best Practices
Non-geek's big data playbook - Hadoop & EDW - SAS Best PracticesNon-geek's big data playbook - Hadoop & EDW - SAS Best Practices
Non-geek's big data playbook - Hadoop & EDW - SAS Best PracticesJyrki Määttä
 
The Big Picture: Real-time Data is Defining Intelligent Offers
The Big Picture: Real-time Data is Defining Intelligent OffersThe Big Picture: Real-time Data is Defining Intelligent Offers
The Big Picture: Real-time Data is Defining Intelligent OffersCloudera, Inc.
 
Cloudera Customer Success Story
Cloudera Customer Success StoryCloudera Customer Success Story
Cloudera Customer Success StoryXpand IT
 
IoT - Data Management Trends, Best Practices, & Use Cases
IoT - Data Management Trends, Best Practices, & Use CasesIoT - Data Management Trends, Best Practices, & Use Cases
IoT - Data Management Trends, Best Practices, & Use CasesCloudera, Inc.
 
How to Build Multi-disciplinary Analytics Applications on a Shared Data Platform
How to Build Multi-disciplinary Analytics Applications on a Shared Data PlatformHow to Build Multi-disciplinary Analytics Applications on a Shared Data Platform
How to Build Multi-disciplinary Analytics Applications on a Shared Data PlatformCloudera, Inc.
 
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015Cloudera, Inc.
 
Webinar - Sehr empfehlenswert: wie man aus Daten durch maschinelles Lernen We...
Webinar - Sehr empfehlenswert: wie man aus Daten durch maschinelles Lernen We...Webinar - Sehr empfehlenswert: wie man aus Daten durch maschinelles Lernen We...
Webinar - Sehr empfehlenswert: wie man aus Daten durch maschinelles Lernen We...Cloudera, Inc.
 

Viewers also liked (9)

Transform Banking with Big Data and Automated Machine Learning 9.12.17
Transform Banking with Big Data and Automated Machine Learning 9.12.17Transform Banking with Big Data and Automated Machine Learning 9.12.17
Transform Banking with Big Data and Automated Machine Learning 9.12.17
 
Put Alternative Data to Use in Capital Markets

Put Alternative Data to Use in Capital Markets
Put Alternative Data to Use in Capital Markets

Put Alternative Data to Use in Capital Markets

 
Non-geek's big data playbook - Hadoop & EDW - SAS Best Practices
Non-geek's big data playbook - Hadoop & EDW - SAS Best PracticesNon-geek's big data playbook - Hadoop & EDW - SAS Best Practices
Non-geek's big data playbook - Hadoop & EDW - SAS Best Practices
 
The Big Picture: Real-time Data is Defining Intelligent Offers
The Big Picture: Real-time Data is Defining Intelligent OffersThe Big Picture: Real-time Data is Defining Intelligent Offers
The Big Picture: Real-time Data is Defining Intelligent Offers
 
Cloudera Customer Success Story
Cloudera Customer Success StoryCloudera Customer Success Story
Cloudera Customer Success Story
 
IoT - Data Management Trends, Best Practices, & Use Cases
IoT - Data Management Trends, Best Practices, & Use CasesIoT - Data Management Trends, Best Practices, & Use Cases
IoT - Data Management Trends, Best Practices, & Use Cases
 
How to Build Multi-disciplinary Analytics Applications on a Shared Data Platform
How to Build Multi-disciplinary Analytics Applications on a Shared Data PlatformHow to Build Multi-disciplinary Analytics Applications on a Shared Data Platform
How to Build Multi-disciplinary Analytics Applications on a Shared Data Platform
 
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
 
Webinar - Sehr empfehlenswert: wie man aus Daten durch maschinelles Lernen We...
Webinar - Sehr empfehlenswert: wie man aus Daten durch maschinelles Lernen We...Webinar - Sehr empfehlenswert: wie man aus Daten durch maschinelles Lernen We...
Webinar - Sehr empfehlenswert: wie man aus Daten durch maschinelles Lernen We...
 

Similar to Large-Scale Data Science on Hadoop Using Lambda Architecture

Data Science at Scale Using Apache Spark and Apache Hadoop
Data Science at Scale Using Apache Spark and Apache HadoopData Science at Scale Using Apache Spark and Apache Hadoop
Data Science at Scale Using Apache Spark and Apache HadoopCloudera, Inc.
 
PyData: The Next Generation | Data Day Texas 2015
PyData: The Next Generation | Data Day Texas 2015PyData: The Next Generation | Data Day Texas 2015
PyData: The Next Generation | Data Day Texas 2015Cloudera, Inc.
 
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud Cloudera Analytics and Machine Learning Platform - Optimized for Cloud
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud Stefan Lipp
 
PyData: The Next Generation
PyData: The Next GenerationPyData: The Next Generation
PyData: The Next GenerationWes McKinney
 
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency ObjectivesHadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency ObjectivesCloudera, Inc.
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaSwiss Big Data User Group
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impalahuguk
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Cloudera, Inc.
 
Spark One Platform Webinar
Spark One Platform WebinarSpark One Platform Webinar
Spark One Platform WebinarCloudera, Inc.
 
Part 2: A Visual Dive into Machine Learning and Deep Learning 

Part 2: A Visual Dive into Machine Learning and Deep Learning 
Part 2: A Visual Dive into Machine Learning and Deep Learning 

Part 2: A Visual Dive into Machine Learning and Deep Learning 
Cloudera, Inc.
 
Consolidate your data marts for fast, flexible analytics 5.24.18
Consolidate your data marts for fast, flexible analytics 5.24.18Consolidate your data marts for fast, flexible analytics 5.24.18
Consolidate your data marts for fast, flexible analytics 5.24.18Cloudera, Inc.
 
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinneyIbis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinneyHakka Labs
 
Part 3: Models in Production: A Look From Beginning to End
Part 3: Models in Production: A Look From Beginning to EndPart 3: Models in Production: A Look From Beginning to End
Part 3: Models in Production: A Look From Beginning to EndCloudera, Inc.
 
Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017
Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017
Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017Stefan Lipp
 
Data Science and Machine Learning for the Enterprise
Data Science and Machine Learning for the EnterpriseData Science and Machine Learning for the Enterprise
Data Science and Machine Learning for the EnterpriseCloudera, Inc.
 
End to End Streaming Architectures
End to End Streaming ArchitecturesEnd to End Streaming Architectures
End to End Streaming ArchitecturesCloudera, Inc.
 
Simplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache KuduSimplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache KuduCloudera, Inc.
 
Data Science and CDSW
Data Science and CDSWData Science and CDSW
Data Science and CDSWJason Hubbard
 
An Incomplete Data Tools Landscape for Hackers in 2015
An Incomplete Data Tools Landscape for Hackers in 2015An Incomplete Data Tools Landscape for Hackers in 2015
An Incomplete Data Tools Landscape for Hackers in 2015Wes McKinney
 
DataFrames: The Extended Cut
DataFrames: The Extended CutDataFrames: The Extended Cut
DataFrames: The Extended CutWes McKinney
 

Similar to Large-Scale Data Science on Hadoop Using Lambda Architecture (20)

Data Science at Scale Using Apache Spark and Apache Hadoop
Data Science at Scale Using Apache Spark and Apache HadoopData Science at Scale Using Apache Spark and Apache Hadoop
Data Science at Scale Using Apache Spark and Apache Hadoop
 
PyData: The Next Generation | Data Day Texas 2015
PyData: The Next Generation | Data Day Texas 2015PyData: The Next Generation | Data Day Texas 2015
PyData: The Next Generation | Data Day Texas 2015
 
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud Cloudera Analytics and Machine Learning Platform - Optimized for Cloud
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud
 
PyData: The Next Generation
PyData: The Next GenerationPyData: The Next Generation
PyData: The Next Generation
 
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency ObjectivesHadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
 
Spark One Platform Webinar
Spark One Platform WebinarSpark One Platform Webinar
Spark One Platform Webinar
 
Part 2: A Visual Dive into Machine Learning and Deep Learning 

Part 2: A Visual Dive into Machine Learning and Deep Learning 
Part 2: A Visual Dive into Machine Learning and Deep Learning 

Part 2: A Visual Dive into Machine Learning and Deep Learning 

 
Consolidate your data marts for fast, flexible analytics 5.24.18
Consolidate your data marts for fast, flexible analytics 5.24.18Consolidate your data marts for fast, flexible analytics 5.24.18
Consolidate your data marts for fast, flexible analytics 5.24.18
 
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinneyIbis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
 
Part 3: Models in Production: A Look From Beginning to End
Part 3: Models in Production: A Look From Beginning to EndPart 3: Models in Production: A Look From Beginning to End
Part 3: Models in Production: A Look From Beginning to End
 
Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017
Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017
Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017
 
Data Science and Machine Learning for the Enterprise
Data Science and Machine Learning for the EnterpriseData Science and Machine Learning for the Enterprise
Data Science and Machine Learning for the Enterprise
 
End to End Streaming Architectures
End to End Streaming ArchitecturesEnd to End Streaming Architectures
End to End Streaming Architectures
 
Simplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache KuduSimplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache Kudu
 
Data Science and CDSW
Data Science and CDSWData Science and CDSW
Data Science and CDSW
 
An Incomplete Data Tools Landscape for Hackers in 2015
An Incomplete Data Tools Landscape for Hackers in 2015An Incomplete Data Tools Landscape for Hackers in 2015
An Incomplete Data Tools Landscape for Hackers in 2015
 
DataFrames: The Extended Cut
DataFrames: The Extended CutDataFrames: The Extended Cut
DataFrames: The Extended Cut
 

More from Uri Laserson

Petascale Genomics (Strata Singapore 20151203)
Petascale Genomics (Strata Singapore 20151203)Petascale Genomics (Strata Singapore 20151203)
Petascale Genomics (Strata Singapore 20151203)Uri Laserson
 
Genomics Is Not Special: Towards Data Intensive Biology
Genomics Is Not Special: Towards Data Intensive BiologyGenomics Is Not Special: Towards Data Intensive Biology
Genomics Is Not Special: Towards Data Intensive BiologyUri Laserson
 
APIs and Synthetic Biology
APIs and Synthetic BiologyAPIs and Synthetic Biology
APIs and Synthetic BiologyUri Laserson
 
Numba-compiled Python UDFs for Impala (Impala Meetup 5/20/14)
Numba-compiled Python UDFs for Impala (Impala Meetup 5/20/14)Numba-compiled Python UDFs for Impala (Impala Meetup 5/20/14)
Numba-compiled Python UDFs for Impala (Impala Meetup 5/20/14)Uri Laserson
 
Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)Uri Laserson
 
Hadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant StoreHadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant StoreUri Laserson
 
Hadoop ecosystem for health/life sciences
Hadoop ecosystem for health/life sciencesHadoop ecosystem for health/life sciences
Hadoop ecosystem for health/life sciencesUri Laserson
 

More from Uri Laserson (7)

Petascale Genomics (Strata Singapore 20151203)
Petascale Genomics (Strata Singapore 20151203)Petascale Genomics (Strata Singapore 20151203)
Petascale Genomics (Strata Singapore 20151203)
 
Genomics Is Not Special: Towards Data Intensive Biology
Genomics Is Not Special: Towards Data Intensive BiologyGenomics Is Not Special: Towards Data Intensive Biology
Genomics Is Not Special: Towards Data Intensive Biology
 
APIs and Synthetic Biology
APIs and Synthetic BiologyAPIs and Synthetic Biology
APIs and Synthetic Biology
 
Numba-compiled Python UDFs for Impala (Impala Meetup 5/20/14)
Numba-compiled Python UDFs for Impala (Impala Meetup 5/20/14)Numba-compiled Python UDFs for Impala (Impala Meetup 5/20/14)
Numba-compiled Python UDFs for Impala (Impala Meetup 5/20/14)
 
Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)
 
Hadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant StoreHadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant Store
 
Hadoop ecosystem for health/life sciences
Hadoop ecosystem for health/life sciencesHadoop ecosystem for health/life sciences
Hadoop ecosystem for health/life sciences
 

Recently uploaded

Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 

Recently uploaded (20)

Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 

Large-Scale Data Science on Hadoop Using Lambda Architecture

  • 1. 1© Cloudera, Inc. All rights reserved. Large-Scale Data Science on Hadoop Uri Laserson | Data Scientist | @laserson
  • 2. 2© Cloudera, Inc. All rights reserved. About the speaker • Data Scientist at Cloudera • PhD in BME at MIT/Harvard • Committer on ADAM, impyla • Co-author on Advanced Analytics with Spark • laserson@cloudera.com
  • 3. 3© Cloudera, Inc. All rights reserved. What is a data scientist?
  • 4. 4© Cloudera, Inc. All rights reserved. What is a data scientist?
  • 5. 5© Cloudera, Inc. All rights reserved. What is a data scientist?
  • 6. 6© Cloudera, Inc. All rights reserved. What is a data science?
  • 7. 7© Cloudera, Inc. All rights reserved. What is a data science? Phase 1. Collect Data Phase 2. Data Science? Phase 3. Profit!
  • 8. 8© Cloudera, Inc. All rights reserved. Some things you might do as a data scientist • Data quality issues • Data formats/versions • Data source integration • Exploration/visualization • Building/deploy models
  • 9. 9© Cloudera, Inc. All rights reserved. Plumbing Exploratory Operational
  • 10. 10© Cloudera, Inc. All rights reserved. 1. Data science is data plumbing
  • 11. 11© Cloudera, Inc. All rights reserved. Example: • Sells deep analysis of huge satellite images • Easy: C++ to analyze images • Hard: continuously reliably ingesting, transforming • Expensive: storing, computing • Hadoop as the data science plumber
  • 12. 12© Cloudera, Inc. All rights reserved. 2. Data science is investigative analytics
  • 13. 13© Cloudera, Inc. All rights reserved. Example: large UK retailer • Customer Churn • SAS, Hive • Path Analysis • Giraph, MapReduce • Customer Segmentation • SAS, Spotfire, Impala • Hadoop as one hub for investigative tools • Avoid buying, training for N new tools
  • 14. 14© Cloudera, Inc. All rights reserved. 3. Data science is operational analytics
  • 15. 15© Cloudera, Inc. All rights reserved. Example: • Real-time Search, ML over Patient Data • MapReduce for indexing, learning • HBase for storage and fast access • Storm for incremental update • RDBMS for recent derived data • API façade for input and querying learning Engineering Machine Learning
  • 16. 16© Cloudera, Inc. All rights reserved. Plumbing Exploratory Operational
  • 17. 17© Cloudera, Inc. All rights reserved. Factors to consider when choosing your tools • Single-node performance • Scalability • Language and tooling familiarity • Integration with Hadoop • Libraries / functions / richness of ecosystem • Integration with data prep / ETL workflows Pattern JPMML
  • 18. 18© Cloudera, Inc. All rights reserved. Plumbing in a nutshell Plumbing Apache Kafka Apache Pig Apache Crunch
  • 19. 19© Cloudera, Inc. All rights reserved. Serialization/RPC frameworks • Specify schemas/services in user-friendly IDLs • Code-generation to multiple languages (wire- compatible/portable) • Compact, binary formats • Natural support for schema evolution • Multiple implementations: • Apache Thrift, Apache Avro, Google’s Protocol Buffers service Twitter { void ping(); bool postTweet(1:Tweet tweet); TweetSearchResult searchTweets(1:string query); } struct Tweet { 1: required i32 userId; 2: required string userName; 3: required string text; 4: optional Location loc; 16: optional string language = "english" }
  • 20. 20© Cloudera, Inc. All rights reserved. Log and service oriented architecture http://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying
  • 21. 21© Cloudera, Inc. All rights reserved. Log and service oriented architecture http://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying
  • 22. 22© Cloudera, Inc. All rights reserved. Factory (operational) vs. Laboratory (exploratory) Programming languages Systems languages Latency, throughput Huge data Online problems Automated Developers, Engineers Statistical environments, BI tools High-level languages Accuracy Medium-sized data Offline work Ad-hoc Statisticians, Analysts vs.
  • 23. 23© Cloudera, Inc. All rights reserved. Exploratory analytics • Offline • Statistical Environment • Discovery-phase • Model Building and Tuning • Accuracy Important • Medium-scale • Visualizations Exploratory
  • 24. 24© Cloudera, Inc. All rights reserved. Exploratory: BI/visualization • Nothing Hadoop-specific • Take your pick of any 3rd party tool • Typically connects to Hadoop via SQL interface with Impala
  • 25. 25© Cloudera, Inc. All rights reserved. Exploratory: SAS • Connects to Hadoop data stores • Can push down some computation to cluster, but requires data movement • Mature and widely used; large algo library • Ongoing collaborative engineering effort with Cloudera
  • 26. 26© Cloudera, Inc. All rights reserved. Exploratory: Python • Python and JVM don’t play nice • Hadoop Streaming / mrjob / scikit- learn • Impyla: Python UDFs on Impala • PySpark: Spark API in Python
  • 27. 27© Cloudera, Inc. All rights reserved. Operational analytics • Online • Real-Time • Cluster Environment • Model Serving, Update • QPS, Latency Important • Large Scale Operational Pattern JPMML
  • 28. 28© Cloudera, Inc. All rights reserved. Operational: MLlib (Spark) • Model building on Spark • Fast (distributed in-memory) • Basic algorithms only • LR, SVM, decision tree • PCA, SVD • K-means • ALS • Easy integration with Spark-as-ETL
  • 29. 29© Cloudera, Inc. All rights reserved. GROUPBY integration with Hadoop Read Hadoop data Requires data movement
  • 30. 30© Cloudera, Inc. All rights reserved. GROUPBY integration with Hadoop YARN-managed Outside
  • 31. 31© Cloudera, Inc. All rights reserved. GROUPBY open source Open source Closed source
  • 32. 32© Cloudera, Inc. All rights reserved. GROUPBY active community Active community Not
  • 33. 33© Cloudera, Inc. All rights reserved. Languages Java Python R Scala
  • 34. 34© Cloudera, Inc. All rights reserved. • Next-generation general processing engine for Hadoop • APIs in Python, Java, Scala (and early R) • DAG execution / in-memory • Interactive REPL • Batch or streaming • MLlib, GraphX • Active community • Scala-like API
  • 35. 35© Cloudera, Inc. All rights reserved. Large scale or real-time? Large-Scale Offline Batch Real-Time Online Streaming vs
  • 36. 36© Cloudera, Inc. All rights reserved. Large scale or real-time? Large-Scale Offline Batch Real-Time Online Streaming vs Why Don’t We Have Both? λ!
  • 37. 37© Cloudera, Inc. All rights reserved. Lambda architecture • Tackle in 3 Layers • Batch Layer: offline, big model build • Speed Layer: near-real-time, approximate update • Serving Layer: real-time model query / scoring
  • 38. 38© Cloudera, Inc. All rights reserved. PMML • Predictive Modeling Markup Language • XML-based format for predictive models • Standardized by Data Mining Group (www.dmg.org) • Wide tool support <PMML xmlns="http://www.dmg.org/PMML-4_1" version="4.1"> <Header copyright="www.dmg.org"/> <DataDictionary numberOfFields="5"> <DataField name="temperature" optype="continuous" dataType="double"/> … </DataDictionary> <TreeModel modelName="golfing" functionName="classification"> <MiningSchema> <MiningField name="temperature"/> … </MiningSchema> <Node score="will play"> <Node score="will play"> <SimplePredicate field="outlook" operator="equal" value="sunny"/> … </Node> </Node> </TreeModel> </PMML>
  • 39. 39© Cloudera, Inc. All rights reserved. Lambda implementation: Oryx 2.x • Generic lambda-architecture platform • With ML specializations • hyperparam selection • Built on Spark Streaming, Kafka • With Intel • 2.x: pre-alpha github.com/OryxProject/oryx
  • 40. 40© Cloudera, Inc. All rights reserved. Lambda implementation: Oryx 2.x github.com/OryxProject/oryx
  • 41. 41© Cloudera, Inc. All rights reserved. HTTP REST API • Convention for RPC-like request / response • HTTP verbs, transport • GET : query • POST : add input • Easy from browser, CLI, Java, Python, Scala, etc. GET /recommend/jwills HTTP/1.1 200 OK Content-Type: text/plain "Ray LaMontagne",0.951 "Fleet Foxes",0.7905 "The National",0.688 "Shearwater",0.3017
  • 42. 42© Cloudera, Inc. All rights reserved. Thank you laserson@cloudera.com

Editor's Notes

  1. What makes data science special on Hadoop.
  2. Background as a scientist. Do genomics/life sciences especially. Shameless plug for our new book.
  3. Or instead, what is data science?
  4. SCARES ME the most when I show up at clients.
  5. Difficult to define, but…
  6. One way to organize these things.
  7. TF-IDF model From simple theory to complicated practical implementation.
  8. Any given operation on an image is not difficult. Reliably integrating satellite data with complex/custom pipelines is difficult. Must coordinate many tasks.
  9. Most similar to research/science/statistics. You don’t really know what you’re doing. Exploratory. Lot’s of tools to do this – Python, R, SAS, etc. BI tools (Tableau).
  10. Doing it at scale more difficult. Hadoop centralizes. No need to copy data for each application. Bioinformatics spends lots of time mucking with different file formats in different systems. Many orgs are very siloed.
  11. Most unique to Hadoop/big data. Don’t want to train a model once. Given model, want to deploy it. Update it.
  12. If this is landscape of what data science is, what are some tools/recs? ~10 min mark
  13. ETL tools. Traditional Hadoop. Don’t want to say much except….
  14. Most common thing: instrumentation and schemas. Need culture of data/telemetry. Best stuff when you join data sets. Requires de-siloization. Requires centralized schemas.
  15. Also Kafka
  16. Ad Hoc Focus on Accuracy, Visualization Traditional stats tools like R, Python, SAS
  17. Ad Hoc Focus on Accuracy, Visualization Traditional stats tools like R, Python, SAS
  18. Thunder as framework on Spark.
  19. Mahout is deprecated.
  20. Another way to think about the tools is based on different features…
  21. Probably detect a theme here.
  22. Lot’s of tools in Hadoop have a dichotomy between online and offline. Do we have to choose?