SlideShare a Scribd company logo

Hadoop for the Data Scientist: Spark in Cloudera 5.5

Inefficient data workloads are all too common across enterprises - causing costly delays, breakages, hard-to-maintain complexity, and ultimately lost productivity. For a typical enterprise with multiple data warehouses, thousands of reports, and hundreds of thousands of ETL jobs being executed every day, this loss of productivity is a real problem. Add to all of this the complex handwritten SQL queries, and there can be nearly a million queries executed every month that desperately need to be optimized, especially to take advantage of the benefits of Apache Hadoop. How can enterprises dig through their workloads and inefficiencies to easily see which are the best fit for Hadoop and what’s the fastest path to get there? Cloudera Navigator Optimizer is the solution - analyzing existing SQL workloads to provide instant insights into your workloads and turns that into an intelligent optimization strategy so you can unlock peak performance and efficiency with Hadoop. As the newest addition to Cloudera’s enterprise Hadoop platform, and now available in limited beta, Navigator Optimizer has helped customers profile over 1.5 million queries and ultimately save millions by optimizing for Hadoop.

1 of 46
1© Cloudera, Inc. All rights reserved.
Hadoop for the Data Scientist:
Spark in Cloudera 5.5
Anand Iyer | Senior Product Manager | Cloudera
Sandy Ryza | Senior Data Scientist | Cloudera
2© Cloudera, Inc. All rights reserved.
Agenda
• Apache Spark Overview
• Machine Learning with Hadoop and Spark
• Machine Learning Use Cases
• What’s Next
3© Cloudera, Inc. All rights reserved.
Cloudera Enterprise
Making Hadoop Fast, Easy, and Secure
A new kind of data
platform:
• One place for unlimited data
• Unified, multi-framework data
access
Cloudera makes it:
• Fast for business
• Easy to manage
• Secure without compromise
OPERATIONS
DATA
MANAGEMENT
STRUCTURED UNSTRUCTURED
PROCESS, ANALYZE, SERVE
UNIFIED SERVICES
RESOURCE MANAGEMENT SECURITY
FILESYSTEM RELATIONAL NoSQL
STORE
INTEGRATE
BATCH STREAM SQL SEARCH SDK
4© Cloudera, Inc. All rights reserved.
One Platform, Many Workloads
Batch, Interactive,
and Real-Time.
Leading performance and
usability in one platform.
• End-to-end analytic workflows
• Access more data
• Work with data in new ways
• Enable new users
OPERATIONS
Cloudera Manager
Cloudera Director
DATA
MANAGEMENT
Cloudera Navigator
Encrypt and KeyTrustee
Optimizer
STRUCTURED
Sqoop
UNSTRUCTURED
Kafka, Flume
PROCESS, ANALYZE, SERVE
UNIFIED SERVICES
RESOURCE MANAGEMENT
YARN
SECURITY
Sentry, RecordService
FILESYSTEM
HDFS
RELATIONAL
Kudu
NoSQL
HBase
STORE
INTEGRATE
BATCH
Spark, Hive, Pig
MapReduce
STREAM
Spark
SQL
Impala
SEARCH
Solr
SDK
Kite
5© Cloudera, Inc. All rights reserved.
Apache Spark
Flexible, in-memory data processing for Hadoop
Easy
Development
Flexible Extensible
API
Fast Batch & Stream
Processing
• Rich APIs for Scala,
Java, and Python
• Interactive shell
• APIs for different
types of workloads:
• Batch
• Streaming
• Machine Learning
• Graph
• In-Memory
processing and
caching
6© Cloudera, Inc. All rights reserved.
The Spark Ecosystem & Hadoop
STRUCTURED
Sqoop
UNSTRUCTURED
Kafka, Flume
UNIFIED SERVICES
RESOURCE MANAGEMENT
YARN
SECURITY
Sentry, RecordService
FILESYSTEM
HDFS
RELATIONAL
Kudu
NoSQL
HBase
STORE
INTEGRATE
SQL
Impala
SEARCH
Solr
SDK
Kite
BATCH & STREAM
Spark
Spark
Streaming Spark SQL DataFrames MLlib …

Recommended

Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for productionFaster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for productionCloudera, Inc.
 
Data Science at Scale Using Apache Spark and Apache Hadoop
Data Science at Scale Using Apache Spark and Apache HadoopData Science at Scale Using Apache Spark and Apache Hadoop
Data Science at Scale Using Apache Spark and Apache HadoopCloudera, Inc.
 
Spark One Platform Webinar
Spark One Platform WebinarSpark One Platform Webinar
Spark One Platform WebinarCloudera, Inc.
 
Cloudera Cares + DataKind | 7 May 2015 | London, UK
Cloudera Cares + DataKind | 7 May 2015 | London, UKCloudera Cares + DataKind | 7 May 2015 | London, UK
Cloudera Cares + DataKind | 7 May 2015 | London, UKCloudera, Inc.
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Cloudera, Inc.
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014cdmaxime
 
Unlock Hadoop Success with Cloudera Navigator Optimizer
Unlock Hadoop Success with Cloudera Navigator OptimizerUnlock Hadoop Success with Cloudera Navigator Optimizer
Unlock Hadoop Success with Cloudera Navigator OptimizerCloudera, Inc.
 

More Related Content

What's hot

Introduction to Machine Learning on Apache Spark MLlib by Juliet Hougland, Se...
Introduction to Machine Learning on Apache Spark MLlib by Juliet Hougland, Se...Introduction to Machine Learning on Apache Spark MLlib by Juliet Hougland, Se...
Introduction to Machine Learning on Apache Spark MLlib by Juliet Hougland, Se...Cloudera, Inc.
 
Building Efficient Pipelines in Apache Spark
Building Efficient Pipelines in Apache SparkBuilding Efficient Pipelines in Apache Spark
Building Efficient Pipelines in Apache SparkJeremy Beard
 
One Hadoop, Multiple Clouds - NYC Big Data Meetup
One Hadoop, Multiple Clouds - NYC Big Data MeetupOne Hadoop, Multiple Clouds - NYC Big Data Meetup
One Hadoop, Multiple Clouds - NYC Big Data MeetupAndrei Savu
 
A Closer Look at Apache Kudu
A Closer Look at Apache KuduA Closer Look at Apache Kudu
A Closer Look at Apache KuduAndriy Zabavskyy
 
Hadoop on Cloud: Why and How?
Hadoop on Cloud: Why and How?Hadoop on Cloud: Why and How?
Hadoop on Cloud: Why and How?Cloudera, Inc.
 
Successes, Challenges, and Pitfalls Migrating a SAAS business to Hadoop
Successes, Challenges, and Pitfalls Migrating a SAAS business to HadoopSuccesses, Challenges, and Pitfalls Migrating a SAAS business to Hadoop
Successes, Challenges, and Pitfalls Migrating a SAAS business to HadoopDataWorks Summit/Hadoop Summit
 
Machine Learning Loves Hadoop
Machine Learning Loves HadoopMachine Learning Loves Hadoop
Machine Learning Loves HadoopCloudera, Inc.
 
Low latency high throughput streaming using Apache Apex and Apache Kudu
Low latency high throughput streaming using Apache Apex and Apache KuduLow latency high throughput streaming using Apache Apex and Apache Kudu
Low latency high throughput streaming using Apache Apex and Apache KuduDataWorks Summit
 
Running Hadoop as Service in AltiScale Platform
Running Hadoop as Service in AltiScale PlatformRunning Hadoop as Service in AltiScale Platform
Running Hadoop as Service in AltiScale PlatformInMobi Technology
 
Hive LLAP: A High Performance, Cost-effective Alternative to Traditional MPP ...
Hive LLAP: A High Performance, Cost-effective Alternative to Traditional MPP ...Hive LLAP: A High Performance, Cost-effective Alternative to Traditional MPP ...
Hive LLAP: A High Performance, Cost-effective Alternative to Traditional MPP ...DataWorks Summit
 
Apache Flink & Kudu: a connector to develop Kappa architectures
Apache Flink & Kudu: a connector to develop Kappa architecturesApache Flink & Kudu: a connector to develop Kappa architectures
Apache Flink & Kudu: a connector to develop Kappa architecturesNacho García Fernández
 
Impala use case @ Zoosk
Impala use case @ ZooskImpala use case @ Zoosk
Impala use case @ ZooskCloudera, Inc.
 
Enabling the Active Data Warehouse with Apache Kudu
Enabling the Active Data Warehouse with Apache KuduEnabling the Active Data Warehouse with Apache Kudu
Enabling the Active Data Warehouse with Apache KuduGrant Henke
 
Provisioning Big Data Platform using Cloudbreak & Ambari
Provisioning Big Data Platform using Cloudbreak & AmbariProvisioning Big Data Platform using Cloudbreak & Ambari
Provisioning Big Data Platform using Cloudbreak & AmbariDataWorks Summit/Hadoop Summit
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaCloudera, Inc.
 
Solr consistency and recovery internals
Solr consistency and recovery internalsSolr consistency and recovery internals
Solr consistency and recovery internalsCloudera, Inc.
 
Intel and Cloudera: Accelerating Enterprise Big Data Success
Intel and Cloudera: Accelerating Enterprise Big Data SuccessIntel and Cloudera: Accelerating Enterprise Big Data Success
Intel and Cloudera: Accelerating Enterprise Big Data SuccessCloudera, Inc.
 
Hadoop AWS infrastructure cost evaluation
Hadoop AWS infrastructure cost evaluationHadoop AWS infrastructure cost evaluation
Hadoop AWS infrastructure cost evaluationmattlieber
 

What's hot (20)

Introduction to Machine Learning on Apache Spark MLlib by Juliet Hougland, Se...
Introduction to Machine Learning on Apache Spark MLlib by Juliet Hougland, Se...Introduction to Machine Learning on Apache Spark MLlib by Juliet Hougland, Se...
Introduction to Machine Learning on Apache Spark MLlib by Juliet Hougland, Se...
 
Building Efficient Pipelines in Apache Spark
Building Efficient Pipelines in Apache SparkBuilding Efficient Pipelines in Apache Spark
Building Efficient Pipelines in Apache Spark
 
One Hadoop, Multiple Clouds - NYC Big Data Meetup
One Hadoop, Multiple Clouds - NYC Big Data MeetupOne Hadoop, Multiple Clouds - NYC Big Data Meetup
One Hadoop, Multiple Clouds - NYC Big Data Meetup
 
Cloudbreak - Technical Deep Dive
Cloudbreak - Technical Deep DiveCloudbreak - Technical Deep Dive
Cloudbreak - Technical Deep Dive
 
A Closer Look at Apache Kudu
A Closer Look at Apache KuduA Closer Look at Apache Kudu
A Closer Look at Apache Kudu
 
Hadoop on Cloud: Why and How?
Hadoop on Cloud: Why and How?Hadoop on Cloud: Why and How?
Hadoop on Cloud: Why and How?
 
Successes, Challenges, and Pitfalls Migrating a SAAS business to Hadoop
Successes, Challenges, and Pitfalls Migrating a SAAS business to HadoopSuccesses, Challenges, and Pitfalls Migrating a SAAS business to Hadoop
Successes, Challenges, and Pitfalls Migrating a SAAS business to Hadoop
 
Machine Learning Loves Hadoop
Machine Learning Loves HadoopMachine Learning Loves Hadoop
Machine Learning Loves Hadoop
 
Low latency high throughput streaming using Apache Apex and Apache Kudu
Low latency high throughput streaming using Apache Apex and Apache KuduLow latency high throughput streaming using Apache Apex and Apache Kudu
Low latency high throughput streaming using Apache Apex and Apache Kudu
 
Running Hadoop as Service in AltiScale Platform
Running Hadoop as Service in AltiScale PlatformRunning Hadoop as Service in AltiScale Platform
Running Hadoop as Service in AltiScale Platform
 
Hive LLAP: A High Performance, Cost-effective Alternative to Traditional MPP ...
Hive LLAP: A High Performance, Cost-effective Alternative to Traditional MPP ...Hive LLAP: A High Performance, Cost-effective Alternative to Traditional MPP ...
Hive LLAP: A High Performance, Cost-effective Alternative to Traditional MPP ...
 
Apache Flink & Kudu: a connector to develop Kappa architectures
Apache Flink & Kudu: a connector to develop Kappa architecturesApache Flink & Kudu: a connector to develop Kappa architectures
Apache Flink & Kudu: a connector to develop Kappa architectures
 
Impala use case @ Zoosk
Impala use case @ ZooskImpala use case @ Zoosk
Impala use case @ Zoosk
 
Enabling the Active Data Warehouse with Apache Kudu
Enabling the Active Data Warehouse with Apache KuduEnabling the Active Data Warehouse with Apache Kudu
Enabling the Active Data Warehouse with Apache Kudu
 
Provisioning Big Data Platform using Cloudbreak & Ambari
Provisioning Big Data Platform using Cloudbreak & AmbariProvisioning Big Data Platform using Cloudbreak & Ambari
Provisioning Big Data Platform using Cloudbreak & Ambari
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
 
Solr consistency and recovery internals
Solr consistency and recovery internalsSolr consistency and recovery internals
Solr consistency and recovery internals
 
Intel and Cloudera: Accelerating Enterprise Big Data Success
Intel and Cloudera: Accelerating Enterprise Big Data SuccessIntel and Cloudera: Accelerating Enterprise Big Data Success
Intel and Cloudera: Accelerating Enterprise Big Data Success
 
Hybrid is the New Normal
Hybrid is the New NormalHybrid is the New Normal
Hybrid is the New Normal
 
Hadoop AWS infrastructure cost evaluation
Hadoop AWS infrastructure cost evaluationHadoop AWS infrastructure cost evaluation
Hadoop AWS infrastructure cost evaluation
 

Similar to Hadoop for the Data Scientist: Spark in Cloudera 5.5

Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)Uri Laserson
 
MySQL Enterprise Portfolio
MySQL Enterprise PortfolioMySQL Enterprise Portfolio
MySQL Enterprise PortfolioAbel Flórez
 
Consolidate your data marts for fast, flexible analytics 5.24.18
Consolidate your data marts for fast, flexible analytics 5.24.18Consolidate your data marts for fast, flexible analytics 5.24.18
Consolidate your data marts for fast, flexible analytics 5.24.18Cloudera, Inc.
 
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...Cloudera, Inc.
 
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...Data Con LA
 
Using MySQL in the Cloud
Using MySQL in the CloudUsing MySQL in the Cloud
Using MySQL in the CloudMatt Lord
 
MySQL London Tech Tour March 2015 - MySQL Fabric
MySQL London Tech Tour March 2015 - MySQL FabricMySQL London Tech Tour March 2015 - MySQL Fabric
MySQL London Tech Tour March 2015 - MySQL FabricMark Swarbrick
 
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin Motgi
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin MotgiWhither the Hadoop Developer Experience, June Hadoop Meetup, Nitin Motgi
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin MotgiFelicia Haggarty
 
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...Cloudera, Inc.
 
Fighting Fraud with Apache Spark
Fighting Fraud with Apache SparkFighting Fraud with Apache Spark
Fighting Fraud with Apache SparkMiklos Christine
 
MySQL in oracle_public_cloud
MySQL in oracle_public_cloudMySQL in oracle_public_cloud
MySQL in oracle_public_cloudOracleMySQL
 
Amazon Redshift with Full 360 Inc.
Amazon Redshift with Full 360 Inc.Amazon Redshift with Full 360 Inc.
Amazon Redshift with Full 360 Inc.Amazon Web Services
 
MySQL in oracle public cloud
MySQL in oracle public cloudMySQL in oracle public cloud
MySQL in oracle public cloudMandy Ang
 
Choosing the Right Business Intelligence Tools for Your Data and Architectura...
Choosing the Right Business Intelligence Tools for Your Data and Architectura...Choosing the Right Business Intelligence Tools for Your Data and Architectura...
Choosing the Right Business Intelligence Tools for Your Data and Architectura...Victor Holman
 
Oracle super cluster for oracle e business suite
Oracle super cluster for oracle e business suiteOracle super cluster for oracle e business suite
Oracle super cluster for oracle e business suiteOTN Systems Hub
 
Latest Innovations in Database as a Service Enabled by Oracle Enterprise Manager
Latest Innovations in Database as a Service Enabled by Oracle Enterprise ManagerLatest Innovations in Database as a Service Enabled by Oracle Enterprise Manager
Latest Innovations in Database as a Service Enabled by Oracle Enterprise ManagerHari Srinivasan
 
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...DataStax
 
Adobe Spark Meetup - 9/19/2018 - San Jose, CA
Adobe Spark Meetup - 9/19/2018 - San Jose, CAAdobe Spark Meetup - 9/19/2018 - San Jose, CA
Adobe Spark Meetup - 9/19/2018 - San Jose, CAJaemi Bremner
 
Stream Analytics in the Enterprise
Stream Analytics in the EnterpriseStream Analytics in the Enterprise
Stream Analytics in the EnterpriseJesus Rodriguez
 

Similar to Hadoop for the Data Scientist: Spark in Cloudera 5.5 (20)

Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)
 
MySQL Enterprise Portfolio
MySQL Enterprise PortfolioMySQL Enterprise Portfolio
MySQL Enterprise Portfolio
 
Consolidate your data marts for fast, flexible analytics 5.24.18
Consolidate your data marts for fast, flexible analytics 5.24.18Consolidate your data marts for fast, flexible analytics 5.24.18
Consolidate your data marts for fast, flexible analytics 5.24.18
 
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
 
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
 
Using MySQL in the Cloud
Using MySQL in the CloudUsing MySQL in the Cloud
Using MySQL in the Cloud
 
MySQL London Tech Tour March 2015 - MySQL Fabric
MySQL London Tech Tour March 2015 - MySQL FabricMySQL London Tech Tour March 2015 - MySQL Fabric
MySQL London Tech Tour March 2015 - MySQL Fabric
 
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin Motgi
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin MotgiWhither the Hadoop Developer Experience, June Hadoop Meetup, Nitin Motgi
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin Motgi
 
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
 
Fighting Fraud with Apache Spark
Fighting Fraud with Apache SparkFighting Fraud with Apache Spark
Fighting Fraud with Apache Spark
 
MySQL Fabric
MySQL FabricMySQL Fabric
MySQL Fabric
 
MySQL in oracle_public_cloud
MySQL in oracle_public_cloudMySQL in oracle_public_cloud
MySQL in oracle_public_cloud
 
Amazon Redshift with Full 360 Inc.
Amazon Redshift with Full 360 Inc.Amazon Redshift with Full 360 Inc.
Amazon Redshift with Full 360 Inc.
 
MySQL in oracle public cloud
MySQL in oracle public cloudMySQL in oracle public cloud
MySQL in oracle public cloud
 
Choosing the Right Business Intelligence Tools for Your Data and Architectura...
Choosing the Right Business Intelligence Tools for Your Data and Architectura...Choosing the Right Business Intelligence Tools for Your Data and Architectura...
Choosing the Right Business Intelligence Tools for Your Data and Architectura...
 
Oracle super cluster for oracle e business suite
Oracle super cluster for oracle e business suiteOracle super cluster for oracle e business suite
Oracle super cluster for oracle e business suite
 
Latest Innovations in Database as a Service Enabled by Oracle Enterprise Manager
Latest Innovations in Database as a Service Enabled by Oracle Enterprise ManagerLatest Innovations in Database as a Service Enabled by Oracle Enterprise Manager
Latest Innovations in Database as a Service Enabled by Oracle Enterprise Manager
 
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
 
Adobe Spark Meetup - 9/19/2018 - San Jose, CA
Adobe Spark Meetup - 9/19/2018 - San Jose, CAAdobe Spark Meetup - 9/19/2018 - San Jose, CA
Adobe Spark Meetup - 9/19/2018 - San Jose, CA
 
Stream Analytics in the Enterprise
Stream Analytics in the EnterpriseStream Analytics in the Enterprise
Stream Analytics in the Enterprise
 

More from Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxCloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformCloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.
 

More from Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Recently uploaded

Battle of React State Managers in frontend applications
Battle of React State Managers in frontend applicationsBattle of React State Managers in frontend applications
Battle of React State Managers in frontend applicationsEvangelia Mitsopoulou
 
HBR SERIES METAL HOUSED RESISTORS POWER ELECTRICAL ABSORBS HIGH CURRENT DURIN...
HBR SERIES METAL HOUSED RESISTORS POWER ELECTRICAL ABSORBS HIGH CURRENT DURIN...HBR SERIES METAL HOUSED RESISTORS POWER ELECTRICAL ABSORBS HIGH CURRENT DURIN...
HBR SERIES METAL HOUSED RESISTORS POWER ELECTRICAL ABSORBS HIGH CURRENT DURIN...htrindia
 
National Institute of Standards and Technology (NIST) Cybersecurity Framework...
National Institute of Standards and Technology (NIST) Cybersecurity Framework...National Institute of Standards and Technology (NIST) Cybersecurity Framework...
National Institute of Standards and Technology (NIST) Cybersecurity Framework...MichaelBenis1
 
PrismCRM-RealEstate-SalesCRM_byCode5Company
PrismCRM-RealEstate-SalesCRM_byCode5CompanyPrismCRM-RealEstate-SalesCRM_byCode5Company
PrismCRM-RealEstate-SalesCRM_byCode5CompanyMustafa Kuğu
 
New ThousandEyes Product Features and Release Highlights: February 2024
New ThousandEyes Product Features and Release Highlights: February 2024New ThousandEyes Product Features and Release Highlights: February 2024
New ThousandEyes Product Features and Release Highlights: February 2024ThousandEyes
 
What’s New in CloudStack 4.19, Abhishek Kumar, Release Manager Apache CloudSt...
What’s New in CloudStack 4.19, Abhishek Kumar, Release Manager Apache CloudSt...What’s New in CloudStack 4.19, Abhishek Kumar, Release Manager Apache CloudSt...
What’s New in CloudStack 4.19, Abhishek Kumar, Release Manager Apache CloudSt...ShapeBlue
 
iOncologi_Pitch Deck_2024 slide show for hostinger
iOncologi_Pitch Deck_2024 slide show for hostingeriOncologi_Pitch Deck_2024 slide show for hostinger
iOncologi_Pitch Deck_2024 slide show for hostingerssuser9354ce
 
Trending now: Book subjects on the move in the Canadian market - Tech Forum 2024
Trending now: Book subjects on the move in the Canadian market - Tech Forum 2024Trending now: Book subjects on the move in the Canadian market - Tech Forum 2024
Trending now: Book subjects on the move in the Canadian market - Tech Forum 2024BookNet Canada
 
Synergy in Leadership and Product Excellence: A Blueprint for Growth by CPO, ...
Synergy in Leadership and Product Excellence: A Blueprint for Growth by CPO, ...Synergy in Leadership and Product Excellence: A Blueprint for Growth by CPO, ...
Synergy in Leadership and Product Excellence: A Blueprint for Growth by CPO, ...Product School
 
How We Grew Up with CloudStack and its Journey – Dilip Singh, DataHub
How We Grew Up with CloudStack and its Journey – Dilip Singh, DataHubHow We Grew Up with CloudStack and its Journey – Dilip Singh, DataHub
How We Grew Up with CloudStack and its Journey – Dilip Singh, DataHubShapeBlue
 
My Journey towards Artificial Intelligence
My Journey towards Artificial IntelligenceMy Journey towards Artificial Intelligence
My Journey towards Artificial IntelligenceVijayananda Mohire
 
Elevating Cloud Infrastructure with Object Storage, DRS, VM Scheduling, and D...
Elevating Cloud Infrastructure with Object Storage, DRS, VM Scheduling, and D...Elevating Cloud Infrastructure with Object Storage, DRS, VM Scheduling, and D...
Elevating Cloud Infrastructure with Object Storage, DRS, VM Scheduling, and D...ShapeBlue
 
The Art of the Possible with Graph by Dr Jim Webber Neo4j.pptx
The Art of the Possible with Graph by Dr Jim Webber Neo4j.pptxThe Art of the Possible with Graph by Dr Jim Webber Neo4j.pptx
The Art of the Possible with Graph by Dr Jim Webber Neo4j.pptxNeo4j
 
Progress Report: Ministry of IT under Dr. Umar Saif Aug 23-Feb'24
Progress Report: Ministry of IT under Dr. Umar Saif Aug 23-Feb'24Progress Report: Ministry of IT under Dr. Umar Saif Aug 23-Feb'24
Progress Report: Ministry of IT under Dr. Umar Saif Aug 23-Feb'24Umar Saif
 
How to write an effective Cyber Incident Response Plan
How to write an effective Cyber Incident Response PlanHow to write an effective Cyber Incident Response Plan
How to write an effective Cyber Incident Response PlanDatabarracks
 
Centralized TLS Certificates Management Using Vault PKI + Cert-Manager
Centralized TLS Certificates Management Using Vault PKI + Cert-ManagerCentralized TLS Certificates Management Using Vault PKI + Cert-Manager
Centralized TLS Certificates Management Using Vault PKI + Cert-ManagerSaiLinnThu2
 
Building Bridges: Merging RPA Processes, UiPath Apps, and Data Service to bu...
Building Bridges:  Merging RPA Processes, UiPath Apps, and Data Service to bu...Building Bridges:  Merging RPA Processes, UiPath Apps, and Data Service to bu...
Building Bridges: Merging RPA Processes, UiPath Apps, and Data Service to bu...DianaGray10
 
VM Migration from VMware to CloudStack and KVM – Suresh Anaparti, ShapeBlue
VM Migration from VMware to CloudStack and KVM – Suresh Anaparti, ShapeBlueVM Migration from VMware to CloudStack and KVM – Suresh Anaparti, ShapeBlue
VM Migration from VMware to CloudStack and KVM – Suresh Anaparti, ShapeBlueShapeBlue
 
CloudStack 101: The Best Way to Build Your Private Cloud – Rohit Yadav, VP Ap...
CloudStack 101: The Best Way to Build Your Private Cloud – Rohit Yadav, VP Ap...CloudStack 101: The Best Way to Build Your Private Cloud – Rohit Yadav, VP Ap...
CloudStack 101: The Best Way to Build Your Private Cloud – Rohit Yadav, VP Ap...ShapeBlue
 

Recently uploaded (20)

Battle of React State Managers in frontend applications
Battle of React State Managers in frontend applicationsBattle of React State Managers in frontend applications
Battle of React State Managers in frontend applications
 
HBR SERIES METAL HOUSED RESISTORS POWER ELECTRICAL ABSORBS HIGH CURRENT DURIN...
HBR SERIES METAL HOUSED RESISTORS POWER ELECTRICAL ABSORBS HIGH CURRENT DURIN...HBR SERIES METAL HOUSED RESISTORS POWER ELECTRICAL ABSORBS HIGH CURRENT DURIN...
HBR SERIES METAL HOUSED RESISTORS POWER ELECTRICAL ABSORBS HIGH CURRENT DURIN...
 
National Institute of Standards and Technology (NIST) Cybersecurity Framework...
National Institute of Standards and Technology (NIST) Cybersecurity Framework...National Institute of Standards and Technology (NIST) Cybersecurity Framework...
National Institute of Standards and Technology (NIST) Cybersecurity Framework...
 
PrismCRM-RealEstate-SalesCRM_byCode5Company
PrismCRM-RealEstate-SalesCRM_byCode5CompanyPrismCRM-RealEstate-SalesCRM_byCode5Company
PrismCRM-RealEstate-SalesCRM_byCode5Company
 
New ThousandEyes Product Features and Release Highlights: February 2024
New ThousandEyes Product Features and Release Highlights: February 2024New ThousandEyes Product Features and Release Highlights: February 2024
New ThousandEyes Product Features and Release Highlights: February 2024
 
What’s New in CloudStack 4.19, Abhishek Kumar, Release Manager Apache CloudSt...
What’s New in CloudStack 4.19, Abhishek Kumar, Release Manager Apache CloudSt...What’s New in CloudStack 4.19, Abhishek Kumar, Release Manager Apache CloudSt...
What’s New in CloudStack 4.19, Abhishek Kumar, Release Manager Apache CloudSt...
 
iOncologi_Pitch Deck_2024 slide show for hostinger
iOncologi_Pitch Deck_2024 slide show for hostingeriOncologi_Pitch Deck_2024 slide show for hostinger
iOncologi_Pitch Deck_2024 slide show for hostinger
 
Trending now: Book subjects on the move in the Canadian market - Tech Forum 2024
Trending now: Book subjects on the move in the Canadian market - Tech Forum 2024Trending now: Book subjects on the move in the Canadian market - Tech Forum 2024
Trending now: Book subjects on the move in the Canadian market - Tech Forum 2024
 
Synergy in Leadership and Product Excellence: A Blueprint for Growth by CPO, ...
Synergy in Leadership and Product Excellence: A Blueprint for Growth by CPO, ...Synergy in Leadership and Product Excellence: A Blueprint for Growth by CPO, ...
Synergy in Leadership and Product Excellence: A Blueprint for Growth by CPO, ...
 
How We Grew Up with CloudStack and its Journey – Dilip Singh, DataHub
How We Grew Up with CloudStack and its Journey – Dilip Singh, DataHubHow We Grew Up with CloudStack and its Journey – Dilip Singh, DataHub
How We Grew Up with CloudStack and its Journey – Dilip Singh, DataHub
 
My Journey towards Artificial Intelligence
My Journey towards Artificial IntelligenceMy Journey towards Artificial Intelligence
My Journey towards Artificial Intelligence
 
Elevating Cloud Infrastructure with Object Storage, DRS, VM Scheduling, and D...
Elevating Cloud Infrastructure with Object Storage, DRS, VM Scheduling, and D...Elevating Cloud Infrastructure with Object Storage, DRS, VM Scheduling, and D...
Elevating Cloud Infrastructure with Object Storage, DRS, VM Scheduling, and D...
 
The Art of the Possible with Graph by Dr Jim Webber Neo4j.pptx
The Art of the Possible with Graph by Dr Jim Webber Neo4j.pptxThe Art of the Possible with Graph by Dr Jim Webber Neo4j.pptx
The Art of the Possible with Graph by Dr Jim Webber Neo4j.pptx
 
In sharing we trust. Taking advantage of a diverse consortium to build a tran...
In sharing we trust. Taking advantage of a diverse consortium to build a tran...In sharing we trust. Taking advantage of a diverse consortium to build a tran...
In sharing we trust. Taking advantage of a diverse consortium to build a tran...
 
Progress Report: Ministry of IT under Dr. Umar Saif Aug 23-Feb'24
Progress Report: Ministry of IT under Dr. Umar Saif Aug 23-Feb'24Progress Report: Ministry of IT under Dr. Umar Saif Aug 23-Feb'24
Progress Report: Ministry of IT under Dr. Umar Saif Aug 23-Feb'24
 
How to write an effective Cyber Incident Response Plan
How to write an effective Cyber Incident Response PlanHow to write an effective Cyber Incident Response Plan
How to write an effective Cyber Incident Response Plan
 
Centralized TLS Certificates Management Using Vault PKI + Cert-Manager
Centralized TLS Certificates Management Using Vault PKI + Cert-ManagerCentralized TLS Certificates Management Using Vault PKI + Cert-Manager
Centralized TLS Certificates Management Using Vault PKI + Cert-Manager
 
Building Bridges: Merging RPA Processes, UiPath Apps, and Data Service to bu...
Building Bridges:  Merging RPA Processes, UiPath Apps, and Data Service to bu...Building Bridges:  Merging RPA Processes, UiPath Apps, and Data Service to bu...
Building Bridges: Merging RPA Processes, UiPath Apps, and Data Service to bu...
 
VM Migration from VMware to CloudStack and KVM – Suresh Anaparti, ShapeBlue
VM Migration from VMware to CloudStack and KVM – Suresh Anaparti, ShapeBlueVM Migration from VMware to CloudStack and KVM – Suresh Anaparti, ShapeBlue
VM Migration from VMware to CloudStack and KVM – Suresh Anaparti, ShapeBlue
 
CloudStack 101: The Best Way to Build Your Private Cloud – Rohit Yadav, VP Ap...
CloudStack 101: The Best Way to Build Your Private Cloud – Rohit Yadav, VP Ap...CloudStack 101: The Best Way to Build Your Private Cloud – Rohit Yadav, VP Ap...
CloudStack 101: The Best Way to Build Your Private Cloud – Rohit Yadav, VP Ap...
 

Hadoop for the Data Scientist: Spark in Cloudera 5.5

  • 1. 1© Cloudera, Inc. All rights reserved. Hadoop for the Data Scientist: Spark in Cloudera 5.5 Anand Iyer | Senior Product Manager | Cloudera Sandy Ryza | Senior Data Scientist | Cloudera
  • 2. 2© Cloudera, Inc. All rights reserved. Agenda • Apache Spark Overview • Machine Learning with Hadoop and Spark • Machine Learning Use Cases • What’s Next
  • 3. 3© Cloudera, Inc. All rights reserved. Cloudera Enterprise Making Hadoop Fast, Easy, and Secure A new kind of data platform: • One place for unlimited data • Unified, multi-framework data access Cloudera makes it: • Fast for business • Easy to manage • Secure without compromise OPERATIONS DATA MANAGEMENT STRUCTURED UNSTRUCTURED PROCESS, ANALYZE, SERVE UNIFIED SERVICES RESOURCE MANAGEMENT SECURITY FILESYSTEM RELATIONAL NoSQL STORE INTEGRATE BATCH STREAM SQL SEARCH SDK
  • 4. 4© Cloudera, Inc. All rights reserved. One Platform, Many Workloads Batch, Interactive, and Real-Time. Leading performance and usability in one platform. • End-to-end analytic workflows • Access more data • Work with data in new ways • Enable new users OPERATIONS Cloudera Manager Cloudera Director DATA MANAGEMENT Cloudera Navigator Encrypt and KeyTrustee Optimizer STRUCTURED Sqoop UNSTRUCTURED Kafka, Flume PROCESS, ANALYZE, SERVE UNIFIED SERVICES RESOURCE MANAGEMENT YARN SECURITY Sentry, RecordService FILESYSTEM HDFS RELATIONAL Kudu NoSQL HBase STORE INTEGRATE BATCH Spark, Hive, Pig MapReduce STREAM Spark SQL Impala SEARCH Solr SDK Kite
  • 5. 5© Cloudera, Inc. All rights reserved. Apache Spark Flexible, in-memory data processing for Hadoop Easy Development Flexible Extensible API Fast Batch & Stream Processing • Rich APIs for Scala, Java, and Python • Interactive shell • APIs for different types of workloads: • Batch • Streaming • Machine Learning • Graph • In-Memory processing and caching
  • 6. 6© Cloudera, Inc. All rights reserved. The Spark Ecosystem & Hadoop STRUCTURED Sqoop UNSTRUCTURED Kafka, Flume UNIFIED SERVICES RESOURCE MANAGEMENT YARN SECURITY Sentry, RecordService FILESYSTEM HDFS RELATIONAL Kudu NoSQL HBase STORE INTEGRATE SQL Impala SEARCH Solr SDK Kite BATCH & STREAM Spark Spark Streaming Spark SQL DataFrames MLlib …
  • 7. 7© Cloudera, Inc. All rights reserved. Easy Machine Learning on data distributed over a large cluster of machines
  • 8. 8© Cloudera, Inc. All rights reserved. Process Flow for ML Development Traditional Data Management Model Development Phase Production Modeling Production Scoring* Metadata Management Development Tools (IDEs, source control, notebooks) Scheduling, Workflow, Publishing Data Ingest Data Prep Feature Engineering Visualization Modeling (incl. hyperparameter search & model validation) Feature Generation + Model Building Model Quality, Usage, Perf. Metrics Experiments Batch Scorer Online Model Update Server + Scoring *There may be further steps after scoring such as aggregations, visualizations, reporting, etc
  • 9. 9© Cloudera, Inc. All rights reserved. What is Mllib? Library of machine learning and data mining algorithms and utilities • Implemented in Spark • Invoked within Java, Scala, or Python Spark applications MLlib applications are Spark applications • Requires Spark knowledge to effectively run • Recommended deployment on YARN • MLlib apps require the same set of parameters Spark applications require (number of executors, memory per executor, etc)
  • 10. 10© Cloudera, Inc. All rights reserved. What Does MLlib Contain? • Machine learning models for classification and regression • Recommender System • Clustering Algorithms • Feature Engineering Algorithms and Utilities • Data Mining Algorithms & Basic Statistical Analysis Utilities
  • 11. 11© Cloudera, Inc. All rights reserved. Classification & Regression Traditional Models • Linear and Logistic Regression • Naïve Bayes • Decision Trees • Support Vector Machines
  • 12. 12© Cloudera, Inc. All rights reserved. Classification & Regression Traditional Models • Linear and Logistic Regression • Naïve Bayes • Decision Trees • Support Vector Machines Next-Gen Models • Gradient Boosted Trees • Random Forests
  • 13. 13© Cloudera, Inc. All rights reserved. Clustering Algorithms • K-Means • Power Iteration Clustering (PIC) • Gaussian Mixture Model • Streaming K-Means
  • 14. 14© Cloudera, Inc. All rights reserved. Clustering Algorithms • K-Means • Power Iteration Clustering (PIC) • Gaussian Mixture Model • Streaming K-Means Textual data clustering i.e. identifying “topics” from a corpus of documents: • Latent Dirichlet Allocation (LDA)
  • 15. 15© Cloudera, Inc. All rights reserved. • Predicting the interests of a user, by collecting partial list of preferences from many users • Predicting missing items of a user-item association matrix • Algorithm used: Alternating Least Squares • Admittedly limited choice of algorithms ? ? ? ? ? ? ? ? ? ? Collaborative Filtering For Building Recommender Systems
  • 16. 16© Cloudera, Inc. All rights reserved. Feature Engineering & Modeling Utilities • Feature Scaling & Normalization • Statistical Correlation Functions (Pearson & Spearman’s) • Tests of Statistical Significance • Chi-Squared independence test for feature selection • Evaluation metrics: Precision, Recall, AUROC, F1-Measure, etc
  • 17. 17© Cloudera, Inc. All rights reserved. Feature Engineering & Modeling Utilities • Feature Scaling & Normalization • Statistical Correlation Functions (Pearson & Spearman’s) • Tests of Statistical Significance • Chi-Squared independence test for feature selection • Evaluation metrics: Precision, Recall, AUROC, F1-Measure, etc Dimensionality Reduction: • Principal Component Analysis (PCA) • Singular Value Decomposition (SVD)
  • 18. 18© Cloudera, Inc. All rights reserved. Feature Engineering & Modeling Utilities • Feature Scaling & Normalization • Statistical Correlation Functions (Pearson & Spearman’s) • Tests of Statistical Significance • Chi-Squared independence test for feature selection • Evaluation metrics: Precision, Recall, AUROC, F1-Measure, etc Dimensionality Reduction: • Principal Component Analysis (PCA) • Singular Value Decomposition (SVD) Textual Feature Generation: • Word2Vec • Term Frequency – Inverse Document Frequency (TF-IDF)
  • 19. 19© Cloudera, Inc. All rights reserved. Data Mining: Frequent Pattern Mining Data Mining Urban Legend: Frequent Pattern Mining algorithm on supermarket purchase data revealed “Men who buy diapers have a very high likelihood of buying beer!”
  • 20. 20© Cloudera, Inc. All rights reserved. Data Mining: Frequent Pattern Mining Data Mining Urban Legend: Frequent Pattern Mining algorithm on supermarket purchase data revealed “Men who buy diapers have a very high likelihood of buying beer!” Algorithms in MLlib: • Frequent Pattern-Growth • Association Rule Mining • PrefixSpan
  • 21. 21© Cloudera, Inc. All rights reserved. What about “Deep Learning”? Deep Learning is an umbrella term for large complex Multi- Layer Neural Networks • MLlib contains a robust Multilayer Neural Network implementation
  • 22. 22© Cloudera, Inc. All rights reserved. Pipeline API Hooking the Pieces Together • Inspired by scikit-learn pipelines • ML involves running multiple sequential steps Eg: Text Classification Pipeline Bag of Words Tokenize TF-IDF LDA Scale & Normalize Features Train Classifier
  • 23. 23© Cloudera, Inc. All rights reserved. Pipeline API Hooking the Pieces Together • Inspired by scikit-learn pipelines • ML involves running multiple sequential steps Eg: Text Classification Pipeline Bag of Words Tokenize TF-IDF LDA Scale & Normalize Features Train Classifier Sequence is repeated during Training and Scoring
  • 24. 24© Cloudera, Inc. All rights reserved. Pipeline API: Hooking the pieces together • Inspired by scikit-learn pipelines • ML involves running multiple sequential steps Eg: Text Classification Pipeline Bag of words Tokenize TF-IDF LDA Scale & Normalize Features Train Classifier Sequence is repeated during Training and Scoring Hyper-Parameter Tuning  Repeat Sequence with different parameter values
  • 25. 25© Cloudera, Inc. All rights reserved. Overview of Pipeline API • Create Pipeline as a sequence of Stages: • Transformers: Transform or augment features • Estimators: Fit a model • Re-use Pipeline • Basic save and load functionality available • Invoke Pipeline with different set of parameters passed as ParamMap
  • 26. 26© Cloudera, Inc. All rights reserved. Process Flow for ML Development Traditional Data Management Model Development Phase Production Modeling Production Scoring* Metadata Management Development Tools (IDEs, source control, notebooks) Scheduling, Workflow, Publishing Data Ingest Data Prep Feature Engineering Visualization Modeling (incl. hyperparameter search & model validation) Feature Generation + Model Building Model Quality, Usage, Perf. Metrics Experiments Batch Scorer Online Model Update Server + Scoring *There may be further steps after scoring such as aggregations, visualizations, reporting, etc
  • 27. 27© Cloudera, Inc. All rights reserved. Process Flow for ML Development Traditional Data Management Model Development Phase Production Modeling Production Scoring* Metadata Management Development Tools (IDEs, source control, notebooks) Scheduling, Workflow, Publishing Data Ingest Data Prep Feature Engineering Visualization Modeling (incl. hyperparameter search & model validation) Feature Generation + Model Building Model Quality, Usage, Perf. Metrics Experiments Batch Scorer Online Model Update Server + Scoring *There may be further steps after scoring such as aggregations, visualizations, reporting, etc
  • 28. 28© Cloudera, Inc. All rights reserved. Process Flow for ML Development Traditional Data Management Model Development Phase Production Modeling Production Scoring* Metadata Management Development Tools (IDEs, source control, notebooks) Scheduling, Workflow, Publishing Data Ingest Data Prep Feature Engineering Visualization Modeling (incl. hyperparameter search & model validation) Feature Generation + Model Building Model Quality, Usage, Perf. Metrics Experiments Batch Scorer Online Model Update Server + Scoring *There may be further steps after scoring such as aggregations, visualizations, reporting, etc
  • 29. 29© Cloudera, Inc. All rights reserved. Process Flow for ML Development Traditional Data Management Model Development Phase Production Modeling Production Scoring* Metadata Management Development Tools (IDEs, source control, notebooks) Scheduling, Workflow, Publishing Data Ingest Data Prep Feature Engineering Visualization Modeling (incl. hyperparameter search & model validation) Feature Generation + Model Building Model Quality, Usage, Perf. Metrics Experiments Batch Scorer Online Model Update Server + Scoring *There may be further steps after scoring such as aggregations, visualizations, reporting, etc
  • 30. 30© Cloudera, Inc. All rights reserved. Process Flow for ML Development Traditional Data Management Model Development Phase Production Modeling Production Scoring* Metadata Management Development Tools (IDEs, source control, notebooks) Scheduling, Workflow, Publishing Data Ingest Data Prep Feature Engineering Visualization Modeling (incl. hyperparameter search & model validation) Feature Generation + Model Building Model Quality, Usage, Perf. Metrics Experiments Batch Scorer Online Model Update Server + Scoring *There may be further steps after scoring such as aggregations, visualizations, reporting, etc
  • 31. 31© Cloudera, Inc. All rights reserved. Process Flow for ML Development Traditional Data Management Model Development Phase Production Modeling Production Scoring* Metadata Management Development Tools (IDEs, source control, notebooks) Scheduling, Workflow, Publishing Data Ingest Data Prep Feature Engineering Visualization Modeling (incl. hyperparameter search & model validation) Feature Generation + Model Building Model Quality, Usage, Perf. Metrics Experiments Batch Scorer Online Model Update Server + Scoring *There may be further steps after scoring such as aggregations, visualizations, reporting, etc Score streaming events in Spark Streaming.
  • 32. 32© Cloudera, Inc. All rights reserved. Machine Learning Use Case
  • 33. 33© Cloudera, Inc. All rights reserved. Predicting Influencers at a Large Telco • Customer loyalty difficult and expensive • Aggressive competition
  • 34. 34© Cloudera, Inc. All rights reserved. Social Churn • Churn is not an isolated event! • When influential subscribers leave, they take their friends with them
  • 35. 35© Cloudera, Inc. All rights reserved. Casting This as a Data Science Problem • Can we quantify: Which lost users were the most influential? • Can we predict: Which current subscribers have as much influence?
  • 36. 36© Cloudera, Inc. All rights reserved. The Challenge: Lots Customers, Lots of Data • Over 100 million customers • Over 1 billion connections
  • 37. 37© Cloudera, Inc. All rights reserved. The Challenge: Lots Customers, Lots of Data • Over 100 million customers • Over 1 billion connections
  • 38. 38© Cloudera, Inc. All rights reserved. Calculating Influencer Scores • Connection: pair of users with communication both ways • Influencer score: number of connected users that churn after user X churns
  • 39. 39© Cloudera, Inc. All rights reserved. Predicting Influencer Scores MLlib! • Regression model • Linear regression • Random forests • Features • # of connections, # calls to connections • Internal vs. External
  • 40. 40© Cloudera, Inc. All rights reserved. Breaking Down the Work Building User and Connection Tables Computing Historical Influencer Scores Feature Generation Model Fitting Model Evaluation
  • 41. 41© Cloudera, Inc. All rights reserved. What’s Next
  • 42. 42© Cloudera, Inc. All rights reserved. Roadmap Update MANAGEMENT Initial Spark-on-YARN integration for shared resource management SECURITY SCALE STREAMING New metrics for easier diagnosis Improved Spark-on-YARN for better multi-tenancy, performance, ease of use Automated configurations to optimize over time Visibility into resource utilization Improved PySpark integration for Python access Kerberos-based authorization Fine-grained access control Auditing and lineage (Governance) Integration with Intel’s Advanced Encryption libraries Full PCI compliance Improved integration with HDFS to enable scheduling Reduced memory pressure on larger jobs Dynamic resource utilization and prioritization Stress test at scale with mixed multi-tenant workloads Spark Streaming resiliency for zero data loss Data ingest integration for Kafka and Flume Improved state management for better performance Higher-level language extensions ✔ ✔✔ ✔ ✔✔ ✔
  • 43. 43© Cloudera, Inc. All rights reserved. Download Cloudera 5.5 cloudera.com/downloads
  • 44. 44© Cloudera, Inc. All rights reserved. Data Science & Spark Training Courses university.cloudera.com
  • 45. 45© Cloudera, Inc. All rights reserved. Thank You
  • 46. 46© Cloudera, Inc. All rights reserved. Spark Resources • Learn Spark • O’Reilly Advanced Analytics with Spark eBook (written by Clouderans) • Cloudera Developer Blog: blog.cloudera.com/spark • Spark Page: cloudera.com/spark • Get Trained • Cloudera Spark Training: university.cloudera.com • Try it Out • Cloudera Live Spark Tutorial: cloudera.com/live • Download Cloudera 5.5: cloudera.com/downloads