SlideShare a Scribd company logo
1 
An Introduction to Spark 
Jai Ranganathan, Senior Director Product Management, Cloudera 
Denny Lee, Senior Director Data Sciences Engineering, Concur
Agenda 
• Cloudera’s Enterprise Data Hub 
• Why Spark? 
• Spark Use Cases 
• Concur Case Study 
• Cloudera and Spark 
• Future of Spark 
2 ©2014 Cloudera, Inc. All rights reserved.
Cloudera’s Enterprise Data Hub 
3 ©2014 Cloudera, Inc. All rights reserved. 
3RD PARTY 
APPS 
STORAGE FOR ANY TYPE OF DATA 
UNIFIED, ELASTIC, RESILIENT URE 
BATCH 
PROCESSING 
MAPREDUCE, 
SPARK 
ANALYTIC 
SQL 
IMPALA 
SEARCH 
ENGINE 
SOLR 
MACHINE 
LEARNING 
SPARK, PARTNERS, 
MAHOUT, MLLIB 
STREAM 
PROCESSING 
SPARK 
WORKLOAD MANAGEMENT YARN 
FILESYSTEM 
HDFS 
ONLINE NOSQL 
HBASE 
MANAGEMENT 
CLOUDERA NAVIGATOR 
DATA 
MANAGEMENT 
CLOUDERA MANAGER 
SYSTEM 
, SECURE SENTRY
Spark: Easy and Fast Big Data 
Easy to Develop 
• Rich APIs in Java, Scala, 
Python 
• Interactive shell 
Fast to Run 
• General execution 
graphs 
• In-memory storage 
2-5× less code Up to 10× faster on disk, 
4 ©2014 Cloudera, Inc. All rights reserved. 
100× in memory
Easy: Expressive API 
• map 
• filter 
• groupBy 
• sort 
• union 
• join 
• leftOuterJoin 
• rightOuterJoin 
• reduce 
• count 
• fold 
• reduceByKey 
• groupByKey 
• cogroup 
• cross 
• zip 
5 ©2014 Cloudera, Inc. All rights reserved. 
• sample 
• take 
• first 
• partitionBy 
• mapWith 
• pipe 
• save ...
Example: Logistic Regression 
data = spark.textFile(...).map(readPoint).cache() 
w = numpy.random.rand(D) 
for i in range(iterations): 
gradient = data 
.map(lambda p: (1 / (1 + exp(-p.y * w.dot(p.x)))) 
* p.y * p.x) 
.reduce(lambda x, y: x + y) 
w -= gradient 
print “Final w: %s” % w 
6 ©2014 Cloudera, Inc. All rights reserved.
Spark Introduces Concept of RDD to Take 
Advantage of Memory 
RDD = Resilient Distributed Datasets 
• Memory caching layer that stores data in a distributed, fault-tolerant 
cache 
• Created by parallel transformations on data in stable storage 
Two observations: 
a. Can fall back to disk when data-set does not fit in memory 
b. Provides fault-tolerance through concept of lineage 
7 ©2014 Cloudera, Inc. All rights reserved.
Fast: Using RAM, Operator Graphs 
In-Memory Caching 
• Data Partitions read from RAM 
instead of disk 
Operator Graphs 
• Scheduling Optimizations 
• Fault Tolerance 
C: D: E: 
8 ©2014 Cloudera, Inc. All rights reserved. 
join 
B: B: 
groupBy 
filter 
F: 
Ç 
√ 
Ω 
map 
A: 
map 
take 
= RDD = cached partition
Easy: Out of the Box Functionality 
Hadoop Integration 
• Standard Hadoop data formats 
• Runs under YARN in mixed clusters 
Libraries 
• Mllib – Machine Learning toolkit 
• GraphX (alpha) – Graph analytics based on 
PowerGraph abstractions 
• Spark Streaming – Near real-time analytics 
• Spark SQL – direct SQL interface in a Spark 
application 
Language support: 
• SparkR (upcoming) 
• Java 8 
• Schema support in Spark’s APIs 
• SQL support in Spark Streaming (upcoming) 
9 ©2014 Cloudera, Inc. All rights reserved.
Logistic Regression Performance 
(Data Fits in Memory) 
4000 
3500 
3000 
2500 
2000 
1500 
1000 
500 
0 
1 5 10 20 30 
Running Time (s) 
Number of Iterations 
10 ©2014 Cloudera, Inc. All rights reserved. 
110 s / iteration 
Hadoop 
Spark 
first iteration 80 s 
further iterations 1 s
Spark Streaming 
What is it? 
• Run continuous processing of data using Spark’s core API. Extends Spark concepts to fault tolerant, 
transformable streams 
• Adds “rolling window” operations. E.g. compute rolling averages or counts for data over last five minutes. 
Why do you care? 
• Same programming paradigm for streaming and batch – reuse knowledge and code in both contexts 
• High level API with automatic DAG generation – simplicity of development 
• Excellent throughput – can scale easily to really large volumes of data ingest 
• Combine elements like MLLib & Oryx into a streaming application 
Example use cases: 
• “On-the-fly” ETL as data is ingested into Hadoop/HDFS 
• Detecting anomalous behavior and triggering alerts. 
• Continuous reporting of summary metrics for incoming data. 
11 ©2014 Cloudera, Inc. All rights reserved.
Streaming Architectures with Spark 
Data sources 
Integration 
layer 
Ingest 
HDFS 
Spark Stream processing 
• Flume 
• Kafka 
12 ©2014 Cloudera, Inc. All rights reserved. 
Data prep 
Aggregation / 
Scoring 
HBase 
Spark long-term analytics / model building 
Real-time result 
serving
Cloudera Customer Use Cases – Core Spark 
Sector Use case Replaces 
Financial 
• Multiple use cases to calculate VaR for portfolio risk analysis – 
Services 
Monte Carlo simulations as well as Var-Covar methods 
• ETL pipeline speed-up 
• Analyzing stock data for 20 years 
13 ©2014 Cloudera, Inc. All rights reserved. 
• Home grown 
applications 
Genomics • Two use cases to identify disease causing genes in full human 
genome 
• MySQL engine 
Data services • Trend analysis using statistical methods on large data sets 
• Document classification (LDA) 
• Fraud analytics 
• Netezza 
replacement 
• Net new 
Healthcare • Calculating Jaccard scores on health care data sets • Net new
Cloudera Customer Use Cases – Streaming 
Sector Use case Replaces 
Financial 
Services 
• On-line fraud detection • Net new 
Many • Continuous ETL 
Retail • On-line recommender systems 
• Inventory management 
14 ©2014 Cloudera, Inc. All rights reserved. 
• Custom apps
15 
Spark at Concur
16 
About Concur 
What do we do? 
• Leading provider of spend management solutions and (Travel, 
Invoice, TripIt, etc.) services in the world 
• Global customer base of 20,000 clients and 25 million users 
• Processing more than $50 Billion in Travel & Expense (T&E) 
spend each year
17 
About the Speaker 
Who Am I? 
• Long time SQL Server BI guy 
(24TB Yahoo! Cube) 
• Project Isotope (Hadoop on 
Windows and Azure) 
• At Concur, helping with Big 
Data and Data Sciences
18 
A long time ago… 
• We started using Hadoop because 
• It was free 
• i.e. Didn’t want to pay for a big data warehouse 
• Could slowly extract from hundreds of relational data sources, consolidate it, and query it 
• We were not thinking about advanced analytics 
• We were thinking …. “cheaper reporting” 
• We have some hardware lying around … let’s cobble it together and now we have reports
19 
Themes 
Consolidate Visualize Insight Recommend
20 
BTS 
Travel Weather 
Invoice Web Analytics 
Expense
Can quickly switch to map mode and determine where most itineraries are from in 2013 
21
22 
Or even quickly map out the airport locations on a map to see that Sun Moon 
Lake Airport is in the center of Taiwan
23 
Starbucks Store #3313 
601 108th Ave NE 
Bellevue, WA (425) 646-9602 
------------------------------- 
Chk 713452 
05/14/2014 11:04 AM 
1961558 Drawer: 1 Reg: 1 
------------------------------- 
Bacon Art Brkfst 3.45 
Warmed 
T1 Latte 2.70 
Triple 1.50 
Soy 0.60 
Gr Vanilla Mac 4.15 
Reload Card 50.00 
AMEX $50.00 
XXXXXXXXXXXXXXXXXX1004 
SBUX Card $13.56 
SUBTOTAL $62.40 
New Caffe Espresso 
Frappuccino(R) Blended beverage 
Our Signature 
Frappuccino(R) roast coffee and 
fresh milk, blended with ice. 
Topped with our new espresso 
whipped cream and new 
Italian roast drizzle 
Expense Categorization 
One of my receipts that I had OCRed 
One of the issues we’re trying to solve 
is to auto-categorize this, so how 
can we do this? 
Below is a simplistic solution using 
WordCount 
Note, a real solution should involve 
machine learning algorithms
24 
Spark assembly has been built with Hive, including Datanucleus jars on classpath 
Welcome to 
____ __ 
/ __/__ ___ _____/ /__ 
_ / _ / _ `/ __/ '_/ 
/___/ .__/_,_/_/ /_/_ version 1.1.0 
/_/ 
Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_45) 
Type in expressions to have them evaluated. 
Type :help for more information. 
2014-09-07 22:31:21.064 java[1871:15527] Unable to load realm info from SCDynamicStore 
14/09/07 22:31:21 WARN NativeCodeLoader: Unable to load native-hadoop library for your 
platform... using builtin-java classes where applicable 
Spark context available as sc. 
scala> val receipt = sc.textFile("/usr/local/Cellar/workspace/data/receipt/receipt.txt") 
receipt: org.apache.spark.rdd.RDD[String] = 
/usr/local/Cellar/workspace/data/receipt/receipt.txt MappedRDD[1] at textFile at 
<console>:12 
scala> receipt.count 
res0: Long = 30
25 
scala> val words = receipt.flatMap(_.split(" ")) 
words: org.apache.spark.rdd.RDD[String] = FlatMappedRDD[2] at flatMap at <console>:14 
scala> words.count 
res1: Long = 161 
scala> words.distinct.count 
res2: Long = 72 
scala> val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _).map{case(x,y) => 
(y,x)}.sortByKey(false).map{case(i,j) => (j, i)} 
wordCounts: org.apache.spark.rdd.RDD[(String, Int)] = MappedRDD[12] at map at <console>:16 
scala> wordCounts.take(12) 
res5: Array[(String, Int)] = Array(("",82), (with,2), (Card,2), (new,2), (---------------- 
---------------,2), (Frappuccino(R),2), (roast,2), (1,2), (and,2), (New,1), (Topped,1), 
(Starbucks,1))
26
27 
What’s next… 
• With Spark 1.1 
• Sort-based shuffling 
• MLLib: correlations, sampling, feature extraction, decision 
trees 
• GraphX: label propagation
28 
Using AtScale to build up a dimensional model based on the data that is 
stored within Impala / Hive
29 
Slice and filter the Impala model using Tableau
30 
Spark and Cloudera
Why Cloudera? 
Expertise 
• Deep engineering investment – only distribution vendor with engineering contributions to Spark and 
actual technical know-how 
• Field team, support, training and services with experience in many Spark use cases 
• Driving roadmap for Spark 
Experience 
•Most customers running Spark across all distributions put together 
• Range from few nodes to 800+ nodes 
• Longest field presence – first vendor to support and still only two vendors with official support 
Partnerships 
• Intel partnership brings 15 Spark developers focused on Cloudera customer use cases 
• Business relationship with Databricks to do joint development on Spark 
31 ©2014 Cloudera, Inc. All rights reserved.
Spark Takes Over From MapReduce 
Stage 1 
• Crunch on Spark 
• Search on Spark 
Stage 2 
• Hive on Spark 
• Pig on Spark 
Stage 3 
• MR equivalence 
• Sqoop on Spark 
Cloudera led multi-organization effort: 
MapR, Intel, Databricks, IBM 
32 ©2014 Cloudera, Inc. All rights reserved.
Spark is Great but… 
• Opaque API limitations 
• Debugging and troubleshooting 
• Complex configuration 
CLOUDERA 
UNIVERSITY 
Spark Training 
33 ©2014 Cloudera, Inc. All rights reserved.
Questions & Next Steps 
Download Now – www.cloudera.com/download 
Spark Training - 
www.cloudera.com/content/cloudera/en/training/cour 
ses/spark-training.html 
34 ©2014 Cloudera, Inc. All rights reserved.
35 ©2014 Cloudera, Inc. All rights reserved. 
Thank You

More Related Content

What's hot

Intro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezIntro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco Vasquez
MapR Technologies
 
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard OfApache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Charles Givre
 
Logical Data Warehouse: How to Build a Virtualized Data Services Layer
Logical Data Warehouse: How to Build a Virtualized Data Services LayerLogical Data Warehouse: How to Build a Virtualized Data Services Layer
Logical Data Warehouse: How to Build a Virtualized Data Services Layer
DataWorks Summit
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to Spark
Slim Baltagi
 
Have your Cake and Eat it Too - Architecture for Batch and Real-time processing
Have your Cake and Eat it Too - Architecture for Batch and Real-time processingHave your Cake and Eat it Too - Architecture for Batch and Real-time processing
Have your Cake and Eat it Too - Architecture for Batch and Real-time processing
DataWorks Summit
 
Interactive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataInteractive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroData
Ofir Manor
 
A New "Sparkitecture" for modernizing your data warehouse
A New "Sparkitecture" for modernizing your data warehouseA New "Sparkitecture" for modernizing your data warehouse
A New "Sparkitecture" for modernizing your data warehouse
DataWorks Summit/Hadoop Summit
 
Introduction to Apache Kudu
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache Kudu
Shravan (Sean) Pabba
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Rohit Kulkarni
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
Dr. C.V. Suresh Babu
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Cloudera, Inc.
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
Cloudera, Inc.
 
Disaster Recovery and Cloud Migration for your Apache Hive Warehouse
Disaster Recovery and Cloud Migration for your Apache Hive WarehouseDisaster Recovery and Cloud Migration for your Apache Hive Warehouse
Disaster Recovery and Cloud Migration for your Apache Hive Warehouse
DataWorks Summit
 
Sherlock: an anomaly detection service on top of Druid
Sherlock: an anomaly detection service on top of Druid Sherlock: an anomaly detection service on top of Druid
Sherlock: an anomaly detection service on top of Druid
DataWorks Summit
 
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...
Cloudera, Inc.
 
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark ClustersA Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
DataWorks Summit/Hadoop Summit
 
Securing data in hybrid environments using Apache Ranger
Securing data in hybrid environments using Apache RangerSecuring data in hybrid environments using Apache Ranger
Securing data in hybrid environments using Apache Ranger
DataWorks Summit
 
A Closer Look at Apache Kudu
A Closer Look at Apache KuduA Closer Look at Apache Kudu
A Closer Look at Apache Kudu
Andriy Zabavskyy
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
Swiss Big Data User Group
 
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 Let Spark Fly: Advantages and Use Cases for Spark on Hadoop Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
MapR Technologies
 

What's hot (20)

Intro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezIntro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco Vasquez
 
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard OfApache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
 
Logical Data Warehouse: How to Build a Virtualized Data Services Layer
Logical Data Warehouse: How to Build a Virtualized Data Services LayerLogical Data Warehouse: How to Build a Virtualized Data Services Layer
Logical Data Warehouse: How to Build a Virtualized Data Services Layer
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to Spark
 
Have your Cake and Eat it Too - Architecture for Batch and Real-time processing
Have your Cake and Eat it Too - Architecture for Batch and Real-time processingHave your Cake and Eat it Too - Architecture for Batch and Real-time processing
Have your Cake and Eat it Too - Architecture for Batch and Real-time processing
 
Interactive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataInteractive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroData
 
A New "Sparkitecture" for modernizing your data warehouse
A New "Sparkitecture" for modernizing your data warehouseA New "Sparkitecture" for modernizing your data warehouse
A New "Sparkitecture" for modernizing your data warehouse
 
Introduction to Apache Kudu
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache Kudu
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Disaster Recovery and Cloud Migration for your Apache Hive Warehouse
Disaster Recovery and Cloud Migration for your Apache Hive WarehouseDisaster Recovery and Cloud Migration for your Apache Hive Warehouse
Disaster Recovery and Cloud Migration for your Apache Hive Warehouse
 
Sherlock: an anomaly detection service on top of Druid
Sherlock: an anomaly detection service on top of Druid Sherlock: an anomaly detection service on top of Druid
Sherlock: an anomaly detection service on top of Druid
 
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...
 
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark ClustersA Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
 
Securing data in hybrid environments using Apache Ranger
Securing data in hybrid environments using Apache RangerSecuring data in hybrid environments using Apache Ranger
Securing data in hybrid environments using Apache Ranger
 
A Closer Look at Apache Kudu
A Closer Look at Apache KuduA Closer Look at Apache Kudu
A Closer Look at Apache Kudu
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 Let Spark Fly: Advantages and Use Cases for Spark on Hadoop Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 

Viewers also liked

Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemWhy Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Cloudera, Inc.
 
Spark on YARN: The Road Ahead
Spark on YARN: The Road AheadSpark on YARN: The Road Ahead
Spark on YARN: The Road Ahead
Cloudera, Inc.
 
Решения Oracle для Big Data
Решения Oracle для Big DataРешения Oracle для Big Data
Решения Oracle для Big Data
Andrey Akulov
 
Hadoopトレーニング番外編 〜間違えられやすいHadoopの7つの仕様〜
Hadoopトレーニング番外編 〜間違えられやすいHadoopの7つの仕様〜Hadoopトレーニング番外編 〜間違えられやすいHadoopの7つの仕様〜
Hadoopトレーニング番外編 〜間違えられやすいHadoopの7つの仕様〜
Cloudera Japan
 
Introduction to Cloudera Search Training
Introduction to Cloudera Search TrainingIntroduction to Cloudera Search Training
Introduction to Cloudera Search Training
Cloudera, Inc.
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem
DataWorks Summit/Hadoop Summit
 
Cloudera Impala: A modern SQL Query Engine for Hadoop
Cloudera Impala: A modern SQL Query Engine for HadoopCloudera Impala: A modern SQL Query Engine for Hadoop
Cloudera Impala: A modern SQL Query Engine for Hadoop
Cloudera, Inc.
 
Apache HBase 入門 (第2回)
Apache HBase 入門 (第2回)Apache HBase 入門 (第2回)
Apache HBase 入門 (第2回)tatsuya6502
 
Integrating Hadoop Into the Enterprise
Integrating Hadoop Into the EnterpriseIntegrating Hadoop Into the Enterprise
Integrating Hadoop Into the EnterpriseDataWorks Summit
 
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetup
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics MeetupIntroduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetup
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetup
iwrigley
 
Big Data Governance in Hadoop Environments with Cloudera Navigatorfeb2017meetu
Big Data Governance in Hadoop Environments with Cloudera Navigatorfeb2017meetuBig Data Governance in Hadoop Environments with Cloudera Navigatorfeb2017meetu
Big Data Governance in Hadoop Environments with Cloudera Navigatorfeb2017meetu
Emre Sevinç
 
CET(Capture EveryThing)プロジェクトにおけるﰀ機械学 習・データマイニング最前線
CET(Capture EveryThing)プロジェクトにおけるﰀ機械学 習・データマイニング最前線CET(Capture EveryThing)プロジェクトにおけるﰀ機械学 習・データマイニング最前線
CET(Capture EveryThing)プロジェクトにおけるﰀ機械学 習・データマイニング最前線
Recruit Lifestyle Co., Ltd.
 
Gremlin: A Graph-Based Programming Language
Gremlin: A Graph-Based Programming LanguageGremlin: A Graph-Based Programming Language
Gremlin: A Graph-Based Programming Language
Marko Rodriguez
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
Amir Sedighi
 
Five Tips for Running Cloudera on AWS
Five Tips for Running Cloudera on AWSFive Tips for Running Cloudera on AWS
Five Tips for Running Cloudera on AWS
Cloudera, Inc.
 
Architectural considerations for Hadoop Applications
Architectural considerations for Hadoop ApplicationsArchitectural considerations for Hadoop Applications
Architectural considerations for Hadoop Applications
hadooparchbook
 
Financial security and machine learning
Financial security and machine learningFinancial security and machine learning
Financial security and machine learning
Mk Kim
 
Impala Unlocks Interactive BI on Hadoop
Impala Unlocks Interactive BI on HadoopImpala Unlocks Interactive BI on Hadoop
Impala Unlocks Interactive BI on Hadoop
Cloudera, Inc.
 
Big Data: Myths and Realities
Big Data: Myths and RealitiesBig Data: Myths and Realities
Big Data: Myths and Realities
Toronto-Oracle-Users-Group
 
Cloudera Impala Overview (via Scott Leberknight)
Cloudera Impala Overview (via Scott Leberknight)Cloudera Impala Overview (via Scott Leberknight)
Cloudera Impala Overview (via Scott Leberknight)
Cloudera, Inc.
 

Viewers also liked (20)

Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemWhy Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
 
Spark on YARN: The Road Ahead
Spark on YARN: The Road AheadSpark on YARN: The Road Ahead
Spark on YARN: The Road Ahead
 
Решения Oracle для Big Data
Решения Oracle для Big DataРешения Oracle для Big Data
Решения Oracle для Big Data
 
Hadoopトレーニング番外編 〜間違えられやすいHadoopの7つの仕様〜
Hadoopトレーニング番外編 〜間違えられやすいHadoopの7つの仕様〜Hadoopトレーニング番外編 〜間違えられやすいHadoopの7つの仕様〜
Hadoopトレーニング番外編 〜間違えられやすいHadoopの7つの仕様〜
 
Introduction to Cloudera Search Training
Introduction to Cloudera Search TrainingIntroduction to Cloudera Search Training
Introduction to Cloudera Search Training
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem
 
Cloudera Impala: A modern SQL Query Engine for Hadoop
Cloudera Impala: A modern SQL Query Engine for HadoopCloudera Impala: A modern SQL Query Engine for Hadoop
Cloudera Impala: A modern SQL Query Engine for Hadoop
 
Apache HBase 入門 (第2回)
Apache HBase 入門 (第2回)Apache HBase 入門 (第2回)
Apache HBase 入門 (第2回)
 
Integrating Hadoop Into the Enterprise
Integrating Hadoop Into the EnterpriseIntegrating Hadoop Into the Enterprise
Integrating Hadoop Into the Enterprise
 
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetup
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics MeetupIntroduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetup
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetup
 
Big Data Governance in Hadoop Environments with Cloudera Navigatorfeb2017meetu
Big Data Governance in Hadoop Environments with Cloudera Navigatorfeb2017meetuBig Data Governance in Hadoop Environments with Cloudera Navigatorfeb2017meetu
Big Data Governance in Hadoop Environments with Cloudera Navigatorfeb2017meetu
 
CET(Capture EveryThing)プロジェクトにおけるﰀ機械学 習・データマイニング最前線
CET(Capture EveryThing)プロジェクトにおけるﰀ機械学 習・データマイニング最前線CET(Capture EveryThing)プロジェクトにおけるﰀ機械学 習・データマイニング最前線
CET(Capture EveryThing)プロジェクトにおけるﰀ機械学 習・データマイニング最前線
 
Gremlin: A Graph-Based Programming Language
Gremlin: A Graph-Based Programming LanguageGremlin: A Graph-Based Programming Language
Gremlin: A Graph-Based Programming Language
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 
Five Tips for Running Cloudera on AWS
Five Tips for Running Cloudera on AWSFive Tips for Running Cloudera on AWS
Five Tips for Running Cloudera on AWS
 
Architectural considerations for Hadoop Applications
Architectural considerations for Hadoop ApplicationsArchitectural considerations for Hadoop Applications
Architectural considerations for Hadoop Applications
 
Financial security and machine learning
Financial security and machine learningFinancial security and machine learning
Financial security and machine learning
 
Impala Unlocks Interactive BI on Hadoop
Impala Unlocks Interactive BI on HadoopImpala Unlocks Interactive BI on Hadoop
Impala Unlocks Interactive BI on Hadoop
 
Big Data: Myths and Realities
Big Data: Myths and RealitiesBig Data: Myths and Realities
Big Data: Myths and Realities
 
Cloudera Impala Overview (via Scott Leberknight)
Cloudera Impala Overview (via Scott Leberknight)Cloudera Impala Overview (via Scott Leberknight)
Cloudera Impala Overview (via Scott Leberknight)
 

Similar to The Future of Hadoop: A deeper look at Apache Spark

Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Anant Corporation
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
Djamel Zouaoui
 
Spark One Platform Webinar
Spark One Platform WebinarSpark One Platform Webinar
Spark One Platform Webinar
Cloudera, Inc.
 
Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...
Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...
Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...
DataStax
 
Big Telco Real-Time Network Analytics
Big Telco Real-Time Network AnalyticsBig Telco Real-Time Network Analytics
Big Telco Real-Time Network Analytics
Yousun Jeong
 
Big Telco - Yousun Jeong
Big Telco - Yousun JeongBig Telco - Yousun Jeong
Big Telco - Yousun Jeong
Spark Summit
 
Glint with Apache Spark
Glint with Apache SparkGlint with Apache Spark
Glint with Apache Spark
Venkata Naga Ravi
 
Cleveland Hadoop Users Group - Spark
Cleveland Hadoop Users Group - SparkCleveland Hadoop Users Group - Spark
Cleveland Hadoop Users Group - Spark
Vince Gonzalez
 
Apache Spark in Scientific Applications
Apache Spark in Scientific ApplicationsApache Spark in Scientific Applications
Apache Spark in Scientific Applications
Dr. Mirko Kämpf
 
Apache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsApache Spark in Scientific Applciations
Apache Spark in Scientific Applciations
Dr. Mirko Kämpf
 
Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark Streaming
Hari Shreedharan
 
Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark Streaming
Hari Shreedharan
 
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Cloudera, Inc.
 
Dev Ops Training
Dev Ops TrainingDev Ops Training
Dev Ops Training
Spark Summit
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
Databricks
 
Large Scale Data Analytics with Spark and Cassandra on the DSE Platform
Large Scale Data Analytics with Spark and Cassandra on the DSE PlatformLarge Scale Data Analytics with Spark and Cassandra on the DSE Platform
Large Scale Data Analytics with Spark and Cassandra on the DSE Platform
DataStax Academy
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
Anyscale
 
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Data Con LA
 
Spark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream ProcessingSpark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream Processing
Jack Gudenkauf
 

Similar to The Future of Hadoop: A deeper look at Apache Spark (20)

Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
Spark One Platform Webinar
Spark One Platform WebinarSpark One Platform Webinar
Spark One Platform Webinar
 
Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...
Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...
Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...
 
Big Telco Real-Time Network Analytics
Big Telco Real-Time Network AnalyticsBig Telco Real-Time Network Analytics
Big Telco Real-Time Network Analytics
 
Big Telco - Yousun Jeong
Big Telco - Yousun JeongBig Telco - Yousun Jeong
Big Telco - Yousun Jeong
 
Glint with Apache Spark
Glint with Apache SparkGlint with Apache Spark
Glint with Apache Spark
 
Cleveland Hadoop Users Group - Spark
Cleveland Hadoop Users Group - SparkCleveland Hadoop Users Group - Spark
Cleveland Hadoop Users Group - Spark
 
Apache Spark in Scientific Applications
Apache Spark in Scientific ApplicationsApache Spark in Scientific Applications
Apache Spark in Scientific Applications
 
Apache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsApache Spark in Scientific Applciations
Apache Spark in Scientific Applciations
 
Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark Streaming
 
Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark Streaming
 
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
 
Dev Ops Training
Dev Ops TrainingDev Ops Training
Dev Ops Training
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
 
Large Scale Data Analytics with Spark and Cassandra on the DSE Platform
Large Scale Data Analytics with Spark and Cassandra on the DSE PlatformLarge Scale Data Analytics with Spark and Cassandra on the DSE Platform
Large Scale Data Analytics with Spark and Cassandra on the DSE Platform
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
 
Spark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream ProcessingSpark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream Processing
 

More from Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
Cloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
Cloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
Cloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
Cloudera, Inc.
 

More from Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Recently uploaded

GOING AOT WITH GRAALVM FOR SPRING BOOT (SPRING IO)
GOING AOT WITH GRAALVM FOR  SPRING BOOT (SPRING IO)GOING AOT WITH GRAALVM FOR  SPRING BOOT (SPRING IO)
GOING AOT WITH GRAALVM FOR SPRING BOOT (SPRING IO)
Alina Yurenko
 
openEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain SecurityopenEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain Security
Shane Coughlan
 
Fundamentals of Programming and Language Processors
Fundamentals of Programming and Language ProcessorsFundamentals of Programming and Language Processors
Fundamentals of Programming and Language Processors
Rakesh Kumar R
 
Using Xen Hypervisor for Functional Safety
Using Xen Hypervisor for Functional SafetyUsing Xen Hypervisor for Functional Safety
Using Xen Hypervisor for Functional Safety
Ayan Halder
 
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
Google
 
Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024
Paco van Beckhoven
 
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Mind IT Systems
 
E-commerce Application Development Company.pdf
E-commerce Application Development Company.pdfE-commerce Application Development Company.pdf
E-commerce Application Development Company.pdf
Hornet Dynamics
 
APIs for Browser Automation (MoT Meetup 2024)
APIs for Browser Automation (MoT Meetup 2024)APIs for Browser Automation (MoT Meetup 2024)
APIs for Browser Automation (MoT Meetup 2024)
Boni García
 
Vitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdfVitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke
 
Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"
Donna Lenk
 
May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
Adele Miller
 
OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024
OpenMetadata
 
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxTop Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
rickgrimesss22
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
NYGGS Automation Suite
 
Empowering Growth with Best Software Development Company in Noida - Deuglo
Empowering Growth with Best Software  Development Company in Noida - DeugloEmpowering Growth with Best Software  Development Company in Noida - Deuglo
Empowering Growth with Best Software Development Company in Noida - Deuglo
Deuglo Infosystem Pvt Ltd
 
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissancesAtelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Neo4j
 
Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604
Fermin Galan
 
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOMLORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
lorraineandreiamcidl
 
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket ManagementUtilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate
 

Recently uploaded (20)

GOING AOT WITH GRAALVM FOR SPRING BOOT (SPRING IO)
GOING AOT WITH GRAALVM FOR  SPRING BOOT (SPRING IO)GOING AOT WITH GRAALVM FOR  SPRING BOOT (SPRING IO)
GOING AOT WITH GRAALVM FOR SPRING BOOT (SPRING IO)
 
openEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain SecurityopenEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain Security
 
Fundamentals of Programming and Language Processors
Fundamentals of Programming and Language ProcessorsFundamentals of Programming and Language Processors
Fundamentals of Programming and Language Processors
 
Using Xen Hypervisor for Functional Safety
Using Xen Hypervisor for Functional SafetyUsing Xen Hypervisor for Functional Safety
Using Xen Hypervisor for Functional Safety
 
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
 
Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024
 
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
 
E-commerce Application Development Company.pdf
E-commerce Application Development Company.pdfE-commerce Application Development Company.pdf
E-commerce Application Development Company.pdf
 
APIs for Browser Automation (MoT Meetup 2024)
APIs for Browser Automation (MoT Meetup 2024)APIs for Browser Automation (MoT Meetup 2024)
APIs for Browser Automation (MoT Meetup 2024)
 
Vitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdfVitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdf
 
Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"
 
May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
 
OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024
 
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxTop Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
 
Empowering Growth with Best Software Development Company in Noida - Deuglo
Empowering Growth with Best Software  Development Company in Noida - DeugloEmpowering Growth with Best Software  Development Company in Noida - Deuglo
Empowering Growth with Best Software Development Company in Noida - Deuglo
 
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissancesAtelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissances
 
Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604
 
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOMLORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
 
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket ManagementUtilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
 

The Future of Hadoop: A deeper look at Apache Spark

  • 1. 1 An Introduction to Spark Jai Ranganathan, Senior Director Product Management, Cloudera Denny Lee, Senior Director Data Sciences Engineering, Concur
  • 2. Agenda • Cloudera’s Enterprise Data Hub • Why Spark? • Spark Use Cases • Concur Case Study • Cloudera and Spark • Future of Spark 2 ©2014 Cloudera, Inc. All rights reserved.
  • 3. Cloudera’s Enterprise Data Hub 3 ©2014 Cloudera, Inc. All rights reserved. 3RD PARTY APPS STORAGE FOR ANY TYPE OF DATA UNIFIED, ELASTIC, RESILIENT URE BATCH PROCESSING MAPREDUCE, SPARK ANALYTIC SQL IMPALA SEARCH ENGINE SOLR MACHINE LEARNING SPARK, PARTNERS, MAHOUT, MLLIB STREAM PROCESSING SPARK WORKLOAD MANAGEMENT YARN FILESYSTEM HDFS ONLINE NOSQL HBASE MANAGEMENT CLOUDERA NAVIGATOR DATA MANAGEMENT CLOUDERA MANAGER SYSTEM , SECURE SENTRY
  • 4. Spark: Easy and Fast Big Data Easy to Develop • Rich APIs in Java, Scala, Python • Interactive shell Fast to Run • General execution graphs • In-memory storage 2-5× less code Up to 10× faster on disk, 4 ©2014 Cloudera, Inc. All rights reserved. 100× in memory
  • 5. Easy: Expressive API • map • filter • groupBy • sort • union • join • leftOuterJoin • rightOuterJoin • reduce • count • fold • reduceByKey • groupByKey • cogroup • cross • zip 5 ©2014 Cloudera, Inc. All rights reserved. • sample • take • first • partitionBy • mapWith • pipe • save ...
  • 6. Example: Logistic Regression data = spark.textFile(...).map(readPoint).cache() w = numpy.random.rand(D) for i in range(iterations): gradient = data .map(lambda p: (1 / (1 + exp(-p.y * w.dot(p.x)))) * p.y * p.x) .reduce(lambda x, y: x + y) w -= gradient print “Final w: %s” % w 6 ©2014 Cloudera, Inc. All rights reserved.
  • 7. Spark Introduces Concept of RDD to Take Advantage of Memory RDD = Resilient Distributed Datasets • Memory caching layer that stores data in a distributed, fault-tolerant cache • Created by parallel transformations on data in stable storage Two observations: a. Can fall back to disk when data-set does not fit in memory b. Provides fault-tolerance through concept of lineage 7 ©2014 Cloudera, Inc. All rights reserved.
  • 8. Fast: Using RAM, Operator Graphs In-Memory Caching • Data Partitions read from RAM instead of disk Operator Graphs • Scheduling Optimizations • Fault Tolerance C: D: E: 8 ©2014 Cloudera, Inc. All rights reserved. join B: B: groupBy filter F: Ç √ Ω map A: map take = RDD = cached partition
  • 9. Easy: Out of the Box Functionality Hadoop Integration • Standard Hadoop data formats • Runs under YARN in mixed clusters Libraries • Mllib – Machine Learning toolkit • GraphX (alpha) – Graph analytics based on PowerGraph abstractions • Spark Streaming – Near real-time analytics • Spark SQL – direct SQL interface in a Spark application Language support: • SparkR (upcoming) • Java 8 • Schema support in Spark’s APIs • SQL support in Spark Streaming (upcoming) 9 ©2014 Cloudera, Inc. All rights reserved.
  • 10. Logistic Regression Performance (Data Fits in Memory) 4000 3500 3000 2500 2000 1500 1000 500 0 1 5 10 20 30 Running Time (s) Number of Iterations 10 ©2014 Cloudera, Inc. All rights reserved. 110 s / iteration Hadoop Spark first iteration 80 s further iterations 1 s
  • 11. Spark Streaming What is it? • Run continuous processing of data using Spark’s core API. Extends Spark concepts to fault tolerant, transformable streams • Adds “rolling window” operations. E.g. compute rolling averages or counts for data over last five minutes. Why do you care? • Same programming paradigm for streaming and batch – reuse knowledge and code in both contexts • High level API with automatic DAG generation – simplicity of development • Excellent throughput – can scale easily to really large volumes of data ingest • Combine elements like MLLib & Oryx into a streaming application Example use cases: • “On-the-fly” ETL as data is ingested into Hadoop/HDFS • Detecting anomalous behavior and triggering alerts. • Continuous reporting of summary metrics for incoming data. 11 ©2014 Cloudera, Inc. All rights reserved.
  • 12. Streaming Architectures with Spark Data sources Integration layer Ingest HDFS Spark Stream processing • Flume • Kafka 12 ©2014 Cloudera, Inc. All rights reserved. Data prep Aggregation / Scoring HBase Spark long-term analytics / model building Real-time result serving
  • 13. Cloudera Customer Use Cases – Core Spark Sector Use case Replaces Financial • Multiple use cases to calculate VaR for portfolio risk analysis – Services Monte Carlo simulations as well as Var-Covar methods • ETL pipeline speed-up • Analyzing stock data for 20 years 13 ©2014 Cloudera, Inc. All rights reserved. • Home grown applications Genomics • Two use cases to identify disease causing genes in full human genome • MySQL engine Data services • Trend analysis using statistical methods on large data sets • Document classification (LDA) • Fraud analytics • Netezza replacement • Net new Healthcare • Calculating Jaccard scores on health care data sets • Net new
  • 14. Cloudera Customer Use Cases – Streaming Sector Use case Replaces Financial Services • On-line fraud detection • Net new Many • Continuous ETL Retail • On-line recommender systems • Inventory management 14 ©2014 Cloudera, Inc. All rights reserved. • Custom apps
  • 15. 15 Spark at Concur
  • 16. 16 About Concur What do we do? • Leading provider of spend management solutions and (Travel, Invoice, TripIt, etc.) services in the world • Global customer base of 20,000 clients and 25 million users • Processing more than $50 Billion in Travel & Expense (T&E) spend each year
  • 17. 17 About the Speaker Who Am I? • Long time SQL Server BI guy (24TB Yahoo! Cube) • Project Isotope (Hadoop on Windows and Azure) • At Concur, helping with Big Data and Data Sciences
  • 18. 18 A long time ago… • We started using Hadoop because • It was free • i.e. Didn’t want to pay for a big data warehouse • Could slowly extract from hundreds of relational data sources, consolidate it, and query it • We were not thinking about advanced analytics • We were thinking …. “cheaper reporting” • We have some hardware lying around … let’s cobble it together and now we have reports
  • 19. 19 Themes Consolidate Visualize Insight Recommend
  • 20. 20 BTS Travel Weather Invoice Web Analytics Expense
  • 21. Can quickly switch to map mode and determine where most itineraries are from in 2013 21
  • 22. 22 Or even quickly map out the airport locations on a map to see that Sun Moon Lake Airport is in the center of Taiwan
  • 23. 23 Starbucks Store #3313 601 108th Ave NE Bellevue, WA (425) 646-9602 ------------------------------- Chk 713452 05/14/2014 11:04 AM 1961558 Drawer: 1 Reg: 1 ------------------------------- Bacon Art Brkfst 3.45 Warmed T1 Latte 2.70 Triple 1.50 Soy 0.60 Gr Vanilla Mac 4.15 Reload Card 50.00 AMEX $50.00 XXXXXXXXXXXXXXXXXX1004 SBUX Card $13.56 SUBTOTAL $62.40 New Caffe Espresso Frappuccino(R) Blended beverage Our Signature Frappuccino(R) roast coffee and fresh milk, blended with ice. Topped with our new espresso whipped cream and new Italian roast drizzle Expense Categorization One of my receipts that I had OCRed One of the issues we’re trying to solve is to auto-categorize this, so how can we do this? Below is a simplistic solution using WordCount Note, a real solution should involve machine learning algorithms
  • 24. 24 Spark assembly has been built with Hive, including Datanucleus jars on classpath Welcome to ____ __ / __/__ ___ _____/ /__ _ / _ / _ `/ __/ '_/ /___/ .__/_,_/_/ /_/_ version 1.1.0 /_/ Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_45) Type in expressions to have them evaluated. Type :help for more information. 2014-09-07 22:31:21.064 java[1871:15527] Unable to load realm info from SCDynamicStore 14/09/07 22:31:21 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Spark context available as sc. scala> val receipt = sc.textFile("/usr/local/Cellar/workspace/data/receipt/receipt.txt") receipt: org.apache.spark.rdd.RDD[String] = /usr/local/Cellar/workspace/data/receipt/receipt.txt MappedRDD[1] at textFile at <console>:12 scala> receipt.count res0: Long = 30
  • 25. 25 scala> val words = receipt.flatMap(_.split(" ")) words: org.apache.spark.rdd.RDD[String] = FlatMappedRDD[2] at flatMap at <console>:14 scala> words.count res1: Long = 161 scala> words.distinct.count res2: Long = 72 scala> val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _).map{case(x,y) => (y,x)}.sortByKey(false).map{case(i,j) => (j, i)} wordCounts: org.apache.spark.rdd.RDD[(String, Int)] = MappedRDD[12] at map at <console>:16 scala> wordCounts.take(12) res5: Array[(String, Int)] = Array(("",82), (with,2), (Card,2), (new,2), (---------------- ---------------,2), (Frappuccino(R),2), (roast,2), (1,2), (and,2), (New,1), (Topped,1), (Starbucks,1))
  • 26. 26
  • 27. 27 What’s next… • With Spark 1.1 • Sort-based shuffling • MLLib: correlations, sampling, feature extraction, decision trees • GraphX: label propagation
  • 28. 28 Using AtScale to build up a dimensional model based on the data that is stored within Impala / Hive
  • 29. 29 Slice and filter the Impala model using Tableau
  • 30. 30 Spark and Cloudera
  • 31. Why Cloudera? Expertise • Deep engineering investment – only distribution vendor with engineering contributions to Spark and actual technical know-how • Field team, support, training and services with experience in many Spark use cases • Driving roadmap for Spark Experience •Most customers running Spark across all distributions put together • Range from few nodes to 800+ nodes • Longest field presence – first vendor to support and still only two vendors with official support Partnerships • Intel partnership brings 15 Spark developers focused on Cloudera customer use cases • Business relationship with Databricks to do joint development on Spark 31 ©2014 Cloudera, Inc. All rights reserved.
  • 32. Spark Takes Over From MapReduce Stage 1 • Crunch on Spark • Search on Spark Stage 2 • Hive on Spark • Pig on Spark Stage 3 • MR equivalence • Sqoop on Spark Cloudera led multi-organization effort: MapR, Intel, Databricks, IBM 32 ©2014 Cloudera, Inc. All rights reserved.
  • 33. Spark is Great but… • Opaque API limitations • Debugging and troubleshooting • Complex configuration CLOUDERA UNIVERSITY Spark Training 33 ©2014 Cloudera, Inc. All rights reserved.
  • 34. Questions & Next Steps Download Now – www.cloudera.com/download Spark Training - www.cloudera.com/content/cloudera/en/training/cour ses/spark-training.html 34 ©2014 Cloudera, Inc. All rights reserved.
  • 35. 35 ©2014 Cloudera, Inc. All rights reserved. Thank You

Editor's Notes

  1. Cloudera’s enterprise data hub (powered by Hadoop) is a data management platform that provides a unique offering that’s unified, compliance-ready, accessible, and open. This enterprise data hub bring everything together in one unified layer. No copying of data. Simply one single transparent view that allows you to easily meet auditing and compliance goals. It offers a single, unified solution for: Storage & serialization Data ingest & egress Security & governance Metadata Resource management It’s compliance-ready for security and governance and includes: Authentication, authorization, encryption, audit, RBAC, lineage Single interface with integrated controls It’s accessible through: Multiple frameworks Familiar tools and skills And it’s completely open: 100% open source Apache licensed platform Extensible to 3rd party frameworks Zero lock-in platform As mentioned, Cloudera’s enterprise data hub has multiple different frameworks integrated into the platform for robust querying. One of the newest and most exciting querying frameworks is Spark, an open source, flexible data processing framework for machine learning and stream processing. Before we dive into Spark, we need to understand why Spark is necessary. And that requires an understanding of MapReduce
  2. Key idea: add “variables” to the “functions” in functional programming
  3. This is for a 29 GB dataset on 20 EC2 m1.xlarge machines (4 cores each)
  4. Quick view of Android vs. iOS mobile sessions