SlideShare a Scribd company logo
‹#›© Cloudera, Inc. All rights reserved.
Mirko Kämpf | 2015
Apache Spark:
Next Generation Data
Processing for Hadoop
‹#›© Cloudera, Inc. All rights reserved.
Agenda
• The Data Science Process (DSP)
- Why or when to use Spark
• The role of: Apache Hadoop and Apache Spark
- History & Hadoop Ecosystem
• Apache Spark: Overview and Concepts
• Practical Tips
‹#›© Cloudera, Inc. All rights reserved.
The Data Science Process
Application of Big-Data-Technology
Images from: http://semanticommunity.info/Data_Science/Doing_Data_Science
‹#›© Cloudera, Inc. All rights reserved.
Huge Data Sets in Science
Application of Big-Data-Technology
Images from: http://semanticommunity.info/Data_Science/Doing_Data_Science
‹#›© Cloudera, Inc. All rights reserved.
“Spark offers tools for Data Science
and components for Data
Products.”
—How can Apache Spark fit into my world?
‹#›© Cloudera, Inc. All rights reserved.
Should I use Apache Spark?
• If all my data fits into Excel-Spreadsheets?
• If I have a special purpose application to work with?
• If my current system is just a bit to slow?
‹#›© Cloudera, Inc. All rights reserved.
Should I use Apache Spark?
• If all my data fits into Excel-Spreadsheets?
• If I have a special purpose application to work with?
• If my current system is just a bit to slow?
• Just export as CSV / JSON and use a DataFrame to join with other DS.
Why not?
‹#›© Cloudera, Inc. All rights reserved.
Should I use Apache Spark?
• If all my data fits into Excel-Spreadsheets?
• If I have a special purpose application to work with?
• If my current system is just a bit to slow?
• Just export as CSV / JSON and use a DataFrame to join with other DS.
• Think about additional analysis methods! Maybe it is already built into Apache
Spark!
Why not?
‹#›© Cloudera, Inc. All rights reserved.
Should I use Apache Spark?
• If all my data fits into Excel-Spreadsheets?
• If I have a special purpose application to work with?
• If my current system is just a bit to slow?
• Just export as CSV / JSON and use a DataFrame to join with other DS.
• Think about additional analysis methods! Maybe it is build into Spark.
• OK, Spark will probably not help to speed up your system, but maybe you can
offload data to Hadoop, which releases some resources.
Why not?
‹#›© Cloudera, Inc. All rights reserved.
“Spark offers fast in memory processing on
huge distributed and even on heterogeneous
datasets.”
—What type of data fits into Spark?
‹#›© Cloudera, Inc. All rights reserved.
History of Spark
Spark is really young, but has a very
active community!
‹#›© Cloudera, Inc. All rights reserved.
Timeline: Spark Adoption
‹#›© Cloudera, Inc. All rights reserved.
Apache Spark:
Overview & Concepts
‹#›© Cloudera, Inc. All rights reserved.
Hadoop Ecosystem incl. Apache Spark
Spark can be an entry point to your Big Data world …
‹#›© Cloudera, Inc. All rights reserved.
“Apache Spark is distributed on top of Hadoop
and brings parallel processing
to powerful workstations.”
—Do I need a Hadoop cluster to work with Apache Spark?
‹#›© Cloudera, Inc. All rights reserved.
Spark vs. MapReduce
‹#›© Cloudera, Inc. All rights reserved.
How to interact with Spark?
‹#›© Cloudera, Inc. All rights reserved.
Spark Components
‹#›© Cloudera, Inc. All rights reserved.
‹#›© Cloudera, Inc. All rights reserved.
MLLib: GraphX:
Basic statistics
summary statistics, correlations, stratified sampling,
hypothesis testing, random data generation
Classification and regression
linear models (SVMs, logistic / linear regression)
naive Bayes, decision trees
ensembles of trees (Random Forests / Gradient-Boosted Trees)
isotonic regression
Collaborative filtering
alternating least squares (ALS)
Clustering
k-means, Gaussian mixture, power iteration clustering (PIC)
latent Dirichlet allocation (LDA), streaming k-means
Dimensionality reduction
singular value decomposition (SVD)
principal component analysis (PCA)
…
PageRank
Connected Components
Triangle Counting
Pregel API
‹#›© Cloudera, Inc. All rights reserved.
How to use your code in Spark?
A. Interactively, by loading it into the spark-shell.
B. Contribute to existing Spark projects.
C. Create your module and use it in a spark-shell session.
D. Build a data-product which uses Apache Spark.
For simple and reliable usage of Java classes
and complete third-party libraries, we define
a Spark Module as a self-contained artifact
created by Maven. This module can easily
be shared by multiple users via repositories.
http://blog.cloudera.com/blog/2015/03/how-to-build-re-usable-spark-programs-using-spark-shell-and-maven/
‹#›© Cloudera, Inc. All rights reserved.
Apache Spark:
Overview & Concepts
‹#›© Cloudera, Inc. All rights reserved.
Spark Context
‹#›© Cloudera, Inc. All rights reserved.
RDDs and DataFrames
‹#›© Cloudera, Inc. All rights reserved.
Creation of RDDs
‹#›© Cloudera, Inc. All rights reserved.
Datatypes in RDDs
‹#›© Cloudera, Inc. All rights reserved.
‹#›© Cloudera, Inc. All rights reserved.
‹#›© Cloudera, Inc. All rights reserved.
Spark in a Cluster
‹#›© Cloudera, Inc. All rights reserved.
Spark in a Cluster
‹#›© Cloudera, Inc. All rights reserved.
‹#›© Cloudera, Inc. All rights reserved.
‹#›© Cloudera, Inc. All rights reserved.
DStream: The heart of Spark Streaming
‹#›© Cloudera, Inc. All rights reserved.
“Efficient hardware utilization, caching,
simple APIs, and access to a variety of data
in Hadoop is key to success.”
—What makes Spark so different, compared to core MapReduce?
‹#›© Cloudera, Inc. All rights reserved.
Practical Tips
‹#›© Cloudera, Inc. All rights reserved.
Development Techniques
• Build your tools and analysis procedures in small cycles.
• Test all phases of your work and document carefully.
• Document what you expect! => Requirements management …
• Collect what you get! => Operational logs …
• Reuse well tested components and modularize your analysis scripts.
• Learn „state of the art“ tools and share your work!
‹#›© Cloudera, Inc. All rights reserved.
Data Management
• Think about typical access patterns:
• random access to each record or field?
• access to entire groups of records?
• variable size or fixed size sets?
• „full table scan“
• OPTIMIZE FOR YOUR DOMINANT ACCESS PATTERN!
• Select efficient storage formats: Avro, Parquet
• Index your data in SOLR for random access and data exploration
• Indexing can be done by just a few clicks in HUE …
‹#›© Cloudera, Inc. All rights reserved.
Collecting Sensor Data with Spark Streaming …
• Spark Streaming works on fixed time slices only (in current version, 1.5)
• Use the original time stamp?
• Requires additional storage and bandwidth
• Original system clock defines resolution
• Use „Spark-Time“ or a local time reference:
• You may lose information!
• You have a limited resolution, defined by batch size.
‹#›© Cloudera, Inc. All rights reserved.
Thank you !
Enjoy Apache Spark and all your data …

More Related Content

What's hot

Data Science with Spark & Zeppelin
Data Science with Spark & ZeppelinData Science with Spark & Zeppelin
Data Science with Spark & Zeppelin
Vinay Shukla
 
An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...
An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...
An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...
Databricks
 
Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...
Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...
Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...
Databricks
 
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Spark Summit
 
Spark in the Enterprise - 2 Years Later by Alan Saldich
Spark in the Enterprise - 2 Years Later by Alan SaldichSpark in the Enterprise - 2 Years Later by Alan Saldich
Spark in the Enterprise - 2 Years Later by Alan Saldich
Spark Summit
 
Meeting Performance Goals in multi-tenant Hadoop Clusters
Meeting Performance Goals in multi-tenant Hadoop ClustersMeeting Performance Goals in multi-tenant Hadoop Clusters
Meeting Performance Goals in multi-tenant Hadoop Clusters
DataWorks Summit/Hadoop Summit
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Cloudera, Inc.
 
The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...
The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...
The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...
Databricks
 
Impala use case @ Zoosk
Impala use case @ ZooskImpala use case @ Zoosk
Impala use case @ Zoosk
Cloudera, Inc.
 
Spark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
Spark Summit East 2015 Keynote -- Databricks CEO Ion StoicaSpark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
Spark Summit East 2015 Keynote -- Databricks CEO Ion StoicaDatabricks
 
Building an ETL pipeline for Elasticsearch using Spark
Building an ETL pipeline for Elasticsearch using SparkBuilding an ETL pipeline for Elasticsearch using Spark
Building an ETL pipeline for Elasticsearch using Spark
Itai Yaffe
 
Ignite Your Big Data With a Spark!
Ignite Your Big Data With a Spark!Ignite Your Big Data With a Spark!
Ignite Your Big Data With a Spark!
Progress
 
Keynote – From MapReduce to Spark: An Ecosystem Evolves by Doug Cutting, Chie...
Keynote – From MapReduce to Spark: An Ecosystem Evolves by Doug Cutting, Chie...Keynote – From MapReduce to Spark: An Ecosystem Evolves by Doug Cutting, Chie...
Keynote – From MapReduce to Spark: An Ecosystem Evolves by Doug Cutting, Chie...
Cloudera, Inc.
 
H2O World - H2O Rains with Databricks Cloud
H2O World - H2O Rains with Databricks CloudH2O World - H2O Rains with Databricks Cloud
H2O World - H2O Rains with Databricks Cloud
Sri Ambati
 
Downscaling: The Achilles heel of Autoscaling Apache Spark Clusters
Downscaling: The Achilles heel of Autoscaling Apache Spark ClustersDownscaling: The Achilles heel of Autoscaling Apache Spark Clusters
Downscaling: The Achilles heel of Autoscaling Apache Spark Clusters
Databricks
 
Getting Spark ready for real-time, operational analytics
Getting Spark ready for real-time, operational analyticsGetting Spark ready for real-time, operational analytics
Getting Spark ready for real-time, operational analytics
airisData
 
Uber's data science workbench
Uber's data science workbenchUber's data science workbench
Uber's data science workbench
Ran Wei
 
Apache Spark 1.6 with Zeppelin - Transformations and Actions on RDDs
Apache Spark 1.6 with Zeppelin - Transformations and Actions on RDDsApache Spark 1.6 with Zeppelin - Transformations and Actions on RDDs
Apache Spark 1.6 with Zeppelin - Transformations and Actions on RDDs
Timothy Spann
 
Rethinking Streaming Analytics For Scale
Rethinking Streaming Analytics For ScaleRethinking Streaming Analytics For Scale
Rethinking Streaming Analytics For Scale
Helena Edelson
 
Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...
DataWorks Summit/Hadoop Summit
 

What's hot (20)

Data Science with Spark & Zeppelin
Data Science with Spark & ZeppelinData Science with Spark & Zeppelin
Data Science with Spark & Zeppelin
 
An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...
An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...
An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...
 
Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...
Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...
Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...
 
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
 
Spark in the Enterprise - 2 Years Later by Alan Saldich
Spark in the Enterprise - 2 Years Later by Alan SaldichSpark in the Enterprise - 2 Years Later by Alan Saldich
Spark in the Enterprise - 2 Years Later by Alan Saldich
 
Meeting Performance Goals in multi-tenant Hadoop Clusters
Meeting Performance Goals in multi-tenant Hadoop ClustersMeeting Performance Goals in multi-tenant Hadoop Clusters
Meeting Performance Goals in multi-tenant Hadoop Clusters
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
 
The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...
The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...
The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...
 
Impala use case @ Zoosk
Impala use case @ ZooskImpala use case @ Zoosk
Impala use case @ Zoosk
 
Spark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
Spark Summit East 2015 Keynote -- Databricks CEO Ion StoicaSpark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
Spark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
 
Building an ETL pipeline for Elasticsearch using Spark
Building an ETL pipeline for Elasticsearch using SparkBuilding an ETL pipeline for Elasticsearch using Spark
Building an ETL pipeline for Elasticsearch using Spark
 
Ignite Your Big Data With a Spark!
Ignite Your Big Data With a Spark!Ignite Your Big Data With a Spark!
Ignite Your Big Data With a Spark!
 
Keynote – From MapReduce to Spark: An Ecosystem Evolves by Doug Cutting, Chie...
Keynote – From MapReduce to Spark: An Ecosystem Evolves by Doug Cutting, Chie...Keynote – From MapReduce to Spark: An Ecosystem Evolves by Doug Cutting, Chie...
Keynote – From MapReduce to Spark: An Ecosystem Evolves by Doug Cutting, Chie...
 
H2O World - H2O Rains with Databricks Cloud
H2O World - H2O Rains with Databricks CloudH2O World - H2O Rains with Databricks Cloud
H2O World - H2O Rains with Databricks Cloud
 
Downscaling: The Achilles heel of Autoscaling Apache Spark Clusters
Downscaling: The Achilles heel of Autoscaling Apache Spark ClustersDownscaling: The Achilles heel of Autoscaling Apache Spark Clusters
Downscaling: The Achilles heel of Autoscaling Apache Spark Clusters
 
Getting Spark ready for real-time, operational analytics
Getting Spark ready for real-time, operational analyticsGetting Spark ready for real-time, operational analytics
Getting Spark ready for real-time, operational analytics
 
Uber's data science workbench
Uber's data science workbenchUber's data science workbench
Uber's data science workbench
 
Apache Spark 1.6 with Zeppelin - Transformations and Actions on RDDs
Apache Spark 1.6 with Zeppelin - Transformations and Actions on RDDsApache Spark 1.6 with Zeppelin - Transformations and Actions on RDDs
Apache Spark 1.6 with Zeppelin - Transformations and Actions on RDDs
 
Rethinking Streaming Analytics For Scale
Rethinking Streaming Analytics For ScaleRethinking Streaming Analytics For Scale
Rethinking Streaming Analytics For Scale
 
Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...
 

Viewers also liked

Web 2.0 y sus usos
Web 2.0 y sus usosWeb 2.0 y sus usos
Web 2.0 y sus usos
Anthony Maya
 
Petra Costa em Santos
Petra Costa em SantosPetra Costa em Santos
Petra Costa em Santos
Carlota Cafiero
 
edited publication list_Nicole_L_McNiven
edited publication list_Nicole_L_McNivenedited publication list_Nicole_L_McNiven
edited publication list_Nicole_L_McNivenNicole McNiven
 
Bbn media kit - 08 - 2016
Bbn   media kit - 08 - 2016Bbn   media kit - 08 - 2016
Bbn media kit - 08 - 2016
Diego Handera
 
3-й этап Минской Городской Лиги Каратэ сезона 2016-2017.
3-й этап Минской Городской Лиги Каратэ сезона 2016-2017.3-й этап Минской Городской Лиги Каратэ сезона 2016-2017.
3-й этап Минской Городской Лиги Каратэ сезона 2016-2017.
Молодежное Общественное Объединение «Спортивный клуб «Сэнкё»»
 
Dev ops, noops or hypeops - Networkshop44
Dev ops, noops or hypeops -  Networkshop44Dev ops, noops or hypeops -  Networkshop44
Dev ops, noops or hypeops - Networkshop44
Jisc
 
Pres1 ppp
Pres1 pppPres1 ppp
Pres1 ppp
BillNye007
 
Campus network refresh - Networkshop44
Campus network refresh -  Networkshop44Campus network refresh -  Networkshop44
Campus network refresh - Networkshop44
Jisc
 
Genre presentation
Genre presentationGenre presentation
Genre presentation
rdeable
 

Viewers also liked (9)

Web 2.0 y sus usos
Web 2.0 y sus usosWeb 2.0 y sus usos
Web 2.0 y sus usos
 
Petra Costa em Santos
Petra Costa em SantosPetra Costa em Santos
Petra Costa em Santos
 
edited publication list_Nicole_L_McNiven
edited publication list_Nicole_L_McNivenedited publication list_Nicole_L_McNiven
edited publication list_Nicole_L_McNiven
 
Bbn media kit - 08 - 2016
Bbn   media kit - 08 - 2016Bbn   media kit - 08 - 2016
Bbn media kit - 08 - 2016
 
3-й этап Минской Городской Лиги Каратэ сезона 2016-2017.
3-й этап Минской Городской Лиги Каратэ сезона 2016-2017.3-й этап Минской Городской Лиги Каратэ сезона 2016-2017.
3-й этап Минской Городской Лиги Каратэ сезона 2016-2017.
 
Dev ops, noops or hypeops - Networkshop44
Dev ops, noops or hypeops -  Networkshop44Dev ops, noops or hypeops -  Networkshop44
Dev ops, noops or hypeops - Networkshop44
 
Pres1 ppp
Pres1 pppPres1 ppp
Pres1 ppp
 
Campus network refresh - Networkshop44
Campus network refresh -  Networkshop44Campus network refresh -  Networkshop44
Campus network refresh - Networkshop44
 
Genre presentation
Genre presentationGenre presentation
Genre presentation
 

Similar to Apache Spark in Scientific Applciations

Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
Home
 
APACHE SPARK.pptx
APACHE SPARK.pptxAPACHE SPARK.pptx
APACHE SPARK.pptx
DeepaThirumurugan
 
Started with-apache-spark
Started with-apache-sparkStarted with-apache-spark
Started with-apache-spark
Happiest Minds Technologies
 
Data Science and CDSW
Data Science and CDSWData Science and CDSW
Data Science and CDSW
Jason Hubbard
 
Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15
Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15
Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15
MLconf
 
Atlanta MLConf
Atlanta MLConfAtlanta MLConf
Atlanta MLConf
Qubole
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptx
Rahul Borate
 
Analyzing Hadoop Data Using Sparklyr

Analyzing Hadoop Data Using Sparklyr
Analyzing Hadoop Data Using Sparklyr

Analyzing Hadoop Data Using Sparklyr

Cloudera, Inc.
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
Zahra Eskandari
 
Spark 101
Spark 101Spark 101
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code Workshop
Amanda Casari
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Agile Testing Alliance
 
Apache Spark in Industry
Apache Spark in IndustryApache Spark in Industry
Apache Spark in Industry
Dorian Beganovic
 
Spark One Platform Webinar
Spark One Platform WebinarSpark One Platform Webinar
Spark One Platform Webinar
Cloudera, Inc.
 
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemWhy Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Cloudera, Inc.
 
Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with Dse
DataStax Academy
 
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Michael Rys
 
Apache spark
Apache sparkApache spark
Apache spark
Dona Mary Philip
 

Similar to Apache Spark in Scientific Applciations (20)

Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
APACHE SPARK.pptx
APACHE SPARK.pptxAPACHE SPARK.pptx
APACHE SPARK.pptx
 
Started with-apache-spark
Started with-apache-sparkStarted with-apache-spark
Started with-apache-spark
 
Spark_Part 1
Spark_Part 1Spark_Part 1
Spark_Part 1
 
Data Science and CDSW
Data Science and CDSWData Science and CDSW
Data Science and CDSW
 
Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15
Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15
Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15
 
Atlanta MLConf
Atlanta MLConfAtlanta MLConf
Atlanta MLConf
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptx
 
Analyzing Hadoop Data Using Sparklyr

Analyzing Hadoop Data Using Sparklyr
Analyzing Hadoop Data Using Sparklyr

Analyzing Hadoop Data Using Sparklyr

 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
Spark 101
Spark 101Spark 101
Spark 101
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code Workshop
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
 
Apache Spark in Industry
Apache Spark in IndustryApache Spark in Industry
Apache Spark in Industry
 
Spark One Platform Webinar
Spark One Platform WebinarSpark One Platform Webinar
Spark One Platform Webinar
 
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemWhy Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
 
Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with Dse
 
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
 
Apache spark
Apache sparkApache spark
Apache spark
 

More from Dr. Mirko Kämpf

Time Series Analysis Using an Event Streaming Platform
 Time Series Analysis Using an Event Streaming Platform Time Series Analysis Using an Event Streaming Platform
Time Series Analysis Using an Event Streaming Platform
Dr. Mirko Kämpf
 
IoT meets AI in the Clouds
IoT meets AI in the CloudsIoT meets AI in the Clouds
IoT meets AI in the Clouds
Dr. Mirko Kämpf
 
Improving computer vision models at scale (Strata Data NYC)
Improving computer vision models at scale  (Strata Data NYC)Improving computer vision models at scale  (Strata Data NYC)
Improving computer vision models at scale (Strata Data NYC)
Dr. Mirko Kämpf
 
Improving computer vision models at scale presentation
Improving computer vision models at scale presentationImproving computer vision models at scale presentation
Improving computer vision models at scale presentation
Dr. Mirko Kämpf
 
Enterprise Metadata Integration
Enterprise Metadata IntegrationEnterprise Metadata Integration
Enterprise Metadata Integration
Dr. Mirko Kämpf
 
PCAP Graphs for Cybersecurity and System Tuning
PCAP Graphs for Cybersecurity and System TuningPCAP Graphs for Cybersecurity and System Tuning
PCAP Graphs for Cybersecurity and System Tuning
Dr. Mirko Kämpf
 
Etosha - Data Asset Manager : Status and road map
Etosha - Data Asset Manager : Status and road mapEtosha - Data Asset Manager : Status and road map
Etosha - Data Asset Manager : Status and road map
Dr. Mirko Kämpf
 
From Events to Networks: Time Series Analysis on Scale
From Events to Networks: Time Series Analysis on ScaleFrom Events to Networks: Time Series Analysis on Scale
From Events to Networks: Time Series Analysis on Scale
Dr. Mirko Kämpf
 
Apache Spark in Scientific Applications
Apache Spark in Scientific ApplicationsApache Spark in Scientific Applications
Apache Spark in Scientific Applications
Dr. Mirko Kämpf
 
DPG Berlin - SOE 18 - talk v1.2.4
DPG Berlin - SOE 18 - talk v1.2.4DPG Berlin - SOE 18 - talk v1.2.4
DPG Berlin - SOE 18 - talk v1.2.4
Dr. Mirko Kämpf
 
Information Spread in the Context of Evacuation Optimization
Information Spread in the Context of Evacuation OptimizationInformation Spread in the Context of Evacuation Optimization
Information Spread in the Context of Evacuation Optimization
Dr. Mirko Kämpf
 
DPG 2014: "Context Sensitive and Time Dependent Relevance of Wikipedia Articles"
DPG 2014: "Context Sensitive and Time Dependent Relevance of Wikipedia Articles"DPG 2014: "Context Sensitive and Time Dependent Relevance of Wikipedia Articles"
DPG 2014: "Context Sensitive and Time Dependent Relevance of Wikipedia Articles"
Dr. Mirko Kämpf
 

More from Dr. Mirko Kämpf (12)

Time Series Analysis Using an Event Streaming Platform
 Time Series Analysis Using an Event Streaming Platform Time Series Analysis Using an Event Streaming Platform
Time Series Analysis Using an Event Streaming Platform
 
IoT meets AI in the Clouds
IoT meets AI in the CloudsIoT meets AI in the Clouds
IoT meets AI in the Clouds
 
Improving computer vision models at scale (Strata Data NYC)
Improving computer vision models at scale  (Strata Data NYC)Improving computer vision models at scale  (Strata Data NYC)
Improving computer vision models at scale (Strata Data NYC)
 
Improving computer vision models at scale presentation
Improving computer vision models at scale presentationImproving computer vision models at scale presentation
Improving computer vision models at scale presentation
 
Enterprise Metadata Integration
Enterprise Metadata IntegrationEnterprise Metadata Integration
Enterprise Metadata Integration
 
PCAP Graphs for Cybersecurity and System Tuning
PCAP Graphs for Cybersecurity and System TuningPCAP Graphs for Cybersecurity and System Tuning
PCAP Graphs for Cybersecurity and System Tuning
 
Etosha - Data Asset Manager : Status and road map
Etosha - Data Asset Manager : Status and road mapEtosha - Data Asset Manager : Status and road map
Etosha - Data Asset Manager : Status and road map
 
From Events to Networks: Time Series Analysis on Scale
From Events to Networks: Time Series Analysis on ScaleFrom Events to Networks: Time Series Analysis on Scale
From Events to Networks: Time Series Analysis on Scale
 
Apache Spark in Scientific Applications
Apache Spark in Scientific ApplicationsApache Spark in Scientific Applications
Apache Spark in Scientific Applications
 
DPG Berlin - SOE 18 - talk v1.2.4
DPG Berlin - SOE 18 - talk v1.2.4DPG Berlin - SOE 18 - talk v1.2.4
DPG Berlin - SOE 18 - talk v1.2.4
 
Information Spread in the Context of Evacuation Optimization
Information Spread in the Context of Evacuation OptimizationInformation Spread in the Context of Evacuation Optimization
Information Spread in the Context of Evacuation Optimization
 
DPG 2014: "Context Sensitive and Time Dependent Relevance of Wikipedia Articles"
DPG 2014: "Context Sensitive and Time Dependent Relevance of Wikipedia Articles"DPG 2014: "Context Sensitive and Time Dependent Relevance of Wikipedia Articles"
DPG 2014: "Context Sensitive and Time Dependent Relevance of Wikipedia Articles"
 

Recently uploaded

Seminar of U.V. Spectroscopy by SAMIR PANDA
 Seminar of U.V. Spectroscopy by SAMIR PANDA Seminar of U.V. Spectroscopy by SAMIR PANDA
Seminar of U.V. Spectroscopy by SAMIR PANDA
SAMIR PANDA
 
Richard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlandsRichard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlands
Richard Gill
 
Leaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdfLeaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdf
RenuJangid3
 
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Sérgio Sacani
 
RNA INTERFERENCE: UNRAVELING GENETIC SILENCING
RNA INTERFERENCE: UNRAVELING GENETIC SILENCINGRNA INTERFERENCE: UNRAVELING GENETIC SILENCING
RNA INTERFERENCE: UNRAVELING GENETIC SILENCING
AADYARAJPANDEY1
 
The ASGCT Annual Meeting was packed with exciting progress in the field advan...
The ASGCT Annual Meeting was packed with exciting progress in the field advan...The ASGCT Annual Meeting was packed with exciting progress in the field advan...
The ASGCT Annual Meeting was packed with exciting progress in the field advan...
Health Advances
 
platelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptxplatelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptx
muralinath2
 
4. An Overview of Sugarcane White Leaf Disease in Vietnam.pdf
4. An Overview of Sugarcane White Leaf Disease in Vietnam.pdf4. An Overview of Sugarcane White Leaf Disease in Vietnam.pdf
4. An Overview of Sugarcane White Leaf Disease in Vietnam.pdf
ssuserbfdca9
 
Citrus Greening Disease and its Management
Citrus Greening Disease and its ManagementCitrus Greening Disease and its Management
Citrus Greening Disease and its Management
subedisuryaofficial
 
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptxBody fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
muralinath2
 
role of pramana in research.pptx in science
role of pramana in research.pptx in sciencerole of pramana in research.pptx in science
role of pramana in research.pptx in science
sonaliswain16
 
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Ana Luísa Pinho
 
platelets- lifespan -Clot retraction-disorders.pptx
platelets- lifespan -Clot retraction-disorders.pptxplatelets- lifespan -Clot retraction-disorders.pptx
platelets- lifespan -Clot retraction-disorders.pptx
muralinath2
 
Structural Classification Of Protein (SCOP)
Structural Classification Of Protein  (SCOP)Structural Classification Of Protein  (SCOP)
Structural Classification Of Protein (SCOP)
aishnasrivastava
 
EY - Supply Chain Services 2018_template.pptx
EY - Supply Chain Services 2018_template.pptxEY - Supply Chain Services 2018_template.pptx
EY - Supply Chain Services 2018_template.pptx
AlguinaldoKong
 
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
University of Maribor
 
general properties of oerganologametal.ppt
general properties of oerganologametal.pptgeneral properties of oerganologametal.ppt
general properties of oerganologametal.ppt
IqrimaNabilatulhusni
 
Orion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWSOrion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWS
Columbia Weather Systems
 
Lab report on liquid viscosity of glycerin
Lab report on liquid viscosity of glycerinLab report on liquid viscosity of glycerin
Lab report on liquid viscosity of glycerin
ossaicprecious19
 
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
Scintica Instrumentation
 

Recently uploaded (20)

Seminar of U.V. Spectroscopy by SAMIR PANDA
 Seminar of U.V. Spectroscopy by SAMIR PANDA Seminar of U.V. Spectroscopy by SAMIR PANDA
Seminar of U.V. Spectroscopy by SAMIR PANDA
 
Richard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlandsRichard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlands
 
Leaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdfLeaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdf
 
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
 
RNA INTERFERENCE: UNRAVELING GENETIC SILENCING
RNA INTERFERENCE: UNRAVELING GENETIC SILENCINGRNA INTERFERENCE: UNRAVELING GENETIC SILENCING
RNA INTERFERENCE: UNRAVELING GENETIC SILENCING
 
The ASGCT Annual Meeting was packed with exciting progress in the field advan...
The ASGCT Annual Meeting was packed with exciting progress in the field advan...The ASGCT Annual Meeting was packed with exciting progress in the field advan...
The ASGCT Annual Meeting was packed with exciting progress in the field advan...
 
platelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptxplatelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptx
 
4. An Overview of Sugarcane White Leaf Disease in Vietnam.pdf
4. An Overview of Sugarcane White Leaf Disease in Vietnam.pdf4. An Overview of Sugarcane White Leaf Disease in Vietnam.pdf
4. An Overview of Sugarcane White Leaf Disease in Vietnam.pdf
 
Citrus Greening Disease and its Management
Citrus Greening Disease and its ManagementCitrus Greening Disease and its Management
Citrus Greening Disease and its Management
 
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptxBody fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
 
role of pramana in research.pptx in science
role of pramana in research.pptx in sciencerole of pramana in research.pptx in science
role of pramana in research.pptx in science
 
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
 
platelets- lifespan -Clot retraction-disorders.pptx
platelets- lifespan -Clot retraction-disorders.pptxplatelets- lifespan -Clot retraction-disorders.pptx
platelets- lifespan -Clot retraction-disorders.pptx
 
Structural Classification Of Protein (SCOP)
Structural Classification Of Protein  (SCOP)Structural Classification Of Protein  (SCOP)
Structural Classification Of Protein (SCOP)
 
EY - Supply Chain Services 2018_template.pptx
EY - Supply Chain Services 2018_template.pptxEY - Supply Chain Services 2018_template.pptx
EY - Supply Chain Services 2018_template.pptx
 
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
 
general properties of oerganologametal.ppt
general properties of oerganologametal.pptgeneral properties of oerganologametal.ppt
general properties of oerganologametal.ppt
 
Orion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWSOrion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWS
 
Lab report on liquid viscosity of glycerin
Lab report on liquid viscosity of glycerinLab report on liquid viscosity of glycerin
Lab report on liquid viscosity of glycerin
 
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
 

Apache Spark in Scientific Applciations

  • 1. ‹#›© Cloudera, Inc. All rights reserved. Mirko Kämpf | 2015 Apache Spark: Next Generation Data Processing for Hadoop
  • 2. ‹#›© Cloudera, Inc. All rights reserved. Agenda • The Data Science Process (DSP) - Why or when to use Spark • The role of: Apache Hadoop and Apache Spark - History & Hadoop Ecosystem • Apache Spark: Overview and Concepts • Practical Tips
  • 3. ‹#›© Cloudera, Inc. All rights reserved. The Data Science Process Application of Big-Data-Technology Images from: http://semanticommunity.info/Data_Science/Doing_Data_Science
  • 4. ‹#›© Cloudera, Inc. All rights reserved. Huge Data Sets in Science Application of Big-Data-Technology Images from: http://semanticommunity.info/Data_Science/Doing_Data_Science
  • 5. ‹#›© Cloudera, Inc. All rights reserved. “Spark offers tools for Data Science and components for Data Products.” —How can Apache Spark fit into my world?
  • 6. ‹#›© Cloudera, Inc. All rights reserved. Should I use Apache Spark? • If all my data fits into Excel-Spreadsheets? • If I have a special purpose application to work with? • If my current system is just a bit to slow?
  • 7. ‹#›© Cloudera, Inc. All rights reserved. Should I use Apache Spark? • If all my data fits into Excel-Spreadsheets? • If I have a special purpose application to work with? • If my current system is just a bit to slow? • Just export as CSV / JSON and use a DataFrame to join with other DS. Why not?
  • 8. ‹#›© Cloudera, Inc. All rights reserved. Should I use Apache Spark? • If all my data fits into Excel-Spreadsheets? • If I have a special purpose application to work with? • If my current system is just a bit to slow? • Just export as CSV / JSON and use a DataFrame to join with other DS. • Think about additional analysis methods! Maybe it is already built into Apache Spark! Why not?
  • 9. ‹#›© Cloudera, Inc. All rights reserved. Should I use Apache Spark? • If all my data fits into Excel-Spreadsheets? • If I have a special purpose application to work with? • If my current system is just a bit to slow? • Just export as CSV / JSON and use a DataFrame to join with other DS. • Think about additional analysis methods! Maybe it is build into Spark. • OK, Spark will probably not help to speed up your system, but maybe you can offload data to Hadoop, which releases some resources. Why not?
  • 10. ‹#›© Cloudera, Inc. All rights reserved. “Spark offers fast in memory processing on huge distributed and even on heterogeneous datasets.” —What type of data fits into Spark?
  • 11. ‹#›© Cloudera, Inc. All rights reserved. History of Spark Spark is really young, but has a very active community!
  • 12. ‹#›© Cloudera, Inc. All rights reserved. Timeline: Spark Adoption
  • 13. ‹#›© Cloudera, Inc. All rights reserved. Apache Spark: Overview & Concepts
  • 14. ‹#›© Cloudera, Inc. All rights reserved. Hadoop Ecosystem incl. Apache Spark Spark can be an entry point to your Big Data world …
  • 15. ‹#›© Cloudera, Inc. All rights reserved. “Apache Spark is distributed on top of Hadoop and brings parallel processing to powerful workstations.” —Do I need a Hadoop cluster to work with Apache Spark?
  • 16. ‹#›© Cloudera, Inc. All rights reserved. Spark vs. MapReduce
  • 17. ‹#›© Cloudera, Inc. All rights reserved. How to interact with Spark?
  • 18. ‹#›© Cloudera, Inc. All rights reserved. Spark Components
  • 19. ‹#›© Cloudera, Inc. All rights reserved.
  • 20. ‹#›© Cloudera, Inc. All rights reserved. MLLib: GraphX: Basic statistics summary statistics, correlations, stratified sampling, hypothesis testing, random data generation Classification and regression linear models (SVMs, logistic / linear regression) naive Bayes, decision trees ensembles of trees (Random Forests / Gradient-Boosted Trees) isotonic regression Collaborative filtering alternating least squares (ALS) Clustering k-means, Gaussian mixture, power iteration clustering (PIC) latent Dirichlet allocation (LDA), streaming k-means Dimensionality reduction singular value decomposition (SVD) principal component analysis (PCA) … PageRank Connected Components Triangle Counting Pregel API
  • 21. ‹#›© Cloudera, Inc. All rights reserved. How to use your code in Spark? A. Interactively, by loading it into the spark-shell. B. Contribute to existing Spark projects. C. Create your module and use it in a spark-shell session. D. Build a data-product which uses Apache Spark. For simple and reliable usage of Java classes and complete third-party libraries, we define a Spark Module as a self-contained artifact created by Maven. This module can easily be shared by multiple users via repositories. http://blog.cloudera.com/blog/2015/03/how-to-build-re-usable-spark-programs-using-spark-shell-and-maven/
  • 22. ‹#›© Cloudera, Inc. All rights reserved. Apache Spark: Overview & Concepts
  • 23. ‹#›© Cloudera, Inc. All rights reserved. Spark Context
  • 24. ‹#›© Cloudera, Inc. All rights reserved. RDDs and DataFrames
  • 25. ‹#›© Cloudera, Inc. All rights reserved. Creation of RDDs
  • 26. ‹#›© Cloudera, Inc. All rights reserved. Datatypes in RDDs
  • 27. ‹#›© Cloudera, Inc. All rights reserved.
  • 28. ‹#›© Cloudera, Inc. All rights reserved.
  • 29. ‹#›© Cloudera, Inc. All rights reserved. Spark in a Cluster
  • 30. ‹#›© Cloudera, Inc. All rights reserved. Spark in a Cluster
  • 31. ‹#›© Cloudera, Inc. All rights reserved.
  • 32. ‹#›© Cloudera, Inc. All rights reserved.
  • 33. ‹#›© Cloudera, Inc. All rights reserved. DStream: The heart of Spark Streaming
  • 34. ‹#›© Cloudera, Inc. All rights reserved. “Efficient hardware utilization, caching, simple APIs, and access to a variety of data in Hadoop is key to success.” —What makes Spark so different, compared to core MapReduce?
  • 35. ‹#›© Cloudera, Inc. All rights reserved. Practical Tips
  • 36. ‹#›© Cloudera, Inc. All rights reserved. Development Techniques • Build your tools and analysis procedures in small cycles. • Test all phases of your work and document carefully. • Document what you expect! => Requirements management … • Collect what you get! => Operational logs … • Reuse well tested components and modularize your analysis scripts. • Learn „state of the art“ tools and share your work!
  • 37. ‹#›© Cloudera, Inc. All rights reserved. Data Management • Think about typical access patterns: • random access to each record or field? • access to entire groups of records? • variable size or fixed size sets? • „full table scan“ • OPTIMIZE FOR YOUR DOMINANT ACCESS PATTERN! • Select efficient storage formats: Avro, Parquet • Index your data in SOLR for random access and data exploration • Indexing can be done by just a few clicks in HUE …
  • 38. ‹#›© Cloudera, Inc. All rights reserved. Collecting Sensor Data with Spark Streaming … • Spark Streaming works on fixed time slices only (in current version, 1.5) • Use the original time stamp? • Requires additional storage and bandwidth • Original system clock defines resolution • Use „Spark-Time“ or a local time reference: • You may lose information! • You have a limited resolution, defined by batch size.
  • 39. ‹#›© Cloudera, Inc. All rights reserved. Thank you ! Enjoy Apache Spark and all your data …