SlideShare a Scribd company logo
1 of 43
Download to read offline
Presenter : Andrey Vykhodtsev
Andrey.vykhodtsev@si.ibm.com
*collective work, see slide credits
 Two meetup groups
 Close, but different
topics
 Ran by me
 I don’t have to be a
presenter all the time
 Propose your agenda
 Not a Big Data introduction
 Visit our next Big Data Essentials meetup instead
http://www.meetup.com/Big-Data-Developers-in-
Slovenia/events/223871144/
 Not for people without technical background
(sorry)
 Not a thorough use case discussion
 Just a technical overview of technology for
beginners
 General purpose distributed computing engine
suitable for large scale machine learning and
data processing tasks
NOT SO GOOD GOOD
 Not the first computing
engine
 MapReduce
 MPI
 Not one of a kind
 Flink
 Not so old (mature)
 Developing very fast
 Rapidly growing
community
 Backed by major
vendors
 Innovation
 Designed for iterative
data analysis on large
scale (supersedes MR)
In-Memory Performance
Ease of Development
Combine Workflows
Unlimited Scale
Enterprise Platform
Wide Range of
Data Formats
 A Big Data/DWH
developer
 A Data Scientist
 An Analytics
Architect
 A CxO of IT company
Statistici
an
Business
Analyst
Software
Engineer
IT WORDS BUSINESS WORDS
 Data
processing/Transformat
ion
 Machine Learning
 Social Network
Analysis
 Streaming/Microbatchi
ng
 Segmentation
 Campaign response
prediction
 Churn avoidance
 CTR prediction
 Behavioral analysis
 Genomics
 ….
 Open Source SystemML
 Educate One Million Data Professionals
 Establish Spark Technology Center
 Founding Member of AMPLab
 Contributing to the Core
 Port many existing applications onto Spark
 Develop applications using Spark
 Distributed platform for thousands of nodes
 Data storage and computation framework
 Open source
 Runs on commodity hardware
 Flexible – everything is loosely coupled
 Driving principals
 Files are stored across the entire cluster
 Programs are brought to the data, not the data to the program
 Distributed file system (DFS) stores blocks across the whole cluster
 Blocks of a single file are distributed across the cluster
 A given block is typically replicated as well for resiliency
 Just like a regular file system, the contents of a file is up to the application
 Unlike a regular file system, you can ask it “where does each block of my file
live?”
FILE
BLOCK
S
map(String key, String value):
// key: document name
// value: document contents
for each word w in value:
EmitIntermediate(w, "1");
reduce(String key, Iterator values):
// key: a word
// values: a list of counts
int result = 0;
for each v in values:
result += ParseInt(v);
Emit(AsString(result));
Hello World Bye World
Hello IBM
Content of Input Documents
Reduce (final output):
< Bye, 1>
< IBM, 1>
< Hello, 2>
< World, 2>
Map 1 emits:
< Hello, 1>
< World, 1>
< Bye, 1>
< World, 1>
Map 2 emits:
< Hello, 1>
< IBM, 1>
 Spark brings two significant value-adds:
 Bring to Map Reduce the same added value that databases (and
parallel databases) brought to query processing:
 Let the app developer focus on the WHAT (they need to ask) and let the
system figure out HOW (it should be done).
 Enable faster higher level application development through higher level
constructs and concepts: (RDD concept)
 Let the system deal with performance (as part of the HOW)
 Leveraging memory (Bufferpools, Caching RDDs in memory)
 Maintaining sets of dedicated worker processes ready to go (subagents in
DBMS, Executors in Spark)
 Enabling interactive processing (CLP, SQL*Plus, spark-shell, etc….)
 Be one general purpose engine for multiples types of
workloads (SQL, Streaming, Machine Learning, etc…)
 Apache Spark is a fast, general
purpose, easy-to-use cluster
computing system for large-scale data
processing
 Fast
 Leverages aggressively cached in-memory
distributed computing and dedicated
App Executor processes even when no jobs
are running
 Faster than MapReduce
 General purpose
 Covers a wide range of workloads
 Provides SQL, streaming and complex
analytics
 Flexible and easier to use than Map
Reduce
 Spark is written in Scala, an object oriented,
functional programming language
 Scala, Python and Java APIs
 Scala and Python interactive shells
 Runs on Hadoop, Mesos, standalone or
cloud
Logistic regression in Hadoop and Spark
Spark Stack
val wordCounts =
sc.textFile("README.md").flatMap(line =>
line.split(" ")).map(word => (word,
1)).reduceByKey((a, b) => a + b)
WordCount
 Spark is versatile and
flexible:
 Can run on YARN /
HDFS but also
standalone or on
MESOS
 Spark engine can be
exploited from multiple
“entry points”: SQL,
Streaming, Machine
Learning, Graph
Processing
 Normally you code stuff up in one of the
languages
 Scala
 Python
 Java
 I like Python, but in some cases it is slower
 With DataFrames, no difference (more later)
 One of the shells
 Scala shell (spark-shell)
 Python shell
 Code it in the editor and submit with spark-
submit
 Use “notebook” (Jupyter, Zeppelin)
 My preferred method. More later
 Enable your IDE to run spark
 PyCharm
 IntelliJ IDEA
 Jupytiter
 Zeppelin
 Scala
 Incubated
 Many others
 Spark Notebook
 Ispark
 DataBricks Cloud
 IBM Spark aaS
 IBM DataScientist
Workbench
 Initialize context
 Read data
 Run stuff
 Transformations
 Actions
 Caching
 More later
GOOD STUFF NOT SO GOOD STUFF
 Full API exposed
 Concise language
 Documentation is way
better
 Faster if you use plain
RDDs
 Build tools and
dependency tracking
 Not so many additional
libraries compared to
Python
 Pandas
 Matplotlib
 Harder to run in a
“notebook”*
 *At the moment
 Harder to learn
 Scala Crash Course
 Holden Karau, DataBricks
http://lintool.github.io/SparkTutorial/slides/day
1_Scala_crash_course.pdf
 Martin Odersky’s “Functional Programming in
Scala” course
 Books
 Scala for Impatient
 Scala by Example
GOOD STUFF NOT SO GOOD STUFF
 Clean & clear language
 Easy to learn
 Lot of libraries
 Pandas
 Scikit
 matplotlib
 Easy to run in a
“notebook”
 Slower
 Interpreted language
 Not all API functions
exposed
 Streaming
 Some times behaves
differently
 I think coding in Java
for Spark is terrible
 But if you like it
messy, there is
nobody to stop you 
 A way to connect to spark engine
 Initialized with all runtime parameters
 For example, memory parameters
 Resilient Distributed Dataset
 An abstraction over a generic data collection
 Integers
 Strings
 PairRDD : <key, value> pairs (support additional
operations)
 Single logical entity but under the hood is a
distributed collection
Mokhtar
Jacques
Dirk
Cindy
Dan
Susan
Dirk
Frank
Jacques
Partition 1 Partition 2 Partition 3
Names
 You have to pay attention what kind of
operation you are running
 Transformation
 Does not do anything until the action is called
 Actions
 Kick off computation
 Results can be persisted to memory (cache) or to disk
(more later)
 Three methods for creation
 Distributing a collection of objects from the driver program
(using the parallelize method of the spark context)
val rddNumbers = sc.parallelize(1 to 10)
val rddLetters = sc.parallelize (List(“a”, “b”, “c”, “d”))
 Loading an external dataset (file)
val quotes = sc.textFile("hdfs:/sparkdata/sparkQuotes.txt")
 Transformation from another existing RDD
val rddNumbers2 = rddNumbers.map(x=> x+1)
 Transformations are
lazy evaluations
 Returns a pointer to
the transformed RDD
 Pair RDD (K,V)
functions for
MapReduce style
transformations
 Map
 Filter
 flatMap
 reduceByKey
 sortByKey
 Join
 See the doc for full list
 Kick off the
computation
 Transformations are
lazily evaluated
 Collect()
 Count()
 Take()
 Reduce()
 First()
 saveAsTextFile()
 Each node stores any partitions of the cache that it
computes in memory
 Reuses them in other actions on that dataset (or
datasets derived from it)
 Future actions are much faster (often by more
than 10x)
 Two methods for RDD persistence: persist() and
cache()
rdd1.join(rdd2)
.groupBy(…)
.filter(…)
RDD Objects
build operator DAG
agnostic to
operators!
doesn’t know
about stages
DAGScheduler
split graph into
stages of tasks
submit each
stage as ready
DAG
TaskScheduler
TaskSet
launch tasks via
cluster manager
retry failed or
straggling tasks
Cluster
manager
Worker
execute tasks
store and serve
blocks
Block
manager
Threads
Task
stage
failed
DataBricks
SparkContext
Driver Program
Cluster Manager
Worker Node
Executor
Task Task
Cache
Worker Node
Executor
Task Task
Cache
App
 MLLib
 Distributed machine
learning libraries
 SparkSQL
 DataFrames
 GraphX
 ML
 SparkR
 Streaming
 Read the Fine Manual
 https://spark.apache.org/docs/latest/index.html
 Take the course
 BigData University
https://bigdatauniversity.com/bdu-wp/bdu-
course/spark-fundamentals/
 edX – edx.org search for Spark
 If you’re stuck
 Try the user lists :
https://spark.apache.org/community.html
 Questions?
 Topic for the next meetup?
 Your experiences?
 Want to be a presenter?
 Some slide and text graphics were borrowed
from the following sources
 Vincent Poncet, IBM France
 Jacques Roy, IBM US
 Daniel Kikuchi , IBM US
 Mokhtar Kandil , IBM US
 DataBricks
 Spark Docs
 I completely lost track what slides I copied
from which source. I apologize.

More Related Content

What's hot

Oracle big data appliance and solutions
Oracle big data appliance and solutionsOracle big data appliance and solutions
Oracle big data appliance and solutionssolarisyougood
 
Leveraging docker for hadoop build automation and big data stack provisioning
Leveraging docker for hadoop build automation and big data stack provisioningLeveraging docker for hadoop build automation and big data stack provisioning
Leveraging docker for hadoop build automation and big data stack provisioningEvans Ye
 
MOUG17: DB Security; Secure your Data
MOUG17: DB Security; Secure your DataMOUG17: DB Security; Secure your Data
MOUG17: DB Security; Secure your DataMonica Li
 
Understanding the IBM Power Systems Advantage
Understanding the IBM Power Systems AdvantageUnderstanding the IBM Power Systems Advantage
Understanding the IBM Power Systems AdvantageIBM Power Systems
 
Bare-metal performance for Big Data workloads on Docker containers
Bare-metal performance for Big Data workloads on Docker containersBare-metal performance for Big Data workloads on Docker containers
Bare-metal performance for Big Data workloads on Docker containersBlueData, Inc.
 
Running secured Spark job in Kubernetes compute cluster and integrating with ...
Running secured Spark job in Kubernetes compute cluster and integrating with ...Running secured Spark job in Kubernetes compute cluster and integrating with ...
Running secured Spark job in Kubernetes compute cluster and integrating with ...DataWorks Summit
 
HPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyHPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyPeter Clapham
 
YARN Containerized Services: Fading The Lines Between On-Prem And Cloud
YARN Containerized Services: Fading The Lines Between On-Prem And CloudYARN Containerized Services: Fading The Lines Between On-Prem And Cloud
YARN Containerized Services: Fading The Lines Between On-Prem And CloudDataWorks Summit
 
Containerized Hadoop beyond Kubernetes
Containerized Hadoop beyond KubernetesContainerized Hadoop beyond Kubernetes
Containerized Hadoop beyond KubernetesDataWorks Summit
 
Realizing the promise of portable data processing with Apache Beam
Realizing the promise of portable data processing with Apache BeamRealizing the promise of portable data processing with Apache Beam
Realizing the promise of portable data processing with Apache BeamDataWorks Summit
 
JCConf 2016 - Cloud Computing Applications - Hazelcast, Spark and Ignite
JCConf 2016 - Cloud Computing Applications - Hazelcast, Spark and IgniteJCConf 2016 - Cloud Computing Applications - Hazelcast, Spark and Ignite
JCConf 2016 - Cloud Computing Applications - Hazelcast, Spark and IgniteJoseph Kuo
 
20150314 sahara intro and the future plan for open stack meetup
20150314 sahara intro and the future plan for open stack meetup20150314 sahara intro and the future plan for open stack meetup
20150314 sahara intro and the future plan for open stack meetupWei Ting Chen
 
Hadoop Virtualization - Intel White Paper
Hadoop Virtualization - Intel White PaperHadoop Virtualization - Intel White Paper
Hadoop Virtualization - Intel White PaperBlueData, Inc.
 
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on KubernetesApache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on KubernetesDataWorks Summit
 
Hadoop on Docker
Hadoop on DockerHadoop on Docker
Hadoop on DockerRakesh Saha
 
How to deploy Apache Spark in a multi-tenant, on-premises environment
How to deploy Apache Spark in a multi-tenant, on-premises environmentHow to deploy Apache Spark in a multi-tenant, on-premises environment
How to deploy Apache Spark in a multi-tenant, on-premises environmentBlueData, Inc.
 
Extending DevOps to Big Data Applications with Kubernetes
Extending DevOps to Big Data Applications with KubernetesExtending DevOps to Big Data Applications with Kubernetes
Extending DevOps to Big Data Applications with KubernetesNicola Ferraro
 
A Closer Look at Apache Kudu
A Closer Look at Apache KuduA Closer Look at Apache Kudu
A Closer Look at Apache KuduAndriy Zabavskyy
 

What's hot (20)

Oracle big data appliance and solutions
Oracle big data appliance and solutionsOracle big data appliance and solutions
Oracle big data appliance and solutions
 
Leveraging docker for hadoop build automation and big data stack provisioning
Leveraging docker for hadoop build automation and big data stack provisioningLeveraging docker for hadoop build automation and big data stack provisioning
Leveraging docker for hadoop build automation and big data stack provisioning
 
MOUG17: DB Security; Secure your Data
MOUG17: DB Security; Secure your DataMOUG17: DB Security; Secure your Data
MOUG17: DB Security; Secure your Data
 
Understanding the IBM Power Systems Advantage
Understanding the IBM Power Systems AdvantageUnderstanding the IBM Power Systems Advantage
Understanding the IBM Power Systems Advantage
 
Bare-metal performance for Big Data workloads on Docker containers
Bare-metal performance for Big Data workloads on Docker containersBare-metal performance for Big Data workloads on Docker containers
Bare-metal performance for Big Data workloads on Docker containers
 
Novinky v Oracle Database 18c
Novinky v Oracle Database 18cNovinky v Oracle Database 18c
Novinky v Oracle Database 18c
 
Running secured Spark job in Kubernetes compute cluster and integrating with ...
Running secured Spark job in Kubernetes compute cluster and integrating with ...Running secured Spark job in Kubernetes compute cluster and integrating with ...
Running secured Spark job in Kubernetes compute cluster and integrating with ...
 
HPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyHPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journey
 
YARN Containerized Services: Fading The Lines Between On-Prem And Cloud
YARN Containerized Services: Fading The Lines Between On-Prem And CloudYARN Containerized Services: Fading The Lines Between On-Prem And Cloud
YARN Containerized Services: Fading The Lines Between On-Prem And Cloud
 
Deploying Big-Data-as-a-Service (BDaaS) in the Enterprise
Deploying Big-Data-as-a-Service (BDaaS) in the EnterpriseDeploying Big-Data-as-a-Service (BDaaS) in the Enterprise
Deploying Big-Data-as-a-Service (BDaaS) in the Enterprise
 
Containerized Hadoop beyond Kubernetes
Containerized Hadoop beyond KubernetesContainerized Hadoop beyond Kubernetes
Containerized Hadoop beyond Kubernetes
 
Realizing the promise of portable data processing with Apache Beam
Realizing the promise of portable data processing with Apache BeamRealizing the promise of portable data processing with Apache Beam
Realizing the promise of portable data processing with Apache Beam
 
JCConf 2016 - Cloud Computing Applications - Hazelcast, Spark and Ignite
JCConf 2016 - Cloud Computing Applications - Hazelcast, Spark and IgniteJCConf 2016 - Cloud Computing Applications - Hazelcast, Spark and Ignite
JCConf 2016 - Cloud Computing Applications - Hazelcast, Spark and Ignite
 
20150314 sahara intro and the future plan for open stack meetup
20150314 sahara intro and the future plan for open stack meetup20150314 sahara intro and the future plan for open stack meetup
20150314 sahara intro and the future plan for open stack meetup
 
Hadoop Virtualization - Intel White Paper
Hadoop Virtualization - Intel White PaperHadoop Virtualization - Intel White Paper
Hadoop Virtualization - Intel White Paper
 
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on KubernetesApache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
 
Hadoop on Docker
Hadoop on DockerHadoop on Docker
Hadoop on Docker
 
How to deploy Apache Spark in a multi-tenant, on-premises environment
How to deploy Apache Spark in a multi-tenant, on-premises environmentHow to deploy Apache Spark in a multi-tenant, on-premises environment
How to deploy Apache Spark in a multi-tenant, on-premises environment
 
Extending DevOps to Big Data Applications with Kubernetes
Extending DevOps to Big Data Applications with KubernetesExtending DevOps to Big Data Applications with Kubernetes
Extending DevOps to Big Data Applications with Kubernetes
 
A Closer Look at Apache Kudu
A Closer Look at Apache KuduA Closer Look at Apache Kudu
A Closer Look at Apache Kudu
 

Viewers also liked

Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...
Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...
Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...Cynthia Saracco
 
Overview - IBM Big Data Platform
Overview - IBM Big Data PlatformOverview - IBM Big Data Platform
Overview - IBM Big Data PlatformVikas Manoria
 
IBM Big Data Analytics Concepts and Use Cases
IBM Big Data Analytics Concepts and Use CasesIBM Big Data Analytics Concepts and Use Cases
IBM Big Data Analytics Concepts and Use CasesTony Pearson
 
Big Data & Analytics Architecture
Big Data & Analytics ArchitectureBig Data & Analytics Architecture
Big Data & Analytics ArchitectureArvind Sathi
 
Big Data - The 5 Vs Everyone Must Know
Big Data - The 5 Vs Everyone Must KnowBig Data - The 5 Vs Everyone Must Know
Big Data - The 5 Vs Everyone Must KnowBernard Marr
 
IBM Watson Analytics Presentation
IBM Watson Analytics PresentationIBM Watson Analytics Presentation
IBM Watson Analytics PresentationIan Balina
 
Georgia Azure Event - Scalable cloud games using Microsoft Azure
Georgia Azure Event - Scalable cloud games using Microsoft AzureGeorgia Azure Event - Scalable cloud games using Microsoft Azure
Georgia Azure Event - Scalable cloud games using Microsoft AzureMicrosoft
 
OpenPOWER Roadmap Toward CORAL
OpenPOWER Roadmap Toward CORALOpenPOWER Roadmap Toward CORAL
OpenPOWER Roadmap Toward CORALinside-BigData.com
 
Presentacin webinar move_up_to_power8_with_scale_out_servers_final
Presentacin webinar move_up_to_power8_with_scale_out_servers_finalPresentacin webinar move_up_to_power8_with_scale_out_servers_final
Presentacin webinar move_up_to_power8_with_scale_out_servers_finalDiego Alberto Tamayo
 
Oracle Solaris Software Integration
Oracle Solaris Software IntegrationOracle Solaris Software Integration
Oracle Solaris Software IntegrationOTN Systems Hub
 
Open Innovation with Power Systems
Open Innovation with Power Systems Open Innovation with Power Systems
Open Innovation with Power Systems IBM Power Systems
 
Expert summit SQL Server 2016
Expert summit   SQL Server 2016Expert summit   SQL Server 2016
Expert summit SQL Server 2016Łukasz Grala
 
Oracle Solaris Secure Cloud Infrastructure
Oracle Solaris Secure Cloud InfrastructureOracle Solaris Secure Cloud Infrastructure
Oracle Solaris Secure Cloud InfrastructureOTN Systems Hub
 

Viewers also liked (20)

Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...
Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...
Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...
 
Overview - IBM Big Data Platform
Overview - IBM Big Data PlatformOverview - IBM Big Data Platform
Overview - IBM Big Data Platform
 
IBM Big Data Analytics Concepts and Use Cases
IBM Big Data Analytics Concepts and Use CasesIBM Big Data Analytics Concepts and Use Cases
IBM Big Data Analytics Concepts and Use Cases
 
Big Data & Analytics Architecture
Big Data & Analytics ArchitectureBig Data & Analytics Architecture
Big Data & Analytics Architecture
 
Big Data - The 5 Vs Everyone Must Know
Big Data - The 5 Vs Everyone Must KnowBig Data - The 5 Vs Everyone Must Know
Big Data - The 5 Vs Everyone Must Know
 
IBM Watson Analytics Presentation
IBM Watson Analytics PresentationIBM Watson Analytics Presentation
IBM Watson Analytics Presentation
 
Georgia Azure Event - Scalable cloud games using Microsoft Azure
Georgia Azure Event - Scalable cloud games using Microsoft AzureGeorgia Azure Event - Scalable cloud games using Microsoft Azure
Georgia Azure Event - Scalable cloud games using Microsoft Azure
 
OpenPOWER Roadmap Toward CORAL
OpenPOWER Roadmap Toward CORALOpenPOWER Roadmap Toward CORAL
OpenPOWER Roadmap Toward CORAL
 
OpenPOWER Update
OpenPOWER UpdateOpenPOWER Update
OpenPOWER Update
 
The State of Linux Containers
The State of Linux ContainersThe State of Linux Containers
The State of Linux Containers
 
IBM POWER8 as an HPC platform
IBM POWER8 as an HPC platformIBM POWER8 as an HPC platform
IBM POWER8 as an HPC platform
 
Presentacin webinar move_up_to_power8_with_scale_out_servers_final
Presentacin webinar move_up_to_power8_with_scale_out_servers_finalPresentacin webinar move_up_to_power8_with_scale_out_servers_final
Presentacin webinar move_up_to_power8_with_scale_out_servers_final
 
Bitcoin explained
Bitcoin explainedBitcoin explained
Bitcoin explained
 
Blockchain
BlockchainBlockchain
Blockchain
 
Oracle Solaris Software Integration
Oracle Solaris Software IntegrationOracle Solaris Software Integration
Oracle Solaris Software Integration
 
Open Innovation with Power Systems
Open Innovation with Power Systems Open Innovation with Power Systems
Open Innovation with Power Systems
 
IBM Power8 announce
IBM Power8 announceIBM Power8 announce
IBM Power8 announce
 
Expert summit SQL Server 2016
Expert summit   SQL Server 2016Expert summit   SQL Server 2016
Expert summit SQL Server 2016
 
Puppet + Windows Nano Server
Puppet + Windows Nano ServerPuppet + Windows Nano Server
Puppet + Windows Nano Server
 
Oracle Solaris Secure Cloud Infrastructure
Oracle Solaris Secure Cloud InfrastructureOracle Solaris Secure Cloud Infrastructure
Oracle Solaris Secure Cloud Infrastructure
 

Similar to 20150716 introduction to apache spark v3

Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Andrey Vykhodtsev
 
Boston Spark Meetup event Slides Update
Boston Spark Meetup event Slides UpdateBoston Spark Meetup event Slides Update
Boston Spark Meetup event Slides Updatevithakur
 
Azure Databricks is Easier Than You Think
Azure Databricks is Easier Than You ThinkAzure Databricks is Easier Than You Think
Azure Databricks is Easier Than You ThinkIke Ellis
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkC4Media
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25thSneha Challa
 
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosApache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosEuangelos Linardos
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Databricks
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irdatastack
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introductionsudhakara st
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Massimo Schenone
 
Hadoop Big Data A big picture
Hadoop Big Data A big pictureHadoop Big Data A big picture
Hadoop Big Data A big pictureJ S Jodha
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsDatabricks
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsDataStax Academy
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonBenjamin Bengfort
 
Big Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computingBig Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computingAnimesh Chaturvedi
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkVincent Poncet
 
About "Apache Cassandra"
About "Apache Cassandra"About "Apache Cassandra"
About "Apache Cassandra"Jihyun Ahn
 

Similar to 20150716 introduction to apache spark v3 (20)

Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
 
Boston Spark Meetup event Slides Update
Boston Spark Meetup event Slides UpdateBoston Spark Meetup event Slides Update
Boston Spark Meetup event Slides Update
 
Azure Databricks is Easier Than You Think
Azure Databricks is Easier Than You ThinkAzure Databricks is Easier Than You Think
Azure Databricks is Easier Than You Think
 
Big data concepts
Big data conceptsBig data concepts
Big data concepts
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
 
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosApache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.ir
 
hadoop-spark.ppt
hadoop-spark.ppthadoop-spark.ppt
hadoop-spark.ppt
 
Bds session 13 14
Bds session 13 14Bds session 13 14
Bds session 13 14
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
 
Hadoop Big Data A big picture
Hadoop Big Data A big pictureHadoop Big Data A big picture
Hadoop Big Data A big picture
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Big Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computingBig Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computing
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
About "Apache Cassandra"
About "Apache Cassandra"About "Apache Cassandra"
About "Apache Cassandra"
 

More from Andrey Vykhodtsev

Explaining machine learning models with python
Explaining machine learning models with pythonExplaining machine learning models with python
Explaining machine learning models with pythonAndrey Vykhodtsev
 
20181003 Whirlwind tour into Pyspark
20181003 Whirlwind tour into Pyspark20181003 Whirlwind tour into Pyspark
20181003 Whirlwind tour into PysparkAndrey Vykhodtsev
 
20180405 av toxic_comment_classification
20180405 av toxic_comment_classification20180405 av toxic_comment_classification
20180405 av toxic_comment_classificationAndrey Vykhodtsev
 
20180328 av kaggle_jigsaw_with_amlwb
20180328 av kaggle_jigsaw_with_amlwb20180328 av kaggle_jigsaw_with_amlwb
20180328 av kaggle_jigsaw_with_amlwbAndrey Vykhodtsev
 
20170927 py data_n3_bokeh_plotly
20170927 py data_n3_bokeh_plotly20170927 py data_n3_bokeh_plotly
20170927 py data_n3_bokeh_plotlyAndrey Vykhodtsev
 
Installing Hadoop / Spark from scratch
Installing Hadoop / Spark from scratchInstalling Hadoop / Spark from scratch
Installing Hadoop / Spark from scratchAndrey Vykhodtsev
 
20151015 zagreb spark_notebooks
20151015 zagreb spark_notebooks20151015 zagreb spark_notebooks
20151015 zagreb spark_notebooksAndrey Vykhodtsev
 
20150617 spark meetup zagreb
20150617 spark meetup zagreb20150617 spark meetup zagreb
20150617 spark meetup zagrebAndrey Vykhodtsev
 

More from Andrey Vykhodtsev (9)

Explaining machine learning models with python
Explaining machine learning models with pythonExplaining machine learning models with python
Explaining machine learning models with python
 
20181003 Whirlwind tour into Pyspark
20181003 Whirlwind tour into Pyspark20181003 Whirlwind tour into Pyspark
20181003 Whirlwind tour into Pyspark
 
20180405 av toxic_comment_classification
20180405 av toxic_comment_classification20180405 av toxic_comment_classification
20180405 av toxic_comment_classification
 
20180328 av kaggle_jigsaw_with_amlwb
20180328 av kaggle_jigsaw_with_amlwb20180328 av kaggle_jigsaw_with_amlwb
20180328 av kaggle_jigsaw_with_amlwb
 
20170927 py data_n3_bokeh_plotly
20170927 py data_n3_bokeh_plotly20170927 py data_n3_bokeh_plotly
20170927 py data_n3_bokeh_plotly
 
PyData Ljubljana meetup #1
PyData Ljubljana meetup #1PyData Ljubljana meetup #1
PyData Ljubljana meetup #1
 
Installing Hadoop / Spark from scratch
Installing Hadoop / Spark from scratchInstalling Hadoop / Spark from scratch
Installing Hadoop / Spark from scratch
 
20151015 zagreb spark_notebooks
20151015 zagreb spark_notebooks20151015 zagreb spark_notebooks
20151015 zagreb spark_notebooks
 
20150617 spark meetup zagreb
20150617 spark meetup zagreb20150617 spark meetup zagreb
20150617 spark meetup zagreb
 

Recently uploaded

Analog to Digital and Digital to Analog Converter
Analog to Digital and Digital to Analog ConverterAnalog to Digital and Digital to Analog Converter
Analog to Digital and Digital to Analog ConverterAbhinavSharma374939
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...Soham Mondal
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile servicerehmti665
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Dr.Costas Sachpazis
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )Tsuyoshi Horigome
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSKurinjimalarL3
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escortsranjana rawat
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingrakeshbaidya232001
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINESIVASHANKAR N
 
main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidmain PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidNikhilNagaraju
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxAsutosh Ranjan
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxpurnimasatapathy1234
 
chaitra-1.pptx fake news detection using machine learning
chaitra-1.pptx  fake news detection using machine learningchaitra-1.pptx  fake news detection using machine learning
chaitra-1.pptx fake news detection using machine learningmisbanausheenparvam
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth
 
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...srsj9000
 

Recently uploaded (20)

Analog to Digital and Digital to Analog Converter
Analog to Digital and Digital to Analog ConverterAnalog to Digital and Digital to Analog Converter
Analog to Digital and Digital to Analog Converter
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile service
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
 
main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidmain PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfid
 
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptx
 
chaitra-1.pptx fake news detection using machine learning
chaitra-1.pptx  fake news detection using machine learningchaitra-1.pptx  fake news detection using machine learning
chaitra-1.pptx fake news detection using machine learning
 
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
 

20150716 introduction to apache spark v3

  • 1. Presenter : Andrey Vykhodtsev Andrey.vykhodtsev@si.ibm.com *collective work, see slide credits
  • 2.  Two meetup groups  Close, but different topics  Ran by me  I don’t have to be a presenter all the time  Propose your agenda
  • 3.  Not a Big Data introduction  Visit our next Big Data Essentials meetup instead http://www.meetup.com/Big-Data-Developers-in- Slovenia/events/223871144/  Not for people without technical background (sorry)  Not a thorough use case discussion  Just a technical overview of technology for beginners
  • 4.  General purpose distributed computing engine suitable for large scale machine learning and data processing tasks
  • 5. NOT SO GOOD GOOD  Not the first computing engine  MapReduce  MPI  Not one of a kind  Flink  Not so old (mature)  Developing very fast  Rapidly growing community  Backed by major vendors  Innovation  Designed for iterative data analysis on large scale (supersedes MR)
  • 6. In-Memory Performance Ease of Development Combine Workflows Unlimited Scale Enterprise Platform Wide Range of Data Formats
  • 7.  A Big Data/DWH developer  A Data Scientist  An Analytics Architect  A CxO of IT company Statistici an Business Analyst Software Engineer
  • 8. IT WORDS BUSINESS WORDS  Data processing/Transformat ion  Machine Learning  Social Network Analysis  Streaming/Microbatchi ng  Segmentation  Campaign response prediction  Churn avoidance  CTR prediction  Behavioral analysis  Genomics  ….
  • 9.
  • 10.
  • 11.  Open Source SystemML  Educate One Million Data Professionals  Establish Spark Technology Center  Founding Member of AMPLab  Contributing to the Core  Port many existing applications onto Spark  Develop applications using Spark
  • 12.
  • 13.  Distributed platform for thousands of nodes  Data storage and computation framework  Open source  Runs on commodity hardware  Flexible – everything is loosely coupled
  • 14.  Driving principals  Files are stored across the entire cluster  Programs are brought to the data, not the data to the program  Distributed file system (DFS) stores blocks across the whole cluster  Blocks of a single file are distributed across the cluster  A given block is typically replicated as well for resiliency  Just like a regular file system, the contents of a file is up to the application  Unlike a regular file system, you can ask it “where does each block of my file live?” FILE BLOCK S
  • 15. map(String key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, "1"); reduce(String key, Iterator values): // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result)); Hello World Bye World Hello IBM Content of Input Documents Reduce (final output): < Bye, 1> < IBM, 1> < Hello, 2> < World, 2> Map 1 emits: < Hello, 1> < World, 1> < Bye, 1> < World, 1> Map 2 emits: < Hello, 1> < IBM, 1>
  • 16.  Spark brings two significant value-adds:  Bring to Map Reduce the same added value that databases (and parallel databases) brought to query processing:  Let the app developer focus on the WHAT (they need to ask) and let the system figure out HOW (it should be done).  Enable faster higher level application development through higher level constructs and concepts: (RDD concept)  Let the system deal with performance (as part of the HOW)  Leveraging memory (Bufferpools, Caching RDDs in memory)  Maintaining sets of dedicated worker processes ready to go (subagents in DBMS, Executors in Spark)  Enabling interactive processing (CLP, SQL*Plus, spark-shell, etc….)  Be one general purpose engine for multiples types of workloads (SQL, Streaming, Machine Learning, etc…)
  • 17.  Apache Spark is a fast, general purpose, easy-to-use cluster computing system for large-scale data processing  Fast  Leverages aggressively cached in-memory distributed computing and dedicated App Executor processes even when no jobs are running  Faster than MapReduce  General purpose  Covers a wide range of workloads  Provides SQL, streaming and complex analytics  Flexible and easier to use than Map Reduce  Spark is written in Scala, an object oriented, functional programming language  Scala, Python and Java APIs  Scala and Python interactive shells  Runs on Hadoop, Mesos, standalone or cloud Logistic regression in Hadoop and Spark Spark Stack val wordCounts = sc.textFile("README.md").flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b) WordCount
  • 18.  Spark is versatile and flexible:  Can run on YARN / HDFS but also standalone or on MESOS  Spark engine can be exploited from multiple “entry points”: SQL, Streaming, Machine Learning, Graph Processing
  • 19.  Normally you code stuff up in one of the languages  Scala  Python  Java  I like Python, but in some cases it is slower  With DataFrames, no difference (more later)
  • 20.  One of the shells  Scala shell (spark-shell)  Python shell  Code it in the editor and submit with spark- submit  Use “notebook” (Jupyter, Zeppelin)  My preferred method. More later  Enable your IDE to run spark  PyCharm  IntelliJ IDEA
  • 21.
  • 22.  Jupytiter  Zeppelin  Scala  Incubated  Many others  Spark Notebook  Ispark  DataBricks Cloud  IBM Spark aaS  IBM DataScientist Workbench
  • 23.  Initialize context  Read data  Run stuff  Transformations  Actions  Caching  More later
  • 24. GOOD STUFF NOT SO GOOD STUFF  Full API exposed  Concise language  Documentation is way better  Faster if you use plain RDDs  Build tools and dependency tracking  Not so many additional libraries compared to Python  Pandas  Matplotlib  Harder to run in a “notebook”*  *At the moment  Harder to learn
  • 25.  Scala Crash Course  Holden Karau, DataBricks http://lintool.github.io/SparkTutorial/slides/day 1_Scala_crash_course.pdf  Martin Odersky’s “Functional Programming in Scala” course  Books  Scala for Impatient  Scala by Example
  • 26. GOOD STUFF NOT SO GOOD STUFF  Clean & clear language  Easy to learn  Lot of libraries  Pandas  Scikit  matplotlib  Easy to run in a “notebook”  Slower  Interpreted language  Not all API functions exposed  Streaming  Some times behaves differently
  • 27.  I think coding in Java for Spark is terrible  But if you like it messy, there is nobody to stop you 
  • 28.
  • 29.  A way to connect to spark engine  Initialized with all runtime parameters  For example, memory parameters
  • 30.  Resilient Distributed Dataset  An abstraction over a generic data collection  Integers  Strings  PairRDD : <key, value> pairs (support additional operations)  Single logical entity but under the hood is a distributed collection Mokhtar Jacques Dirk Cindy Dan Susan Dirk Frank Jacques Partition 1 Partition 2 Partition 3 Names
  • 31.  You have to pay attention what kind of operation you are running  Transformation  Does not do anything until the action is called  Actions  Kick off computation  Results can be persisted to memory (cache) or to disk (more later)
  • 32.  Three methods for creation  Distributing a collection of objects from the driver program (using the parallelize method of the spark context) val rddNumbers = sc.parallelize(1 to 10) val rddLetters = sc.parallelize (List(“a”, “b”, “c”, “d”))  Loading an external dataset (file) val quotes = sc.textFile("hdfs:/sparkdata/sparkQuotes.txt")  Transformation from another existing RDD val rddNumbers2 = rddNumbers.map(x=> x+1)
  • 33.  Transformations are lazy evaluations  Returns a pointer to the transformed RDD  Pair RDD (K,V) functions for MapReduce style transformations  Map  Filter  flatMap  reduceByKey  sortByKey  Join  See the doc for full list
  • 34.  Kick off the computation  Transformations are lazily evaluated  Collect()  Count()  Take()  Reduce()  First()  saveAsTextFile()
  • 35.  Each node stores any partitions of the cache that it computes in memory  Reuses them in other actions on that dataset (or datasets derived from it)  Future actions are much faster (often by more than 10x)  Two methods for RDD persistence: persist() and cache()
  • 36. rdd1.join(rdd2) .groupBy(…) .filter(…) RDD Objects build operator DAG agnostic to operators! doesn’t know about stages DAGScheduler split graph into stages of tasks submit each stage as ready DAG TaskScheduler TaskSet launch tasks via cluster manager retry failed or straggling tasks Cluster manager Worker execute tasks store and serve blocks Block manager Threads Task stage failed DataBricks
  • 37.
  • 38. SparkContext Driver Program Cluster Manager Worker Node Executor Task Task Cache Worker Node Executor Task Task Cache App
  • 39.
  • 40.  MLLib  Distributed machine learning libraries  SparkSQL  DataFrames  GraphX  ML  SparkR  Streaming
  • 41.  Read the Fine Manual  https://spark.apache.org/docs/latest/index.html  Take the course  BigData University https://bigdatauniversity.com/bdu-wp/bdu- course/spark-fundamentals/  edX – edx.org search for Spark  If you’re stuck  Try the user lists : https://spark.apache.org/community.html
  • 42.  Questions?  Topic for the next meetup?  Your experiences?  Want to be a presenter?
  • 43.  Some slide and text graphics were borrowed from the following sources  Vincent Poncet, IBM France  Jacques Roy, IBM US  Daniel Kikuchi , IBM US  Mokhtar Kandil , IBM US  DataBricks  Spark Docs  I completely lost track what slides I copied from which source. I apologize.