SlideShare a Scribd company logo
1 of 70
Introduction to Real-time
Big Data with Apache Spark
Introduction
About Me
https://ua.linkedin.com/in/tarasmatyashovsky
Agenda
• Buzzwords
• Spark in a Nutshell
• Spark Concepts
• Spark Core
• live demo session
• Spark SQL
• live demo session
• Road to Production
• Spark Drawbacks
• Our Spark Integration
• Spark is on a Rise
Buzzword for large
and complex data sets
difficult to process using on-hand
database management tools or
traditional data processing applications
https://www.linkedin.com/pulse/decoding-buzzwords-big-data-predictive-analytics-business-gordon
http://www.ibmbigdatahub.com/infographic/four-vs-big-data
Jesus Christ,
It is Big Data,
Get Hadoop!
by Sergey Shelpuk (https://ua.linkedin.com/in/shelpuk) at AI Club Meetup in Lviv
To Hadoop?
http://www.thoughtworks.com/insights/blog/hadoop-or-not-hadoop
• Batch mode, not real-time
• Unstructured or semi-structured data
• MapReduce programming model, e.g.
key/value pairs
Not to Hadoop?
• Real-time, streaming
• Structures which could not be
decomposed to key-value pairs
• Jobs/algorithms which do not yield to
the MapReduce programming model
http://www.thoughtworks.com/insights/blog/hadoop-or-not-hadoop
Not to Hadoop?
• Subset of data is enough
Remove excessive complexity or shrink data set via other
processing techniques, e.g.: hashing, clusterization
• Random, Interactive Access to Data
Well structured data
Bunch of scalable mature (No)SQL DB solutions exist
(Hbase/Cassandra/Columnar scalable DW engines)
• Sensitive Data
Security is still very challenging and immature
Why Spark?
As of mid 2014,
Spark is the most active Big Data project
http://www.slideshare.net/databricks/new-direction-for-spark-in-2015-spark-summit-east
Contributors per month to Spark
Spark
Fast and general-purpose
cluster computing platform
for large-scale data processing
History
Time to Sort 100TB
http://www.slideshare.net/databricks/new-direction-for-spark-in-2015-spark-summit-east
Why Spark is Faster?
Spark processes data in-memory while
Hadoop persists back to the disk
after a map/reduce action
Powered by Spark
https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark
Components Stack
https://databricks.com/blog/2015/02/09/learning-spark-book-available-from-oreilly.html
Core Concepts
automatically distribute data across cluster
and
parallelize operations performed on them
Distributed Application
https://databricks.com/blog/2015/02/09/learning-spark-book-available-from-oreilly.html
http://spark.apache.org/docs/latest/cluster-overview.html
Spark Core Abstractions
RDD API
Transformations:
• filter()
• map()
• flatMap()
• distinct()
• union()
• intersection()
• subtract()
• etc.
Actions:
• collect()
• reduce()
• count()
• countByValue()
• first()
• take()
• top()
• etc.
RDD Operations
• transformations are executed on
workers
• actions may transfer data from the
workers to the driver
• сollect() sends all the partitions to the
single driver
http://www.slideshare.net/databricks/strata-sj-everyday-im-shuffling-tips-for-writing-better-spark-programs
Pair RDD
Transformations:
• reduceByKey()
• groupByKey()
• sortByKey()
• keys()
• values()
• join()
• etc.
Actions:
• countByKey()
• collectAsMap()
• lookup()
• etc.
Sample Application
https://github.com/tmatyashovsky/spark-samples-jeeconf-kyiv
Requirements
Analytics about Morning@Lohika events:
• unique participants by companies
• most loyal participants
• participants by position
• etc.
https://github.com/tmatyashovsky/spark-samples-jeeconf-kyiv
Data Format
Simple CSV files
all fields are optional
First Name Last Name Company Position Email Present
Vladimir Tsukur GlobalLogic
Tech/Team
Lead
flushdia@gmail.com 1
Mikalai Alimenkou XP Injection Tech Lead
mikalai.alimenkou@
xpinjection.com
1
Taras Matyashovsky Lohika
Software
Engineer
taras.matyashovsky@
gmail.com
0
https://github.com/tmatyashovsky/spark-samples-jeeconf-kyiv
Technologies
Technologies:
• Spring Boot 1.2.3.RELEASE
• Spark 1.3.1 - released April 17, 2015
• 2 Spark jar dependencies
• Apache 2.0 license, i.e. free to use
https://github.com/tmatyashovsky/spark-samples-jeeconf-kyiv
Features
• simple HTTP-based API
• file system: local and HDFS
• data formats: CSV and Parquet
• 3 compatible implementations based on:
• RDD (Spark Core)
• Data Frame DSL (Spark SQL)
• Data Frame SQL (Spark SQL)
• serialization: default Java and Kryo
https://github.com/tmatyashovsky/spark-samples-jeeconf-kyiv
Demo Time
https://github.com/tmatyashovsky/spark-samples-jeeconf-kyiv
Cluster
Manager
Worker
Driver
Spark
Context
Executor
Task
Worker
Executor
Task
http://spark.apache.org/docs/latest/cluster-overview.html
Task
Task
Demo Explained
Limited opportunities for
automatic optimization
Functional Programming API Drawback
Structured data processing
Spark SQL
Distributed collection of data
organized into named columns
Data Frame
Data Frame API
• selecting columns
• joining different data sources
• aggregation, e.g. sum, count, average
• filtering
Plan Optimization & Execution
http://web.eecs.umich.edu/~prabal/teaching/resources/eecs582/armbrust15sparksql.pdf
Faster than RDD
http://www.slideshare.net/databricks/spark-sqlsse2015public
Demo Time
https://github.com/tmatyashovsky/spark-samples-jeeconf-kyiv
Persistence & Caching
• by default stores the data in the JVM
heap as unserialized objects
• possibility to store on disk as
unserialized/serialized objects
• off-heap caching is experimental and
uses
https://spark.apache.org/docs/latest/running-on-mesos.html
Cluster Manager
should be chosen and configured
properly
Monitoring
via web UI(s) and metrics
Monitoring
• master web UI
• worker web UI
• driver web UI
• available only during execution
• history server
• spark.eventLog.enabled = true
Metrics
• based on Coda Hale Metrics library
• can be reported via HTTP, JMX, and
CSV files
https://spark.apache.org/docs/latest/tuning.html
Serialization
https://spark.apache.org/docs/latest/configuration.html#compression-and-serialization
Memory Management
Tune Executor Memory Fraction
RDD Storage (60%)
Shuffle and aggregation
buffers (20%)
User code (20%)
https://spark.apache.org/docs/latest/configuration.html#shuffle-behavior
Memory Management
Tune storage level:
• store in memory and/or on disk
• store as unserialized/serialized objects
• replicate each partition on 1 or 2 cluster
nodes
• store in Tachyon
https://spark.apache.org/docs/latest/programming-guide.html#which-storage-level-to-choose
Level of Parallelism
• spark.task.cpus
• 1 task per partition using 1 core to execute
• spark.default.parallelism
• can be controlled:
• repartition() and coalescence() functions
• degree of parallelism as a operations parameter
• storage system matters
Data Locality
• check data locality via UI
• configure data locality settings if
needed
• spark.locality.wait timeout
• execute certain jobs on a driver
• spark.localExecution.enabled
Java API Drawbacks
• API can be experimental or used just
for development
• Spark Java API can be not up-to-date
as Scala API is main focus
Our Spark Integration
Product
Cloud-based analytics application
Use Cases
• supplement Neo4j database used to
store/query big dimensions
• supplement RDBMS for querying of
high volumes of data
Use Cases
• represent existing computational graph
as flow of Spark-based operations
• predictive analytics based on Spark
MLib component
Lessons Learned
• Spark simplicity is deceptive
• Each use case is unique
• Be really aware:
• Databricks blog
• Mailing lists & Jira
• Pull requests
Spark is kind of magic
Spark is on a Rise
http://www.techrepublic.com/article/can-anything-dim-apache-spark/
Project Tungsten
• the largest change to Spark’s execution
engine since the project’s inception
• focuses on substantially improving the
efficiency of memory and CPU for
Spark applications
• sun.misc.Unsafe
https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html
Thank you!
Taras Matyashovsky
taras.matyashovsky@gmail.com
@tmatyashovsky
http://www.filevych.com/
References
https://www.linkedin.com/pulse/decoding-buzzwords-big-data-predictive-analytics-
business-gordon
http://www.ibmbigdatahub.com/infographic/four-vs-big-data
http://www.thoughtworks.com/insights/blog/hadoop-or-not-hadoop
http://blog.cloudera.com/blog/2014/05/apache-spark-resource-management-and-yarn-app-
models/
Learning Spark, by Holden Karau, Andy Konwinski, Patrick Wendell and Matei Zaharia (early
release ebook from O'Reilly Media)
https://spark-prs.appspot.com/#all
https://www.gitbook.com/book/databricks/databricks-spark-knowledge-base/details
http://insidebigdata.com/2015/03/06/8-reasons-apache-spark-hot/
https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark
http://databricks.com/blog/2014/10/10/spark-petabyte-sort.html
http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-
sorting.html
http://web.eecs.umich.edu/~prabal/teaching/resources/eecs582/armbrust15sparksql.pdf
http://www.slideshare.net/databricks/strata-sj-everyday-im-shuffling-tips-for-writing-
better-spark-programs
http://www.slideshare.net/databricks/new-direction-for-spark-in-2015-spark-summit-east
http://www.slideshare.net/databricks/spark-sqlsse2015public
https://spark.apache.org/docs/latest/running-on-mesos.html
http://spark.apache.org/docs/latest/cluster-overview.html
http://www.techrepublic.com/article/can-anything-dim-apache-spark/
http://spark-packages.org/

More Related Content

What's hot

How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapePaco Nathan
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache SparkBTI360
 
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...Edureka!
 
Productionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan ChanProductionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan ChanSpark Summit
 
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...Databricks
 
How to build your query engine in spark
How to build your query engine in sparkHow to build your query engine in spark
How to build your query engine in sparkPeng Cheng
 
Spark Internals Training | Apache Spark | Spark | Anika Technologies
Spark Internals Training | Apache Spark | Spark | Anika TechnologiesSpark Internals Training | Apache Spark | Spark | Anika Technologies
Spark Internals Training | Apache Spark | Spark | Anika TechnologiesAnand Narayanan
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark Mostafa
 
Spark after Dark by Chris Fregly of Databricks
Spark after Dark by Chris Fregly of DatabricksSpark after Dark by Chris Fregly of Databricks
Spark after Dark by Chris Fregly of DatabricksData Con LA
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Databricks
 
Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...
Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...
Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...Richard Seymour
 
The Pushdown of Everything by Stephan Kessler and Santiago Mola
The Pushdown of Everything by Stephan Kessler and Santiago MolaThe Pushdown of Everything by Stephan Kessler and Santiago Mola
The Pushdown of Everything by Stephan Kessler and Santiago MolaSpark Summit
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark FundamentalsZahra Eskandari
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkSamy Dindane
 
Introduction to apache spark
Introduction to apache spark Introduction to apache spark
Introduction to apache spark Aakashdata
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark InternalsPietro Michiardi
 

What's hot (20)

How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscape
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Productionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan ChanProductionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan Chan
 
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
How to build your query engine in spark
How to build your query engine in sparkHow to build your query engine in spark
How to build your query engine in spark
 
Hadoop and Spark
Hadoop and SparkHadoop and Spark
Hadoop and Spark
 
Spark Internals Training | Apache Spark | Spark | Anika Technologies
Spark Internals Training | Apache Spark | Spark | Anika TechnologiesSpark Internals Training | Apache Spark | Spark | Anika Technologies
Spark Internals Training | Apache Spark | Spark | Anika Technologies
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
 
Spark after Dark by Chris Fregly of Databricks
Spark after Dark by Chris Fregly of DatabricksSpark after Dark by Chris Fregly of Databricks
Spark after Dark by Chris Fregly of Databricks
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
 
Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...
Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...
Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...
 
Apache Spark 101
Apache Spark 101Apache Spark 101
Apache Spark 101
 
The Pushdown of Everything by Stephan Kessler and Santiago Mola
The Pushdown of Everything by Stephan Kessler and Santiago MolaThe Pushdown of Everything by Stephan Kessler and Santiago Mola
The Pushdown of Everything by Stephan Kessler and Santiago Mola
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Introduction to apache spark
Introduction to apache spark Introduction to apache spark
Introduction to apache spark
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
 

Viewers also liked

Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkRahul Jain
 
Big data analysis in java world
Big data analysis in java worldBig data analysis in java world
Big data analysis in java worldSerg Masyutin
 
Tweaking performance on high-load projects
Tweaking performance on high-load projectsTweaking performance on high-load projects
Tweaking performance on high-load projectsDmitriy Dumanskiy
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Sparkdatamantra
 
Introduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingIntroduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingCloudera, Inc.
 
Python from zero to hero (Twitter Explorer)
Python from zero to hero (Twitter Explorer)Python from zero to hero (Twitter Explorer)
Python from zero to hero (Twitter Explorer)Yuriy Senko
 
Voltdb: Shard It by V. Torshyn
Voltdb: Shard It by V. TorshynVoltdb: Shard It by V. Torshyn
Voltdb: Shard It by V. Torshynvtors
 
JavaScript in Mobile Development
JavaScript in Mobile DevelopmentJavaScript in Mobile Development
JavaScript in Mobile DevelopmentDima Maleev
 
From Pilot to Product - Morning@Lohika
From Pilot to Product - Morning@LohikaFrom Pilot to Product - Morning@Lohika
From Pilot to Product - Morning@LohikaIvan Verhun
 
DuyHai DOAN - Real time analytics with Cassandra and Spark - NoSQL matters Pa...
DuyHai DOAN - Real time analytics with Cassandra and Spark - NoSQL matters Pa...DuyHai DOAN - Real time analytics with Cassandra and Spark - NoSQL matters Pa...
DuyHai DOAN - Real time analytics with Cassandra and Spark - NoSQL matters Pa...NoSQLmatters
 
AWS Simple Workflow: Distributed Out of the Box! - Morning@Lohika
AWS Simple Workflow: Distributed Out of the Box! - Morning@LohikaAWS Simple Workflow: Distributed Out of the Box! - Morning@Lohika
AWS Simple Workflow: Distributed Out of the Box! - Morning@LohikaSerhiy Batyuk
 
Spark - Migration Story
Spark - Migration Story Spark - Migration Story
Spark - Migration Story Roman Chukh
 
Introduction to big data and apache spark
Introduction to big data and apache sparkIntroduction to big data and apache spark
Introduction to big data and apache sparkMohammed Guller
 
Хитрости UX-дизайна: ключевые лайфхаки, которые должен знать разработчик
Хитрости UX-дизайна: ключевые лайфхаки, которые должен знать разработчикХитрости UX-дизайна: ключевые лайфхаки, которые должен знать разработчик
Хитрости UX-дизайна: ключевые лайфхаки, которые должен знать разработчикNick Grachov
 

Viewers also liked (20)

Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Big data analysis in java world
Big data analysis in java worldBig data analysis in java world
Big data analysis in java world
 
Tweaking performance on high-load projects
Tweaking performance on high-load projectsTweaking performance on high-load projects
Tweaking performance on high-load projects
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Introduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingIntroduction to Apache Spark Developer Training
Introduction to Apache Spark Developer Training
 
Python from zero to hero (Twitter Explorer)
Python from zero to hero (Twitter Explorer)Python from zero to hero (Twitter Explorer)
Python from zero to hero (Twitter Explorer)
 
Voltdb: Shard It by V. Torshyn
Voltdb: Shard It by V. TorshynVoltdb: Shard It by V. Torshyn
Voltdb: Shard It by V. Torshyn
 
JavaScript in Mobile Development
JavaScript in Mobile DevelopmentJavaScript in Mobile Development
JavaScript in Mobile Development
 
Creation of ideas
Creation of ideasCreation of ideas
Creation of ideas
 
From Pilot to Product - Morning@Lohika
From Pilot to Product - Morning@LohikaFrom Pilot to Product - Morning@Lohika
From Pilot to Product - Morning@Lohika
 
DuyHai DOAN - Real time analytics with Cassandra and Spark - NoSQL matters Pa...
DuyHai DOAN - Real time analytics with Cassandra and Spark - NoSQL matters Pa...DuyHai DOAN - Real time analytics with Cassandra and Spark - NoSQL matters Pa...
DuyHai DOAN - Real time analytics with Cassandra and Spark - NoSQL matters Pa...
 
AWS Simple Workflow: Distributed Out of the Box! - Morning@Lohika
AWS Simple Workflow: Distributed Out of the Box! - Morning@LohikaAWS Simple Workflow: Distributed Out of the Box! - Morning@Lohika
AWS Simple Workflow: Distributed Out of the Box! - Morning@Lohika
 
Spark - Migration Story
Spark - Migration Story Spark - Migration Story
Spark - Migration Story
 
Take a REST!
Take a REST!Take a REST!
Take a REST!
 
Introduction to big data and apache spark
Introduction to big data and apache sparkIntroduction to big data and apache spark
Introduction to big data and apache spark
 
Apache HBase Workshop
Apache HBase WorkshopApache HBase Workshop
Apache HBase Workshop
 
Ayasdi strata
Ayasdi strataAyasdi strata
Ayasdi strata
 
React. Flux. Redux
React. Flux. ReduxReact. Flux. Redux
React. Flux. Redux
 
Хитрости UX-дизайна: ключевые лайфхаки, которые должен знать разработчик
Хитрости UX-дизайна: ключевые лайфхаки, которые должен знать разработчикХитрости UX-дизайна: ключевые лайфхаки, которые должен знать разработчик
Хитрости UX-дизайна: ключевые лайфхаки, которые должен знать разработчик
 
Apache Spark Components
Apache Spark ComponentsApache Spark Components
Apache Spark Components
 

Similar to Introduction to real time big data with Apache Spark

Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformTeaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformYao Yao
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupRafal Kwasny
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Djamel Zouaoui
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan
 
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...Mark Rittman
 
Databricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupDatabricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupPaco Nathan
 
End-to-End Data Pipelines with Apache Spark
End-to-End Data Pipelines with Apache SparkEnd-to-End Data Pipelines with Apache Spark
End-to-End Data Pipelines with Apache SparkBurak Yavuz
 
Hannes end-of-the-router-tnc17
Hannes end-of-the-router-tnc17Hannes end-of-the-router-tnc17
Hannes end-of-the-router-tnc17Hannes Gredler
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
Austin Data Meetup 092014 - Spark
Austin Data Meetup 092014 - SparkAustin Data Meetup 092014 - Spark
Austin Data Meetup 092014 - SparkSteve Blackmon
 
Ingesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmedIngesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmedwhoschek
 
Deploying Data Science Engines to Production
Deploying Data Science Engines to ProductionDeploying Data Science Engines to Production
Deploying Data Science Engines to ProductionMostafa Majidpour
 
Spark + H20 = Machine Learning at scale
Spark + H20 = Machine Learning at scaleSpark + H20 = Machine Learning at scale
Spark + H20 = Machine Learning at scaleMateusz Dymczyk
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoMapR Technologies
 
Spark meetup feb 2016
Spark meetup feb 2016Spark meetup feb 2016
Spark meetup feb 2016Todd Niven
 
Spark Streaming @ Scale (Clicktale)
Spark Streaming @ Scale (Clicktale)Spark Streaming @ Scale (Clicktale)
Spark Streaming @ Scale (Clicktale)Yuval Itzchakov
 
Top 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationsTop 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationshadooparchbook
 
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...Databricks
 
Spark Hsinchu meetup
Spark Hsinchu meetupSpark Hsinchu meetup
Spark Hsinchu meetupYung-An He
 

Similar to Introduction to real time big data with Apache Spark (20)

Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformTeaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
 
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
 
Databricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupDatabricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User Group
 
End-to-End Data Pipelines with Apache Spark
End-to-End Data Pipelines with Apache SparkEnd-to-End Data Pipelines with Apache Spark
End-to-End Data Pipelines with Apache Spark
 
Hannes end-of-the-router-tnc17
Hannes end-of-the-router-tnc17Hannes end-of-the-router-tnc17
Hannes end-of-the-router-tnc17
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Austin Data Meetup 092014 - Spark
Austin Data Meetup 092014 - SparkAustin Data Meetup 092014 - Spark
Austin Data Meetup 092014 - Spark
 
Ingesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmedIngesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmed
 
Deploying Data Science Engines to Production
Deploying Data Science Engines to ProductionDeploying Data Science Engines to Production
Deploying Data Science Engines to Production
 
Spark + H20 = Machine Learning at scale
Spark + H20 = Machine Learning at scaleSpark + H20 = Machine Learning at scale
Spark + H20 = Machine Learning at scale
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
 
Spark meetup feb 2016
Spark meetup feb 2016Spark meetup feb 2016
Spark meetup feb 2016
 
Spark Streaming @ Scale (Clicktale)
Spark Streaming @ Scale (Clicktale)Spark Streaming @ Scale (Clicktale)
Spark Streaming @ Scale (Clicktale)
 
Top 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationsTop 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applications
 
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
 
Spark Hsinchu meetup
Spark Hsinchu meetupSpark Hsinchu meetup
Spark Hsinchu meetup
 

More from Taras Matyashovsky

Distinguish Pop from Heavy Metal using Apache Spark MLlib
Distinguish Pop from Heavy Metal using Apache Spark MLlibDistinguish Pop from Heavy Metal using Apache Spark MLlib
Distinguish Pop from Heavy Metal using Apache Spark MLlibTaras Matyashovsky
 
Introduction to ML with Apache Spark MLlib
Introduction to ML with Apache Spark MLlibIntroduction to ML with Apache Spark MLlib
Introduction to ML with Apache Spark MLlibTaras Matyashovsky
 
Morning at Lohika 2nd anniversary
Morning at Lohika 2nd anniversaryMorning at Lohika 2nd anniversary
Morning at Lohika 2nd anniversaryTaras Matyashovsky
 
Influence. The Psychology of Persuasion (in IT)
Influence. The Psychology of Persuasion (in IT)Influence. The Psychology of Persuasion (in IT)
Influence. The Psychology of Persuasion (in IT)Taras Matyashovsky
 
JEEConf 2015 - Introduction to real-time big data with Apache Spark
JEEConf 2015 - Introduction to real-time big data with Apache SparkJEEConf 2015 - Introduction to real-time big data with Apache Spark
JEEConf 2015 - Introduction to real-time big data with Apache SparkTaras Matyashovsky
 
Morning at Lohika 1st anniversary
Morning at Lohika 1st anniversaryMorning at Lohika 1st anniversary
Morning at Lohika 1st anniversaryTaras Matyashovsky
 
New life inside monolithic application
New life inside monolithic applicationNew life inside monolithic application
New life inside monolithic applicationTaras Matyashovsky
 
Distributed applications using Hazelcast
Distributed applications using HazelcastDistributed applications using Hazelcast
Distributed applications using HazelcastTaras Matyashovsky
 
From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.Taras Matyashovsky
 

More from Taras Matyashovsky (12)

Morning 3 anniversary
Morning 3 anniversaryMorning 3 anniversary
Morning 3 anniversary
 
Distinguish Pop from Heavy Metal using Apache Spark MLlib
Distinguish Pop from Heavy Metal using Apache Spark MLlibDistinguish Pop from Heavy Metal using Apache Spark MLlib
Distinguish Pop from Heavy Metal using Apache Spark MLlib
 
Introduction to ML with Apache Spark MLlib
Introduction to ML with Apache Spark MLlibIntroduction to ML with Apache Spark MLlib
Introduction to ML with Apache Spark MLlib
 
Morning at Lohika 2nd anniversary
Morning at Lohika 2nd anniversaryMorning at Lohika 2nd anniversary
Morning at Lohika 2nd anniversary
 
Confession of an Engineer
Confession of an EngineerConfession of an Engineer
Confession of an Engineer
 
Influence. The Psychology of Persuasion (in IT)
Influence. The Psychology of Persuasion (in IT)Influence. The Psychology of Persuasion (in IT)
Influence. The Psychology of Persuasion (in IT)
 
JEEConf 2015 - Introduction to real-time big data with Apache Spark
JEEConf 2015 - Introduction to real-time big data with Apache SparkJEEConf 2015 - Introduction to real-time big data with Apache Spark
JEEConf 2015 - Introduction to real-time big data with Apache Spark
 
Morning at Lohika 1st anniversary
Morning at Lohika 1st anniversaryMorning at Lohika 1st anniversary
Morning at Lohika 1st anniversary
 
New life inside monolithic application
New life inside monolithic applicationNew life inside monolithic application
New life inside monolithic application
 
Distributed applications using Hazelcast
Distributed applications using HazelcastDistributed applications using Hazelcast
Distributed applications using Hazelcast
 
Morning at Lohika
Morning at LohikaMorning at Lohika
Morning at Lohika
 
From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.
 

Recently uploaded

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 

Recently uploaded (20)

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 

Introduction to real time big data with Apache Spark

Editor's Notes

  1. Cluster Manager: Standalone, Apache Mesos, Hadoop Yarn Cluster Manager should be chosen and configured properly Monitoring via web UI(s) and metrics Web UI: master web UI worker web UI driver web UI - available only during execution history server - spark.eventLog.enabled = true Metrics based on Coda Hale Metrics library. Can be reported via HTTP, JMX, and CSV files.
  2. Serialization: default and Kryo Tune Executor Memory Fraction: RDD Storage (60%), Shuffle and Aggregation Buffers (20%), User code (20%) Tune storage level: store in memory and/or on disk store as unserialized/serialized objects replicate each partition on 1 or 2 cluster nodes store in Tachyon Level of Parallelism: spark.task.cpus 1 task per partition using 1 core to execute spark.default.parallelism can be controlled: repartition() and coalescence() functions degree of parallelism as a operations parameter storage system matters Data locality: check data locality via UI configure data locality settings if needed spark.locality.wait timeout execute certain jobs on a driver spark.localExecution.enabled