Infra space talk on Apache Spark - Into to CASK

Apache Spark - Scala - Presentation
20th July 2017
Chandrasekar Umathurappan (Uma)
InfraSpace Technology Corporation
© 2017 InfraSpace Technology Corporation

Agenda
Introduce Spark
Spark-Shell
Demo
Spark Internals
Demo
Spark In a Enterprise
Demo
Apache Spark - Scala - Presentation

Spark Core API
SQL R Python Scala Java
Spark SQL MLLib GraphX
Apache Spark Ecosystem
Streaming
Apache Spark™ is a powerful open source processing engine built around speed, ease of use, and
sophisticated analytics. It was originally developed at UC Berkeley in 2009.
Spark Philosophy
Uniﬁed engine for complete data applications
High-level user-friendly APIs
Speed
Apache Spark

Apache Spark
Credits:
https://spark-summit.org/2014/wp-content/uploads/2014/07/A-Deeper-Understanding-of-Spark-Internals-Aaron-Davidson.pdf
https://www.slideshare.net/michiard/introduction-to-spark-internals
Spark Execution Model
1. Create DAG of RDDs to represent computation
2. Create logical execution plan for DAG
3. Schedule and execute individual tasks
4. Cached
Glossary
DAG - Direct Acyclic Graph

RDD - Resilient Distributed Dataset

What are RDD’s?
Distributed, Partitioned, Locality aware
immutable collection.

Points to data in HDFS / other RDD’s

Lazily Evaluated

Apache Spark Interactive- Demo
Capabilities Overview
Install Spark
Spark Shell
Run a Program
Typical Commands
Set Up Overview
/usr/bin/ruby -e "$(curl -fsSL https://
raw.githubusercontent.com/Homebrew/
install/master/install)"

brew install scala

brew install apache-spark

spark-shell

Typical Commands
sc -spark context

spark.version
:load <file>
:he
Apache Spark

Spark Driver
Executor Executor Executor
Task A
Task A Task A
Task B Task B
Job A Job B
SC
Apache Spark Components
Cluster Manager Types:
Concepts:
RDD - Resilient Distributed Dataset
Transformation, Actions, Persistence
DAG - Direct Acyclic Graph
Shufﬂe
Cache Cache Cache
References:http://datastrophic.io/core-concepts-architecture-and-internals-of-apache-spark/

Apache Spark Components
https://0x0fff.com/spark-memory-management/
https://www.slideshare.net/databricks/deep-dive-memory-management-in-apache-spark
https://spark.apache.org/docs/latest/tuning.html
Storage
Shufﬂes, joins, sorts and aggregations
Execution
Cache
Memory Tuning Considerations
the amount of memory used by your objects

the cost of accessing those objects

the overhead of garbage collection
Preferable >32GB

Navigate Data with Spark
http://localhost:9995
Create RDD

Apache Spark Standalone - Demo
Hadoop
spark-submit
Set Up Overview
Install Hortonworks SandBox
Shell
Spark-submit
Typical Commands
spark-submit


Apache Spark in Enterprise
References:http://blog.cask.co/2016/06/cdap-spark-prototype-to-production/
Sources:
Oracle
SQL Server
HDFS

…

Sink:
Cassandra
…
Analytics:
Tableau
…

Apache Spark
WHY CDAP?
The First Uniﬁed Integration Platform
For Big Data That Cuts Down The
Time To Production For Data
Applications And Data Lakes By 80%.
BENEFITS
Self Service
Build Once and Run Anywhere
Enterprise Ready
Plan
CDAP 4.2.0
http://cask.co/products/cdap/#capabilities

Apache Spark on Hadoop YARN- Demo
Hadoop
cask
Set Up Overview
Install Hortonworks SandBox
Shell
Spark-submit
Typical Commands
spark-submit


Apache Spark - Reference
Thank You!
Rob Mueller (InfraSpace)
Mark Soule (The Nerdery)
Byamba Tumurkhuu (Rally Health)
References
https://www.supergloo.com (Todd M)
http://blog.cask.co/2016/06/cdap-spark-prototype-to-production/
http://datastrophic.io/core-concepts-architecture-and-internals-of-apache-spark/
https://spark-summit.org/2014/wp-content/uploads/2014/07/A-Deeper-Understanding-of-Spark-Internals-Aaron-Davidson.pdf
https://www.slideshare.net/michiard/introduction-to-spark-internals
https://0x0fff.com/spark-memory-management/
https://www.slideshare.net/databricks/deep-dive-memory-management-in-apache-spark
https://spark.apache.org/docs/latest/tuning.html
http://blog.cask.co/2016/06/cdap-spark-prototype-to-production/
http://cask.co/products/cdap/#capabilities
Next Steps, How can we help you solve challenges!
Organizations: We will work with your teams to solve challenges in operationalizing Big Data Solutions
Individuals: You could run your workloads on our Data Center with public datasets
VPN User Accounts VPN Tunnel
Our Lab SetUp

Infra space talk on Apache Spark - Into to CASK

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Infra space talk on Apache Spark - Into to CASK

Similar to Infra space talk on Apache Spark - Into to CASK (20)

Recently uploaded

Recently uploaded (20)

Infra space talk on Apache Spark - Into to CASK