SlideShare a Scribd company logo
Apache Spark - Scala - Presentation
20th July 2017
Chandrasekar Umathurappan (Uma)
InfraSpace Technology Corporation
© 2017 InfraSpace Technology Corporation
Agenda
Introduce Spark
Spark-Shell
Demo
Spark Internals
Demo
Spark In a Enterprise
Demo
© 2017 InfraSpace Technology Corporation
Apache Spark - Scala - Presentation
Spark Core API
SQL R Python Scala Java
Spark SQL MLLib GraphX
Apache Spark Ecosystem
Streaming
Apache Spark™ is a powerful open source processing engine built around speed, ease of use, and
sophisticated analytics. It was originally developed at UC Berkeley in 2009.
Spark Philosophy
Unified engine for complete data applications
High-level user-friendly APIs
Speed
© 2017 InfraSpace Technology Corporation
Apache Spark
© 2017 InfraSpace Technology Corporation
Apache Spark
Credits:
https://spark-summit.org/2014/wp-content/uploads/2014/07/A-Deeper-Understanding-of-Spark-Internals-Aaron-Davidson.pdf
https://www.slideshare.net/michiard/introduction-to-spark-internals
Spark Execution Model
1. Create DAG of RDDs to represent computation
2. Create logical execution plan for DAG
3. Schedule and execute individual tasks
4. Cached
Glossary
DAG - Direct Acyclic Graph

RDD - Resilient Distributed Dataset

What are RDD’s?
Distributed, Partitioned, Locality aware
immutable collection.

Points to data in HDFS / other RDD’s

Lazily Evaluated
Apache Spark Interactive- Demo
Capabilities Overview
Install Spark
Spark Shell
Run a Program
Typical Commands
Set Up Overview
/usr/bin/ruby -e "$(curl -fsSL https://
raw.githubusercontent.com/Homebrew/
install/master/install)"

brew install scala

brew install apache-spark

spark-shell

Typical Commands
sc -spark context

spark.version
:load <file>
:he
© 2017 InfraSpace Technology Corporation
Apache Spark
Spark Driver
Executor Executor Executor
Task A
Task A Task A
Task B Task B
Job A Job B
© 2017 InfraSpace Technology Corporation
SC
Apache Spark Components
Cluster Manager Types:
Concepts:
RDD - Resilient Distributed Dataset
Transformation, Actions, Persistence
DAG - Direct Acyclic Graph
Shuffle
Cache Cache Cache
References:http://datastrophic.io/core-concepts-architecture-and-internals-of-apache-spark/
© 2017 InfraSpace Technology Corporation
Apache Spark Components
https://0x0fff.com/spark-memory-management/
https://www.slideshare.net/databricks/deep-dive-memory-management-in-apache-spark
https://spark.apache.org/docs/latest/tuning.html
Storage
Shuffles, joins, sorts and aggregations
Execution
Cache
Memory Tuning Considerations
the amount of memory used by your objects

the cost of accessing those objects

the overhead of garbage collection
Preferable >32GB
© 2017 InfraSpace Technology Corporation
Navigate Data with Spark
http://localhost:9995
Create RDD
Apache Spark Standalone - Demo
Capabilities Overview
Hadoop
spark-submit
Set Up Overview
Install Hortonworks SandBox
Shell
Spark-submit
Typical Commands
spark-submit

© 2017 InfraSpace Technology Corporation
© 2017 InfraSpace Technology Corporation
Apache Spark in Enterprise
References:http://blog.cask.co/2016/06/cdap-spark-prototype-to-production/
Sources:
Oracle
SQL Server
HDFS

…

Sink:
Cassandra
…
Analytics:
Tableau
…
© 2017 InfraSpace Technology Corporation
Apache Spark
WHY CDAP?
The First Unified Integration Platform
For Big Data That Cuts Down The
Time To Production For Data
Applications And Data Lakes By 80%.
BENEFITS
Self Service
Build Once and Run Anywhere
Enterprise Ready
Plan
CDAP 4.2.0
http://cask.co/products/cdap/#capabilities
Apache Spark on Hadoop YARN- Demo
Capabilities Overview
Hadoop
cask
Set Up Overview
Install Hortonworks SandBox
Shell
Spark-submit
Typical Commands
spark-submit

© 2017 InfraSpace Technology Corporation
© 2017 InfraSpace Technology Corporation
Apache Spark - Reference
Thank You!
Rob Mueller (InfraSpace)
Mark Soule (The Nerdery)
Byamba Tumurkhuu (Rally Health)
References
https://www.supergloo.com (Todd M)
http://blog.cask.co/2016/06/cdap-spark-prototype-to-production/
http://datastrophic.io/core-concepts-architecture-and-internals-of-apache-spark/
https://spark-summit.org/2014/wp-content/uploads/2014/07/A-Deeper-Understanding-of-Spark-Internals-Aaron-Davidson.pdf
https://www.slideshare.net/michiard/introduction-to-spark-internals
https://0x0fff.com/spark-memory-management/
https://www.slideshare.net/databricks/deep-dive-memory-management-in-apache-spark
https://spark.apache.org/docs/latest/tuning.html
http://blog.cask.co/2016/06/cdap-spark-prototype-to-production/
http://cask.co/products/cdap/#capabilities
Next Steps, How can we help you solve challenges!
Organizations: We will work with your teams to solve challenges in operationalizing Big Data Solutions
Individuals: You could run your workloads on our Data Center with public datasets
VPN User Accounts VPN Tunnel
Our Lab SetUp

More Related Content

What's hot

Spark and Hadoop Technology
Spark and Hadoop Technology Spark and Hadoop Technology
Spark and Hadoop Technology
Avinash Gautam
 
Spark for big data analytics
Spark for big data analyticsSpark for big data analytics
Spark for big data analytics
Edureka!
 
Apache Spark: Usage and Roadmap in Hadoop
Apache Spark: Usage and Roadmap in HadoopApache Spark: Usage and Roadmap in Hadoop
Apache Spark: Usage and Roadmap in Hadoop
Cloudera Japan
 
Spark + Flashblade: Spark Summit East talk by Brian Gold
Spark + Flashblade: Spark Summit East talk by Brian GoldSpark + Flashblade: Spark Summit East talk by Brian Gold
Spark + Flashblade: Spark Summit East talk by Brian Gold
Spark Summit
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Noritaka Sekiyama
 
Writing Apache Spark and Apache Flink Applications Using Apache Bahir
Writing Apache Spark and Apache Flink Applications Using Apache BahirWriting Apache Spark and Apache Flink Applications Using Apache Bahir
Writing Apache Spark and Apache Flink Applications Using Apache Bahir
Luciano Resende
 
An Insider’s Guide to Maximizing Spark SQL Performance
 An Insider’s Guide to Maximizing Spark SQL Performance An Insider’s Guide to Maximizing Spark SQL Performance
An Insider’s Guide to Maximizing Spark SQL Performance
Takuya UESHIN
 
Apache Spark beyond Hadoop MapReduce
Apache Spark beyond Hadoop MapReduceApache Spark beyond Hadoop MapReduce
Apache Spark beyond Hadoop MapReduce
Edureka!
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
airisData
 
Performance of Spark vs MapReduce
Performance of Spark vs MapReducePerformance of Spark vs MapReduce
Performance of Spark vs MapReduce
Edureka!
 
Introduction to Machine Learning on Apache Spark MLlib by Juliet Hougland, Se...
Introduction to Machine Learning on Apache Spark MLlib by Juliet Hougland, Se...Introduction to Machine Learning on Apache Spark MLlib by Juliet Hougland, Se...
Introduction to Machine Learning on Apache Spark MLlib by Juliet Hougland, Se...
Cloudera, Inc.
 
Big Data Processing with Spark and Scala
Big Data Processing with Spark and Scala Big Data Processing with Spark and Scala
Big Data Processing with Spark and Scala
Edureka!
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
MapR Technologies
 
Sgi hadoop
Sgi hadoopSgi hadoop
Sgi hadoop
Jason Shao
 
Apache spark linkedin
Apache spark linkedinApache spark linkedin
Apache spark linkedin
Yukti Kaura
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
Mostafa
 
Spark Summit EU talk by William Benton
Spark Summit EU talk by William BentonSpark Summit EU talk by William Benton
Spark Summit EU talk by William Benton
Spark Summit
 
Apache spark
Apache spark Apache spark
Apache spark
Edureka!
 
Apache Ignite - Distributed Database Orchestration
Apache Ignite - Distributed Database OrchestrationApache Ignite - Distributed Database Orchestration
Apache Ignite - Distributed Database Orchestration
Ariel Jatib
 
5 reasons why spark is in demand!
5 reasons why spark is in demand!5 reasons why spark is in demand!
5 reasons why spark is in demand!
Edureka!
 

What's hot (20)

Spark and Hadoop Technology
Spark and Hadoop Technology Spark and Hadoop Technology
Spark and Hadoop Technology
 
Spark for big data analytics
Spark for big data analyticsSpark for big data analytics
Spark for big data analytics
 
Apache Spark: Usage and Roadmap in Hadoop
Apache Spark: Usage and Roadmap in HadoopApache Spark: Usage and Roadmap in Hadoop
Apache Spark: Usage and Roadmap in Hadoop
 
Spark + Flashblade: Spark Summit East talk by Brian Gold
Spark + Flashblade: Spark Summit East talk by Brian GoldSpark + Flashblade: Spark Summit East talk by Brian Gold
Spark + Flashblade: Spark Summit East talk by Brian Gold
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
Writing Apache Spark and Apache Flink Applications Using Apache Bahir
Writing Apache Spark and Apache Flink Applications Using Apache BahirWriting Apache Spark and Apache Flink Applications Using Apache Bahir
Writing Apache Spark and Apache Flink Applications Using Apache Bahir
 
An Insider’s Guide to Maximizing Spark SQL Performance
 An Insider’s Guide to Maximizing Spark SQL Performance An Insider’s Guide to Maximizing Spark SQL Performance
An Insider’s Guide to Maximizing Spark SQL Performance
 
Apache Spark beyond Hadoop MapReduce
Apache Spark beyond Hadoop MapReduceApache Spark beyond Hadoop MapReduce
Apache Spark beyond Hadoop MapReduce
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Performance of Spark vs MapReduce
Performance of Spark vs MapReducePerformance of Spark vs MapReduce
Performance of Spark vs MapReduce
 
Introduction to Machine Learning on Apache Spark MLlib by Juliet Hougland, Se...
Introduction to Machine Learning on Apache Spark MLlib by Juliet Hougland, Se...Introduction to Machine Learning on Apache Spark MLlib by Juliet Hougland, Se...
Introduction to Machine Learning on Apache Spark MLlib by Juliet Hougland, Se...
 
Big Data Processing with Spark and Scala
Big Data Processing with Spark and Scala Big Data Processing with Spark and Scala
Big Data Processing with Spark and Scala
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
 
Sgi hadoop
Sgi hadoopSgi hadoop
Sgi hadoop
 
Apache spark linkedin
Apache spark linkedinApache spark linkedin
Apache spark linkedin
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
 
Spark Summit EU talk by William Benton
Spark Summit EU talk by William BentonSpark Summit EU talk by William Benton
Spark Summit EU talk by William Benton
 
Apache spark
Apache spark Apache spark
Apache spark
 
Apache Ignite - Distributed Database Orchestration
Apache Ignite - Distributed Database OrchestrationApache Ignite - Distributed Database Orchestration
Apache Ignite - Distributed Database Orchestration
 
5 reasons why spark is in demand!
5 reasons why spark is in demand!5 reasons why spark is in demand!
5 reasons why spark is in demand!
 

Similar to Infra space talk on Apache Spark - Into to CASK

Spark SQL | Apache Spark
Spark SQL | Apache SparkSpark SQL | Apache Spark
Spark SQL | Apache Spark
Edureka!
 
Big Data Processing With Spark
Big Data Processing With SparkBig Data Processing With Spark
Big Data Processing With Spark
Edureka!
 
5 things one must know about spark!
5 things one must know about spark!5 things one must know about spark!
5 things one must know about spark!
Edureka!
 
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Edureka!
 
Module01
 Module01 Module01
Module01
NPN Training
 
spark_v1_2
spark_v1_2spark_v1_2
spark_v1_2
Frank Schroeter
 
5 Reasons why Spark is in demand!
5 Reasons why Spark is in demand!5 Reasons why Spark is in demand!
5 Reasons why Spark is in demand!
Edureka!
 
Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...
Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...
Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...
Edureka!
 
Running Spark In Production in the Cloud is Not Easy with Nayur Khan
Running Spark In Production in the Cloud is Not Easy with Nayur KhanRunning Spark In Production in the Cloud is Not Easy with Nayur Khan
Running Spark In Production in the Cloud is Not Easy with Nayur Khan
Databricks
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
Dharmjit Singh
 
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
Edureka!
 
5 things one must know about spark!
5 things one must know about spark!5 things one must know about spark!
5 things one must know about spark!
Edureka!
 
H2O PySparkling Water
H2O PySparkling WaterH2O PySparkling Water
H2O PySparkling Water
Sri Ambati
 
Apache spark architecture (Big Data and Analytics)
Apache spark architecture (Big Data and Analytics)Apache spark architecture (Big Data and Analytics)
Apache spark architecture (Big Data and Analytics)
Jyotasana Bharti
 
Learn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideLearn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive Guide
Whizlabs
 
Apache spark
Apache sparkApache spark
Apache spark
Dona Mary Philip
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 
Devops Spark Streaming
Devops Spark StreamingDevops Spark Streaming
Devops Spark Streaming
Marilyn Waldman
 
Started with-apache-spark
Started with-apache-sparkStarted with-apache-spark
Started with-apache-spark
Happiest Minds Technologies
 
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Alex Zeltov
 

Similar to Infra space talk on Apache Spark - Into to CASK (20)

Spark SQL | Apache Spark
Spark SQL | Apache SparkSpark SQL | Apache Spark
Spark SQL | Apache Spark
 
Big Data Processing With Spark
Big Data Processing With SparkBig Data Processing With Spark
Big Data Processing With Spark
 
5 things one must know about spark!
5 things one must know about spark!5 things one must know about spark!
5 things one must know about spark!
 
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
 
Module01
 Module01 Module01
Module01
 
spark_v1_2
spark_v1_2spark_v1_2
spark_v1_2
 
5 Reasons why Spark is in demand!
5 Reasons why Spark is in demand!5 Reasons why Spark is in demand!
5 Reasons why Spark is in demand!
 
Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...
Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...
Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...
 
Running Spark In Production in the Cloud is Not Easy with Nayur Khan
Running Spark In Production in the Cloud is Not Easy with Nayur KhanRunning Spark In Production in the Cloud is Not Easy with Nayur Khan
Running Spark In Production in the Cloud is Not Easy with Nayur Khan
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
 
5 things one must know about spark!
5 things one must know about spark!5 things one must know about spark!
5 things one must know about spark!
 
H2O PySparkling Water
H2O PySparkling WaterH2O PySparkling Water
H2O PySparkling Water
 
Apache spark architecture (Big Data and Analytics)
Apache spark architecture (Big Data and Analytics)Apache spark architecture (Big Data and Analytics)
Apache spark architecture (Big Data and Analytics)
 
Learn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideLearn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive Guide
 
Apache spark
Apache sparkApache spark
Apache spark
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
 
Devops Spark Streaming
Devops Spark StreamingDevops Spark Streaming
Devops Spark Streaming
 
Started with-apache-spark
Started with-apache-sparkStarted with-apache-spark
Started with-apache-spark
 
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
 

Recently uploaded

Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
ssuserfac0301
 
Webinar: Designing a schema for a Data Warehouse
Webinar: Designing a schema for a Data WarehouseWebinar: Designing a schema for a Data Warehouse
Webinar: Designing a schema for a Data Warehouse
Federico Razzoli
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
Ivanti
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
OpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - AuthorizationOpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - Authorization
David Brossard
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Speck&Tech
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Tosin Akinosho
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
Daiki Mogmet Ito
 
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Jeffrey Haguewood
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
Zilliz
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
Tatiana Kojar
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Safe Software
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
Postman
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
IndexBug
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
Chart Kalyan
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
SitimaJohn
 

Recently uploaded (20)

Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
 
Webinar: Designing a schema for a Data Warehouse
Webinar: Designing a schema for a Data WarehouseWebinar: Designing a schema for a Data Warehouse
Webinar: Designing a schema for a Data Warehouse
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
OpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - AuthorizationOpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - Authorization
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
 
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
 

Infra space talk on Apache Spark - Into to CASK

  • 1. Apache Spark - Scala - Presentation 20th July 2017 Chandrasekar Umathurappan (Uma) InfraSpace Technology Corporation © 2017 InfraSpace Technology Corporation
  • 2. Agenda Introduce Spark Spark-Shell Demo Spark Internals Demo Spark In a Enterprise Demo © 2017 InfraSpace Technology Corporation Apache Spark - Scala - Presentation
  • 3. Spark Core API SQL R Python Scala Java Spark SQL MLLib GraphX Apache Spark Ecosystem Streaming Apache Spark™ is a powerful open source processing engine built around speed, ease of use, and sophisticated analytics. It was originally developed at UC Berkeley in 2009. Spark Philosophy Unified engine for complete data applications High-level user-friendly APIs Speed © 2017 InfraSpace Technology Corporation Apache Spark
  • 4. © 2017 InfraSpace Technology Corporation Apache Spark Credits: https://spark-summit.org/2014/wp-content/uploads/2014/07/A-Deeper-Understanding-of-Spark-Internals-Aaron-Davidson.pdf https://www.slideshare.net/michiard/introduction-to-spark-internals Spark Execution Model 1. Create DAG of RDDs to represent computation 2. Create logical execution plan for DAG 3. Schedule and execute individual tasks 4. Cached Glossary DAG - Direct Acyclic Graph RDD - Resilient Distributed Dataset What are RDD’s? Distributed, Partitioned, Locality aware immutable collection. Points to data in HDFS / other RDD’s Lazily Evaluated
  • 5. Apache Spark Interactive- Demo Capabilities Overview Install Spark Spark Shell Run a Program Typical Commands Set Up Overview /usr/bin/ruby -e "$(curl -fsSL https:// raw.githubusercontent.com/Homebrew/ install/master/install)" brew install scala brew install apache-spark spark-shell Typical Commands sc -spark context spark.version :load <file> :he © 2017 InfraSpace Technology Corporation Apache Spark
  • 6. Spark Driver Executor Executor Executor Task A Task A Task A Task B Task B Job A Job B © 2017 InfraSpace Technology Corporation SC Apache Spark Components Cluster Manager Types: Concepts: RDD - Resilient Distributed Dataset Transformation, Actions, Persistence DAG - Direct Acyclic Graph Shuffle Cache Cache Cache References:http://datastrophic.io/core-concepts-architecture-and-internals-of-apache-spark/
  • 7. © 2017 InfraSpace Technology Corporation Apache Spark Components https://0x0fff.com/spark-memory-management/ https://www.slideshare.net/databricks/deep-dive-memory-management-in-apache-spark https://spark.apache.org/docs/latest/tuning.html Storage Shuffles, joins, sorts and aggregations Execution Cache Memory Tuning Considerations the amount of memory used by your objects the cost of accessing those objects the overhead of garbage collection Preferable >32GB
  • 8. © 2017 InfraSpace Technology Corporation Navigate Data with Spark http://localhost:9995 Create RDD
  • 9. Apache Spark Standalone - Demo Capabilities Overview Hadoop spark-submit Set Up Overview Install Hortonworks SandBox Shell Spark-submit Typical Commands spark-submit © 2017 InfraSpace Technology Corporation
  • 10. © 2017 InfraSpace Technology Corporation Apache Spark in Enterprise References:http://blog.cask.co/2016/06/cdap-spark-prototype-to-production/ Sources: Oracle SQL Server HDFS … Sink: Cassandra … Analytics: Tableau …
  • 11. © 2017 InfraSpace Technology Corporation Apache Spark WHY CDAP? The First Unified Integration Platform For Big Data That Cuts Down The Time To Production For Data Applications And Data Lakes By 80%. BENEFITS Self Service Build Once and Run Anywhere Enterprise Ready Plan CDAP 4.2.0 http://cask.co/products/cdap/#capabilities
  • 12. Apache Spark on Hadoop YARN- Demo Capabilities Overview Hadoop cask Set Up Overview Install Hortonworks SandBox Shell Spark-submit Typical Commands spark-submit © 2017 InfraSpace Technology Corporation
  • 13. © 2017 InfraSpace Technology Corporation Apache Spark - Reference Thank You! Rob Mueller (InfraSpace) Mark Soule (The Nerdery) Byamba Tumurkhuu (Rally Health) References https://www.supergloo.com (Todd M) http://blog.cask.co/2016/06/cdap-spark-prototype-to-production/ http://datastrophic.io/core-concepts-architecture-and-internals-of-apache-spark/ https://spark-summit.org/2014/wp-content/uploads/2014/07/A-Deeper-Understanding-of-Spark-Internals-Aaron-Davidson.pdf https://www.slideshare.net/michiard/introduction-to-spark-internals https://0x0fff.com/spark-memory-management/ https://www.slideshare.net/databricks/deep-dive-memory-management-in-apache-spark https://spark.apache.org/docs/latest/tuning.html http://blog.cask.co/2016/06/cdap-spark-prototype-to-production/ http://cask.co/products/cdap/#capabilities Next Steps, How can we help you solve challenges! Organizations: We will work with your teams to solve challenges in operationalizing Big Data Solutions Individuals: You could run your workloads on our Data Center with public datasets VPN User Accounts VPN Tunnel Our Lab SetUp