Spark_Intro_Syed_Academy

•

2 likes•80 views

Syed Hadoop

Spark Introduction

Software

Apache Spark
Syed
Solutions Engineer - Big Data
mail.syed786@gmail.com
info.syedacademy@gmail.com
+91-9030477368

History
Developed in 2009 at UC Berkeley AMPLab.
● Open sourced in 2010.
● Spark becomes one of the largest big-data
projects with more 400 contributors in 50+ organizations such
as:
– Databricks, Yahoo!, Intel, Cloudera, IBM, …

• Fast and general cluster computing system
interoperable with Hadoop datasets.
What is Spark?

Where Does Big Data Come From?
It’s all happening online – could record every:
» Click
» Ad impression
» Billing event
» Fast Forward, pause,…
» Server request
» Transaction
» Network message
» Fault
» Facebook
» Instagram
» TripAdvisor
» Twitter
» YouTube
»…

Graph Data
Lots of interesting data has a graph structure:
• Social networks
• Telecommunication Networks
• Computer Networks
• Road networks
• Collaborations/Relationships
• …
Some of these graphs can get quite large
(e.g., Facebook user graph)
Log Files – Apache Web Server Log

Why Apache Spark?
General purpose cluster computing system
• Originally developed at UC Berkeley, now one of the
largest Apache projects
• Typically faster than Hadoop due to main-memory
processing
• High-level APIs in Java, Scala, Python and R
Functionality for:
• Map/Reduce
• SQL processing
• Real-time stream processing
• Machine learning
• Graph processing

Apache Spark EcoSystem
• Apache Spark
• RDDs
• Spark SQL
• Once known as Shark
before completely
integrated into Spark
• For SQL, structured and
semi-structured data
processing
• Spark Streaming
• Processing of live data
streams
• MLlib/ML
• Machine Learning
Algorithms
• GraphX
• Graph Processing

MapReduce vs Spark
PIG HIVE MAHOUT
(machine
learning)
MapReduce

Programming Models
• MapReduce – 50 lines of code
• Spark – 1 line of code

MapReduce Bottlenecks and Improvements
• Bottlenecks
• MapReduce is a very I/O heavy operation
• Map phase needs to read from disk then write back out
• Reduce phase needs to read from disk and then write
back out
• How can we improve it?
• RAM is becoming very cheap and abundant
• Use RAM for in-data sharing

MapReduce vs. Spark (Performance)
MapReduce Record Spark Record Spark Record 1PB
Data Size 102.5 TB 100 TB 1000 TB
# Nodes 2100 206 190
# Cores 50400 physical 6592 virtualized 6080 virtualized
Elapsed Time 72 mins 23 mins 234 mins
Sort rate 1.42 TB/min 4.27 TB/min 4.27 TB/min
Sort rate/node 0.67 GB/min 20.7 GB/min 22.5 GB/min

RDDs
• Primary abstraction object used by Apache Spark
• Resilient Distributed Dataset
• Fault-tolerant
• Collection of elements that can be operated on in parallel
• Distributed collection of data from any source
• Contained in an RDD:
• Set of dependencies on parent RDDs
• Lineage (Directed Acyclic Graph – DAG)
• Set of partitions
• Atomic pieces of a dataset
• A function for computing the RDD based on its parents
• Metadata about its partitioning scheme and data
placement
• RDDs are Immutable
• Allows for more effective fault tolerance
• Intended to support abstract datasets while also maintain
MapReduce properties like automatic fault tolerance,
locality-aware scheduling and scalability.

Spark Streaming
• Spark Streaming is an extension of the core Spark API that
enables scalable, high-throughput, fault-tolerant stream
processing of live data streams

Thank you!
www.syedacademy.com
mail.syed786@gmail.com
info.syedacademy@gmail.com
+91-9030477368

What's hot

Intro to Apache SparkMarius Soutier

Data Science with Spark & ZeppelinVinay Shukla

Spark Summit EU talk by Kaarthik SivashanmugamSpark Summit

Spark: Interactive To ProductionJen Aman

Spark Streaming and MLlib - Hyderabad Spark GroupPhaneendra Chiruvella

Netflix running Presto in the AWS CloudZhenxiao Luo

Spark Summit EU talk by Ahsan Javed AwanSpark Summit

Spark Summit EU talk by Josef HabdankSpark Summit

Mining public datasets using opensource tools: Zeppelin, Spark and Jujuseoul_engineer

Efficiently Building Machine Learning Models for Predictive Maintenance in th...Databricks

Bring Satellite and Drone Imagery into your Data Science WorkflowsDatabricks

Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...Amazon Web Services

Alluxio+Presto: An Architecture for Fast SQL in the CloudAlluxio, Inc.

Introduction to DremioDremio Corporation

Trends for Big Data and Apache Spark in 2017 by Matei ZahariaSpark Summit

Spark Magic Building and Deploying a High Scale Product in 4 Monthstsliwowicz

Spark Summit EU talk by Tug GrallSpark Summit

Big Telco - Yousun JeongSpark Summit

Lessons Learned from Modernizing USCIS Data Analytics PlatformDatabricks

Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...Databricks

What's hot (20)

Intro to Apache Spark

Data Science with Spark & Zeppelin

Spark Summit EU talk by Kaarthik Sivashanmugam

Spark: Interactive To Production

Spark Streaming and MLlib - Hyderabad Spark Group

Netflix running Presto in the AWS Cloud

Spark Summit EU talk by Ahsan Javed Awan

Spark Summit EU talk by Josef Habdank

Mining public datasets using opensource tools: Zeppelin, Spark and Juju

Efficiently Building Machine Learning Models for Predictive Maintenance in th...

Bring Satellite and Drone Imagery into your Data Science Workflows

Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...

Alluxio+Presto: An Architecture for Fast SQL in the Cloud

Introduction to Dremio

Trends for Big Data and Apache Spark in 2017 by Matei Zaharia

Spark Magic Building and Deploying a High Scale Product in 4 Months

Spark Summit EU talk by Tug Grall

Big Telco - Yousun Jeong

Lessons Learned from Modernizing USCIS Data Analytics Platform

Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...

Similar to Spark_Intro_Syed_Academy

Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformYao Yao

Scaling Spark Workloads on YARN - Boulder/Denver July 2015Mac Moore

Deploy Apache Spark™ on Rackspace OnMetal™ for Cloud Big Data PlatformRackspace

Tech Spark PresentationStephen Borg

Building a High Performance Analytics PlatformSantanu Dey

Taboola Road To Scale With Apache Sparktsliwowicz

Apache Spark: The Next Gen toolset for Big Data Processingprajods

Spark introduction and architectureSohil Jain

Intro to Apache Spark by CTO of TwingoMapR Technologies

Big Telco Real-Time Network AnalyticsYousun Jeong

Chirp 2010: Scaling TwitterJohn Adams

Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi

Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan

Apache Spark FundamentalsZahra Eskandari

夏俊鸾：Spark——基于内存的下一代大数据分析框架hdhappy001

Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability MeetupHyderabad Scalability Meetup

Dec6 meetup spark presentationRamesh Mudunuri

Apache Spark in IndustryDorian Beganovic

AWS (Hadoop) Meetup 30.04.09Chris Purrington

Similar to Spark_Intro_Syed_Academy (20)

Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform

Scaling Spark Workloads on YARN - Boulder/Denver July 2015

Deploy Apache Spark™ on Rackspace OnMetal™ for Cloud Big Data Platform

Tech Spark Presentation

Building a High Performance Analytics Platform

Taboola Road To Scale With Apache Spark

Apache Spark: The Next Gen toolset for Big Data Processing

Spark introduction and architecture

Intro to Apache Spark by CTO of Twingo

Big Telco Real-Time Network Analytics

Chirp 2010: Scaling Twitter

Processing Large Data with Apache Spark -- HasGeek

Tiny Batches, in the wine: Shiny New Bits in Spark Streaming

Apache Spark Fundamentals

夏俊鸾：Spark——基于内存的下一代大数据分析框架

Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup

Dec6 meetup spark presentation

Apache Spark in Industry

AWS (Hadoop) Meetup 30.04.09

Recently uploaded

Project Based Learning (A.I).pptx detail explanationkaushalgiri8080

Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812

Exploring iOS App Development: Simplifying the ProcessEvangelist Apps https://twitter.com/EvangelistSW/

Asset Management Software - InfographicHr365.us smith

Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531

The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfkalichargn70th171

Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171

Unit 1.1 Excite Part 1, class 9, cbse...aditisharan08

(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700

chapter--4-software-project-planning.pptkotipi9215

BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp

Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure

The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS

Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110

A Secure and Reliable Document Management System is Essential.docxComplianceQuest1

Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions

Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH

Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.

Engage Usergroup 2024 - The Good The Bad_The UglyFrank van der Linden

Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin

Recently uploaded (20)

Project Based Learning (A.I).pptx detail explanation

Unlocking the Future of AI Agents with Large Language Models

Exploring iOS App Development: Simplifying the Process

Asset Management Software - Infographic

Hand gesture recognition PROJECT PPT.pptx

The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf

Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf

Unit 1.1 Excite Part 1, class 9, cbse...

(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...

chapter--4-software-project-planning.ppt

BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE

Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...

The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...

Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...

A Secure and Reliable Document Management System is Essential.docx

Advancing Engineering with AI through the Next Generation of Strategic Projec...

Der Spagat zwischen BIAS und FAIRNESS (2024)

Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data

Engage Usergroup 2024 - The Good The Bad_The Ugly

Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide

Spark_Intro_Syed_Academy

1. Apache Spark Syed Solutions Engineer - Big Data mail.syed786@gmail.com info.syedacademy@gmail.com +91-9030477368

2. History Developed in 2009 at UC Berkeley AMPLab. ● Open sourced in 2010. ● Spark becomes one of the largest big-data projects with more 400 contributors in 50+ organizations such as: – Databricks, Yahoo!, Intel, Cloudera, IBM, …

3. • Fast and general cluster computing system interoperable with Hadoop datasets. What is Spark?

4. Where Does Big Data Come From? It’s all happening online – could record every: » Click » Ad impression » Billing event » Fast Forward, pause,… » Server request » Transaction » Network message » Fault » Facebook » Instagram » TripAdvisor » Twitter » YouTube »…

5. Graph Data Lots of interesting data has a graph structure: • Social networks • Telecommunication Networks • Computer Networks • Road networks • Collaborations/Relationships • … Some of these graphs can get quite large (e.g., Facebook user graph) Log Files – Apache Web Server Log

6. Why Apache Spark? General purpose cluster computing system • Originally developed at UC Berkeley, now one of the largest Apache projects • Typically faster than Hadoop due to main-memory processing • High-level APIs in Java, Scala, Python and R Functionality for: • Map/Reduce • SQL processing • Real-time stream processing • Machine learning • Graph processing

7. Apache Spark EcoSystem • Apache Spark • RDDs • Spark SQL • Once known as Shark before completely integrated into Spark • For SQL, structured and semi-structured data processing • Spark Streaming • Processing of live data streams • MLlib/ML • Machine Learning Algorithms • GraphX • Graph Processing

8. MapReduce vs Spark PIG HIVE MAHOUT (machine learning) MapReduce

9. Hadoop MapReduce

10.

11.

12.

13.

14. Programming Models • MapReduce – 50 lines of code • Spark – 1 line of code

15.

16.

17.

18. MapReduce Bottlenecks and Improvements • Bottlenecks • MapReduce is a very I/O heavy operation • Map phase needs to read from disk then write back out • Reduce phase needs to read from disk and then write back out • How can we improve it? • RAM is becoming very cheap and abundant • Use RAM for in-data sharing

19. MapReduce vs. Spark (Performance) MapReduce Record Spark Record Spark Record 1PB Data Size 102.5 TB 100 TB 1000 TB # Nodes 2100 206 190 # Cores 50400 physical 6592 virtualized 6080 virtualized Elapsed Time 72 mins 23 mins 234 mins Sort rate 1.42 TB/min 4.27 TB/min 4.27 TB/min Sort rate/node 0.67 GB/min 20.7 GB/min 22.5 GB/min

20.

21. Spark Architecture

22. RDDs • Primary abstraction object used by Apache Spark • Resilient Distributed Dataset • Fault-tolerant • Collection of elements that can be operated on in parallel • Distributed collection of data from any source • Contained in an RDD: • Set of dependencies on parent RDDs • Lineage (Directed Acyclic Graph – DAG) • Set of partitions • Atomic pieces of a dataset • A function for computing the RDD based on its parents • Metadata about its partitioning scheme and data placement • RDDs are Immutable • Allows for more effective fault tolerance • Intended to support abstract datasets while also maintain MapReduce properties like automatic fault tolerance, locality-aware scheduling and scalability.

23. SPARK SQL • DataFrames • DataSets

24. Spark Streaming • Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams

25.

26. Thank you! www.syedacademy.com mail.syed786@gmail.com info.syedacademy@gmail.com +91-9030477368

Spark_Intro_Syed_Academy

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Spark_Intro_Syed_Academy

Similar to Spark_Intro_Syed_Academy (20)

More from Syed Hadoop

More from Syed Hadoop (6)

Recently uploaded

Recently uploaded (20)

Spark_Intro_Syed_Academy