Spark: Beyond mapreduce

•Download as PPTX, PDF•

1 like•427 views

blidiselalin

Apache Spark: Introduction, Examples, Data Analysis and Statistics.

Software

Alin Blidisel - Spark: Big Data Beyond MapReduce
ALIN BLIDISEL - SPARK: BIG DATA BEYOND MAPREDUCE
1

2
Apache Spark:
Introduction,
Examples,
Data Analysis and
Statistics.
Blidisel Alin

Alin Blidisel - Big Data: Beyond MapReduce
3
WHY SPARK?
Hadoop Spark

Alin Blidisel - Big Data: Beyond MapReduce
4
SPARK - INTRODUCTION
- was created by Matei Zaharia at Berkley
- was introduced by Apache Software Foundation for speeding up the
Hadoop computational process
- is not a modified version of Hadoop
- in-memory cluster computing
- own cluster computation management
- designed to cover a wide range of workloads such as batch
applications, iterative algorithms, interactive queries and streaming

Alin Blidisel - Big Data: Beyond MapReduce
SPARK COMPONENTS
5

Alin Blidisel - Big Data: Beyond MapReduce
6
FEATURES OF APACHE SPARK
- Lighting Fast Processing (10 to 100 faster then Hadoop)
- Ease of Use as it supports multiple languages
- Support for Sophisticated Analytics
- Real Time Stream Processing
- Ability to Integrate with Hadoop and Existing HadoopData
- Active and Expanding Community (more than 250 developers have contributed to Spark already)

Alin Blidisel - Big Data: Beyond MapReduce
RESILIENT DISTRIBUTED DATASETS (RDDS)
- fault-tolerant collection of elements that can be operated on in parallel (distributed and immutable)
- two ways to create RDDs:
- parallelizing an existing collection in your driver program
- referencing a dataset in an external storage system, such as a shared
filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat
- persistence (MEMORY_ONLY*, MEMORY_AND_DISK*, DISK_ONLY, OFF_HEAP)
7

Alin Blidisel - Big Data: Beyond MapReduce
SPARK CLUSTER MODE OVERVIEW
8

Alin Blidisel - Big Data: Beyond MapReduce
SPARK USER INTERFACE
9

Alin Blidisel - Big Data: Beyond MapReduce
EXAMPLE: DATA ANALYSIS
Sample Data from Sales transactions CSV file
Transaction_date,Product,Price,Payment_Type,Name,City,State,Country,Account_Created,Last_Login,Latitude,Longitude
1/2/09 6:17,Product1,1200,Mastercard,carolina,Basildon,England,United Kingdom,1/2/09 6:00,1/2/09 6:08,51.5,-1.1166667
1/2/09 4:53,Product1,1200,Visa,Betina,Parkville,MO,United States,1/2/09 4:42,1/2/09 7:49,39.195,-94.68194
1/2/09 13:08,Product1,1200,Mastercard,Federica e Andrea,Astoria,OR,United States,1/1/09 16:21,1/3/09 12:32,46.18806,-123.83
1/3/09 14:44,Product1,1200,Visa,Gouya,Echuca,Victoria,Australia,9/25/05 21:13,1/3/09 14:22,-36.1333333,144.75
1/4/09 12:56,Product2,3600,Visa,Gerd W ,Cahaba Heights,AL,United States,11/15/08 15:47,1/4/09 12:45,33.52056,-86.8025
1/4/09 13:19,Product1,1200,Visa,LAURENCE,Mickleton,NJ,United States,9/24/08 15:19,1/4/09 13:04,39.79,-75.23806
1/4/09 20:11,Product1,1200,Mastercard,Fleur,Peoria,IL,United States,1/3/09 9:38,1/4/09 19:45,40.69361,-89.58889
1/2/09 20:09,Product1,1200,Mastercard,adam,Martin,TN,United States,1/2/09 17:43,1/4/09 20:01,36.34333,-88.85028
1/4/09 13:17,Product1,1200,Mastercard,Renee Elisabeth,Tel Aviv,Tel Aviv,Israel,1/4/09 13:03,1/4/09 22:10,32.0666667,34.7666667
1 0

Alin Blidisel - Big Data: Beyond MapReduce
LOAD ORIGINAL CSV FROM HDFS
Create Spark Context and define input parameters
Create RDD from CSV file
1 2

Alin Blidisel - Big Data: Beyond MapReduce
GET RANDOM DATA AND CREATE A DATAFRAME
1 3

Alin Blidisel - Big Data: Beyond MapReduce
DETERMINE FIELD TYPES
1 4

Alin Blidisel - Big Data: Beyond MapReduce
CREATE NEW DATAFRAME BASED ON
THE NEW DETERMINED FIELD TYPES
1 5

Alin Blidisel - Big Data: Beyond MapReduce
SAVE DATA IN PARQUET FORMAT
This is the new updated schema
1 6

Alin Blidisel - Big Data: Beyond MapReduce
GENERATE STATISTICS
1 7

Similar to Spark: Beyond mapreduce

Kudu as Storage Layer to Digitize Credit ProcessesOlaf Hein

Kudu as Storage Layer to Digitize Credit ProcessesDataWorks Summit

Visualizing Geospatial Data at ScaleArcadia Data

Big data beyond the hype may 2014bigdatagurus_meetup

Hadoop Tutorial | What is Hadoop | Hadoop Project on Reddit | EdurekaEdureka!

Big Data Analytics Tutorial | Big Data Analytics for Beginners | Hadoop Tutor...Edureka!

Using Scalding for Data Driven Product Development at LinkedInSasha Ovsankin

Iasi code camp 20 april 2013 testing big data-anca sfecla - embarcaderoCodecamp Romania

Hadoop Demo eConvergencekvnnrao

BreizhJUG - Janvier 2014 - Big Data - Dataiku - Pages JaunesDataiku

ArunKJ BigData3-0 AnalyticsArun Kumar J

Madhumadhu sudhanreddy

Logical Data Warehouse: How to Build a Virtualized Data Services LayerDataWorks Summit

Big Data Trend with Open PlatformJongwook Woo

Big data with javaStefan Angelov

Big data in marketing at harvard business club nick1 june 15 2013nkabra

Big Data Trend and Open DataJongwook Woo

Intro to Big Data - Orlando Code Camp 2014John Ternent

GENESIIS PorjectsNishad Wijesekara

Data Analyst vs Data Engineer vs Data Scientist | Data Analytics Masters Prog...Edureka!

Similar to Spark: Beyond mapreduce (20)

Kudu as Storage Layer to Digitize Credit Processes

Visualizing Geospatial Data at Scale

Big data beyond the hype may 2014

Hadoop Tutorial | What is Hadoop | Hadoop Project on Reddit | Edureka

Big Data Analytics Tutorial | Big Data Analytics for Beginners | Hadoop Tutor...

Using Scalding for Data Driven Product Development at LinkedIn

Iasi code camp 20 april 2013 testing big data-anca sfecla - embarcadero

Hadoop Demo eConvergence

BreizhJUG - Janvier 2014 - Big Data - Dataiku - Pages Jaunes

ArunKJ BigData3-0 Analytics

Madhu

Logical Data Warehouse: How to Build a Virtualized Data Services Layer

Big Data Trend with Open Platform

Big data with java

Big data in marketing at harvard business club nick1 june 15 2013

Big Data Trend and Open Data

Intro to Big Data - Orlando Code Camp 2014

GENESIIS Porjects

Data Analyst vs Data Engineer vs Data Scientist | Data Analytics Masters Prog...

Recently uploaded

How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc

Software Quality Assurance Interview QuestionsArshad QA

HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai

How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes

Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab

Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01

Microsoft AI Transformation Partner Playbook.pdfWilly Marroquin (WillyDevNET)

Right Money Management App For Your Financial GoalsJhone kinadey

CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE9953056974 Low Rate Call Girls In Saket, Delhi NCR

A Secure and Reliable Document Management System is Essential.docxComplianceQuest1

Diamond Application Development Crafting Solutions with PrecisionSolGuruz

Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave

Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.

CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823

Optimizing AI for immediate response in Smart CCTVshikhaohhpro

Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions

Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveCall Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure

call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls

CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceanilsa9823

W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda

Recently uploaded (20)

How To Use Server-Side Rendering with Nuxt.js

Software Quality Assurance Interview Questions

HR Software Buyers Guide in 2024 - HRSoftware.com

How To Troubleshoot Collaboration Apps for the Modern Connected Worker

Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...

Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...

Microsoft AI Transformation Partner Playbook.pdf

Right Money Management App For Your Financial Goals

CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE

A Secure and Reliable Document Management System is Essential.docx

Diamond Application Development Crafting Solutions with Precision

Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...

Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...

CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️

Optimizing AI for immediate response in Smart CCTV

Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...

Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live

call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️

CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service

W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...

Spark: Beyond mapreduce

1. Alin Blidisel - Spark: Big Data Beyond MapReduce ALIN BLIDISEL - SPARK: BIG DATA BEYOND MAPREDUCE 1

2. 2 Apache Spark: Introduction, Examples, Data Analysis and Statistics. Blidisel Alin

3. Alin Blidisel - Big Data: Beyond MapReduce 3 WHY SPARK? Hadoop Spark

4. Alin Blidisel - Big Data: Beyond MapReduce 4 SPARK - INTRODUCTION - was created by Matei Zaharia at Berkley - was introduced by Apache Software Foundation for speeding up the Hadoop computational process - is not a modified version of Hadoop - in-memory cluster computing - own cluster computation management - designed to cover a wide range of workloads such as batch applications, iterative algorithms, interactive queries and streaming

5. Alin Blidisel - Big Data: Beyond MapReduce SPARK COMPONENTS 5

6. Alin Blidisel - Big Data: Beyond MapReduce 6 FEATURES OF APACHE SPARK - Lighting Fast Processing (10 to 100 faster then Hadoop) - Ease of Use as it supports multiple languages - Support for Sophisticated Analytics - Real Time Stream Processing - Ability to Integrate with Hadoop and Existing HadoopData - Active and Expanding Community (more than 250 developers have contributed to Spark already)

7. Alin Blidisel - Big Data: Beyond MapReduce RESILIENT DISTRIBUTED DATASETS (RDDS) - fault-tolerant collection of elements that can be operated on in parallel (distributed and immutable) - two ways to create RDDs: - parallelizing an existing collection in your driver program - referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat - persistence (MEMORY_ONLY*, MEMORY_AND_DISK*, DISK_ONLY, OFF_HEAP) 7

8. Alin Blidisel - Big Data: Beyond MapReduce SPARK CLUSTER MODE OVERVIEW 8

9. Alin Blidisel - Big Data: Beyond MapReduce SPARK USER INTERFACE 9

10. Alin Blidisel - Big Data: Beyond MapReduce EXAMPLE: DATA ANALYSIS Sample Data from Sales transactions CSV file Transaction_date,Product,Price,Payment_Type,Name,City,State,Country,Account_Created,Last_Login,Latitude,Longitude 1/2/09 6:17,Product1,1200,Mastercard,carolina,Basildon,England,United Kingdom,1/2/09 6:00,1/2/09 6:08,51.5,-1.1166667 1/2/09 4:53,Product1,1200,Visa,Betina,Parkville,MO,United States,1/2/09 4:42,1/2/09 7:49,39.195,-94.68194 1/2/09 13:08,Product1,1200,Mastercard,Federica e Andrea,Astoria,OR,United States,1/1/09 16:21,1/3/09 12:32,46.18806,-123.83 1/3/09 14:44,Product1,1200,Visa,Gouya,Echuca,Victoria,Australia,9/25/05 21:13,1/3/09 14:22,-36.1333333,144.75 1/4/09 12:56,Product2,3600,Visa,Gerd W ,Cahaba Heights,AL,United States,11/15/08 15:47,1/4/09 12:45,33.52056,-86.8025 1/4/09 13:19,Product1,1200,Visa,LAURENCE,Mickleton,NJ,United States,9/24/08 15:19,1/4/09 13:04,39.79,-75.23806 1/4/09 20:11,Product1,1200,Mastercard,Fleur,Peoria,IL,United States,1/3/09 9:38,1/4/09 19:45,40.69361,-89.58889 1/2/09 20:09,Product1,1200,Mastercard,adam,Martin,TN,United States,1/2/09 17:43,1/4/09 20:01,36.34333,-88.85028 1/4/09 13:17,Product1,1200,Mastercard,Renee Elisabeth,Tel Aviv,Tel Aviv,Israel,1/4/09 13:03,1/4/09 22:10,32.0666667,34.7666667 1 0

11. Alin Blidisel - Big Data: Beyond MapReduce LOAD ORIGINAL CSV FROM HDFS Create Spark Context and define input parameters Create RDD from CSV file 1 2

12. Alin Blidisel - Big Data: Beyond MapReduce GET RANDOM DATA AND CREATE A DATAFRAME 1 3

13. Alin Blidisel - Big Data: Beyond MapReduce DETERMINE FIELD TYPES 1 4

14. Alin Blidisel - Big Data: Beyond MapReduce CREATE NEW DATAFRAME BASED ON THE NEW DETERMINED FIELD TYPES 1 5

15. Alin Blidisel - Big Data: Beyond MapReduce SAVE DATA IN PARQUET FORMAT This is the new updated schema 1 6

16. Alin Blidisel - Big Data: Beyond MapReduce GENERATE STATISTICS 1 7

17. © 2016 Atigeo, Corporation. All rights reserved. Atigeo and the xPatterns logo are trademarks of Atigeo. The information herein is for informational purposes only and represents the current view of Atigeo as of the date of this presentation. Because Atigeo must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Atigeo, and Atigeo cannot guarantee the accuracy of any information provided after the date of this presentation. ATIGEO MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION. 18 Thank you!

Spark: Beyond mapreduce

Recommended

Recommended

More Related Content

Similar to Spark: Beyond mapreduce

Similar to Spark: Beyond mapreduce (20)

Recently uploaded

Recently uploaded (20)

Spark: Beyond mapreduce