Submit Search
Upload
Introduction to Spark on Hadoop
•
8 likes
•
2,473 views
Carol McDonald
Follow
Introduction to Apache Spark on Hadoop
Read less
Read more
Data & Analytics
Report
Share
Report
Share
1 of 116
Download now
Download to read offline
Recommended
Build a Time Series Application with Apache Spark and Apache HBase
Build a Time Series Application with Apache Spark and Apache HBase
Carol McDonald
Apache Spark Overview
Apache Spark Overview
Carol McDonald
Introduction to Spark
Introduction to Spark
Carol McDonald
Apache Spark streaming and HBase
Apache Spark streaming and HBase
Carol McDonald
NoSQL HBase schema design and SQL with Apache Drill
NoSQL HBase schema design and SQL with Apache Drill
Carol McDonald
Getting Started with HBase
Getting Started with HBase
Carol McDonald
Free Code Friday - Machine Learning with Apache Spark
Free Code Friday - Machine Learning with Apache Spark
MapR Technologies
Apache Spark Overview @ ferret
Apache Spark Overview @ ferret
Andrii Gakhov
Recommended
Build a Time Series Application with Apache Spark and Apache HBase
Build a Time Series Application with Apache Spark and Apache HBase
Carol McDonald
Apache Spark Overview
Apache Spark Overview
Carol McDonald
Introduction to Spark
Introduction to Spark
Carol McDonald
Apache Spark streaming and HBase
Apache Spark streaming and HBase
Carol McDonald
NoSQL HBase schema design and SQL with Apache Drill
NoSQL HBase schema design and SQL with Apache Drill
Carol McDonald
Getting Started with HBase
Getting Started with HBase
Carol McDonald
Free Code Friday - Machine Learning with Apache Spark
Free Code Friday - Machine Learning with Apache Spark
MapR Technologies
Apache Spark Overview @ ferret
Apache Spark Overview @ ferret
Andrii Gakhov
Fast Cars, Big Data How Streaming can help Formula 1
Fast Cars, Big Data How Streaming can help Formula 1
Carol McDonald
NoSQL Application Development with JSON and MapR-DB
NoSQL Application Development with JSON and MapR-DB
MapR Technologies
Hadoop ecosystem
Hadoop ecosystem
Ran Silberman
Tez Data Processing over Yarn
Tez Data Processing over Yarn
InMobi Technology
Apache Spark Machine Learning Decision Trees
Apache Spark Machine Learning Decision Trees
Carol McDonald
Dealing with an Upside Down Internet
Dealing with an Upside Down Internet
MapR Technologies
When Streaming Becomes Strategic
When Streaming Becomes Strategic
MapR Technologies
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
DataWorks Summit/Hadoop Summit
What's new in Hadoop Common and HDFS
What's new in Hadoop Common and HDFS
DataWorks Summit/Hadoop Summit
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...
Carol McDonald
Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...
Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...
Codemotion
Apache Eagle - Monitor Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real Time
DataWorks Summit/Hadoop Summit
Streaming Patterns Revolutionary Architectures with the Kafka API
Streaming Patterns Revolutionary Architectures with the Kafka API
Carol McDonald
Advanced Threat Detection on Streaming Data
Advanced Threat Detection on Streaming Data
Carol McDonald
Hug france-2012-12-04
Hug france-2012-12-04
Ted Dunning
Cost-based query optimization in Apache Hive
Cost-based query optimization in Apache Hive
Julian Hyde
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
DataWorks Summit
Applying Machine Learning to Live Patient Data
Applying Machine Learning to Live Patient Data
Carol McDonald
MapR 5.2 Product Update
MapR 5.2 Product Update
MapR Technologies
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
DataWorks Summit
Tez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthel
t3rmin4t0r
Big Data Everywhere Chicago: SQL on Hadoop
Big Data Everywhere Chicago: SQL on Hadoop
BigDataEverywhere
More Related Content
What's hot
Fast Cars, Big Data How Streaming can help Formula 1
Fast Cars, Big Data How Streaming can help Formula 1
Carol McDonald
NoSQL Application Development with JSON and MapR-DB
NoSQL Application Development with JSON and MapR-DB
MapR Technologies
Hadoop ecosystem
Hadoop ecosystem
Ran Silberman
Tez Data Processing over Yarn
Tez Data Processing over Yarn
InMobi Technology
Apache Spark Machine Learning Decision Trees
Apache Spark Machine Learning Decision Trees
Carol McDonald
Dealing with an Upside Down Internet
Dealing with an Upside Down Internet
MapR Technologies
When Streaming Becomes Strategic
When Streaming Becomes Strategic
MapR Technologies
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
DataWorks Summit/Hadoop Summit
What's new in Hadoop Common and HDFS
What's new in Hadoop Common and HDFS
DataWorks Summit/Hadoop Summit
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...
Carol McDonald
Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...
Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...
Codemotion
Apache Eagle - Monitor Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real Time
DataWorks Summit/Hadoop Summit
Streaming Patterns Revolutionary Architectures with the Kafka API
Streaming Patterns Revolutionary Architectures with the Kafka API
Carol McDonald
Advanced Threat Detection on Streaming Data
Advanced Threat Detection on Streaming Data
Carol McDonald
Hug france-2012-12-04
Hug france-2012-12-04
Ted Dunning
Cost-based query optimization in Apache Hive
Cost-based query optimization in Apache Hive
Julian Hyde
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
DataWorks Summit
Applying Machine Learning to Live Patient Data
Applying Machine Learning to Live Patient Data
Carol McDonald
MapR 5.2 Product Update
MapR 5.2 Product Update
MapR Technologies
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
DataWorks Summit
What's hot
(20)
Fast Cars, Big Data How Streaming can help Formula 1
Fast Cars, Big Data How Streaming can help Formula 1
NoSQL Application Development with JSON and MapR-DB
NoSQL Application Development with JSON and MapR-DB
Hadoop ecosystem
Hadoop ecosystem
Tez Data Processing over Yarn
Tez Data Processing over Yarn
Apache Spark Machine Learning Decision Trees
Apache Spark Machine Learning Decision Trees
Dealing with an Upside Down Internet
Dealing with an Upside Down Internet
When Streaming Becomes Strategic
When Streaming Becomes Strategic
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
What's new in Hadoop Common and HDFS
What's new in Hadoop Common and HDFS
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...
Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...
Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...
Apache Eagle - Monitor Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real Time
Streaming Patterns Revolutionary Architectures with the Kafka API
Streaming Patterns Revolutionary Architectures with the Kafka API
Advanced Threat Detection on Streaming Data
Advanced Threat Detection on Streaming Data
Hug france-2012-12-04
Hug france-2012-12-04
Cost-based query optimization in Apache Hive
Cost-based query optimization in Apache Hive
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
Applying Machine Learning to Live Patient Data
Applying Machine Learning to Live Patient Data
MapR 5.2 Product Update
MapR 5.2 Product Update
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
Similar to Introduction to Spark on Hadoop
Tez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthel
t3rmin4t0r
Big Data Everywhere Chicago: SQL on Hadoop
Big Data Everywhere Chicago: SQL on Hadoop
BigDataEverywhere
October 2014 HUG : Hive On Spark
October 2014 HUG : Hive On Spark
Yahoo Developer Network
Interactive query in hadoop
Interactive query in hadoop
Rommel Garcia
Introduction to HBase - Phoenix HUG 5/14
Introduction to HBase - Phoenix HUG 5/14
Jeremy Walsh
Introduction to Spark - Phoenix Meetup 08-19-2014
Introduction to Spark - Phoenix Meetup 08-19-2014
cdmaxime
Hackathon bonn
Hackathon bonn
Emil Andreas Siemes
Analyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache Drill
Tomer Shiran
Scrap Your MapReduce - Apache Spark
Scrap Your MapReduce - Apache Spark
IndicThreads
Is Spark Replacing Hadoop
Is Spark Replacing Hadoop
MapR Technologies
Apache Spark Overview part1 (20161107)
Apache Spark Overview part1 (20161107)
Steve Min
writing Hadoop Map Reduce programs
writing Hadoop Map Reduce programs
jani shaik
2014 08-20-pit-hug
2014 08-20-pit-hug
Andy Pernsteiner
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
cdmaxime
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Cloudera, Inc.
Hadoop_EcoSystem_Pradeep_MG
Hadoop_EcoSystem_Pradeep_MG
Pradeep MG
MapReduce and NoSQL
MapReduce and NoSQL
Aaron Cordova
Drill into Drill – How Providing Flexibility and Performance is Possible
Drill into Drill – How Providing Flexibility and Performance is Possible
MapR Technologies
The Powerful Marriage of Hadoop and R (David Champagne)
The Powerful Marriage of Hadoop and R (David Champagne)
Revolution Analytics
Big Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with Pentaho
Mark Kromer
Similar to Introduction to Spark on Hadoop
(20)
Tez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthel
Big Data Everywhere Chicago: SQL on Hadoop
Big Data Everywhere Chicago: SQL on Hadoop
October 2014 HUG : Hive On Spark
October 2014 HUG : Hive On Spark
Interactive query in hadoop
Interactive query in hadoop
Introduction to HBase - Phoenix HUG 5/14
Introduction to HBase - Phoenix HUG 5/14
Introduction to Spark - Phoenix Meetup 08-19-2014
Introduction to Spark - Phoenix Meetup 08-19-2014
Hackathon bonn
Hackathon bonn
Analyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache Drill
Scrap Your MapReduce - Apache Spark
Scrap Your MapReduce - Apache Spark
Is Spark Replacing Hadoop
Is Spark Replacing Hadoop
Apache Spark Overview part1 (20161107)
Apache Spark Overview part1 (20161107)
writing Hadoop Map Reduce programs
writing Hadoop Map Reduce programs
2014 08-20-pit-hug
2014 08-20-pit-hug
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop_EcoSystem_Pradeep_MG
Hadoop_EcoSystem_Pradeep_MG
MapReduce and NoSQL
MapReduce and NoSQL
Drill into Drill – How Providing Flexibility and Performance is Possible
Drill into Drill – How Providing Flexibility and Performance is Possible
The Powerful Marriage of Hadoop and R (David Champagne)
The Powerful Marriage of Hadoop and R (David Champagne)
Big Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with Pentaho
More from Carol McDonald
Introduction to machine learning with GPUs
Introduction to machine learning with GPUs
Carol McDonald
Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...
Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...
Carol McDonald
Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DB
Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DB
Carol McDonald
Analysis of Popular Uber Locations using Apache APIs: Spark Machine Learning...
Analysis of Popular Uber Locations using Apache APIs: Spark Machine Learning...
Carol McDonald
Predicting Flight Delays with Spark Machine Learning
Predicting Flight Delays with Spark Machine Learning
Carol McDonald
Structured Streaming Data Pipeline Using Kafka, Spark, and MapR-DB
Structured Streaming Data Pipeline Using Kafka, Spark, and MapR-DB
Carol McDonald
Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...
Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...
Carol McDonald
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real-Ti...
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real-Ti...
Carol McDonald
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real- T...
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real- T...
Carol McDonald
How Big Data is Reducing Costs and Improving Outcomes in Health Care
How Big Data is Reducing Costs and Improving Outcomes in Health Care
Carol McDonald
Demystifying AI, Machine Learning and Deep Learning
Demystifying AI, Machine Learning and Deep Learning
Carol McDonald
Spark graphx
Spark graphx
Carol McDonald
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...
Carol McDonald
Streaming patterns revolutionary architectures
Streaming patterns revolutionary architectures
Carol McDonald
Spark machine learning predicting customer churn
Spark machine learning predicting customer churn
Carol McDonald
Apache Spark Machine Learning
Apache Spark Machine Learning
Carol McDonald
Machine Learning Recommendations with Spark
Machine Learning Recommendations with Spark
Carol McDonald
CU9411MW.DOC
CU9411MW.DOC
Carol McDonald
Getting started with HBase
Getting started with HBase
Carol McDonald
More from Carol McDonald
(19)
Introduction to machine learning with GPUs
Introduction to machine learning with GPUs
Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...
Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...
Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DB
Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DB
Analysis of Popular Uber Locations using Apache APIs: Spark Machine Learning...
Analysis of Popular Uber Locations using Apache APIs: Spark Machine Learning...
Predicting Flight Delays with Spark Machine Learning
Predicting Flight Delays with Spark Machine Learning
Structured Streaming Data Pipeline Using Kafka, Spark, and MapR-DB
Structured Streaming Data Pipeline Using Kafka, Spark, and MapR-DB
Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...
Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real-Ti...
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real-Ti...
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real- T...
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real- T...
How Big Data is Reducing Costs and Improving Outcomes in Health Care
How Big Data is Reducing Costs and Improving Outcomes in Health Care
Demystifying AI, Machine Learning and Deep Learning
Demystifying AI, Machine Learning and Deep Learning
Spark graphx
Spark graphx
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...
Streaming patterns revolutionary architectures
Streaming patterns revolutionary architectures
Spark machine learning predicting customer churn
Spark machine learning predicting customer churn
Apache Spark Machine Learning
Apache Spark Machine Learning
Machine Learning Recommendations with Spark
Machine Learning Recommendations with Spark
CU9411MW.DOC
CU9411MW.DOC
Getting started with HBase
Getting started with HBase
Recently uploaded
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
Samantha Rae Coolbeth
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
makika9823
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
Invezz1
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
Suhani Kapoor
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
soniya singh
Digi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptx
TanveerAhmed817946
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
Anupama Kate
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
shivangimorya083
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
dajasot375
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Delhi Call girls
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
Pramod Kumar Srivastava
Predicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project Presentation
Boston Institute of Analytics
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Social Samosa
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
sapnasaifi408
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
ccctableauusergroup
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Rachmat Ramadhan H
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
Florian Roscheck
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
shivangimorya083
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
Sonatrach
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
shivangimorya083
Recently uploaded
(20)
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
Digi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptx
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
Predicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project Presentation
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Introduction to Spark on Hadoop
1.
© 2014 MapR
Technologies 1© 2014 MapR Technologies An Overview of Apache Spark
2.
© 2014 MapR
Technologies 2 Agenda • MapReduce Refresher • What is Spark? • The Difference with Spark • Examples and Resources
3.
© 2014 MapR
Technologies 3© 2014 MapR Technologies MapReduce Refresher
4.
© 2014 MapR
Technologies 4 MapReduce: A Programming Model • MapReduce: Simplified Data Processing on Large Clusters (published 2004) • Parallel and Distributed Algorithm: • Data Locality • Fault Tolerance • Linear Scalability
5.
© 2014 MapR
Technologies 5 The Hadoop Strategy http://developer.yahoo.com/hadoop/tutorial/module4.html Distribute data (share nothing) Distribute computation (parallelization without synchronization) Tolerate failures (no single point of failure) Node 1 Mapping process Node 2 Mapping process Node 3 Mapping process Node 1 Reducing process Node 2 Reducing process Node 3 Reducing process
6.
© 2014 MapR
Technologies 6 Chunks are replicated across the cluster Distribute Data: HDFS User process NameNode . . . network HDFS splits large data files into chunks (64 MB) metadata access physical data access Location metadata DataNodes store & retrieve data data
7.
© 2014 MapR
Technologies 7 Distribute Computation MapReduce Program Data Sources Hadoop Cluster Result
8.
© 2014 MapR
Technologies 8 MapReduce Basics • Foundational model is based on a distributed file system – Scalability and fault-tolerance • Map – Loading of the data and defining a set of keys • Many use cases do not utilize a reduce task • Reduce – Collects the organized key-based data to process and output • Performance can be tweaked based on known details of your source files and cluster shape (size, total number)
9.
© 2014 MapR
Technologies 9 MapReduce Execution and Data Flow Files loaded from HDFS stores file file Files loaded from HDFS stores Node 1 InputFormat InputFormat OutputFormat OutputFormat Final (k, v) pairs Final (k, v) pairs reduce reduce (sort) (sort) Input (k, v) pairs map map map RR RR RR RecordReaders: Split Split Split Writeback to Local HDFS store file Writeback to Local HDFS store file SplitSplitSplit RRRRRR RecordReaders: Input (k, v) pairs mapmapmap Node2 “Shuffle” process Intermediate (k, v) Pairs exchanged By all nodes Partitioner Intermediate (k, v) pairs Partitioner Intermediate (k, v) pairs
10.
© 2014 MapR
Technologies 10 MapReduce Example: Word Count Output "The time has come," the Walrus said, "To talk of many things: Of shoes—and ships—and sealing-wax the, 1 time, 1 has, 1 come, 1 … and, 1 … and, 1 … and, [1, 1, 1] come, [1,1,1] has, [1,1] the, [1,1,1] time, [1,1,1,1] … and, 12 come, 6 has, 8 the, 4 time, 14 … Input Map Shuffle and Sort Reduce Output Reduce
11.
© 2014 MapR
Technologies 11 Tolerate Failures Hadoop Cluster Failures are expected & managed gracefully DataNode fails -> name node will locate replica MapReduce task fails -> job tracker will schedule another one Data
12.
© 2014 MapR
Technologies 12 MapReduce Processing Model • Define mappers • Shuffling is automatic • Define reducers • For complex work, chain jobs together – Use a higher level language or DSL that does this for you
13.
© 2014 MapR
Technologies 13 MapReduce Design Patterns • Summarization – Inverted index, counting • Filtering – Top ten, distinct • Aggregation • Data Organziation – partitioning • Join – Join data sets • Metapattern – Job chaining
14.
© 2014 MapR
Technologies 14 Inverted Index Example come, (alice.txt) do, (macbeth.txt) has, (alice.txt) time, (alice.txt, macbeth.txt) . . . "The time has come," the Walrus said alice.txt tis time to do it macbeth.txt time, alice.txt has, alice.txt come, alice.txt .. tis, macbeth.txt time, macbeth.txt do, macbeth.txt …
15.
© 2014 MapR
Technologies 15 MapReduce Example:Inverted Index • Input: (filename, text) records • Output: list of files containing each word • Map: foreach word in text.split(): output(word, filename) • Combine: uniquify filenames for each word • Reduce: def reduce(word, filenames): output(word, sort(filenames))
16.
© 2014 MapR
Technologies 18 MapReduce: The Good • Built in fault tolerance • Optimized IO path • Scalable • Developer focuses on Map/Reduce, not infrastructure • simple? API
17.
© 2014 MapR
Technologies 19 MapReduce: The Bad • Optimized for disk IO – Doesn’t leverage memory well – Iterative algorithms go through disk IO path again and again • Primitive API – simple abstraction – Key/Value in/out – basic things like join • require extensive code • Result often many files that need to be combined appropriately
18.
© 2014 MapR
Technologies 20 Free Hadoop MapReduce On Demand Training • https://www.mapr.com/services/mapr-academy/big-data-hadoop- online-training
19.
© 2014 MapR
Technologies 21 What is Hive? • Data Warehouse on top of Hadoop – Gives ability to query without programming – Used for analytical querying of data • SQL like execution for Hadoop • SQL evaluates to MapReduce code – Submits jobs to your cluster
20.
© 2014 MapR
Technologies 22 Using HBase as a MapReduce/Hive Source EXAMPLE: Data Warehouse for Analytical Processing queries Hive runs MapReduce application Hive Select JoinHBase database Files (HDFS/MapR-FS) Query Result File
21.
© 2014 MapR
Technologies 23 Using HBase as a MapReduce or Hive Sink EXAMPLE: bulk load data into a table Files (HDFS/MapR-FS) HBase databaseHive runs MapReduce application Hive Insert Select
22.
© 2014 MapR
Technologies 24 Using HBase as a Source & Sink EXAMPLE: calculate and store summaries, Pre-Computed, Materialized View HBase database Hive Select Join Hive runs MapReduce application
23.
© 2014 MapR
Technologies 25 Job Tracker Name Node HADOOP (MAP-REDUCE + HDFS) Data Node + Task Tracker Hive Metastore Driver (compiler, Optimizer, Executor) Command Line Interface Web Interface JDBC Thrift Server ODBC Metastore Hive The schema metadata is stored in the Hive metastore Hive Table definition HBase trades_tall Table
24.
© 2014 MapR
Technologies 26 Hive HBase HBase Tables Hive metastore Points to Existing Hive Managed
25.
© 2014 MapR
Technologies 27 Hive HBase – External Table CREATE EXTERNAL TABLE trades(key string, price bigint, vol bigint) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ("hbase.columns.mapping"= “:key,cf1:price#b,cf1:vol#b") TBLPROPERTIES ("hbase.table.name" = "/usr/user1/trades_tall"); Points to External key string price bigint vol bigint key cf1:price cf1:vol AMZN_986186008 12.34 1000 AMZN_986186007 12.00 50 trades /usr/user1/trades_tall Hive Table definition HBaseTable
26.
© 2014 MapR
Technologies 28 Hive HBase – Hive Query SQL evaluates to MapReduce code SELECT AVG(price) FROM trades WHERE key LIKE "GOOG” ; HBase Tables Queries Parser Planner Execution
27.
© 2014 MapR
Technologies 29 Hive HBase – External Table key cf1:price cf1:vol AMZN_986186008 12.34 1000 AMZN_986186007 12.00 50 Selection WHERE key like SQL evaluates to MapReduce code SELECT AVG(price) FROM trades WHERE key LIKE “AMZN” ; Projection select price Aggregation Avg( price)
28.
© 2014 MapR
Technologies 30 Hive Query Plan • EXPLAIN SELECT AVG(price) FROM trades WHERE key LIKE "GOOG%"; STAGE PLANS: Stage: Stage-1 Map Reduce Map Operator Tree: TableScan Filter Operator predicate: (key like 'GOOG%') (type: boolean) Select Operator Group By Operator Reduce Operator Tree: Group By Operator Select Operator File Output Operator
29.
© 2014 MapR
Technologies 31 Hive Query Plan – (2) output hive> SELECT AVG(price) FROM trades WHERE key LIKE "GOOG%"; col0 Trades table group aggregations: avg(price) scan filter Select key like 'GOOG% Select price Group by map() map() map() reduce() reduce()
30.
© 2014 MapR
Technologies 32 Hive Map Reduce Region Region Region scan key, row reduce() shuffle reduce() reduce()Map() Map() Map() Query Result File HBase Hive Select Join Hive Query result result result
31.
© 2014 MapR
Technologies 33 Some Hive Design Patterns • Summarization – Select min(delay), max(delay), count(*) from flights group by carrier; • Filtering – SELECT * FROM trades WHERE key LIKE "GOOG%"; – SELECT price FROM trades DESC LIMIT 10 ; • Join SELECT tableA.field1, tableB.field2 FROM tableA JOIN tableB ON tableA.field1 = tableB.field2;
32.
© 2014 MapR
Technologies 34 What is a Directed Acylic Graph (DAG) ? • Graph – vertices (points) and edges (lines) • Directed – Only in a single direction • Acyclic – No looping • This supports fault-tolerance BA BA
33.
© 2014 MapR
Technologies 35 Hive Query Plan Map Reduce Execution FS1 AGG2 RS4 JOIN1 RS2 AGG1 RS1 t1 RS3 t1 Job 3 Job 2 FS1 AGG2 JOIN1 AGG1 RS1 t1 RS3Job 1 Job 1 Optimize
34.
© 2014 MapR
Technologies 36 Slow ! Iteration: the bane of MapReduce
35.
© 2014 MapR
Technologies 37 Typical MapReduce Workflows Input to Job 1 SequenceFile Last Job Maps Reduces SequenceFile Job 1 Maps Reduces SequenceFile Job 2 Maps Reduces Output from Job 1 Output from Job 2 Input to last job Output from last job HDFS
36.
© 2014 MapR
Technologies 38 Iterations Step Step Step Step Step In-memory Caching • Data Partitions read from RAM instead of disk
37.
© 2014 MapR
Technologies 39 Free HBase On Demand Training (includes Hive and MapReduce with HBase) • https://www.mapr.com/services/mapr-academy/big-data-hadoop- online-training
38.
© 2014 MapR
Technologies 40 Lab – Query HBase airline data with Hive Import mapping to Row Key and Columns: Row-key Carrier- Flightnumber- Date- Origin- destination delay info stats timing Air Craft delay Arr delay Carrier delay cncl Cncl code tailnum distance elaptime arrtime Dep time AA-1-2014-01- 01-JFK-LAX 13 0 N7704 2475 385.00 359 …
39.
© 2014 MapR
Technologies 41 Count number of cancellations by reason (code) $ hive hive> explain select count(*) as cancellations, cnclcode from flighttable where cncl=1 group by cnclcode order by cancellations asc limit 100; 1 row OK STAGE DEPENDENCIES: Stage-1 is a root stage Stage-2 depends on stages: Stage-1 Stage-0 is a root stage STAGE PLANS: Stage: Stage-1 Map Reduce Map Operator Tree: TableScan Filter Operator Select Operator Group By Operator aggregations: count() Reduce Output Operator Reduce Operator Tree: Group By Operator aggregations: count(VALUE._col0) Select Operator File Output Operator Stage: Stage-2 Map Reduce Map Operator Tree: TableScan Reduce Output Operator Reduce Operator Tree: Extract Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE Limit File Output Operator Stage: Stage-0 Fetch Operator limit: 100
40.
© 2014 MapR
Technologies 42 2 MapReduce jobs $ hive hive> select count(*) as cancellations, cnclcode from flighttable where cncl=1 group by cnclcode order by cancellations asc limit 100; 1 row Total jobs = 2 MapReduce Jobs Launched: Job 0: Map: 1 Reduce: 1 Cumulative CPU: 13.3 sec MAPRFS Read: 0 MAPRFS Write: 0 SUCCESS Job 1: Map: 1 Reduce: 1 Cumulative CPU: 1.52 sec MAPRFS Read: 0 MAPRFS Write: 0 SUCCESS Total MapReduce CPU Time Spent: 14 seconds 820 msec OK 4598 C 7146 A
41.
© 2014 MapR
Technologies 43 Find the longest airline delays $ hive hive> select arrdelay,key from flighttable where arrdelay > 1000 order by arrdelay desc limit 10; 1 row MapReduce Jobs Launched: Map: 1 Reduce: 1 OK 1530.0 AA-385-2014-01-18-BNA-DFW 1504.0 AA-1202-2014-01-15-ONT-DFW 1473.0 AA-1265-2014-01-05-CMH-LAX 1448.0 AA-1243-2014-01-21-IAD-DFW 1390.0 AA-1198-2014-01-11-PSP-DFW 1335.0 AA-1680-2014-01-21-SLC-DFW 1296.0 AA-1277-2014-01-21-BWI-DFW 1294.0 MQ-2894-2014-01-02-CVG-DFW 1201.0 MQ-3756-2014-01-01-CLT-MIA 1184.0 DL-2478-2014-01-10-BOS-ATL
42.
© 2014 MapR
Technologies 44© 2014 MapR Technologies Apache Spark
43.
© 2014 MapR
Technologies 45 Apache Spark spark.apache.org github.com/apache/spark user@spark.apache.org • Originally developed in 2009 in UC Berkeley’s AMP Lab • Fully open sourced in 2010 – now a Top Level Project at the Apache Software Foundation
44.
© 2014 MapR
Technologies 46 Spark: Fast Big Data – Rich APIs in Java, Scala, Python – Interactive shell • Fast to Run – General execution graphs – In-memory storage 2-5× less code
45.
© 2014 MapR
Technologies 47 The Spark Community
46.
© 2014 MapR
Technologies 48 Spark is the Most Active Open Source Project in Big Data Giraph Storm Tez 0 20 40 60 80 100 120 140 Projectcontributorsinpastyear
47.
© 2014 MapR
Technologies 49 Spark SQL Spark Streaming (Streaming) MLlib (Machine learning) Spark (General execution engine) GraphX (Graph computation) Mesos Distributed File System (HDFS, MapR-FS, S3, …) Hadoop YARN Unified Platform
48.
© 2014 MapR
Technologies 50 Spark Use Cases • Iterative Algorithms on on large amounts of data • Anomaly detection • Classification • Predictions • Recommendations
49.
© 2014 MapR
Technologies 51 Why Iterative Algorithms • Algorithms that need iterations – Clustering (K-Means, Canopy, …) – Gradient descent (e.g., Logistic Regression, Matrix Factorization) – Graph Algorithms (e.g., PageRank, Line-Rank, components, paths, reachability, centrality, ) – Alternating Least Squares ALS – Graph communities / dense sub-components – Inference (believe propagation) – … 51
50.
© 2014 MapR
Technologies 52 Example: Logistic Regression • Goal: find best line separating two sets of points target random initial line
51.
© 2014 MapR
Technologies 53 data = spark.textFile(...).map(readPoint).cache() w = numpy.random.rand(D) for i in range(iterations): gradient = data .map(lambda p: (1 / (1 + exp(-p.y * w.dot(p.x)))) * p.y * p.x) .reduce(lambda x, y: x + y) w -= gradient print “Final w: %s” % w Iteration! Logistic Regression
52.
© 2014 MapR
Technologies 54 Data Sources • Local Files – file:///opt/httpd/logs/access_log • S3 • Hadoop Distributed Filesystem – Regular files, sequence files, any other Hadoop InputFormat • HBase • other NoSQL data stores
53.
© 2014 MapR
Technologies 55© 2014 MapR Technologies How Spark Works
54.
© 2014 MapR
Technologies 56 Spark Programming Model sc=new SparkContext rDD=sc.textfile(“hdfs://…”) rDD.map Driver Program SparkContext cluster Worker Node Task Task Task Worker Node
55.
© 2014 MapR
Technologies 57 Resilient Distributed Datasets (RDD) Spark revolves around RDDs • Fault-tolerant • read only collection of elements • operated on in parallel • Cached in memory • Or on disk http://www.cs.berkeley.edu/~matei/papers/ 2012/nsdi_spark.pdf
56.
© 2014 MapR
Technologies 58 Working With RDDs RDD textFile = sc.textFile(”SomeFile.txt”)
57.
© 2014 MapR
Technologies 59 Working With RDDs RDD RDD RDD RDD Transformations linesWithSpark = textFile.filter(lambda line: "Spark” in line) textFile = sc.textFile(”SomeFile.txt”)
58.
© 2014 MapR
Technologies 60 Working With RDDs RDD RDD RDD RDD Transformations Action Value linesWithSpark = textFile.filter(lambda line: "Spark” in line) linesWithSpark.count() 74 linesWithSpark.first() # Apache Spark textFile = sc.textFile(”SomeFile.txt”)
59.
© 2014 MapR
Technologies 61 MapR Tutorial: Getting Started with Spark on MapR Sandbox • https://www.mapr.com/products/mapr-sandbox- hadoop/tutorials/spark-tutorial
60.
© 2014 MapR
Technologies 62 Example Spark Word Count in Java ...the ... "The time has come," the Walrus said, "To talk of many things: Of shoes—and ships—and sealing-wax andtime and the, 1 time, 1 and, 1 and, 1 and, 12time, 4 ...the, 20 JavaRDD<String> input = sc.textFile(inputFile); // Split each line into words JavaRDD<String> words = input.flatMap( new FlatMapFunction<String, String>() { public Iterable<String> call(String x) { return Arrays.asList(x.split(" ")); }}); // Turn the words into (word, 1) pairs JavaPairRDD<String, Integer> word1s== words.mapToPair( new PairFunction<String, String, Integer>(){ public Tuple2<String, Integer> call(String x){ return new Tuple2(x, 1); }}); // reduce add the pairs by key to produce counts JavaPairRDD<String, Integer> counts =word1s.reduceByKey( new Function2<Integer, Integer, Integer>(){ public Integer call(Integer x, Integer y){ return x + y; }}); .........
61.
© 2014 MapR
Technologies 63 Example Spark Word Count in Scala ...the ... "The time has come," the Walrus said, "To talk of many things: Of shoes—and ships—and sealing-wax andtime and the, 1 time, 1 and, 1 and, 1 and, 12time, 4 ...the, 20 // Load our input data. val input = sc.textFile(inputFile) // Split it up into words. val words = input.flatMap(line => line.split(" ")) // Transform into pairs and count. val counts = words .map(word => (word, 1)) .reduceByKey{case (x, y) => x + y} // Save the word count back out to a text file, counts.saveAsTextFile(outputFile) the, 20 time, 4 ….. and, 12 .........
62.
© 2014 MapR
Technologies 64 Example Spark Word Count in Scala 64 HadoopRDD textFile // Load input data. val input = sc.textFile(inputFile) RDD partitions MapPartitionsRDD
63.
© 2014 MapR
Technologies 65 Example Spark Word Count in Scala 65 // Load our input data. val input = sc.textFile(inputFile) // Split it up into words. val words = input.flatMap(line => line.split(" ")) HadoopRDD textFile flatmap MapPartitionsRDD MapPartitionsRDD
64.
© 2014 MapR
Technologies 66 FlatMap flatMap line => line.split(" ")) 1 to many mapping ShipsShips and wax and wax JavaPairRDD<String> words
65.
© 2014 MapR
Technologies 67 Example Spark Word Count in Scala 67 textFile flatmap map val input = sc.textFile(inputFile) val words = input.flatMap(line => line.split(" ")) // Transform into pairs val counts = words.map(word => (word, 1)) HadoopRDD MapPartitionsRDD MapPartitionsRDD MapPartitionsRDD
66.
© 2014 MapR
Technologies 68 Map map word => (word, 1)) 1 to 1 mapping and and, 1 JavaPairRDD<String, Integer> word1s
67.
© 2014 MapR
Technologies 69 Example Spark Word Count in Scala 69 textFile flatmap map reduceByKey val input = sc.textFile(inputFile) val words = input.flatMap(line => line.split(" ")) val counts = words .map(word => (word, 1)) .reduceByKey{case (x, y) => x + y} HadoopRDD MapPartitionsRDD MapPartitionsRDD ShuffledRDD MapPartitionsRDD
68.
© 2014 MapR
Technologies 70 reduceByKey reduceByKey case (x, y) => x + y wax, 1 and, 1 and, 1 wax, 1 and, 2 JavaPairRDD<String, Integer> counts
69.
© 2014 MapR
Technologies 71 Example Spark Word Count in Scala textFile flatmap map reduceByKey val input = sc.textFile(inputFile) val words = input.flatMap(line => line.split(" ")) val counts = words .map(word => (word, 1)) .reduceByKey{case (x, y) => x + y} val countArray = counts.collect() HadoopRDD MapPartitionsRDD MapPartitionsRDD MapPartitionsRDD collect ShuffledRDD Array
70.
© 2014 MapR
Technologies 72© 2014 MapR Technologies Components Of Execution
71.
© 2014 MapR
Technologies 73 MapR Blog: Getting Started with the Spark Web UI • https://www.mapr.com/blog/getting-started-spark-web-ui
72.
© 2014 MapR
Technologies 74 Spark RDD DAG -> Physical Execution plan HadoopRDD sc.textfile(…) MapPartitionsRDD flatmap flatmap reduceByKey RDD Graph Physical Plan collect MapPartitionsRDD ShuffledRDD MapPartitionsRDD Stage 1 Stage 2
73.
© 2014 MapR
Technologies 75 Physical Plan DAG Stage 1 Stage 2 Task Task Task Task Task Task Task Stage 1 Stage 2 Split into Tasks HFile HDFS Data Node Worker Node block cache partition Executor HFile block HFileHFile Task thread Task Set Task Scheduler Task Physical Execution plan -> Stages and Tasks
74.
© 2014 MapR
Technologies 76 Summary of Components • Task : unit of execution • Stage: Group of Tasks – Base on partitions of RDD – Tasks run in parallel • DAG : Logical Graph of RDD operations • RDD : Parallel dataset with partitions 76
75.
© 2014 MapR
Technologies 77 How Spark Application runs on a Hadoop cluster HFile HDFS Data Node Worker Node block cache partitiontask task Executor HFile block HFileHFile SparkContext zookeeper YARN Resource Manager HFile HDFS Data Node Worker Node block cache partitiontask task Executor HFile block HFileHFile Client node sc=new SparkContext rDD=sc.textfile(“hdfs://…”) rDD.map Driver Program Yarn Node Manger Yarn Node Manger
76.
© 2014 MapR
Technologies 78 Deploying Spark – Cluster Manager Types • Standalone mode • Mesos • YARN • EC2 • GCE
77.
© 2014 MapR
Technologies 79© 2014 MapR Technologies Example: Log Mining
78.
© 2014 MapR
Technologies 80 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Based on slides from Pat McDonough at
79.
© 2014 MapR
Technologies 81 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Worker Worker Worker Driver
80.
© 2014 MapR
Technologies 82 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Worker Worker Worker Driver lines = spark.textFile(“hdfs://...”)
81.
© 2014 MapR
Technologies 83 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Worker Worker Worker Driver lines = spark.textFile(“hdfs://...”) Base RDD
82.
© 2014 MapR
Technologies 84 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) Worker Worker Worker Driver
83.
© 2014 MapR
Technologies 85 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) Worker Worker Worker Driver Transformed RDD
84.
© 2014 MapR
Technologies 86 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker Driver messages.filter(lambda s: “mysql” in s).count()
85.
© 2014 MapR
Technologies 87 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker Driver messages.filter(lambda s: “mysql” in s).count() Action
86.
© 2014 MapR
Technologies 88 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker Driver messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3
87.
© 2014 MapR
Technologies 89 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3 Driver tasks tasks tasks
88.
© 2014 MapR
Technologies 90 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3 Driver Read HDFS Block Read HDFS Block Read HDFS Block
89.
© 2014 MapR
Technologies 91 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3 Driver Cache 1 Cache 2 Cache 3 Process & Cache Data Process & Cache Data Process & Cache Data
90.
© 2014 MapR
Technologies 92 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3 Driver Cache 1 Cache 2 Cache 3 results results results
91.
© 2014 MapR
Technologies 93 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3 Driver Cache 1 Cache 2 Cache 3 messages.filter(lambda s: “php” in s).count()
92.
© 2014 MapR
Technologies 94 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3 Cache 1 Cache 2 Cache 3 messages.filter(lambda s: “php” in s).count() tasks tasks tasks Driver
93.
© 2014 MapR
Technologies 95 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3 Cache 1 Cache 2 Cache 3 messages.filter(lambda s: “php” in s).count() Driver Process from Cache Process from Cache Process from Cache
94.
© 2014 MapR
Technologies 96 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3 Cache 1 Cache 2 Cache 3 messages.filter(lambda s: “php” in s).count() Driver results results results
95.
© 2014 MapR
Technologies 97 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3 Cache 1 Cache 2 Cache 3 messages.filter(lambda s: “php” in s).count() Driver Cache your data Faster Results Full-text search of Wikipedia • 60GB on 20 EC2 machines • 0.5 sec from cache vs. 20s for on-disk
96.
© 2014 MapR
Technologies 98© 2014 MapR Technologies Transformations and Actions
97.
© 2014 MapR
Technologies 99 RDD Transformations and Actions RDD RDD RDD RDDTransformations Action Value Transformations (define a new RDD) map filter sample union groupByKey reduceByKey join cache … Actions (return a value) reduce collect count save lookupKey …
98.
© 2014 MapR
Technologies 100 Basic Transformations > nums = sc.parallelize([1, 2, 3]) # Pass each element through a function > squares = nums.map(lambda x: x*x) // {1, 4, 9} # Keep elements passing a predicate > even = squares.filter(lambda x: x % 2 == 0) // {4} # Map each element to zero or more others > nums.flatMap(lambda x: => range(x)) > # => {0, 0, 1, 0, 1, 2} Range object (sequence of numbers 0, 1, …, x-1)
99.
© 2014 MapR
Technologies 101 Basic Actions > nums = sc.parallelize([1, 2, 3]) # Retrieve RDD contents as a local collection > nums.collect() # => [1, 2, 3] # Return first K elements > nums.take(2) # => [1, 2] # Count number of elements > nums.count() # => 3 # Merge elements with an associative function > nums.reduce(lambda x, y: x + y) # => 6 # Write elements to a text file > nums.saveAsTextFile(“hdfs://file.txt”)
100.
© 2014 MapR
Technologies 102 RDD Fault Recovery • RDDs track lineage information • can be used to efficiently recompute lost data msgs = textFile.filter(lambda s: s.startsWith(“ERROR”)) .map(lambda s: s.split(“t”)[2]) HDFS File Filtered RDD Mapped RDD filter (func = startsWith(…)) map (func = split(...))
101.
© 2014 MapR
Technologies 103 Passing a function to Spark • Spark is based on Anonymous function syntax – (x: Int) => x *x • Which is a shorthand for new Function1[Int,Int] { def apply(x: Int) = x * x } 103
102.
© 2014 MapR
Technologies 104© 2014 MapR Technologies Dataframes
103.
© 2014 MapR
Technologies 105 DataFrame Distributed collection of data organized into named columns // Create the DataFrame val df = sqlContext.read.json("person.json") // Print the schema in a tree format df.printSchema() // root // |-- age: long (nullable = true) // |-- name: string (nullable = true) // |-- height: string (nullable = true) // Select only the "name" column df.select("name").show() https://spark.apache.org/docs/latest/sql-programming-guide.html
104.
© 2014 MapR
Technologies 106 DataFrame RDD • # data frame style lineitems.groupby(‘customer’).agg(Map( ‘units’ > ‘avg’, ‘totalPrice’ > ‘std’ )) • # or SQL style SELECT AVG(units), STD(totalPrice) FROM linetiems GROUP BY customer
105.
© 2014 MapR
Technologies 107 Demo Interactive Shell • Iterative Development – Cache those RDDs – Open the shell and ask questions • We have all wished we could do this with MapReduce – Compile / save your code for scheduled jobs later • Scala – spark-shell • Python – pyspark
106.
© 2014 MapR
Technologies 108 MapR Blog: Using Apache Spark DataFrames for Processing of Tabular Data • https://www.mapr.com/blog/using-apache-spark-dataframes- processing-tabular-data
107.
© 2014 MapR
Technologies 109 The physical plan for DataFrames
108.
© 2014 MapR
Technologies 110 DataFrame Excecution plan // Print the physical plan to the console auction.select("auctionid").distinct.explain() == Physical Plan == Distinct false Exchange (HashPartitioning [auctionid#0], 200) Distinct true Project [auctionid#0] PhysicalRDD [auctionid#0,bid#1,bidtime#2,bidder#3, bidderrate#4,openbid#5,price#6,item#7,daystolive#8], MapPartitionsRDD[11] at mapPartitions at ExistingRDD.scala:37
109.
© 2014 MapR
Technologies 111© 2014 MapR Technologies There’s a lot more !
110.
© 2014 MapR
Technologies 112 Spark SQL Spark Streaming (Streaming) MLlib (Machine learning) Spark (General execution engine) GraphX (Graph computation) Mesos Distributed File System (HDFS, MapR-FS, S3, …) Hadoop YARN Unified Platform
111.
© 2014 MapR
Technologies 113 Soon to Come • Spark On Demand Training – https://www.mapr.com/services/mapr-academy/ • Blogs and Tutorials: – Movie Recommendations with Collaborative Filtering – Spark Streaming
112.
© 2014 MapR
Technologies 114 Soon to Come Blogs and Tutorials: – Re-write this mahout example with spark
113.
© 2014 MapR
Technologies 115© 2014 MapR Technologies Examples and Resources
114.
© 2014 MapR
Technologies 116 Spark on MapR • Certified Spark Distribution • Fully supported and packaged by MapR in partnership with Databricks – mapr-spark package with Spark, Shark, Spark Streaming today – Spark-python, GraphX and MLLib soon • YARN integration – Spark can then allocate resources from cluster when needed
115.
© 2014 MapR
Technologies 117 References • Spark web site: http://spark.apache.org/ • https://databricks.com/ • Spark on MapR: – http://www.mapr.com/products/apache-spark • Spark SQL and DataFrame Guide • Apache Spark vs. MapReduce – Whiteboard Walkthrough • Learning Spark - O'Reilly Book • Apache Spark
116.
© 2014 MapR
Technologies 118 Q&A @mapr maprtech kbotzum@mapr.com Engage with us! MapR maprtech mapr-technologies
Download now