Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLab

•

1 like•383 views

Big Data with Hadoop & Spark Training: http://bit.ly/2spQIBA This CloudxLab Introduction to Apache Spark tutorial helps you to understand Spark in detail. Below are the topics covered in this tutorial: 1) Spark Architecture 2) Why Apache Spark? 3) Shortcoming of MapReduce 4) Downloading Apache Spark 5) Starting Spark With Scala Interactive Shell 6) Starting Spark With Python Interactive Shell 7) Getting started with spark-submit

Technology

Introduction
• Really fast MapReduce
• 100x faster than Hadoop MapReduce in memory,
• 10x faster on disk.
• Builds on similar paradigms as MapReduce
• Integrated with Hadoop
Spark Core - A fast and general engine for large-scale
data processing.

Introduction
Spark Architecture
Spark Core
StandaloneAmazon EC2
Hadoop
YARN
Apache Mesos
HDFS
HBase
Hive
Tachyon
Cassandra
SQL
Streaming MLLib GraphX
SparkR Java Python Scala
Libraries
Languages
Data Sources
Resource/cluster managers
Dataframes

Why Apache Spark?
Or
Why is Apache Spark faster than MapReduce?

Introduction
Why Apache Spark?
Map
Reduce
HDFS
Map()
Reduce()
Read
Input
Write
Output
● User Sends Logic
● In form of Map() & Reduces
● Tries to do execute near data
● Saves result to HDFS
Hadoop Map Reduce

Introduction
Hadoop Map Reduce - Multiple Phases
Map
Reduce - 1
HDFS
Write Map
Reduce - 2
HDFSHDFS

Introduction
Shortcoming of Map Reduce
1. Batchwise Design
a. Every map-reduce cycle reads from and writes to
HDFS
b. Heavy Latency
2. Converting logic to Map-Reduce paradigm is difficult
3. In-memory computing was not possible

Introduction
Shortcoming of Map Reduce
Map
Reduce - 1
RAMWrite Map
Reduce - 2
RAMRAM
80 times faster
than disk
See: https://gist.github.com/jboner/2841832
Latency Numbers Every Programmer Should Know

Introduction
We have already installed the Apache Spark on CloudxLab.
So, you don't have install anything.
You simply have to login into Web Console
and
Get started with commands.
Getting Started - CloudxLab

Introduction
1. Find out hadoop version:
○ [student@hadoop1 ~]$ hadoop version
○ Hadoop 2.4.0.2.1.4.0-632
2. Go to https://spark.apache.org/downloads.html
3. Select the release for your version of hadoop & Download
4. On servers you could use wget
5. Every download can be run in standalone mode
6. Unzip - tar -xzvf spark*.tgz
7. In this folder, the bin folder contains the spark commands
Getting Started - Downloading

Introduction
Getting Started - Spark Programs Overview
Program Description
spark-shell Runs spark scala interactive commandline
pyspark Runs python spark interactive commandline
sparkR Runs R on spark (/usr/spark2.6/bin/sparkR)
spark-submit Submit a jar or python application for execution on cluster
spark-sql Runs the spark sql interactive shell

Introduction
$ spark-shell
It is basically the scala REPL or interactive shell with one extra variable “sc”.
Check dir(sc) or help(sc)
Starting Spark With Scala Interactive Shell

Introduction
$ pyspark
It is basically the python interactive shell
with one extra variable “sc”.
Check dir(sc) or help(sc)
Starting Spark With Python Interactive Shell

Introduction
Getting Started - spark-submit
● To run example:
○ spark-submit --class org.apache.spark.examples.SparkPi
/usr/hdp/current/spark-client/lib/spark-examples-*.jar 10
The example computes the area of circle of a radius 1 by counting total number of
squares.
○ See https://en.wikipedia.org/wiki/Approximations_of_%CF%80#Summing_a_circle.27s_area
○ Code:
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/SparkPi.scala

Introduction
Getting Started - spark-submit

Introduction
Getting Started - Binaries Overview
Binary Description
spark-shell Runs spark scala interactive commandline
pyspark Runs python spark interactive commandline
sparkR Runs R on spark (/usr/spark2.6/bin/sparkR)
spark-submit Submit a jar or python application for execution on cluster
spark-sql Runs the spark sql interactive shell

Introduction
Getting Started - CloudxLab
To launch Spark on Hadoop,
Set the Environment Variables pointing to Hadoop.
export YARN_CONF_DIR=/etc/hadoop/conf/
export HADOOP_CONF_DIR=/etc/hadoop/conf/

Introduction
Getting Started - CloudxLab
We have installed other versions too:
1. /usr/spark2.0.1/bin/spark-shell
2. /usr/spark1.6/bin/spark-shell
3. /usr/spark1.2.1/bin/spark-shell

Introduction
Getting Started - CloudxLab
How to run the jupyter with pyspark

(Big Data with Hadoop & Spark Training: http://bit.ly/2IUsWca This CloudxLab Running in a Cluster tutorial helps you to understand running Spark in the cluster in detail. Below are the topics covered in this tutorial: 1) Spark Runtime Architecture 2) Driver Node 3) Scheduling Tasks on Executors 4) Understanding the Architecture 5) Cluster Managers 6) Executors 7) Launching a Program using spark-submit 8) Local Mode & Cluster-Mode 9) Installing Standalone Cluster 10) Cluster Mode - YARN 11) Launching a Program on YARN 12) Cluster Mode - Mesos and AWS EC2 13) Deployment Modes - Client and Cluster 14) Which Cluster Manager to Use? 15) Common flags for spark-submit

Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...

CloudxLab

Introduction to Structured Streaming | Big Data Hadoop Spark Tutorial | Cloud...

CloudxLab

Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...

CloudxLab

Big Data with Hadoop & Spark Training: http://bit.ly/2L6bZbn This CloudxLab Introduction to Spark Streaming & Apache Kafka tutorial helps you to understand Spark Streaming and Kafka in detail. Below are the topics covered in this tutorial: 1) Spark Streaming - Workflow 2) Use Cases - E-commerce, Real-time Sentiment Analysis & Real-time Fraud Detection 3) Spark Streaming - DStream 4) Word Count Hands-on using Spark Streaming 5) Spark Streaming - Running Locally Vs Running on Cluster 6) Introduction to Apache Kafka 7) Apache Kafka Hands-on on CloudxLab 8) Integrating Spark Streaming & Kafka 9) Spark Streaming & Kafka Hands-on

Apache Spark - Loading & Saving data | Big Data Hadoop Spark Tutorial | Cloud...

CloudxLab

Big Data with Hadoop & Spark Training: http://bit.ly/2shXBpj This CloudxLab Apache Spark - Loading & Saving data tutorial helps you to understand Loading & Saving data in Apache Spark in detail. Below are the topics covered in this tutorial: 1) Common Data Sources 2) Common Supported File Formats 3) Handling Text Files using Scala 4) Loading CSV 5) SequenceFiles 6) Object Files 7) Hadoop Input and Output Format - Old and New API 8) Protocol Buffers 9) File Compression 10) Handling LZO

Apache Spark Introduction

sudhakara st

Advanced Spark Programming - Part 1 | Big Data Hadoop Spark Tutorial | CloudxLab

CloudxLab

Introduction to Apache Spark

Anastasios Skarlatidis

Big Data with Hadoop & Spark Training: http://bit.ly/2sf2z6i This CloudxLab Introduction to Spark SQL & DataFrames tutorial helps you to understand Spark SQL & DataFrames in detail. Below are the topics covered in this slide: 1) Introduction to DataFrames 2) Creating DataFrames from JSON 3) DataFrame Operations 4) Running SQL Queries Programmatically 5) Datasets 6) Inferring the Schema Using Reflection 7) Programmatically Specifying the Schema

Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab

CloudxLab

Big Data with Hadoop & Spark Training: http://bit.ly/2kyRTuW This CloudxLab Advanced Spark Programming tutorial helps you to understand Advanced Spark Programming in detail. Below are the topics covered in this slide: 1) Shared Variables - Accumulators & Broadcast Variables 2) Accumulators and Fault Tolerance 3) Custom Accumulators - Version 1.x & Version 2.x 4) Examples of Broadcast Variables 5) Key Performance Considerations - Level of Parallelism 6) Serialization Format - Kryo 7) Memory Management 8) Hardware Provisioning

Intro to apache spark stand ford

Thu Hiền

Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab

CloudxLab

Introduction to Apache Spark

Mohamed hedi Abidi

Apache spark Intro

Tudor Lapusan

Introduction to spark

Duyhai Doan

Introduction to Spark with Scala

Himanshu Gupta

Apache Spark Tutorial

Farzad Nozarian

Apache Spark RDD 101

sparkInstructor

Apache Spark overview

DataArt

Introduction to Spark Internals

Pietro Michiardi

Spark & Spark Streaming Internals - Nov 15 (1)Akhil Das

Apache Spark RDDs

Dean Chen

Video to talk: https://www.youtube.com/watch?v=gd4Jqtyo7mM Apache Spark is a next generation engine for large scale data processing built with Scala. This talk will first show how Spark takes advantage of Scala's function idioms to produce an expressive and intuitive API. You will learn about the design of Spark RDDs and the abstraction enables the Spark execution engine to be extended to support a wide variety of use cases(Spark SQL, Spark Streaming, MLib and GraphX). The Spark source will be be referenced to illustrate how these concepts are implemented with Scala. http://www.meetup.com/Scala-Bay/events/209740892/

Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python

Christian Perone

Intro to Apache Spark

Robert Sanders

Apache Spark in Depth: Core Concepts, Architecture & Internals

Anton Kirillov

Spark shuffle introduction

colorant

Data analysis scala_spark

Yiguang Hu

Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...

Databricks

Module01

NPN Training

Spark SQL | Apache Spark

Edureka!

What's hot

Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...

CloudxLab

Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab

CloudxLab

Intro to apache spark stand ford

Thu Hiền

Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab

CloudxLab

Introduction to Apache Spark

Mohamed hedi Abidi

Apache spark Intro

Tudor Lapusan

Introduction to spark

Duyhai Doan

Introduction to Spark with Scala

Himanshu Gupta

Apache Spark Tutorial

Farzad Nozarian

Apache Spark RDD 101

sparkInstructor

Apache Spark overview

DataArt

Introduction to Spark Internals

Pietro Michiardi

Spark & Spark Streaming Internals - Nov 15 (1)Akhil Das

Apache Spark RDDs

Dean Chen

Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python

Christian Perone

Intro to Apache Spark

Robert Sanders

Apache Spark in Depth: Core Concepts, Architecture & Internals

Anton Kirillov

Spark shuffle introduction

colorant

Data analysis scala_spark

Yiguang Hu

Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...

Databricks

What's hot (20)

Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...

Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab

Intro to apache spark stand ford

Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab

Introduction to Apache Spark

Apache spark Intro

Introduction to spark

Introduction to Spark with Scala

Apache Spark Tutorial

Apache Spark RDD 101

Apache Spark overview

Introduction to Spark Internals

Spark & Spark Streaming Internals - Nov 15 (1)

Apache Spark RDDs

Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python

Intro to Apache Spark

Apache Spark in Depth: Core Concepts, Architecture & Internals

Spark shuffle introduction

Data analysis scala_spark

Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...

Similar to Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLab

Module01

NPN Training

Spark SQL | Apache Spark

Edureka!

Big Data Processing With Spark

Edureka!

Apache Spark Overview @ ferret

Andrii Gakhov

Spark Summit EU 2015: Lessons from 300+ production users

Databricks

At Databricks, we have a unique view into over a hundred different companies trying out Spark for development and production use-cases, from their support tickets and forum posts. Having seen so many different workflows and applications, some discernible patterns emerge when looking at common performance and scalability issues that our users run into. This talk will discuss some of these common common issues from an engineering and operations perspective, describing solutions and clarifying misconceptions.

Apache spark installation [autosaved]

Shweta Patnaik

Introduction to Apache Spark Ecosystem

Bojan Babic

Spark core

Prashant Gupta

5 things one must know about spark!

Edureka!

Introduction to Apache Spark Developer Training

Cloudera, Inc.

Apache Spark is a next-generation processing engine optimized for speed, ease of use, and advanced analytics well beyond batch. The Spark framework supports streaming data and complex, iterative algorithms, enabling applications to run 100x faster than traditional MapReduce programs. With Spark, developers can write sophisticated parallel applications for faster business decisions and better user outcomes, applied to a wide variety of architectures and industries. Learn What Apache Spark is and how it compares to Hadoop MapReduce, How to filter, map, reduce, and save Resilient Distributed Datasets (RDDs), Who is best suited to attend the course and what prior knowledge you should have, and the benefits of building Spark applications as part of an enterprise data hub.

What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...

Simplilearn

This presentation about Apache Spark covers all the basics that a beginner needs to know to get started with Spark. It covers the history of Apache Spark, what is Spark, the difference between Hadoop and Spark. You will learn the different components in Spark, and how Spark works with the help of architecture. You will understand the different cluster managers on which Spark can run. Finally, you will see the various applications of Spark and a use case on Conviva. Now, let's get started with what is Apache Spark. Below topics are explained in this Spark presentation: 1. History of Spark 2. What is Spark 3. Hadoop vs Spark 4. Components of Apache Spark 5. Spark architecture 6. Applications of Spark 7. Spark usecase What is this Big Data Hadoop training course about? The Big Data Hadoop and Spark developer course have been designed to impart an in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab. What are the course objectives? Simplilearn’s Apache Spark and Scala certification training are designed to: 1. Advance your expertise in the Big Data Hadoop Ecosystem 2. Help you master essential Apache and Spark skills, such as Spark Streaming, Spark SQL, machine learning programming, GraphX programming and Shell Scripting Spark 3. Help you land a Hadoop developer job requiring Apache Spark expertise by giving you a real-life industry project coupled with 30 demos What skills will you learn? By completing this Apache Spark and Scala course you will be able to: 1. Understand the limitations of MapReduce and the role of Spark in overcoming these limitations 2. Understand the fundamentals of the Scala programming language and its features 3. Explain and master the process of installing Spark as a standalone cluster 4. Develop expertise in using Resilient Distributed Datasets (RDD) for creating applications in Spark 5. Master Structured Query Language (SQL) using SparkSQL 6. Gain a thorough understanding of Spark streaming features 7. Master and describe the features of Spark ML programming and GraphX programming Who should take this Scala course? 1. Professionals aspiring for a career in the field of real-time big data analytics 2. Analytics professionals 3. Research professionals 4. IT developers and testers 5. Data scientists 6. BI and reporting professionals 7. Students who wish to gain a thorough understanding of Apache Spark Learn more at https://www.simplilearn.com/big-data-and-analytics/apache-spark-scala-certification-training

5 reasons why spark is in demand!

Edureka!

Intro to Apache Spark by CTO of Twingo

MapR Technologies

http://bit.ly/1BTaXZP – As organizations look for even faster ways to derive value from big data, they are turning to Apache Spark is an in-memory processing framework that offers lightning-fast big data analytics, providing speed, developer productivity, and real-time processing advantages. The Spark software stack includes a core data-processing engine, an interface for interactive querying, Spark Streaming for streaming data analysis, and growing libraries for machine-learning and graph analysis. Spark is quickly establishing itself as a leading environment for doing fast, iterative in-memory and streaming analysis. This talk will give an introduction the Spark stack, explain how Spark has lighting fast results, and how it complements Apache Hadoop. By the end of the session, you’ll come away with a deeper understanding of how you can unlock deeper insights from your data, faster, with Spark.

Introduction to Apache Spark

Rahul Jain

Introduction to Apache Spark :: Lagos Scala Meetup session 2

Olalekan Fuad Elesin

Apache Spark Tutorial

Ahmet Bulut

Apache spark-melbourne-april-2015-meetup

Ned Shawa

Strata NYC 2015 - Supercharging R with Apache Spark

Databricks

R is the favorite language of many data scientists. In addition to a language and runtime, R is a rich ecosystem of libraries for a wide range of use cases from statistical inference to data visualization. However, handling large or distributed data with R is challenging. Hence R is used along with other frameworks and languages by most data scientist. In this mode most of the friction is at the interface of R and the other systems. For example, when data is sampled by a big data platform, results need to be transferred to and imported in R as native data structures. In this talk we show an alternative, and complimentary, approach to SparkR for integrating Spark and R. Since SparkR was released in version 1.4 of Apache Spark distributed data remains inside the JVM instead of individual R processes running on workers. This approach is more convenient when dealing with external data sources such as Cassandra, Hive, and Spark’s own distributed DataFrames. We show two specific techniques to remove the data transfer friction between R and JVM: collecting Spark DataFrames as R data frames and user space filesystems. We think this model complements and improves the day-to-day workload of many data scientists who use R. Spark’s interactive query processing, especially with in-memory datasets, closely matches the R interactive session model. When integrated together Spark and R can provide state of the art tools for the entire end-to-end data science pipeline. We will show how such a pipeline works in real world use cases in a live demo at the end of the talk.

Using pySpark with Google Colab & Spark 3.0 preview

Mario Cartia

Apache Spark is the Big Data opensource framework used by the world's leading companies for the implementation of advanced analytics. The talk will introduce the architecture, the modules and the main functionalities of the framework showing some practical examples in Python that do not require the installation of any software on your machine using the Colab tool made available by Google. The talk will also introduce the new features available on the preview version of the upcoming release 3.0

Accelerating apache spark with rdma

inside-BigData.com

"Apache Spark is today’s fastest growing Big Data analysis platform. Spark workloads typically maintain a persistent data set in memory, which is accessed multiple times over the network. Consequently, networking IO performance is a critical component in Spark systems. RDMA’s performance characteristics, such as high bandwidth, low latency, and low CPU overhead, offer a good opportunity for accelerating Spark by improving its data transfer facilities." "In this talk, we present a Java-based, RDMA network layer for Apache Spark. The implementation optimized both the RPC and the Shuffle mechanisms for RDMA. Initial benchmarking shows up to 25% improvement for Spark Applications." Watch the video presentation: http://wp.me/p3RLHQ-gzN Learn more: http://mellanox.com Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter

Similar to Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLab (20)

Module01

Spark SQL | Apache Spark

Big Data Processing With Spark

Apache Spark Overview @ ferret

Spark Summit EU 2015: Lessons from 300+ production users

Apache spark installation [autosaved]

Introduction to Apache Spark Ecosystem

Spark core

5 things one must know about spark!

Introduction to Apache Spark Developer Training

What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...

5 reasons why spark is in demand!

Intro to Apache Spark by CTO of Twingo

Introduction to Apache Spark

Introduction to Apache Spark :: Lagos Scala Meetup session 2

Apache Spark Tutorial

Apache spark-melbourne-april-2015-meetup

Strata NYC 2015 - Supercharging R with Apache Spark

Using pySpark with Google Colab & Spark 3.0 preview

Accelerating apache spark with rdma

More from CloudxLab

Understanding computer vision with Deep Learning

CloudxLab

Computer vision is a branch of computer science which deals with recognising objects, people and identifying patterns in visuals. It is basically analogous to the vision of an animal. Topics covered: 1. Overview of Machine Learning 2. Basics of Deep Learning 3. What is computer vision and its use-cases? 4. Various algorithms used in Computer Vision (mostly CNN) 5. Live hands-on demo of either Auto Cameraman or Face recognition system 6. What next?

Deep Learning Overview

CloudxLab

Recurrent Neural Networks

CloudxLab

Natural Language Processing

CloudxLab

Naive Bayes

CloudxLab

Autoencoders

CloudxLab

Training Deep Neural Nets

CloudxLab

Reinforcement Learning

CloudxLab

( Machine Learning & Deep Learning Specialization Training: https://goo.gl/5u2RiS ) This CloudxLab Reinforcement Learning tutorial helps you to understand Reinforcement Learning in detail. Below are the topics covered in this tutorial: 1) What is Reinforcement? 2) Reinforcement Learning an Introduction 3) Reinforcement Learning Example 4) Learning to Optimize Rewards 5) Policy Search - Brute Force Approach, Genetic Algorithms and Optimization Techniques 6) OpenAI Gym 7) The Credit Assignment Problem 8) Inverse Reinforcement Learning 9) Playing Atari with Deep Reinforcement Learning 10) Policy Gradients 11) Markov Decision Processes

Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...

CloudxLab

Big Data with Hadoop & Spark Training: http://bit.ly/2sm5Ekd This CloudxLab Key-Value RDD Transformations tutorial helps you to understand Key-Value RDD transformations in detail. Below are the topics covered in this tutorial: 1) Transformations on Key-Value Pair RDD - keys(), values(), groupByKey(), combineByKey(), sortByKey(), subtractByKey(), join(), leftOuterJoin(), rightOuterJoin(), cogroup(), countByKey() and lookup()

Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab

CloudxLab

Big Data with Hadoop & Spark Training: http://bit.ly/2kyP2Ct This CloudxLab Introduction to NoSQL tutorial helps you to understand NoSQL in detail. Below are the topics covered in this slide: 1) Introduction to NoSQL 2) Scaling Out vs Scaling Up 3) ACID - Properties of DB Transactions 4) RDBMS - Story 5) What is NoSQL? 6) Types Of NoSQL Stores 7) CAP Theorem 8) Serialization 9) Column Oriented Database 10) Column Family Oriented DataStore

Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...

CloudxLab

Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab

CloudxLab

( Machine Learning & Deep Learning Specialization Training: https://goo.gl/6n3vko ) This CloudxLab TensorFlow tutorial helps you to understand TensorFlow in detail. Below are the topics covered in this tutorial: 1) Why TensorFlow? 2) What are Tensors? 3) What is TensorFlow? 4) Creating your First Graph 5) Linear Regression with TensorFlow 6) Implementing Gradient Descent using TensorFlow 7) Implementing Gradient Descent Using autodiff 8) Implementing Gradient Descent Using an Optimizer 9) Graph Visualization using TensorBoard 10) Name Scopes in TensorFlow 11) Modularity in TensorFlow 12) Sharing Variables in TensorFlow

Introduction to Deep Learning | CloudxLab

CloudxLab

Dimensionality Reduction | Machine Learning | CloudxLab

CloudxLab

Ensemble Learning and Random Forests

CloudxLab

Decision Trees

CloudxLab

Support Vector Machines

CloudxLab

Introduction to Linux | Big Data Hadoop Spark Tutorial | CloudxLab

CloudxLab

Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab

CloudxLab

Big Data with Hadoop & Spark Training: http://bit.ly/2LF3pBA This CloudxLab Introduction to Pig & Pig Latin tutorial helps you to understand Pig and Pig Latin in detail. Below are the topics covered in this tutorial: 1) Introduction to Pig 2) Why Do We Need Pig? 3) Pig - Usecases 4) Pig - Philosophy 5) Pig Latin - Data Flow Language 6) Pig - Local and MapReduce Mode 7) Pig Data Types 8) Load, Store, and Dump in Pig 9) Lazy Evaluation in Pig 10) Pig - Relational Operators - FOREACH, GROUP and FILTER 11) Hands-on on Pig - Calculate Average Dividend of NYSE

Introduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLab

CloudxLab

Big Data with Hadoop & Spark Training: http://bit.ly/2JjXp2u This CloudxLab Oozie tutorial helps you to understand Oozie in detail. Below are the topics covered in this tutorial: 1) Introduction to Oozie 2) Oozie - Workflow & Coordinator Jobs 3) Oozie - Workflow jobs - DAG (Directed Acyclic Graph) 4) Oozie Use cases 5) Oozie Workflow - XML 6) Oozie Hands-on on the command line and Hue 7) Oozie WorkFlow for Hive 8) Execute shell script using Oozie Workflow 9) Run and debug the Spark task on Oozie

More from CloudxLab (20)

Understanding computer vision with Deep Learning

Deep Learning Overview

Recurrent Neural Networks

Natural Language Processing

Naive Bayes

Autoencoders

Training Deep Neural Nets

Reinforcement Learning

Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...

Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab

Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...

Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab

Introduction to Deep Learning | CloudxLab

Dimensionality Reduction | Machine Learning | CloudxLab

Ensemble Learning and Random Forests

Decision Trees

Support Vector Machines

Introduction to Linux | Big Data Hadoop Spark Tutorial | CloudxLab

Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab

Introduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLab

Recently uploaded

UiPath Test Automation using UiPath Test Suite series, part 3

DianaGray10

Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...

Thierry Lestable

FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf

FIDO Alliance

Assure Contact Center Experiences for Your Customers With ThousandEyes

ThousandEyes

The Future of Platform Engineering

Jemma Hussein Allen

Key Trends Shaping the Future of Infrastructure.pdf

Cheryl Hung

State of ICS and IoT Cyber Threat Landscape Report 2024 preview

Prayukth K V

The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development. The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers: State of global ICS asset and network exposure Sectoral targets and attacks as well as the cost of ransom Global APT activity, AI usage, actor and tactic profiles, and implications Rise in volumes of AI-powered cyberattacks Major cyber events in 2024 Malware and malicious payload trends Cyberattack types and targets Vulnerability exploit attempts on CVEs Attacks on counties – USA Expansion of bot farms – how, where, and why In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East Why are attacks on smart factories rising? Cyber risk predictions Axis of attacks – Europe Systemic attacks in the Middle East Download the full report from here: https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/

Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf

91mobiles

Assuring Contact Center Experiences for Your Customers With ThousandEyes

ThousandEyes

Securing your Kubernetes cluster_ a step-by-step guide to success !

KatiaHIMEUR1

Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster. However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks. In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.

Encryption in Microsoft 365 - ExpertsLive Netherlands 2024

Albert Hoitingh

PHP Frameworks: I want to break free (IPC Berlin 2024)

Ralf Eggert

In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development. This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.

Welocme to ViralQR, your best QR code generator.

ViralQR

Welcome to ViralQR, your best QR code generator available on the market! At ViralQR, we design static and dynamic QR codes. Our mission is to make business operations easier and customer engagement more powerful through the use of QR technology. Be it a small-scale business or a huge enterprise, our easy-to-use platform provides multiple choices that can be tailored according to your company's branding and marketing strategies. Our Vision We are here to make the process of creating QR codes easy and smooth, thus enhancing customer interaction and making business more fluid. We very strongly believe in the ability of QR codes to change the world for businesses in their interaction with customers and are set on making that technology accessible and usable far and wide. Our Achievements Ever since its inception, we have successfully served many clients by offering QR codes in their marketing, service delivery, and collection of feedback across various industries. Our platform has been recognized for its ease of use and amazing features, which helped a business to make QR codes. Our Services At ViralQR, here is a comprehensive suite of services that caters to your very needs: Static QR Codes: Create free static QR codes. These QR codes are able to store significant information such as URLs, vCards, plain text, emails and SMS, Wi-Fi credentials, and Bitcoin addresses. Dynamic QR codes: These also have all the advanced features but are subscription-based. They can directly link to PDF files, images, micro-landing pages, social accounts, review forms, business pages, and applications. In addition, they can be branded with CTAs, frames, patterns, colors, and logos to enhance your branding. Pricing and Packages Additionally, there is a 14-day free offer to ViralQR, which is an exceptional opportunity for new users to take a feel of this platform. One can easily subscribe from there and experience the full dynamic of using QR codes. The subscription plans are not only meant for business; they are priced very flexibly so that literally every business could afford to benefit from our service. Why choose us? ViralQR will provide services for marketing, advertising, catering, retail, and the like. The QR codes can be posted on fliers, packaging, merchandise, and banners, as well as to substitute for cash and cards in a restaurant or coffee shop. With QR codes integrated into your business, improve customer engagement and streamline operations. Comprehensive Analytics Subscribers of ViralQR receive detailed analytics and tracking tools in light of having a view of the core values of QR code performance. Our analytics dashboard shows aggregate views and unique views, as well as detailed information about each impression, including time, device, browser, and estimated location by city and country. So, thank you for choosing ViralQR; we have an offer of nothing but the best in terms of QR code services to meet business diversity!

DevOps and Testing slides at DASA Connect

Kari Kakkonen

Transcript: Selling digital books in 2024: Insights from industry leaders - T...

BookNet Canada

The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more. Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/ Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.

Bits & Pixels using AI for Good.........

Alison B. Lowndes

Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf

Paige Cruz

Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack. While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack. I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:

UiPath Test Automation using UiPath Test Suite series, part 4

DianaGray10

Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap. The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies. Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques What will you get from this session? 1. Insights into SAP testing best practices 2. Heatmap utilization for testing 3. Optimization of testing processes 4. Demo Topics covered: Execution from the test manager Orchestrator execution result Defect reporting SAP heatmap example with demo Speaker: Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP

FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf

FIDO Alliance

GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...

Sri Ambati

Recently uploaded (20)

UiPath Test Automation using UiPath Test Suite series, part 3

Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...

FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf

Assure Contact Center Experiences for Your Customers With ThousandEyes

The Future of Platform Engineering

Key Trends Shaping the Future of Infrastructure.pdf

State of ICS and IoT Cyber Threat Landscape Report 2024 preview

Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf

Assuring Contact Center Experiences for Your Customers With ThousandEyes

Securing your Kubernetes cluster_ a step-by-step guide to success !

Encryption in Microsoft 365 - ExpertsLive Netherlands 2024

PHP Frameworks: I want to break free (IPC Berlin 2024)

Welocme to ViralQR, your best QR code generator.

DevOps and Testing slides at DASA Connect

Transcript: Selling digital books in 2024: Insights from industry leaders - T...

Bits & Pixels using AI for Good.........

Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf

UiPath Test Automation using UiPath Test Suite series, part 4

FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf

GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...

Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLab

1. Introduction

2. Introduction • Really fast MapReduce • 100x faster than Hadoop MapReduce in memory, • 10x faster on disk. • Builds on similar paradigms as MapReduce • Integrated with Hadoop Spark Core - A fast and general engine for large-scale data processing.

3. Introduction Spark Architecture Spark Core StandaloneAmazon EC2 Hadoop YARN Apache Mesos HDFS HBase Hive Tachyon Cassandra SQL Streaming MLLib GraphX SparkR Java Python Scala Libraries Languages Data Sources Resource/cluster managers Dataframes

4. Why Apache Spark? Or Why is Apache Spark faster than MapReduce?

5. Introduction Why Apache Spark? Map Reduce HDFS Map() Reduce() Read Input Write Output ● User Sends Logic ● In form of Map() & Reduces ● Tries to do execute near data ● Saves result to HDFS Hadoop Map Reduce

6. Introduction Hadoop Map Reduce - Multiple Phases Map Reduce - 1 HDFS Write Map Reduce - 2 HDFSHDFS

7. Introduction Shortcoming of Map Reduce 1. Batchwise Design a. Every map-reduce cycle reads from and writes to HDFS b. Heavy Latency 2. Converting logic to Map-Reduce paradigm is difficult 3. In-memory computing was not possible

8. Introduction Shortcoming of Map Reduce Map Reduce - 1 RAMWrite Map Reduce - 2 RAMRAM 80 times faster than disk See: https://gist.github.com/jboner/2841832 Latency Numbers Every Programmer Should Know

9. Introduction We have already installed the Apache Spark on CloudxLab. So, you don't have install anything. You simply have to login into Web Console and Get started with commands. Getting Started - CloudxLab

10. Introduction 1. Find out hadoop version: ○ [student@hadoop1 ~]$ hadoop version ○ Hadoop 2.4.0.2.1.4.0-632 2. Go to https://spark.apache.org/downloads.html 3. Select the release for your version of hadoop & Download 4. On servers you could use wget 5. Every download can be run in standalone mode 6. Unzip - tar -xzvf spark*.tgz 7. In this folder, the bin folder contains the spark commands Getting Started - Downloading

11. Introduction Getting Started - Spark Programs Overview Program Description spark-shell Runs spark scala interactive commandline pyspark Runs python spark interactive commandline sparkR Runs R on spark (/usr/spark2.6/bin/sparkR) spark-submit Submit a jar or python application for execution on cluster spark-sql Runs the spark sql interactive shell

12. Introduction $ spark-shell It is basically the scala REPL or interactive shell with one extra variable “sc”. Check dir(sc) or help(sc) Starting Spark With Scala Interactive Shell

13. Introduction $ pyspark It is basically the python interactive shell with one extra variable “sc”. Check dir(sc) or help(sc) Starting Spark With Python Interactive Shell

14. Introduction Getting Started - spark-submit ● To run example: ○ spark-submit --class org.apache.spark.examples.SparkPi /usr/hdp/current/spark-client/lib/spark-examples-*.jar 10 The example computes the area of circle of a radius 1 by counting total number of squares. ○ See https://en.wikipedia.org/wiki/Approximations_of_%CF%80#Summing_a_circle.27s_area ○ Code: https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/SparkPi.scala

15. Introduction Getting Started - spark-submit

16. Introduction Getting Started - Binaries Overview Binary Description spark-shell Runs spark scala interactive commandline pyspark Runs python spark interactive commandline sparkR Runs R on spark (/usr/spark2.6/bin/sparkR) spark-submit Submit a jar or python application for execution on cluster spark-sql Runs the spark sql interactive shell

17. Introduction Getting Started - CloudxLab To launch Spark on Hadoop, Set the Environment Variables pointing to Hadoop. export YARN_CONF_DIR=/etc/hadoop/conf/ export HADOOP_CONF_DIR=/etc/hadoop/conf/

18. Introduction Getting Started - CloudxLab We have installed other versions too: 1. /usr/spark2.0.1/bin/spark-shell 2. /usr/spark1.6/bin/spark-shell 3. /usr/spark1.2.1/bin/spark-shell

19. Introduction Getting Started - CloudxLab How to run the jupyter with pyspark

20. Thank you! reachus@cloudxlab.com

Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLab

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLab

Similar to Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLab (20)

More from CloudxLab

More from CloudxLab (20)

Recently uploaded

Recently uploaded (20)

Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLab