The document is a presentation about Apache Spark given on August 25th, 2015 in Pittsburgh by Sneha Challa. It introduces Spark as a fast and general cluster computing engine for large-scale data processing. It discusses Spark's Resilient Distributed Datasets (RDDs) and transformations/actions. It provides examples of Spark APIs like map, reduce, and explains running Spark on standalone, Mesos, YARN, or EC2 clusters. It also covers Spark libraries like MLlib and running machine learning algorithms like k-means clustering and logistic regression.
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveSachin Aggarwal
We will give a detailed introduction to Apache Spark and why and how Spark can change the analytics world. Apache Spark's memory abstraction is RDD (Resilient Distributed DataSet). One of the key reason why Apache Spark is so different is because of the introduction of RDD. You cannot do anything in Apache Spark without knowing about RDDs. We will give a high level introduction to RDD and in the second half we will have a deep dive into RDDs.
This talk gives details about Spark internals and an explanation of the runtime behavior of a Spark application. It explains how high level user programs are compiled into physical execution plans in Spark. It then reviews common performance bottlenecks encountered by Spark users, along with tips for diagnosing performance problems in a production application.
Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov
Slides cover Spark core concepts of Apache Spark such as RDD, DAG, execution workflow, forming stages of tasks and shuffle implementation and also describes architecture and main components of Spark Driver. The workshop part covers Spark execution modes , provides link to github repo which contains Spark Applications examples and dockerized Hadoop environment to experiment with
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveSachin Aggarwal
We will give a detailed introduction to Apache Spark and why and how Spark can change the analytics world. Apache Spark's memory abstraction is RDD (Resilient Distributed DataSet). One of the key reason why Apache Spark is so different is because of the introduction of RDD. You cannot do anything in Apache Spark without knowing about RDDs. We will give a high level introduction to RDD and in the second half we will have a deep dive into RDDs.
This talk gives details about Spark internals and an explanation of the runtime behavior of a Spark application. It explains how high level user programs are compiled into physical execution plans in Spark. It then reviews common performance bottlenecks encountered by Spark users, along with tips for diagnosing performance problems in a production application.
Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov
Slides cover Spark core concepts of Apache Spark such as RDD, DAG, execution workflow, forming stages of tasks and shuffle implementation and also describes architecture and main components of Spark Driver. The workshop part covers Spark execution modes , provides link to github repo which contains Spark Applications examples and dockerized Hadoop environment to experiment with
Created at the University of Berkeley in California, Apache Spark combines a distributed computing system through computer clusters with a simple and elegant way of writing programs. Spark is considered the first open source software that makes distribution programming really accessible to data scientists. Here you can find an introduction and basic concepts.
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17spark-project
Slides from Tathagata Das's talk at the Spark Meetup entitled "Deep Dive with Spark Streaming" on June 17, 2013 in Sunnyvale California at Plug and Play. Tathagata Das is the lead developer on Spark Streaming and a PhD student in computer science in the UC Berkeley AMPLab.
Video: https://www.youtube.com/watch?v=kkOG_aJ9KjQ
This talk gives details about Spark internals and an explanation of the runtime behavior of a Spark application. It explains how high level user programs are compiled into physical execution plans in Spark. It then reviews common performance bottlenecks encountered by Spark users, along with tips for diagnosing performance problems in a production application.
This presentation show the main Spark characteristics, like RDD, Transformations and Actions.
I used this presentation for many Spark Intro workshops from Cluj-Napoca Big Data community : http://www.meetup.com/Big-Data-Data-Science-Meetup-Cluj-Napoca/
Video to talk: https://www.youtube.com/watch?v=gd4Jqtyo7mM
Apache Spark is a next generation engine for large scale data processing built with Scala. This talk will first show how Spark takes advantage of Scala's function idioms to produce an expressive and intuitive API. You will learn about the design of Spark RDDs and the abstraction enables the Spark execution engine to be extended to support a wide variety of use cases(Spark SQL, Spark Streaming, MLib and GraphX). The Spark source will be be referenced to illustrate how these concepts are implemented with Scala.
http://www.meetup.com/Scala-Bay/events/209740892/
These are the slides for the Productionizing your Streaming Jobs webinar on 5/26/2016.
Apache Spark Streaming is one of the most popular stream processing framework that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. In this talk, we will focus on the following aspects of Spark streaming:
- Motivation and most common use cases for Spark Streaming
- Common design patterns that emerge from these use cases and tips to avoid common pitfalls while implementing these design patterns
- Performance Optimization Techniques
Survey of Spark for Data Pre-Processing and AnalyticsYannick Pouliot
A short presentation I gave on why Apache Spark is such an impressive analytics platform, particularly for R and Python users. I also discuss how academia can benefit from Amazon AWS implementation.
Introduction to Apache Spark. With an emphasis on the RDD API, Spark SQL (DataFrame and Dataset API) and Spark Streaming.
Presented at the Desert Code Camp:
http://oct2016.desertcodecamp.com/sessions/all
Let Spark Fly: Advantages and Use Cases for Spark on HadoopMapR Technologies
http://bit.ly/1BTaXZP – Apache Spark is currently one of the most active projects in the Hadoop ecosystem, and as such, there’s been plenty of hype about it in recent months, but how much of the discussion is marketing spin? And what are the facts? MapR and Databricks, the company that created and led the development of the Spark stack, will cut through the noise to uncover practical advantages for having the full set of Spark technologies at your disposal and reveal the benefits for running Spark on Hadoop
This presentation was given at a webinar hosted by Data Science Central and co-presented by MapR + Databricks.
To see the webinar, please go to: http://www.datasciencecentral.com/video/let-spark-fly-advantages-and-use-cases-for-spark-on-hadoop
An invited talk I gave at the "Zurich Spark Meetup" in July 2016. Talking about the Apache Spark trends I observed at Spark Summit 2016, as well as some personal insights.
Created at the University of Berkeley in California, Apache Spark combines a distributed computing system through computer clusters with a simple and elegant way of writing programs. Spark is considered the first open source software that makes distribution programming really accessible to data scientists. Here you can find an introduction and basic concepts.
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17spark-project
Slides from Tathagata Das's talk at the Spark Meetup entitled "Deep Dive with Spark Streaming" on June 17, 2013 in Sunnyvale California at Plug and Play. Tathagata Das is the lead developer on Spark Streaming and a PhD student in computer science in the UC Berkeley AMPLab.
Video: https://www.youtube.com/watch?v=kkOG_aJ9KjQ
This talk gives details about Spark internals and an explanation of the runtime behavior of a Spark application. It explains how high level user programs are compiled into physical execution plans in Spark. It then reviews common performance bottlenecks encountered by Spark users, along with tips for diagnosing performance problems in a production application.
This presentation show the main Spark characteristics, like RDD, Transformations and Actions.
I used this presentation for many Spark Intro workshops from Cluj-Napoca Big Data community : http://www.meetup.com/Big-Data-Data-Science-Meetup-Cluj-Napoca/
Video to talk: https://www.youtube.com/watch?v=gd4Jqtyo7mM
Apache Spark is a next generation engine for large scale data processing built with Scala. This talk will first show how Spark takes advantage of Scala's function idioms to produce an expressive and intuitive API. You will learn about the design of Spark RDDs and the abstraction enables the Spark execution engine to be extended to support a wide variety of use cases(Spark SQL, Spark Streaming, MLib and GraphX). The Spark source will be be referenced to illustrate how these concepts are implemented with Scala.
http://www.meetup.com/Scala-Bay/events/209740892/
These are the slides for the Productionizing your Streaming Jobs webinar on 5/26/2016.
Apache Spark Streaming is one of the most popular stream processing framework that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. In this talk, we will focus on the following aspects of Spark streaming:
- Motivation and most common use cases for Spark Streaming
- Common design patterns that emerge from these use cases and tips to avoid common pitfalls while implementing these design patterns
- Performance Optimization Techniques
Survey of Spark for Data Pre-Processing and AnalyticsYannick Pouliot
A short presentation I gave on why Apache Spark is such an impressive analytics platform, particularly for R and Python users. I also discuss how academia can benefit from Amazon AWS implementation.
Introduction to Apache Spark. With an emphasis on the RDD API, Spark SQL (DataFrame and Dataset API) and Spark Streaming.
Presented at the Desert Code Camp:
http://oct2016.desertcodecamp.com/sessions/all
Let Spark Fly: Advantages and Use Cases for Spark on HadoopMapR Technologies
http://bit.ly/1BTaXZP – Apache Spark is currently one of the most active projects in the Hadoop ecosystem, and as such, there’s been plenty of hype about it in recent months, but how much of the discussion is marketing spin? And what are the facts? MapR and Databricks, the company that created and led the development of the Spark stack, will cut through the noise to uncover practical advantages for having the full set of Spark technologies at your disposal and reveal the benefits for running Spark on Hadoop
This presentation was given at a webinar hosted by Data Science Central and co-presented by MapR + Databricks.
To see the webinar, please go to: http://www.datasciencecentral.com/video/let-spark-fly-advantages-and-use-cases-for-spark-on-hadoop
An invited talk I gave at the "Zurich Spark Meetup" in July 2016. Talking about the Apache Spark trends I observed at Spark Summit 2016, as well as some personal insights.
Apache Spark Usage in the Open Source EcosystemDatabricks
Apache Spark is an active member of the broad open source community beyond the Apache Foundation. Every day thousands of users combine capabilities of Spark with other open source software to get their job done. This is not by chance. Spark has been designed to behave well with existing ecosystems. For example, PySpark is designed to work well with Pandas, Numpy and other python packages. In this talk we will present an analysis of libraries and open source tools that are commonly used along with Spark in JVM, Python and R ecosystems. Our quantitative results are based on usage of thousands of Spark users. We will show the Spark Summit attendees what the rest of their community finds useful to complement the power of Spark and what parts of Spark API is used in conjunction with most popular open source libraries.
Taboola's experience with Apache Spark (presentation @ Reversim 2014)tsliwowicz
At taboola we are getting a constant feed of data (many billions of user events a day) and are using Apache Spark together with Cassandra for both real time data stream processing as well as offline data processing. We'd like to share our experience with these cutting edge technologies.
Apache Spark is an open source project - Hadoop-compatible computing engine that makes big data analysis drastically faster, through in-memory computing, and simpler to write, through easy APIs in Java, Scala and Python. This project was born as part of a PHD work in UC Berkley's AMPLab (part of the BDAS - pronounced "Bad Ass") and turned into an incubating Apache project with more active contributors than Hadoop. Surprisingly, Yahoo! are one of the biggest contributors to the project and already have large production clusters of Spark on YARN.
Spark can run either standalone cluster, or using either Apache mesos and ZooKeeper or YARN and can run side by side with Hadoop/Hive on the same data.
One of the biggest benefits of Spark is that the API is very simple and the same analytics code can be used for both streaming data and offline data processing.
In July 2016, we conducted our Apache Spark Survey to identify insights on how organizations are using Spark and highlight growth trends since our last Spark Survey 2015. The 2016 survey results reflect answers from 900 distinct organizations and 1615 respondents, who were predominantly Apache Spark users. The results show that the Spark community is...
https://databricks.com/blog/2016/09/27/spark-survey-2016-released.html
Spark 2.0 is a major release of Apache Spark. This release has brought many changes to API(s) and libraries of Spark. So in this KnolX, we will be looking at some improvements that are made in Spark 2.0. Also, in these slides we will be getting an introduction to some new features in Spark 2,0 like SparkSession API and Structured Streaming.
Apache Spark: The Analytics Operating SystemAdarsh Pannu
This presentation was delivered by Adarsh Pannu at IBM's Insight Conference in Nov 2015. For a recording, visit: https://www.youtube.com/watch?v=Tbm7HIlmwJQ
The presentation provides an overview of Apache Spark, a general-purpose big data processing engine built around speed, ease of use and sophisticated analytics. It enumerates the benefits of incorporating Spark in the enterprise, including how it allows developers to write fully-featured distributed applications ranging from traditional data processing pipelines to complex machine learning. The presentation uses the Airline "On Time" data set to explore various components of the Spark stack.
RISELab: Enabling Intelligent Real-Time Decisions keynote by Ion StoicaSpark Summit
A long-standing grand challenge in computing is to enable machines to act autonomously and intelligently: to rapidly and repeatedly take appropriate actions based on information in the world around them. To address this challenge, at UC Berkeley we are starting a new five year effort that focuses on the development of data-intensive systems that provide Real-Time Intelligence with Secure Execution (RISE). Following in the footsteps of AMPLab, RISELab is an interdisciplinary effort bringing together researchers across AI, robotics, security, and data systems. In this talk I’ll present our research vision and then discuss some of the applications that will be enabled by RISE technologies.
Time Series Analytics with Spark: Spark Summit East talk by Simon OuelletteSpark Summit
spark-timeseries is a Scala / Java / Python library for interacting with time series data on Apache Spark.
Time-series are an important part of data science applications, but are notoriously difficult in the context of distributed systems, due to their sequential nature. Getting this right is therefore a challenging but important element of progress in the universe of distributed systems applied to data science.
This talk will cover the current overall design of spark-timeseries, the current functionalities, and will provide some usage examples. Because the project is still at an early stage, the talk will also cover the current weaknesses and future improvements that are in the spark-timeseries project roadmap.
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...Spark Summit
In Spark 2.0, we introduced Structured Streaming, which allows users to continually and incrementally update your view of the world as new data arrives, while still using the same familiar Spark SQL abstractions. I talk about progress we’ve made since then on robustness, latency, expressiveness and observability, using examples of production end-to-end continuous applications.
Trends for Big Data and Apache Spark in 2017 by Matei ZahariaSpark Summit
Big data remains a rapidly evolving field with new applications and infrastructure appearing every year. In this talk, I’ll cover new trends in 2016 / 2017 and how Apache Spark is moving to meet them. In particular, I’ll talk about work Databricks is doing to make Apache Spark interact better with native code (e.g. deep learning libraries), support heterogeneous hardware, and simplify production data pipelines in both streaming and batch settings through Structured Streaming.
1. Big Data Analytics
- Big Data
- Spark: Big Data Analytics
- Resilient Distributed Datasets (RDD)
- Spark libraries (SQL, DataFrames, MLlib for machine learning, GraphX, and Streaming)
- PFP: Parallel FP-Growth
2. Ubiquitous Computing
- Edge Computing
- Cloudlet
- Fog computing
- Internet of Things (IoT)
- Virtualization
- Virtual Conferencing
- Virtual Events (2D, 3D, and Hybrid)
We are a company driven by inquisitive data scientists, having developed a pragmatic and interdisciplinary approach, which has evolved over the decades working with over 100 clients across multiple industries. Combining several Data Science techniques from statistics, machine learning, deep learning, decision science, cognitive science, and business intelligence, with our ecosystem of technology platforms, we have produced unprecedented solutions. Welcome to the Data Science Analytics team that can do it all, from architecture to algorithms.
Our practice delivers data driven solutions, including Descriptive Analytics, Diagnostic Analytics, Predictive Analytics, and Prescriptive Analytics. We employ a number of technologies in the area of Big Data and Advanced Analytics such as DataStax (Cassandra), Databricks (Spark), Cloudera, Hortonworks, MapR, R, SAS, Matlab, SPSS and Advanced Data Visualizations.
This presentation is designed for Spark Enthusiasts to get started and details of the course are below.
1. Introduction to Apache Spark
2. Functional Programming + Scala
3. Spark Core
4. Spark SQL + Parquet
5. Advanced Libraries
6. Tips & Tricks
7. Where do I go from here?
Apache Spark is a In Memory Data Processing Solution that can work with existing data source like HDFS and can make use of your existing computation infrastructure like YARN/Mesos etc. This talk will cover a basic introduction of Apache Spark with its various components like MLib, Shark, GrpahX and with few examples.
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"IT Event
In this talk we’ll explore Apache Spark — the most popular cluster computing framework right now. We’ll look at the improvements that Spark brought over Hadoop MapReduce and what makes Spark so fast; explore Spark programming model and RDDs; and look at some sample use cases for Spark and big data in general.
This talk will be interesting for people who have little or no experience with Spark and would like to learn more about it. It will also be interesting to a general engineering audience as we’ll go over the Spark programming model and some engineering tricks that make Spark fast.
In this one day workshop, we will introduce Spark at a high level context. Spark is fundamentally different than writing MapReduce jobs so no prior Hadoop experience is needed. You will learn how to interact with Spark on the command line and conduct rapid in-memory data analyses. We will then work on writing Spark applications to perform large cluster-based analyses including SQL-like aggregations, machine learning applications, and graph algorithms. The course will be conducted in Python using PySpark.
Presentation detailed about capabilities of In memory Analytic using Apache Spark. Apache Spark overview with programming mode, cluster mode with Mosos, supported operations and comparison with Hadoop Map Reduce. Elaborating Apache Spark Stack expansion like Shark, Streaming, MLib, GraphX
Hands-on Session on Big Data processing using Apache Spark and Hadoop Distributed File System
This is the first session in the series of "Apache Spark Hands-on"
Topics Covered
+ Introduction to Apache Spark
+ Introduction to RDD (Resilient Distributed Datasets)
+ Loading data into an RDD
+ RDD Operations - Transformation
+ RDD Operations - Actions
+ Hands-on demos using CloudxLab
Similar to Apache spark sneha challa- google pittsburgh-aug 25th (20)
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
2. What is Apache Spark?
Spark is an open source computation engine built on top of the popular Hadoop
Distributed File System (HDFS). Fast and general cluster computing engine for large scale
data processing
Efficient Usable
Offers In memory computing
DAG Execution Engine
Up to 10× faster on disk,
100× in memory 2-5× less code
Rich APIs in Java,
Scala, Python
Interactive shell
4. Spark Community
Spark was initially developed by UC Berkeley, AMP Lab and is being used and developed in a wide
variety of companies.
MOST ACTIVE OPEN SOURCE PROJECTS IN BIG DATA
More than 150 Contributors in the past One Year
25 + Companies Contributing and it’s growing.
Spark was designed to both make traditional Map Reduce programming easier and to support
new types of applications, with one of the earliest focus areas being machine learning. Spark can
be used to build fast end-to-end Machine Learning workflows.
8. Spark VS Map Reduce
Run programs up to 100x faster than Hadoop MapReduce in
memory, or 10x faster on disk.
Spark has an advanced DAG execution engine that supports
cyclic data flow and in-memory computing.
11. Spark Installation
• Extract compressed folder spark-1.3.0-binhadoop2.4
• From terminal, go to spark-1.3.0-bin-hadoop2.4/
bin
• Run pyspark
• Run rdd = sc.parallelize([0, 1, 2]);
rdd.map(lambda x: x*x).collect()
• Get result [0,1,4]
• It’s that easy!
Windows users might need to download and
run additional winutils.exe for smooth
running of applications
Download winutils here
http://www.srccodes.com/p/article/39/error
-util-shell-failed-locate-winutils-binary-
hadoop-binary-path.
and add to $HADOOP_HOME/bin
• Download a bigger zip (1.9GB) from
http://bit.ly/1FpZAXH
12. Interactive Shell & Spark Context
Interactive Shell:
The Fastest Way to Learn Spark
Available in Python and Scala
Runs as an application on an existing Spark Cluster.
OR Can run locally
Spark Context:
Main entry point to Spark functionality
Available in shell as variable sc
In standalone programs, you’d make your
own by calling SC object(see later for details)
13. Key Concepts – RDD Distributional Model
RDD – Resilient Distributed Dataset
Programs are written in terms of transformations on these Distributed Datasets
• RDD = Resilient
Distributed Database
• Transformations
convert one RDD into
another
• No actual calculation
• Actions force
calculation of result
• Lazy evaluation
14. RDD’s - Motivation
RDDs are motivated by two types of applications that current data flow systems handle
inefficiently:
1) Iterative algorithms - Common in graph applications and Machine Learning
2) Interactive Data Mining Tools
To achieve fault tolerance efficiently - RDDs provide a highly restricted form of shared memory:
they are read-only datasets that can only be constructed through bulk operations on other
RDDs.
However, RDDs are expressive enough to capture a wide class of computations, including
MapReduce and specialized computations.
16. Programming Model
Two types of operations on an RDD:
• transformations
• actions
Transformations are lazily evaluated - they are not executed when you issue the
command.
RDDs are recomputed when an action is executed.
18. RDD Computational Model
• Operators on RDDs form a directed acyclic graph.
• If any partition on dead workers is lost, it can be recomputed by retracing the operator DAG.
19. FIRST STEP IN DATA ANALYSIS :
Create an RDD
Read data from a text file on local machine , S3 or HDFS into RDD.
Give life to an RDD using Spark Context
# Convert a python collection to an RDD
# Turn a Python collection into an RDD
>sc.parallelize ([7, 8, 9])
# Load text file from local FS, HDFS, or S3
>sc.textFile(“textfile.txt”)
>sc.textFile(“directory/*.txt”)
>sc.textFile(“hdfs://namenode:9000/path/file”)
20. Transformations – RDD: map
Pass each element of an RDD through a function
>>>rdd = sc.parallelize(range(1,8))
>>>result_rdd = rdd.map(lambda x: x%3)
23. Some more RDD Transformations
rdd.flatMap(f): Return a new RDD by first applying a function to all elements of this RDD, and
then flattening the results.
>>> rdd = sc.parallelize([2,3,4])
sorted(rdd.flatMap(lambda x: range(1,x)).collect() ?/* Collect is the action applied on
transformation
[1,1,1,2,2,3]
rdd.filter(f) : Return a new RDD containing only the elements that satisfy a predicate.
>>> rdd = sc.parallelize([1,2,3,4,5])
rdd.filter(lambda x: x%2 ==0).collect()
[2,4]
24. Some more RDD Transformations
sortBy(self, keyfunc, ascending=True, numPartitions=None)
Sorts this RDD by given keyfunc
>> rdd= [(‘a’,1),(’b’, 2) , (‘1’,3), (‘d’,4),(‘2’,5) ]
rdd = sc.parallelize(rdd).sortBy(lambda x:x[0]).
rdd.cache() : Cache RDD in memory for repeated use
countByKey(self):
rdd= sc.parallelize([ (“a”,1) , (“b”,1), (“a”,1) ])
rdd.countByKey().items()
join(self,other, numPartitions = None): Return an RDD containing all pairs of elements with
matching keys in self and others.
25. Setting the level of parallelism
All the pair RDD operations take an optional second parameter for number of tasks.
> rdd.reduceByKey(lambda x, y: x + y, 5)
>rdd.groupByKey(5)
>rdd.join(pageViews, 5)
26. Some RDD Actions
RDD Transformations are lazily evaluated . Actions kick off computation on transformations.
Eg: Collect(), glom() etc
rdd.collect() : Return RDD content as a list
rdd = sc.parallelize([1,2,3], 3)
rdd2 = rdd.map(lambda x: x*x)
rdd2.glom().collect(): [1, 4, 9]
rdd.glom().collect():
rdd = sc.parallelize([0,1,2], 3)
rdd2 = rdd.map(lambda x: x*x)
rdd2.collect(): [[0], [1], [4]]
saveAsTextFile(path): Write the elements of the dataset as a text file (or set of text files) in a
given directory in the local filesystem, HDFS or any other Hadoop-supported file system.
take(n) :Return an array with the first n elements of the dataset.
first() : return the first element of the dataset. Similar to take(1)
27. In Map Reduce you get only two operators – map and reduce.
Whereas Spark offers 80+ operations!
Automatic parallelization of workflows on SPARK. In Spark a whole series of
individual tasks is expressed as a single program flow that is lazily evaluated., so
that system has a complete picture of the execution graph.
28. Word Count
from pyspark import SparkContext
logFile = "hdfs://localhost:9000/user/bigdatavm/input"
sc = SparkContext("spark://bigdata-vm:7077", "WordCount")
textFile = sc.textFile(logFile)
wordCounts = textFile.flatMap(lambda line: line.split()).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a+b)
wordCounts.saveAsTextFile("hdfs://localhost:9000/user/bigdatavm/output")
29. Fault tolerant - Persistent
RDDs track lineage information that can be used to efficiently re compute lost data
msgs = textFile.filter(lambda s: s.startsWith(“ERROR”))
.map(lambda s: s.split(“t”)[2])
Spark will persist or cache RDD slices in memory on each node during operations.
You can mark an RDD to be persisted with the cache method on an RDD along with a storage level.
31. MLlib
Spark subproject providing Machine Learning primitives. Initial contribution from AMP Lab @ UC Berkeley.
Shipped with Spark since version 0.8
35 contributors
Highlights include:
Basic statistics - summary statistics ,correlation and stratified sampling
hypothesis testing, random data generation
Linear models of regression (logistic and linear regression, SVM’s)
Naive Bayes and Decision Tree classifiers, ensemble of trees
Collaborative Filtering with ALS
K-Means clustering and Gaussian Mixture.
Stochastic gradient descent
SVD (singular value decomposition) and PCA
Spark’s scalable machine
learning library consisting
of common learning
algorithms and utilities,
including classification,
regression, clustering,
collaborative filtering,
dimensionality reduction,
as well as underlying
optimization primitives.
32. Running a Spark Application
Command: submit-spark <python_file_path>
Let’s see the implementation of
1) K-Means
2) Logistic Regression
33. K-Means
# Import the required pyspark functions
from pyspark.mllib.clustering import KMeans
from numpy import array
from math import sqrt
from pyspark import SparkContext
sc =SparkContext()
data = sc.textFile("C:UserssnehachallaDownloadsspark-1.4.1-bin-hadoop2.4spark-1.4.1-bin-hadoop2.4binkmeans_data.txt")
parsedData = data.map(lambda line:array([float(x) for x in line.split(' ')])).cache()
34. K-Means (Cont..)
# Build the model (cluster the data)
clusters = KMeans.train(parsedData, 2, maxIterations = 10,runs = 1, initializationMode = "k-means||")
# Evaluate clustering by computing the sum of squared errors
def error(point):
center = clusters.centers[clusters.predict(point)]
return sqrt(sum([x**2 for x in (point - center)]))
cost = parsedData.map(lambda point: error(point)).reduce(lambda x, y: x + y)
print("Sum of squared error = " + str(cost))
35. Logistic Regression
from pyspark.mllib.classification import LogisticRegressionWithSGD
from numpy import array
# Load and parse the data
data = sc.textFile("mllib/data/sample_svm_data.txt")
parsedData = data.map(lambda line: array([float(x) for x in line.split(' ')]))
model = LogisticRegressionWithSGD.train(parsedData)
# Build the model
labelsAndPreds = parsedData.map(lambda point: (int(point.item(0)),
model.predict(point.take(range(1, point.size)))))
# Evaluating the model on training data
trainErr = labelsAndPreds.filter(lambda (v, p): v != p).count() / float(parsedData.count())
print("Training Error = " + str(trainErr)
36. Spark UI
Run Spark in local mode (pyspark) .
Spark UI is at http://localhost:4040
You will be able to see the RDD sizes and Identify slow running tasks.
38. Setting up a EMR Cluster
If your data is too large to compute on your local machine - then you’re in the right
place. An easy way to get Spark running is with EC2.
Create an account on aws.amazon.com
Get a keypair from aws Console: This is the security for your instance
https://console.aws.amazon.com/ec2/v2/home?region=us-east-1#KeyPairs:sort=keyName
Create EMR instance and configure the nodes:
https://console.aws.amazon.com/console/home?region=us-east-1#
Launch EMR instance
https://console.aws.amazon.com/ec2/v2/home?region=us-east-1
40. For more info on Spark
Website: http://spark.apache.org
Tutorials: http://ampcamp.berkeley.edu
Spark Summit: http://spark-summit.org
Github: https://github.com/apache/spark
Mailing lists: user@spark.apache.org,dev@spark.apache.org
Python API documentation. http://spark.apache.org/docs/latest/api/python/
MapReduce has been around as the major framework for distributed
computing for 10 years - this is pretty old in technology time! Well
known limitations include:
1. Programmability
a. Requires multiple chained MR steps
b. Specialized systems for applications
2. Performance
a. Writes to disk between each computational step
b. Expensive for apps to "reuse" data
i. Iterative algorithms
ii. Interactive analysis
Most machine learning algorithms are iterative …
Spark provides an efficient way for solving iterative algorithms by keeping the intermediate data in the memory. This avoids the overhead of R/W of the intermediate data from the disk as in the case of MR. Also, when running the same operation again and again, data can be cached/fetched from the memory without performing the same operation again. MR is stateless, lets say a program/application in MR has been executed 10 times, then the whole data set has to be scanned 10 times.
Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing
http://www.eecs.berkeley.edu/Pubs/TechRpts/2011/EECS-2011-82.pdf
Comprehensive list of actions:
http://spark.apache.org/docs/latest/programming-guide.html#actions
When constructing a complex pipeline of MapReduce jobs, the task of correctly parallelizing the sequence of jobs is left to you. Thus, a scheduler tool such as Apache Oozie is often required to carefully construct this sequence.
With Spark, a whole series of individual tasks is expressed as a single program flow that is lazily evaluated so that the system has a complete picture of the execution graph. This approach allows the core scheduler to correctly map the dependencies across different stages in the application, and automatically parallelize the flow of operators without user intervention.
Spark allows you to access these operators in the context of a full programming language — thus, you can use control statements, functions, and classes as you would in a typical programming environment.
Automatic Parallelization of Complex Flows
When constructing a complex pipeline of MapReduce jobs, the task of correctly parallelizing the sequence of jobs is left to you. Thus, a scheduler tool such as Apache Oozie is often required to carefully construct this sequence.
With Spark, a whole series of individual tasks is expressed as a single program flow that is lazily evaluated so that the system has a complete picture of the execution graph. This approach allows the core scheduler to correctly map the dependencies across different stages in the application, and automatically parallelize the flow of operators without user intervention.
This capability also has the property of enabling certain optimizations to the engine while reducing the burden on the application developer. Win, and win again!
Spark runs on Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources including HDFS, Cassandra, HBase, and S3.
You can run Spark using its standalone cluster mode, on EC2, on Hadoop YARN, or on Apache Mesos. Access data inHDFS, Cassandra, HBase, Hive, Tachyon, and any Hadoop data source
The RDD data model and cached memory computing allow Spark to quickly
and easily solve similar workflows and use cases that are part of Hadoop.
Spark has a series of high level tools at it’s disposal that are added as
component libraries, not integrated into the general computing framework:
Know more here: http://spark.apache.org/docs/latest/mllib-guide.html