Spark is a parallel framework that provides efficient primitives for in-memory data sharing across clusters. It introduces Resilient Distributed Datasets (RDDs) that allow data to be partitioned across nodes and transformed using operations like map and filter. RDDs track the lineage of transformations to allow recovery of lost data partitions. This allows interactive queries and iterative algorithms to run much faster than disk-based systems like MapReduce by avoiding expensive replication and I/O.
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...IndicThreads
Session presented at the 2nd IndicThreads.com Conference on Cloud Computing held in Pune, India on 3-4 June 2011.
http://CloudComputing.IndicThreads.com
Abstract: The processing of massive amount of data gives great insights into analysis for business. Many primary algorithms run over the data and gives information which can be used for business benefits and scientific research. Extraction and processing of large amount of data has become a primary concern in terms of time, processing power and cost. Map Reduce algorithm promises to address the above mentioned concerns. It makes computing of large sets of data considerably easy and flexible. The algorithm offers high scalability across many computing nodes. This session will introduce Map Reduce algorithm, followed by few variations of the same and also hands on example in Map Reduce using Apache Hadoop.
Speaker: Allahbaksh Asadullah is a Product Technology Lead from Infosys Labs, Bangalore. He has over 5 years of experience in software industry in various technologies. He has extensively worked on GWT, Eclipse Plugin development, Lucene, Solr, No SQL databases etc. He speaks at the developer events like ACM Compute, Indic Threads and Dev Camps.
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...IndicThreads
Session presented at the 2nd IndicThreads.com Conference on Cloud Computing held in Pune, India on 3-4 June 2011.
http://CloudComputing.IndicThreads.com
Abstract: The processing of massive amount of data gives great insights into analysis for business. Many primary algorithms run over the data and gives information which can be used for business benefits and scientific research. Extraction and processing of large amount of data has become a primary concern in terms of time, processing power and cost. Map Reduce algorithm promises to address the above mentioned concerns. It makes computing of large sets of data considerably easy and flexible. The algorithm offers high scalability across many computing nodes. This session will introduce Map Reduce algorithm, followed by few variations of the same and also hands on example in Map Reduce using Apache Hadoop.
Speaker: Allahbaksh Asadullah is a Product Technology Lead from Infosys Labs, Bangalore. He has over 5 years of experience in software industry in various technologies. He has extensively worked on GWT, Eclipse Plugin development, Lucene, Solr, No SQL databases etc. He speaks at the developer events like ACM Compute, Indic Threads and Dev Camps.
Tokyo Cabinet is a library of routines for managing a database. The database is a simple data file containing records, each is a pair of a key and a value. Every key and value is serial bytes with variable length. Both binary data and character string can be used as a key and a value. There is neither concept of data tables nor data types. Records are organized in hash table, B+ tree, or fixed-length array.
SQL? NoSQL? NewSQL?!? What's a Java developer to do? - PhillyETE 2012Chris Richardson
The database world is undergoing a major upheaval. NoSQL databases such as MongoDB and Cassandra are emerging as a compelling choice for many applications. They can simplify the persistence of complex data models and offering significantly better scalability and performance. But these databases have a very different and unfamiliar data model and APIs as well as a limited transaction model. Moreover, the relational world is fighting back with so-called NewSQL databases such as VoltDB, which by using a radically different architecture offers high scalability and performance as well as the familiar relational model and ACID transactions. Sounds great but unlike the traditional relational database you can’t use JDBC and must partition your data.
In this presentation you will learn about popular NoSQL databases – MongoDB, and Cassandra – as well at VoltDB. We will compare and contrast each database’s data model and Java API using NoSQL and NewSQL versions of a use case from the book POJOs in Action. We will learn about the benefits and drawbacks of using NoSQL and NewSQL databases.
Survey of Spark for Data Pre-Processing and AnalyticsYannick Pouliot
A short presentation I gave on why Apache Spark is such an impressive analytics platform, particularly for R and Python users. I also discuss how academia can benefit from Amazon AWS implementation.
Configuring Mahout Clustering Jobs - Frank Scholtenlucenerevolution
See conference video - http://www.lucidimagination.com/devzone/events/conferences/ApacheLuceneEurocon2011
For more than a decade internet search engines have helped users find documents they are looking for. However, what if users aren't looking for anything specific but want a summary of a large document collection and want to be surprised? One solution to this problem is document clustering. Clustering algorithms group documents that have similar content. Real-life examples of clustering are clustered search results of Google news, or tag clouds which group documents under a shared label. Apache Mahout is a framework for scalable machine learning on top of Apache Hadoop and can be used for large scale document clustering. This talk introduces clustering in general and shows you step-by-step how to configure Mahout clustering jobs to create a tag cloud from a document collection. This talk is suitable for people who have some experience with Hadoop and perhaps Mahout. Knowledge of clustering is not required.
Apache Hive provides SQL-like access to your stored data in Apache Hadoop. Apache HBase stores tabular data in Hadoop and supports update operations. The combination of these two capabilities is often desired, however, the current integration show limitations such as performance issues. In this talk, Enis Soztutar will present an overview of Hive and HBase and discuss new updates/improvements from the community on the integration of these two projects. Various techniques used to reduce data exchange and improve efficiency will also be provided.
Tokyo Cabinet is a library of routines for managing a database. The database is a simple data file containing records, each is a pair of a key and a value. Every key and value is serial bytes with variable length. Both binary data and character string can be used as a key and a value. There is neither concept of data tables nor data types. Records are organized in hash table, B+ tree, or fixed-length array.
SQL? NoSQL? NewSQL?!? What's a Java developer to do? - PhillyETE 2012Chris Richardson
The database world is undergoing a major upheaval. NoSQL databases such as MongoDB and Cassandra are emerging as a compelling choice for many applications. They can simplify the persistence of complex data models and offering significantly better scalability and performance. But these databases have a very different and unfamiliar data model and APIs as well as a limited transaction model. Moreover, the relational world is fighting back with so-called NewSQL databases such as VoltDB, which by using a radically different architecture offers high scalability and performance as well as the familiar relational model and ACID transactions. Sounds great but unlike the traditional relational database you can’t use JDBC and must partition your data.
In this presentation you will learn about popular NoSQL databases – MongoDB, and Cassandra – as well at VoltDB. We will compare and contrast each database’s data model and Java API using NoSQL and NewSQL versions of a use case from the book POJOs in Action. We will learn about the benefits and drawbacks of using NoSQL and NewSQL databases.
Survey of Spark for Data Pre-Processing and AnalyticsYannick Pouliot
A short presentation I gave on why Apache Spark is such an impressive analytics platform, particularly for R and Python users. I also discuss how academia can benefit from Amazon AWS implementation.
Configuring Mahout Clustering Jobs - Frank Scholtenlucenerevolution
See conference video - http://www.lucidimagination.com/devzone/events/conferences/ApacheLuceneEurocon2011
For more than a decade internet search engines have helped users find documents they are looking for. However, what if users aren't looking for anything specific but want a summary of a large document collection and want to be surprised? One solution to this problem is document clustering. Clustering algorithms group documents that have similar content. Real-life examples of clustering are clustered search results of Google news, or tag clouds which group documents under a shared label. Apache Mahout is a framework for scalable machine learning on top of Apache Hadoop and can be used for large scale document clustering. This talk introduces clustering in general and shows you step-by-step how to configure Mahout clustering jobs to create a tag cloud from a document collection. This talk is suitable for people who have some experience with Hadoop and perhaps Mahout. Knowledge of clustering is not required.
Apache Hive provides SQL-like access to your stored data in Apache Hadoop. Apache HBase stores tabular data in Hadoop and supports update operations. The combination of these two capabilities is often desired, however, the current integration show limitations such as performance issues. In this talk, Enis Soztutar will present an overview of Hive and HBase and discuss new updates/improvements from the community on the integration of these two projects. Various techniques used to reduce data exchange and improve efficiency will also be provided.
Twitter Storm: Ereignisverarbeitung in EchtzeitGuido Schmutz
Hadoop bzw. MapReduce eignet sich sehr gut, um grosse Datenmengen effizient verarbeiten zu können. Eine Verarbeitung in Hadoop ist jedoch immer batch-orientiert, d.h. es braucht eine gewisse Zeit, bis ein Resultat zur Verfügung steht. Für gewisse Anwendungsfälle kann dies ausreichend sein, andere Anwendungsfälle benötigen jedoch Daten in Echtzeit. Für die Lösung solcher Problemstellungen, gibt es seit einigen Jahren sogennante Complex-Event Processing (CEP) Systeme. Diese lassen es zu, direkt auf dem eingehenden Ereignisstrom Abfragen/Berechungen und Verabeitugnen durchzuführen, ohne diese Informationen erst in einer Datenbank abspeichern zu müssen.
Twitter Storm ist ein Open Source Framework für die Verarbeitung von Datenströmem in Echtzeit. Es wird auch als "Hadoop für Echtzeitverarbeitung" bezeichnet, wobei das Programmiermodell sich doch stark von Hadoop unterscheidet. Storm ist mehrheitlich in Clojure geschrieben und unterstützt Java direkt. Die grundlegenden Bausteine, die Spouts und die Bolts können sowohl in Java wie auch in anderen Programmiersprachen implementiert werden.
Diese Session präsentiert, wie man mit Hilfe von Twitter Storm Applikaitonen implementieren kann und zeigt entsprechende Anwendungsfälle, welche sich mit Twitter Storm lösen lassen. Zudem wird diskutiert, wie sich Storm mit Hadoop und NoSQL sinnvoll kombinieren lässt.
A monitoring system is arguably the most crucial system to have in place when administering and tweaking the performance of any database system. DBAs also find themselves with a variety of monitoring systems and plugins to use; ranging from small scripts in cron to complex data collection systems. In this talk, I’ll discuss how Box made a shift from the Cacti monitoring system and other various shell scripts to OpenTSDB and the changes made to our servers and daily interaction with monitoring to increase our agility in identifying and addressing changes in database behavior.
Cloudera Morphlines is a new open source framework, recently added to the CDK, that reduces the time and skills necessary to integrate, build, and change Hadoop processing applications that extract, transform, and load data into Apache Solr, Apache HBase, HDFS, enterprise data warehouses, or analytic online dashboards.
How we solved Real-time User Segmentation using HBaseDataWorks Summit
At RichRelevance, we service 10 of the top 20 Internet retailer chains and deliver more than $5.5 billions in attributable sales. Every 21 milliseconds a shopper clicks on a recommendation that we have delivered, and we serve over 850 million product recommendations daily. Our Hadoop infrastructure has a capacity to handle upwards of 1.5+ PB. Behavioral Targeting, specifically user segmentation and building personas, is critical for us in generating triggers when a user is added to a segment or switches from a segment. In this presentation, we intend to demonstrate not only how the events are captured, but also how they are stored in HBase in real-time. It is critical to design the system so it can handle thousands of writes per second and, at the same time, be able to query any combination of behavioral attributes in HBase through real-time APIs. This session will walk attendees through the entire design & architecture starting from data Ingestion, schema design, and access patterns, as well as some major problems like sharing & hot spotting. Furthermore, performance metrics will be presented, including the number of read/write per second and details around cluster configuration.
Deploying Apache Flume to enable low-latency analyticsDataWorks Summit
The driving question behind redesigns of countless data collection architectures has often been, ?how can we make the data available to our analytical systems faster?? Increasingly, the go-to solution for this data collection problem is Apache Flume. In this talk, architectures and techniques for designing a low-latency Flume-based data collection and delivery system to enable Hadoop-based analytics are explored. Techniques for getting the data into Flume, getting the data onto HDFS and HBase, and making the data available as quickly as possible are discussed. Best practices for scaling up collection, addressing de-duplication, and utilizing a combination streaming/batch model are described in the context of Flume and Hadoop ecosystem components.
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUponCloudera, Inc.
OpenTSDB was built on the belief that, through HBase, a new breed of monitoring systems could be created, one that can store and serve billions of data points forever without the need for destructive downsampling, one that could scale to millions of metrics, and where plotting real-time graphs is easy and fast. In this presentation we’ll review some of the key points of OpenTSDB’s design, some of the mistakes that were made, how they were or will be addressed, and what were some of the lessons learned while writing and running OpenTSDB as well as asynchbase, the asynchronous high-performance thread-safe client for HBase. Specific topics discussed will be around the schema, how it impacts performance and allows concurrent writes without need for coordination in a distributed cluster of OpenTSDB instances.
Quantified Self Ideology: Personal Data becomes Big DataMelanie Swan
A key contemporary trend emerging in big data science is the quantified self: individuals engaged in the deliberate self-tracking of any kind of biological, physical, behavioral, or transactional information, as n=1 individuals or in groups. The quantified self is one dimension of the bigger trend to integrate and apply a variety of personal information streams including big health data (genome, transcriptome, environmentome, diseasome), quantified self data streams (biosensor, fitness, sleep, food, mood, heart rate, glucose tracking, etc.), traditional data streams (personal and family health history, prescription history) and IOT (Internet of things) activity data streams (smart home, smart car, environmental sensors, community data). This talk looks at how personal data and group data are becoming big data as individuals and communities share, collaborate, and work with large personalized data sets using novel discovery methods such as anomaly detection and exception reporting, longitudinal baseline analysis, episodic triggers, and hierarchical machine learning.
The Future of Telecom (Petro Chernyshov Business Stream)IT Arena
Lviv IT Arena is a conference specially designed for programmers, designers, developers, top managers, inverstors, entrepreneurs and startuppers. Annually it takes place at the beginning of October in Lviv at Arena Lviv stadium. In 2016 the conference gathered more than 1800 participants and over 100 speakers from companies like Microsoft, Philips, Twitter, UBER and IBM. More details about the conference at itarena.lviv.ua.
Governing the Data to Dollars Value Chain™ - Sept 2012 NYC Data Governance Co...Fitzgerald Analytics, Inc.
Data is the ultimate intangible asset: worthless is raw form, yet priceless when used well. Financial services companies depend on analytics to transform troves of data into business advantage, insight, and profits. Yet the ugly secret is that most analytics project fail to achieve their full potential, leaving millions of dollars in potential profits on the table.
Hive on Spark を活用した高速データ分析 - Hadoop / Spark Conference Japan 2016Nagato Kasaki
現在、DMM.comでは、1日あたり1億レコード以上の行動ログを中心に、各サービスのコンテンツ情報や、地域情報のようなオープンデータを収集し、データドリブンマーケティングやマーケティングオートメーションに活用しています。しかし、データの規模が増大し、その用途が多様化するにともなって、データ処理のレイテンシが課題となってきました。本発表では、既存のデータ処理に用いられていたHiveの処理をHive on Sparkに置き換えることで、1日あたりのバッチ処理の時間を3分の1まで削減することができた事例を紹介し、Hive on Sparkの導入方法やメリットを具体的に解説します。
Hadoop / Spark Conference Japan 2016
http://www.eventbrite.com/e/hadoop-spark-conference-japan-2016-tickets-20809016328
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive DataJetlore
Spark is an open source cluster computing framework that can outperform Hadoop by 30x through a combination of in-memory computation and a richer execution engine. Shark is a port of Apache Hive onto Spark, which provides a similar speedup for SQL queries, allowing interactive exploration of data in existing Hive warehouses. This talk will cover how both Spark and Shark are being used at various companies to accelerate big data analytics, the architecture of the systems, and where they are heading. We will also discuss the next major feature we are developing, Spark Streaming, which adds support for low-latency stream processing to Spark, giving users a unified interface for batch and real-time analytics.
BDT305 Transforming Big Data with Spark and Shark - AWS re: Invent 2012Amazon Web Services
The Berkeley AMPLab is developing a new open source data analysis software stack by deeply integrating machine learning and data analytics at scale (Algorithms), cloud and cluster computing (Machines) and crowdsourcing (People) to make sense of massive data. Current application efforts focus on cancer genomics, real-time traffic prediction, and collaborative analytics for mobile devices. In this talk, we present an overview of this stack and demonstrate key components: Spark and Shark.
Apache Spark presentation showing how Spark works internally and how it deals with distributed data.
A comparison with Apache Hadoop is made in order to show the advantages that Apache Spark.
In this talk, we present two emerging, popular open source projects: Spark and Shark. Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write. It outperform Hadoop by up to 100x in many real-world applications. Spark programs are often much shorter than their MapReduce counterparts thanks to its high-level APIs and language integration in Java, Scala, and Python. Shark is an analytic query engine built on top of Spark that is compatible with Hive. It can run Hive queries much faster in existing Hive warehouses without modifications.
These systems have been adopted by many organizations large and small (e.g. Yahoo, Intel, Adobe, Alibaba, Tencent) to implement data intensive applications such as ETL, interactive SQL, and machine learning.
This talk gives details about Spark internals and an explanation of the runtime behavior of a Spark application. It explains how high level user programs are compiled into physical execution plans in Spark. It then reviews common performance bottlenecks encountered by Spark users, along with tips for diagnosing performance problems in a production application.
This is a deck of slides from a recent meetup of AWS Usergroup Greece, presented by Ioannis Konstantinou from the National Technical University of Athens.
The presentation gives an overview of the Map Reduce framework and a description of its open source implementation (Hadoop). Amazon's own Elastic Map Reduce (EMR) service is also mentioned. With the growing interest on Big Data this is a good introduction to the subject.
Video: https://www.youtube.com/watch?v=kkOG_aJ9KjQ
This talk gives details about Spark internals and an explanation of the runtime behavior of a Spark application. It explains how high level user programs are compiled into physical execution plans in Spark. It then reviews common performance bottlenecks encountered by Spark users, along with tips for diagnosing performance problems in a production application.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Making Big Data Analytics Interactive and Real-Time
1. Spark
Making
Big
Data
Analytics
Interactive
and
Real-‐Time
Matei
Zaharia,
in
collaboration
with
Mosharaf
Chowdhury,
Tathagata
Das,
Timothy
Hunter,
Ankur
Dave,
Haoyuan
Li,
Justin
Ma,
Murphy
McCauley,
Michael
Franklin,
Scott
Shenker,
Ion
Stoica
spark-‐project.org
UC
BERKELEY
2. Overview
Spark
is
a
parallel
framework
that
provides:
» Efficient
primitives
for
in-‐memory
data
sharing
» Simple
APIs
in
Scala,
Java,
SQL
» High
generality
(applicable
to
many
emerging
problems)
This
talk
will
cover:
» What
it
does
» How
people
are
using
it
(including
some
surprises)
» Current
research
3. Motivation
MapReduce
simplified
data
analysis
on
large,
unreliable
clusters
But
as
soon
as
it
got
popular,
users
wanted
more:
» More
complex,
multi-‐pass
applications
(e.g.
machine
learning,
graph
algorithms)
» More
interactive
ad-‐hoc
queries
» More
real-‐time
stream
processing
One
reaction:
specialized
models
for
some
of
these
apps
(e.g.
Pregel,
Storm)
4. Motivation
Complex
apps,
streaming,
and
interactive
queries
all
need
one
thing
that
MapReduce
lacks:
Efficient
primitives
for
data
sharing
5. Examples
HDFS
HDFS
HDFS
HDFS
read
write
read
write
iter.
1
iter.
2
.
.
.
Input
HDFS
query
1
result
1
read
query
2
result
2
query
3
result
3
Input
.
.
.
Slow
due
to
replication
and
disk
I/O,
but
necessary
for
fault
tolerance
6. Goal:
Sharing
at
Memory
Speed
iter.
1
iter.
2
.
.
.
Input
query
1
one-‐time
processing
query
2
query
3
Input
.
.
.
10-‐100×
faster
than
network/disk,
but
how
to
get
FT?
7. Challenge
How
to
design
a
distributed
memory
abstraction
that
is
both
fault-‐tolerant
and
efficient?
8. Existing
Systems
Existing
in-‐memory
storage
systems
have
interfaces
based
on
fine-‐grained
updates
» Reads
and
writes
to
cells
in
a
table
» E.g.
databases,
key-‐value
stores,
distributed
memory
Requires
replicating
data
or
logs
across
nodes
for
fault
tolerance
è
expensive!
» 10-‐100x
slower
than
memory
write…
9. Solution:
Resilient
Distributed
Datasets
(RDDs)
Provide
an
interface
based
on
coarse-‐grained
operations
(map,
group-‐by,
join,
…)
Efficient
fault
recovery
using
lineage
» Log
one
operation
to
apply
to
many
elements
» Recompute
lost
partitions
on
failure
» No
cost
if
nothing
fails
11. Generality
of
RDDs
RDDs
can
express
surprisingly
many
parallel
algorithms
» These
naturally
apply
same
operation
to
many
items
Capture
many
current
programming
models
» Data
flow
models:
MapReduce,
Dryad,
SQL,
…
» Specialized
models
for
iterative
apps:
Pregel,
iterative
MapReduce,
PowerGraph,
…
Allow
these
models
to
be
composed
13. Spark
Programming
Interface
Language-‐integrated
API
in
Scala*
Provides:
» Resilient
distributed
datasets
(RDDs)
• Partitioned
collections
with
controllable
caching
» Operations
on
RDDs
• Transformations
(define
RDDs),
actions
(compute
results)
» Restricted
shared
variables
(broadcast,
accumulators)
*Also
Java,
SQL
and
soon
Python
14. Example:
Log
Mining
Load
error
messages
from
a
log
into
memory,
then
interactively
search
for
various
patterns
Base
RDD
Msgs.
1
lines = spark.textFile(“hdfs://...”) Transformed
RDD
Worker
results
errors = lines.filter(_.startsWith(“ERROR”))
messages = errors.map(_.split(‘t’)(2)) tasks
Block
1
Driver
cachedMsgs = messages.persist()
Action
cachedMsgs.filter(_.contains(“foo”)).count
cachedMsgs.filter(_.contains(“bar”)).count Msgs.
2
Worker
. . .
Msgs.
3
Worker
Block
2
Result:
sull-‐text
s1
TB
data
in
5-‐7
sec
fcaled
to
earch
of
Wikipedia
in
<1
sec
(vs
ec
for
on-‐disk
data)
ata)
(vs
170
s 20
sec
for
on-‐disk
d Block
3
15. Fault
Recovery
RDDs
track
the
graph
of
transformations
that
built
them
(their
lineage)
to
rebuild
lost
data
E.g.:
messages = textFile(...).filter(_.contains(“error”))
.map(_.split(‘t’)(2))
HadoopRDD
FilteredRDD
MappedRDD
path
=
hdfs://…
func
=
_.contains(...)
func
=
_.split(…)
17. Example:
Logistic
Regression
Goal:
find
best
line
separating
two
sets
of
points
random
initial
line
+
+ ++ +
– – +
+ +
– – –– + +
– –
– –
target
18. Example:
Logistic
Regression
val data = spark.textFile(...).map(readPoint).persist()
var w = Vector.random(D)
for (i <- 1 to ITERATIONS) {
val gradient = data.map(p =>
(1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x
).reduce(_ + _)
w -= gradient
}
println("Final w: " + w)
19. Logistic
Regression
Performance
60
110
s
/
iteration
Running
Time
(min)
50
40
Hadoop
30
Spark
20
10
first
iteration
80
s
0
further
iterations
6
s
1
10
20
30
Number
of
Iterations
20. Example:
Collaborative
Filtering
Goal:
predict
users’
movie
ratings
based
on
past
ratings
of
other
movies
1
?
?
4
5
?
3
?
?
3
5
?
?
3
R
=
Users
5
?
5
?
?
?
1
4
?
?
?
?
2
?
Movies
21. Model
and
Algorithm
Model
R
as
product
of
user
and
movie
feature
matrices
A
and
B
of
size
U×K
and
M×K
BT
R
=
A
Alternating
Least
Squares
(ALS)
» Start
with
random
A
&
B
» Optimize
user
vectors
(A)
based
on
movies
» Optimize
movie
vectors
(B)
based
on
users
» Repeat
until
converged
22. Serial
ALS
var R = readRatingsMatrix(...)
var A = // array of U random vectors
var B = // array of M random vectors
for (i <- 1 to ITERATIONS) {
A = (0 until U).map(i => updateUser(i, B, R))
B = (0 until M).map(i => updateMovie(i, A, R))
}
Range
objects
23. Naïve
Spark
ALS
var R = readRatingsMatrix(...)
var A = // array of U random vectors
var B = // array of M random vectors
for (i <- 1 to ITERATIONS) {
A = spark.parallelize(0 until U, numSlices) Problem:
.map(i => updateUser(i, B, R)) R
re-‐sent
.collect()
B = spark.parallelize(0 until M, numSlices) to
all
nodes
.map(i => updateMovie(i, A, R)) in
each
.collect() iteration
}
24. Efficient
Spark
ALS
var R = spark.broadcast(readRatingsMatrix(...)) Solution:
mark
R
as
var A = // array of U random vectors broadcast
var B = // array of M random vectors
variable
for (i <- 1 to ITERATIONS) {
A = spark.parallelize(0 until U, numSlices)
.map(i => updateUser(i, B, R.value))
.collect()
B = spark.parallelize(0 until M, numSlices)
.map(i => updateMovie(i, A, R.value))
.collect()
}
Result:
3×
performance
improvement
25. Scaling
Up
Broadcast
Initial
version
(HDFS)
Cornet
P2P
broadcast
250
250
Communication
Communication
200
Computation
200
Computation
Iteration
time
(s)
Iteration
time
(s)
150
150
100
100
50
50
0
0
10
30
60
90
10
30
60
90
Number
of
machines
Number
of
machines
[Chowdhury
et
al,
SIGCOMM
2011]
26. Other
RDD
Operations
map
flatMap
filter
union
Transformations
sample
join
(define
a
new
RDD)
groupByKey
cogroup
reduceByKey
cross
sortByKey
…
collect
Actions
reduce
(return
a
result
to
count
driver
program)
save
…
33. Sample
Data
Credit:
Tim
Hunter,
with
support
of
the
Mobile
Millennium
team;
P.I.
Alex
Bayen;
traffic.berkeley.edu
34. Challenge
Data
is
noisy
and
sparse
(1
sample/minute)
Must
infer
path
taken
by
each
vehicle
in
addition
to
travel
time
distribution
on
each
link
35. Challenge
Data
is
noisy
and
sparse
(1
sample/minute)
Must
infer
path
taken
by
each
vehicle
in
addition
to
travel
time
distribution
on
each
link
36. Solution
EM
algorithm
to
estimate
paths
and
travel
time
distributions
simultaneously
observations
flatMap
weighted
path
samples
groupByKey
link
parameters
broadcast
37. Results
[Hunter
et
al,
SOCC
2011]
3×
speedup
from
caching,
4.5x
from
broadcast
38. Conviva
GeoReport
Hive
20
Spark
0.5
Time
(hours)
0
5
10
15
20
SQL
aggregations
on
many
keys
w/
same
filter
40×
gain
over
Hive
from
avoiding
repeated
I/O,
deserialization
and
filtering
39. Other
Programming
Models
Pregel
on
Spark
(Bagel)
» 200
lines
of
code
Iterative
MapReduce
» 200
lines
of
code
Hive
on
Spark
(Shark)
» 5000
lines
of
code
» Compatible
with
Apache
Hive
» Machine
learning
ops.
in
Scala
41. Implementation
Runs
on
Apache
Mesos
cluster
Spark
Hadoop
MPI
manager
to
coexist
w/
Hadoop
…
Mesos
Supports
any
Hadoop
storage
system
(HDFS,
HBase,
…)
Node
Node
Node
Easy
local
mode
and
EC2
launch
scripts
No
changes
to
Scala
42. Task
Scheduler
Runs
general
DAGs
A:
B:
Pipelines
functions
G:
within
a
stage
Stage
1
groupBy
Cache-‐aware
data
C:
D:
F:
reuse
&
locality
map
E:
join
Partitioning-‐aware
to
avoid
shuffles
Stage
2
union
Stage
3
=
cached
data
partition
43. Language
Integration
Scala
closures
are
Serializable
Java
objects
» Serialize
on
master,
load
&
run
on
workers
Not
quite
enough
» Nested
closures
may
reference
entire
outer
scope,
pulling
in
non-‐Serializable
variables
not
used
inside
» Solution:
bytecode
analysis
+
reflection
44. Interactive
Spark
Modified
Scala
interpreter
to
allow
Spark
to
be
used
interactively
from
the
command
line
» Track
variables
that
each
line
depends
on
» Ship
generated
classes
to
workers
Enables
in-‐memory
exploration
of
big
data
47. Motivation
Many
“big
data”
apps
need
to
work
in
real
time
» Site
statistics,
spam
filtering,
intrusion
detection,
…
To
scale
to
100s
of
nodes,
need:
» Fault-‐tolerance:
for
both
crashes
and
stragglers
» Efficiency:
don’t
consume
many
resources
beyond
base
processing
Challenging
in
existing
streaming
systems
48. Traditional
Streaming
Systems
Continuous
processing
model
» Each
node
has
long-‐lived
state
» For
each
record,
update
state
&
send
new
records
mutable
state
input
records
push
node
1
node
3
input
records
node
2
49. Traditional
Streaming
Systems
Fault
tolerance
via
replication
or
upstream
backup:
input
node
1
input
node
1
node
3
node
3
input
node
2
input
synchronization
node
2
node
1’
standby
node
3’
node
2’
50. Traditional
Streaming
Systems
Fault
tolerance
via
replication
or
upstream
backup:
input
node
1
input
node
1
node
3
node
3
input
node
2
input
synchronization
node
2
node
1’
standby
node
3’
Fast
recovery,
but
2x
Only
need
1
standby,
hardware
cost
node
2’
but
slow
to
recover
51. Traditional
Streaming
Systems
Fault
tolerance
via
replication
or
upstream
backup:
input
node
1
input
node
1
node
3
node
3
input
node
2
input
synchronization
node
2
node
1’
standby
node
3’
node
2’
Neither
approach
can
handle
stragglers
[Borealis,
Flux]
[Hwang
et
al,
2005]
52. Observation
Batch
processing
models,
such
as
MapReduce,
do
provide
fault
tolerance
efficiently
» Divide
job
into
deterministic
tasks
» Rerun
failed/slow
tasks
in
parallel
on
other
nodes
Idea:
run
streaming
computations
as
a
series
of
small,
deterministic
batch
jobs
» Same
recovery
schemes
at
much
smaller
timescale
» To
make
latency
low,
store
state
in
RDDs
53. Discretized
Stream
Processing
batch
operation
t
=
1:
input
pull
immutable
dataset
immutable
dataset
(stored
reliably)
(output
or
state);
stored
as
RDD
t
=
2:
input
…
…
54. Fault
Recovery
Checkpoint
state
RDDs
periodically
If
a
node
fails/straggles,
rebuild
lost
RDD
partitions
in
parallel
on
other
nodes
map
input
dataset
output
dataset
Faster
recovery
than
upstream
backup,
without
the
cost
of
replication
55. How
Fast
Can
It
Go?
Can
process
over
60M
records/s
(6
GB/s)
on
100
nodes
at
sub-‐second
latency
80
40
Records/s
(millions)
Records/s
(millions)
Grep
Top
K
Words
60
30
40
20
20
1
sec
10
1
sec
2
sec
2
sec
0
0
0
50
100
0
50
100
Nodes
in
Cluster
Nodes
in
Cluster
Max
throughput
under
a
given
latency
(1
or
2s)
56. Comparison
with
Storm
Grep
Top
K
Words
80
30
Spark
Spark
MB/s
per
node
MB/s
per
node
60
Storm
Storm
20
40
10
20
0
0
100
1000
100
1000
Record
Size
(bytes)
Record
Size
(bytes)
Storm
limited
to
100K
records/s/node
Lack
Spark’s
Also
tried
S4:
10K
records/s/node
FT
guarantees
Commercial
systems:
O(500K)
total
57. How
Fast
Can
It
Recover?
Recovers
from
faults/stragglers
within
1
second
Failure
happens
1.5
Interval
Processing
1
Time
(s)
0.5
0
Time
(s)
0
5
10
15
20
25
Sliding
WordCount
on
20
nodes
with
10s
checkpoint
interval
58. Programming
Interface
Extension
to
Spark:
Spark
Streaming
» All
Spark
operators
plus
new
“stateful”
ones
“Discretized
stream”
// Running count of pageviews by URL (D-‐stream)
ones
views counts
t
=
1:
views = readStream("http:...", "1s")
ones = views.map(ev => (ev.url, 1)) map
reduce
counts = ones.runningReduce(_ + _) t
=
2:
...
=
RDD
=
partition
59. Incremental
Operators
words.reduceByWindow(“5s”, max) words.reduceByWindow(“5s”, _+_, _-_)
words
interval
sliding
words
interval
sliding
max
max
counts
counts
t-‐1
t-‐1
t
t
t+1
t+1
t+2
t+2
–
t+3
t+3
max
+
t+4
t+4
+
Associative
function
Associative
&
invertible
60. Applications
Mobile
Millennium
Conviva
video
dashboard
traffic
estimation
4.0
2000
GPS
observations
/
sec
Active
video
sessions
3.5
3.0
1500
2.5
(millions)
2.0
1000
1.5
1.0
500
0.5
0.0
0
0
16
32
48
64
0
20
40
60
80
Nodes
in
Cluster
Nodes
in
Cluster
(>50
session-‐level
metrics)
(online
EM
algorithm)
61. Unifying
Streaming
and
Batch
D-‐streams
and
RDDs
can
seamlessly
be
combined
» Same
execution
and
fault
recovery
models
Enables
powerful
features:
» Combining
streams
with
historical
data:
! used
in
MM
pageViews.join(historicCounts).map(...) application
» Interactive
ad-‐hoc
queries
on
stream
state:
!
pageViews.slice(“21:00”,“21:05”).topK(10) used
in
Conviva
app
62. Benefits
of
a
Unified
Stack
Write
each
algorithm
only
once
Reuse
data
across
streaming
&
batch
jobs
Query
stream
state
instead
of
waiting
for
import
Some
users
were
doing
this
manually!
» Conviva
anomaly
detection,
Quantifind
dashboard
63. Conclusion
“Big
data”
is
moving
beyond
one-‐pass
batch
jobs,
to
low-‐latency
apps
that
need
data
sharing
RDDs
offer
fault-‐tolerant
sharing
at
memory
speed
Spark
uses
them
to
combine
streaming,
batch
&
interactive
analytics
in
one
system
www.spark-‐project.org
64. Related
Work
DryadLINQ,
FlumeJava
» Similar
“distributed
collection”
API,
but
cannot
reuse
datasets
efficiently
across
queries
GraphLab,
Piccolo,
BigTable,
RAMCloud
» Fine-‐grained
writes
requiring
replication
or
checkpoints
Iterative
MapReduce
(e.g.
Twister,
HaLoop)
» Implicit
data
sharing
for
a
fixed
computation
pattern
Relational
databases
» Lineage/provenance,
logical
logging,
materialized
views
Caching
systems
(e.g.
Nectar)
» Store
data
in
files,
no
explicit
control
over
what
is
cached
65. Behavior
with
Not
Enough
RAM
100
68.8
Iteration
time
(s)
58.1
80
40.7
60
29.7
40
11.5
20
0
Cache
25%
50%
75%
Fully
disabled
cached
%
of
working
set
in
memory
66. RDDs
for
Debugging
Debugging
general
distributed
apps
is
very
hard
However,
Spark
and
other
recent
frameworks
run
deterministic
tasks
for
fault
tolerance
Leverage
this
determinism
for
debugging:
» Log
lineage
for
all
RDDs
created
(small)
» Let
user
replay
any
task
in
jdb,
rebuild
any
RDD
to
query
it
interactively,
or
check
assertions