Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)

•

3 likes•1,931 views

The document discusses graph computations and Pregel, an API for graph processing. It introduces Pregel's vertex-centric programming model where computation is organized into supersteps and depends only on neighboring vertices. Examples like PageRank are shown implemented in Pregel. GraphX is also introduced as a library providing Pregel-like abstractions on Spark. The document then discusses distributing matrix computations, covering partitioning schemes for matrices and how to distribute operations like multiplication and singular value decomposition (SVD) across a cluster.

Reza Zadeh
Matrix and Graph Computations
@Reza_Zadeh | http://reza-zadeh.com

Overview

Graph Computations and Pregel"

Introduction to Matrix Computations

Data Flow Models
Restrict the programming interface so that the
system can do more automatically
Express jobs as graphs of high-level operators
» System picks how to split each operator into tasks
and where to run each task
» Run parts twice fault recovery
New example: Pregel (parallel graph google)

Pregel
oogle
Expose specialized APIs to simplify graph
programming.

“Think like a vertex”

Graph-Parallel Pattern

Model / Alg.
State
Computation depends
only on the neighbors

Pregel
Data
Flow

Input
graph
Vertex
state
1
Messages
1

Superstep
1

Vertex
state
2
Messages
2

Superstep
2

.

.

.

Group
by
vertex
ID

Group
by
vertex
ID

Simple
Pregel
in
Spark

Separate
RDDs
for
immutable
graph
state
and

for
vertex
states
and
messages
at
each
iteration

Use
groupByKey
to
perform
each
step

Cache
the
resulting
vertex
and
message
RDDs

Optimization:
co-‐partition
input
graph
and

vertex
state
RDDs
to
reduce
communication

Update ranks in parallel
Iterate until convergence
Rank of
user i

Weighted sum of
neighbors’ ranks

Example: PageRank
R[i] = 0.15 +
X
j2Nbrs(i)
wjiR[j]

PageRank
in
Pregel

Input
graph
Vertex
ranks
1
Contributions
1

Superstep
1
(add
contribs)

Vertex
ranks
2
Contributions
2

Superstep
2
(add
contribs)

.

.

.

Group
&
add
by
vertex

Group
&
add
by
vertex

GraphX
Provides
Pregel
message-‐passing
and
other

operators
on
top
of
RDDs

GraphX: Triplets
The triplets operator joins vertices and edges:

F

E

Map Reduce Triplets
Map-Reduce for each vertex
D

B

A

C

mapF( )

A
B

mapF( )

A
C

A1

A2

reduceF( , )

A1
A2
A

F

E

Example: Oldest Follower
D

B

A

C
What is the age of the oldest
follower for each user?
val oldestFollowerAge = graph
.mrTriplets(
e=> (e.dst.id, e.src.age),//Map
(a,b)=> max(a, b) //Reduce
)
.vertices
23

42

30

19

75

16

Summary of Operators
All operations:
https://spark.apache.org/docs/latest/graphx-programming-guide.html#summary-list-of-operators

Pregel API:
https://spark.apache.org/docs/latest/graphx-programming-guide.html#pregel-api

The GraphX Stack"
(Lines of Code)
GraphX (3575)

Spark

Pregel (28) + GraphLab (50)

PageRank
(5)

Connected
Comp. (10)

Shortest
Path (10)

ALS

(40)

LDA

(120)

K-core

(51)

Triangle

Count

(45)

SVD

(40)

Optimizations
Overloaded vertices have their work
distributed

More examples
In your HW: Single-Source-Shortest Paths
using Pregel

Distributing Matrices
How to distribute a matrix across machines?
»  By Entries (CoordinateMatrix)
»  By Rows (RowMatrix)
»  By Blocks (BlockMatrix)
All of Linear Algebra to be rebuilt using these
partitioning schemes
As
of
version
1.3

Distributing Matrices
Even the simplest operations require thinking
about communication e.g. multiplication

How many different matrix multiplies needed?
»  At least one per pair of {Coordinate, Row,
Block, LocalDense, LocalSparse} = 10
»  More because multiplies not commutative

Distributed Singular Value Decomposition

Singular Value Decomposition
Two cases
»  Tall and Skinny
»  Short and Fat (not really)
»  Roughly Square
SVD method on RowMatrix takes care of
which one to call.

Tall and Skinny SVD
Gets
us

V
and
the

singular
values

Gets
us

U
by
one

matrix
multiplication

Square SVD
ARPACK: Very mature Fortran77 package for
computing eigenvalue decompositions"

JNI interface available via netlib-java"

Distributed using Spark – how?

Square SVD via ARPACK
Only interfaces with distributed matrix via
matrix-vector multiplies

The result of matrix-vector multiply is small.
The multiplication can be distributed.

Square SVD
With 68 executors and 8GB memory in each,
looking for the top 5 singular vectors

This document discusses GraphX, a graph processing system built on Apache Spark. It defines what graphs are, including vertices and edges. It explains that GraphX uses Resilient Distributed Datasets (RDDs) to keep data in memory for iterative graph algorithms. GraphX implements the Pregel computational model where each vertex can modify its state, receive and send messages to neighbors each superstep until halting. The document provides examples of graph algorithms and notes when GraphX is well-suited versus a graph database.

Parallel Graph Analytics

Stefano Romanazzi

Download It

butest

The document discusses using Map-Reduce for machine learning algorithms on multi-core processors. It describes rewriting machine learning algorithms in "summation form" to express the independent computations as Map tasks and aggregating results as Reduce tasks. This formulation allows the algorithms to be parallelized efficiently across multiple cores. Specific machine learning algorithms that have been implemented or analyzed in this Map-Reduce framework are listed.

MLLeap, or How to Productionize Data Science Workflows Using Spark by Mikha...

Spark Summit

MLeap is a machine learning platform that enables data scientists and engineers to collaborate using a single environment. It allows machine learning models trained using Spark to be deployed to production APIs without dependencies on Spark. MLeap addresses the common problems of data scientists and engineers having to re-write data pipelines and model code for production. It provides core machine learning components, a runtime for executing models without Spark, and tools for converting Spark models to the MLeap format. A demo is shown training and deploying models to an API in under 5 minutes.

Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...

Spark Summit

This document discusses using Spark and R for petabyte-scale data science at Comcast. It describes Comcast's top data initiatives, how Spark and SparkR enable machine learning algorithms, and an example of using a Hidden Markov Model with SparkR to analyze streaming data and detect hidden states. Performance testing showed the model processing 1.7 billion observations per day in 30 minutes with 92% accuracy.

Streaming sql w kafka and flink

Kenny Gorman

Scalding intro 20141125

Erik Schmiegelow

Scalding is a Scala API for MapReduce applications built on top of Cascading. Cascading introduces the concepts of source taps (input) and sink taps (output) connected by pipes to abstract the key/value scheme in MapReduce. Within pipes, users define data transformations using operations like GroupBy. Scalding offers a domain specific language for Cascading to operate on data flows with functions rather than constructing objects. It provides three APIs - Field, Type Safe, and Matrix - that allow different levels of access to Cascading functionality. Scalding includes common functions like map, filter, and reduce and also grouping and joining functions to transform data in MapReduce workflows.

Writing an Interactive Interface for SQL on Flink

Eventador

This document discusses designing machine learning algorithms for Apache Spark. It explains that Spark is a fast, general-purpose cluster computing system that is well-suited for machine learning algorithms due to its ability to perform iterative computations in-memory. The document outlines key considerations for writing Spark algorithms, such as minimizing shared state, achieving linear scalability, and reworking algorithms that have quadratic complexity. It provides an example of implementing the Silhouette clustering evaluation metric in a Spark-optimized way that reduces the computational complexity from quadratic to linear.

Spark Summit EU talk by Chris Pool and Jeroen Vlek

Spark Summit

This document discusses a predictive maintenance project for switch failures on Dutch railways. The goal is to predict switch failures 3 weeks in advance using sensor data collected from switches. Features are derived from the sensor data and a decision tree model is used to classify switches into failure risk categories. The current system uses SQL but will transition to Spark for increased data volumes and streaming capabilities. Initial results show high precision and recall. Future work involves deep learning models and predicting specific failure types.

Uber Business Metrics Generation and Management Through Apache Flink

Wenrui Meng

Uber uses Apache Flink to generate and manage business metrics in real-time from raw streaming data sources. The system defines metrics using a domain-specific language and optimizes an execution plan to generate the metrics directly rather than first generating raw datasets. This avoids inefficiencies, inconsistencies, and wasted resources. The system provides a unified way to define metrics from multiple data sources and store results in various databases and warehouses.

Inside Apache SystemML by Frederick Reiss

Spark Summit

1. The document describes the origins and goals of the SystemML project for scalable machine learning. 2. SystemML was created to allow data scientists to write machine learning algorithms in R and automatically compile and optimize them to run efficiently on large datasets in parallel. 3. An example alternating least squares algorithm is shown written concisely in R, while traditional approaches required translating algorithms to other languages like Scala which was error-prone and slowed iteration. SystemML aims to allow the same algorithm to run fast at large scale with the same answer.

Willump: Optimizing Feature Computation in ML Inference

Databricks

Beam summit 2019 - Unifying Batch and Stream Data Processing with Apache Calc...

Khai Tran

Computation convergence problems, auto-generate Beam API code from Pig scripts, convergences at LinkedIn with AORA (Author Once Run Anywhere) principle. Blog post: https://engineering.linkedin.com/blog/2019/01/bridging-offline-and-nearline-computations-with-apache-calcite Code example: Pig script: https://gist.github.com/khaitranq/1d06c27832f15fa52a4a7e2fa7bec340 Beam autogen code: https://gist.github.com/khaitranq/785dbb8495cd382788f3ca8200231d8

Anatomy of Spark SQL Catalyst - Part 2

datamantra

Catalyst is a framework that defines relational operators and expressions for Spark SQL. It converts DataFrame queries into logical and physical query plans. The logical plans are optimized through rules before being converted to Spark plans by strategies. These strategies implement the query planner to transform logical plans into physical plans that can be executed on RDDs. Key components include the analyzer, optimizer, query planner and strategies like filtering that generate Spark plans from logical plans.

Advanced Neo4j Use Cases with the GraphAware Framework

Michal Bachman

The document discusses GraphAware Framework, which makes it easy to build, test, and deploy custom APIs, transaction-driven behavior, and asynchronous computation functionality for Neo4j. It provides examples like representing time series data, tracking graph changes, assigning UUIDs, and running algorithms. GraphAware Framework is open source and supports building both generic and domain-specific Neo4j extensions.

Akka Streams

Diego Pacheco

This document discusses Akka Streams, which is a library for building reactive stream processing applications. It defines key concepts like streams, processing stages, back pressure, and flow control. It also covers topics like stream composition, materialization, using actors in streams, and includes code examples. The document is presented by Diego Pacheco, a principal software architect at Akka Streams.

Online learning with structured streaming, spark summit brussels 2016

Ram Sriharsha

This document summarizes an online presentation about online learning with structured streaming in Spark. The key points are: - Online learning updates model parameters for each data point as it arrives, unlike batch learning which sees the full dataset before updating. - Structured streaming in Spark provides a single API for batch, streaming, and machine learning workloads. It offers exactly-once guarantees and understands external event time. - Streaming machine learning on structured streaming works by having a stateful aggregation query that picks up the last trained model and performs a distributed update and merge on each trigger interval. This allows modeling streaming data in a fault-tolerant way.

Spark architecture

datamantra

Spark uses Resilient Distributed Datasets (RDDs) as its fundamental data structure. RDDs are immutable, lazy evaluated collections of data that can be operated on in parallel. This allows RDD transformations to be computed lazily and combined for better performance. RDDs also support type inference and caching to improve efficiency. Spark programs run by submitting jobs to a cluster manager like YARN or Mesos, which then schedule tasks across worker nodes where the lazy transformations are executed.

Scilab-by-dr-gomez-june2014

Ir. Dr. R.Badlishah Ahmad

Scilab is a free and open-source numerical computation software developed by Scilab Enterprises. It includes a high-level programming language, hundreds of mathematical functions, advanced data structures, and a computation engine that can be easily embedded into applications. Scilab also includes Xcos for dynamic systems modeling and simulation and ATOMS for managing external modules. Scilab has over 2,000 functions and is used widely in industry, academia, and education.

Cape2013 scilab-workshop-19Oct13

Naren P.R.

This document outlines a presentation given at the National Conference on Advances in Process Engineering (CAPE-2013) about Scilab, an open-source computing tool. The presentation covers the basics of Scilab including variables, matrices, functions, control statements, plotting, and file operations. It also describes 12 tutorials that demonstrate key Scilab capabilities like matrix calculations, solving equations of motion, finding polynomial roots, and using loops and conditionals. The goal is to help engineers learn Scilab syntax and programming skills to complement their education.

Anatomy of Data Source API : A deep dive into Spark Data source API

datamantra

Optimized declarative transformation First Eclipse QVTc results

Edward Willink

1) This document discusses the first results from Eclipse QVTc, which implements the declarative QVT Core (QVTc) specification for model transformations. 2) Optimization techniques like reducing multi-input mappings to single-input mappings with local loops, and analyzing dependencies between mappings to generate an efficient static schedule provided a 30x speedup over naive polling execution. 3) While more optimizations are still needed, particularly for QVT Relational (QVTr), this implementation demonstrates that declarative transformations can be efficiently executed by analyzing the graph of dependencies between mappings.

Cypher for Apache Spark

openCypher

Papyrus @ Eclipse Summit Europe 2010

CEA - Commissariat à l'énergie atomique et aux énergies alternatives

Scilab

Kripal Priyadarshi

This document provides an overview of the open source software Scilab for numerical computation in chemical engineering. It discusses how Scilab can be used to solve common engineering problems through examples involving fluid flow, heat transfer, distillation, and equipment design. The key features of Scilab are explained, including its interface, basic commands, matrix operations, and graphical capabilities. Comparisons are made between Scilab and the commercial software Matlab. Optimization techniques in Scilab are also covered.

Modeling & Simulation of CubeSat-based Missions'Concept of Operations

Obeo

Discover how Arcadia/Capella is used to model and simulate concept of operations scenarios for CubeSat-based missions. During this webinar, Danilo Pallamin de Almeida, who worked as a Space Systems Engineer for the NanosatC-BR2 mission at INPE, the Brazilian Institute for Space Research, will present how CubeSat-based missions have been modeled with Capella. The model describing an initial architecture mission and concept of operations (CONOPS) is used to generate a script that configures a satellite simulator with the corresponding mission parameters. You will see how it allows the INPE to: - run concept of operations scenarios simulations, - use the results for power/data-budget analyses and trade studies

Graphs in data structures are non-linear data structures made up of a finite ...

bhargavi804095

GraphQL & DGraph with Go

James Tan

This document discusses GraphQL and DGraph with GO. It begins by introducing GraphQL and some popular GraphQL implementations in GO like graphql-go. It then discusses DGraph, describing it as a distributed, high performance graph database written in GO. It provides examples of using the DGraph GO client to perform CRUD operations, querying for single and multiple objects, committing transactions, and more.

What's hot

Asymmetry in Large-Scale Graph Analysis, Explained

Vasia Kalavri

Designing a machine learning algorithm for Apache Spark

Marco Gaido

Spark Summit EU talk by Chris Pool and Jeroen Vlek

Spark Summit

Uber Business Metrics Generation and Management Through Apache Flink

Wenrui Meng

Inside Apache SystemML by Frederick Reiss

Spark Summit

Willump: Optimizing Feature Computation in ML Inference

Databricks

Beam summit 2019 - Unifying Batch and Stream Data Processing with Apache Calc...

Khai Tran

Anatomy of Spark SQL Catalyst - Part 2

datamantra

Advanced Neo4j Use Cases with the GraphAware Framework

Michal Bachman

Akka Streams

Diego Pacheco

Online learning with structured streaming, spark summit brussels 2016

Ram Sriharsha

Spark architecture

datamantra

Scilab-by-dr-gomez-june2014

Ir. Dr. R.Badlishah Ahmad

Cape2013 scilab-workshop-19Oct13

Naren P.R.

Anatomy of Data Source API : A deep dive into Spark Data source API

datamantra

Optimized declarative transformation First Eclipse QVTc results

Edward Willink

Cypher for Apache Spark

openCypher

Papyrus @ Eclipse Summit Europe 2010

CEA - Commissariat à l'énergie atomique et aux énergies alternatives

Scilab

Kripal Priyadarshi

Modeling & Simulation of CubeSat-based Missions'Concept of Operations

Obeo

What's hot (20)

Asymmetry in Large-Scale Graph Analysis, Explained

Designing a machine learning algorithm for Apache Spark

Spark Summit EU talk by Chris Pool and Jeroen Vlek

Uber Business Metrics Generation and Management Through Apache Flink

Inside Apache SystemML by Frederick Reiss

Willump: Optimizing Feature Computation in ML Inference

Beam summit 2019 - Unifying Batch and Stream Data Processing with Apache Calc...

Anatomy of Spark SQL Catalyst - Part 2

Advanced Neo4j Use Cases with the GraphAware Framework

Akka Streams

Online learning with structured streaming, spark summit brussels 2016

Spark architecture

Scilab-by-dr-gomez-june2014

Cape2013 scilab-workshop-19Oct13

Anatomy of Data Source API : A deep dive into Spark Data source API

Optimized declarative transformation First Eclipse QVTc results

Cypher for Apache Spark

Papyrus @ Eclipse Summit Europe 2010

Scilab

Modeling & Simulation of CubeSat-based Missions'Concept of Operations

Similar to Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)

Graphs in data structures are non-linear data structures made up of a finite ...

bhargavi804095

GraphQL & DGraph with Go

James Tan

Apache Spark: What? Why? When?

Massimo Schenone

Apache Spark is a cluster computing framework that allows for fast, easy, and general processing of large datasets. It extends the MapReduce model to support iterative algorithms and interactive queries. Spark uses Resilient Distributed Datasets (RDDs), which allow data to be distributed across a cluster and cached in memory for faster processing. RDDs support transformations like map, filter, and reduce and actions like count and collect. This functional programming approach allows Spark to efficiently handle iterative algorithms and interactive data analysis.

Apache spark - Architecture , Overview & libraries

Walaa Hamdy Assy

This document provides an overview of Apache Spark, an open-source unified analytics engine for large-scale data processing. It discusses Spark's core APIs including RDDs and transformations/actions. It also covers Spark SQL, Spark Streaming, MLlib, and GraphX. Spark provides a fast and general engine for big data processing, with explicit operations for streaming, SQL, machine learning, and graph processing. The document includes installation instructions and examples of using various Spark components.

Data Processing with Apache Spark Meetup Talk

Eren Avşaroğulları

GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)

Ankur Dave

Tuning and Debugging in Apache Spark

Databricks

Advanced Data Science on Spark-(Reza Zadeh, Stanford)

Spark Summit

The document provides an overview of Spark and its machine learning library MLlib. It discusses how Spark uses resilient distributed datasets (RDDs) to perform distributed computing tasks across clusters in a fault-tolerant manner. It summarizes the key capabilities of MLlib, including its support for common machine learning algorithms and how MLlib can be used together with other Spark components like Spark Streaming, GraphX, and SQL. The document also briefly discusses future directions for MLlib, such as tighter integration with DataFrames and new optimization methods.

SparkSQL: A Compiler from Queries to RDDs

Databricks

SparkSQL, a module for processing structured data in Spark, is one of the fastest SQL on Hadoop systems in the world. This talk will dive into the technical details of SparkSQL spanning the entire lifecycle of a query execution. The audience will walk away with a deeper understanding of how Spark analyzes, optimizes, plans and executes a user’s query. Speaker: Sameer Agarwal This talk was originally presented at Spark Summit East 2017.

No more struggles with Apache Spark workloads in production

Chetan Khatri

Paris Scala Group Event May 2019, No more struggles with Apache Spark workloads in production. Apache Spark Primary data structures (RDD, DataSet, Dataframe) Pragmatic explanation - executors, cores, containers, stage, job, a task in Spark. Parallel read from JDBC: Challenges and best practices. Bulk Load API vs JDBC write An optimization strategy for Joins: SortMergeJoin vs BroadcastHashJoin Avoid unnecessary shuffle Alternative to spark default sort Why dropDuplicates() doesn’t result consistency, What is alternative Optimize Spark stage generation plan Predicate pushdown with partitioning and bucketing Why not to use Scala Concurrent ‘Future’ explicitly!

Haskell Accelerate

Steve Severance

This document provides an introduction and overview of GPU programming with Haskell using the Accelerate library. It discusses GPU architecture, when GPUs are suitable compared to CPUs, and demonstrates pricing financial options on a GPU using Accelerate. The document outlines the talk, introduces key GPU concepts, demonstrates basic Accelerate usage including arrays and computations, and provides references for further reading on GPU programming in Haskell.

Apache Spark: The Analytics Operating System

Adarsh Pannu

This presentation was delivered by Adarsh Pannu at IBM's Insight Conference in Nov 2015. For a recording, visit: https://www.youtube.com/watch?v=Tbm7HIlmwJQ The presentation provides an overview of Apache Spark, a general-purpose big data processing engine built around speed, ease of use and sophisticated analytics. It enumerates the benefits of incorporating Spark in the enterprise, including how it allows developers to write fully-featured distributed applications ranging from traditional data processing pipelines to complex machine learning. The presentation uses the Airline "On Time" data set to explore various components of the Spark stack.

Recent Developments In SparkR For Advanced Analytics

Databricks

Since its introduction in Spark 1.4, SparkR has received contributions from both the Spark community and the R community. In this talk, we will summarize recent community efforts on extending SparkR for scalable advanced analytics. We start with the computation of summary statistics on distributed datasets, including single-pass approximate algorithms. Then we demonstrate MLlib machine learning algorithms that have been ported to SparkR and compare them with existing solutions on R, e.g., generalized linear models, classification and clustering algorithms. We also show how to integrate existing R packages with SparkR to accelerate existing R workflows.

Apache Spark

Majid Hajibaba

This document provides an overview of Spark, including: - Spark's processing model involves chopping live data streams into batches and treating each batch as an RDD to apply transformations and actions. - Resilient Distributed Datasets (RDDs) are Spark's primary abstraction, representing an immutable distributed collection of objects that can be operated on in parallel. - An example word count program is presented to illustrate how to create and manipulate RDDs to count the frequency of words in a text file.

Unified Big Data Processing with Apache Spark (QCON 2014)

Databricks

This document discusses Apache Spark, a fast and general engine for big data processing. It describes how Spark generalizes the MapReduce model through its Resilient Distributed Datasets (RDDs) abstraction, which allows efficient sharing of data across parallel operations. This unified approach allows Spark to support multiple types of processing, like SQL queries, streaming, and machine learning, within a single framework. The document also outlines ongoing developments like Spark SQL and improved machine learning capabilities.

ScalaTo July 2019 - No more struggles with Apache Spark workloads in production

Chetan Khatri

Spark: The State of the Art Engine for Big Data Processing

Ramaninder Singh Jhajj

The document discusses Spark, an open-source cluster computing framework for large-scale data processing. It outlines Spark's advantages over MapReduce, including its ability to support iterative algorithms through in-memory caching. Spark provides a unified stack including Spark Core for distributed processing, Spark SQL for structured data, GraphX for graphs, MLlib for machine learning, and Spark Streaming for real-time data. Major companies that use Spark are cited.

Hadoop ecosystem

Ran Silberman

Tuning and Debugging in Apache Spark

Patrick Wendell

Video: https://www.youtube.com/watch?v=kkOG_aJ9KjQ This talk gives details about Spark internals and an explanation of the runtime behavior of a Spark application. It explains how high level user programs are compiled into physical execution plans in Spark. It then reviews common performance bottlenecks encountered by Spark users, along with tips for diagnosing performance problems in a production application.

Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...

Databricks

Similar to Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford) (20)

Graphs in data structures are non-linear data structures made up of a finite ...

GraphQL & DGraph with Go

Apache Spark: What? Why? When?

Apache spark - Architecture , Overview & libraries

Data Processing with Apache Spark Meetup Talk

GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)

Tuning and Debugging in Apache Spark

Advanced Data Science on Spark-(Reza Zadeh, Stanford)

SparkSQL: A Compiler from Queries to RDDs

No more struggles with Apache Spark workloads in production

Haskell Accelerate

Apache Spark: The Analytics Operating System

Recent Developments In SparkR For Advanced Analytics

Apache Spark

Unified Big Data Processing with Apache Spark (QCON 2014)

ScalaTo July 2019 - No more struggles with Apache Spark workloads in production

Spark: The State of the Art Engine for Big Data Processing

Hadoop ecosystem

Tuning and Debugging in Apache Spark

Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...

More from Spark Summit

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang

Spark Summit

In this session we will present a Configurable FPGA-Based Spark SQL Acceleration Architecture. It is target to leverage FPGA highly parallel computing capability to accelerate Spark SQL Query and for FPGA’s higher power efficiency than CPU we can lower the power consumption at the same time. The Architecture consists of SQL query decomposition algorithms, fine-grained FPGA based Engine Units which perform basic computation of sub string, arithmetic and logic operations. Using SQL query decomposition algorithm, we are able to decompose a complex SQL query into basic operations and according to their patterns each is fed into an Engine Unit. SQL Engine Units are highly configurable and can be chained together to perform complex Spark SQL queries, finally one SQL query is transformed into a Hardware Pipeline. We will present the performance benchmark results comparing the queries with FGPA-Based Spark SQL Acceleration Architecture on XEON E5 and FPGA to the ones with Spark SQL Query on XEON E5 with 10X ~ 100X improvement and we will demonstrate one SQL query workload from a real customer.

VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...

Spark Summit

In this talk, we’ll present techniques for visualizing large scale machine learning systems in Spark. These are techniques that are employed by Netflix to understand and refine the machine learning models behind Netflix’s famous recommender systems that are used to personalize the Netflix experience for their 99 millions members around the world. Essential to these techniques is Vegas, a new OSS Scala library that aims to be the “missing MatPlotLib” for Spark/Scala. We’ll talk about the design of Vegas and its usage in Scala notebooks to visualize Machine Learning Models.

Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu

Spark Summit

This presentation introduces how we design and implement a real-time processing platform using latest Spark Structured Streaming framework to intelligently transform the production lines in the manufacturing industry. In the traditional production line there are a variety of isolated structured, semi-structured and unstructured data, such as sensor data, machine screen output, log output, database records etc. There are two main data scenarios: 1) Picture and video data with low frequency but a large amount; 2) Continuous data with high frequency. They are not a large amount of data per unit. However the total amount of them is very large, such as vibration data used to detect the quality of the equipment. These data have the characteristics of streaming data: real-time, volatile, burst, disorder and infinity. Making effective real-time decisions to retrieve values from these data is critical to smart manufacturing. The latest Spark Structured Streaming framework greatly lowers the bar for building highly scalable and fault-tolerant streaming applications. Thanks to the Spark we are able to build a low-latency, high-throughput and reliable operation system involving data acquisition, transmission, analysis and storage. The actual user case proved that the system meets the needs of real-time decision-making. The system greatly enhance the production process of predictive fault repair and production line material tracking efficiency, and can reduce about half of the labor force for the production lines.

Improving Traffic Prediction Using Weather Data with Ramya Raghavendra

Spark Summit

As common sense would suggest, weather has a definite impact on traffic. But how much? And under what circumstances? Can we improve traffic (congestion) prediction given weather data? Predictive traffic is envisioned to significantly impact how driver’s plan their day by alerting users before they travel, find the best times to travel, and over time, learn from new IoT data such as road conditions, incidents, etc. This talk will cover the traffic prediction work conducted jointly by IBM and the traffic data provider. As a part of this work, we conducted a case study over five large metropolitans in the US, 2.58 billion traffic records and 262 million weather records, to quantify the boost in accuracy of traffic prediction using weather data. We will provide an overview of our lambda architecture with Apache Spark being used to build prediction models with weather and traffic data, and Spark Streaming used to score the model and provide real-time traffic predictions. This talk will also cover a suite of extensions to Spark to analyze geospatial and temporal patterns in traffic and weather data, as well as the suite of machine learning algorithms that were used with Spark framework. Initial results of this work were presented at the National Association of Broadcasters meeting in Las Vegas in April 2017, and there is work to scale the system to provide predictions in over a 100 cities. Audience will learn about our experience scaling using Spark in offline and streaming mode, building statistical and deep-learning pipelines with Spark, and techniques to work with geospatial and time-series data.

A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...

Spark Summit

Graph is on the rise and it’s time to start learning about scalable graph analytics! In this session we will go over two Spark-based Graph Analytics frameworks: Tinkerpop and GraphFrames. While both frameworks can express very similar traversals, they have different performance characteristics and APIs. In this Deep-Dive by example presentation, we will demonstrate some common traversals and explain how, at a Spark level, each traversal is actually computed under the hood! Learn both the fluent Gremlin API as well as the powerful GraphFrame Motif api as we show examples of both simultaneously. No need to be familiar with Graphs or Spark for this presentation as we’ll be explaining everything from the ground up!

No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...

Spark Summit

Building accurate machine learning models has been an art of data scientists, i.e., algorithm selection, hyper parameter tuning, feature selection and so on. Recently, challenges to breakthrough this “black-arts” have got started. In cooperation with our partner, NEC Laboratories America, we have developed a Spark-based automatic predictive modeling system. The system automatically searches the best algorithm, parameters and features without any manual work. In this talk, we will share how the automation system is designed to exploit attractive advantages of Spark. The evaluation with real open data demonstrates that our system can explore hundreds of predictive models and discovers the most accurate ones in minutes on a Ultra High Density Server, which employs 272 CPU cores, 2TB memory and 17TB SSD in 3U chassis. We will also share open challenges to learn such a massive amount of models on Spark, particularly from reliability and stability standpoints. This talk will cover the presentation already shown on Spark Summit SF’17 (#SFds5) but from more technical perspective.

Apache Spark and Tensorflow as a Service with Jim Dowling

Spark Summit

In Sweden, from the Rise ICE Data Center at www.hops.site, we are providing to reseachers both Spark-as-a-Service and, more recently, Tensorflow-as-a-Service as part of the Hops platform. In this talk, we examine the different ways in which Tensorflow can be included in Spark workflows, from batch to streaming to structured streaming applications. We will analyse the different frameworks for integrating Spark with Tensorflow, from Tensorframes to TensorflowOnSpark to Databrick’s Deep Learning Pipelines. We introduce the different programming models supported and highlight the importance of cluster support for managing different versions of python libraries on behalf of users. We will also present cluster management support for sharing GPUs, including Mesos and YARN (in Hops Hadoop). Finally, we will perform a live demonstration of training and inference for a TensorflowOnSpark application written on Jupyter that can read data from either HDFS or Kafka, transform the data in Spark, and train a deep neural network on Tensorflow. We will show how to debug the application using both Spark UI and Tensorboard, and how to examine logs and monitor training.

Apache Spark and Tensorflow as a Service with Jim Dowling

Spark Summit

MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...

Spark Summit

With the rapid growth of available datasets, it is imperative to have good tools for extracting insight from big data. The Spark ML library has excellent support for performing at-scale data processing and machine learning experiments, but more often than not, Data Scientists find themselves struggling with issues such as: low level data manipulation, lack of support for image processing, text analytics and deep learning, as well as the inability to use Spark alongside other popular machine learning libraries. To address these pain points, Microsoft recently released The Microsoft Machine Learning Library for Apache Spark (MMLSpark), an open-source machine learning library built on top of SparkML that seeks to simplify the data science process and integrate SparkML Pipelines with deep learning and computer vision libraries such as the Microsoft Cognitive Toolkit (CNTK) and OpenCV. With MMLSpark, Data Scientists can build models with 1/10th of the code through Pipeline objects that compose seamlessly with other parts of the SparkML ecosystem. In this session, we explore some of the main lessons learned from building MMLSpark. Join us if you would like to know how to extend Pipelines to ensure seamless integration with SparkML, how to auto-generate Python and R wrappers from Scala Transformers and Estimators, how to integrate and use previously non-distributed libraries in a distributed manner and how to efficiently deploy a Spark library across multiple platforms.

Next CERN Accelerator Logging Service with Jakub Wozniak

Spark Summit

The Next Accelerator Logging Service (NXCALS) is a new Big Data project at CERN aiming to replace the existing Oracle-based service. The main purpose of the system is to store and present Controls/Infrastructure related data gathered from thousands of devices in the whole accelerator complex. The data is used to operate the machines, improve their performance and conduct studies for new beam types or future experiments. During this talk, Jakub will speak about NXCALS requirements and design choices that lead to the selected architecture based on Hadoop and Spark. He will present the Ingestion API, the abstractions behind the Meta-data Service and the Spark-based Extraction API where simple changes to the schema handling greatly improved the overall usability of the system. The system itself is not CERN specific and can be of interest to other companies or institutes confronted with similar Big Data problems.

Powering a Startup with Apache Spark with Kevin Kim

Spark Summit

In Between (A mobile App for couples, downloaded 20M in Global), from daily batch for extracting metrics, analysis and dashboard. Spark is widely used by engineers and data analysts in Between, thanks to the performance and expendability of Spark, data operating has become extremely efficient. Entire team including Biz Dev, Global Operation, Designers are enjoying data results so Spark is empowering entire company for data driven operation and thinking. Kevin, Co-founder and Data Team leader of Between will be presenting how things are going in Between. Listeners will know how small and agile team is living with data (how we build organization, culture and technical base) after this presentation.

Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra

Spark Summit

Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...

Spark Summit

In many cases, Big Data becomes just another buzzword because of the lack of tools that can support both the technological requirements for developing and deploying of the projects and/or the fluency of communication between the different profiles of people involved in the projects. In this talk, we will present Moriarty, a set of tools for fast prototyping of Big Data applications that can be deployed in an Apache Spark environment. These tools support the creation of Big Data workflows using the already existing functional blocks or supporting the creation of new functional blocks. The created workflow can then be deployed in a Spark infrastructure and used through a REST API. For better understanding of Moriarty, the prototyping process and the way it hides the Spark environment to the Big Data users and developers, we will present it together with a couple of examples based on a Industry 4.0 success cases and other on a logistic success case.

How Nielsen Utilized Databricks for Large-Scale Research and Development with...

Spark Summit

Nielsen used Databricks to test new digital advertising rating methodologies on a large scale. Databricks allowed Nielsen to run analyses on thousands of advertising campaigns using both small panel data and large production data. This identified edge cases and performance gains faster than traditional methods. Using Databricks reduced the time required to test and deploy improved rating methodologies to benefit Nielsen's clients.

Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...

Spark Summit

Goal Based Data Production with Sim Simeonov

Spark Summit

Since the invention of SQL and relational databases, data production has been about specifying how data is transformed through queries. While Apache Spark can certainly be used as a general distributed query engine, the power and granularity of Spark’s APIs enables a revolutionary increase in data engineering productivity: goal-based data production. Goal-based data production concerns itself with specifying WHAT the desired result is, leaving the details of HOW the result is achieved to a smart data warehouse running on top of Spark. That not only substantially increases productivity, but also significantly expands the audience that can work directly with Spark: from developers and data scientists to technical business users. With specific data and architecture patterns spanning the range from ETL to machine learning data prep and with live demos, this session will demonstrate how Spark users can gain the benefits of goal-based data production.

Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...

Spark Summit

Have you imagined a simple machine learning solution able to prevent revenue leakage and monitor your distributed application? To answer this question, we offer a practical and a simple machine learning solution to create an intelligent monitoring application based on simple data analysis using Apache Spark MLlib. Our application uses linear regression models to make predictions and check if the platform is experiencing any operational problems that can impact in revenue losses. The application monitor distributed systems and provides notifications stating the problem detected, that way users can operate quickly to avoid serious problems which directly impact the company’s revenue and reduce the time for action. We will present an architecture for not only a monitoring system, but also an active actor for our outages recoveries. At the end of the presentation you will have access to our training program source code and you will be able to adapt and implement in your company. This solution already helped to prevent about US$3mi in losses last year.

Getting Ready to Use Redis with Apache Spark with Dvir Volk

Spark Summit

Getting Ready to use Redis with Apache Spark is a technical tutorial designed to address integrating Redis with an Apache Spark deployment to increase the performance of serving complex decision models. To set the context for the session, we start with a quick introduction to Redis and the capabilities Redis provides. We cover the basic data types provided by Redis and cover the module system. Using an ad serving use-case, we look at how Redis can improve the performance and reduce the cost of using complex ML-models in production. Attendees will be guided through the key steps of setting up and integrating Redis with Spark, including how to train a model using Spark then load and serve it using Redis, as well as how to work with the Spark Redis module. The capabilities of the Redis Machine Learning Module (redis-ml) will be discussed focusing primarily on decision trees and regression (linear and logistic) with code examples to demonstrate how to use these feature. At the end of the session, developers should feel confident building a prototype/proof-of-concept application using Redis and Spark. Attendees will understand how Redis complements Spark and how to use Redis to serve complex, ML-models with high performance.

Deduplication and Author-Disambiguation of Streaming Records via Supervised M...

Spark Summit

Here we present a general supervised framework for record deduplication and author-disambiguation via Spark. This work differentiates itself by – Application of Databricks and AWS makes this a scalable implementation. Compute resources are comparably lower than traditional legacy technology using big boxes 24/7. Scalability is crucial as Elsevier’s Scopus data, the biggest scientific abstract repository, covers roughly 250 million authorships from 70 million abstracts covering a few hundred years. – We create a fingerprint for each content by deep learning and/or word2vec algorithms to expedite pairwise similarity calculation. These encoders substantially reduce compute time while maintaining semantic similarity (unlike traditional TFIDF or predefined taxonomies). We will briefly discuss how to optimize word2vec training with high parallelization. Moreover, we show how these encoders can be used to derive a standard representation for all our entities namely such as documents, authors, users, journals, etc. This standard representation can simplify the recommendation problem into a pairwise similarity search and hence it can offer a basic recommender for cross-product applications where we may not have a dedicate recommender engine designed. – Traditional author-disambiguation or record deduplication algorithms are batch-processing with small to no training data. However, we have roughly 25 million authorships that are manually curated or corrected upon user feedback. Hence, it is crucial to maintain historical profiles and hence we have developed a machine learning implementation to deal with data streams and process them in mini batches or one document at a time. We will discuss how to measure the accuracy of such a system, how to tune it and how to process the raw data of pairwise similarity function into final clusters. Lessons learned from this talk can help all sort of companies where they want to integrate their data or deduplicate their user/customer/product databases.

MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...

Spark Summit

The use of large-scale machine learning and data mining methods is becoming ubiquitous in many application domains ranging from business intelligence and bioinformatics to self-driving cars. These methods heavily rely on matrix computations, and it is hence critical to make these computations scalable and efficient. These matrix computations are often complex and involve multiple steps that need to be optimized and sequenced properly for efficient execution. This work presents new efficient and scalable matrix processing and optimization techniques based on Spark. The proposed techniques estimate the sparsity of intermediate matrix-computation results and optimize communication costs. An evaluation plan generator for complex matrix computations is introduced as well as a distributed plan optimizer that exploits dynamic cost-based analysis and rule-based heuristics The result of a matrix operation will often serve as an input to another matrix operation, thus defining the matrix data dependencies within a matrix program. The matrix query plan generator produces query execution plans that minimize memory usage and communication overhead by partitioning the matrix based on the data dependencies in the execution plan. We implemented the proposed matrix techniques inside the Spark SQL, and optimize the matrix execution plan based on Spark SQL Catalyst. We conduct case studies on a series of ML models and matrix computations with special features on different datasets. These are PageRank, GNMF, BFGS, sparse matrix chain multiplications, and a biological data analysis. The open-source library ScaLAPACK and the array-based database SciDB are used for performance evaluation. Our experiments are performed on six real-world datasets are: social network data ( e.g., soc-pokec, cit-Patents, LiveJournal), Twitter2010, Netflix recommendation data, and 1000 Genomes Project sample. Experiments demonstrate that our proposed techniques achieve up to an order-of-magnitude performance.

More from Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang

VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...

Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu

Improving Traffic Prediction Using Weather Data with Ramya Raghavendra

A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...

No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...

Apache Spark and Tensorflow as a Service with Jim Dowling

MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...

Next CERN Accelerator Logging Service with Jakub Wozniak

Powering a Startup with Apache Spark with Kevin Kim

Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra

Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...

How Nielsen Utilized Databricks for Large-Scale Research and Development with...

Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...

Goal Based Data Production with Sim Simeonov

Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...

Getting Ready to Use Redis with Apache Spark with Dvir Volk

Deduplication and Author-Disambiguation of Streaming Records via Supervised M...

MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...

Recently uploaded

原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样

ihavuls

学校原件一模一样【微信：741003700 】《(unimelb毕业证书)墨尔本大学毕业证》【微信：741003700 】学位证，留信认证（真实可查，永久存档）原件一模一样纸张工艺/offer、雅思、外壳等材料/诚信可靠,可直接看成品样本，帮您解决无法毕业带来的各种难题！外壳，原版制作，诚信可靠，可直接看成品样本。行业标杆！精益求精，诚心合作，真诚制作！多年品质 ,按需精细制作，24小时接单,全套进口原装设备。十五年致力于帮助留学生解决难题，包您满意。本公司拥有海外各大学样板无数，能完美还原。 1:1完美还原海外各大学毕业材料上的工艺：水印，阴影底纹，钢印LOGO烫金烫银，LOGO烫金烫银复合重叠。文字图案浮雕、激光镭射、紫外荧光、温感、复印防伪等防伪工艺。材料咨询办理、认证咨询办理请加学历顾问Q/微741003700 【主营项目】一.毕业证【q微741003700】成绩单、使馆认证、教育部认证、雅思托福成绩单、学生卡等！二.真实使馆公证(即留学回国人员证明,不成功不收费) 三.真实教育部学历学位认证（教育部存档！教育部留服网站永久可查）四.办理各国各大学文凭(一对一专业服务,可全程监控跟踪进度) 如果您处于以下几种情况： ◇在校期间，因各种原因未能顺利毕业……拿不到官方毕业证【q/微741003700】 ◇面对父母的压力，希望尽快拿到； ◇不清楚认证流程以及材料该如何准备； ◇回国时间很长，忘记办理； ◇回国马上就要找工作，办给用人单位看； ◇企事业单位必须要求办理的 ◇需要报考公务员、购买免税车、落转户口 ◇申请留学生创业基金留信网认证的作用: 1:该专业认证可证明留学生真实身份 2:同时对留学生所学专业登记给予评定 3:国家专业人才认证中心颁发入库证书 4:这个认证书并且可以归档倒地方 5:凡事获得留信网入网的信息将会逐步更新到个人身份内，将在公安局网内查询个人身份证信息后，同步读取人才网入库信息 6:个人职称评审加20分 7:个人信誉贷款加10分 8:在国家人才网主办的国家网络招聘大会中纳入资料，供国家高端企业选择人才

"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"

sameer shah

A presentation that explain the Power BI Licensing

AlessioFois2

STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...

sameer shah

writing report business partner b1+ .pdf

VyNguyen709676

Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf

Fernanda Palhano

Intelligence supported media monitoring in veterinary medicine

AndrzejJarynowski

4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...

Social Samosa

一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理

xclpvhuk

原版制作【微信:41543339】【(Unimelb毕业证书)墨尔本大学毕业证】【微信:41543339】《成绩单、外壳、雅思、offer、留信学历认证（永久存档/真实可查）》采用学校原版纸张、特殊工艺完全按照原版一比一制作（包括：隐形水印，阴影底纹，钢印LOGO烫金烫银，LOGO烫金烫银复合重叠，文字图案浮雕，激光镭射，紫外荧光，温感，复印防伪）行业标杆！精益求精，诚心合作，真诚制作！多年品质 ,按需精细制作，24小时接单,全套进口原装设备，十五年致力于帮助留学生解决难题，业务范围有加拿大、英国、澳洲、韩国、美国、新加坡，新西兰等学历材料，包您满意。【我们承诺采用的是学校原版纸张（纸质、底色、纹路），我们拥有全套进口原装设备，特殊工艺都是采用不同进口机器一比一制作，仿真度基本可以达到100%，所有工艺效果都可提前给客户展示，不满意可以根据客户要求进行调整，直到满意为止！】【业务选择办理准则】一、工作未确定，回国需先给父母、亲戚朋友看下文凭的情况，办理一份就读学校的毕业证【微信41543339】文凭即可二、回国进私企、外企、自己做生意的情况，这些单位是不查询毕业证真伪的，而且国内没有渠道去查询国外文凭的真假，也不需要提供真实教育部认证。鉴于此，办理一份毕业证【微信41543339】即可三、进国企，银行，事业单位，考公务员等等，这些单位是必需要提供真实教育部认证的，办理教育部认证所需资料众多且烦琐，所有材料您都必须提供原件，我们凭借丰富的经验，快捷的绿色通道帮您快速整合材料，让您少走弯路。留信网认证的作用: 1:该专业认证可证明留学生真实身份 2:同时对留学生所学专业登记给予评定 3:国家专业人才认证中心颁发入库证书 4:这个认证书并且可以归档倒地方 5:凡事获得留信网入网的信息将会逐步更新到个人身份内，将在公安局网内查询个人身份证信息后，同步读取人才网入库信息 6:个人职称评审加20分 7:个人信誉贷款加10分 8:在国家人才网主办的国家网络招聘大会中纳入资料，供国家高端企业选择人才留信网服务项目： 1、留学生专业人才库服务（留信分析） 2、国（境）学习人员提供就业推荐信服务 3、留学人员区块链存储服务【关于价格问题（保证一手价格）】我们所定的价格是非常合理的，而且我们现在做得单子大多数都是代理和回头客户介绍的所以一般现在有新的单子我给客户的都是第一手的代理价格，因为我想坦诚对待大家不想跟大家在价格方面浪费时间对于老客户或者被老客户介绍过来的朋友，我们都会适当给一些优惠。选择实体注册公司办理，更放心，更安全！我们的承诺：客户在留信官方认证查询网站查询到认证通过结果后付款，不成功不收费！

一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理

y3i0qsdzb

原版办理【微信号:BYZS866】【巴斯大学毕业证(Bath毕业证书)】【微信号:BYZS866】《成绩单、外壳、雅思、offer、真实留信官方学历认证（永久存档/真实可查）》采用学校原版纸张、特殊工艺完全按照原版一比一制作（包括：隐形水印，阴影底纹，钢印LOGO烫金烫银，LOGO烫金烫银复合重叠，文字图案浮雕，激光镭射，紫外荧光，温感，复印防伪）行业标杆！精益求精，诚心合作，真诚制作！多年品质 ,按需精细制作，24小时接单,全套进口原装设备，十五年致力于帮助留学生解决难题，业务范围有加拿大、英国、澳洲、韩国、美国、新加坡，新西兰等学历材料，包您满意。【关于学历材料质量】我们承诺采用的是学校原版纸张（原版纸质、底色、纹路、）我们工厂拥有全套进口原装设备，特殊工艺都是采用不同机器制作，仿真度基本可以达到100%，所有成品以及工艺效果都可提前给客户展示，不满意可以根据客户要求进行调整，直到满意为止！【业务选择办理准则】一、工作未确定，回国需先给父母、亲戚朋友看下文凭的情况，办理一份就读学校的毕业证【微信号BYZS866】文凭即可二、回国进私企、外企、自己做生意的情况，这些单位是不查询毕业证真伪的，而且国内没有渠道去查询国外文凭的真假，也不需要提供真实教育部认证。鉴于此，办理一份毕业证【微信号BYZS866】即可三、进国企，银行，事业单位，考公务员等等，这些单位是必需要提供真实教育部认证的，办理教育部认证所需资料众多且烦琐，所有材料您都必须提供原件，我们凭借丰富的经验，快捷的绿色通道帮您快速整合材料，让您少走弯路。留信网认证的作用: 1:该专业认证可证明留学生真实身份 2:同时对留学生所学专业登记给予评定 3:国家专业人才认证中心颁发入库证书 4:这个认证书并且可以归档倒地方 5:凡事获得留信网入网的信息将会逐步更新到个人身份内，将在公安局网内查询个人身份证信息后，同步读取人才网入库信息 6:个人职称评审加20分 7:个人信誉贷款加10分 8:在国家人才网主办的国家网络招聘大会中纳入资料，供国家高端企业选择人才留信网服务项目： 1、留学生专业人才库服务（留信分析） 2、国（境）学习人员提供就业推荐信服务 3、留学人员区块链存储服务【关于价格问题（保证一手价格）】我们所定的价格是非常合理的，而且我们现在做得单子大多数都是代理和回头客户介绍的所以一般现在有新的单子我给客户的都是第一手的代理价格，因为我想坦诚对待大家不想跟大家在价格方面浪费时间对于老客户或者被老客户介绍过来的朋友，我们都会适当给一些优惠。选择实体注册公司办理，更放心，更安全！我们的承诺：客户在留信官方认证查询网站查询到认证通过结果后付款，不成功不收费！

The Ipsos - AI - Monitor 2024 Report.pdf

Social Samosa

Open Source Contributions to Postgres: The Basics POSETTE 2024

ElizabethGarrettChri

Postgres is the most advanced open-source database in the world and it's supported by a community, not a single company. So how does this work? How does code actually get into Postgres? I recently had a patch submitted and committed and I want to share what I learned in that process. I’ll give you an overview of Postgres versions and how the underlying project codebase functions. I’ll also show you the process for submitting a patch and getting that tested and committed.

原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理

a9qfiubqu

原版制作【微信:41543339】【弗林德斯大学毕业证(Flinders毕业证书)】【微信:41543339】《成绩单、外壳、雅思、offer、真实留信官方学历认证（永久存档/真实可查）》采用学校原版纸张、特殊工艺完全按照原版一比一制作（包括：隐形水印，阴影底纹，钢印LOGO烫金烫银，LOGO烫金烫银复合重叠，文字图案浮雕，激光镭射，紫外荧光，温感，复印防伪）行业标杆！精益求精，诚心合作，真诚制作！多年品质 ,按需精细制作，24小时接单,全套进口原装设备，十五年致力于帮助留学生解决难题，业务范围有加拿大、英国、澳洲、韩国、美国、新加坡，新西兰等学历材料，包您满意。【我们承诺采用的是学校原版纸张（纸质、底色、纹路）我们拥有全套进口原装设备，特殊工艺都是采用不同机器制作，仿真度基本可以达到100%，所有工艺效果都可提前给客户展示，不满意可以根据客户要求进行调整，直到满意为止！】【业务选择办理准则】一、工作未确定，回国需先给父母、亲戚朋友看下文凭的情况，办理一份就读学校的毕业证【微信41543339】文凭即可二、回国进私企、外企、自己做生意的情况，这些单位是不查询毕业证真伪的，而且国内没有渠道去查询国外文凭的真假，也不需要提供真实教育部认证。鉴于此，办理一份毕业证【微信41543339】即可三、进国企，银行，事业单位，考公务员等等，这些单位是必需要提供真实教育部认证的，办理教育部认证所需资料众多且烦琐，所有材料您都必须提供原件，我们凭借丰富的经验，快捷的绿色通道帮您快速整合材料，让您少走弯路。留信网认证的作用: 1:该专业认证可证明留学生真实身份 2:同时对留学生所学专业登记给予评定 3:国家专业人才认证中心颁发入库证书 4:这个认证书并且可以归档倒地方 5:凡事获得留信网入网的信息将会逐步更新到个人身份内，将在公安局网内查询个人身份证信息后，同步读取人才网入库信息 6:个人职称评审加20分 7:个人信誉贷款加10分 8:在国家人才网主办的国家网络招聘大会中纳入资料，供国家高端企业选择人才留信网服务项目： 1、留学生专业人才库服务（留信分析） 2、国（境）学习人员提供就业推荐信服务 3、留学人员区块链存储服务【关于价格问题（保证一手价格）】我们所定的价格是非常合理的，而且我们现在做得单子大多数都是代理和回头客户介绍的所以一般现在有新的单子我给客户的都是第一手的代理价格，因为我想坦诚对待大家不想跟大家在价格方面浪费时间对于老客户或者被老客户介绍过来的朋友，我们都会适当给一些优惠。选择实体注册公司办理，更放心，更安全！我们的承诺：客户在留信官方认证查询网站查询到认证通过结果后付款，不成功不收费！

End-to-end pipeline agility - Berlin Buzzwords 2024

Lars Albertsson

We describe how we achieve high change agility in data engineering by eliminating the fear of breaking downstream data pipelines through end-to-end pipeline testing, and by using schema metaprogramming to safely eliminate boilerplate involved in changes that affect whole pipelines. A quick poll on agility in changing pipelines from end to end indicated a huge span in capabilities. For the question "How long time does it take for all downstream pipelines to be adapted to an upstream change," the median response was 6 months, but some respondents could do it in less than a day. When quantitative data engineering differences between the best and worst are measured, the span is often 100x-1000x, sometimes even more. A long time ago, we suffered at Spotify from fear of changing pipelines due to not knowing what the impact might be downstream. We made plans for a technical solution to test pipelines end-to-end to mitigate that fear, but the effort failed for cultural reasons. We eventually solved this challenge, but in a different context. In this presentation we will describe how we test full pipelines effectively by manipulating workflow orchestration, which enables us to make changes in pipelines without fear of breaking downstream. Making schema changes that affect many jobs also involves a lot of toil and boilerplate. Using schema-on-read mitigates some of it, but has drawbacks since it makes it more difficult to detect errors early. We will describe how we have rejected this tradeoff by applying schema metaprogramming, eliminating boilerplate but keeping the protection of static typing, thereby further improving agility to quickly modify data pipelines without fear.

一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理

bopyb

毕业原版【微信:176555708】【(GWU,GW毕业证书)乔治·华盛顿大学毕业证】【微信:176555708】成绩单、外壳、offer、留信学历认证（永久存档真实可查）采用学校原版纸张、特殊工艺完全按照原版一比一制作（包括：隐形水印，阴影底纹，钢印LOGO烫金烫银，LOGO烫金烫银复合重叠，文字图案浮雕，激光镭射，紫外荧光，温感，复印防伪）行业标杆！精益求精，诚心合作，真诚制作！多年品质 ,按需精细制作，24小时接单,全套进口原装设备，十五年致力于帮助留学生解决难题，业务范围有加拿大、英国、澳洲、韩国、美国、新加坡，新西兰等学历材料，包您满意。【我们承诺采用的是学校原版纸张（纸质、底色、纹路），我们拥有全套进口原装设备，特殊工艺都是采用不同机器制作，仿真度基本可以达到100%，所有工艺效果都可提前给客户展示，不满意可以根据客户要求进行调整，直到满意为止！】【业务选择办理准则】一、工作未确定，回国需先给父母、亲戚朋友看下文凭的情况，办理一份就读学校的毕业证【微信176555708】文凭即可二、回国进私企、外企、自己做生意的情况，这些单位是不查询毕业证真伪的，而且国内没有渠道去查询国外文凭的真假，也不需要提供真实教育部认证。鉴于此，办理一份毕业证【微信176555708】即可三、进国企，银行，事业单位，考公务员等等，这些单位是必需要提供真实教育部认证的，办理教育部认证所需资料众多且烦琐，所有材料您都必须提供原件，我们凭借丰富的经验，快捷的绿色通道帮您快速整合材料，让您少走弯路。留信网认证的作用: 1:该专业认证可证明留学生真实身份 2:同时对留学生所学专业登记给予评定 3:国家专业人才认证中心颁发入库证书 4:这个认证书并且可以归档倒地方 5:凡事获得留信网入网的信息将会逐步更新到个人身份内，将在公安局网内查询个人身份证信息后，同步读取人才网入库信息 6:个人职称评审加20分 7:个人信誉贷款加10分 8:在国家人才网主办的国家网络招聘大会中纳入资料，供国家高端企业选择人才留信网服务项目： 1、留学生专业人才库服务（留信分析） 2、国（境）学习人员提供就业推荐信服务 3、留学人员区块链存储服务 → 【关于价格问题（保证一手价格）】我们所定的价格是非常合理的，而且我们现在做得单子大多数都是代理和回头客户介绍的所以一般现在有新的单子我给客户的都是第一手的代理价格，因为我想坦诚对待大家不想跟大家在价格方面浪费时间对于老客户或者被老客户介绍过来的朋友，我们都会适当给一些优惠。选择实体注册公司办理，更放心，更安全！我们的承诺：客户在留信官方认证查询网站查询到认证通过结果后付款，不成功不收费！

Learn SQL from basic queries to Advance queries

manishkhaire30

Dive into the world of data analysis with our comprehensive guide on mastering SQL! This presentation offers a practical approach to learning SQL, focusing on real-world applications and hands-on practice. Whether you're a beginner or looking to sharpen your skills, this guide provides the tools you need to extract, analyze, and interpret data effectively. Key Highlights: Foundations of SQL: Understand the basics of SQL, including data retrieval, filtering, and aggregation. Advanced Queries: Learn to craft complex queries to uncover deep insights from your data. Data Trends and Patterns: Discover how to identify and interpret trends and patterns in your datasets. Practical Examples: Follow step-by-step examples to apply SQL techniques in real-world scenarios. Actionable Insights: Gain the skills to derive actionable insights that drive informed decision-making. Join us on this journey to enhance your data analysis capabilities and unlock the full potential of SQL. Perfect for data enthusiasts, analysts, and anyone eager to harness the power of data! #DataAnalysis #SQL #LearningSQL #DataInsights #DataScience #Analytics

Analysis insight about a Flyball dog competition team's performance

roli9797

Population Growth in Bataan: The effects of population growth around rural pl...

Bill641377

DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx

SaffaIbrahim1

University of New South Wales degree offer diploma Transcript

soxrziqu

Recently uploaded (20)

原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样

"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"

A presentation that explain the Power BI Licensing

STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...

writing report business partner b1+ .pdf

Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf

Intelligence supported media monitoring in veterinary medicine

4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...

一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理

一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理

The Ipsos - AI - Monitor 2024 Report.pdf

Open Source Contributions to Postgres: The Basics POSETTE 2024

原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理

End-to-end pipeline agility - Berlin Buzzwords 2024

一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理

Learn SQL from basic queries to Advance queries

Analysis insight about a Flyball dog competition team's performance

Population Growth in Bataan: The effects of population growth around rural pl...

DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx

University of New South Wales degree offer diploma Transcript

Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)

1. Reza Zadeh Matrix and Graph Computations @Reza_Zadeh | http://reza-zadeh.com

2. Overview Graph Computations and Pregel" Introduction to Matrix Computations

3. Graph Computations and Pregel

4. Data Flow Models Restrict the programming interface so that the system can do more automatically Express jobs as graphs of high-level operators » System picks how to split each operator into tasks and where to run each task » Run parts twice fault recovery New example: Pregel (parallel graph google)

5. Pregel oogle Expose specialized APIs to simplify graph programming. “Think like a vertex”

6. Graph-Parallel Pattern Model / Alg. State Computation depends only on the neighbors

7. Pregel Data Flow Input graph Vertex state 1 Messages 1 Superstep 1 Vertex state 2 Messages 2 Superstep 2 . . . Group by vertex ID Group by vertex ID

8. Simple Pregel in Spark Separate RDDs for immutable graph state and for vertex states and messages at each iteration Use groupByKey to perform each step Cache the resulting vertex and message RDDs Optimization: co-‐partition input graph and vertex state RDDs to reduce communication

9. Update ranks in parallel Iterate until convergence Rank of user i Weighted sum of neighbors’ ranks Example: PageRank R[i] = 0.15 + X j2Nbrs(i) wjiR[j]

10. PageRank in Pregel Input graph Vertex ranks 1 Contributions 1 Superstep 1 (add contribs) Vertex ranks 2 Contributions 2 Superstep 2 (add contribs) . . . Group & add by vertex Group & add by vertex

11. GraphX Provides Pregel message-‐passing and other operators on top of RDDs

12. GraphX: Properties

13. GraphX: Triplets The triplets operator joins vertices and edges:

14. F E Map Reduce Triplets Map-Reduce for each vertex D B A C mapF( ) A B mapF( ) A C A1 A2 reduceF( , ) A1 A2 A

15. F E Example: Oldest Follower D B A C What is the age of the oldest follower for each user? val oldestFollowerAge = graph .mrTriplets( e=> (e.dst.id, e.src.age),//Map (a,b)=> max(a, b) //Reduce ) .vertices 23 42 30 19 75 16

16. Summary of Operators All operations: https://spark.apache.org/docs/latest/graphx-programming-guide.html#summary-list-of-operators Pregel API: https://spark.apache.org/docs/latest/graphx-programming-guide.html#pregel-api

17. The GraphX Stack" (Lines of Code) GraphX (3575) Spark Pregel (28) + GraphLab (50) PageRank (5) Connected Comp. (10) Shortest Path (10) ALS (40) LDA (120) K-core (51) Triangle Count (45) SVD (40)

18. Optimizations Overloaded vertices have their work distributed

19. Optimizations

20. More examples In your HW: Single-Source-Shortest Paths using Pregel

21. Distributing Matrix Computations

22. Distributing Matrices How to distribute a matrix across machines? »  By Entries (CoordinateMatrix) »  By Rows (RowMatrix) »  By Blocks (BlockMatrix) All of Linear Algebra to be rebuilt using these partitioning schemes As of version 1.3

23. Distributing Matrices Even the simplest operations require thinking about communication e.g. multiplication How many different matrix multiplies needed? »  At least one per pair of {Coordinate, Row, Block, LocalDense, LocalSparse} = 10 »  More because multiplies not commutative

24. Distributed Singular Value Decomposition

25. Singular Value Decomposition

26. Singular Value Decomposition Two cases »  Tall and Skinny »  Short and Fat (not really) »  Roughly Square SVD method on RowMatrix takes care of which one to call.

27. Tall and Skinny SVD

28. Tall and Skinny SVD Gets us V and the singular values Gets us U by one matrix multiplication

29. Square SVD ARPACK: Very mature Fortran77 package for computing eigenvalue decompositions" JNI interface available via netlib-java" Distributed using Spark – how?

30. Square SVD via ARPACK Only interfaces with distributed matrix via matrix-vector multiplies The result of matrix-vector multiply is small. The multiplication can be distributed.

31. Square SVD With 68 executors and 8GB memory in each, looking for the top 5 singular vectors

Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)

Similar to Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford) (20)

More from Spark Summit

More from Spark Summit (20)

Recently uploaded

Recently uploaded (20)

Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)