SlideShare a Scribd company logo
LARGE-SCALE ANALYTICS WITH 
APACHE SPARK 
THOMSON REUTERS R&D 
TWIN CITIES HADOOP USER GROUP 
FRANK SCHILDER 
SEPTEMBER 22, 2014
THOMSON REUTERS 
• The Thomson Reuters Corporation 
– 50,000+ employees 
– 2,000+ journalists at news desks world wide 
– Offices in more than 1,000 countries 
– $12 billion dollars revenue/year 
• Products: intelligent information for professionals and enterprises 
– Legal: WestlawNext legal search engine 
– Financial: Eikon financial platform; Datastream real-time share price data 
– News: REUTERS news 
– Science: Endnote, ISI journal impact factor, Derwent World Patent Index 
– Tax & Accounting: OneSource tax information 
• Corporate R&D 
– Around 40 researchers and developers (NLP, IR, ML) 
– Three R&D sites the US one in the UK: Eagan, MN; Rochester, NY; NYC 
and London 
– We are hiring… email me at frank.schilder@thomsonreuters.com
OVERVIEW 
• Speed 
– Data locality, scalability, fault tolerance 
• Ease of Use 
– Scala, interactive Shell 
• Generality 
– SparkSQL, MLLib 
• Comparing ML frameworks 
– Vowpal Wabbit (VW) 
– Sparkling Water 
• The Future
WHAT IS SPARK? 
Apache Spark is a fast and general engine 
for large-scale data processing. 
• Speed: allows to run iterative Map-Reduce 
faster because of in-Memory computation: 
Resilient Distributed Datasets (RDD) 
• Ease of use: enables interactive data analysis 
in Scala, Python, or Java; interactive Shell 
• Generality: offers libraries for SQL, Streaming 
and large-scale analytics (graph processing 
and machine learning) 
• Integrated with Hadoop: runs on Hadoop 2’s 
YARN cluster
ACKNOWLEDGMENTS 
• Matei Zaharia and ampLab and databricks team for 
fantastic learning material and tutorials on Spark 
• Hiroko Bretz, Thomas Vacek, Dezhao Song, Terry 
Heinze for Spark and Scala support and running 
experiments 
• Adam Glaser for his time as a TSAP intern 
• Mahadev Wudali and Mike Edwards for letting us 
play in the “sandbox” (cluster)
SPEED
PRIMARY GOALS OF SPARK 
• Extend the MapReduce model to better support 
two common classes of analytics apps: 
– Iterative algorithms (machine learning, graphs) 
– Interactive data mining (R, Python) 
• Enhance programmability: 
– Integrate into Scala programming language 
– Allow interactive use from Scala interpreter 
– Make Spark easily accessible from other 
languages (Python, Java)
MOTIVATION 
• Acyclic data flow is inefficient for 
applications that repeatedly reuse a working 
set of data: 
– Iterative algorithms (machine learning, graphs) 
– Interactive data mining tools (R, Python) 
• With current frameworks, apps reload data 
from stable storage on each query
HADOOP MAPREDUCE VS SPARK
SOLUTION: Resilient 
Distributed Datasets (RDDs) 
• Allow apps to keep working sets in memory for 
efficient reuse 
• Retain the attractive properties of MapReduce 
– Fault tolerance, data locality, scalability 
• Support a wide range of applications
PROGRAMMING MODEL 
Resilient distributed datasets (RDDs) 
– Immutable, partitioned collections of objects 
– Created through parallel transformations (map, filter, 
groupBy, join, …) on data in stable storage 
– Functions follow the same patterns as Scala operations 
on lists 
– Can be cached for efficient reuse 
80+ Actions on RDDs 
– count, reduce, save, take, first, …
EXAMPLE: LOG MINING 
Load error messages from a log into memory, then 
interactively search for various patterns 
Base RDD 
Transformed RDD 
Val lines = spark.textFile(“hdfs://...”) 
Val errors = lines.filter(_.startsWith(“ERROR”)) 
Val messages = errors.map(_.split(‘t’)(2)) 
Val cachedMsgs = messages.cache() 
Block 1 
Block 2 
Block 3 
Worker 
results 
Worker 
Worker 
Driver 
cachedMsgs.filter(_.contains(“timeout”)).count 
cachedMsgs.filter(_.contains(“license”)).count 
. . . 
tasks 
Cache 1 
Cache 2 
Cache 3 
Action 
Result: scaled to 1 TB data in 5-7 sec 
Result: full-text search of Wikipedia in <1 sec 
(vs 170 sec for on-disk data) 
(vs 20 sec for on-disk data)
BEHAVIOR WITH NOT ENOUGH RAM 
68.8 
58.1 
40.7 
29.7 
11.5 
100 
80 
60 
40 
20 
0 
Cache 
disabled 
25% 
50% 
75% 
Fully 
cached 
Iteration 
time 
(s) 
% 
of 
working 
set 
in 
memory
RDD Fault Tolerance 
RDDs maintain lineage information that can be used 
to reconstruct lost partitions 
Ex: 
messages = textFile(...).filter(_.startsWith(“ERROR”)) 
.map(_.split(‘t’)(2)) 
HDFS File Filtered RDD Mapped RDD 
filter 
(func 
= 
_.contains(...)) 
map 
(func 
= 
_.split(...))
Fault Recovery Results 
119 
No 
Failure 
Failure 
in 
the 
6th 
Iteration 
57 
56 
58 
58 
81 
57 
59 
57 
59 
140 
120 
100 
80 
60 
40 
20 
0 
1 
2 
3 
4 
5 
6 
7 
8 
9 
10 
Iteratrion 
time 
(s) 
Iteration
EASE OF USE
INTERACTIVE SHELL 
• Data analysis can be done in the interactive shell. 
– Start from local machine or cluster 
– Access multi-core processor with local[n] 
– Spark context is already set up for you: SparkContext sc 
• Load data from anywhere (local, HDFS, 
Cassandra, Amazon S3 etc.): 
• Start analyzing your data: 
Processing 
starts here 
Local data file
ANALYZE YOUR DATA 
• Word count in one line: 
• List the word counts: 
• Broadcast variables (e.g. dictionary, stop word list) 
because local variables need to distributed to the workers:
RUN A SPARK SCRIPT
PYTHON SHELL & IPYTHON 
• The interactive shell can also be started as Python 
shell called pySpark: 
• Start analyzing your data in python now: 
• Since it’s Python, you may want to use iPython 
– (command shell for interactive programming in your 
brower) :
IPYTHON AND SPARK 
• The iPython notebook environment and pySpark: 
– Document data analysis results 
– Carry out machine learning experiments 
– Visualize results with matplotlib or other visualization 
libraries 
– Combine with NLP libraries such as NLTK 
• PySpark does not offer the full functionality of 
Spark Shell in Scala (yet) 
• Some bugs (e.g. problems with unicode)
PROJECTS AT R&D USING SPARK 
• Entity linking 
– Alternative name extraction from 
Wikipedia, Freebase, free text, ClueWeb12; 
several TB large web collection (planned) 
• Large-scale text data analysis: 
– creating fingerprints for entities/events 
– Temporal slot filling: Assigning a begin and end time 
stamp to a slot filler (e.g. A is employee of company B 
from BEGIN to END) 
– Large-Scale text classification of Reuters News Archive 
articles (10 years) 
• Language model computation used for search 
query analysis
SPARK MODULES 
• Spark streaming: 
– Processing real-time data streams 
• Spark SQL: 
– Support for structured data (JSON, Parquet) and 
relational queries (SQL) 
• MLlib: 
– Machine learning library 
• GraphX: 
– New graph processing API
SPARKSQL
SPARK SQL 
• Relational queries expressed in 
– SQL 
– HiveQL 
– Scala Domain specific language (DSL) 
• New type of RDD: SchemaRDD : 
– RDD composed of Row objects 
– Schema definition or inferred from a Parquet file, JSON 
data set, or data store in Hive 
• SPARK SQL is in alpha: API may change in the 
future!
DEFINING A SCHEMA
MLLIB
MLLIB 
• A machine learning module that comes with Spark 
• Shipped since Spark 0.8.0 
• Provides various machine learning algorithms for 
classification and clustering 
• Sparse vector representation since 1.0.0 
• New features in recently released version 1.1.0: 
– Includes a standard statistics library (e.g. correlation, 
Hypothesis testing, sampling) 
– More algorithms ported to Java and Python 
– More feature engineering: TF-IDF, Singular Value 
Decomposition (SVD)
MLLIB 
• Provides various machine learning algorithms: 
– Classification: 
• Logistic regression, support vector machine (SVM), naïve 
Bayes, decision trees 
– Regression: 
• Linear regression, regression trees 
– Collaborative Filtering: 
• Alternative least square (ALS) 
– Clustering: 
• K-means 
– Decomposition 
• Singular value decomposition (SVD), Principal component 
analysis (PCA)
OTHER ML FRAMEWORKS 
• Mahout 
• LIBLINEAR 
• MatLAB 
• Scikit-learn 
• GraphLab 
• R 
• Weka 
• Vowpal Wabbit 
• BigML
LARGE-SCALE ML INFRASTRUCTURE 
• More data implies bigger training sets and richer 
feature sets. 
• More data with simple ML algorithm often beats 
small data with complicated ML algorithm 
• Large-scale ML requires big data infrastructure: 
– Faster processing: Hadoop, Spark 
– Feature engineering: Principal Component Analysis, 
Hashing trick, Word2Vec
PREDICTIVE ANALYTICS WITH MLLIB
PREDICTIVE ANALYTICS WITH MLLIB 
http://databricks.com/blog/2014/03/26/spark-sql-manipulating-structured- 
data-using-spark-2.html
VW AND MLLIB COMPARISON 
• We compared Vowpal Wabbit and MLlib in 
December 2013 (work with Tom Vacek) 
• Vowpal Wabbit (VW) is a large-scale ML tool 
developed by John Langford (Microsoft) 
• Task: binary text classification task on Reuters 
articles 
– Ease of implementation 
– Feature Extraction 
– Parameter tuning 
– Speed 
– Accessibility of programming languages
VW VS. MLLIB 
• Ease of implementation 
– VW: user tool designed for ML, not programming language 
– MLlib: programming language, some support now (e.g. regularization) 
• Feature Extraction 
– VW: specific capabilities for bi-grams, prefix etc. 
– MLlib: no limit in terms of creating features 
• Parameter tuning 
– VW: no parameter search capability, but multiple parameters can be hand-tuned 
– MLlib: offers cross-validation 
• Speed 
– VW: highly optimized, very fast even on a single machine with multiple cores 
– MLlib: fast with lots of machines 
• Accessibility of programming languages 
– VW: written in C++, a few wrappers (e.g. Python) 
– MLlib: Scala, Python, Java 
• Conclusion end of 2013: VW had a slight advantage, but MLlib has caught up in at 
least some of the areas (e.g. sparse feature representation)
FINDINGS SO FAR 
• Large-scale extraction is a great fit for Spark when 
working with large data sets (> 1GB) 
• Ease of use makes Spark an ideal framework for 
rapid prototyping. 
• MLlib is a fast growing ML library, but “under 
development” 
• Vowpal Wabbit has been shown to crunch even 
large data sets with ease. 
250 
200 
150 
100 
50 
0 
vw liblinear Spark 
local[4] 
0/1 loss 
time
OTHER ML FRAMEWORKS 
• Internship by Adam Glaser compared various ML 
frameworks with 5 standard data sets (NIPS) 
– Mass-spectrometric data (cancer), handwritten digit 
detection, Reuters news classification, synthetic data sets 
– Data sets were not very big, but had up to 1.000.000 
features 
• Evaluated accuracy of the generated models and 
speed for training time 
• H20, GraphLab and Microsoft Azure showed strong 
performances in terms of accuracy and training 
time.
ACCURACY
SPEED
WHAT IS NEXT? 
• Oxdata plans to release Sparkling Water in October 
2014: 
• Microsoft Azure also offers a strong platform with 
multiple ML algorithm and an intuitive user interface 
• GraphLab has GraphLab Canvas ™ for visualizing your 
data and plans to incorporate more ML algorithms.
CAN’T DECIDE?
CONCLUSIONS
CONCLUSIONS 
• Apache Spark is the most active project in the Hadoop 
eco system 
• Spark offers speed and ease of use because of 
– RDDs 
– Interactive shell and 
– Easy integration of Scala, Java, Python scripts 
• Integrated in Spark are modules for 
– Easy data access via SparkSQL 
– Large-scale analytics via MLlib 
• Other ML frameworks enable analytics as well 
• Evaluate which framework is the best fit for your data 
problem
THE FUTURE? 
• Apache Spark will be a unified platform to run 
under various work loads: 
– Batch 
– Streaming 
– Interactive 
• And connect with different runtime systems 
– Hadoop 
– Cassandra 
– Mesos 
– Cloud 
– …
THE FUTURE? 
• Spark will extend its offering of large-scale 
algorithms for doing complex analytics: 
– Graph processing 
– Classification 
– Clustering 
– … 
• Other frameworks will continue to offer similar 
capabilities. 
• If you can’t beat them, join them.
http://labs.thomsonreuters.com/about-rd-careers/ 
FRANK.SCHILDER@THOMSONREUTERS.COM
EXTRA SLIDES
Example: Logistic Regression 
Goal: find best line separating two sets of points 
+ 
– 
– 
+ 
+ 
+ + + 
+ 
+ + 
– 
– – 
– 
– 
– – 
+ 
target 
– 
random 
initial 
line
Example: Logistic Regression 
val data = spark.textFile(...).map(readPoint).cache() 
var w = Vector.random(D) 
for (i <- 1 to ITERATIONS) { 
val gradient = data.map(p => 
(1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x 
).reduce(_ + _) 
w -= gradient 
} 
println("Final w: " + w)
Logistic Regression Performance 
4500 
4000 
3500 
3000 
2500 
2000 
1500 
1000 
500 
0 
1 5 10 20 30 
Running Time (s) 
Number of Iterations 
127 
s 
/ 
iteration 
Hadoop 
Spark 
first 
iteration 
174 
s 
further 
iterations 
6 
s
Spark Scheduler 
Dryad-like DAGs 
Pipelines functions 
within a stage 
Cache-aware work 
reuse & locality 
Partitioning-aware 
to avoid shuffles 
join 
groupBy 
union 
map 
Stage 
3 
A: 
Stage 
1 
Stage 
2 
B: 
C: 
D: 
E: 
F: 
G: 
= 
cached 
data 
partition
Spark Operations 
Transformations 
(define a new 
RDD) 
map 
filter 
sample 
groupByKey 
reduceByKey 
sortByKey 
flatMap 
union 
join 
cogroup 
cross 
mapValues 
Actions 
(return a result to 
driver program) 
collect 
reduce 
count 
save 
lookupKey

More Related Content

What's hot

Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 Let Spark Fly: Advantages and Use Cases for Spark on Hadoop Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
MapR Technologies
 
Spark and Spark Streaming
Spark and Spark StreamingSpark and Spark Streaming
Spark and Spark Streaming
宇 傅
 
Machine Learning by Example - Apache Spark
Machine Learning by Example - Apache SparkMachine Learning by Example - Apache Spark
Machine Learning by Example - Apache Spark
Meeraj Kunnumpurath
 
Introduction to apache spark
Introduction to apache sparkIntroduction to apache spark
Introduction to apache spark
UserReport
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student Slides
Databricks
 
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Spark Summit
 
Apache Spark MLlib
Apache Spark MLlib Apache Spark MLlib
Apache Spark MLlib
Zahra Eskandari
 
Intro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezIntro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco Vasquez
MapR Technologies
 
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
Miklos Christine
 
Introduction to Apache Spark Ecosystem
Introduction to Apache Spark EcosystemIntroduction to Apache Spark Ecosystem
Introduction to Apache Spark Ecosystem
Bojan Babic
 
Large-Scale Data Science in Apache Spark 2.0
Large-Scale Data Science in Apache Spark 2.0Large-Scale Data Science in Apache Spark 2.0
Large-Scale Data Science in Apache Spark 2.0
Databricks
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
Mammoth Data
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Samy Dindane
 
Spark Under the Hood - Meetup @ Data Science London
Spark Under the Hood - Meetup @ Data Science LondonSpark Under the Hood - Meetup @ Data Science London
Spark Under the Hood - Meetup @ Data Science London
Databricks
 
Project Hydrogen: State-of-the-Art Deep Learning on Apache Spark
Project Hydrogen: State-of-the-Art Deep Learning on Apache SparkProject Hydrogen: State-of-the-Art Deep Learning on Apache Spark
Project Hydrogen: State-of-the-Art Deep Learning on Apache Spark
Databricks
 
Reactive dashboard’s using apache spark
Reactive dashboard’s using apache sparkReactive dashboard’s using apache spark
Reactive dashboard’s using apache spark
Rahul Kumar
 
Spark Streaming and MLlib - Hyderabad Spark Group
Spark Streaming and MLlib - Hyderabad Spark GroupSpark Streaming and MLlib - Hyderabad Spark Group
Spark Streaming and MLlib - Hyderabad Spark Group
Phaneendra Chiruvella
 
Scalable Data Science with SparkR
Scalable Data Science with SparkRScalable Data Science with SparkR
Scalable Data Science with SparkR
DataWorks Summit
 
Apache Spark and Online Analytics
Apache Spark and Online Analytics Apache Spark and Online Analytics
Apache Spark and Online Analytics
Databricks
 
Introduction to Apache Spark
Introduction to Apache Spark Introduction to Apache Spark
Introduction to Apache Spark
Hubert Fan Chiang
 

What's hot (20)

Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 Let Spark Fly: Advantages and Use Cases for Spark on Hadoop Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 
Spark and Spark Streaming
Spark and Spark StreamingSpark and Spark Streaming
Spark and Spark Streaming
 
Machine Learning by Example - Apache Spark
Machine Learning by Example - Apache SparkMachine Learning by Example - Apache Spark
Machine Learning by Example - Apache Spark
 
Introduction to apache spark
Introduction to apache sparkIntroduction to apache spark
Introduction to apache spark
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student Slides
 
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
 
Apache Spark MLlib
Apache Spark MLlib Apache Spark MLlib
Apache Spark MLlib
 
Intro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezIntro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco Vasquez
 
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
 
Introduction to Apache Spark Ecosystem
Introduction to Apache Spark EcosystemIntroduction to Apache Spark Ecosystem
Introduction to Apache Spark Ecosystem
 
Large-Scale Data Science in Apache Spark 2.0
Large-Scale Data Science in Apache Spark 2.0Large-Scale Data Science in Apache Spark 2.0
Large-Scale Data Science in Apache Spark 2.0
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Spark Under the Hood - Meetup @ Data Science London
Spark Under the Hood - Meetup @ Data Science LondonSpark Under the Hood - Meetup @ Data Science London
Spark Under the Hood - Meetup @ Data Science London
 
Project Hydrogen: State-of-the-Art Deep Learning on Apache Spark
Project Hydrogen: State-of-the-Art Deep Learning on Apache SparkProject Hydrogen: State-of-the-Art Deep Learning on Apache Spark
Project Hydrogen: State-of-the-Art Deep Learning on Apache Spark
 
Reactive dashboard’s using apache spark
Reactive dashboard’s using apache sparkReactive dashboard’s using apache spark
Reactive dashboard’s using apache spark
 
Spark Streaming and MLlib - Hyderabad Spark Group
Spark Streaming and MLlib - Hyderabad Spark GroupSpark Streaming and MLlib - Hyderabad Spark Group
Spark Streaming and MLlib - Hyderabad Spark Group
 
Scalable Data Science with SparkR
Scalable Data Science with SparkRScalable Data Science with SparkR
Scalable Data Science with SparkR
 
Apache Spark and Online Analytics
Apache Spark and Online Analytics Apache Spark and Online Analytics
Apache Spark and Online Analytics
 
Introduction to Apache Spark
Introduction to Apache Spark Introduction to Apache Spark
Introduction to Apache Spark
 

Viewers also liked

Realtime Detection of DDOS attacks using Apache Spark and MLLib
Realtime Detection of DDOS attacks using Apache Spark and MLLibRealtime Detection of DDOS attacks using Apache Spark and MLLib
Realtime Detection of DDOS attacks using Apache Spark and MLLib
Ryan Bosshart
 
2016 Spark Summit East Keynote: Ali Ghodsi and Databricks Community Edition demo
2016 Spark Summit East Keynote: Ali Ghodsi and Databricks Community Edition demo2016 Spark Summit East Keynote: Ali Ghodsi and Databricks Community Edition demo
2016 Spark Summit East Keynote: Ali Ghodsi and Databricks Community Edition demo
Databricks
 
Successes, Challenges, and Pitfalls Migrating a SAAS business to Hadoop
Successes, Challenges, and Pitfalls Migrating a SAAS business to HadoopSuccesses, Challenges, and Pitfalls Migrating a SAAS business to Hadoop
Successes, Challenges, and Pitfalls Migrating a SAAS business to Hadoop
DataWorks Summit/Hadoop Summit
 
Big Data Platform Industrialization
Big Data Platform Industrialization Big Data Platform Industrialization
Big Data Platform Industrialization
DataWorks Summit/Hadoop Summit
 
SAP HORTONWORKS
SAP HORTONWORKSSAP HORTONWORKS
SAP HORTONWORKS
Douglas Bernardini
 
Migrating Clinical Data in Various Formats to a Clinical Data Management System
Migrating Clinical Data in Various Formats to a Clinical Data Management SystemMigrating Clinical Data in Various Formats to a Clinical Data Management System
Migrating Clinical Data in Various Formats to a Clinical Data Management SystemPerficient, Inc.
 
Cloudera Impala 1.0
Cloudera Impala 1.0Cloudera Impala 1.0
Cloudera Impala 1.0
Minwoo Kim
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
Databricks
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
Spark Summit
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
Databricks
 
Scalable And Incremental Data Profiling With Spark
Scalable And Incremental Data Profiling With SparkScalable And Incremental Data Profiling With Spark
Scalable And Incremental Data Profiling With Spark
Jen Aman
 
Hadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an exampleHadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an example
hadooparchbook
 
Redis for duplicate detection on real time stream
Redis for duplicate detection on real time streamRedis for duplicate detection on real time stream
Redis for duplicate detection on real time stream
Roberto Franchini
 
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
DataWorks Summit/Hadoop Summit
 
Apache spark 소개 및 실습
Apache spark 소개 및 실습Apache spark 소개 및 실습
Apache spark 소개 및 실습
동현 강
 
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Databricks
 
[D2 COMMUNITY] Spark User Group - 스파크를 통한 딥러닝 이론과 실제
[D2 COMMUNITY] Spark User Group - 스파크를 통한 딥러닝 이론과 실제[D2 COMMUNITY] Spark User Group - 스파크를 통한 딥러닝 이론과 실제
[D2 COMMUNITY] Spark User Group - 스파크를 통한 딥러닝 이론과 실제
NAVER D2
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Databricks
 
File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & ParquetFile Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
DataWorks Summit/Hadoop Summit
 
Parallelizing Existing R Packages with SparkR
Parallelizing Existing R Packages with SparkRParallelizing Existing R Packages with SparkR
Parallelizing Existing R Packages with SparkR
Databricks
 

Viewers also liked (20)

Realtime Detection of DDOS attacks using Apache Spark and MLLib
Realtime Detection of DDOS attacks using Apache Spark and MLLibRealtime Detection of DDOS attacks using Apache Spark and MLLib
Realtime Detection of DDOS attacks using Apache Spark and MLLib
 
2016 Spark Summit East Keynote: Ali Ghodsi and Databricks Community Edition demo
2016 Spark Summit East Keynote: Ali Ghodsi and Databricks Community Edition demo2016 Spark Summit East Keynote: Ali Ghodsi and Databricks Community Edition demo
2016 Spark Summit East Keynote: Ali Ghodsi and Databricks Community Edition demo
 
Successes, Challenges, and Pitfalls Migrating a SAAS business to Hadoop
Successes, Challenges, and Pitfalls Migrating a SAAS business to HadoopSuccesses, Challenges, and Pitfalls Migrating a SAAS business to Hadoop
Successes, Challenges, and Pitfalls Migrating a SAAS business to Hadoop
 
Big Data Platform Industrialization
Big Data Platform Industrialization Big Data Platform Industrialization
Big Data Platform Industrialization
 
SAP HORTONWORKS
SAP HORTONWORKSSAP HORTONWORKS
SAP HORTONWORKS
 
Migrating Clinical Data in Various Formats to a Clinical Data Management System
Migrating Clinical Data in Various Formats to a Clinical Data Management SystemMigrating Clinical Data in Various Formats to a Clinical Data Management System
Migrating Clinical Data in Various Formats to a Clinical Data Management System
 
Cloudera Impala 1.0
Cloudera Impala 1.0Cloudera Impala 1.0
Cloudera Impala 1.0
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
 
Scalable And Incremental Data Profiling With Spark
Scalable And Incremental Data Profiling With SparkScalable And Incremental Data Profiling With Spark
Scalable And Incremental Data Profiling With Spark
 
Hadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an exampleHadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an example
 
Redis for duplicate detection on real time stream
Redis for duplicate detection on real time streamRedis for duplicate detection on real time stream
Redis for duplicate detection on real time stream
 
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
 
Apache spark 소개 및 실습
Apache spark 소개 및 실습Apache spark 소개 및 실습
Apache spark 소개 및 실습
 
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
 
[D2 COMMUNITY] Spark User Group - 스파크를 통한 딥러닝 이론과 실제
[D2 COMMUNITY] Spark User Group - 스파크를 통한 딥러닝 이론과 실제[D2 COMMUNITY] Spark User Group - 스파크를 통한 딥러닝 이론과 실제
[D2 COMMUNITY] Spark User Group - 스파크를 통한 딥러닝 이론과 실제
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
 
File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & ParquetFile Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
 
Parallelizing Existing R Packages with SparkR
Parallelizing Existing R Packages with SparkRParallelizing Existing R Packages with SparkR
Parallelizing Existing R Packages with SparkR
 

Similar to Spark meetup TCHUG

A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
Jose Quesada (hiring)
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
Ned Shawa
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
Databricks
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
DataStax Academy
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
C4Media
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
Ahmet Bulut
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache Spark
Venkata Naga Ravi
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_Session
RUHULAMINHAZARIKA
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
MapR Technologies
 
Dev Ops Training
Dev Ops TrainingDev Ops Training
Dev Ops Training
Spark Summit
 
Composable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and WeldComposable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and Weld
Databricks
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code Workshop
Amanda Casari
 
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Databricks
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Paco Nathan
 
Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for Beginners
Anirudh
 
Apache Spark & MLlib
Apache Spark & MLlibApache Spark & MLlib
Apache Spark & MLlib
Grigory Sapunov
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
Amir Sedighi
 
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
DB Tsai
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferret
Andrii Gakhov
 
Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training
Synergetics Learning and Cloud Consulting
 

Similar to Spark meetup TCHUG (20)

A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache Spark
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_Session
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
 
Dev Ops Training
Dev Ops TrainingDev Ops Training
Dev Ops Training
 
Composable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and WeldComposable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and Weld
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code Workshop
 
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
 
Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for Beginners
 
Apache Spark & MLlib
Apache Spark & MLlibApache Spark & MLlib
Apache Spark & MLlib
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferret
 
Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training
 

Recently uploaded

Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
Bhaskar Mitra
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
Fwdays
 

Recently uploaded (20)

Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
 

Spark meetup TCHUG

  • 1. LARGE-SCALE ANALYTICS WITH APACHE SPARK THOMSON REUTERS R&D TWIN CITIES HADOOP USER GROUP FRANK SCHILDER SEPTEMBER 22, 2014
  • 2. THOMSON REUTERS • The Thomson Reuters Corporation – 50,000+ employees – 2,000+ journalists at news desks world wide – Offices in more than 1,000 countries – $12 billion dollars revenue/year • Products: intelligent information for professionals and enterprises – Legal: WestlawNext legal search engine – Financial: Eikon financial platform; Datastream real-time share price data – News: REUTERS news – Science: Endnote, ISI journal impact factor, Derwent World Patent Index – Tax & Accounting: OneSource tax information • Corporate R&D – Around 40 researchers and developers (NLP, IR, ML) – Three R&D sites the US one in the UK: Eagan, MN; Rochester, NY; NYC and London – We are hiring… email me at frank.schilder@thomsonreuters.com
  • 3. OVERVIEW • Speed – Data locality, scalability, fault tolerance • Ease of Use – Scala, interactive Shell • Generality – SparkSQL, MLLib • Comparing ML frameworks – Vowpal Wabbit (VW) – Sparkling Water • The Future
  • 4. WHAT IS SPARK? Apache Spark is a fast and general engine for large-scale data processing. • Speed: allows to run iterative Map-Reduce faster because of in-Memory computation: Resilient Distributed Datasets (RDD) • Ease of use: enables interactive data analysis in Scala, Python, or Java; interactive Shell • Generality: offers libraries for SQL, Streaming and large-scale analytics (graph processing and machine learning) • Integrated with Hadoop: runs on Hadoop 2’s YARN cluster
  • 5. ACKNOWLEDGMENTS • Matei Zaharia and ampLab and databricks team for fantastic learning material and tutorials on Spark • Hiroko Bretz, Thomas Vacek, Dezhao Song, Terry Heinze for Spark and Scala support and running experiments • Adam Glaser for his time as a TSAP intern • Mahadev Wudali and Mike Edwards for letting us play in the “sandbox” (cluster)
  • 7. PRIMARY GOALS OF SPARK • Extend the MapReduce model to better support two common classes of analytics apps: – Iterative algorithms (machine learning, graphs) – Interactive data mining (R, Python) • Enhance programmability: – Integrate into Scala programming language – Allow interactive use from Scala interpreter – Make Spark easily accessible from other languages (Python, Java)
  • 8. MOTIVATION • Acyclic data flow is inefficient for applications that repeatedly reuse a working set of data: – Iterative algorithms (machine learning, graphs) – Interactive data mining tools (R, Python) • With current frameworks, apps reload data from stable storage on each query
  • 10. SOLUTION: Resilient Distributed Datasets (RDDs) • Allow apps to keep working sets in memory for efficient reuse • Retain the attractive properties of MapReduce – Fault tolerance, data locality, scalability • Support a wide range of applications
  • 11. PROGRAMMING MODEL Resilient distributed datasets (RDDs) – Immutable, partitioned collections of objects – Created through parallel transformations (map, filter, groupBy, join, …) on data in stable storage – Functions follow the same patterns as Scala operations on lists – Can be cached for efficient reuse 80+ Actions on RDDs – count, reduce, save, take, first, …
  • 12. EXAMPLE: LOG MINING Load error messages from a log into memory, then interactively search for various patterns Base RDD Transformed RDD Val lines = spark.textFile(“hdfs://...”) Val errors = lines.filter(_.startsWith(“ERROR”)) Val messages = errors.map(_.split(‘t’)(2)) Val cachedMsgs = messages.cache() Block 1 Block 2 Block 3 Worker results Worker Worker Driver cachedMsgs.filter(_.contains(“timeout”)).count cachedMsgs.filter(_.contains(“license”)).count . . . tasks Cache 1 Cache 2 Cache 3 Action Result: scaled to 1 TB data in 5-7 sec Result: full-text search of Wikipedia in <1 sec (vs 170 sec for on-disk data) (vs 20 sec for on-disk data)
  • 13. BEHAVIOR WITH NOT ENOUGH RAM 68.8 58.1 40.7 29.7 11.5 100 80 60 40 20 0 Cache disabled 25% 50% 75% Fully cached Iteration time (s) % of working set in memory
  • 14. RDD Fault Tolerance RDDs maintain lineage information that can be used to reconstruct lost partitions Ex: messages = textFile(...).filter(_.startsWith(“ERROR”)) .map(_.split(‘t’)(2)) HDFS File Filtered RDD Mapped RDD filter (func = _.contains(...)) map (func = _.split(...))
  • 15. Fault Recovery Results 119 No Failure Failure in the 6th Iteration 57 56 58 58 81 57 59 57 59 140 120 100 80 60 40 20 0 1 2 3 4 5 6 7 8 9 10 Iteratrion time (s) Iteration
  • 17. INTERACTIVE SHELL • Data analysis can be done in the interactive shell. – Start from local machine or cluster – Access multi-core processor with local[n] – Spark context is already set up for you: SparkContext sc • Load data from anywhere (local, HDFS, Cassandra, Amazon S3 etc.): • Start analyzing your data: Processing starts here Local data file
  • 18. ANALYZE YOUR DATA • Word count in one line: • List the word counts: • Broadcast variables (e.g. dictionary, stop word list) because local variables need to distributed to the workers:
  • 19. RUN A SPARK SCRIPT
  • 20. PYTHON SHELL & IPYTHON • The interactive shell can also be started as Python shell called pySpark: • Start analyzing your data in python now: • Since it’s Python, you may want to use iPython – (command shell for interactive programming in your brower) :
  • 21. IPYTHON AND SPARK • The iPython notebook environment and pySpark: – Document data analysis results – Carry out machine learning experiments – Visualize results with matplotlib or other visualization libraries – Combine with NLP libraries such as NLTK • PySpark does not offer the full functionality of Spark Shell in Scala (yet) • Some bugs (e.g. problems with unicode)
  • 22.
  • 23. PROJECTS AT R&D USING SPARK • Entity linking – Alternative name extraction from Wikipedia, Freebase, free text, ClueWeb12; several TB large web collection (planned) • Large-scale text data analysis: – creating fingerprints for entities/events – Temporal slot filling: Assigning a begin and end time stamp to a slot filler (e.g. A is employee of company B from BEGIN to END) – Large-Scale text classification of Reuters News Archive articles (10 years) • Language model computation used for search query analysis
  • 24. SPARK MODULES • Spark streaming: – Processing real-time data streams • Spark SQL: – Support for structured data (JSON, Parquet) and relational queries (SQL) • MLlib: – Machine learning library • GraphX: – New graph processing API
  • 26. SPARK SQL • Relational queries expressed in – SQL – HiveQL – Scala Domain specific language (DSL) • New type of RDD: SchemaRDD : – RDD composed of Row objects – Schema definition or inferred from a Parquet file, JSON data set, or data store in Hive • SPARK SQL is in alpha: API may change in the future!
  • 28. MLLIB
  • 29. MLLIB • A machine learning module that comes with Spark • Shipped since Spark 0.8.0 • Provides various machine learning algorithms for classification and clustering • Sparse vector representation since 1.0.0 • New features in recently released version 1.1.0: – Includes a standard statistics library (e.g. correlation, Hypothesis testing, sampling) – More algorithms ported to Java and Python – More feature engineering: TF-IDF, Singular Value Decomposition (SVD)
  • 30. MLLIB • Provides various machine learning algorithms: – Classification: • Logistic regression, support vector machine (SVM), naïve Bayes, decision trees – Regression: • Linear regression, regression trees – Collaborative Filtering: • Alternative least square (ALS) – Clustering: • K-means – Decomposition • Singular value decomposition (SVD), Principal component analysis (PCA)
  • 31. OTHER ML FRAMEWORKS • Mahout • LIBLINEAR • MatLAB • Scikit-learn • GraphLab • R • Weka • Vowpal Wabbit • BigML
  • 32. LARGE-SCALE ML INFRASTRUCTURE • More data implies bigger training sets and richer feature sets. • More data with simple ML algorithm often beats small data with complicated ML algorithm • Large-scale ML requires big data infrastructure: – Faster processing: Hadoop, Spark – Feature engineering: Principal Component Analysis, Hashing trick, Word2Vec
  • 34. PREDICTIVE ANALYTICS WITH MLLIB http://databricks.com/blog/2014/03/26/spark-sql-manipulating-structured- data-using-spark-2.html
  • 35. VW AND MLLIB COMPARISON • We compared Vowpal Wabbit and MLlib in December 2013 (work with Tom Vacek) • Vowpal Wabbit (VW) is a large-scale ML tool developed by John Langford (Microsoft) • Task: binary text classification task on Reuters articles – Ease of implementation – Feature Extraction – Parameter tuning – Speed – Accessibility of programming languages
  • 36. VW VS. MLLIB • Ease of implementation – VW: user tool designed for ML, not programming language – MLlib: programming language, some support now (e.g. regularization) • Feature Extraction – VW: specific capabilities for bi-grams, prefix etc. – MLlib: no limit in terms of creating features • Parameter tuning – VW: no parameter search capability, but multiple parameters can be hand-tuned – MLlib: offers cross-validation • Speed – VW: highly optimized, very fast even on a single machine with multiple cores – MLlib: fast with lots of machines • Accessibility of programming languages – VW: written in C++, a few wrappers (e.g. Python) – MLlib: Scala, Python, Java • Conclusion end of 2013: VW had a slight advantage, but MLlib has caught up in at least some of the areas (e.g. sparse feature representation)
  • 37. FINDINGS SO FAR • Large-scale extraction is a great fit for Spark when working with large data sets (> 1GB) • Ease of use makes Spark an ideal framework for rapid prototyping. • MLlib is a fast growing ML library, but “under development” • Vowpal Wabbit has been shown to crunch even large data sets with ease. 250 200 150 100 50 0 vw liblinear Spark local[4] 0/1 loss time
  • 38. OTHER ML FRAMEWORKS • Internship by Adam Glaser compared various ML frameworks with 5 standard data sets (NIPS) – Mass-spectrometric data (cancer), handwritten digit detection, Reuters news classification, synthetic data sets – Data sets were not very big, but had up to 1.000.000 features • Evaluated accuracy of the generated models and speed for training time • H20, GraphLab and Microsoft Azure showed strong performances in terms of accuracy and training time.
  • 40. SPEED
  • 41. WHAT IS NEXT? • Oxdata plans to release Sparkling Water in October 2014: • Microsoft Azure also offers a strong platform with multiple ML algorithm and an intuitive user interface • GraphLab has GraphLab Canvas ™ for visualizing your data and plans to incorporate more ML algorithms.
  • 44. CONCLUSIONS • Apache Spark is the most active project in the Hadoop eco system • Spark offers speed and ease of use because of – RDDs – Interactive shell and – Easy integration of Scala, Java, Python scripts • Integrated in Spark are modules for – Easy data access via SparkSQL – Large-scale analytics via MLlib • Other ML frameworks enable analytics as well • Evaluate which framework is the best fit for your data problem
  • 45. THE FUTURE? • Apache Spark will be a unified platform to run under various work loads: – Batch – Streaming – Interactive • And connect with different runtime systems – Hadoop – Cassandra – Mesos – Cloud – …
  • 46. THE FUTURE? • Spark will extend its offering of large-scale algorithms for doing complex analytics: – Graph processing – Classification – Clustering – … • Other frameworks will continue to offer similar capabilities. • If you can’t beat them, join them.
  • 49. Example: Logistic Regression Goal: find best line separating two sets of points + – – + + + + + + + + – – – – – – – + target – random initial line
  • 50. Example: Logistic Regression val data = spark.textFile(...).map(readPoint).cache() var w = Vector.random(D) for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient } println("Final w: " + w)
  • 51. Logistic Regression Performance 4500 4000 3500 3000 2500 2000 1500 1000 500 0 1 5 10 20 30 Running Time (s) Number of Iterations 127 s / iteration Hadoop Spark first iteration 174 s further iterations 6 s
  • 52. Spark Scheduler Dryad-like DAGs Pipelines functions within a stage Cache-aware work reuse & locality Partitioning-aware to avoid shuffles join groupBy union map Stage 3 A: Stage 1 Stage 2 B: C: D: E: F: G: = cached data partition
  • 53. Spark Operations Transformations (define a new RDD) map filter sample groupByKey reduceByKey sortByKey flatMap union join cogroup cross mapValues Actions (return a result to driver program) collect reduce count save lookupKey