SlideShare a Scribd company logo
FUSING APACHE SPARK AND
LUCENE FOR NEAR-REALTIME
PREDICTIVE MODEL BUILDING
Debasish Das
Principal Engineer
Verizon
Contributors
Platform: Pankaj Rastogi, Venkat Chunduru, Ponrama Jegan, Masoud Tavazoei
Algorithm: Santanu Das, Debasish Das (Dave)
Frontend: Altaff Shaik, Jon Leonhardt
Pramod Lakshmi Narasimha
Principal Engineer
Verizon
© Verizon 2016 All Rights Reserved!
Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners.!
Data Overview
•  Location data
•  Each srcIp defined as unique row key
•  Provides approximate location of each key
•  Timeseries containing latitude, longitude, error bound, duration, timezone for
each key
•  Clickstream data
•  Contains clickstream data of each row key
•  Contains startTime, duration, httphost, httpuri, upload/download bytes,
httpmethod
•  Compatible with IPFIX/Netflow formats
2
© Verizon 2016 All Rights Reserved!
Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners.!
Marketing Analytics
3
Lookalike modeling
Churn reduction
Competitive analysis
Increased share
of stomach
•  Anonymous aggregate analysis for customer insights
© Verizon 2016 All Rights Reserved!
Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners.!
Data Model
•  Dense dimension, dense measure
!Schema: srcip, date, hour, tld, zip, tldvisits, zipvisits!
!Data: 10.1.13.120, d1, H2, macys.com, 94555, 2, 4!
•  Sparse dimension, dense measure
!Schema: srcip, date, tld, zip, clickstreamvisits, zipvisits!
!Data: 10.1.13.120, d1, {macys.com, kohls.com}, {94555, 94301}, 10, 15!
•  Sparse dimension, sparse measure
!Schema: srcip, date, tld, zip, tldvisits, zipvisits!
!Data: 10.1.13.120, d1, {macys.com, kohls.com}, {94555, 94301}, {macys.com:4, kohls.com:6}, {94555:8, 94301:7}!
!Schema: srcip, week, tld, zip, tldvisits, zipvisits!
!Data: 10.1.13.120, week1,  {macys.com, kohls.com}, {94555, 94301}, {macys.com:4, kohls.com:6}, {94555:8, 94301:7}!
•  Sparse dimension, sparse measure, last N days
! ! Schema: srcip, tld, zip, tldvisits, zipvisits!
! ! Data: 10.1.13.120, {macys.com, kohls.com}, {94555, 94301}, {macys.com:4, kohls.com:6}, {94555:8, 94301:7} !
•  Competing technologies: PowerDrill, Druid, LinkedIn Pinot, EssBase
4
© Verizon 2016 All Rights Reserved!
Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners.!
Document Dataset Representation
•  Example
!Schema: srcip, tld, zip, tldvisits, zipvisits!
!Data: 10.1.13.120, {macys.com, kohls.com}, {94555, 94301}, {macys.com:4, kohls.com:6}, {94555:8, 94301:7}
•  DataFrame row to Lucene Document mapping
5
Store/schema! Row! Document!
srcip! primary key! docId!
tld!
zip!
String!
Array[String]!
SingleValue/MultiValue !
Indexed Fields!
tldvisits!
zipvisits!
Double!
Map[String, Double]!
SparseVector !
StoredField!
•  Distributed collection of srcIp as RDD[Document]
•  ~100M srcip, 1M+ terms (sparse dimensions)
© Verizon 2016 All Rights Reserved!
Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners.!
DeviceAnalyzer
6
•  DeviceAnalyzer goals
– Search and retrieve devices that
matched query
– Generate statistical and predictive
models on retrieved devices
© Verizon 2016 All Rights Reserved!
Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners.!
What is Trapezium ?
7
DAIS Open Source framework to build batch, streaming and API services
https://github.com/Verizon/trapezium
© Verizon 2016 All Rights Reserved!
Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners.!
Trapezium Architecture
8
Trapezium
D1
D2
D3
O1
O2
O3
Validation
D1
V1
V1
O1
D2
O2
D3
O1
VARIOUS TRANSACTIONS
© Verizon 2016 All Rights Reserved!
Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners.!
Lucene Overview
9
•  Scalable, full-text search library
•  Focus: Indexing + searching documents
© Verizon 2016 All Rights Reserved!
Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners.!
Trapezium LuceneDAO
•  SparkSQL and MLlib optimized for full scan, column indexing not supported
•  Why Spark + Lucene integration
•  Lucene is battle tested Apache Licensed Open Source Project
•  Adds column search capabilities to Spark
•  Adds spark operators (treeAggregate, treeReduce, map) to Lucene
•  LuceneDAO features
•  Build distributed lucene shards from Dataframe
•  Save shards to HDFS for QueryProcessor (CloudSolr)
•  Access saved shards through LuceneDAO for ML pipelines
10
© Verizon 2016 All Rights Reserved!
Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners.!
Trapezium Batch
11
runMode = "BATCH"
dataSource = “HDFS”
dependentWorkflows= {
workflows=[aggregate]
frequencyToCheck=100
}
hdfsFileBatch = {
batchTime = 86400
timerStartDelay = 1
batchInfo = [{
name = "DeviceStore"
dataDirectory = {saiph-devqa=/aggregates}
fileFormat = "parquet"
}]
}
transactions = [{
transactionName = “DeviceIndexer”
inputData = [{name = "DeviceStore"}]
persistDataName = “indexed"
}]
© Verizon 2016 All Rights Reserved!
Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners.!
DeviceAnalyzer: Indexing
12
/?ref=1108&?url=http://
www.macys.com&id=5
www.walmart.com%2Fc%2Fep%2Frange-
hood-filters&sellermemid=459
http%3A%2F%2Fm.macys.com%2Fshop
%2Fproduct%2Fjockey-elance-cotton
/?ref=1108&?url=http://
www.macys.com&id=5
m.amazon.com%2Fshop%2Fproduct
%2Fjockey-elance-cotton
https://www.walmart.com/ip/Women-
Pant-Suit-Roundtree
walmart://ip/?veh=dsn&wmlspartner
m.macys.com%2Fshop%2Fsearch
%3Fkeyword%3DDress
ip1, macys.com : 2
ip1, walmart.com: 1
ip1, macys.com : 1
ip2, walmart.com: 1
ip1, amazon.com: 1
ip1, macys.com : 2
ip2, walmart.com: 1
0! 1! 2!
5 1! 1!
Macys, 0
Walmart, 1
Amazon, 2
object DeviceIndexer extend BatchTransaction {
process(dfs: Map[String, DataFrame], batchTime: Time): {
df = dfs(“DeviceStore”)
dm = generateDictionary(df)
vectorizedDf = transform(df, dm)
}
persist(df: DataFrame, batchTime: Time): {
converter = SparkLuceneConverter(dm.size)
dao = LuceneDAO(batchTime,…).setConverter(converter)
dm.save(path, batchTime)
dao.index(df, numShards)
}
1!
2
ip1 ip2
© Verizon 2016 All Rights Reserved!
Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners.!
LuceneDAO Index Size
13
0.0!
75.0!
150.0!
225.0!
300.0!
1M! 4M! 8M! 16M! 73M! 73M all!
rows!
InputSize(gb)! IndexSize(gb)!
rows	 InputSize(gb)	 IndexSize(gb)	
1M	 4.0	 5.1	
4M	 14.4	 19.0	
8M	 27.9	 35.7	
16M	 58.8	 63.2	
73M	 276.5	 228.0	
73M	all	 276.5	 267.1
© Verizon 2016 All Rights Reserved!
Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners.!
LuceneDAO Shuffle Size
14
0.!
250.!
500.!
750.!
1000.!
1M! 4M! 8M! 16M! 73M! 73M all!
Dictionary(mb)! ShuffleWrite(mb)!
rows	 ShuffleWrite(mb)	 DicIonary(mb)	
1M	 25	 22.0	
4M	 56	 30.0	
8M	 85	 31.6	
16M	 126	 32.2	
73M	 334	 32.4	
73M	all	 921	 146.5
© Verizon 2016 All Rights Reserved!
Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners.!
LuceneDAO Index Runtime
15
rows	 RunIme	(s)	
1M	 135	
4M	 228	
8M	 434	
16M	 571	
73M	 1726	
73M	all	 2456	
0!
750!
1500!
2250!
3000!
1M! 4M! 8M! 16M! 73M! 73M all!
#rows!
Runtime (s)!
20 executors 16 cores!
Executor RAM 16 GB!
Driver RAM 8g!
© Verizon 2016 All Rights Reserved!
Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners.!
Trapezium Api
runMode = "BATCH"
dataSource = "HDFS"
httpServer = {
provider = "akka"
hostname = "localhost"
port = 19999
contextPath = "/"
endPoints = [{
path = “analyzer-api"
className = "TopKEndPoint"
}]
}
16
© Verizon 2016 All Rights Reserved!
Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners.!
DeviceAnalyzer: Topk
•  Given a query select * from devices
where tld=‘macys.com’ OR
‘nordstorm.com’ AND
(city=‘SanFrancisco’ OR ‘Brussels’) AND
(device=‘Android’) …
•  ML: Find topk dimensions
highly correlated with
selected device
•  BI: group by tld order by
sum(visits) as tldVisits limit
topk
17
class TopkController(sc: SparkContext) extends
SparkServiceEndPoint(sc) {
override def route : topkRoute
converter = SparkLuceneConverter(dm.size)
batchTime = Trapezium.getSyncTime(“indexer”)
dao = LuceneDAO(batchTime…)
.setConverter(converter).load(sc, indexPath)
dict = loadDictionary(sc, indexPath, batchTime)
def topkRoute : {
post { request => {
devices = dao.search(request)
response = getCorrelates(devices, dict, topk)
}
}
df[deviceId, vector] 
sum, support
mean, median, stddev
© Verizon 2016 All Rights Reserved!
Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners.!
Trapezium Stream
18
runMode = "STREAM"
dataSource = “KAFKA"
kafkaTopicInfo = {
consumerGroup = "KafkaStreamGroup"
maxRatePerPartition = 970
batchTime = "5"
streamsInfo = [{
name = "queries"
topicName = “deviceanalyzer”
}]
}
transactions = [{
transactionName = DeviceAnalyzer"
inputStreams = [{name: "queries"}]
persistStreamName = "deviceanalyzer"
isPersist = "true"
}]
© Verizon 2016 All Rights Reserved!
Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners.!
DeviceAnalyzer: Compare
19
•  Given two queries
select * from Devices where
tld=‘macys.com’ OR ‘nordstorm.com’ AND
(city=‘SanFrancisco’) AND (device=‘Android’)
select * from Devices where
tld=‘macys.com’ OR ‘nordstorm.com’ AND
(city=‘Brussels’) AND (device=‘Android’)
•  Find the dimensions that
discriminate the devices
associated with two groups
def processStream(streams: Map[String,
DStream[Row]], workflowTime: Time): {
streams(“queries”).collect().map{ requests =>
group1 = dao.search(requests(0))
group2 = dao.search(requests(1))
response = runLDA(aud1, aud2, dict)
}
def persistStream(responses: RDD[Row],
batchTime: Time) {
HBaseDAO.write(responses)
}
•  Sparse weighted least squares
using Breeze QuadraticMinimizer
•  L1 Regularized logistic regression
© Verizon 2016 All Rights Reserved!
Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners.!
DeviceAnalyzer: Augment
20
•  Given a query
select * from Devices where
tld=‘macys.com’ OR ‘nordstorm.com’
AND (city=‘SanFrancisco’ OR ‘Brussels’)
AND (device=‘Android’)…
•  Find devices similar to
seed as lookalikes
•  Find dimensions that
represent lookalikes
object DeviceAnalyzer extends StreamingTransaction {
converter = SparkLuceneConverter(dm.size)
batchTime = Trapezium.getSyncTime(“indexer”)
dao = LuceneDAO(batchTime…)
.setConverter(converter).load(sc, indexPath)
dict = loadDictionary(sc, indexPath, batchTime)
all = dao.search(“*:*”)
def processStream(streams: Map[String, DStream[Row]]) :
{
streams(“queries”).collect().map{ request =>
audience = dao.search(request)
response = getLookalikeDimensions(all, audience, dict)
} •  Sparse weighted least squares using
Breeze QuadraticMinimizer
•  L2 regularized linear regression
© Verizon 2016 All Rights Reserved!
Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners.!
FastSummarizer
•  Statistical and predictive operators
•  sum: sum over numeric measures
•  support: sum over distinct docID
•  sumSquared: L2 norm
•  gram: Uses BLAS sspr
•  solve: Uses Breeze QuadraticMinimizer to support L1
•  Implemented using Array[Float] for shuffle opt
•  Scala/Java for Level1 operations
•  OpenBLAS for Level3 operations
21
© Verizon 2016 All Rights Reserved!
Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners.!
Sync API Benchmark
73M rows 1M+ search terms
1 measure on 250K sparse dimensions
20 executors 8 cores
32 GB driver RAM 16 GB executor RAM
akka-http cores: 24 default
22
topk!
qps! runtime(s)!
1! 1.389!
5! 1.663!
10! 3.214!
20! 5.992!
40! 12.174!
0.!
3.5!
7.!
10.5!
14.!
1! 5! 10! 20! 40!
qps!
runtime(s)!
© Verizon 2016 All Rights Reserved!
Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners.!
Async API Benchmark
73M rows, 1M+ search terms
1 measure on 250K sparse dimensions
20 executors 8 cores
32 GB driver RAM 16 GB executor RAM
forkjoinpool = 40
Kafka Fetch + compare/augment + HBase Persist
23
predictions!
qps! compare(s)! augment(s)!
1! 9! 16!
5! 13! 36!
10! 23! 70!
20! 42! 142!
0!
40!
80!
120!
160!
1! 5! 10! 20!
qps!
compare! augment!
© Verizon 2016 All Rights Reserved!
Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners.!
topk tld + apps
24
© Verizon 2016 All Rights Reserved!
Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners.!
25
Augment: Auto Enthusiastic
© Verizon 2016 All Rights Reserved!
Information contained herein is provided AS IS and subject to change without notice.  All trademarks used herein are property of their respective owners.!
26
Augment Model Performance
© Verizon 2016 All Rights Reserved!
Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners.!
27
Compare: Leisure vs Business Travellers
THANK YOU.
Q&A
Join us and make machines intelligent
Data & Artificial Intelligence Systems
499 Hamilton Ave, Palo Alto
California

More Related Content

What's hot

Using Spark with Tachyon by Gene Pang
Using Spark with Tachyon by Gene PangUsing Spark with Tachyon by Gene Pang
Using Spark with Tachyon by Gene Pang
Spark Summit
 
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Spark Summit
 
Apache Spark and Online Analytics
Apache Spark and Online Analytics Apache Spark and Online Analytics
Apache Spark and Online Analytics
Databricks
 
Spark Summit EU talk by Yiannis Gkoufas
Spark Summit EU talk by Yiannis GkoufasSpark Summit EU talk by Yiannis Gkoufas
Spark Summit EU talk by Yiannis Gkoufas
Spark Summit
 
Spark Summit EU talk by Oscar Castaneda
Spark Summit EU talk by Oscar CastanedaSpark Summit EU talk by Oscar Castaneda
Spark Summit EU talk by Oscar Castaneda
Spark Summit
 
Monitoring of GPU Usage with Tensorflow Models Using Prometheus
Monitoring of GPU Usage with Tensorflow Models Using PrometheusMonitoring of GPU Usage with Tensorflow Models Using Prometheus
Monitoring of GPU Usage with Tensorflow Models Using Prometheus
Databricks
 
Spark Summit EU talk by Jakub Hava
Spark Summit EU talk by Jakub HavaSpark Summit EU talk by Jakub Hava
Spark Summit EU talk by Jakub Hava
Spark Summit
 
Spark Summit EU talk by Steve Loughran
Spark Summit EU talk by Steve LoughranSpark Summit EU talk by Steve Loughran
Spark Summit EU talk by Steve Loughran
Spark Summit
 
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Sparking up Data Engineering: Spark Summit East talk by Rohan SharmaSparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Spark Summit
 
Spark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike PercySpark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike Percy
Spark Summit
 
Building Robust, Adaptive Streaming Apps with Spark Streaming
Building Robust, Adaptive Streaming Apps with Spark StreamingBuilding Robust, Adaptive Streaming Apps with Spark Streaming
Building Robust, Adaptive Streaming Apps with Spark Streaming
Databricks
 
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
Spark Summit
 
Big Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and ZeppelinBig Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and Zeppelin
prajods
 
Spark Summit EU talk by John Musser
Spark Summit EU talk by John MusserSpark Summit EU talk by John Musser
Spark Summit EU talk by John Musser
Spark Summit
 
Spark Meetup at Uber
Spark Meetup at UberSpark Meetup at Uber
Spark Meetup at Uber
Databricks
 
Spark Summit EU talk by Emlyn Whittick
Spark Summit EU talk by Emlyn WhittickSpark Summit EU talk by Emlyn Whittick
Spark Summit EU talk by Emlyn Whittick
Spark Summit
 
The Pushdown of Everything by Stephan Kessler and Santiago Mola
The Pushdown of Everything by Stephan Kessler and Santiago MolaThe Pushdown of Everything by Stephan Kessler and Santiago Mola
The Pushdown of Everything by Stephan Kessler and Santiago Mola
Spark Summit
 
Spark Summit EU talk by Shay Nativ and Dvir Volk
Spark Summit EU talk by Shay Nativ and Dvir VolkSpark Summit EU talk by Shay Nativ and Dvir Volk
Spark Summit EU talk by Shay Nativ and Dvir Volk
Spark Summit
 
Spark Summit EU talk by Kaarthik Sivashanmugam
Spark Summit EU talk by Kaarthik SivashanmugamSpark Summit EU talk by Kaarthik Sivashanmugam
Spark Summit EU talk by Kaarthik Sivashanmugam
Spark Summit
 
Building a Business Logic Translation Engine with Spark Streaming for Communi...
Building a Business Logic Translation Engine with Spark Streaming for Communi...Building a Business Logic Translation Engine with Spark Streaming for Communi...
Building a Business Logic Translation Engine with Spark Streaming for Communi...
Spark Summit
 

What's hot (20)

Using Spark with Tachyon by Gene Pang
Using Spark with Tachyon by Gene PangUsing Spark with Tachyon by Gene Pang
Using Spark with Tachyon by Gene Pang
 
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
 
Apache Spark and Online Analytics
Apache Spark and Online Analytics Apache Spark and Online Analytics
Apache Spark and Online Analytics
 
Spark Summit EU talk by Yiannis Gkoufas
Spark Summit EU talk by Yiannis GkoufasSpark Summit EU talk by Yiannis Gkoufas
Spark Summit EU talk by Yiannis Gkoufas
 
Spark Summit EU talk by Oscar Castaneda
Spark Summit EU talk by Oscar CastanedaSpark Summit EU talk by Oscar Castaneda
Spark Summit EU talk by Oscar Castaneda
 
Monitoring of GPU Usage with Tensorflow Models Using Prometheus
Monitoring of GPU Usage with Tensorflow Models Using PrometheusMonitoring of GPU Usage with Tensorflow Models Using Prometheus
Monitoring of GPU Usage with Tensorflow Models Using Prometheus
 
Spark Summit EU talk by Jakub Hava
Spark Summit EU talk by Jakub HavaSpark Summit EU talk by Jakub Hava
Spark Summit EU talk by Jakub Hava
 
Spark Summit EU talk by Steve Loughran
Spark Summit EU talk by Steve LoughranSpark Summit EU talk by Steve Loughran
Spark Summit EU talk by Steve Loughran
 
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Sparking up Data Engineering: Spark Summit East talk by Rohan SharmaSparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
 
Spark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike PercySpark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike Percy
 
Building Robust, Adaptive Streaming Apps with Spark Streaming
Building Robust, Adaptive Streaming Apps with Spark StreamingBuilding Robust, Adaptive Streaming Apps with Spark Streaming
Building Robust, Adaptive Streaming Apps with Spark Streaming
 
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
 
Big Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and ZeppelinBig Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and Zeppelin
 
Spark Summit EU talk by John Musser
Spark Summit EU talk by John MusserSpark Summit EU talk by John Musser
Spark Summit EU talk by John Musser
 
Spark Meetup at Uber
Spark Meetup at UberSpark Meetup at Uber
Spark Meetup at Uber
 
Spark Summit EU talk by Emlyn Whittick
Spark Summit EU talk by Emlyn WhittickSpark Summit EU talk by Emlyn Whittick
Spark Summit EU talk by Emlyn Whittick
 
The Pushdown of Everything by Stephan Kessler and Santiago Mola
The Pushdown of Everything by Stephan Kessler and Santiago MolaThe Pushdown of Everything by Stephan Kessler and Santiago Mola
The Pushdown of Everything by Stephan Kessler and Santiago Mola
 
Spark Summit EU talk by Shay Nativ and Dvir Volk
Spark Summit EU talk by Shay Nativ and Dvir VolkSpark Summit EU talk by Shay Nativ and Dvir Volk
Spark Summit EU talk by Shay Nativ and Dvir Volk
 
Spark Summit EU talk by Kaarthik Sivashanmugam
Spark Summit EU talk by Kaarthik SivashanmugamSpark Summit EU talk by Kaarthik Sivashanmugam
Spark Summit EU talk by Kaarthik Sivashanmugam
 
Building a Business Logic Translation Engine with Spark Streaming for Communi...
Building a Business Logic Translation Engine with Spark Streaming for Communi...Building a Business Logic Translation Engine with Spark Streaming for Communi...
Building a Business Logic Translation Engine with Spark Streaming for Communi...
 

Viewers also liked

Using Apache Spark for Intelligent Services: Keynote at Spark Summit East by ...
Using Apache Spark for Intelligent Services: Keynote at Spark Summit East by ...Using Apache Spark for Intelligent Services: Keynote at Spark Summit East by ...
Using Apache Spark for Intelligent Services: Keynote at Spark Summit East by ...
Spark Summit
 
Spark Summit EU talk by Johnathan Mercer
Spark Summit EU talk by Johnathan MercerSpark Summit EU talk by Johnathan Mercer
Spark Summit EU talk by Johnathan Mercer
Spark Summit
 
Spark Summit EU talk by Yaroslav Nedashkovsky and Andy Starzhinsky
Spark Summit EU talk by Yaroslav Nedashkovsky and Andy StarzhinskySpark Summit EU talk by Yaroslav Nedashkovsky and Andy Starzhinsky
Spark Summit EU talk by Yaroslav Nedashkovsky and Andy Starzhinsky
Spark Summit
 
Spark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod NarasimhaSpark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit
 
Spark Summit EU talk by Pat Patterson
Spark Summit EU talk by Pat PattersonSpark Summit EU talk by Pat Patterson
Spark Summit EU talk by Pat Patterson
Spark Summit
 
Using Apache Spark for Intelligent Services by Alexis Roos
Using Apache Spark for Intelligent Services by Alexis RoosUsing Apache Spark for Intelligent Services by Alexis Roos
Using Apache Spark for Intelligent Services by Alexis Roos
Spark Summit
 
Spark Summit EU talk by Ruben Pulido and Behar Veliqi
Spark Summit EU talk by Ruben Pulido and Behar VeliqiSpark Summit EU talk by Ruben Pulido and Behar Veliqi
Spark Summit EU talk by Ruben Pulido and Behar Veliqi
Spark Summit
 
Spark Summit EU talk by Tug Grall
Spark Summit EU talk by Tug GrallSpark Summit EU talk by Tug Grall
Spark Summit EU talk by Tug Grall
Spark Summit
 
Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...
Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...
Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...
Spark Summit
 
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli
Spark Summit
 
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
Spark Summit
 
Debugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauDebugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden Karau
Spark Summit
 
Spark Summit EU talk by Miklos Christine paddling up the stream
Spark Summit EU talk by Miklos Christine paddling up the streamSpark Summit EU talk by Miklos Christine paddling up the stream
Spark Summit EU talk by Miklos Christine paddling up the stream
Spark Summit
 
Spark Summit EU talk by Shaun Klopfenstein and Neelesh Shastry
Spark Summit EU talk by Shaun Klopfenstein and Neelesh ShastrySpark Summit EU talk by Shaun Klopfenstein and Neelesh Shastry
Spark Summit EU talk by Shaun Klopfenstein and Neelesh Shastry
Spark Summit
 
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Spark Summit
 
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Spark Summit
 
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
Spark Summit
 
Apache Spark in Cloud and Hybrid: Why Security and Governance Become More Imp...
Apache Spark in Cloud and Hybrid: Why Security and Governance Become More Imp...Apache Spark in Cloud and Hybrid: Why Security and Governance Become More Imp...
Apache Spark in Cloud and Hybrid: Why Security and Governance Become More Imp...
Spark Summit
 
Secured (Kerberos-based) Spark Notebook for Data Science: Spark Summit East t...
Secured (Kerberos-based) Spark Notebook for Data Science: Spark Summit East t...Secured (Kerberos-based) Spark Notebook for Data Science: Spark Summit East t...
Secured (Kerberos-based) Spark Notebook for Data Science: Spark Summit East t...
Spark Summit
 
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...
Spark Summit
 

Viewers also liked (20)

Using Apache Spark for Intelligent Services: Keynote at Spark Summit East by ...
Using Apache Spark for Intelligent Services: Keynote at Spark Summit East by ...Using Apache Spark for Intelligent Services: Keynote at Spark Summit East by ...
Using Apache Spark for Intelligent Services: Keynote at Spark Summit East by ...
 
Spark Summit EU talk by Johnathan Mercer
Spark Summit EU talk by Johnathan MercerSpark Summit EU talk by Johnathan Mercer
Spark Summit EU talk by Johnathan Mercer
 
Spark Summit EU talk by Yaroslav Nedashkovsky and Andy Starzhinsky
Spark Summit EU talk by Yaroslav Nedashkovsky and Andy StarzhinskySpark Summit EU talk by Yaroslav Nedashkovsky and Andy Starzhinsky
Spark Summit EU talk by Yaroslav Nedashkovsky and Andy Starzhinsky
 
Spark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod NarasimhaSpark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod Narasimha
 
Spark Summit EU talk by Pat Patterson
Spark Summit EU talk by Pat PattersonSpark Summit EU talk by Pat Patterson
Spark Summit EU talk by Pat Patterson
 
Using Apache Spark for Intelligent Services by Alexis Roos
Using Apache Spark for Intelligent Services by Alexis RoosUsing Apache Spark for Intelligent Services by Alexis Roos
Using Apache Spark for Intelligent Services by Alexis Roos
 
Spark Summit EU talk by Ruben Pulido and Behar Veliqi
Spark Summit EU talk by Ruben Pulido and Behar VeliqiSpark Summit EU talk by Ruben Pulido and Behar Veliqi
Spark Summit EU talk by Ruben Pulido and Behar Veliqi
 
Spark Summit EU talk by Tug Grall
Spark Summit EU talk by Tug GrallSpark Summit EU talk by Tug Grall
Spark Summit EU talk by Tug Grall
 
Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...
Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...
Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...
 
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli
 
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
 
Debugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauDebugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden Karau
 
Spark Summit EU talk by Miklos Christine paddling up the stream
Spark Summit EU talk by Miklos Christine paddling up the streamSpark Summit EU talk by Miklos Christine paddling up the stream
Spark Summit EU talk by Miklos Christine paddling up the stream
 
Spark Summit EU talk by Shaun Klopfenstein and Neelesh Shastry
Spark Summit EU talk by Shaun Klopfenstein and Neelesh ShastrySpark Summit EU talk by Shaun Klopfenstein and Neelesh Shastry
Spark Summit EU talk by Shaun Klopfenstein and Neelesh Shastry
 
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
 
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
 
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
 
Apache Spark in Cloud and Hybrid: Why Security and Governance Become More Imp...
Apache Spark in Cloud and Hybrid: Why Security and Governance Become More Imp...Apache Spark in Cloud and Hybrid: Why Security and Governance Become More Imp...
Apache Spark in Cloud and Hybrid: Why Security and Governance Become More Imp...
 
Secured (Kerberos-based) Spark Notebook for Data Science: Spark Summit East t...
Secured (Kerberos-based) Spark Notebook for Data Science: Spark Summit East t...Secured (Kerberos-based) Spark Notebook for Data Science: Spark Summit East t...
Secured (Kerberos-based) Spark Notebook for Data Science: Spark Summit East t...
 
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...
 

Similar to Spark Summit EU talk by Debasish Das and Pramod Narasimha

Spark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod NarasimhaSpark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit
 
Realtime Analytical Query Processing and Predictive Model Building on High Di...
Realtime Analytical Query Processing and Predictive Model Building on High Di...Realtime Analytical Query Processing and Predictive Model Building on High Di...
Realtime Analytical Query Processing and Predictive Model Building on High Di...
Spark Summit
 
Realtime Analytical Query Processing and Predictive Model Building on High Di...
Realtime Analytical Query Processing and Predictive Model Building on High Di...Realtime Analytical Query Processing and Predictive Model Building on High Di...
Realtime Analytical Query Processing and Predictive Model Building on High Di...
Spark Summit
 
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
DataWorks Summit/Hadoop Summit
 
Learning the basics of Apache NiFi for iot OSS Europe 2020
Learning the basics of Apache NiFi for iot OSS Europe 2020Learning the basics of Apache NiFi for iot OSS Europe 2020
Learning the basics of Apache NiFi for iot OSS Europe 2020
Timothy Spann
 
Cassandra Day SV 2014: Spark, Shark, and Apache Cassandra
Cassandra Day SV 2014: Spark, Shark, and Apache CassandraCassandra Day SV 2014: Spark, Shark, and Apache Cassandra
Cassandra Day SV 2014: Spark, Shark, and Apache Cassandra
DataStax Academy
 
Analyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache DrillAnalyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache Drill
Tomer Shiran
 
Model-driven Telemetry: The Foundation of Big Data Analytics
Model-driven Telemetry: The Foundation of Big Data AnalyticsModel-driven Telemetry: The Foundation of Big Data Analytics
Model-driven Telemetry: The Foundation of Big Data Analytics
Cisco Canada
 
Introduction to Apache NiFi 1.11.4
Introduction to Apache NiFi 1.11.4Introduction to Apache NiFi 1.11.4
Introduction to Apache NiFi 1.11.4
Timothy Spann
 
Hadoop
HadoopHadoop
Stratos Open PaaS OSCON 2011
Stratos Open PaaS OSCON 2011Stratos Open PaaS OSCON 2011
Stratos Open PaaS OSCON 2011
Paul Fremantle
 
Delivering Apache Hadoop for the Modern Data Architecture
Delivering Apache Hadoop for the Modern Data Architecture Delivering Apache Hadoop for the Modern Data Architecture
Delivering Apache Hadoop for the Modern Data Architecture
Hortonworks
 
DANE and Application Uses of DNSSEC
DANE and Application Uses of DNSSECDANE and Application Uses of DNSSEC
DANE and Application Uses of DNSSEC
Shumon Huque
 
HUG Italy meet-up with Tugdual Grall, MapR Technical Evangelist
HUG Italy meet-up with Tugdual Grall, MapR Technical EvangelistHUG Italy meet-up with Tugdual Grall, MapR Technical Evangelist
HUG Italy meet-up with Tugdual Grall, MapR Technical Evangelist
SpagoWorld
 
Context-Aware Access Control for RDF Graph Stores
Context-Aware Access Control for RDF Graph StoresContext-Aware Access Control for RDF Graph Stores
Context-Aware Access Control for RDF Graph Stores
Serena Villata
 
Putting Apache Drill into Production
Putting Apache Drill into ProductionPutting Apache Drill into Production
Putting Apache Drill into Production
MapR Technologies
 
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
Spark Summit
 
Predictable Big Data Performance in Real-time
Predictable Big Data Performance in Real-timePredictable Big Data Performance in Real-time
Predictable Big Data Performance in Real-time
Aerospike, Inc.
 
Creating PostgreSQL-as-a-Service at Scale
Creating PostgreSQL-as-a-Service at ScaleCreating PostgreSQL-as-a-Service at Scale
Creating PostgreSQL-as-a-Service at Scale
Sean Chittenden
 
Analyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache DrillAnalyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache Drill
tshiran
 

Similar to Spark Summit EU talk by Debasish Das and Pramod Narasimha (20)

Spark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod NarasimhaSpark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod Narasimha
 
Realtime Analytical Query Processing and Predictive Model Building on High Di...
Realtime Analytical Query Processing and Predictive Model Building on High Di...Realtime Analytical Query Processing and Predictive Model Building on High Di...
Realtime Analytical Query Processing and Predictive Model Building on High Di...
 
Realtime Analytical Query Processing and Predictive Model Building on High Di...
Realtime Analytical Query Processing and Predictive Model Building on High Di...Realtime Analytical Query Processing and Predictive Model Building on High Di...
Realtime Analytical Query Processing and Predictive Model Building on High Di...
 
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
 
Learning the basics of Apache NiFi for iot OSS Europe 2020
Learning the basics of Apache NiFi for iot OSS Europe 2020Learning the basics of Apache NiFi for iot OSS Europe 2020
Learning the basics of Apache NiFi for iot OSS Europe 2020
 
Cassandra Day SV 2014: Spark, Shark, and Apache Cassandra
Cassandra Day SV 2014: Spark, Shark, and Apache CassandraCassandra Day SV 2014: Spark, Shark, and Apache Cassandra
Cassandra Day SV 2014: Spark, Shark, and Apache Cassandra
 
Analyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache DrillAnalyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache Drill
 
Model-driven Telemetry: The Foundation of Big Data Analytics
Model-driven Telemetry: The Foundation of Big Data AnalyticsModel-driven Telemetry: The Foundation of Big Data Analytics
Model-driven Telemetry: The Foundation of Big Data Analytics
 
Introduction to Apache NiFi 1.11.4
Introduction to Apache NiFi 1.11.4Introduction to Apache NiFi 1.11.4
Introduction to Apache NiFi 1.11.4
 
Hadoop
HadoopHadoop
Hadoop
 
Stratos Open PaaS OSCON 2011
Stratos Open PaaS OSCON 2011Stratos Open PaaS OSCON 2011
Stratos Open PaaS OSCON 2011
 
Delivering Apache Hadoop for the Modern Data Architecture
Delivering Apache Hadoop for the Modern Data Architecture Delivering Apache Hadoop for the Modern Data Architecture
Delivering Apache Hadoop for the Modern Data Architecture
 
DANE and Application Uses of DNSSEC
DANE and Application Uses of DNSSECDANE and Application Uses of DNSSEC
DANE and Application Uses of DNSSEC
 
HUG Italy meet-up with Tugdual Grall, MapR Technical Evangelist
HUG Italy meet-up with Tugdual Grall, MapR Technical EvangelistHUG Italy meet-up with Tugdual Grall, MapR Technical Evangelist
HUG Italy meet-up with Tugdual Grall, MapR Technical Evangelist
 
Context-Aware Access Control for RDF Graph Stores
Context-Aware Access Control for RDF Graph StoresContext-Aware Access Control for RDF Graph Stores
Context-Aware Access Control for RDF Graph Stores
 
Putting Apache Drill into Production
Putting Apache Drill into ProductionPutting Apache Drill into Production
Putting Apache Drill into Production
 
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
 
Predictable Big Data Performance in Real-time
Predictable Big Data Performance in Real-timePredictable Big Data Performance in Real-time
Predictable Big Data Performance in Real-time
 
Creating PostgreSQL-as-a-Service at Scale
Creating PostgreSQL-as-a-Service at ScaleCreating PostgreSQL-as-a-Service at Scale
Creating PostgreSQL-as-a-Service at Scale
 
Analyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache DrillAnalyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache Drill
 

More from Spark Summit

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Spark Summit
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
Spark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
Spark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Spark Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spark Summit
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
Spark Summit
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 

More from Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
 

Recently uploaded

一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
nuttdpt
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
bopyb
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
Bill641377
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
kuntobimo2016
 
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
74nqk8xf
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Kiwi Creative
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
manishkhaire30
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
aqzctr7x
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
Timothy Spann
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
AlessioFois2
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
Social Samosa
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
soxrziqu
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
Walaa Eldin Moustafa
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
nyfuhyz
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
Lars Albertsson
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
jerlynmaetalle
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
Sachin Paul
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
nuttdpt
 
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
g4dpvqap0
 

Recently uploaded (20)

一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
 
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
 
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
 

Spark Summit EU talk by Debasish Das and Pramod Narasimha

  • 1. FUSING APACHE SPARK AND LUCENE FOR NEAR-REALTIME PREDICTIVE MODEL BUILDING Debasish Das Principal Engineer Verizon Contributors Platform: Pankaj Rastogi, Venkat Chunduru, Ponrama Jegan, Masoud Tavazoei Algorithm: Santanu Das, Debasish Das (Dave) Frontend: Altaff Shaik, Jon Leonhardt Pramod Lakshmi Narasimha Principal Engineer Verizon
  • 2. © Verizon 2016 All Rights Reserved! Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners.! Data Overview •  Location data •  Each srcIp defined as unique row key •  Provides approximate location of each key •  Timeseries containing latitude, longitude, error bound, duration, timezone for each key •  Clickstream data •  Contains clickstream data of each row key •  Contains startTime, duration, httphost, httpuri, upload/download bytes, httpmethod •  Compatible with IPFIX/Netflow formats 2
  • 3. © Verizon 2016 All Rights Reserved! Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners.! Marketing Analytics 3 Lookalike modeling Churn reduction Competitive analysis Increased share of stomach •  Anonymous aggregate analysis for customer insights
  • 4. © Verizon 2016 All Rights Reserved! Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners.! Data Model •  Dense dimension, dense measure !Schema: srcip, date, hour, tld, zip, tldvisits, zipvisits! !Data: 10.1.13.120, d1, H2, macys.com, 94555, 2, 4! •  Sparse dimension, dense measure !Schema: srcip, date, tld, zip, clickstreamvisits, zipvisits! !Data: 10.1.13.120, d1, {macys.com, kohls.com}, {94555, 94301}, 10, 15! •  Sparse dimension, sparse measure !Schema: srcip, date, tld, zip, tldvisits, zipvisits! !Data: 10.1.13.120, d1, {macys.com, kohls.com}, {94555, 94301}, {macys.com:4, kohls.com:6}, {94555:8, 94301:7}! !Schema: srcip, week, tld, zip, tldvisits, zipvisits! !Data: 10.1.13.120, week1,  {macys.com, kohls.com}, {94555, 94301}, {macys.com:4, kohls.com:6}, {94555:8, 94301:7}! •  Sparse dimension, sparse measure, last N days ! ! Schema: srcip, tld, zip, tldvisits, zipvisits! ! ! Data: 10.1.13.120, {macys.com, kohls.com}, {94555, 94301}, {macys.com:4, kohls.com:6}, {94555:8, 94301:7} ! •  Competing technologies: PowerDrill, Druid, LinkedIn Pinot, EssBase 4
  • 5. © Verizon 2016 All Rights Reserved! Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners.! Document Dataset Representation •  Example !Schema: srcip, tld, zip, tldvisits, zipvisits! !Data: 10.1.13.120, {macys.com, kohls.com}, {94555, 94301}, {macys.com:4, kohls.com:6}, {94555:8, 94301:7} •  DataFrame row to Lucene Document mapping 5 Store/schema! Row! Document! srcip! primary key! docId! tld! zip! String! Array[String]! SingleValue/MultiValue ! Indexed Fields! tldvisits! zipvisits! Double! Map[String, Double]! SparseVector ! StoredField! •  Distributed collection of srcIp as RDD[Document] •  ~100M srcip, 1M+ terms (sparse dimensions)
  • 6. © Verizon 2016 All Rights Reserved! Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners.! DeviceAnalyzer 6 •  DeviceAnalyzer goals – Search and retrieve devices that matched query – Generate statistical and predictive models on retrieved devices
  • 7. © Verizon 2016 All Rights Reserved! Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners.! What is Trapezium ? 7 DAIS Open Source framework to build batch, streaming and API services https://github.com/Verizon/trapezium
  • 8. © Verizon 2016 All Rights Reserved! Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners.! Trapezium Architecture 8 Trapezium D1 D2 D3 O1 O2 O3 Validation D1 V1 V1 O1 D2 O2 D3 O1 VARIOUS TRANSACTIONS
  • 9. © Verizon 2016 All Rights Reserved! Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners.! Lucene Overview 9 •  Scalable, full-text search library •  Focus: Indexing + searching documents
  • 10. © Verizon 2016 All Rights Reserved! Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners.! Trapezium LuceneDAO •  SparkSQL and MLlib optimized for full scan, column indexing not supported •  Why Spark + Lucene integration •  Lucene is battle tested Apache Licensed Open Source Project •  Adds column search capabilities to Spark •  Adds spark operators (treeAggregate, treeReduce, map) to Lucene •  LuceneDAO features •  Build distributed lucene shards from Dataframe •  Save shards to HDFS for QueryProcessor (CloudSolr) •  Access saved shards through LuceneDAO for ML pipelines 10
  • 11. © Verizon 2016 All Rights Reserved! Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners.! Trapezium Batch 11 runMode = "BATCH" dataSource = “HDFS” dependentWorkflows= { workflows=[aggregate] frequencyToCheck=100 } hdfsFileBatch = { batchTime = 86400 timerStartDelay = 1 batchInfo = [{ name = "DeviceStore" dataDirectory = {saiph-devqa=/aggregates} fileFormat = "parquet" }] } transactions = [{ transactionName = “DeviceIndexer” inputData = [{name = "DeviceStore"}] persistDataName = “indexed" }]
  • 12. © Verizon 2016 All Rights Reserved! Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners.! DeviceAnalyzer: Indexing 12 /?ref=1108&?url=http:// www.macys.com&id=5 www.walmart.com%2Fc%2Fep%2Frange- hood-filters&sellermemid=459 http%3A%2F%2Fm.macys.com%2Fshop %2Fproduct%2Fjockey-elance-cotton /?ref=1108&?url=http:// www.macys.com&id=5 m.amazon.com%2Fshop%2Fproduct %2Fjockey-elance-cotton https://www.walmart.com/ip/Women- Pant-Suit-Roundtree walmart://ip/?veh=dsn&wmlspartner m.macys.com%2Fshop%2Fsearch %3Fkeyword%3DDress ip1, macys.com : 2 ip1, walmart.com: 1 ip1, macys.com : 1 ip2, walmart.com: 1 ip1, amazon.com: 1 ip1, macys.com : 2 ip2, walmart.com: 1 0! 1! 2! 5 1! 1! Macys, 0 Walmart, 1 Amazon, 2 object DeviceIndexer extend BatchTransaction { process(dfs: Map[String, DataFrame], batchTime: Time): { df = dfs(“DeviceStore”) dm = generateDictionary(df) vectorizedDf = transform(df, dm) } persist(df: DataFrame, batchTime: Time): { converter = SparkLuceneConverter(dm.size) dao = LuceneDAO(batchTime,…).setConverter(converter) dm.save(path, batchTime) dao.index(df, numShards) } 1! 2 ip1 ip2
  • 13. © Verizon 2016 All Rights Reserved! Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners.! LuceneDAO Index Size 13 0.0! 75.0! 150.0! 225.0! 300.0! 1M! 4M! 8M! 16M! 73M! 73M all! rows! InputSize(gb)! IndexSize(gb)! rows InputSize(gb) IndexSize(gb) 1M 4.0 5.1 4M 14.4 19.0 8M 27.9 35.7 16M 58.8 63.2 73M 276.5 228.0 73M all 276.5 267.1
  • 14. © Verizon 2016 All Rights Reserved! Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners.! LuceneDAO Shuffle Size 14 0.! 250.! 500.! 750.! 1000.! 1M! 4M! 8M! 16M! 73M! 73M all! Dictionary(mb)! ShuffleWrite(mb)! rows ShuffleWrite(mb) DicIonary(mb) 1M 25 22.0 4M 56 30.0 8M 85 31.6 16M 126 32.2 73M 334 32.4 73M all 921 146.5
  • 15. © Verizon 2016 All Rights Reserved! Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners.! LuceneDAO Index Runtime 15 rows RunIme (s) 1M 135 4M 228 8M 434 16M 571 73M 1726 73M all 2456 0! 750! 1500! 2250! 3000! 1M! 4M! 8M! 16M! 73M! 73M all! #rows! Runtime (s)! 20 executors 16 cores! Executor RAM 16 GB! Driver RAM 8g!
  • 16. © Verizon 2016 All Rights Reserved! Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners.! Trapezium Api runMode = "BATCH" dataSource = "HDFS" httpServer = { provider = "akka" hostname = "localhost" port = 19999 contextPath = "/" endPoints = [{ path = “analyzer-api" className = "TopKEndPoint" }] } 16
  • 17. © Verizon 2016 All Rights Reserved! Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners.! DeviceAnalyzer: Topk •  Given a query select * from devices where tld=‘macys.com’ OR ‘nordstorm.com’ AND (city=‘SanFrancisco’ OR ‘Brussels’) AND (device=‘Android’) … •  ML: Find topk dimensions highly correlated with selected device •  BI: group by tld order by sum(visits) as tldVisits limit topk 17 class TopkController(sc: SparkContext) extends SparkServiceEndPoint(sc) { override def route : topkRoute converter = SparkLuceneConverter(dm.size) batchTime = Trapezium.getSyncTime(“indexer”) dao = LuceneDAO(batchTime…) .setConverter(converter).load(sc, indexPath) dict = loadDictionary(sc, indexPath, batchTime) def topkRoute : { post { request => { devices = dao.search(request) response = getCorrelates(devices, dict, topk) } } df[deviceId, vector] sum, support mean, median, stddev
  • 18. © Verizon 2016 All Rights Reserved! Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners.! Trapezium Stream 18 runMode = "STREAM" dataSource = “KAFKA" kafkaTopicInfo = { consumerGroup = "KafkaStreamGroup" maxRatePerPartition = 970 batchTime = "5" streamsInfo = [{ name = "queries" topicName = “deviceanalyzer” }] } transactions = [{ transactionName = DeviceAnalyzer" inputStreams = [{name: "queries"}] persistStreamName = "deviceanalyzer" isPersist = "true" }]
  • 19. © Verizon 2016 All Rights Reserved! Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners.! DeviceAnalyzer: Compare 19 •  Given two queries select * from Devices where tld=‘macys.com’ OR ‘nordstorm.com’ AND (city=‘SanFrancisco’) AND (device=‘Android’) select * from Devices where tld=‘macys.com’ OR ‘nordstorm.com’ AND (city=‘Brussels’) AND (device=‘Android’) •  Find the dimensions that discriminate the devices associated with two groups def processStream(streams: Map[String, DStream[Row]], workflowTime: Time): { streams(“queries”).collect().map{ requests => group1 = dao.search(requests(0)) group2 = dao.search(requests(1)) response = runLDA(aud1, aud2, dict) } def persistStream(responses: RDD[Row], batchTime: Time) { HBaseDAO.write(responses) } •  Sparse weighted least squares using Breeze QuadraticMinimizer •  L1 Regularized logistic regression
  • 20. © Verizon 2016 All Rights Reserved! Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners.! DeviceAnalyzer: Augment 20 •  Given a query select * from Devices where tld=‘macys.com’ OR ‘nordstorm.com’ AND (city=‘SanFrancisco’ OR ‘Brussels’) AND (device=‘Android’)… •  Find devices similar to seed as lookalikes •  Find dimensions that represent lookalikes object DeviceAnalyzer extends StreamingTransaction { converter = SparkLuceneConverter(dm.size) batchTime = Trapezium.getSyncTime(“indexer”) dao = LuceneDAO(batchTime…) .setConverter(converter).load(sc, indexPath) dict = loadDictionary(sc, indexPath, batchTime) all = dao.search(“*:*”) def processStream(streams: Map[String, DStream[Row]]) : { streams(“queries”).collect().map{ request => audience = dao.search(request) response = getLookalikeDimensions(all, audience, dict) } •  Sparse weighted least squares using Breeze QuadraticMinimizer •  L2 regularized linear regression
  • 21. © Verizon 2016 All Rights Reserved! Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners.! FastSummarizer •  Statistical and predictive operators •  sum: sum over numeric measures •  support: sum over distinct docID •  sumSquared: L2 norm •  gram: Uses BLAS sspr •  solve: Uses Breeze QuadraticMinimizer to support L1 •  Implemented using Array[Float] for shuffle opt •  Scala/Java for Level1 operations •  OpenBLAS for Level3 operations 21
  • 22. © Verizon 2016 All Rights Reserved! Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners.! Sync API Benchmark 73M rows 1M+ search terms 1 measure on 250K sparse dimensions 20 executors 8 cores 32 GB driver RAM 16 GB executor RAM akka-http cores: 24 default 22 topk! qps! runtime(s)! 1! 1.389! 5! 1.663! 10! 3.214! 20! 5.992! 40! 12.174! 0.! 3.5! 7.! 10.5! 14.! 1! 5! 10! 20! 40! qps! runtime(s)!
  • 23. © Verizon 2016 All Rights Reserved! Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners.! Async API Benchmark 73M rows, 1M+ search terms 1 measure on 250K sparse dimensions 20 executors 8 cores 32 GB driver RAM 16 GB executor RAM forkjoinpool = 40 Kafka Fetch + compare/augment + HBase Persist 23 predictions! qps! compare(s)! augment(s)! 1! 9! 16! 5! 13! 36! 10! 23! 70! 20! 42! 142! 0! 40! 80! 120! 160! 1! 5! 10! 20! qps! compare! augment!
  • 24. © Verizon 2016 All Rights Reserved! Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners.! topk tld + apps 24
  • 25. © Verizon 2016 All Rights Reserved! Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners.! 25 Augment: Auto Enthusiastic
  • 26. © Verizon 2016 All Rights Reserved! Information contained herein is provided AS IS and subject to change without notice.  All trademarks used herein are property of their respective owners.! 26 Augment Model Performance
  • 27. © Verizon 2016 All Rights Reserved! Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners.! 27 Compare: Leisure vs Business Travellers
  • 28. THANK YOU. Q&A Join us and make machines intelligent Data & Artificial Intelligence Systems 499 Hamilton Ave, Palo Alto California