SlideShare a Scribd company logo
Information Retrieval
and Data Science
Thamme Gowda
@thammegowda
Karanjeet Singh
@_karanjeet
A	web-crawler	on	Apache	Spark
Feb	7-9,	2017Spark	Summit	East	2017,	Boston 1
SPARKLER
Dr. Chris Mattmann
@chrismattmann
https://github.com/USCDataScience/sparkler
Information Retrieval
and Data Science
ABOUT
2
Information	Retrieval	and	Data	Science	(IRDS)	Group
University	of	Southern	California,	Los	Angeles,	CA
Home	page:	https://irds.usc.edu Email:	irds-L@mymaillists.usc.edu
Thamme	Gowda Dr.	Chris	MattmannKaranjeet	Singh
Graduate	Student
@thammegowda
Graduate	Student
@karanjeet_tw
Director,	IRDS
@chrismattmann
Feb	7-9,	2017Spark	Summit	East	2017,	Boston
Information Retrieval
and Data Science
OVERVIEW
● About Sparkler
● Motivations for building Sparkler
● Sparkler technology stack, internals
● Features of Sparkler
● Dashboard
● Demo
● What’s Next ?
3Feb	7-9,	2017Spark	Summit	East	2017,	Boston
Information Retrieval
and Data Science
ABOUT:	SPARKLER
● New Open Source Web Crawler
•A bot program that can fetch resources from the web
● Name: Spark Crawler
● Inspired by Apache Nutch
● Like Nutch: Distributed crawler that can scale horizontally
● Unlike Nutch: Runs on top of Apache Spark
● Easy to deploy and easy to use
4Feb	7-9,	2017Spark	Summit	East	2017,	Boston
Information Retrieval
and Data Science
MOTIVATION	#1
● Challenges in DARPA MEMEX*
•MEMEX System has crawlers to fetch deep and dark web data
•ML based analysis to assist law keeping agencies
•Crawls are blackbox, we wanted real-time progress reports
● Dr. Chris Mattmann was considering an upgrade since 3 years
● Technology upgrade needed
5Feb	7-9,	2017Spark	Summit	East	2017,	Boston
* http://memex.jpl.nasa.gov/
Information Retrieval
and Data Science
WHY A NEW CRAWLER?
6
Modern	Hadoop	cluster	has	no	Hadoop	(Map-Reduce)	left	in	it!
https://twitter.com/cutting/status/796566255830503424
Feb	7-9,	2017Spark	Summit	East	2017,	Boston
Information Retrieval
and Data Science
MOTIVATION	#2
● Challenges at DATOIN
•Intro: Datoin.com is a distributed text analytics platform
•Late 2014 - migrated the infrastructure from Hadoop Map Reduce to
Apache Spark
•But the crawler component (powered by Apache Nutch) was left behind
● Met Dr. Chris Mattmann at USC in Web Search Engines class
•Enquired about his thoughts for running Nutch on Spark
7Feb	7-9,	2017Spark	Summit	East	2017,	Boston
Information Retrieval
and Data Science
SPARKLER: TECH STACK
● Batch crawling (similar to Apache Nutch)
● Apache Solr as crawl database
● Multi module Maven project with OSGi bundles
● Stream crawled content through Apache Kafka
● Parses everything using Apache Tika
● Crawl visualization - Banana
8Feb	7-9,	2017Spark	Summit	East	2017,	Boston
Information Retrieval
and Data Science
SPARKLER: INTERNALS & WORKFLOW
9Feb	7-9,	2017Spark	Summit	East	2017,	Boston
Information Retrieval
and Data Science
SPARKLER: CRAWLDB
10Feb	7-9,	2017Spark	Summit	East	2017,	Boston
Information Retrieval
and Data Science
SPARKLER: RDD
11Feb	7-9,	2017Spark	Summit	East	2017,	Boston
Information Retrieval
and Data Science
SPARKLER: LINKS PIPELINE
12Feb	7-9,	2017Spark	Summit	East	2017,	Boston
Information Retrieval
and Data Science
SPARKLER: OUTPUT CONSUMPTION
13Feb	7-9,	2017Spark	Summit	East	2017,	Boston
Information Retrieval
and Data Science
SPARKLER: FEATURES
14Feb	7-9,	2017Spark	Summit	East	2017,	Boston
Information Retrieval
and Data Science
SPARKLER #1: Lucene/Solr powered Crawldb
● Crawldb needed indexing
•For real time analytics
•For instant visualizations
● This is internal data structure of sparkler
•Exposed over REST API
•Used by Sparkler-ui, the web application
● We chose Apache Solr
● Standalone Solr server or Solr cloud? Yes!
● Glued the crawldb and spark using CrawldbRDD
15Feb	7-9,	2017Spark	Summit	East	2017,	Boston
Information Retrieval
and Data Science
SPARKLER #2: URL Partitioning
● Politeness
•Doesn’t hit same server too many times in distributed mode
● First version
•Group by: Host name
•Sort by: depth, score
● Customization is easy
•Write your own Solr query
•Take advantage of boosting to alter the ranking
● Partitions the dataset based on the above criteria
● Lazy evaluations and delay between the requests
•Performs parsing instead of waiting
•Inserts delay only when it is necessary
16Feb	7-9,	2017Spark	Summit	East	2017,	Boston
Information Retrieval
and Data Science
SPARKLER #3: OSGI Plugins
● Plugins Interfaces are inspired by Nutch
● Plugins are developed as per Open Service Gateway Interface (OSGI)
● We chose Apache Felix implementation of OSGI
● Migrated a plugin from Nutch
•Regex URL Filter Plugin → The most used plugin in Nutch
● Added JavaScript plugin (described in the next slide)
● //TODO: Migrate more plugins from Nutch
•Mavenize nutch [NUTCH-2293]
17Feb	7-9,	2017Spark	Summit	East	2017,	Boston
Information Retrieval
and Data Science
SPARKLER #4: JavaScript Rendering
● Java Script Execution* has first class support
•Allows Sparkler to crawl the Deep/Dark web too
● Distributable on Spark Cluster without pain
•Pure JVM based JavaScript engine
● This is an implementation of FetchFunction
● FetchFunction
•Stream<URL> → Stream<Content>
•Note: URLS are grouped by host
•Preserves cookies and reuses sessions for each iteration
18
Thanks to: Madhav Sharan
Member of USC IRDS* JBrowserDriver by MachinePublishers
Feb	7-9,	2017Spark	Summit	East	2017,	Boston
Information Retrieval
and Data Science
SPARKLER #5: Output in Kafka Streams
● Crawler is sometimes input for the applications that does deeper analysis
•Can’t fit all those deeper analysis into crawler
● Integrating to such applications made easy via Queues
● We chose Apache Kafka
•Suits our need
•Distributable, Scalable, Fault Tolerant
● FIXME: Larger messages such as Videos
● This is optional, default output on Shared File System (such as HDFS),
compatible with Nutch
19
Thanks to: Rahul Palamuttam
MS CS @ Stanford University; Intern @ NASA JPL
Feb	7-9,	2017Spark	Summit	East	2017,	Boston
Information Retrieval
and Data Science
SPARKLER #6: Tika, the universal parser
● Apache Tika
•Is a toolkit of parsers
•Detects and extracts metadata, text, and URLS
•Over a thousand different file types
● Main application is to discover outgoing links
● The default Implementation for our ParseFunction
20Feb	7-9,	2017Spark	Summit	East	2017,	Boston
Information Retrieval
and Data Science
SPARKLER #7: Visual Analytics
● Charts and Graphs provides nice summary of crawl job
● Real time analytics
● Example:
•Distribution of URLS across hosts/domains
•Temporal activities
•Status reports
● Customizable in real time
● Using Banana Dashboard from Lucidworks
● Sparkler has a sub component named sparkler-ui
21
Thanks to: Manish Dwibedy
MS CS University of Southern California
Feb	7-9,	2017Spark	Summit	East	2017,	Boston
Information Retrieval
and Data Science
SPARKLER #7 DASHBOARD
22Feb	7-9,	2017Spark	Summit	East	2017,	Boston
Information Retrieval
and Data Science
SPARKLER #8: Deployment
● Docker
● Juju Charms
23
Thanks to: Tom Barber
Spicule Analytics & NASA-JPL
Feb	7-9,	2017Spark	Summit	East	2017,	Boston
Information Retrieval
and Data Science
SPARKLER #Next: What’s coming?
● Scoring Crawled Pages (Work in progress)
● Focused Crawling (Work in progress)
● Domain Discovery (Work in progress)
● Detailed documentation and tutorials on wiki (Work in progress)
● Interactive UI
● Crawl Graph Analysis
● Other useful plugins from Nutch
24Feb	7-9,	2017Spark	Summit	East	2017,	Boston
Being used for Polar Deep Insights project
https://www.earthcube.org/group/polar-data-insights-search-analytics-deep-scientific-web
Information Retrieval
and Data Science
DEMO
25
https://github.com/USCDataScience/sparkler
Feb	7-9,	2017Spark	Summit	East	2017,	Boston
$ bin/dockler.sh
Information Retrieval
and Data Science
QUESTIONS?
26
https://github.com/USCDataScience/sparkler
Feb	7-9,	2017Spark	Summit	East	2017,	Boston
Information Retrieval
and Data Science
THANK YOU
27
https://github.com/USCDataScience/sparkler
Feb	7-9,	2017Spark	Summit	East	2017,	Boston

More Related Content

What's hot

Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
Hermeto Romano
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
DataWorks Summit
 
Graph and RDF databases
Graph and RDF databasesGraph and RDF databases
Graph and RDF databases
Nassim Bahri
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
Deep Dive on Amazon Aurora
Deep Dive on Amazon AuroraDeep Dive on Amazon Aurora
Deep Dive on Amazon Aurora
Amazon Web Services
 
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
Edureka!
 
How netflix manages petabyte scale apache cassandra in the cloud
How netflix manages petabyte scale apache cassandra in the cloudHow netflix manages petabyte scale apache cassandra in the cloud
How netflix manages petabyte scale apache cassandra in the cloud
Vinay Kumar Chella
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
Rahul Jain
 
Introduction to elasticsearch
Introduction to elasticsearchIntroduction to elasticsearch
Introduction to elasticsearch
hypto
 
Azure Data Factory v2
Azure Data Factory v2Azure Data Factory v2
Azure Data Factory v2
inovex GmbH
 
Apache Spark PDF
Apache Spark PDFApache Spark PDF
Apache Spark PDF
Naresh Rupareliya
 
Stream Processing – Concepts and Frameworks
Stream Processing – Concepts and FrameworksStream Processing – Concepts and Frameworks
Stream Processing – Concepts and Frameworks
Guido Schmutz
 
Streaming data for real time analysis
Streaming data for real time analysisStreaming data for real time analysis
Streaming data for real time analysis
Amazon Web Services
 
Spark on yarn
Spark on yarnSpark on yarn
Spark on yarn
datamantra
 
Chicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An IntroductionChicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An Introduction
Cloudera, Inc.
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
DataArt
 
How Uber scaled its Real Time Infrastructure to Trillion events per day
How Uber scaled its Real Time Infrastructure to Trillion events per dayHow Uber scaled its Real Time Infrastructure to Trillion events per day
How Uber scaled its Real Time Infrastructure to Trillion events per day
DataWorks Summit
 
DASK and Apache Spark
DASK and Apache SparkDASK and Apache Spark
DASK and Apache Spark
Databricks
 
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Simplilearn
 
Elasticsearch Introduction
Elasticsearch IntroductionElasticsearch Introduction
Elasticsearch Introduction
Roopendra Vishwakarma
 

What's hot (20)

Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
Graph and RDF databases
Graph and RDF databasesGraph and RDF databases
Graph and RDF databases
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Deep Dive on Amazon Aurora
Deep Dive on Amazon AuroraDeep Dive on Amazon Aurora
Deep Dive on Amazon Aurora
 
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
 
How netflix manages petabyte scale apache cassandra in the cloud
How netflix manages petabyte scale apache cassandra in the cloudHow netflix manages petabyte scale apache cassandra in the cloud
How netflix manages petabyte scale apache cassandra in the cloud
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
Introduction to elasticsearch
Introduction to elasticsearchIntroduction to elasticsearch
Introduction to elasticsearch
 
Azure Data Factory v2
Azure Data Factory v2Azure Data Factory v2
Azure Data Factory v2
 
Apache Spark PDF
Apache Spark PDFApache Spark PDF
Apache Spark PDF
 
Stream Processing – Concepts and Frameworks
Stream Processing – Concepts and FrameworksStream Processing – Concepts and Frameworks
Stream Processing – Concepts and Frameworks
 
Streaming data for real time analysis
Streaming data for real time analysisStreaming data for real time analysis
Streaming data for real time analysis
 
Spark on yarn
Spark on yarnSpark on yarn
Spark on yarn
 
Chicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An IntroductionChicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An Introduction
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 
How Uber scaled its Real Time Infrastructure to Trillion events per day
How Uber scaled its Real Time Infrastructure to Trillion events per dayHow Uber scaled its Real Time Infrastructure to Trillion events per day
How Uber scaled its Real Time Infrastructure to Trillion events per day
 
DASK and Apache Spark
DASK and Apache SparkDASK and Apache Spark
DASK and Apache Spark
 
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
 
Elasticsearch Introduction
Elasticsearch IntroductionElasticsearch Introduction
Elasticsearch Introduction
 

Similar to Sparkler—Crawler on Apache Spark: Spark Summit East talk by Karanjeet Singh and Thamme Gowda Narayanaswamy

End-to-End Data Pipelines with Apache Spark
End-to-End Data Pipelines with Apache SparkEnd-to-End Data Pipelines with Apache Spark
End-to-End Data Pipelines with Apache Spark
Burak Yavuz
 
Vote Early, Vote Often: From Napkin to Canvassing Application in a Single Wee...
Vote Early, Vote Often: From Napkin to Canvassing Application in a Single Wee...Vote Early, Vote Often: From Napkin to Canvassing Application in a Single Wee...
Vote Early, Vote Often: From Napkin to Canvassing Application in a Single Wee...
Jim Czuprynski
 
Incrementally streaming rdbms data to your data lake automagically
Incrementally streaming rdbms data to your data lake automagicallyIncrementally streaming rdbms data to your data lake automagically
Incrementally streaming rdbms data to your data lake automagically
Timothy Spann
 
Jumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on DatabricksJumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on Databricks
Databricks
 
Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks
Databricks
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
Anyscale
 
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)
Spark Summit
 
Hopsworks Secure Streaming as-a-service with Kafka Flinkspark - Theofilos Kak...
Hopsworks Secure Streaming as-a-service with Kafka Flinkspark - Theofilos Kak...Hopsworks Secure Streaming as-a-service with Kafka Flinkspark - Theofilos Kak...
Hopsworks Secure Streaming as-a-service with Kafka Flinkspark - Theofilos Kak...
Evention
 
Secure Streaming-as-a-Service with Kafka/Spark/Flink in Hopsworks
Secure Streaming-as-a-Service with Kafka/Spark/Flink in HopsworksSecure Streaming-as-a-Service with Kafka/Spark/Flink in Hopsworks
Secure Streaming-as-a-Service with Kafka/Spark/Flink in Hopsworks
Theofilos Kakantousis
 
NEW LAUNCH! How to build graph applications with SPARQL and Gremlin using Ama...
NEW LAUNCH! How to build graph applications with SPARQL and Gremlin using Ama...NEW LAUNCH! How to build graph applications with SPARQL and Gremlin using Ama...
NEW LAUNCH! How to build graph applications with SPARQL and Gremlin using Ama...
Amazon Web Services
 
Applying large scale text analytics with graph databases
Applying large scale text analytics with graph databasesApplying large scale text analytics with graph databases
Applying large scale text analytics with graph databases
Data Ninja API
 
Processing genetic data at scale
Processing genetic data at scaleProcessing genetic data at scale
Processing genetic data at scale
Mark Schroering
 
Devteach 2017 Store 2 million of audit a day into elasticsearch
Devteach 2017 Store 2 million of audit a day into elasticsearchDevteach 2017 Store 2 million of audit a day into elasticsearch
Devteach 2017 Store 2 million of audit a day into elasticsearch
Taswar Bhatti
 
Improving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and InteroperabilityImproving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and Interoperability
Wes McKinney
 
Improving Python and Spark Performance and Interoperability: Spark Summit Eas...
Improving Python and Spark Performance and Interoperability: Spark Summit Eas...Improving Python and Spark Performance and Interoperability: Spark Summit Eas...
Improving Python and Spark Performance and Interoperability: Spark Summit Eas...
Spark Summit
 
Machine Learning with SparkR
Machine Learning with SparkRMachine Learning with SparkR
Machine Learning with SparkR
Olgun Aydın
 
Python Web Scraper for ACM and Google Scholar.pptx
Python Web Scraper for ACM and Google Scholar.pptxPython Web Scraper for ACM and Google Scholar.pptx
Python Web Scraper for ACM and Google Scholar.pptx
ASIMKHAN840563
 
Hadoop databases for oracle DBAs
Hadoop databases for oracle DBAsHadoop databases for oracle DBAs
Hadoop databases for oracle DBAs
Maxym Kharchenko
 
Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...
DataWorks Summit/Hadoop Summit
 
Analyzing Pwned Passwords with Spark - OWASP Meetup July 2018
Analyzing Pwned Passwords with Spark - OWASP Meetup July 2018Analyzing Pwned Passwords with Spark - OWASP Meetup July 2018
Analyzing Pwned Passwords with Spark - OWASP Meetup July 2018
Kelley Robinson
 

Similar to Sparkler—Crawler on Apache Spark: Spark Summit East talk by Karanjeet Singh and Thamme Gowda Narayanaswamy (20)

End-to-End Data Pipelines with Apache Spark
End-to-End Data Pipelines with Apache SparkEnd-to-End Data Pipelines with Apache Spark
End-to-End Data Pipelines with Apache Spark
 
Vote Early, Vote Often: From Napkin to Canvassing Application in a Single Wee...
Vote Early, Vote Often: From Napkin to Canvassing Application in a Single Wee...Vote Early, Vote Often: From Napkin to Canvassing Application in a Single Wee...
Vote Early, Vote Often: From Napkin to Canvassing Application in a Single Wee...
 
Incrementally streaming rdbms data to your data lake automagically
Incrementally streaming rdbms data to your data lake automagicallyIncrementally streaming rdbms data to your data lake automagically
Incrementally streaming rdbms data to your data lake automagically
 
Jumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on DatabricksJumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on Databricks
 
Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)
 
Hopsworks Secure Streaming as-a-service with Kafka Flinkspark - Theofilos Kak...
Hopsworks Secure Streaming as-a-service with Kafka Flinkspark - Theofilos Kak...Hopsworks Secure Streaming as-a-service with Kafka Flinkspark - Theofilos Kak...
Hopsworks Secure Streaming as-a-service with Kafka Flinkspark - Theofilos Kak...
 
Secure Streaming-as-a-Service with Kafka/Spark/Flink in Hopsworks
Secure Streaming-as-a-Service with Kafka/Spark/Flink in HopsworksSecure Streaming-as-a-Service with Kafka/Spark/Flink in Hopsworks
Secure Streaming-as-a-Service with Kafka/Spark/Flink in Hopsworks
 
NEW LAUNCH! How to build graph applications with SPARQL and Gremlin using Ama...
NEW LAUNCH! How to build graph applications with SPARQL and Gremlin using Ama...NEW LAUNCH! How to build graph applications with SPARQL and Gremlin using Ama...
NEW LAUNCH! How to build graph applications with SPARQL and Gremlin using Ama...
 
Applying large scale text analytics with graph databases
Applying large scale text analytics with graph databasesApplying large scale text analytics with graph databases
Applying large scale text analytics with graph databases
 
Processing genetic data at scale
Processing genetic data at scaleProcessing genetic data at scale
Processing genetic data at scale
 
Devteach 2017 Store 2 million of audit a day into elasticsearch
Devteach 2017 Store 2 million of audit a day into elasticsearchDevteach 2017 Store 2 million of audit a day into elasticsearch
Devteach 2017 Store 2 million of audit a day into elasticsearch
 
Improving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and InteroperabilityImproving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and Interoperability
 
Improving Python and Spark Performance and Interoperability: Spark Summit Eas...
Improving Python and Spark Performance and Interoperability: Spark Summit Eas...Improving Python and Spark Performance and Interoperability: Spark Summit Eas...
Improving Python and Spark Performance and Interoperability: Spark Summit Eas...
 
Machine Learning with SparkR
Machine Learning with SparkRMachine Learning with SparkR
Machine Learning with SparkR
 
Python Web Scraper for ACM and Google Scholar.pptx
Python Web Scraper for ACM and Google Scholar.pptxPython Web Scraper for ACM and Google Scholar.pptx
Python Web Scraper for ACM and Google Scholar.pptx
 
Hadoop databases for oracle DBAs
Hadoop databases for oracle DBAsHadoop databases for oracle DBAs
Hadoop databases for oracle DBAs
 
Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...
 
Analyzing Pwned Passwords with Spark - OWASP Meetup July 2018
Analyzing Pwned Passwords with Spark - OWASP Meetup July 2018Analyzing Pwned Passwords with Spark - OWASP Meetup July 2018
Analyzing Pwned Passwords with Spark - OWASP Meetup July 2018
 

More from Spark Summit

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Spark Summit
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
Spark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
Spark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Spark Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spark Summit
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
Spark Summit
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 

More from Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
 

Recently uploaded

一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
hqfek
 
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
eudsoh
 
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
Vietnam Cotton & Spinning Association
 
06-18-2024-Princeton Meetup-Introduction to Milvus
06-18-2024-Princeton Meetup-Introduction to Milvus06-18-2024-Princeton Meetup-Introduction to Milvus
06-18-2024-Princeton Meetup-Introduction to Milvus
Timothy Spann
 
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
eoxhsaa
 
Q4FY24 Investor-Presentation.pdf bank slide
Q4FY24 Investor-Presentation.pdf bank slideQ4FY24 Investor-Presentation.pdf bank slide
Q4FY24 Investor-Presentation.pdf bank slide
mukulupadhayay1
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
ytypuem
 
Build applications with generative AI on Google Cloud
Build applications with generative AI on Google CloudBuild applications with generative AI on Google Cloud
Build applications with generative AI on Google Cloud
Márton Kodok
 
一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理
一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理
一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理
osoyvvf
 
一比一原版南昆士兰大学毕业证如何办理
一比一原版南昆士兰大学毕业证如何办理一比一原版南昆士兰大学毕业证如何办理
一比一原版南昆士兰大学毕业证如何办理
ugydym
 
PyData London 2024: Mistakes were made (Dr. Rebecca Bilbro)
PyData London 2024: Mistakes were made (Dr. Rebecca Bilbro)PyData London 2024: Mistakes were made (Dr. Rebecca Bilbro)
PyData London 2024: Mistakes were made (Dr. Rebecca Bilbro)
Rebecca Bilbro
 
Template xxxxxxxx ssssssssssss Sertifikat.pptx
Template xxxxxxxx ssssssssssss Sertifikat.pptxTemplate xxxxxxxx ssssssssssss Sertifikat.pptx
Template xxxxxxxx ssssssssssss Sertifikat.pptx
TeukuEriSyahputra
 
Module 1 ppt BIG DATA ANALYTICS NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS NOTES FOR MCAModule 1 ppt BIG DATA ANALYTICS NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS NOTES FOR MCA
yuvarajkumar334
 
Senior Engineering Sample EM DOE - Sheet1.pdf
Senior Engineering Sample EM DOE  - Sheet1.pdfSenior Engineering Sample EM DOE  - Sheet1.pdf
Senior Engineering Sample EM DOE - Sheet1.pdf
Vineet
 
Econ3060_Screen Time and Success_ final_GroupProject.pdf
Econ3060_Screen Time and Success_ final_GroupProject.pdfEcon3060_Screen Time and Success_ final_GroupProject.pdf
Econ3060_Screen Time and Success_ final_GroupProject.pdf
blueshagoo1
 
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
oaxefes
 
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
9gr6pty
 
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
z6osjkqvd
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
bmucuha
 
A gentle exploration of Retrieval Augmented Generation
A gentle exploration of Retrieval Augmented GenerationA gentle exploration of Retrieval Augmented Generation
A gentle exploration of Retrieval Augmented Generation
dataschool1
 

Recently uploaded (20)

一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
 
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
 
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
 
06-18-2024-Princeton Meetup-Introduction to Milvus
06-18-2024-Princeton Meetup-Introduction to Milvus06-18-2024-Princeton Meetup-Introduction to Milvus
06-18-2024-Princeton Meetup-Introduction to Milvus
 
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
 
Q4FY24 Investor-Presentation.pdf bank slide
Q4FY24 Investor-Presentation.pdf bank slideQ4FY24 Investor-Presentation.pdf bank slide
Q4FY24 Investor-Presentation.pdf bank slide
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
 
Build applications with generative AI on Google Cloud
Build applications with generative AI on Google CloudBuild applications with generative AI on Google Cloud
Build applications with generative AI on Google Cloud
 
一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理
一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理
一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理
 
一比一原版南昆士兰大学毕业证如何办理
一比一原版南昆士兰大学毕业证如何办理一比一原版南昆士兰大学毕业证如何办理
一比一原版南昆士兰大学毕业证如何办理
 
PyData London 2024: Mistakes were made (Dr. Rebecca Bilbro)
PyData London 2024: Mistakes were made (Dr. Rebecca Bilbro)PyData London 2024: Mistakes were made (Dr. Rebecca Bilbro)
PyData London 2024: Mistakes were made (Dr. Rebecca Bilbro)
 
Template xxxxxxxx ssssssssssss Sertifikat.pptx
Template xxxxxxxx ssssssssssss Sertifikat.pptxTemplate xxxxxxxx ssssssssssss Sertifikat.pptx
Template xxxxxxxx ssssssssssss Sertifikat.pptx
 
Module 1 ppt BIG DATA ANALYTICS NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS NOTES FOR MCAModule 1 ppt BIG DATA ANALYTICS NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS NOTES FOR MCA
 
Senior Engineering Sample EM DOE - Sheet1.pdf
Senior Engineering Sample EM DOE  - Sheet1.pdfSenior Engineering Sample EM DOE  - Sheet1.pdf
Senior Engineering Sample EM DOE - Sheet1.pdf
 
Econ3060_Screen Time and Success_ final_GroupProject.pdf
Econ3060_Screen Time and Success_ final_GroupProject.pdfEcon3060_Screen Time and Success_ final_GroupProject.pdf
Econ3060_Screen Time and Success_ final_GroupProject.pdf
 
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
 
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
 
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
A gentle exploration of Retrieval Augmented Generation
A gentle exploration of Retrieval Augmented GenerationA gentle exploration of Retrieval Augmented Generation
A gentle exploration of Retrieval Augmented Generation
 

Sparkler—Crawler on Apache Spark: Spark Summit East talk by Karanjeet Singh and Thamme Gowda Narayanaswamy

  • 1. Information Retrieval and Data Science Thamme Gowda @thammegowda Karanjeet Singh @_karanjeet A web-crawler on Apache Spark Feb 7-9, 2017Spark Summit East 2017, Boston 1 SPARKLER Dr. Chris Mattmann @chrismattmann https://github.com/USCDataScience/sparkler
  • 2. Information Retrieval and Data Science ABOUT 2 Information Retrieval and Data Science (IRDS) Group University of Southern California, Los Angeles, CA Home page: https://irds.usc.edu Email: irds-L@mymaillists.usc.edu Thamme Gowda Dr. Chris MattmannKaranjeet Singh Graduate Student @thammegowda Graduate Student @karanjeet_tw Director, IRDS @chrismattmann Feb 7-9, 2017Spark Summit East 2017, Boston
  • 3. Information Retrieval and Data Science OVERVIEW ● About Sparkler ● Motivations for building Sparkler ● Sparkler technology stack, internals ● Features of Sparkler ● Dashboard ● Demo ● What’s Next ? 3Feb 7-9, 2017Spark Summit East 2017, Boston
  • 4. Information Retrieval and Data Science ABOUT: SPARKLER ● New Open Source Web Crawler •A bot program that can fetch resources from the web ● Name: Spark Crawler ● Inspired by Apache Nutch ● Like Nutch: Distributed crawler that can scale horizontally ● Unlike Nutch: Runs on top of Apache Spark ● Easy to deploy and easy to use 4Feb 7-9, 2017Spark Summit East 2017, Boston
  • 5. Information Retrieval and Data Science MOTIVATION #1 ● Challenges in DARPA MEMEX* •MEMEX System has crawlers to fetch deep and dark web data •ML based analysis to assist law keeping agencies •Crawls are blackbox, we wanted real-time progress reports ● Dr. Chris Mattmann was considering an upgrade since 3 years ● Technology upgrade needed 5Feb 7-9, 2017Spark Summit East 2017, Boston * http://memex.jpl.nasa.gov/
  • 6. Information Retrieval and Data Science WHY A NEW CRAWLER? 6 Modern Hadoop cluster has no Hadoop (Map-Reduce) left in it! https://twitter.com/cutting/status/796566255830503424 Feb 7-9, 2017Spark Summit East 2017, Boston
  • 7. Information Retrieval and Data Science MOTIVATION #2 ● Challenges at DATOIN •Intro: Datoin.com is a distributed text analytics platform •Late 2014 - migrated the infrastructure from Hadoop Map Reduce to Apache Spark •But the crawler component (powered by Apache Nutch) was left behind ● Met Dr. Chris Mattmann at USC in Web Search Engines class •Enquired about his thoughts for running Nutch on Spark 7Feb 7-9, 2017Spark Summit East 2017, Boston
  • 8. Information Retrieval and Data Science SPARKLER: TECH STACK ● Batch crawling (similar to Apache Nutch) ● Apache Solr as crawl database ● Multi module Maven project with OSGi bundles ● Stream crawled content through Apache Kafka ● Parses everything using Apache Tika ● Crawl visualization - Banana 8Feb 7-9, 2017Spark Summit East 2017, Boston
  • 9. Information Retrieval and Data Science SPARKLER: INTERNALS & WORKFLOW 9Feb 7-9, 2017Spark Summit East 2017, Boston
  • 10. Information Retrieval and Data Science SPARKLER: CRAWLDB 10Feb 7-9, 2017Spark Summit East 2017, Boston
  • 11. Information Retrieval and Data Science SPARKLER: RDD 11Feb 7-9, 2017Spark Summit East 2017, Boston
  • 12. Information Retrieval and Data Science SPARKLER: LINKS PIPELINE 12Feb 7-9, 2017Spark Summit East 2017, Boston
  • 13. Information Retrieval and Data Science SPARKLER: OUTPUT CONSUMPTION 13Feb 7-9, 2017Spark Summit East 2017, Boston
  • 14. Information Retrieval and Data Science SPARKLER: FEATURES 14Feb 7-9, 2017Spark Summit East 2017, Boston
  • 15. Information Retrieval and Data Science SPARKLER #1: Lucene/Solr powered Crawldb ● Crawldb needed indexing •For real time analytics •For instant visualizations ● This is internal data structure of sparkler •Exposed over REST API •Used by Sparkler-ui, the web application ● We chose Apache Solr ● Standalone Solr server or Solr cloud? Yes! ● Glued the crawldb and spark using CrawldbRDD 15Feb 7-9, 2017Spark Summit East 2017, Boston
  • 16. Information Retrieval and Data Science SPARKLER #2: URL Partitioning ● Politeness •Doesn’t hit same server too many times in distributed mode ● First version •Group by: Host name •Sort by: depth, score ● Customization is easy •Write your own Solr query •Take advantage of boosting to alter the ranking ● Partitions the dataset based on the above criteria ● Lazy evaluations and delay between the requests •Performs parsing instead of waiting •Inserts delay only when it is necessary 16Feb 7-9, 2017Spark Summit East 2017, Boston
  • 17. Information Retrieval and Data Science SPARKLER #3: OSGI Plugins ● Plugins Interfaces are inspired by Nutch ● Plugins are developed as per Open Service Gateway Interface (OSGI) ● We chose Apache Felix implementation of OSGI ● Migrated a plugin from Nutch •Regex URL Filter Plugin → The most used plugin in Nutch ● Added JavaScript plugin (described in the next slide) ● //TODO: Migrate more plugins from Nutch •Mavenize nutch [NUTCH-2293] 17Feb 7-9, 2017Spark Summit East 2017, Boston
  • 18. Information Retrieval and Data Science SPARKLER #4: JavaScript Rendering ● Java Script Execution* has first class support •Allows Sparkler to crawl the Deep/Dark web too ● Distributable on Spark Cluster without pain •Pure JVM based JavaScript engine ● This is an implementation of FetchFunction ● FetchFunction •Stream<URL> → Stream<Content> •Note: URLS are grouped by host •Preserves cookies and reuses sessions for each iteration 18 Thanks to: Madhav Sharan Member of USC IRDS* JBrowserDriver by MachinePublishers Feb 7-9, 2017Spark Summit East 2017, Boston
  • 19. Information Retrieval and Data Science SPARKLER #5: Output in Kafka Streams ● Crawler is sometimes input for the applications that does deeper analysis •Can’t fit all those deeper analysis into crawler ● Integrating to such applications made easy via Queues ● We chose Apache Kafka •Suits our need •Distributable, Scalable, Fault Tolerant ● FIXME: Larger messages such as Videos ● This is optional, default output on Shared File System (such as HDFS), compatible with Nutch 19 Thanks to: Rahul Palamuttam MS CS @ Stanford University; Intern @ NASA JPL Feb 7-9, 2017Spark Summit East 2017, Boston
  • 20. Information Retrieval and Data Science SPARKLER #6: Tika, the universal parser ● Apache Tika •Is a toolkit of parsers •Detects and extracts metadata, text, and URLS •Over a thousand different file types ● Main application is to discover outgoing links ● The default Implementation for our ParseFunction 20Feb 7-9, 2017Spark Summit East 2017, Boston
  • 21. Information Retrieval and Data Science SPARKLER #7: Visual Analytics ● Charts and Graphs provides nice summary of crawl job ● Real time analytics ● Example: •Distribution of URLS across hosts/domains •Temporal activities •Status reports ● Customizable in real time ● Using Banana Dashboard from Lucidworks ● Sparkler has a sub component named sparkler-ui 21 Thanks to: Manish Dwibedy MS CS University of Southern California Feb 7-9, 2017Spark Summit East 2017, Boston
  • 22. Information Retrieval and Data Science SPARKLER #7 DASHBOARD 22Feb 7-9, 2017Spark Summit East 2017, Boston
  • 23. Information Retrieval and Data Science SPARKLER #8: Deployment ● Docker ● Juju Charms 23 Thanks to: Tom Barber Spicule Analytics & NASA-JPL Feb 7-9, 2017Spark Summit East 2017, Boston
  • 24. Information Retrieval and Data Science SPARKLER #Next: What’s coming? ● Scoring Crawled Pages (Work in progress) ● Focused Crawling (Work in progress) ● Domain Discovery (Work in progress) ● Detailed documentation and tutorials on wiki (Work in progress) ● Interactive UI ● Crawl Graph Analysis ● Other useful plugins from Nutch 24Feb 7-9, 2017Spark Summit East 2017, Boston Being used for Polar Deep Insights project https://www.earthcube.org/group/polar-data-insights-search-analytics-deep-scientific-web
  • 25. Information Retrieval and Data Science DEMO 25 https://github.com/USCDataScience/sparkler Feb 7-9, 2017Spark Summit East 2017, Boston $ bin/dockler.sh
  • 26. Information Retrieval and Data Science QUESTIONS? 26 https://github.com/USCDataScience/sparkler Feb 7-9, 2017Spark Summit East 2017, Boston
  • 27. Information Retrieval and Data Science THANK YOU 27 https://github.com/USCDataScience/sparkler Feb 7-9, 2017Spark Summit East 2017, Boston