NETWORK	
  SEARCH	
  ENGINE	
  
Data Pipeline
2
Raw Data
Ingestion
and File System
Batch Processing Data Store Web Framework
Common
Crawl
AWS S3
Wikipedia 2015
160 TB per Month!
Snapshot of Entire Internet
Data Pipeline
3
Raw Data
Ingestion
and File System
Batch Processing Data Store Web Framework
Common
Crawl
AWS S3
Wikipedia 2015
S3
Source
of Truth
160 TB per Month!
Snapshot of Entire Internet
12 x s3 Medium
$0.80 per hour
~$160 per month
Data Pipeline
4
Raw Data
Ingestion
and File System
Batch Processing Data Store Web Framework
Common
Crawl
8 X T4 Large
AWS S3
Wikipedia 2015
S3
Source
of Truth
160 TB per Month!
Snapshot of Entire Internet
12 x s3 Medium
$0.80 per hour
~$160 per month
$1.01 per hour
Elasticsearch
Data Pipeline
5
Raw Data
Ingestion
and File System
Batch Processing Data Store Web Framework
Common
Crawl
Cassandra
8 X T4 Large 8 x T4 large
AWS S3
Wikipedia 2015
S3
Source
of Truth
160 TB per Month!
Snapshot of Entire Internet
12 x s3 Medium
$0.80 per hour
~$160 per month
$1.01 per hour
$1.01 per hour
Elasticsearch
Data Pipeline
6
Raw Data
Ingestion
and File System
Batch Processing Data Store Web Framework
Common
Crawl
Cassandra
Flask
8 X T4 Large 8 x T4 large
AWS S3
Wikipedia 2015
S3
1 X T2 Micro
Source
of Truth
160 TB per Month!
Snapshot of Entire Internet
12 x s3 Medium
$0.80 per hour
~$160 per month
$1.01 per hour
$1.01 per hour
System Costs:
~$2800 per month
If spot instances used:
~$300 per month
Free
Common
Crawl
AWS S3
Wikipedia 2015
CC Blast:
Custom Python Module
Filetypes:
Text – WET
Hyperlinks - WAT
Common
Crawl
AWS S3
Wikipedia 2015
Query:
*.wikipedia.org
808
buckets
15,000
Sub-bucket
~50/2000
wikipedia URLs
Filetypes:
Text – WET
Hyperlinks - WAT
JSON
CC Blast:
Custom Python Module
Source
of Truth
Common
Crawl
AWS S3
Wikipedia 2015
Query:
*.wikipedia.org
808
buckets
15,000
Sub-bucket
~50/2000
wikipedia URLs
Filetypes:
Text – WET
Hyperlinks - WAT
JSON
CC Blast:
Custom Python Module
Source
of Truth
Challenge:
How to optimize database for low latency querying?
URL (keys)
Documents (values)
Challenge:
How to optimize database for low latency querying?
URL (keys)
Documents (values)
1
25
3
Tyler
Ben Casey
4
Barb
Dana
Network Map – Wikipedia Contributor
QUERY: Telecommunications
Hybrid Database - Schema
Cassandra
Elasticsearch
Date
Elasticsearch
Index by Text
Property
Key: Text
Value
Nodes
Texta
(Physics)
a,b,d
Textb
(Engineering)
a
Textc
(Science)
a,c
Hybrid Database - Schema
Cassandra
Elasticsearch
Date
Value
URL
Clustering
Rank
Value
Links
/Data_Science 189 a-c, a-d, …
/Insight_Data 186 c-a, c-h, …
/Spark_Streaming 185 a-b, b-c
Property
Key: Text
Value
Nodes
Texta
(Physics)
a,b,d
Textb
(Engineering)
a
Textc
(Science)
a,c
Key = URL, Order by Rank
CassandraElasticsearch
Index by Text
Engineering Challenges :
Approximation of page rank with low latency
14
2
35
4
Network Map – Wikipedia Contributor
QUERY: ALL
Mary 1
Tyler
Ben Casey
6
Barb
Dana
1
25
3
Tyler
Ben Casey
4
Barb
Dana
Network Map – Wikipedia Contributor
QUERY: Data Engineering
Data Science Application
Pearson
Correlation
Page Rank Correlation to
‘Data Engineering
Biology Biotech Tech
Matthew Rubashkin
BioE PhD, UC Berkeley
Origin
Write Capacity of Cassandra

Insight_150115_Demo