Insight_150115_Demo

Data Pipeline
2
Raw Data
Ingestion
and File System
Batch Processing Data Store Web Framework
Common
Crawl
AWS S3
Wikipedia 2015
160 TB per Month!
Snapshot of Entire Internet

Data Pipeline
3
Raw Data
Ingestion
and File System
Common
Crawl
AWS S3
Wikipedia 2015
S3
Source
of Truth
160 TB per Month!
12 x s3 Medium
$0.80 per hour
~$160 per month

Data Pipeline
4
Raw Data
Ingestion
and File System
Common
Crawl
8 X T4 Large
AWS S3
Wikipedia 2015
S3
Source
of Truth
160 TB per Month!
12 x s3 Medium
$0.80 per hour
~$160 per month
$1.01 per hour

Elasticsearch
Data Pipeline
5
Raw Data
Ingestion
and File System
Common
Crawl
Cassandra
8 X T4 Large 8 x T4 large
AWS S3
Wikipedia 2015
S3
Source
of Truth
160 TB per Month!
12 x s3 Medium
$0.80 per hour
~$160 per month
$1.01 per hour
$1.01 per hour

Elasticsearch
Data Pipeline
6
Raw Data
Ingestion
and File System
Common
Crawl
Cassandra
Flask
8 X T4 Large 8 x T4 large
AWS S3
Wikipedia 2015
S3
1 X T2 Micro
Source
of Truth
160 TB per Month!
12 x s3 Medium
$0.80 per hour
~$160 per month
$1.01 per hour
$1.01 per hour
System Costs:
~$2800 per month
If spot instances used:
~$300 per month
Free

Common
Crawl
AWS S3
Wikipedia 2015
CC Blast:
Custom Python Module
Filetypes:
Text – WET
Hyperlinks - WAT

Common
Crawl
AWS S3
Wikipedia 2015
Query:
*.wikipedia.org
808
buckets
15,000
Sub-bucket
~50/2000
wikipedia URLs
Filetypes:
Text – WET
Hyperlinks - WAT
JSON
CC Blast:
Custom Python Module
Source
of Truth

Challenge:
How to optimize database for low latency querying?
URL (keys)
Documents (values)

Challenge:
How to optimize database for low latency querying?
URL (keys)
Documents (values)
1
25
3
Tyler
Ben Casey
4
Barb
Dana
Network Map – Wikipedia Contributor
QUERY: Telecommunications

Hybrid Database - Schema
Cassandra
Elasticsearch
Date
Elasticsearch
Index by Text
Property
Key: Text
Value
Nodes
Texta
(Physics)
a,b,d
Textb
(Engineering)
a
Textc
(Science)
a,c

Hybrid Database - Schema
Cassandra
Elasticsearch
Date
Value
URL
Clustering
Rank
Value
Links
/Data_Science 189 a-c, a-d, …
/Insight_Data 186 c-a, c-h, …
/Spark_Streaming 185 a-b, b-c
Property
Key: Text
Value
Nodes
Texta
(Physics)
a,b,d
Textb
(Engineering)
a
Textc
(Science)
a,c
Key = URL, Order by Rank
CassandraElasticsearch
Index by Text

Engineering Challenges :
Approximation of page rank with low latency
14
2
35
4
QUERY: ALL
Mary 1
Tyler
Ben Casey
6
Barb
Dana
1
25
3
Tyler
Ben Casey
4
Barb
Dana
QUERY: Data Engineering

Data Science Application
Pearson
Correlation
Page Rank Correlation to
‘Data Engineering
Biology Biotech Tech

Matthew Rubashkin
BioE PhD, UC Berkeley
Origin

Insight_150115_Demo

More Related Content

What's hot

Viewers also liked

Similar to Insight_150115_Demo

Recently uploaded

Insight_150115_Demo