BUILDING
DATA PRODUCTS AT SCALE
DATAWEAVE:WHAT WE DO?
• Aggregate large amounts of data publicly available on the web, and
serve it to businesses in readi...
DATAWEAVE PLATFORM
API Feeds
Data Services
Dashboards
Visualizations and
Widgets
Data APIs
Unstructured , spread
across so...
HOW DOES IT WORK - 1?
• Crawling/Scraping:
from a large number of data sources
• Cleaning/Deduplication:
remove as much no...
HOW DOES IT WORK - 2?
• Store/Index:
store optimally to support several complex queries
• Create "Views":
on top of data f...
AGGREGATION
AND EXTRACTION
Extraction Layer
Offline Extraction of Factual Data
Aggregation Layer
Distributed Crawler Infras...
AGGREGATION LAYER
Customized crawler infrastructure
• vertical specific crawlers
• capable of crawling the "deep web"
Highl...
DATA EXTRACTION LAYER
• Extract as many data points from crawled pages as possible
• Completely offline process, independen...
NORMALIZATION
Normalization Layer
Machine
Learning
Techniques
Remove Noise Fill Gaps in Data
Represent Data Clustering
Ext...
NORMALIZATION LAYER
• Remove noise, remove duplicates
• Gather data from multiple sources and fill "gaps" in info
• Normali...
DATA STORAGE
AND SERVING
Data APIs Visualizations Dashboards Reports
Serving Layer
Highly
Responsive
Indexes Views
Filters...
DATA STORAGE LAYER
• Store snapshots of crawl data -- never throw away raw data!
• Store processed data -- both individual...
SERVING LAYER
This is the system as far as a user is concerned!
Must be highly responsive
Process data offline and periodic...
DATAWEAVE PLATFORM
API Feeds
Data Services
Dashboards
Visualizations and
Widgets
Data APIs
Unstructured , spread
across so...
THANKYOU
Sanket Patil
sanket@dataweave.in
+91-9900063093
2013 Dataweave
On Facebook www.facebook.com/DataWeave
Catch us on...
Upcoming SlideShare
Loading in …5
×

Yahoo! Hack India: Hyderabad 2013 | Building Data Products At Scale

1,206 views

Published on

Sanket Patil speaking on Building Data Products At Scale

Published in: Technology, Education
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,206
On SlideShare
0
From Embeds
0
Number of Embeds
13
Actions
Shares
0
Downloads
10
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Yahoo! Hack India: Hyderabad 2013 | Building Data Products At Scale

  1. 1. BUILDING DATA PRODUCTS AT SCALE
  2. 2. DATAWEAVE:WHAT WE DO? • Aggregate large amounts of data publicly available on the web, and serve it to businesses in readily usable forms • Serve actionable data through APIs,Visualizations, and Dashboards • Provide reporting and analytics layer on top of datasets and APIs
  3. 3. DATAWEAVE PLATFORM API Feeds Data Services Dashboards Visualizations and Widgets Data APIs Unstructured , spread across sources and temporally changing Pricing Date Open Government Data Social Media Data Attributes Attribute Big Data Platform
  4. 4. HOW DOES IT WORK - 1? • Crawling/Scraping: from a large number of data sources • Cleaning/Deduplication: remove as much noise as possible • Data Normalization: represent related data together in standard forms
  5. 5. HOW DOES IT WORK - 2? • Store/Index: store optimally to support several complex queries • Create "Views": on top of data for easy consumption, through APIs, visualizations, dashboards, and reports • Package data as a product: to solve a bunch of related pain points in a certain domain (e.g., PriceWeave for retail)
  6. 6. AGGREGATION AND EXTRACTION Extraction Layer Offline Extraction of Factual Data Aggregation Layer Distributed Crawler Infrastructure Public Data on the Web
  7. 7. AGGREGATION LAYER Customized crawler infrastructure • vertical specific crawlers • capable of crawling the "deep web" Highly Scalable • 500+ websites on a daily basis • more with the addition of hardware Robust to failures (404s, timeouts, server restarts) • stateless distributed workers • crawl state maintained in a separate data store
  8. 8. DATA EXTRACTION LAYER • Extract as many data points from crawled pages as possible • Completely offline process, independent of crawling • Highly parallelized -- scales in a straightforward manner
  9. 9. NORMALIZATION Normalization Layer Machine Learning Techniques Remove Noise Fill Gaps in Data Represent Data Clustering Extraction Layer Offline Extraction of Factual Data Knowledge Base
  10. 10. NORMALIZATION LAYER • Remove noise, remove duplicates • Gather data from multiple sources and fill "gaps" in info • Normalize data points to a standard internal representation • Cluster related data together (Machine Learning techniques) • Build a "knowledge base" -- continuous learning • "Human in the loop" for data validation
  11. 11. DATA STORAGE AND SERVING Data APIs Visualizations Dashboards Reports Serving Layer Highly Responsive Indexes Views Filters Pre-Computed Results Serving Layer Distributed Data Storage Crawl Snapshots Processed Data Clustered Data
  12. 12. DATA STORAGE LAYER • Store snapshots of crawl data -- never throw away raw data! • Store processed data -- both individual data points as well as "clusters" of related data points • Distributed data stores • Highly scalable -- add more hardware • Highly available -- replication
  13. 13. SERVING LAYER This is the system as far as a user is concerned! Must be highly responsive Process data offline and periodically push it to the serving layer • create Indexes for fast data retrieval • create views to serve queries that are known a priori • minimize computation to the extent possible
  14. 14. DATAWEAVE PLATFORM API Feeds Data Services Dashboards Visualizations and Widgets Data APIs Unstructured , spread across sources and temporally changing Pricing Date Open Government Data Social Media Data Attributes Attribute Big Data Platform
  15. 15. THANKYOU Sanket Patil sanket@dataweave.in +91-9900063093 2013 Dataweave On Facebook www.facebook.com/DataWeave Catch us onTwitter @dataweavein www.dataweave.in

×