Your SlideShare is downloading. ×
Yahoo! Hack India: Hyderabad 2013 | Building Data Products At Scale
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

Yahoo! Hack India: Hyderabad 2013 | Building Data Products At Scale

459
views

Published on

Sanket Patil speaking on Building Data Products At Scale

Sanket Patil speaking on Building Data Products At Scale

Published in: Technology, Education

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
459
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
7
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. BUILDING DATA PRODUCTS AT SCALE
  • 2. DATAWEAVE:WHAT WE DO? • Aggregate large amounts of data publicly available on the web, and serve it to businesses in readily usable forms • Serve actionable data through APIs,Visualizations, and Dashboards • Provide reporting and analytics layer on top of datasets and APIs
  • 3. DATAWEAVE PLATFORM API Feeds Data Services Dashboards Visualizations and Widgets Data APIs Unstructured , spread across sources and temporally changing Pricing Date Open Government Data Social Media Data Attributes Attribute Big Data Platform
  • 4. HOW DOES IT WORK - 1? • Crawling/Scraping: from a large number of data sources • Cleaning/Deduplication: remove as much noise as possible • Data Normalization: represent related data together in standard forms
  • 5. HOW DOES IT WORK - 2? • Store/Index: store optimally to support several complex queries • Create "Views": on top of data for easy consumption, through APIs, visualizations, dashboards, and reports • Package data as a product: to solve a bunch of related pain points in a certain domain (e.g., PriceWeave for retail)
  • 6. AGGREGATION AND EXTRACTION Extraction Layer Offline Extraction of Factual Data Aggregation Layer Distributed Crawler Infrastructure Public Data on the Web
  • 7. AGGREGATION LAYER Customized crawler infrastructure • vertical specific crawlers • capable of crawling the "deep web" Highly Scalable • 500+ websites on a daily basis • more with the addition of hardware Robust to failures (404s, timeouts, server restarts) • stateless distributed workers • crawl state maintained in a separate data store
  • 8. DATA EXTRACTION LAYER • Extract as many data points from crawled pages as possible • Completely offline process, independent of crawling • Highly parallelized -- scales in a straightforward manner
  • 9. NORMALIZATION Normalization Layer Machine Learning Techniques Remove Noise Fill Gaps in Data Represent Data Clustering Extraction Layer Offline Extraction of Factual Data Knowledge Base
  • 10. NORMALIZATION LAYER • Remove noise, remove duplicates • Gather data from multiple sources and fill "gaps" in info • Normalize data points to a standard internal representation • Cluster related data together (Machine Learning techniques) • Build a "knowledge base" -- continuous learning • "Human in the loop" for data validation
  • 11. DATA STORAGE AND SERVING Data APIs Visualizations Dashboards Reports Serving Layer Highly Responsive Indexes Views Filters Pre-Computed Results Serving Layer Distributed Data Storage Crawl Snapshots Processed Data Clustered Data
  • 12. DATA STORAGE LAYER • Store snapshots of crawl data -- never throw away raw data! • Store processed data -- both individual data points as well as "clusters" of related data points • Distributed data stores • Highly scalable -- add more hardware • Highly available -- replication
  • 13. SERVING LAYER This is the system as far as a user is concerned! Must be highly responsive Process data offline and periodically push it to the serving layer • create Indexes for fast data retrieval • create views to serve queries that are known a priori • minimize computation to the extent possible
  • 14. DATAWEAVE PLATFORM API Feeds Data Services Dashboards Visualizations and Widgets Data APIs Unstructured , spread across sources and temporally changing Pricing Date Open Government Data Social Media Data Attributes Attribute Big Data Platform
  • 15. THANKYOU Sanket Patil sanket@dataweave.in +91-9900063093 2013 Dataweave On Facebook www.facebook.com/DataWeave Catch us onTwitter @dataweavein www.dataweave.in