Yahoo! Hack India: Hyderabad 2013 | Building Data Products At Scale

•

1 like•1,399 views

Yahoo Developer Network

Sanket Patil speaking on Building Data Products At Scale

Technology Education

DATAWEAVE:WHAT WE DO?
• Aggregate large amounts of data publicly available on the web, and
serve it to businesses in readily usable forms
• Serve actionable data through APIs,Visualizations, and Dashboards
• Provide reporting and analytics layer on top of datasets and APIs

DATAWEAVE PLATFORM
API Feeds
Data Services
Dashboards
Visualizations and
Widgets
Data APIs
Unstructured , spread
across sources and
temporally changing
Pricing Date
Open Government Data
Social Media Data
Attributes
Attribute
Big Data Platform

HOW DOES IT WORK - 1?
• Crawling/Scraping:
from a large number of data sources
• Cleaning/Deduplication:
remove as much noise as possible
• Data Normalization:
represent related data together in standard forms

HOW DOES IT WORK - 2?
• Store/Index:
store optimally to support several complex queries
• Create "Views":
on top of data for easy consumption, through APIs, visualizations,
dashboards, and reports
• Package data as a product:
to solve a bunch of related pain points in a certain domain (e.g.,
PriceWeave for retail)

AGGREGATION
AND EXTRACTION
Extraction Layer
Ofﬂine Extraction of Factual Data
Aggregation Layer
Distributed Crawler Infrastructure
Public Data on the Web

AGGREGATION LAYER
Customized crawler infrastructure
• vertical speciﬁc crawlers
• capable of crawling the "deep web"
Highly Scalable
• 500+ websites on a daily basis
• more with the addition of hardware
Robust to failures (404s, timeouts, server restarts)
• stateless distributed workers
• crawl state maintained in a separate data store

DATA EXTRACTION LAYER
• Extract as many data points from crawled pages as possible
• Completely ofﬂine process, independent of crawling
• Highly parallelized -- scales in a straightforward manner

NORMALIZATION
Normalization Layer
Machine
Learning
Techniques
Remove Noise Fill Gaps in Data
Represent Data Clustering
Extraction Layer
Ofﬂine Extraction of Factual Data
Knowledge
Base

NORMALIZATION LAYER
• Remove noise, remove duplicates
• Gather data from multiple sources and ﬁll "gaps" in info
• Normalize data points to a standard internal representation
• Cluster related data together (Machine Learning techniques)
• Build a "knowledge base" -- continuous learning
• "Human in the loop" for data validation

DATA STORAGE
AND SERVING
Data APIs Visualizations Dashboards Reports
Serving Layer
Highly
Responsive
Indexes Views
Filters
Pre-Computed
Results
Serving Layer
Distributed Data Storage
Crawl Snapshots
Processed Data
Clustered Data

DATA STORAGE LAYER
• Store snapshots of crawl data -- never throw away raw data!
• Store processed data -- both individual data points as well as
"clusters" of related data points
• Distributed data stores
• Highly scalable -- add more hardware
• Highly available -- replication

SERVING LAYER
This is the system as far as a user is concerned!
Must be highly responsive
Process data ofﬂine and periodically push it to the serving layer
• create Indexes for fast data retrieval
• create views to serve queries that are known a priori
• minimize computation to the extent possible

THANKYOU
Sanket Patil
sanket@dataweave.in
+91-9900063093
2013 Dataweave
On Facebook www.facebook.com/DataWeave
Catch us onTwitter @dataweavein
www.dataweave.in

What's hot

Oracle hyperion essbaseTimothy J. Simkiss, CPA

OLAPgokulprasath06

Warehouse chapter3 fika sweety

Big Data Ingestion Using Hadoop - Capstone PresentationSamkannan

Big Data in AzureDataWorks Summit/Hadoop Summit

Data Warehouses and Data LakesAmazon Web Services

hyperion essbase training | hyperion essbase online training | hyperion essb...Nancy Thomas

Victoria Tableau User Group - Getting started with TableauDmitry Anoshin

Hadoop Data WarehouseKalyana Miriyala

Introducing Data LakesPravin Kumar Singh, PMP, PSM

Introduction to Azure HDInsightStéphane Fréchette

Hyperion Essbase - Ravi KurakulaRavi kurakula

Azure Data Lake Intro (SQLBits 2016)Michael Rys

OLAPAshir Ali

Translating Models to Medicine an Example of Managing Visual CommunicationsDatabricks

OBIEE12c and Embedded Essbase 12c - An Initial Look at Query Acceleration Use...Mark Rittman

Azure cafe marketplace with looker data analyticsMark Kromer

Cloud and Analytics - From Platforms to an EcosystemDatabricks

Options for Data Prep - A Survey of the Current MarketDremio Corporation

Data Visualization with Tableau - by Knowledgebee TrainingsRamesh Pabba - seeking new projects

What's hot (20)

Oracle hyperion essbase

OLAP

Warehouse chapter3

Big Data Ingestion Using Hadoop - Capstone Presentation

Big Data in Azure

Data Warehouses and Data Lakes

hyperion essbase training | hyperion essbase online training | hyperion essb...

Victoria Tableau User Group - Getting started with Tableau

Hadoop Data Warehouse

Introducing Data Lakes

Introduction to Azure HDInsight

Hyperion Essbase - Ravi Kurakula

Azure Data Lake Intro (SQLBits 2016)

OLAP

Translating Models to Medicine an Example of Managing Visual Communications

OBIEE12c and Embedded Essbase 12c - An Initial Look at Query Acceleration Use...

Azure cafe marketplace with looker data analytics

Cloud and Analytics - From Platforms to an Ecosystem

Options for Data Prep - A Survey of the Current Market

Data Visualization with Tableau - by Knowledgebee Trainings

Similar to Yahoo! Hack India: Hyderabad 2013 | Building Data Products At Scale

Day 1 (Lecture 1): Data Management- The Foundation of all AnalyticsAseda Owusua Addai-Deseh

A lap around Azure Data FactoryBizTalk360

Tableau and hadoopCraig Jordan

Prague data management meetup 2018-03-27Martin Bém

Big data analytics with hadoop volume 2Imviplav

Taming the shrew, Optimizing Power BI OptionsKellyn Pot'Vin-Gorman

Big dataR prasad

Big data analytics and machine intelligence v5.0Amr Kamel Deklel

Apache CarbonData+Spark to realize data convergence and Unified high performa...Tech Triveni

Data saturday malta - ADX Azure Data Explorer overviewRiccardo Zamana

Introduction To Big Data & HadoopBlackvard

Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS Amazon Web Services LATAM

05_Decision Support and OLAP.pdfINyomanSwitrayana

Taming the shrew Power BIKellyn Pot'Vin-Gorman

Data lake-itweekend-sharif university-vahid amirydatastack

How to Empower Your Business Users with Oracle Data VisualizationPerficient, Inc.

Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016StampedeCon

3 OLAP.pptxPriyanshu931034

Skilwise Big dataSkillwise Group

DATA WAREHOUSINGRishikese MR

Similar to Yahoo! Hack India: Hyderabad 2013 | Building Data Products At Scale (20)

Day 1 (Lecture 1): Data Management- The Foundation of all Analytics

A lap around Azure Data Factory

Tableau and hadoop

Prague data management meetup 2018-03-27

Big data analytics with hadoop volume 2

Taming the shrew, Optimizing Power BI Options

Big data

Big data analytics and machine intelligence v5.0

Apache CarbonData+Spark to realize data convergence and Unified high performa...

Data saturday malta - ADX Azure Data Explorer overview

Introduction To Big Data & Hadoop

Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS

05_Decision Support and OLAP.pdf

Taming the shrew Power BI

Data lake-itweekend-sharif university-vahid amiry

How to Empower Your Business Users with Oracle Data Visualization

Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016

3 OLAP.pptx

Skilwise Big data

DATA WAREHOUSING

Recently uploaded

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot

CNIC Information System with Pakdata Cf In Pakistandanishmna97

Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea

TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays

Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya

WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2

[BuildWithAI] Introduction to Gemini.pdfSandro Moreira

AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin

Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@

DBX First Quarter 2024 Investor PresentationDropbox

Apidays New York 2024 - The value of a flexible API Management solution for O...apidays

Why Teams call analytics are critical to your entire businesspanagenda

Elevate Developer Efficiency & build GenAI Application with Amazon QBhuvaneswari Subramani

FWD Group - Insurer Innovation Award 2024The Digital Insurer

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10

Recently uploaded (20)

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER

CNIC Information System with Pakdata Cf In Pakistan

Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...

Artificial Intelligence Chap.5 : Uncertainty

WSO2's API Vision: Unifying Control, Empowering Developers

[BuildWithAI] Introduction to Gemini.pdf

AWS Community Day CPH - Three problems of Terraform

Introduction to Multilingual Retrieval Augmented Generation (RAG)

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...

DBX First Quarter 2024 Investor Presentation

Apidays New York 2024 - The value of a flexible API Management solution for O...

Why Teams call analytics are critical to your entire business

Elevate Developer Efficiency & build GenAI Application with Amazon Q

FWD Group - Insurer Innovation Award 2024

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...

Yahoo! Hack India: Hyderabad 2013 | Building Data Products At Scale

1. BUILDING DATA PRODUCTS AT SCALE

2. DATAWEAVE:WHAT WE DO? • Aggregate large amounts of data publicly available on the web, and serve it to businesses in readily usable forms • Serve actionable data through APIs,Visualizations, and Dashboards • Provide reporting and analytics layer on top of datasets and APIs

3. DATAWEAVE PLATFORM API Feeds Data Services Dashboards Visualizations and Widgets Data APIs Unstructured , spread across sources and temporally changing Pricing Date Open Government Data Social Media Data Attributes Attribute Big Data Platform

4. HOW DOES IT WORK - 1? • Crawling/Scraping: from a large number of data sources • Cleaning/Deduplication: remove as much noise as possible • Data Normalization: represent related data together in standard forms

5. HOW DOES IT WORK - 2? • Store/Index: store optimally to support several complex queries • Create "Views": on top of data for easy consumption, through APIs, visualizations, dashboards, and reports • Package data as a product: to solve a bunch of related pain points in a certain domain (e.g., PriceWeave for retail)

6. AGGREGATION AND EXTRACTION Extraction Layer Ofﬂine Extraction of Factual Data Aggregation Layer Distributed Crawler Infrastructure Public Data on the Web

7. AGGREGATION LAYER Customized crawler infrastructure • vertical speciﬁc crawlers • capable of crawling the "deep web" Highly Scalable • 500+ websites on a daily basis • more with the addition of hardware Robust to failures (404s, timeouts, server restarts) • stateless distributed workers • crawl state maintained in a separate data store

8. DATA EXTRACTION LAYER • Extract as many data points from crawled pages as possible • Completely ofﬂine process, independent of crawling • Highly parallelized -- scales in a straightforward manner

9. NORMALIZATION Normalization Layer Machine Learning Techniques Remove Noise Fill Gaps in Data Represent Data Clustering Extraction Layer Ofﬂine Extraction of Factual Data Knowledge Base

10. NORMALIZATION LAYER • Remove noise, remove duplicates • Gather data from multiple sources and ﬁll "gaps" in info • Normalize data points to a standard internal representation • Cluster related data together (Machine Learning techniques) • Build a "knowledge base" -- continuous learning • "Human in the loop" for data validation

11. DATA STORAGE AND SERVING Data APIs Visualizations Dashboards Reports Serving Layer Highly Responsive Indexes Views Filters Pre-Computed Results Serving Layer Distributed Data Storage Crawl Snapshots Processed Data Clustered Data

12. DATA STORAGE LAYER • Store snapshots of crawl data -- never throw away raw data! • Store processed data -- both individual data points as well as "clusters" of related data points • Distributed data stores • Highly scalable -- add more hardware • Highly available -- replication

13. SERVING LAYER This is the system as far as a user is concerned! Must be highly responsive Process data ofﬂine and periodically push it to the serving layer • create Indexes for fast data retrieval • create views to serve queries that are known a priori • minimize computation to the extent possible

14. DATAWEAVE PLATFORM API Feeds Data Services Dashboards Visualizations and Widgets Data APIs Unstructured , spread across sources and temporally changing Pricing Date Open Government Data Social Media Data Attributes Attribute Big Data Platform

15. THANKYOU Sanket Patil sanket@dataweave.in +91-9900063093 2013 Dataweave On Facebook www.facebook.com/DataWeave Catch us onTwitter @dataweavein www.dataweave.in

Yahoo! Hack India: Hyderabad 2013 | Building Data Products At Scale

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Yahoo! Hack India: Hyderabad 2013 | Building Data Products At Scale

Similar to Yahoo! Hack India: Hyderabad 2013 | Building Data Products At Scale (20)

More from Yahoo Developer Network

More from Yahoo Developer Network (20)

Recently uploaded

Recently uploaded (20)

Yahoo! Hack India: Hyderabad 2013 | Building Data Products At Scale