Building a Business on Hadoop, HBase, and Open Source Distributed Computing
Jul. 27, 2009•0 likes
94 likes
Be the first to like this
Show More
•42,193 views
views
Total views
0
On Slideshare
0
From embeds
0
Number of embeds
0
Download to read offline
Report
Technology
This is a talk on a fundamental approach to thinking about scalability, and how Hadoop, HBase, and Lucene are enabling companies to process amazing amounts of data. It's also about how Social Media is making the traditional RDBMS irrelevant.
Social Media and Scaling
•Scalability Matters Now.
•SM produces large, complex data
Social Media and Scaling
•Scalability Matters Now.
•SM produces large, complex data
•Anyone can collect the web
Social Media and Scaling
•Scalability Matters Now.
•SM produces large, complex data
•Anyone can collect the web
•Make a Twitter in a few days
Social Media and Scaling
•Scalability Matters Now.
•SM produces large, complex data
•Anyone can collect the web
•Make a Twitter in a few days
•Easy to get TBs of data
Social Media and Scaling
•Scalability Matters Now.
•SM produces large, complex data
•Anyone can collect the web
•Make a Twitter in a few days
•Easy to get TBs of data
•Big Data enabling new fields for
companies
Avoiding Impedance Mismatch
•Most problems can be divided into
High or Low latency
•Get a lot of data eventually, or a little
now
Avoiding Impedance Mismatch
•Most problems can be divided into
High or Low latency
•Get a lot of data eventually, or a little
now
•MapReduce vs. Sharding/Indexing
Ecosystem
Compiled
Pig Cascading Hive
Processing
Katta / Applications
Raw
Zookeeper
MapReduce
Processing
Structured
HBase
Storage
Unstructured
Hadoop DFS
Storage
Simple Workflow
Semantic Unstructured
Hadoop Collect
Analysis Analysis
Structured
Analysis
Hadoop + Store in
HBase HBase
Store in
Indexing
Hadoop
Lucene+ Load/
Pull
Solr+ Replicate
Indexes
Katta Shards Search
Unstructured Processing Cluster
Semantic Unstructured Structured
Internet Collect Store
Analysis Analysis
HBase
HTML XML
Records
Hadoop + MR
•Special: Crunch web-scale data fast
•Sacrifice: Low-Latency, Transactions,
Random Access, Updates
Hadoop + MR
•Special: Crunch web-scale data fast
•Sacrifice: Low-Latency, Transactions,
Random Access, Updates
•Structure: Chunked flat files
Structured Processing Cluster
Enriched Data
Structured
Analysis
Unstructured Store in
Cluster HBase
Store in Search
Indexing
Hadoop Cluster
HBase
Records
Sharded
Lucene Index
Lucene Index
Document Structure
ContentID: 00BAC189
Title: Iron Maiden Rules
Body: I think Janick Gers is an amazing guitarist blah blah
PostDT: 20090718
ParentID: 0FDEADBEEF
Permalink: www.roadtofailure.com/post?=20