Building a Business on Hadoop, HBase, and Open Source Distributed Computing
Upcoming SlideShare
Loading in...5
×
 

Building a Business on Hadoop, HBase, and Open Source Distributed Computing

on

  • 22,648 views

This is a talk on a fundamental approach to thinking about scalability, and how Hadoop, HBase, and Lucene are enabling companies to process amazing amounts of data. It's also about how Social Media ...

This is a talk on a fundamental approach to thinking about scalability, and how Hadoop, HBase, and Lucene are enabling companies to process amazing amounts of data. It's also about how Social Media is making the traditional RDBMS irrelevant.

Statistics

Views

Total Views
22,648
Views on SlideShare
21,999
Embed Views
649

Actions

Likes
78
Downloads
1,767
Comments
3

20 Embeds 649

http://122.160.252.5 192
http://www.slideshare.net 186
http://javadialog.blogspot.com 97
http://www.johnmwillis.com 76
http://javadialog.blogspot.in 28
http://www.techgig.com 26
http://javadialog.blogspot.co.il 14
http://paper.li 7
http://www.e-presentations.us 6
http://traackit.blogspot.sg 3
http://traackit.blogspot.fr 2
http://www.linkedin.com 2
http://www.apurva.com 2
http://javadialog.blogspot.kr 2
http://javadialog.blogspot.com.es 1
http://javadialog.blogspot.fr 1
http://traackit.blogspot.in 1
http://static.slidesharecdn.com 1
https://gw.vertica.com:1443 1
http://www.lmodules.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Apple Keynote

Usage Rights

CC Attribution-NoDerivs LicenseCC Attribution-NoDerivs License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Building a Business on Hadoop, HBase, and Open Source Distributed Computing Building a Business on Hadoop, HBase, and Open Source Distributed Computing Presentation Transcript

  • Building a Business on Open Source Distributed Computing company: www.visibletechnologies.com blog: www.roadtofailure.com twitter: @lusciouspear
  • Social Media and Scaling
  • Social Media and Scaling •Scalability Matters Now.
  • Social Media and Scaling •Scalability Matters Now. •SM produces large, complex data
  • Social Media and Scaling •Scalability Matters Now. •SM produces large, complex data •Anyone can collect the web
  • Social Media and Scaling •Scalability Matters Now. •SM produces large, complex data •Anyone can collect the web •Make a Twitter in a few days
  • Social Media and Scaling •Scalability Matters Now. •SM produces large, complex data •Anyone can collect the web •Make a Twitter in a few days •Easy to get TBs of data
  • Social Media and Scaling •Scalability Matters Now. •SM produces large, complex data •Anyone can collect the web •Make a Twitter in a few days •Easy to get TBs of data •Big Data enabling new fields for companies
  • What Visible Does
  • What Visible Does •BI and Brand Management on Social Media
  • What Visible Does •BI and Brand Management on Social Media •Listen, Monitor, Engage
  • Old Product: RDBMS
  • Old Product: RDBMS •A few MSSQL servers on boxes
  • Old Product: RDBMS •A few MSSQL servers on boxes •Lots of ETL
  • Old Product: RDBMS •A few MSSQL servers on boxes •Lots of ETL •Several TB, inserts slow, deletes impossible, random fail
  • Why RDBMS Bad
  • Why RDBMS Bad •Nonlinear scale cost
  • Why RDBMS Bad •Nonlinear scale cost •Used as a storage abstraction
  • Why RDBMS Bad •Nonlinear scale cost •Used as a storage abstraction •Mainly Select, Join, Group, Count
  • Why RDBMS Bad •Nonlinear scale cost •Used as a storage abstraction •Mainly Select, Join, Group, Count •Specialized Scale-Out ones ‘meh’
  • Why RDBMS Bad •Nonlinear scale cost •Used as a storage abstraction •Mainly Select, Join, Group, Count •Specialized Scale-Out ones ‘meh’ •Impedance Mismatch - Try to be High- Throughput, Low-Latency
  • Why RDBMS Bad •Nonlinear scale cost •Used as a storage abstraction •Mainly Select, Join, Group, Count •Specialized Scale-Out ones ‘meh’ •Impedance Mismatch - Try to be High- Throughput, Low-Latency •Swiss-army knife, unstable, transactions, advanced SQL, tuning
  • Why OSS?
  • Why OSS? •Previously all MS
  • Why OSS? •Previously all MS •It exists!
  • Why OSS? •Previously all MS •It exists! •Scaling + Licensing = No
  • Why OSS? •Previously all MS •It exists! •Scaling + Licensing = No •Can’t build a platform without source
  • Why OSS? •Previously all MS •It exists! •Scaling + Licensing = No •Can’t build a platform without source •It’s Enterprise Now!
  • Goals for New Platform
  • Goals for New Platform •“Golden Timeline”
  • Goals for New Platform •“Golden Timeline” •Search/Analyze *any* data
  • Goals for New Platform •“Golden Timeline” •Search/Analyze *any* data •Linear Cost
  • Goals for New Platform •“Golden Timeline” •Search/Analyze *any* data •Linear Cost •Not Hacked Together
  • Goals for New Platform •“Golden Timeline” •Search/Analyze *any* data •Linear Cost •Not Hacked Together •“Collect the Social Internet”
  • HOW TO SCALE
  • HOW TO SCALE •What makes you special?
  • HOW TO SCALE •What makes you special? •What are you willing to sacrifice?
  • HOW TO SCALE •What makes you special? •What are you willing to sacrifice? •How will you structure the data?
  • Avoiding Impedance Mismatch
  • Avoiding Impedance Mismatch •Most problems can be divided into High or Low latency
  • Avoiding Impedance Mismatch •Most problems can be divided into High or Low latency •Get a lot of data eventually, or a little now
  • Avoiding Impedance Mismatch •Most problems can be divided into High or Low latency •Get a lot of data eventually, or a little now •MapReduce vs. Sharding/Indexing
  • Ecosystem Compiled Pig Cascading Hive Processing Katta / Applications Raw Zookeeper MapReduce Processing Structured HBase Storage Unstructured Hadoop DFS Storage
  • Simple Workflow Semantic Unstructured Hadoop Collect Analysis Analysis Structured Analysis Hadoop + Store in HBase HBase Store in Indexing Hadoop Lucene+ Load/ Pull Solr+ Replicate Indexes Katta Shards Search
  • Unstructured Processing Cluster Semantic Unstructured Structured Internet Collect Store Analysis Analysis HBase HTML XML Records
  • Hadoop + MR
  • Hadoop + MR •Special: Crunch web-scale data fast
  • Hadoop + MR •Special: Crunch web-scale data fast •Sacrifice: Low-Latency, Transactions, Random Access, Updates
  • Hadoop + MR •Special: Crunch web-scale data fast •Sacrifice: Low-Latency, Transactions, Random Access, Updates •Structure: Chunked flat files
  • Structured Processing Cluster Enriched Data Structured Analysis Unstructured Store in Cluster HBase Store in Search Indexing Hadoop Cluster HBase Records Sharded Lucene Index Lucene Index
  • Document Structure ContentID: 00BAC189 Title: Iron Maiden Rules Body: I think Janick Gers is an amazing guitarist blah blah PostDT: 20090718 ParentID: 0FDEADBEEF Permalink: www.roadtofailure.com/post?=20
  • HBase
  • HBase •Special: Scalable random/sequential access almost as fast as RDBMS
  • HBase •Special: Scalable random/sequential access almost as fast as RDBMS •Sacrifice: Joins, Secondary Indexes, Transactions (kind of)
  • HBase •Special: Scalable random/sequential access almost as fast as RDBMS •Sacrifice: Joins, Secondary Indexes, Transactions (kind of) •Structure: BigTable - column oriented
  • Search Cluster Lucene Load/ Pull Indexes from Replicate Indexes HDFS Shards Search Lucene Lucene Indexes Indexes
  • Search
  • Katta + Solr
  • Katta + Solr •Special: Sharded search
  • Katta + Solr •Special: Sharded search •Sacrifice: Consistency, high-throughput
  • Katta + Solr •Special: Sharded search •Sacrifice: Consistency, high-throughput •Structure: Reverse index
  • BI
  • BI •Group, Sort, Filter, Count, Sum
  • BI •Group, Sort, Filter, Count, Sum •Semi-additive (Avg) rare but not hard
  • BI •Group, Sort, Filter, Count, Sum •Semi-additive (Avg) rare but not hard •MapReduce Jobs
  • BI •Group, Sort, Filter, Count, Sum •Semi-additive (Avg) rare but not hard •MapReduce Jobs •Faceted Search
  • Examples
  • Challenges
  • Challenges •Scaling Search
  • Challenges •Scaling Search •Understanding Latency
  • Challenges •Scaling Search •Understanding Latency •What do we need ‘now’? Can customers wait for big data?
  • Challenges •Scaling Search •Understanding Latency •What do we need ‘now’? Can customers wait for big data? •Monitoring
  • Recap: Rules for Scaling
  • Recap: Rules for Scaling •RDBMS is not a Swiss-Army Knife
  • Recap: Rules for Scaling •RDBMS is not a Swiss-Army Knife •Know your sacrifices
  • Recap: Rules for Scaling •RDBMS is not a Swiss-Army Knife •Know your sacrifices •Know your specialness
  • Recap: Rules for Scaling •RDBMS is not a Swiss-Army Knife •Know your sacrifices •Know your specialness •Know your data structure
  • Recap: Rules for Scaling •RDBMS is not a Swiss-Army Knife •Know your sacrifices •Know your specialness •Know your data structure •Ponder Latency
  • What Next?
  • What Next? •HBase Analytics?
  • What Next? •HBase Analytics? •“What would make a bank trust it”
  • What Next? •HBase Analytics? •“What would make a bank trust it” •Teach people to think about data
  • ...
  • The End company: www.visibletechnologies.com blog: www.roadtofailure.com twitter: @lusciouspear bradfordstephens@gmail.com