Building a Business on Open Source
       Distributed Computing


  company: www.visibletechnologies.com

      blog: www....
Social Media and Scaling
Social Media and Scaling

•Scalability Matters Now.
Social Media and Scaling

•Scalability Matters Now.
•SM produces large, complex data
Social Media and Scaling

•Scalability Matters Now.
•SM produces large, complex data
•Anyone can collect the web
Social Media and Scaling

•Scalability Matters Now.
•SM produces large, complex data
•Anyone can collect the web
•Make a T...
Social Media and Scaling

•Scalability Matters Now.
•SM produces large, complex data
•Anyone can collect the web
•Make a T...
Social Media and Scaling

•Scalability Matters Now.
•SM produces large, complex data
•Anyone can collect the web
•Make a T...
What Visible Does
What Visible Does



•BI and Brand Management on Social
  Media
What Visible Does



•BI and Brand Management on Social
  Media

•Listen, Monitor, Engage
Old Product: RDBMS
Old Product: RDBMS


•A few MSSQL servers on boxes
Old Product: RDBMS


•A few MSSQL servers on boxes
•Lots of ETL
Old Product: RDBMS


•A few MSSQL servers on boxes
•Lots of ETL
•Several TB, inserts slow, deletes
  impossible, random fa...
Why RDBMS Bad
Why RDBMS Bad
•Nonlinear scale cost
Why RDBMS Bad
•Nonlinear scale cost
•Used as a storage abstraction
Why RDBMS Bad
•Nonlinear scale cost
•Used as a storage abstraction
•Mainly Select, Join, Group, Count
Why RDBMS Bad
•Nonlinear scale cost
•Used as a storage abstraction
•Mainly Select, Join, Group, Count
•Specialized Scale-O...
Why RDBMS Bad
•Nonlinear scale cost
•Used as a storage abstraction
•Mainly Select, Join, Group, Count
•Specialized Scale-O...
Why RDBMS Bad
•Nonlinear scale cost
•Used as a storage abstraction
•Mainly Select, Join, Group, Count
•Specialized Scale-O...
Why OSS?
Why OSS?

•Previously all MS
Why OSS?

•Previously all MS
•It exists!
Why OSS?

•Previously all MS
•It exists!
•Scaling + Licensing = No
Why OSS?

•Previously all MS
•It exists!
•Scaling + Licensing = No
•Can’t build a platform without source
Why OSS?

•Previously all MS
•It exists!
•Scaling + Licensing = No
•Can’t build a platform without source
•It’s Enterprise...
Goals for New Platform
Goals for New Platform

•“Golden Timeline”
Goals for New Platform

•“Golden Timeline”
•Search/Analyze *any* data
Goals for New Platform

•“Golden Timeline”
•Search/Analyze *any* data
•Linear Cost
Goals for New Platform

•“Golden Timeline”
•Search/Analyze *any* data
•Linear Cost
•Not Hacked Together
Goals for New Platform

•“Golden Timeline”
•Search/Analyze *any* data
•Linear Cost
•Not Hacked Together
•“Collect the Soci...
HOW TO SCALE
HOW TO SCALE



•What makes you special?
HOW TO SCALE



•What makes you special?
•What are you willing to sacrifice?
HOW TO SCALE



•What makes you special?
•What are you willing to sacrifice?
•How will you structure the data?
Avoiding Impedance Mismatch
Avoiding Impedance Mismatch


 •Most problems can be divided into
   High or Low latency
Avoiding Impedance Mismatch


 •Most problems can be divided into
   High or Low latency

 •Get a lot of data eventually, ...
Avoiding Impedance Mismatch


 •Most problems can be divided into
   High or Low latency

 •Get a lot of data eventually, ...
Ecosystem
                                                                    Compiled
                                   ...
Simple Workflow
                       Semantic     Unstructured
Hadoop      Collect
                       Analysis       ...
Unstructured Processing Cluster


                     Semantic   Unstructured   Structured
Internet   Collect            ...
Hadoop + MR
Hadoop + MR


•Special: Crunch web-scale data fast
Hadoop + MR


•Special: Crunch web-scale data fast
•Sacrifice: Low-Latency, Transactions,
  Random Access, Updates
Hadoop + MR


•Special: Crunch web-scale data fast
•Sacrifice: Low-Latency, Transactions,
  Random Access, Updates

•Struct...
Structured Processing Cluster
                          Enriched Data

                           Structured
             ...
Document Structure


ContentID: 00BAC189
Title: Iron Maiden Rules
Body: I think Janick Gers is an amazing guitarist blah b...
HBase
HBase


•Special: Scalable random/sequential
  access almost as fast as RDBMS
HBase


•Special: Scalable random/sequential
  access almost as fast as RDBMS

•Sacrifice: Joins, Secondary Indexes,
  Tran...
HBase


•Special: Scalable random/sequential
  access almost as fast as RDBMS

•Sacrifice: Joins, Secondary Indexes,
  Tran...
Search Cluster


   Lucene                 Load/
                 Pull
Indexes from             Replicate
               I...
Search
Katta + Solr
Katta + Solr



•Special: Sharded search
Katta + Solr



•Special: Sharded search
•Sacrifice: Consistency, high-throughput
Katta + Solr



•Special: Sharded search
•Sacrifice: Consistency, high-throughput
•Structure: Reverse index
BI
BI


•Group, Sort, Filter, Count, Sum
BI


•Group, Sort, Filter, Count, Sum
•Semi-additive (Avg) rare but not hard
BI


•Group, Sort, Filter, Count, Sum
•Semi-additive (Avg) rare but not hard
•MapReduce Jobs
BI


•Group, Sort, Filter, Count, Sum
•Semi-additive (Avg) rare but not hard
•MapReduce Jobs
•Faceted Search
Examples
Challenges
Challenges


•Scaling Search
Challenges


•Scaling Search
•Understanding Latency
Challenges


•Scaling Search
•Understanding Latency
•What do we need ‘now’? Can
 customers wait for big data?
Challenges


•Scaling Search
•Understanding Latency
•What do we need ‘now’? Can
 customers wait for big data?

•Monitoring
Recap: Rules for Scaling
Recap: Rules for Scaling

•RDBMS is not a Swiss-Army Knife
Recap: Rules for Scaling

•RDBMS is not a Swiss-Army Knife
•Know your sacrifices
Recap: Rules for Scaling

•RDBMS is not a Swiss-Army Knife
•Know your sacrifices
•Know your specialness
Recap: Rules for Scaling

•RDBMS is not a Swiss-Army Knife
•Know your sacrifices
•Know your specialness
•Know your data str...
Recap: Rules for Scaling

•RDBMS is not a Swiss-Army Knife
•Know your sacrifices
•Know your specialness
•Know your data str...
What Next?
What Next?



•HBase Analytics?
What Next?



•HBase Analytics?
•“What would make a bank trust it”
What Next?



•HBase Analytics?
•“What would make a bank trust it”
•Teach people to think about data
...
The End


company: www.visibletechnologies.com

    blog: www.roadtofailure.com
       twitter: @lusciouspear

    bradfor...
Building a Business on Hadoop, HBase, and Open Source Distributed Computing
Building a Business on Hadoop, HBase, and Open Source Distributed Computing
Building a Business on Hadoop, HBase, and Open Source Distributed Computing
Building a Business on Hadoop, HBase, and Open Source Distributed Computing
Building a Business on Hadoop, HBase, and Open Source Distributed Computing
Upcoming SlideShare
Loading in …5
×

Building a Business on Hadoop, HBase, and Open Source Distributed Computing

32,463 views

Published on

This is a talk on a fundamental approach to thinking about scalability, and how Hadoop, HBase, and Lucene are enabling companies to process amazing amounts of data. It's also about how Social Media is making the traditional RDBMS irrelevant.

Published in: Technology
4 Comments
90 Likes
Statistics
Notes
No Downloads
Views
Total views
32,463
On SlideShare
0
From Embeds
0
Number of Embeds
686
Actions
Shares
0
Downloads
1,922
Comments
4
Likes
90
Embeds 0
No embeds

No notes for slide

Building a Business on Hadoop, HBase, and Open Source Distributed Computing

  1. Building a Business on Open Source Distributed Computing company: www.visibletechnologies.com blog: www.roadtofailure.com twitter: @lusciouspear
  2. Social Media and Scaling
  3. Social Media and Scaling •Scalability Matters Now.
  4. Social Media and Scaling •Scalability Matters Now. •SM produces large, complex data
  5. Social Media and Scaling •Scalability Matters Now. •SM produces large, complex data •Anyone can collect the web
  6. Social Media and Scaling •Scalability Matters Now. •SM produces large, complex data •Anyone can collect the web •Make a Twitter in a few days
  7. Social Media and Scaling •Scalability Matters Now. •SM produces large, complex data •Anyone can collect the web •Make a Twitter in a few days •Easy to get TBs of data
  8. Social Media and Scaling •Scalability Matters Now. •SM produces large, complex data •Anyone can collect the web •Make a Twitter in a few days •Easy to get TBs of data •Big Data enabling new fields for companies
  9. What Visible Does
  10. What Visible Does •BI and Brand Management on Social Media
  11. What Visible Does •BI and Brand Management on Social Media •Listen, Monitor, Engage
  12. Old Product: RDBMS
  13. Old Product: RDBMS •A few MSSQL servers on boxes
  14. Old Product: RDBMS •A few MSSQL servers on boxes •Lots of ETL
  15. Old Product: RDBMS •A few MSSQL servers on boxes •Lots of ETL •Several TB, inserts slow, deletes impossible, random fail
  16. Why RDBMS Bad
  17. Why RDBMS Bad •Nonlinear scale cost
  18. Why RDBMS Bad •Nonlinear scale cost •Used as a storage abstraction
  19. Why RDBMS Bad •Nonlinear scale cost •Used as a storage abstraction •Mainly Select, Join, Group, Count
  20. Why RDBMS Bad •Nonlinear scale cost •Used as a storage abstraction •Mainly Select, Join, Group, Count •Specialized Scale-Out ones ‘meh’
  21. Why RDBMS Bad •Nonlinear scale cost •Used as a storage abstraction •Mainly Select, Join, Group, Count •Specialized Scale-Out ones ‘meh’ •Impedance Mismatch - Try to be High- Throughput, Low-Latency
  22. Why RDBMS Bad •Nonlinear scale cost •Used as a storage abstraction •Mainly Select, Join, Group, Count •Specialized Scale-Out ones ‘meh’ •Impedance Mismatch - Try to be High- Throughput, Low-Latency •Swiss-army knife, unstable, transactions, advanced SQL, tuning
  23. Why OSS?
  24. Why OSS? •Previously all MS
  25. Why OSS? •Previously all MS •It exists!
  26. Why OSS? •Previously all MS •It exists! •Scaling + Licensing = No
  27. Why OSS? •Previously all MS •It exists! •Scaling + Licensing = No •Can’t build a platform without source
  28. Why OSS? •Previously all MS •It exists! •Scaling + Licensing = No •Can’t build a platform without source •It’s Enterprise Now!
  29. Goals for New Platform
  30. Goals for New Platform •“Golden Timeline”
  31. Goals for New Platform •“Golden Timeline” •Search/Analyze *any* data
  32. Goals for New Platform •“Golden Timeline” •Search/Analyze *any* data •Linear Cost
  33. Goals for New Platform •“Golden Timeline” •Search/Analyze *any* data •Linear Cost •Not Hacked Together
  34. Goals for New Platform •“Golden Timeline” •Search/Analyze *any* data •Linear Cost •Not Hacked Together •“Collect the Social Internet”
  35. HOW TO SCALE
  36. HOW TO SCALE •What makes you special?
  37. HOW TO SCALE •What makes you special? •What are you willing to sacrifice?
  38. HOW TO SCALE •What makes you special? •What are you willing to sacrifice? •How will you structure the data?
  39. Avoiding Impedance Mismatch
  40. Avoiding Impedance Mismatch •Most problems can be divided into High or Low latency
  41. Avoiding Impedance Mismatch •Most problems can be divided into High or Low latency •Get a lot of data eventually, or a little now
  42. Avoiding Impedance Mismatch •Most problems can be divided into High or Low latency •Get a lot of data eventually, or a little now •MapReduce vs. Sharding/Indexing
  43. Ecosystem Compiled Pig Cascading Hive Processing Katta / Applications Raw Zookeeper MapReduce Processing Structured HBase Storage Unstructured Hadoop DFS Storage
  44. Simple Workflow Semantic Unstructured Hadoop Collect Analysis Analysis Structured Analysis Hadoop + Store in HBase HBase Store in Indexing Hadoop Lucene+ Load/ Pull Solr+ Replicate Indexes Katta Shards Search
  45. Unstructured Processing Cluster Semantic Unstructured Structured Internet Collect Store Analysis Analysis HBase HTML XML Records
  46. Hadoop + MR
  47. Hadoop + MR •Special: Crunch web-scale data fast
  48. Hadoop + MR •Special: Crunch web-scale data fast •Sacrifice: Low-Latency, Transactions, Random Access, Updates
  49. Hadoop + MR •Special: Crunch web-scale data fast •Sacrifice: Low-Latency, Transactions, Random Access, Updates •Structure: Chunked flat files
  50. Structured Processing Cluster Enriched Data Structured Analysis Unstructured Store in Cluster HBase Store in Search Indexing Hadoop Cluster HBase Records Sharded Lucene Index Lucene Index
  51. Document Structure ContentID: 00BAC189 Title: Iron Maiden Rules Body: I think Janick Gers is an amazing guitarist blah blah PostDT: 20090718 ParentID: 0FDEADBEEF Permalink: www.roadtofailure.com/post?=20
  52. HBase
  53. HBase •Special: Scalable random/sequential access almost as fast as RDBMS
  54. HBase •Special: Scalable random/sequential access almost as fast as RDBMS •Sacrifice: Joins, Secondary Indexes, Transactions (kind of)
  55. HBase •Special: Scalable random/sequential access almost as fast as RDBMS •Sacrifice: Joins, Secondary Indexes, Transactions (kind of) •Structure: BigTable - column oriented
  56. Search Cluster Lucene Load/ Pull Indexes from Replicate Indexes HDFS Shards Search Lucene Lucene Indexes Indexes
  57. Search
  58. Katta + Solr
  59. Katta + Solr •Special: Sharded search
  60. Katta + Solr •Special: Sharded search •Sacrifice: Consistency, high-throughput
  61. Katta + Solr •Special: Sharded search •Sacrifice: Consistency, high-throughput •Structure: Reverse index
  62. BI
  63. BI •Group, Sort, Filter, Count, Sum
  64. BI •Group, Sort, Filter, Count, Sum •Semi-additive (Avg) rare but not hard
  65. BI •Group, Sort, Filter, Count, Sum •Semi-additive (Avg) rare but not hard •MapReduce Jobs
  66. BI •Group, Sort, Filter, Count, Sum •Semi-additive (Avg) rare but not hard •MapReduce Jobs •Faceted Search
  67. Examples
  68. Challenges
  69. Challenges •Scaling Search
  70. Challenges •Scaling Search •Understanding Latency
  71. Challenges •Scaling Search •Understanding Latency •What do we need ‘now’? Can customers wait for big data?
  72. Challenges •Scaling Search •Understanding Latency •What do we need ‘now’? Can customers wait for big data? •Monitoring
  73. Recap: Rules for Scaling
  74. Recap: Rules for Scaling •RDBMS is not a Swiss-Army Knife
  75. Recap: Rules for Scaling •RDBMS is not a Swiss-Army Knife •Know your sacrifices
  76. Recap: Rules for Scaling •RDBMS is not a Swiss-Army Knife •Know your sacrifices •Know your specialness
  77. Recap: Rules for Scaling •RDBMS is not a Swiss-Army Knife •Know your sacrifices •Know your specialness •Know your data structure
  78. Recap: Rules for Scaling •RDBMS is not a Swiss-Army Knife •Know your sacrifices •Know your specialness •Know your data structure •Ponder Latency
  79. What Next?
  80. What Next? •HBase Analytics?
  81. What Next? •HBase Analytics? •“What would make a bank trust it”
  82. What Next? •HBase Analytics? •“What would make a bank trust it” •Teach people to think about data
  83. ...
  84. The End company: www.visibletechnologies.com blog: www.roadtofailure.com twitter: @lusciouspear bradfordstephens@gmail.com

×