Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Building a Business on Hadoop, HBase, and Open Source Distributed Computing

42,545 views

Published on

This is a talk on a fundamental approach to thinking about scalability, and how Hadoop, HBase, and Lucene are enabling companies to process amazing amounts of data. It's also about how Social Media is making the traditional RDBMS irrelevant.

Published in: Technology

Building a Business on Hadoop, HBase, and Open Source Distributed Computing

  1. Building a Business on Open Source Distributed Computing company: www.visibletechnologies.com blog: www.roadtofailure.com twitter: @lusciouspear
  2. Social Media and Scaling
  3. Social Media and Scaling •Scalability Matters Now.
  4. Social Media and Scaling •Scalability Matters Now. •SM produces large, complex data
  5. Social Media and Scaling •Scalability Matters Now. •SM produces large, complex data •Anyone can collect the web
  6. Social Media and Scaling •Scalability Matters Now. •SM produces large, complex data •Anyone can collect the web •Make a Twitter in a few days
  7. Social Media and Scaling •Scalability Matters Now. •SM produces large, complex data •Anyone can collect the web •Make a Twitter in a few days •Easy to get TBs of data
  8. Social Media and Scaling •Scalability Matters Now. •SM produces large, complex data •Anyone can collect the web •Make a Twitter in a few days •Easy to get TBs of data •Big Data enabling new fields for companies
  9. What Visible Does
  10. What Visible Does •BI and Brand Management on Social Media
  11. What Visible Does •BI and Brand Management on Social Media •Listen, Monitor, Engage
  12. Old Product: RDBMS
  13. Old Product: RDBMS •A few MSSQL servers on boxes
  14. Old Product: RDBMS •A few MSSQL servers on boxes •Lots of ETL
  15. Old Product: RDBMS •A few MSSQL servers on boxes •Lots of ETL •Several TB, inserts slow, deletes impossible, random fail
  16. Why RDBMS Bad
  17. Why RDBMS Bad •Nonlinear scale cost
  18. Why RDBMS Bad •Nonlinear scale cost •Used as a storage abstraction
  19. Why RDBMS Bad •Nonlinear scale cost •Used as a storage abstraction •Mainly Select, Join, Group, Count
  20. Why RDBMS Bad •Nonlinear scale cost •Used as a storage abstraction •Mainly Select, Join, Group, Count •Specialized Scale-Out ones ‘meh’
  21. Why RDBMS Bad •Nonlinear scale cost •Used as a storage abstraction •Mainly Select, Join, Group, Count •Specialized Scale-Out ones ‘meh’ •Impedance Mismatch - Try to be High- Throughput, Low-Latency
  22. Why RDBMS Bad •Nonlinear scale cost •Used as a storage abstraction •Mainly Select, Join, Group, Count •Specialized Scale-Out ones ‘meh’ •Impedance Mismatch - Try to be High- Throughput, Low-Latency •Swiss-army knife, unstable, transactions, advanced SQL, tuning
  23. Why OSS?
  24. Why OSS? •Previously all MS
  25. Why OSS? •Previously all MS •It exists!
  26. Why OSS? •Previously all MS •It exists! •Scaling + Licensing = No
  27. Why OSS? •Previously all MS •It exists! •Scaling + Licensing = No •Can’t build a platform without source
  28. Why OSS? •Previously all MS •It exists! •Scaling + Licensing = No •Can’t build a platform without source •It’s Enterprise Now!
  29. Goals for New Platform
  30. Goals for New Platform •“Golden Timeline”
  31. Goals for New Platform •“Golden Timeline” •Search/Analyze *any* data
  32. Goals for New Platform •“Golden Timeline” •Search/Analyze *any* data •Linear Cost
  33. Goals for New Platform •“Golden Timeline” •Search/Analyze *any* data •Linear Cost •Not Hacked Together
  34. Goals for New Platform •“Golden Timeline” •Search/Analyze *any* data •Linear Cost •Not Hacked Together •“Collect the Social Internet”
  35. HOW TO SCALE
  36. HOW TO SCALE •What makes you special?
  37. HOW TO SCALE •What makes you special? •What are you willing to sacrifice?
  38. HOW TO SCALE •What makes you special? •What are you willing to sacrifice? •How will you structure the data?
  39. Avoiding Impedance Mismatch
  40. Avoiding Impedance Mismatch •Most problems can be divided into High or Low latency
  41. Avoiding Impedance Mismatch •Most problems can be divided into High or Low latency •Get a lot of data eventually, or a little now
  42. Avoiding Impedance Mismatch •Most problems can be divided into High or Low latency •Get a lot of data eventually, or a little now •MapReduce vs. Sharding/Indexing
  43. Ecosystem Compiled Pig Cascading Hive Processing Katta / Applications Raw Zookeeper MapReduce Processing Structured HBase Storage Unstructured Hadoop DFS Storage
  44. Simple Workflow Semantic Unstructured Hadoop Collect Analysis Analysis Structured Analysis Hadoop + Store in HBase HBase Store in Indexing Hadoop Lucene+ Load/ Pull Solr+ Replicate Indexes Katta Shards Search
  45. Unstructured Processing Cluster Semantic Unstructured Structured Internet Collect Store Analysis Analysis HBase HTML XML Records
  46. Hadoop + MR
  47. Hadoop + MR •Special: Crunch web-scale data fast
  48. Hadoop + MR •Special: Crunch web-scale data fast •Sacrifice: Low-Latency, Transactions, Random Access, Updates
  49. Hadoop + MR •Special: Crunch web-scale data fast •Sacrifice: Low-Latency, Transactions, Random Access, Updates •Structure: Chunked flat files
  50. Structured Processing Cluster Enriched Data Structured Analysis Unstructured Store in Cluster HBase Store in Search Indexing Hadoop Cluster HBase Records Sharded Lucene Index Lucene Index
  51. Document Structure ContentID: 00BAC189 Title: Iron Maiden Rules Body: I think Janick Gers is an amazing guitarist blah blah PostDT: 20090718 ParentID: 0FDEADBEEF Permalink: www.roadtofailure.com/post?=20
  52. HBase
  53. HBase •Special: Scalable random/sequential access almost as fast as RDBMS
  54. HBase •Special: Scalable random/sequential access almost as fast as RDBMS •Sacrifice: Joins, Secondary Indexes, Transactions (kind of)
  55. HBase •Special: Scalable random/sequential access almost as fast as RDBMS •Sacrifice: Joins, Secondary Indexes, Transactions (kind of) •Structure: BigTable - column oriented
  56. Search Cluster Lucene Load/ Pull Indexes from Replicate Indexes HDFS Shards Search Lucene Lucene Indexes Indexes
  57. Search
  58. Katta + Solr
  59. Katta + Solr •Special: Sharded search
  60. Katta + Solr •Special: Sharded search •Sacrifice: Consistency, high-throughput
  61. Katta + Solr •Special: Sharded search •Sacrifice: Consistency, high-throughput •Structure: Reverse index
  62. BI
  63. BI •Group, Sort, Filter, Count, Sum
  64. BI •Group, Sort, Filter, Count, Sum •Semi-additive (Avg) rare but not hard
  65. BI •Group, Sort, Filter, Count, Sum •Semi-additive (Avg) rare but not hard •MapReduce Jobs
  66. BI •Group, Sort, Filter, Count, Sum •Semi-additive (Avg) rare but not hard •MapReduce Jobs •Faceted Search
  67. Examples
  68. Challenges
  69. Challenges •Scaling Search
  70. Challenges •Scaling Search •Understanding Latency
  71. Challenges •Scaling Search •Understanding Latency •What do we need ‘now’? Can customers wait for big data?
  72. Challenges •Scaling Search •Understanding Latency •What do we need ‘now’? Can customers wait for big data? •Monitoring
  73. Recap: Rules for Scaling
  74. Recap: Rules for Scaling •RDBMS is not a Swiss-Army Knife
  75. Recap: Rules for Scaling •RDBMS is not a Swiss-Army Knife •Know your sacrifices
  76. Recap: Rules for Scaling •RDBMS is not a Swiss-Army Knife •Know your sacrifices •Know your specialness
  77. Recap: Rules for Scaling •RDBMS is not a Swiss-Army Knife •Know your sacrifices •Know your specialness •Know your data structure
  78. Recap: Rules for Scaling •RDBMS is not a Swiss-Army Knife •Know your sacrifices •Know your specialness •Know your data structure •Ponder Latency
  79. What Next?
  80. What Next? •HBase Analytics?
  81. What Next? •HBase Analytics? •“What would make a bank trust it”
  82. What Next? •HBase Analytics? •“What would make a bank trust it” •Teach people to think about data
  83. ...
  84. The End company: www.visibletechnologies.com blog: www.roadtofailure.com twitter: @lusciouspear bradfordstephens@gmail.com

×