Big Data and Me Bhupesh Bansal Feb 3, 2012
Relational Model Architecture Reference :  http:// www.slideshare.net / adorepump / voldemort-nosql
Linkedin 2006 Reference :  http://www.slideshare.net/linkedin/linked-in-javaone-2008-tech-session-comm
Relational model <ul><li>The relational model is a triumph of computer science: </li></ul><ul><ul><li>General </li></ul></...
Specialized Systems Architecture Reference :  http:// www.slideshare.net / adorepump / voldemort-nosql
Linkedin 2007 Reference :  http://www.slideshare.net/linkedin/linked-in-javaone-2008-tech-session-comm
Specialized systems <ul><li>Specialized systems are efficient (10-100x) </li></ul><ul><ul><li>Search: Inverted index </li>...
Batch Driven Architecture Reference :   http:// www.slideshare.net / bhupeshbansal /hadoop-user-group-jan2010
Motivation I : Big Data  02/06/12 Reference :  algo2.iti.kit.edu/.../fopraext/index.html
Motivation II: Data Driven Features
Motivation III: Makes Money  02/06/12 Proprietary & Confidential
Motivation IV: Big Data is cool 02/06/12
Reference : http:// www.slideshare.net / BenSiscovick /the-business-of-big-data-ia-ventures-8577588
Big Data Challenges <ul><li>Large scale data processing </li></ul><ul><ul><li>Use all available signals eg. Weblogs, Socia...
Why is this hard ? <ul><li>Large scale data processing </li></ul><ul><ul><li>TB/PB of data </li></ul></ul><ul><ul><li>Trad...
Some good news !! <ul><li>Hadoop </li></ul><ul><ul><li>Biggest single driver for large scale data economy </li></ul></ul><...
What works !! <ul><li>Simplicity </li></ul><ul><ul><li>Go with the simplest design possible. </li></ul></ul><ul><li>Near r...
What doesn’t works !! <ul><li>Magic systems </li></ul><ul><ul><li>Auto configure, Auto tuning </li></ul></ul><ul><ul><li>V...
Open source <ul><li>Very very important for any company today </li></ul><ul><ul><li>Do not reinvent the wheel </li></ul></...
Open source: Storage <ul><li>Problem: You want to store TB of data for user consumption in real time </li></ul><ul><ul><li...
Open source: Publish/Subscribe <ul><li>Problem: Data River for all other systems to get their feed </li></ul><ul><li>Solut...
Open source: Real time analysis <ul><li>Problem: Analyze a stream of data and do simple analysis/reporting </li></ul><ul><...
Open source: Search <ul><li>Problem: unstructured queries on data </li></ul><ul><li>Solutions </li></ul><ul><ul><li>Lucene...
Open source: Batch computation <ul><li>Problem: You want to process TB of data </li></ul><ul><li>Solutions is simple: Use ...
Open source: Other <ul><li>Serialization </li></ul><ul><ul><li>Avro, Thrift, protocol buffers </li></ul></ul><ul><li>Compr...
My personal picks !! <ul><li>Storage: </li></ul><ul><ul><li>Pure key-value lookup : Voldemort </li></ul></ul><ul><ul><li>R...
Jeff Dean’s Thoughts <ul><li>Very practical advice on building good reliable distributed systems. Highlights </li></ul><ul...
How Voldemort was born ? Reference : 1)  http://www.slideshare.net/bhupeshbansal/hadoop-user-group-jan2010 2)  http://www....
Why NoSQL ? <ul><li>TBs of data </li></ul><ul><li>Sharding the only way to scale </li></ul><ul><ul><li>No joins possible (...
Inspired By Amazon Dynamo & Memcached <ul><li>Amazon ’s Dynamo storage system </li></ul><ul><ul><li>Works across data cent...
ACID Vs CAP <ul><li>ACID  </li></ul><ul><ul><li>Great for single centralized server. </li></ul></ul><ul><li>CAP Theorem </...
Consistent Hashing <ul><li>Key space is Partitioned </li></ul><ul><ul><li>Many small partitions </li></ul></ul><ul><li>Par...
R+W > N  <ul><li>N - The replication factor  </li></ul><ul><li>R - The number of blocking reads </li></ul><ul><li>W - The ...
Versioning & Conflict Resolution <ul><li>Eventual Consistency allows multiple versions of value </li></ul><ul><ul><li>Need...
Vector Clock <ul><li>Vector Clock [Lamport] provides way to order events in a distributed system. </li></ul><ul><li>A vect...
Client API <ul><li>Data is organized into  “stores”, i.e. tables </li></ul><ul><li>Key-value only </li></ul><ul><ul><li>Bu...
Voldemort Physical Deployment
 
Read-only storage engine <ul><li>Throughput vs. Latency </li></ul><ul><li>Index building done in Hadoop </li></ul><ul><li>...
What do we use Hadoop/Voldemort for ? Proprietary & Confidential 02/06/12
Batch Driven Architecture Reference :   http:// www.slideshare.net / bhupeshbansal /hadoop-user-group-jan2010
Data Flow Driven Architecture Reference : http:// sna-projects.com /blog/2011/08/ kafka /
Questions
Upcoming SlideShare
Loading in …5
×

Bhupeshbansal bigdata

933 views

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
933
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
22
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Example: member data--does not make sense to repeatedly join positions, emails, groups, etc. Explain about joins How to better model in java? Json like data model
  • Example: member data--does not make sense to repeatedly join positions, emails, groups, etc. Explain about joins How to better model in java? Json like data model
  • Statistical learning as the ultimate agile development tool (Peter Norvig), “business logic” through data rather than code
  • No Joins Across data domains due to APIs Within data domains due to performance Natural operation: getAll(id…) Latency: if you want to call 30 services on your main pages, they better be quick (30 * 20ms = 600ms)
  • - Strong Consistency: all clients see the same view, even in presence of updates - High Availability: all clients can find some replica of the data, even in the presence of failures Partition-tolerance: the system properties hold even when the system is partitioned high availability : Mantra for websites Better to deal with inconsistencies, because their primary need is to scale well to allow for a smooth user experience.
  • Hashing .. Why do we need it ?? Basic problem : Clients need to know which data is where ?? Many ways of solving it Central configuration Hashing Linear hashing works : issue is when cluster is dynamic ?? KeyHash –node IDmapping change for a lot of entries When you add new slots Consistent hashing : preserves key –Node mapping for most of the keys and only change the minimal amount needed How to do it ?? Number of partitions ---------------------------- Arbitrary , each node is allocated many partitions (better load balancing and fault tolerance) Few hundreds to few thousands .. Key  partition mapping is fixed and only ownership of partitions can change
  • Give example of read and writes with vector clocks Pros and cons vs paxos and 2pc User can supply strategy for handling cases where v1 and v2 are not comparable.
  • Fancy way of doing Optimistic locking
  • Very simple APIS NO Range Scans .. . No iterator on KeySet / Entry SET : Very hard to fix performance Have plans to provide such an iterator
  • Explain about partitions Make things fast by removing slow things, not by tuning HTTP client not performant Separate caching layer
  • Transfer time: 30 minutes Can max out a gb network, so be careful
  • Example: member data--does not make sense to repeatedly join positions, emails, groups, etc. Explain about joins How to better model in java? Json like data model
  • Questions, comments, etc
  • Bhupeshbansal bigdata

    1. 1. Big Data and Me Bhupesh Bansal Feb 3, 2012
    2. 2. Relational Model Architecture Reference : http:// www.slideshare.net / adorepump / voldemort-nosql
    3. 3. Linkedin 2006 Reference : http://www.slideshare.net/linkedin/linked-in-javaone-2008-tech-session-comm
    4. 4. Relational model <ul><li>The relational model is a triumph of computer science: </li></ul><ul><ul><li>General </li></ul></ul><ul><ul><li>Concise </li></ul></ul><ul><ul><li>Well understood </li></ul></ul><ul><li>But then again: </li></ul><ul><ul><li>SQL is a pain </li></ul></ul><ul><ul><li>Hard to build re-usable data structures </li></ul></ul><ul><ul><li>Hides performance issues/details </li></ul></ul>
    5. 5. Specialized Systems Architecture Reference : http:// www.slideshare.net / adorepump / voldemort-nosql
    6. 6. Linkedin 2007 Reference : http://www.slideshare.net/linkedin/linked-in-javaone-2008-tech-session-comm
    7. 7. Specialized systems <ul><li>Specialized systems are efficient (10-100x) </li></ul><ul><ul><li>Search: Inverted index </li></ul></ul><ul><ul><li>Offline: Hadoop, Terradata, Oracle DWH </li></ul></ul><ul><ul><li>Memcached </li></ul></ul><ul><ul><li>In memory systems (social graph) </li></ul></ul><ul><li>Specialized system are scalable </li></ul><ul><li>New data and problems </li></ul><ul><ul><li>Graphs, sequences, and text </li></ul></ul>
    8. 8. Batch Driven Architecture Reference : http:// www.slideshare.net / bhupeshbansal /hadoop-user-group-jan2010
    9. 9. Motivation I : Big Data 02/06/12 Reference : algo2.iti.kit.edu/.../fopraext/index.html
    10. 10. Motivation II: Data Driven Features
    11. 11. Motivation III: Makes Money 02/06/12 Proprietary & Confidential
    12. 12. Motivation IV: Big Data is cool 02/06/12
    13. 13. Reference : http:// www.slideshare.net / BenSiscovick /the-business-of-big-data-ia-ventures-8577588
    14. 14. Big Data Challenges <ul><li>Large scale data processing </li></ul><ul><ul><li>Use all available signals eg. Weblogs, Social signals (twitter/facebook/linkedin) </li></ul></ul><ul><li>Data Driven Applications </li></ul><ul><ul><li>Refine data push back to user for consumption </li></ul></ul><ul><li>Near real time feedback loop </li></ul><ul><ul><li>Keep continuously improving </li></ul></ul>
    15. 15. Why is this hard ? <ul><li>Large scale data processing </li></ul><ul><ul><li>TB/PB of data </li></ul></ul><ul><ul><li>Traditional storage systems cannot handle the scale </li></ul></ul><ul><li>Data Driven Applications </li></ul><ul><ul><li>Need to run complex machine learning algorithms on this data scale </li></ul></ul><ul><li>Near real time analysis </li></ul><ul><ul><li>improves application performance and usage. </li></ul></ul>
    16. 16. Some good news !! <ul><li>Hadoop </li></ul><ul><ul><li>Biggest single driver for large scale data economy </li></ul></ul><ul><ul><li>Scales, works, easy to use </li></ul></ul><ul><li>Memcached </li></ul><ul><ul><li>Works, scales and is fast </li></ul></ul><ul><li>Open source world </li></ul><ul><ul><li>Lot of awesome people working on awesome systems eg. hBase, memcached, Voldemort, kafka, mahout etc. </li></ul></ul><ul><li>Sharing across companies </li></ul><ul><ul><li>Common practices/knowledge sharing across companies. </li></ul></ul>
    17. 17. What works !! <ul><li>Simplicity </li></ul><ul><ul><li>Go with the simplest design possible. </li></ul></ul><ul><li>Near real time </li></ul><ul><ul><li>Async/Batch processing </li></ul></ul><ul><ul><ul><li>Put computation to background as much as possible </li></ul></ul></ul><ul><li>Duplicate data everywhere </li></ul><ul><ul><li>Build customized solution for each problem </li></ul></ul><ul><ul><li>Duplicate data as needed </li></ul></ul><ul><li>Data river </li></ul><ul><ul><li>Publish events and let all systems consume at their own pace </li></ul></ul><ul><li>Monitoring/Alerting </li></ul><ul><ul><li>Keep a close eye on things and build a strong dev-ops team </li></ul></ul>
    18. 18. What doesn’t works !! <ul><li>Magic systems </li></ul><ul><ul><li>Auto configure, Auto tuning </li></ul></ul><ul><ul><li>Very hard to get it right instead have easy configuration and better monitoring </li></ul></ul><ul><li>Open source </li></ul><ul><ul><li>If Not supported by strong engineering team internally </li></ul></ul><ul><ul><li>Be ready to have folks spend 30-40% time on understanding, helping open source components </li></ul></ul><ul><li>Silver bullets </li></ul><ul><ul><li>One system to solve all scaling problems eg. Hbase </li></ul></ul><ul><ul><li>Build separate systems for separate problems </li></ul></ul><ul><li>Central data source </li></ul><ul><ul><li>Don’ t lock your data let it flow </li></ul></ul><ul><ul><li>Use (Kafka, Scribe or any publish/subscribe system) </li></ul></ul>
    19. 19. Open source <ul><li>Very very important for any company today </li></ul><ul><ul><li>Do not reinvent the wheel </li></ul></ul><ul><ul><ul><li>Do not write a line of code if not needed </li></ul></ul></ul><ul><ul><li>90/10 % rule </li></ul></ul><ul><ul><ul><li>Pick up open source solutions, fix what is broken </li></ul></ul></ul><ul><ul><li>Big plus for hiring </li></ul></ul><ul><ul><li>Stand on shoulder of crowd </li></ul></ul>
    20. 20. Open source: Storage <ul><li>Problem: You want to store TB of data for user consumption in real time </li></ul><ul><ul><li>Latency < 50 ms </li></ul></ul><ul><ul><li>Scale 10,000 QPS + </li></ul></ul><ul><li>Solutions </li></ul><ul><ul><li>Big table design eg. Hbase </li></ul></ul><ul><ul><li>Amazon Dynamo design eg. Voldemort </li></ul></ul><ul><ul><li>Cache with persistence eg. Membase </li></ul></ul><ul><ul><li>Document based storage eg. MongoDB </li></ul></ul>
    21. 21. Open source: Publish/Subscribe <ul><li>Problem: Data River for all other systems to get their feed </li></ul><ul><li>Solutions </li></ul><ul><ul><li>Strong data guarantees eg. ActiveMQ, RabbitMQ, HornetQ </li></ul></ul><ul><ul><li>Log feeds eg. Scribe, flume </li></ul></ul><ul><ul><li>Kafka </li></ul></ul><ul><ul><ul><li>A great mix of both the world </li></ul></ul></ul>
    22. 22. Open source: Real time analysis <ul><li>Problem: Analyze a stream of data and do simple analysis/reporting </li></ul><ul><li>Solutions </li></ul><ul><ul><li>Splunk </li></ul></ul><ul><ul><ul><li>General purpose but high maintenance expansive analysis tool </li></ul></ul></ul><ul><ul><li>OpenTSDB </li></ul></ul><ul><ul><ul><li>Simple but scalable metrics reporting </li></ul></ul></ul><ul><ul><li>Yahoo S4/Twitter Storm </li></ul></ul><ul><ul><ul><li>Online map-reduce ish </li></ul></ul></ul><ul><ul><ul><li>New systems will need lots of love and care </li></ul></ul></ul>
    23. 23. Open source: Search <ul><li>Problem: unstructured queries on data </li></ul><ul><li>Solutions </li></ul><ul><ul><li>Lucene </li></ul></ul><ul><ul><ul><li>Most tested common search (but just a) library </li></ul></ul></ul><ul><ul><li>Solr </li></ul></ul><ul><ul><ul><li>Old system with lot of users but bad design </li></ul></ul></ul><ul><ul><li>Elastic Search </li></ul></ul><ul><ul><ul><li>Very well designed but new system </li></ul></ul></ul><ul><ul><li>Linkedin search open source systems </li></ul></ul><ul><ul><ul><li>sensieDB, zoie </li></ul></ul></ul>
    24. 24. Open source: Batch computation <ul><li>Problem: You want to process TB of data </li></ul><ul><li>Solutions is simple: Use Hadoop </li></ul><ul><ul><li>Hadoop workflow manager </li></ul></ul><ul><ul><ul><li>Azkaban </li></ul></ul></ul><ul><ul><ul><li>Oozie </li></ul></ul></ul><ul><ul><li>Query </li></ul></ul><ul><ul><ul><li>Native Java code </li></ul></ul></ul><ul><ul><ul><li>Cascading </li></ul></ul></ul><ul><ul><ul><li>Hive </li></ul></ul></ul><ul><ul><ul><li>Pig </li></ul></ul></ul>
    25. 25. Open source: Other <ul><li>Serialization </li></ul><ul><ul><li>Avro, Thrift, protocol buffers </li></ul></ul><ul><li>Compression </li></ul><ul><ul><li>Snappy, LZO </li></ul></ul><ul><li>Monitoring </li></ul><ul><ul><li>Ganglia </li></ul></ul>
    26. 26. My personal picks !! <ul><li>Storage: </li></ul><ul><ul><li>Pure key-value lookup : Voldemort </li></ul></ul><ul><ul><li>Range queries, Hadoop job support: Hbase </li></ul></ul><ul><ul><li>Batch generated Read only data serving: Voldemort </li></ul></ul><ul><li>Publish/Subscribe </li></ul><ul><ul><li>HornetQ OR Kafka </li></ul></ul><ul><li>Search </li></ul><ul><ul><li>ElasticSearch </li></ul></ul><ul><li>Hadoop </li></ul><ul><ul><li>Azkaban </li></ul></ul><ul><ul><li>Hive and Native Java code </li></ul></ul>
    27. 27. Jeff Dean’s Thoughts <ul><li>Very practical advice on building good reliable distributed systems. Highlights </li></ul><ul><ul><li>Back of the envelope calculations </li></ul></ul><ul><ul><ul><li>Understand your base numbers well </li></ul></ul></ul><ul><ul><li>Scale for 10X not 100X </li></ul></ul><ul><ul><li>Embrace chaos/failure and design around it </li></ul></ul><ul><ul><li>Monitor/status hooks at all levels </li></ul></ul><ul><ul><li>Important not to try to be all things for everybody </li></ul></ul>Reference : http:// www.slideshare.net / xlight /google-designs-lessons-and-advice-from-building-large-distributed-systems
    28. 28. How Voldemort was born ? Reference : 1) http://www.slideshare.net/bhupeshbansal/hadoop-user-group-jan2010 2) http://www.slideshare.net/adorepump/voldemort-nosql
    29. 29. Why NoSQL ? <ul><li>TBs of data </li></ul><ul><li>Sharding the only way to scale </li></ul><ul><ul><li>No joins possible (Data is split across machines) </li></ul></ul><ul><li>Specialized systems eg search, network feed breaks relational model </li></ul><ul><li>No constraints, triggers, etc disappear </li></ul><ul><li>Lots of denormalization </li></ul><ul><li>Latency is key </li></ul><ul><ul><li>Relational DB depend on caching layer to achieve high throughput and low latency </li></ul></ul>
    30. 30. Inspired By Amazon Dynamo & Memcached <ul><li>Amazon ’s Dynamo storage system </li></ul><ul><ul><li>Works across data centers </li></ul></ul><ul><ul><li>Eventual consistency </li></ul></ul><ul><ul><li>Commodity hardware </li></ul></ul><ul><li>Memcached </li></ul><ul><ul><li>Actually works </li></ul></ul><ul><ul><li>Really fast </li></ul></ul><ul><ul><li>Really simple </li></ul></ul>
    31. 31. ACID Vs CAP <ul><li>ACID </li></ul><ul><ul><li>Great for single centralized server. </li></ul></ul><ul><li>CAP Theorem </li></ul><ul><ul><li>Consistency (Strict), Availability , Partition Tolerance </li></ul></ul><ul><ul><li>Impossible to achieve all three at same time in distributed platform </li></ul></ul><ul><ul><li>Can choose 2 out of 3 </li></ul></ul><ul><ul><li>Dynamo chooses High Availability and Partition Tolerance </li></ul></ul><ul><ul><ul><li>by sacrificing Strict Consistency to Eventual consistency </li></ul></ul></ul>Proprietary & Confidential 02/06/12
    32. 32. Consistent Hashing <ul><li>Key space is Partitioned </li></ul><ul><ul><li>Many small partitions </li></ul></ul><ul><li>Partitions never change </li></ul><ul><ul><li>Partitions ownership can change </li></ul></ul><ul><li>Replication </li></ul><ul><ul><li>Each partition is stored by ‘N’ nodes </li></ul></ul>Proprietary & Confidential 02/06/12
    33. 33. R+W > N <ul><li>N - The replication factor </li></ul><ul><li>R - The number of blocking reads </li></ul><ul><li>W - The number of blocking writes </li></ul><ul><li>If R+W > N </li></ul><ul><ul><li>then we have a quorum-like algorithm </li></ul></ul><ul><ul><li>Guarantees that we will read latest writes OR fail </li></ul></ul><ul><li>R, W, N can be tuned for different use cases </li></ul><ul><ul><li>W = 1, Highly available writes </li></ul></ul><ul><ul><li>R = 1, Read intensive workloads </li></ul></ul><ul><ul><li>Knobs to tune performance, durability and availability </li></ul></ul>Proprietary & Confidential 02/06/12
    34. 34. Versioning & Conflict Resolution <ul><li>Eventual Consistency allows multiple versions of value </li></ul><ul><ul><li>Need a way to understand which value is latest </li></ul></ul><ul><ul><li>Need a way to say values are not comparable </li></ul></ul><ul><li>Solutions </li></ul><ul><ul><li>Timestamp </li></ul></ul><ul><ul><li>Vector clocks </li></ul></ul><ul><ul><ul><li>Provides global ordering. </li></ul></ul></ul><ul><ul><ul><li>No locking or blocking necessary </li></ul></ul></ul>
    35. 35. Vector Clock <ul><li>Vector Clock [Lamport] provides way to order events in a distributed system. </li></ul><ul><li>A vector clock is a tuple {t1 , t2 , ..., tn } of counters. </li></ul><ul><li>Each value update has a master node </li></ul><ul><ul><li>When data is written with master node i, it increments ti. </li></ul></ul><ul><ul><li>All the replicas will receive the same version </li></ul></ul><ul><ul><li>Helps resolving consistency between writes on multiple replicas </li></ul></ul><ul><li>If you get network partitions </li></ul><ul><ul><li>You can have a case where two vector clocks are not comparable. </li></ul></ul><ul><ul><li>In this case Voldemort returns both values to clients for conflict resolution </li></ul></ul>Proprietary & Confidential 02/06/12
    36. 36. Client API <ul><li>Data is organized into “stores”, i.e. tables </li></ul><ul><li>Key-value only </li></ul><ul><ul><li>But values can be arbitrarily rich or complex </li></ul></ul><ul><ul><ul><li>Maps, lists, nested combinations … </li></ul></ul></ul><ul><li>Four operations </li></ul><ul><ul><li>PUT (Key K, Value V) </li></ul></ul><ul><ul><li>GET (Key K) </li></ul></ul><ul><ul><li>MULTI-GET (Iterator<Key> K), </li></ul></ul><ul><ul><li>DELETE (Key K) / (Key K , Version ver) </li></ul></ul><ul><ul><li>No Range Scans </li></ul></ul>
    37. 37. Voldemort Physical Deployment
    38. 39. Read-only storage engine <ul><li>Throughput vs. Latency </li></ul><ul><li>Index building done in Hadoop </li></ul><ul><li>Fully parallel transfer </li></ul><ul><li>Very efficient on-disk structure </li></ul><ul><li>Heavy reliance on OS pagecache </li></ul><ul><li>Rollback! </li></ul>Reference : http:// www.slideshare.net / bhupeshbansal /hadoop-user-group-jan2010
    39. 40. What do we use Hadoop/Voldemort for ? Proprietary & Confidential 02/06/12
    40. 41. Batch Driven Architecture Reference : http:// www.slideshare.net / bhupeshbansal /hadoop-user-group-jan2010
    41. 42. Data Flow Driven Architecture Reference : http:// sna-projects.com /blog/2011/08/ kafka /
    42. 43. Questions

    ×