Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Big Data: Whats, Wheres and Whens


Published on

This is the slide deck from the presentation I gave at Kamikaze Big Data Conference. It describes a very high level view of the thoughts that go on when we decide to choose data stores for our data. Since the presentation was created for attendees that were expected to be non technical, the ideas might seem hazy. If you want to discuss about Big data in detail you can send me an email at ajinkyaster at gmail dot com

Published in: Technology
  • :) I remember so much about this...
    was so proud..
    Are you sure you want to  Yes  No
    Your message goes here
  • Be the first to like this

Big Data: Whats, Wheres and Whens

  1. 1. Big data @LinkedinThe Whats, Wheres and WhensBy Ajinkya Harkare
  2. 2. About me
  3. 3. What is Big Data?
  4. 4. 3 V’s of Big data Volume •  No. of bytes •  No. of records •  Batch •  Structured •  Stream •  Unstructured •  Changing •  Semi structure structured Velocity Variety
  5. 5. Linkedin By Numbers 200 187 180 160 145 ~2/secMembers in millions New 140 Members joining 120 100 90 80 ~4.2 Billion 60 55 Searches in 2011 40 32 17 20 8 2 4 0 Year
  6. 6. That’s Big!!! We need a scalable storage platform for thisTo scale at cost we need acluster built on commodity hardware
  7. 7. But which platform should I use?How should I retrieve data from it?How should I analyze data on it?
  8. 8. Lucky us! We are spoilt for choices
  9. 9. Lets focus on storage and retrieval… Why do we even have so many options for data stores?
  10. 10. Why can’t “One size fit all”?Why can’t one technology do everything for us? Truth is it can’t be done… The reasons:•  Disparate Data•  Varied Access patterns
  11. 11. Data and its access patterns: Profile update Case Add a Position  Has a relational structure  Requires strong transactional behavior Relational Databases
  12. 12. Data and its access patterns: PYMK Case  Always accessed through a primary key  Requires high availability and low latency  Is semi-structured Key/Value Stores
  13. 13. Other Data and its access patterns: Article Subscription  Is accessed only a few columns in the table at a time  Has repetitive values in columns (like y/n) that can be compressed Column Oriented Stores
  14. 14. Other Data and its access patterns: Resume Churning  Has record schema change for every row  Needs very low online latency Document Oriented Stores
  15. 15. Latency Requirements  Online Case: Latency ~ milliseconds –  OLTP –  Frequently updated/read data  Near-Line Case: Latency ~ seconds –  Intermediate store between online and offline –  Streams/Deltas for data transmission –  Status updates for target audience  Offline Case: Latency ~ hours –  OLAP –  Data exploration –  Historic/Aggregate/Analytics data
  16. 16. No single system can unify all theseBut a union set from it might This solution is:Polyglot Persistence
  17. 17. Polyglot Persistence Do not base your stack on one platform for storage Rather Choose the right one for the problem at handUse different technologies for different use cases This principle also makes sense for non storage use cases: Polyglot Platforms
  18. 18. Linkedin Stack: Polyglot Platforms Online Stack: Near-line Stack: Offline Stack:
  19. 19. Online Case: High Availability & Throughput •  Open source •  Highly Available Distributed Key/Value(Open Source) Store •  Skills, People You May Know, Company Follow •  Consistent Distributed Key/Value Store •  Supports full text search •  Inmail, Company Profiles
  20. 20. Near-line Case •  Open Source •  Distributed Pub/Sub Messaging system •  High-Throughput; Low Latency •  Parallel data load into Hadoop •  Change data capture using pub/sub model •  Low-latency; High-Throughput buffer with deep filtering •  Works with a variety of data source technologies
  21. 21. Offline Case: Analytics, Warehousing •  Flexible, fault tolerant and available framework •  Distributed network of commodity hardware •  Offline analytics, log processing and learning •  Linearly scalable •  Extensive parallel processing •  Can handle many concurrent users •  Shared nothing architecture •  Not suitable for small OLTP workloads
  22. 22. Finally…Tons of technologies to choose from What you need to care about is: •  Polyglot Persistence •  Knowing your data and its access patterns
  23. 23. Thanks!!