The Big Data Ecosystem at LinkedIn<br />Jay Kreps<br />
Me<br />Background in data not infrastructure<br />LinkedIn’s SNA team<br />Original co-author of some LinkedIn open sourc...
This Talk<br />We are in a renaissance of data infrastructure.<br />How do all these pieces fit together?<br />
Why the current obsession with “Big Data”?<br />
The goal of modern data infrastructure is to make many small computers act like one big one.<br />
The Old Picture<br />
The New Picture<br />
Polyglot persistence?<br />
Infrastructure Icebergs<br />90k lines of tooling and monitoring, 30k lines of logic<br />Dedicated engineers, operations<...
This is (still) a very immature space. Which systems should we have?<br />
Infrastructure is sculpted by applications and constraints<br />Projects are defined by trade-offs<br />
Constraints<br />Hardware<br />Jeff Dean: Numbers everyone should know<br />David Patterson: Latency lags bandwidth<br />$...
Applications<br />
Common categories of non-CRUD<br />Recommendations & Matching<br />Graphs<br />Search<br />Data Normalization<br />News fe...
Social Graph<br />
Search<br />
Recommendations: People<br />
Recommendations: Jobs<br />
Recommendations: Newsfeed<br />
Data Normalization<br />
Analytics<br />
Infrastructure<br />Search<br />Lucene<br />Bobo (facets), Zoie (real-time indexing), Sensei (distribution)<br />Social Gr...
Three Major Paradigms<br />Request/Response<br />Search<br />Social Graph<br />Storage<br />Streams<br />Kafka<br />Batch<...
Most features are multi-paradigm<br />
Request/Response<br />Search<br />Social Graph<br />Storage<br />Voldemort<br />Espresso<br />
Request/Response Patterns<br />Broker, scatter-gather<br />Storage systems: only <br />Partitioning strategy<br />Latency ...
Batch: Hadoop<br />Uses<br />Ad hoc<br />Production batch<br />Ecosystem<br />Hive, Pig<br />Azkaban (workflow)<br />Avro ...
Why do batch if you have real-time?<br />Batch advantages<br />Safety<br />Easy<br />Throughput<br />Simplicity<br />Econo...
Why do streaming?<br />You have to glue all these systems together<br />Throughput as good as batch<br />Latency much bett...
What makes successful infrastructure systems?<br />Operability and Operations<br />Monitoring<br />Simplicity<br />Documen...
Open Source<br />Data > Infrastructure<br />Open source creates better code—even with few outside contributors<br />Commer...
Open Source Projects<br />We made<br />Voldemort: Key/Value storage<br />Sensei, Bobo, Zoie: Elastic, faceted, real-time s...
The End<br />jay.kreps@gmail.com<br />http://www.linkedin.com/in/jaykreps<br />http://twitter.com/jaykreps<br />http://sna...
Upcoming SlideShare
Loading in...5
×

The Big Data Ecosystem at LinkedIn

11,050

Published on

Published in: Technology, Business
1 Comment
24 Likes
Statistics
Notes
No Downloads
Views
Total Views
11,050
On Slideshare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
290
Comments
1
Likes
24
Embeds 0
No embeds

No notes for slide
  • Good news for users, bad news for distributed systems nerdsFilesystems take a decade to mature. Don’t expect this will be easier.
  • The Big Data Ecosystem at LinkedIn

    1. 1. The Big Data Ecosystem at LinkedIn<br />Jay Kreps<br />
    2. 2. Me<br />Background in data not infrastructure<br />LinkedIn’s SNA team<br />Original co-author of some LinkedIn open source projects (Voldemort, Azkaban, Kafka)<br />
    3. 3. This Talk<br />We are in a renaissance of data infrastructure.<br />How do all these pieces fit together?<br />
    4. 4. Why the current obsession with “Big Data”?<br />
    5. 5. The goal of modern data infrastructure is to make many small computers act like one big one.<br />
    6. 6. The Old Picture<br />
    7. 7. The New Picture<br />
    8. 8. Polyglot persistence?<br />
    9. 9. Infrastructure Icebergs<br />90k lines of tooling and monitoring, 30k lines of logic<br />Dedicated engineers, operations<br />Training<br />First three nines come from operations<br />
    10. 10. This is (still) a very immature space. Which systems should we have?<br />
    11. 11. Infrastructure is sculpted by applications and constraints<br />Projects are defined by trade-offs<br />
    12. 12. Constraints<br />Hardware<br />Jeff Dean: Numbers everyone should know<br />David Patterson: Latency lags bandwidth<br />$$$<br />Other<br />Path dependence<br />Complexity<br />Resources<br />
    13. 13. Applications<br />
    14. 14. Common categories of non-CRUD<br />Recommendations & Matching<br />Graphs<br />Search<br />Data Normalization<br />News feed<br />Analysis & Monitoring<br />
    15. 15. Social Graph<br />
    16. 16. Search<br />
    17. 17. Recommendations: People<br />
    18. 18. Recommendations: Jobs<br />
    19. 19. Recommendations: Newsfeed<br />
    20. 20. Data Normalization<br />
    21. 21. Analytics<br />
    22. 22. Infrastructure<br />Search<br />Lucene<br />Bobo (facets), Zoie (real-time indexing), Sensei (distribution)<br />Social Graph<br />Storage<br />Oracle<br />Voldemort<br />Espresso<br />Streams<br />Databus<br />Kafka<br />Offline<br />Hadoop & friends (Pig, Hive, Azkaban, etc)<br />
    23. 23. Three Major Paradigms<br />Request/Response<br />Search<br />Social Graph<br />Storage<br />Streams<br />Kafka<br />Batch<br />Hadoop<br />
    24. 24. Most features are multi-paradigm<br />
    25. 25. Request/Response<br />Search<br />Social Graph<br />Storage<br />Voldemort<br />Espresso<br />
    26. 26. Request/Response Patterns<br />Broker, scatter-gather<br />Storage systems: only <br />Partitioning strategy<br />Latency oriented<br />
    27. 27. Batch: Hadoop<br />Uses<br />Ad hoc<br />Production batch<br />Ecosystem<br />Hive, Pig<br />Azkaban (workflow)<br />Avro data<br />Data in: Kafka<br />Data out: Voldemort, Kafka<br />
    28. 28. Why do batch if you have real-time?<br />Batch advantages<br />Safety<br />Easy<br />Throughput<br />Simplicity<br />Economics<br />Tricky bit: engineering the data cycle<br />
    29. 29. Why do streaming?<br />You have to glue all these systems together<br />Throughput as good as batch<br />Latency much better<br />Metaphor more natural for low latency than Hadoop<br />
    30. 30. What makes successful infrastructure systems?<br />Operability and Operations<br />Monitoring<br />Simplicity<br />Documentation<br />Broad adoption<br />Lazy users<br />Open source<br />
    31. 31. Open Source<br />Data > Infrastructure<br />Open source creates better code—even with few outside contributors<br />Commercial infrastructure not interesting<br />
    32. 32. Open Source Projects<br />We made<br />Voldemort: Key/Value storage<br />Sensei, Bobo, Zoie: Elastic, faceted, real-time search with Lucene<br />Kafka: Persistent, distributed data streams<br />Norbert: Cluster aware RPC, load balancing, and group membership<br />And others…<br />We stole<br />Hadoop, Pig, Hive<br />Lucene<br />Netty, Jetty<br />Zookeeper<br />Avro<br />Apache Traffic Server<br />
    33. 33. The End<br />jay.kreps@gmail.com<br />http://www.linkedin.com/in/jaykreps<br />http://twitter.com/jaykreps<br />http://sna-projects.com<br />
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×