The Big Data Revolution is an Evolution

1,167 views
886 views

Published on

Dealing with data doesn't only require a data store, it requires an infrastructure. At SimpleReach, we have 5 data storage layers to service all of our data needs. These range from high volume, high velocity data ingestion with real-time analytics to ad-hoc style historical analysis with search capabilities. To communicate effectively between applications, data stores sit behind a service architecture for consistent data access patterns and failover/redundancy. This talk is a story of how we came to this architecture and some of the lessons we learned along the way.

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,167
On SlideShare
0
From Embeds
0
Number of Embeds
54
Actions
Shares
0
Downloads
20
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

The Big Data Revolution is an Evolution

  1. 1. The Big DataRevolution is an Eric Lubow @elubow elubow@simplereach.co
  2. 2. Overvie• Evolution• SimpleReach• Data Stores / Languages• Architecture Implementation Big Data Revolution is an Eric Lubow @elubow Evolution #NYCassandra2013
  3. 3. Were in the midst of anevolution, not a revolution. Big Data Revolution is an Eric Lubow @elubow Evolution #NYCassandra2013
  4. 4. The 2 Truths Big Data Revolution is an Eric Lubow @elubow Evolution #NYCassandra2013
  5. 5. The Real TruthEven with the right tools, 80% ofthe work of building a big datasystem is acquiring and refining Big Data Revolution is an Eric Lubow @elubow Evolution #NYCassandra2013
  6. 6. 30m plays/day + 4m user ratings + 75k movies metadata + 24.4m usemetadata = David Fincher + Kevin Mitch Hurwitz + Will Arnett + Spacey + British House of Jason Bateman + Arrested Cards Development Big Data Revolution is an Eric Lubow @elubow Evolution #NYCassandra2013
  7. 7. BRING ITTOGETHE Big Data Revolution is an Eric Lubow @elubow Evolution #NYCassandra2013
  8. 8. revolution evolution Insufficient New Products Capabilities Scale/Need Development & Changes Integration Big Data Revolution is an Eric Lubow @elubow Evolution #NYCassandra2013
  9. 9. Big Data Revolution is an Eric Lubow @elubowEvolution #NYCassandra2013
  10. 10. Big Data Revolution is an Eric Lubow @elubowEvolution #NYCassandra2013
  11. 11. SimpleReach• Millions of URLs per day• Over 1 billion pageviews per month• 250m events per day (~3k events/second)• Auto-scale 90-130 machines depending on traffic Big Data Revolution is an Eric Lubow @elubow Evolution #NYCassandra2013
  12. 12. HUMBLE BEGINNINGS Big Data Revolution is an Eric Lubow @elubow Evolution #NYCassandra2013
  13. 13. Scale Big Data Revolution is an Eric Lubow @elubow Evolution #NYCassandra2013
  14. 14. AND THEN... C*Big Data Revolution is an Eric Lubow @elubowEvolution #NYCassandra2013
  15. 15. Cassandra C*• Large data volume ingestion at high velocity• Really fast writes to many locations (eventual consistency)• Query by column groups within rows (slicing)• TTLs for small group aggregation• Wrote Helenus, Node.js driver for Cassandra Big Data Revolution is an Eric Lubow @elubow Evolution #NYCassandra2013
  16. 16. • MongoDB Fast atomic increments (Node.js is native JSON)• Sharding• Solid ORM for Rails (MongoID)• B-Tree Indexes• Document based via JSON• TTLs for ephemeral data Big Data Revolution is an Eric Lubow @elubow Evolution #NYCassandra2013
  17. 17. Redis• Supports hundreds of thousands transactions per second• Great caching engine• Supports useful variable types like sets, sorted set, lists• Everything is guaranteed to be Memory Mapped Big Data Revolution is an Eric Lubow @elubow Evolution #NYCassandra2013
  18. 18. Infobright• Works with standard MySQL driver• Column Stores for ad-hoc analytics queries in SQL• Heavy compression of data (avg 12:1) Big Data Revolution is an Eric Lubow @elubow Evolution #NYCassandra2013
  19. 19. The• c0dez Polyglottany doesn’t only apply to data stores• Each language has its own benefit to each stack layer• Each language has its own individual benefits• Each language has its own development benefits Big Data Revolution is an Eric Lubow @elubow Evolution #NYCassandra2013
  20. 20. Big Data Revolution is an Eric Lubow @elubowEvolution #NYCassandra2013
  21. 21. Cons• Redis - Can only utilize a single core. SerDe price.• Infobright - DELETE/UPDATEs are VERY expensive• Cassandra - No btree indexes or probabilistic counters• Mongo - Indexes must fit in memory. Forced Replica ping times• Python - Whitespace. Community• Ruby - Not high performance enough for our standards Big Data Revolution is an Eric Lubow @elubow Evolution #NYCassandra2013
  22. 22. • Evolution Takes Work Service Oriented Architecture (Internal API)• Data accuracy checks: visual and programmatic• Built framework for testing out engines (Storage, Queueing, etc)• Access to many toolsets (for all languages, DBs, Engines) Big Data Revolution is an Eric Lubow @elubow Evolution #NYCassandra2013
  23. 23. Service Solr C*Real-time C* Internal API Big Data Revolution is an Eric Lubow @elubow Evolution #NYCassandra2013
  24. 24. Path of a Packet Fire Solr Hos C* Internal API Consumers EP QueueInternet Mong API Redis SC IB Big Data Revolution is an Eric Lubow @elubow Evolution #NYCassandra2013
  25. 25. Architecture Distribution US-EAST-1a US-EAST-1b US-EAST-1e CASSANDRA-0001 CASSANDRA-0002 CASSANDRA-0003 CASSANDRA-0010 CASSANDRA-0011 CASSANDRA-0012 REDIS-0001A REDIS-0001B INFOBRIGHT-00 INFOBRIGHT-00 01 02MONGO-SHARD-0000-A MONGO-SHARD-0000-BMONGO-SHARD-0001-B MONGO-SHARD-0001-A MONGO-SHARD-0002-B MONGO-SHARD-0002-A iAPI-0001 iAPI-0002 iAPI-0003 Big Data Revolution is an Eric Lubow @elubow Evolution #NYCassandra2013
  26. 26. The Schrute of the Problem Big Data Revolution is an Eric Lubow @elubow Evolution #NYCassandra2013
  27. 27. Evolving Amazon Tools • CloudSearch• Full Featured API • Elastic Beanstalk• Simple Queuing Service • Elastic MapReduce• Data Pipelining • Simple Workflow Coordinator• OpsWorks • S3 / Glacier• Cloud Formation• Redshift Analytics Big Data Revolution is an Eric Lubow @elubow Evolution #NYCassandra2013
  28. 28. DevOps Wizardry• Extensive use of AWS• Monitor: Nagios, Statsd, and Graphite• Manage: Chef, OpsWorks, cSSHx• Deployments Big Data Revolution is an Eric Lubow @elubow Evolution #NYCassandra2013
  29. 29. • Summary Solutions Require Evolution• Build, Use, and Integrate Tools• Abstraction• Distribution• Monitoring & Automation Big Data Revolution is an Eric Lubow @elubow Evolution #NYCassandra2013
  30. 30. Evolution TakesTimeA revolution only lasts fifteenyears, a period whichcoincides with the Big Data Revolution is an Eric Lubow @elubow Evolution #NYCassandra2013
  31. 31. We’re(Ask us about Foodis an Big Data Revolution Coma Fridays) Eric Lubow @elubow Evolution #NYCassandra2013
  32. 32. Questions are guaranteed in life.Answers aren’t. Eric Lubow @elubow elubow@simplereach.co Thank Big Data Revolution is an you. Eric Lubow @elubow Evolution #NYCassandra2013

×