2. Overview
• SimpleReach
• Definitions and Data Stores
• Evolution to Polyglottany
• Tie It Together
• Final Thoughts
• Questions
Polyglottany Is Not A Sin Eric Lubow @elubow
4. Size
• 150m events
recorded per day
and growing
• 600m Pageviews per
month and growing
Polyglottany Is Not A Sin Eric Lubow @elubow
5. Polyglot Persistence
Polyglot Persistence, like polyglot programming, is all about
choosing the right persistence option for the task at hand.
http://www.sleberknight.com/blog/sleberkn/entry/polyglot_persistence
Polyglottany Is Not A Sin Eric Lubow @elubow
6. Right Tool For The Job
Polyglottany Is Not A Sin Eric Lubow @elubow
7. Decisions. Decisions.
• Is the
• What are my query patterns? • Are my display requirements • How fault tolerant is the system?
encryption/authentication/authoriz
for realtime data? ation support sufficient for my
Is my data ingestion high volume/high What supporting tools do I need?
Tech
• •
needs?
Data
velocity? • Do I need to aggregate data
on the fly? • Is there support for my language?
• Are there monitoring
• Am I batch loading data? architectures already built?
• Is my data structured or
• Am I write heavy or read heavy? unstructured? • Are there best practices guides
already
• Are data relationships important? • Does my data lend itself to a
specific design pattern? • Will the data need to be
• Does my data need to be distributed?
immediately available everywhere?
Data Tech
Financial Other
• Am I cloud based?
Do I have legal requirements (HIPAA/FIPS/Sarbanes Oxley/PII)?
Other
•
Financial
• Am I hardware based?
• What kind of enterprise support is available?
• Am I a cloud/iron hybrid?
• What is the community like?
• How much am I willing to spend?
• Does the product roadmap pertain to my roadmap?
• How much am I willing to spend if something goes wrong?
Polyglottany Is Not A Sin Eric Lubow @elubow
8. No One Size Fits All
Polyglottany Is Not A Sin Eric Lubow @elubow
9. Tools
C*
Polyglottany Is Not A Sin Eric Lubow @elubow
17. Cassandra C*
• Large data volume ingestion at high velocity
• Really fast writes to many locations (eventual
consistency)
• Query by column groups within rows (slicing)
• Opscenter
• Data toolkit: more than a data storage layer
• TTLs for small group aggregation
• Wrote Helenus, Node.js driver for Cassandra
Polyglottany Is Not A Sin Eric Lubow @elubow
18. MongoDB
• Fast atomic increments (Node.js is native JSON)
• Sharding
• Solid ORM for Rails (MongoID)
• Fast access for pub/sub of durable/persisted documents
• B-Tree Indexes
• Document based via JSON
• TTLs for ephemeral data
Polyglottany Is Not A Sin Eric Lubow @elubow
19. Redis
• Supports hundreds of thousands transactions per second
• Great caching engine
• Supports useful variable types like sets, sorted set, lists
• Everything is guaranteed to Memory Mapped (mmap)
• Transactional and supports bulk operations
• Centralized queueing and locking system
Polyglottany Is Not A Sin Eric Lubow @elubow
20. Infobright
• Works with standard MySQL driver
• Column Stores for ad-hoc analytics queries in SQL
• Databases built for business intelligence
• Heavy compression of data
• Pre-aggregated data (Knowledge Grid)
Polyglottany Is Not A Sin Eric Lubow @elubow
21. Ruby, Node.js, Python
• Polyglottany doesn’t only apply to data stores
• Each language has its own benefit to each data storage layer
• Each language has its own individual benefits
• JSON, APIs, Performance
Polyglottany Is Not A Sin Eric Lubow @elubow
23. Cons
• Redis - Can only utilize a single core. SerDe price.
• MySQL Column Store - DELETE/UPDATEs are VERY expensive
• Cassandra - No btree indexes
• Mongo - Indexes must fit in memory. Forced Replica ping times
• Python - Whitespace. Community
• Ruby - Not high performance enough for our standards
• Javascript (Node.js) - Bad for CPU or IO intensive workloads
Polyglottany Is Not A Sin Eric Lubow @elubow
24. Tying It Together
Even with the right tools, 80% of the work of building a
big data system is acquiring and refining the raw data into
usable data.
Polyglottany Is Not A Sin Eric Lubow @elubow
26. Tying It Together
• Service Oriented Architecture (Internal API)
• Data accuracy checks: visual and programmatic
• Built framework for testing out storage engines
• Access to many toolsets (for all languages and
DBs)
Polyglottany Is Not A Sin Eric Lubow @elubow
28. Distributed Architecture
US-EAST-1a US-EAST-1b US-EAST-1e
CASSANDRA-0001 CASSANDRA-0002 CASSANDRA-0003
CASSANDRA-0010 CASSANDRA-0011 CASSANDRA-0012
REDIS-0001A REDIS-0001B
MYSQL-0001 MYSQL-0002
MONGO-SHARD-0000-A MONGO-SHARD-0000-B
MONGO-SHARD-0001-B MONGO-SHARD-0001-A
MONGO-SHARD-0002-B MONGO-SHARD-0002-A
iAPI-0001 iAPI-0002 iAPI-0003
Polyglottany Is Not A Sin Eric Lubow @elubow
29. Points To Consider
• Data consistency - Same in all data stores
• How important is data durability?
• Managing many servers (Chef, AWS, CSSH)
• Managing and learning many different applications
and tuning for them
• Expertise
Polyglottany Is Not A Sin Eric Lubow @elubow
30. Expertise
• What happens when you need help?
• How do you become experts?
• What happens when you need more experts?
Polyglottany Is Not A Sin Eric Lubow @elubow
31. Summary
• Polyglottany is not a sin
• Know your data read/write
patterns
• Know the tools available to you
• Know your compromises
• Expertise
Polyglottany Is Not A Sin Eric Lubow @elubow
33. Questions are guaranteed in life.
Answers aren’t.
Eric Lubow
@elubow
elubow@simplereach.com
#MongoBoston
Thank you.
Editor's Notes
SimpleReach is a social intelligence tool for content creators. We track everything social action, on every major network, across the entire web in real-time. That means every like, tweet, pin, stumble and many more.