The Evolution of Hadoop at Stripe

2,223 views

Published on

Published in: Technology, Sports
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,223
On SlideShare
0
From Embeds
0
Number of Embeds
10
Actions
Shares
0
Downloads
11
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

The Evolution of Hadoop at Stripe

  1. 1. THE EVOLUTION OF HADOOP AT STRIPE colin marc @colinmarc
  2. 2. ABOUT STRIPE • payments for the web • based in SF • last time I checked, ~75 people (stripe.com/about) • main product is an API
  3. 3. WITH US, DATA WAS AN AFTERTHOUGHT
  4. 4. A LOT OF OUR DATA IS IN MONGO • MongoDB is a fantastic application database • uses BSON - like JSON, but has a binary representation • MongoDB is schemaless, but has indexed queries and other features that are nice for applications
  5. 5. APPLICATION DBS SUCK FOR ANALYSIS • well, sometimes. relational databases are OK • MongoDB is awful (for this) • no joins • scans are painful • no declarative query language
  6. 6. SOLUTION: PUT THE DATA SOMEWHERE ELSE
  7. 7. V1: TSV + IMPALA • threw together a Hadoop cluster on the developer boxes script dumped models to • “nightly” in HDFS TSV files script output • jankyour models the schema from • query from Impala
  8. 8. ASIDE: IMPALA IS PRETTY COOL • developed by Cloudera • absurdly fast queries over HDFS • SQL is great • most of our questions are ad-hoc secrets =( woah
  9. 9. A NICE EXPERIMENT, BUT... • schema translation is hard • SLOW SLOW SLOW • TSV is not a great format • script never runs • not production data
  10. 10. V2: MONGO -> HBASE • Impala can query HBase, I think? wrote MoSQL - let’s do • @nelhagething, but put the data in the same HBase! from • translatingeasier one k/v store to another is
  11. 11. ZEROWING http://github.com/stripe/zerowing
  12. 12. FIRST, SNAPSHOT Mongo-Hadoop, map • usingMongoDB database over your • HFileOutputFormat, completeBulkLoad
  13. 13. THEN, STREAM MongoDB oplog, like a • tail the set member replica • replicate inserts/updates/deletes by _id
  14. 14. HAVING DATA IN HDFS IS A GREAT
  15. 15. THEN, QUERY IT WITH IMPALA...UM • wait, impala can’t actually query HBase effectively • 30-40x slower over the same data • limitingI factor is HBase scan speed, think
  16. 16. LOST IN TRANSLATION • our schema problem is still there! • BSON is typed, but HBase is just strings • nested hashes still don’t work • lists??? • what is the canonical schema?
  17. 17. V3: PARQUET + THRIFT storing k/v pairs, • instead ofraw BSON blobs just store the • write your MR jobs against HBase if you want up-to-date data • also periodically dump out Parquet files • use thrift definitions to manage schema
  18. 18. USING THRIFT AS SCHEMA nice way • thrift is a expect toto define what fields we be in the BSON • in most cases, we can do the translation automatically on the backend, instead of • decodereplication during • no information loss
  19. 19. GENERATE THRIFT DEFINITIONS? • thrift still isn’t the canonical- that schema for our application exists in our ODM • wrote a quick ruby script to generate thrift definitions from our application models
  20. 20. PARQUET <3 THRIFT • columnar, read-optimized a little bit • withbasic thrift of glue, serialize any struct easily
  21. 21. IMPALA <3 PARQUET • more glue can automatically import parquet files into Impala designed • Impala and parquet areother to work well with each • nested structs don’t work yet =(
  22. 22. SCALDING <3 PARQUET • we use scalding for a lot of MapReduce stuff • added ParquetSource to scalding to make this easy (source and sink)
  23. 23. THIS WORKS FOR ANY DATA • use thrift to define an data type, intermediate or derived and you get, for free: • serialization using parquet • easy MR jobs with scalding • ad-hoc querying with Impala
  24. 24. MongoDB Application Land ZeroWing OVERVIEW HBase Hadoop MR Impala Parquet Snapshots Hadoop Land
  25. 25. QUESTIONS? • meeeee: @colinmarc • Stripe: stripe.com • we’re hiring! stripe.com/jobs • ZeroWing: github.com/stripe/zerowing • Impala: github.com/cloudera/impala • Parquet: parquet.github.com

×