Data flow in the data center


Published on

Published in: Technology, Business
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Data flow in the data center

  1. 1. DATA FLOW IN THE DATA CENTER Adam Cataldo @djscrooge November 7, 2013
  2. 2. Wealthfront & Me • Wealthfront is the largest and fastest growing softwarebased financial advisor • We manage the first $10,000 for free the rest for only 0.25% a year • Our automated trading system continuously rebalances a portfolio of low-cost ETFs, with continuous tax-loss harvesting for accounts over $100,000 • I’ve been working on the data platform we use for website optimization, investment research, business analytics, and operations | 2
  3. 3. Why the Ptolemy conference? • This is not a talk about modeling, simulation, and design of concurrent, real-time embedded systems • This is a talk about the design of a data analytics system • It turns out many of the patterns are the same in both fields | 3
  4. 4. MapReduce & Hadoop | 4
  5. 5. Hadoop at a Glance • Scales well for large data sets • Industry standard for data processing • Optimized for throughput batch-processing • Long latency • Overkill for small data sets | 5
  6. 6. Cascading | 6
  7. 7. Why Cascading? • Most real problems require multiple MapReduce jobs • Provides a data-flow abstraction to specify data transformations • Builds on standard database concepts: joins, groups, and so on • Provides decent testing capabilities, which we’ve extended | 7
  8. 8. From SQL to Cascading select name from users join mails on Pipe joined = new CoGroup(users, “email”, mails, “to); Pipe name = new Retain(joined, “lastName”); | 8
  9. 9. Cascading to Hadoop mails mails mappers result join reducers users users mappers | 9
  10. 10. Getting data ready for Cascading Production MySQL DB Avro Avro Avrofile file files extract transform Production Amazon Simple MySQL DB Storage Service load | 10
  11. 11. Why Avro? • A compact data format, capable of storing large data sets • We compress with Google Snappy • Compressed is splittable into 128MB chunks • De-facto file format for Hadoop | 11
  12. 12. Running Cascading Jobs Elastic MapReduce Production Amazon Simple MySQL DB Storage Service Online Systems Redshift data warehouse | 12
  13. 13. What do we do with the data? • We use it to track how well the investment product is performing • We use it to track how well the business is performing • We use it to monitor our production systems • We use it to test how well new features perform on the website | 13
  14. 14. Bandit Testing • When rolling new features out, we expose the new version to some users and the old version to the rest • We monitor what percent of users “convert”: sign up, fund account, etc. • We gradually send more traffic to the winning variant of the experiment • Similar to A/B testing, but way faster | 14
  15. 15. Does anyone know where the name bandit testing comes from?
  16. 16. Thompson Sampling 1. Estimate the probability for each variant of the experiment that it performs best, using Bayesian inference 2. Weight the percentage of traffic sent to each variant according to this probability 3. End the experiment when one variant has a 95% chance of winning, or when the losing arms have no more than a %5 chance of beating the winner by more than 1% 4. In 2012, Kaufmann et al proved optimality of Thompson sampling | 16
  17. 17. What’s Redshift? • Amazon’s cloud-based data warehouse database • To support ad-hoc analysis, we copy all raw and computed data into redshift • It’s a column-oriented database, optimized for aggregate queries and joins over large batch sizes | 17
  18. 18. What are the technical challenges? • Testing complicated analytics computations is nontrivial - We ended up writing a small library to make testing Cascading jobs simpler • Running multiple Hadoop jobs on large datasets takes a long time - We use Spark for prototyping, to get a speedup • Your assumptions about the constraints on the data is always wrong | 18
  19. 19. Where’s this heading? • We have a unique collection of consumer web data and financial data • There are many ways we can combine this data to make our product better • Hypothetical example: suggest portfolio risk adjustments based on a client’s withdrawal patterns | 19
  20. 20. How is this relevant? • We use data flow as the primary model of computation • While the time scales are much slower, we have timing constraints, called SLAs, imposed by production use cases • We have to make sure all code can safely execute concurrently on multiple machines, cores, and threads | 20
  21. 21. Disclosure Nothing in this presentation should be construed as a solicitation or offer, or recommendation, to buy or sell any security. Financial advisory services are only provided to investors who become Wealthfront clients pursuant to a written agreement, Tex which investors are urged to read and carefully consider in determining t whether such agreement is suitable for their individual facts and circumstances. Past performance is no guarantee of future results, and any hypothetical returns, expected returns, or probability projections may not reflect actual future performance. Investors should review Wealthfront’s website for additional information about advisory services. | 21