SlideShare a Scribd company logo
wealthfront.com

DATA FLOW
IN THE DATA CENTER

Adam Cataldo @djscrooge
November 7, 2013
Wealthfront & Me
• Wealthfront is the largest and fastest growing softwarebased financial advisor
• We manage the first $10,000 for free the rest for only
0.25% a year
• Our automated trading system continuously rebalances
a portfolio of low-cost ETFs, with continuous tax-loss
harvesting for accounts over $100,000
• I’ve been working on the data platform we use for
website optimization, investment research, business
analytics, and operations

wealthfront.com | 2
Why the Ptolemy conference?
• This is not a talk about modeling, simulation, and
design of concurrent, real-time embedded systems
• This is a talk about the design of a data analytics
system
• It turns out many of the patterns are the same in both
fields

wealthfront.com | 3
MapReduce & Hadoop

wealthfront.com | 4
Hadoop at a Glance
• Scales well for large data sets
• Industry standard for data processing
• Optimized for throughput batch-processing

• Long latency
• Overkill for small data sets

wealthfront.com | 5
Cascading

wealthfront.com | 6
Why Cascading?
• Most real problems require multiple MapReduce jobs
• Provides a data-flow abstraction to specify data
transformations
• Builds on standard database concepts: joins, groups,
and so on
• Provides decent testing capabilities, which we’ve
extended

wealthfront.com | 7
From SQL to Cascading

select name from users join mails on users.email=mails.to

Pipe joined = new CoGroup(users, “email”, mails, “to);
Pipe name = new Retain(joined, “lastName”);

wealthfront.com | 8
Cascading to Hadoop

mails

mails
mappers
result
join
reducers

users

users
mappers

wealthfront.com | 9
Getting data ready for Cascading

Production
MySQL DB

Avro
Avro
Avrofile
file
files

extract

transform

Production
Amazon Simple
MySQL DB
Storage Service

load

wealthfront.com | 10
Why Avro?

• A compact data format, capable of storing large data sets
• We compress with Google
Snappy
• Compressed is splittable
into 128MB chunks
• De-facto file format for
Hadoop

wealthfront.com | 11
Running Cascading Jobs
Elastic MapReduce

Production
Amazon Simple
MySQL DB
Storage Service

Online
Systems

Redshift
data
warehouse

wealthfront.com | 12
What do we do with the data?
• We use it to track how well the investment product is
performing
• We use it to track how well the business is performing
• We use it to monitor our production systems
• We use it to test how well new features perform on the
website

wealthfront.com | 13
Bandit Testing
• When rolling new features out, we expose
the new version to some users and the old
version to the rest
• We monitor what percent of users
“convert”: sign up, fund account, etc.
• We gradually send more traffic to the
winning variant of the experiment
• Similar to A/B testing, but way faster

wealthfront.com | 14
Does anyone know
where the name bandit
testing comes from?
Thompson Sampling
1. Estimate the probability for each variant of the
experiment that it performs best, using Bayesian
inference
2. Weight the percentage of traffic sent to each variant
according to this probability
3. End the experiment when one variant has a 95%
chance of winning, or when the losing arms have no
more than a %5 chance of beating the winner by more
than 1%
4. In 2012, Kaufmann et al proved optimality of
Thompson sampling
wealthfront.com | 16
What’s Redshift?
• Amazon’s cloud-based data
warehouse database
• To support ad-hoc analysis,
we copy all raw and computed
data into redshift
• It’s a column-oriented
database, optimized for
aggregate queries and joins
over large batch sizes

wealthfront.com | 17
What are the technical challenges?
• Testing complicated analytics computations is nontrivial
-

We ended up writing a small library to make testing
Cascading jobs simpler

• Running multiple Hadoop jobs on large datasets takes a
long time
-

We use Spark for prototyping, to get a speedup

• Your assumptions about the constraints on the data is
always wrong

wealthfront.com | 18
Where’s this heading?
• We have a unique collection of
consumer web data and
financial data
• There are many ways we can
combine this data to make our
product better
• Hypothetical example: suggest
portfolio risk adjustments
based on a client’s withdrawal
patterns

wealthfront.com | 19
How is this relevant?
• We use data flow as the
primary model of computation
• While the time scales are much
slower, we have timing
constraints, called SLAs,
imposed by production use
cases
• We have to make sure all code
can safely execute
concurrently on multiple
machines, cores, and threads

wealthfront.com | 20
Disclosure
Nothing in this presentation should be construed as
a solicitation or offer, or recommendation, to buy
or sell any security. Financial advisory services
are only provided to investors who become
Wealthfront clients pursuant to a written agreement,
Tex
which investors are urged to read and carefully
consider in determining t
whether such agreement is
suitable for their individual facts and
circumstances. Past performance is no guarantee of
future results, and any hypothetical returns,
expected returns, or probability projections may not
reflect actual future performance. Investors should
review Wealthfront’s website for additional
information about advisory services.
wealthfront.com | 21
Data flow in the data center

More Related Content

What's hot

Address resolution protocol and internet control message protocol
Address resolution protocol and internet control message protocolAddress resolution protocol and internet control message protocol
Address resolution protocol and internet control message protocolasimnawaz54
 
Visão Geral - pfSense
Visão Geral - pfSenseVisão Geral - pfSense
Visão Geral - pfSense
Alexandre Silva
 
Modelo documentacao-rede
Modelo documentacao-redeModelo documentacao-rede
Modelo documentacao-rede
Rod Deville
 
Preparação e Limpeza de Dados
Preparação e Limpeza de DadosPreparação e Limpeza de Dados
Preparação e Limpeza de Dados
Alexandre Duarte
 
FIBRA ÓPTICA INFRAESTRUTURAS
FIBRA ÓPTICA INFRAESTRUTURAS  FIBRA ÓPTICA INFRAESTRUTURAS
FIBRA ÓPTICA INFRAESTRUTURAS
WELLINGTON MARTINS
 
Virtualizing the Network to enable a Software Defined Infrastructure (SDI)
Virtualizing the Network to enable a Software Defined Infrastructure (SDI)Virtualizing the Network to enable a Software Defined Infrastructure (SDI)
Virtualizing the Network to enable a Software Defined Infrastructure (SDI)
Odinot Stanislas
 
Sistemas Distribuídos - Comunicação Distribuída - Socket
Sistemas Distribuídos - Comunicação Distribuída - SocketSistemas Distribuídos - Comunicação Distribuída - Socket
Sistemas Distribuídos - Comunicação Distribuída - SocketAdriano Teixeira de Souza
 
3° unidade (placa mãe)
3° unidade (placa mãe)3° unidade (placa mãe)
3° unidade (placa mãe)André Lopes
 
Networking Project(FINAL)
Networking Project(FINAL)Networking Project(FINAL)
Networking Project(FINAL)Priyojit Das
 
Normalização de Banco de Dados
Normalização de Banco de DadosNormalização de Banco de Dados
Normalização de Banco de Dadoselliando dias
 
Algoritmo de escalonamento Fuzzy Round Robin
Algoritmo de escalonamento Fuzzy Round RobinAlgoritmo de escalonamento Fuzzy Round Robin
Algoritmo de escalonamento Fuzzy Round Robin
Marcos Castro
 
Redes de computadores, tipologias e elementos de rede
Redes de computadores, tipologias e elementos de redeRedes de computadores, tipologias e elementos de rede
Redes de computadores, tipologias e elementos de redeandreaires
 
06 Requisitos
06 Requisitos06 Requisitos
06 Requisitos
Waldemar Roberti
 
Gerências de Processos: Escalonamento de CPU
Gerências de Processos: Escalonamento de CPUGerências de Processos: Escalonamento de CPU
Gerências de Processos: Escalonamento de CPU
Alexandre Duarte
 
Servidores Web
Servidores Web Servidores Web
Servidores Web
bastosluis
 
Motherboard
MotherboardMotherboard
Motherboard
Nelson Sousa
 
Network Essentials v2.0
Network Essentials v2.0Network Essentials v2.0
Network Essentials v2.0Hossein Zahed
 
Padrões-04 - Padrões Arquiteturais - Broker
Padrões-04 - Padrões Arquiteturais - BrokerPadrões-04 - Padrões Arquiteturais - Broker
Padrões-04 - Padrões Arquiteturais - Broker
Eduardo Nicola F. Zagari
 

What's hot (20)

Address resolution protocol and internet control message protocol
Address resolution protocol and internet control message protocolAddress resolution protocol and internet control message protocol
Address resolution protocol and internet control message protocol
 
Visão Geral - pfSense
Visão Geral - pfSenseVisão Geral - pfSense
Visão Geral - pfSense
 
Modelo documentacao-rede
Modelo documentacao-redeModelo documentacao-rede
Modelo documentacao-rede
 
Preparação e Limpeza de Dados
Preparação e Limpeza de DadosPreparação e Limpeza de Dados
Preparação e Limpeza de Dados
 
FIBRA ÓPTICA INFRAESTRUTURAS
FIBRA ÓPTICA INFRAESTRUTURAS  FIBRA ÓPTICA INFRAESTRUTURAS
FIBRA ÓPTICA INFRAESTRUTURAS
 
Virtualizing the Network to enable a Software Defined Infrastructure (SDI)
Virtualizing the Network to enable a Software Defined Infrastructure (SDI)Virtualizing the Network to enable a Software Defined Infrastructure (SDI)
Virtualizing the Network to enable a Software Defined Infrastructure (SDI)
 
casos de uso
casos de usocasos de uso
casos de uso
 
Sistemas Distribuídos - Comunicação Distribuída - Socket
Sistemas Distribuídos - Comunicação Distribuída - SocketSistemas Distribuídos - Comunicação Distribuída - Socket
Sistemas Distribuídos - Comunicação Distribuída - Socket
 
3° unidade (placa mãe)
3° unidade (placa mãe)3° unidade (placa mãe)
3° unidade (placa mãe)
 
Networking Project(FINAL)
Networking Project(FINAL)Networking Project(FINAL)
Networking Project(FINAL)
 
Normalização de Banco de Dados
Normalização de Banco de DadosNormalização de Banco de Dados
Normalização de Banco de Dados
 
Algoritmo de escalonamento Fuzzy Round Robin
Algoritmo de escalonamento Fuzzy Round RobinAlgoritmo de escalonamento Fuzzy Round Robin
Algoritmo de escalonamento Fuzzy Round Robin
 
Redes de computadores, tipologias e elementos de rede
Redes de computadores, tipologias e elementos de redeRedes de computadores, tipologias e elementos de rede
Redes de computadores, tipologias e elementos de rede
 
06 Requisitos
06 Requisitos06 Requisitos
06 Requisitos
 
Gerências de Processos: Escalonamento de CPU
Gerências de Processos: Escalonamento de CPUGerências de Processos: Escalonamento de CPU
Gerências de Processos: Escalonamento de CPU
 
Servidores Web
Servidores Web Servidores Web
Servidores Web
 
Motherboard
MotherboardMotherboard
Motherboard
 
Network Essentials v2.0
Network Essentials v2.0Network Essentials v2.0
Network Essentials v2.0
 
Sistema operacional de tempo real rtos
Sistema operacional de tempo real   rtosSistema operacional de tempo real   rtos
Sistema operacional de tempo real rtos
 
Padrões-04 - Padrões Arquiteturais - Broker
Padrões-04 - Padrões Arquiteturais - BrokerPadrões-04 - Padrões Arquiteturais - Broker
Padrões-04 - Padrões Arquiteturais - Broker
 

Viewers also liked

Be A Great Product Leader (Opower 2014)
Be A Great Product Leader (Opower 2014)Be A Great Product Leader (Opower 2014)
Be A Great Product Leader (Opower 2014)
Adam Nash
 
Building Complex Data Workflows with Cascading on Hadoop
Building Complex Data Workflows with Cascading on HadoopBuilding Complex Data Workflows with Cascading on Hadoop
Building Complex Data Workflows with Cascading on Hadoop
Gagan Agrawal
 
Overview of Cascading 3.0 on Apache Flink
Overview of Cascading 3.0 on Apache Flink Overview of Cascading 3.0 on Apache Flink
Overview of Cascading 3.0 on Apache Flink
Cascading
 
Data center network architectures v1.3
Data center network architectures v1.3Data center network architectures v1.3
Data center network architectures v1.3
Jeong, Wookjae
 
Data center proposal
Data center proposalData center proposal
Data center proposal
Muhammad Ahad
 
Data Center Network Topologies
Data Center Network TopologiesData Center Network Topologies
Data Center Network Topologies
rjain51
 
Intro To Cascading
Intro To CascadingIntro To Cascading
Intro To Cascading
Nate Murray
 
Introduction to Data Center Network Architecture
Introduction to Data Center Network ArchitectureIntroduction to Data Center Network Architecture
Introduction to Data Center Network Architecture
Ankita Mahajan
 

Viewers also liked (8)

Be A Great Product Leader (Opower 2014)
Be A Great Product Leader (Opower 2014)Be A Great Product Leader (Opower 2014)
Be A Great Product Leader (Opower 2014)
 
Building Complex Data Workflows with Cascading on Hadoop
Building Complex Data Workflows with Cascading on HadoopBuilding Complex Data Workflows with Cascading on Hadoop
Building Complex Data Workflows with Cascading on Hadoop
 
Overview of Cascading 3.0 on Apache Flink
Overview of Cascading 3.0 on Apache Flink Overview of Cascading 3.0 on Apache Flink
Overview of Cascading 3.0 on Apache Flink
 
Data center network architectures v1.3
Data center network architectures v1.3Data center network architectures v1.3
Data center network architectures v1.3
 
Data center proposal
Data center proposalData center proposal
Data center proposal
 
Data Center Network Topologies
Data Center Network TopologiesData Center Network Topologies
Data Center Network Topologies
 
Intro To Cascading
Intro To CascadingIntro To Cascading
Intro To Cascading
 
Introduction to Data Center Network Architecture
Introduction to Data Center Network ArchitectureIntroduction to Data Center Network Architecture
Introduction to Data Center Network Architecture
 

Similar to Data flow in the data center

Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Precisely
 
PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...
PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...
PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...
Deepak Chandramouli
 
Which Change Data Capture Strategy is Right for You?
Which Change Data Capture Strategy is Right for You?Which Change Data Capture Strategy is Right for You?
Which Change Data Capture Strategy is Right for You?
Precisely
 
How Data Drives Business at Choice Hotels
How Data Drives Business at Choice HotelsHow Data Drives Business at Choice Hotels
How Data Drives Business at Choice Hotels
Cloudera, Inc.
 
2015 nov 27_thug_paytm_rt_ingest_brief_final
2015 nov 27_thug_paytm_rt_ingest_brief_final2015 nov 27_thug_paytm_rt_ingest_brief_final
2015 nov 27_thug_paytm_rt_ingest_brief_final
Adam Muise
 
Reaching scale limits on a Hadoop platform: issues and errors created by spee...
Reaching scale limits on a Hadoop platform: issues and errors created by spee...Reaching scale limits on a Hadoop platform: issues and errors created by spee...
Reaching scale limits on a Hadoop platform: issues and errors created by spee...
DataWorks Summit
 
5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer
Caserta
 
Deploy Apache Spark™ on Rackspace OnMetal™ for Cloud Big Data Platform
Deploy Apache Spark™ on Rackspace OnMetal™ for Cloud Big Data PlatformDeploy Apache Spark™ on Rackspace OnMetal™ for Cloud Big Data Platform
Deploy Apache Spark™ on Rackspace OnMetal™ for Cloud Big Data Platform
Rackspace
 
AWS Summit 2013 | India - Petabyte Scale Data Warehousing at Low Cost, Abhish...
AWS Summit 2013 | India - Petabyte Scale Data Warehousing at Low Cost, Abhish...AWS Summit 2013 | India - Petabyte Scale Data Warehousing at Low Cost, Abhish...
AWS Summit 2013 | India - Petabyte Scale Data Warehousing at Low Cost, Abhish...
Amazon Web Services
 
Getting It Right Exactly Once: Principles for Streaming Architectures
Getting It Right Exactly Once: Principles for Streaming ArchitecturesGetting It Right Exactly Once: Principles for Streaming Architectures
Getting It Right Exactly Once: Principles for Streaming Architectures
SingleStore
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
DATAVERSITY
 
Retail & CPG
Retail & CPGRetail & CPG
Cloud Services helping in cloud service to be fully knowledgably .pptx
Cloud Services helping in cloud service to be fully knowledgably .pptxCloud Services helping in cloud service to be fully knowledgably .pptx
Cloud Services helping in cloud service to be fully knowledgably .pptx
terewog808
 
Skilwise Big data
Skilwise Big dataSkilwise Big data
Skilwise Big data
Skillwise Group
 
Conflict in the Cloud – Issues & Solutions for Big Data
Conflict in the Cloud – Issues & Solutions for Big DataConflict in the Cloud – Issues & Solutions for Big Data
Conflict in the Cloud – Issues & Solutions for Big Data
Halo BI
 
Skillwise Big Data part 2
Skillwise Big Data part 2Skillwise Big Data part 2
Skillwise Big Data part 2
Skillwise Group
 
Data Engineer's Lunch #60: Series - Developing Enterprise Consciousness
Data Engineer's Lunch #60: Series - Developing Enterprise ConsciousnessData Engineer's Lunch #60: Series - Developing Enterprise Consciousness
Data Engineer's Lunch #60: Series - Developing Enterprise Consciousness
Anant Corporation
 
Hadoop and the Relational Database: The Best of Both Worlds
Hadoop and the Relational Database: The Best of Both WorldsHadoop and the Relational Database: The Best of Both Worlds
Hadoop and the Relational Database: The Best of Both Worlds
Inside Analysis
 
Using real time big data analytics for competitive advantage
 Using real time big data analytics for competitive advantage Using real time big data analytics for competitive advantage
Using real time big data analytics for competitive advantage
Amazon Web Services
 

Similar to Data flow in the data center (20)

Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
 
PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...
PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...
PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...
 
Which Change Data Capture Strategy is Right for You?
Which Change Data Capture Strategy is Right for You?Which Change Data Capture Strategy is Right for You?
Which Change Data Capture Strategy is Right for You?
 
How Data Drives Business at Choice Hotels
How Data Drives Business at Choice HotelsHow Data Drives Business at Choice Hotels
How Data Drives Business at Choice Hotels
 
2015 nov 27_thug_paytm_rt_ingest_brief_final
2015 nov 27_thug_paytm_rt_ingest_brief_final2015 nov 27_thug_paytm_rt_ingest_brief_final
2015 nov 27_thug_paytm_rt_ingest_brief_final
 
Reaching scale limits on a Hadoop platform: issues and errors created by spee...
Reaching scale limits on a Hadoop platform: issues and errors created by spee...Reaching scale limits on a Hadoop platform: issues and errors created by spee...
Reaching scale limits on a Hadoop platform: issues and errors created by spee...
 
5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer
 
presentation slides
presentation slidespresentation slides
presentation slides
 
Deploy Apache Spark™ on Rackspace OnMetal™ for Cloud Big Data Platform
Deploy Apache Spark™ on Rackspace OnMetal™ for Cloud Big Data PlatformDeploy Apache Spark™ on Rackspace OnMetal™ for Cloud Big Data Platform
Deploy Apache Spark™ on Rackspace OnMetal™ for Cloud Big Data Platform
 
AWS Summit 2013 | India - Petabyte Scale Data Warehousing at Low Cost, Abhish...
AWS Summit 2013 | India - Petabyte Scale Data Warehousing at Low Cost, Abhish...AWS Summit 2013 | India - Petabyte Scale Data Warehousing at Low Cost, Abhish...
AWS Summit 2013 | India - Petabyte Scale Data Warehousing at Low Cost, Abhish...
 
Getting It Right Exactly Once: Principles for Streaming Architectures
Getting It Right Exactly Once: Principles for Streaming ArchitecturesGetting It Right Exactly Once: Principles for Streaming Architectures
Getting It Right Exactly Once: Principles for Streaming Architectures
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
 
Retail & CPG
Retail & CPGRetail & CPG
Retail & CPG
 
Cloud Services helping in cloud service to be fully knowledgably .pptx
Cloud Services helping in cloud service to be fully knowledgably .pptxCloud Services helping in cloud service to be fully knowledgably .pptx
Cloud Services helping in cloud service to be fully knowledgably .pptx
 
Skilwise Big data
Skilwise Big dataSkilwise Big data
Skilwise Big data
 
Conflict in the Cloud – Issues & Solutions for Big Data
Conflict in the Cloud – Issues & Solutions for Big DataConflict in the Cloud – Issues & Solutions for Big Data
Conflict in the Cloud – Issues & Solutions for Big Data
 
Skillwise Big Data part 2
Skillwise Big Data part 2Skillwise Big Data part 2
Skillwise Big Data part 2
 
Data Engineer's Lunch #60: Series - Developing Enterprise Consciousness
Data Engineer's Lunch #60: Series - Developing Enterprise ConsciousnessData Engineer's Lunch #60: Series - Developing Enterprise Consciousness
Data Engineer's Lunch #60: Series - Developing Enterprise Consciousness
 
Hadoop and the Relational Database: The Best of Both Worlds
Hadoop and the Relational Database: The Best of Both WorldsHadoop and the Relational Database: The Best of Both Worlds
Hadoop and the Relational Database: The Best of Both Worlds
 
Using real time big data analytics for competitive advantage
 Using real time big data analytics for competitive advantage Using real time big data analytics for competitive advantage
Using real time big data analytics for competitive advantage
 

Recently uploaded

FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
Bhaskar Mitra
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 

Recently uploaded (20)

FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 

Data flow in the data center

  • 1. wealthfront.com DATA FLOW IN THE DATA CENTER Adam Cataldo @djscrooge November 7, 2013
  • 2. Wealthfront & Me • Wealthfront is the largest and fastest growing softwarebased financial advisor • We manage the first $10,000 for free the rest for only 0.25% a year • Our automated trading system continuously rebalances a portfolio of low-cost ETFs, with continuous tax-loss harvesting for accounts over $100,000 • I’ve been working on the data platform we use for website optimization, investment research, business analytics, and operations wealthfront.com | 2
  • 3. Why the Ptolemy conference? • This is not a talk about modeling, simulation, and design of concurrent, real-time embedded systems • This is a talk about the design of a data analytics system • It turns out many of the patterns are the same in both fields wealthfront.com | 3
  • 5. Hadoop at a Glance • Scales well for large data sets • Industry standard for data processing • Optimized for throughput batch-processing • Long latency • Overkill for small data sets wealthfront.com | 5
  • 7. Why Cascading? • Most real problems require multiple MapReduce jobs • Provides a data-flow abstraction to specify data transformations • Builds on standard database concepts: joins, groups, and so on • Provides decent testing capabilities, which we’ve extended wealthfront.com | 7
  • 8. From SQL to Cascading select name from users join mails on users.email=mails.to Pipe joined = new CoGroup(users, “email”, mails, “to); Pipe name = new Retain(joined, “lastName”); wealthfront.com | 8
  • 10. Getting data ready for Cascading Production MySQL DB Avro Avro Avrofile file files extract transform Production Amazon Simple MySQL DB Storage Service load wealthfront.com | 10
  • 11. Why Avro? • A compact data format, capable of storing large data sets • We compress with Google Snappy • Compressed is splittable into 128MB chunks • De-facto file format for Hadoop wealthfront.com | 11
  • 12. Running Cascading Jobs Elastic MapReduce Production Amazon Simple MySQL DB Storage Service Online Systems Redshift data warehouse wealthfront.com | 12
  • 13. What do we do with the data? • We use it to track how well the investment product is performing • We use it to track how well the business is performing • We use it to monitor our production systems • We use it to test how well new features perform on the website wealthfront.com | 13
  • 14. Bandit Testing • When rolling new features out, we expose the new version to some users and the old version to the rest • We monitor what percent of users “convert”: sign up, fund account, etc. • We gradually send more traffic to the winning variant of the experiment • Similar to A/B testing, but way faster wealthfront.com | 14
  • 15. Does anyone know where the name bandit testing comes from?
  • 16. Thompson Sampling 1. Estimate the probability for each variant of the experiment that it performs best, using Bayesian inference 2. Weight the percentage of traffic sent to each variant according to this probability 3. End the experiment when one variant has a 95% chance of winning, or when the losing arms have no more than a %5 chance of beating the winner by more than 1% 4. In 2012, Kaufmann et al proved optimality of Thompson sampling wealthfront.com | 16
  • 17. What’s Redshift? • Amazon’s cloud-based data warehouse database • To support ad-hoc analysis, we copy all raw and computed data into redshift • It’s a column-oriented database, optimized for aggregate queries and joins over large batch sizes wealthfront.com | 17
  • 18. What are the technical challenges? • Testing complicated analytics computations is nontrivial - We ended up writing a small library to make testing Cascading jobs simpler • Running multiple Hadoop jobs on large datasets takes a long time - We use Spark for prototyping, to get a speedup • Your assumptions about the constraints on the data is always wrong wealthfront.com | 18
  • 19. Where’s this heading? • We have a unique collection of consumer web data and financial data • There are many ways we can combine this data to make our product better • Hypothetical example: suggest portfolio risk adjustments based on a client’s withdrawal patterns wealthfront.com | 19
  • 20. How is this relevant? • We use data flow as the primary model of computation • While the time scales are much slower, we have timing constraints, called SLAs, imposed by production use cases • We have to make sure all code can safely execute concurrently on multiple machines, cores, and threads wealthfront.com | 20
  • 21. Disclosure Nothing in this presentation should be construed as a solicitation or offer, or recommendation, to buy or sell any security. Financial advisory services are only provided to investors who become Wealthfront clients pursuant to a written agreement, Tex which investors are urged to read and carefully consider in determining t whether such agreement is suitable for their individual facts and circumstances. Past performance is no guarantee of future results, and any hypothetical returns, expected returns, or probability projections may not reflect actual future performance. Investors should review Wealthfront’s website for additional information about advisory services. wealthfront.com | 21