Adam Cataldo discusses how Wealthfront uses data analytics and data flows. Wealthfront is an automated financial advisor that manages portfolios for a low fee. Cataldo works on Wealthfront's data platform, which uses Hadoop and Cascading to process large amounts of data from users, investments, and business operations. This data is used for website optimization, investment research, and monitoring systems. Cascading provides a data flow abstraction to specify transformations across multiple MapReduce jobs. Avro is used to store and transport data efficiently in Hadoop. Results are analyzed in Amazon Redshift for ad-hoc queries.
CCNA Routing Fundamentals - EIGRP, OSPF and RIPsushmil123
- Basics of Routing
- Static Routing/Dynamic Routing
- Classification of Dynamic Routing
- Administrative Distance and Metric
- Link State Routing and Distance Vector Routing
- Routing Information Protocol (RIP)
- Enhanced Interior Gateway Routing Protocol (EIGRP)
- Open Shortest Path First (OSPF)
Construir microservices em python nunca foi tão simples como com o Nameko!Flávio Pimenta
Pretendo mostrar os caminhos que segui na definição de uma arquitetura de referência em microservices que neste caso tinha como pré-requisito a linguagem python. Mas independente da linguagem, a ideia é passar pelos pontos a serem considerados na decisão final. Ao arquitetar um ecossistema de microservices em python, foi necessária uma POC (prova de conceito) entre frameworks consolidados (como Flask e Django Rest Framework) contra tecnologias mais novas como o Nameko Partindo do conceito de Domain Driven Design vou mostrar como o Nameko ajuda a focar o desenvolvimento orientado ao negócio!
Spanning Tree Protocol (STP) is standardized as IEEE 802.1D.
Is a network protocol that ensures a loop-free topology for any bridged Ethernet local area network.
CCNA Routing Fundamentals - EIGRP, OSPF and RIPsushmil123
- Basics of Routing
- Static Routing/Dynamic Routing
- Classification of Dynamic Routing
- Administrative Distance and Metric
- Link State Routing and Distance Vector Routing
- Routing Information Protocol (RIP)
- Enhanced Interior Gateway Routing Protocol (EIGRP)
- Open Shortest Path First (OSPF)
Construir microservices em python nunca foi tão simples como com o Nameko!Flávio Pimenta
Pretendo mostrar os caminhos que segui na definição de uma arquitetura de referência em microservices que neste caso tinha como pré-requisito a linguagem python. Mas independente da linguagem, a ideia é passar pelos pontos a serem considerados na decisão final. Ao arquitetar um ecossistema de microservices em python, foi necessária uma POC (prova de conceito) entre frameworks consolidados (como Flask e Django Rest Framework) contra tecnologias mais novas como o Nameko Partindo do conceito de Domain Driven Design vou mostrar como o Nameko ajuda a focar o desenvolvimento orientado ao negócio!
Spanning Tree Protocol (STP) is standardized as IEEE 802.1D.
Is a network protocol that ensures a loop-free topology for any bridged Ethernet local area network.
Virtualizing the Network to enable a Software Defined Infrastructure (SDI)Odinot Stanislas
Une très intéressante présentation autour de la virtualisation des réseaux contenant des explications détaillées autour des VLAN, VXLAN, mais aussi d'NVGRE et surtout de GENEVE (Generic Network Virtualization Encapsulation) supporté pour la première fois sur la dernière carte 40 GbE d'Intel (XL710)
Introduzir o conceito de escalonamento de CPU, base para os sistemas operacionais multiprogramados
Descrever vários algoritmos de escalonamento de CPU
Discutir os critérios de avaliação para selecionar um algoritmo de escalonamento de CPU para um determinado sistema
Building Complex Data Workflows with Cascading on HadoopGagan Agrawal
In the Hadoop world, writing Map Reduce is always a painful task especially when you want to create complex workflows with multiple Map Reduce jobs and having complex dependencies between them. There are high level languages like Pig latin or Hive which make writing Map Reduce jobs easy. But if you want to write complex logic, you need to write custom functions in Java, which makes testing and debugging difficult. This is where Cascading makes a developer's life easy. Everything is written in Java with ease of writing Map Reduce in high level language, similar to Pig or Hive. Once the logic is written, it can be easily be tested by running in stand-alone mode since everything is in Java. Not only that, Cascading provides Hadoop (or any other framework) independent APIs, which means workflows written in Cascading can be executed on multiple frameworks without any code change as long as Cascading connector is available
Virtualizing the Network to enable a Software Defined Infrastructure (SDI)Odinot Stanislas
Une très intéressante présentation autour de la virtualisation des réseaux contenant des explications détaillées autour des VLAN, VXLAN, mais aussi d'NVGRE et surtout de GENEVE (Generic Network Virtualization Encapsulation) supporté pour la première fois sur la dernière carte 40 GbE d'Intel (XL710)
Introduzir o conceito de escalonamento de CPU, base para os sistemas operacionais multiprogramados
Descrever vários algoritmos de escalonamento de CPU
Discutir os critérios de avaliação para selecionar um algoritmo de escalonamento de CPU para um determinado sistema
Building Complex Data Workflows with Cascading on HadoopGagan Agrawal
In the Hadoop world, writing Map Reduce is always a painful task especially when you want to create complex workflows with multiple Map Reduce jobs and having complex dependencies between them. There are high level languages like Pig latin or Hive which make writing Map Reduce jobs easy. But if you want to write complex logic, you need to write custom functions in Java, which makes testing and debugging difficult. This is where Cascading makes a developer's life easy. Everything is written in Java with ease of writing Map Reduce in high level language, similar to Pig or Hive. Once the logic is written, it can be easily be tested by running in stand-alone mode since everything is in Java. Not only that, Cascading provides Hadoop (or any other framework) independent APIs, which means workflows written in Cascading can be executed on multiple frameworks without any code change as long as Cascading connector is available
Class lecture by Prof. Raj Jain on Data Center Network Topologies. The talk covers Google’s Data Center, Cooling Plant, Modular Data Centers, Containerized Data Center, Unstructured Cabling, Structured Cabling, Data Center Equipment Cabinets, Data Center Physical Layout, ANSI/TIA-942-2005 Standard, ANSI/TIA-942-2005 Standard, Data Center Network Topologies, Data Center Networks, Switch Locations, ToR vs EoR, Hierarchical Network Design, Access Aggregation Connections, Data Center Networking Issues, DCN Requirements, 4-Post Architecture at Facebook, Clos Networks, Fat-Tree DCN Example. Video recording available on You
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Precisely
Tackling the challenge of designing a machine learning model and putting it into production is the key to getting value back – and the roadblock that stops many promising machine learning projects. After the data scientists have done their part, engineering robust production data pipelines has its own set of challenges. Syncsort software helps the data engineer every step of the way.
Building on the process of finding and matching duplicates to resolve entities, the next step is to set up a continuous streaming flow of data from data sources so that as the sources change, new data automatically gets pushed through the same transformation and cleansing data flow – into the arms of machine learning models.
Some of your sources may already be streaming, but the rest are sitting in transactional databases that change hundreds or thousands of times a day. The challenge is that you can’t affect performance of data sources that run key applications, so putting something like database triggers in place is not the best idea. Using Apache Kafka or similar technologies as the backbone to moving data around doesn’t solve the problem of needing to grab changes from the source pushing them into Kafka and consuming the data from Kafka to be processed. If something unexpected happens – like connectivity is lost on either the source or the target side, you don’t want to have to fix it or start over because the data is out of sync.
View this 15-minute webcast on-demand to learn how to tackle these challenges in large scale production implementations.
PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...Deepak Chandramouli
PayPal Data Lake Journey | 2017-Oct | San Diego | Teradata Edge of Next
Gimel [http://www.gimel.io] is a Big Data Processing Library, open sourced by PayPal.
https://www.youtube.com/watch?v=52PdNno_9cU&t=3s
Gimel empowers analysts, scientists, data engineers alike to access a variety of Big Data / Traditional Data Stores - with just SQL or a single line of code (Unified Data API).
This is possible via the Catalog of Technical properties abstracted from users, along with a rich collection of Data Store Connectors available in Gimel Library.
A Catalog provider can be Hive or User Supplied (runtime) or UDC.
In addition, PayPal recently open sourced UDC [Unified Data Catalog], which can host and serve the Technical Metatada of the Data Stores & Objects. Visit http://www.unifieddatacatalog.io to experience first hand.
Which Change Data Capture Strategy is Right for You?Precisely
Change Data Capture or CDC is the practice of moving the changes made in an important transactional system to other systems, so that data is kept current and consistent across the enterprise. CDC keeps reporting and analytic systems working on the latest, most accurate data.
Many different CDC strategies exist. Each strategy has advantages and disadvantages. Some put an undue burden on the source database. They can cause queries or applications to become slow or even fail. Some bog down network bandwidth, or have big delays between change and replication.
Each business process has different requirements, as well. For some business needs, a replication delay of more than a second is too long. For others, a delay of less than 24 hours is excellent.
Which CDC strategy will match your business needs? How do you choose?
View this webcast on-demand to learn:
• Advantages and disadvantages of different CDC methods
• The replication latency your project requires
• How to keep data current in Big Data technologies like Hadoop
3 Things to Learn:
-How data is driving digital transformation to help businesses innovate rapidly
-How Choice Hotels (one of largest hoteliers) is using Cloudera Enterprise to gain meaningful insights that drive their business
-How Choice Hotels has transformed business through innovative use of Apache Hadoop, Cloudera Enterprise, and deployment in the cloud — from developing customer experiences to meeting IT compliance requirements
2015 nov 27_thug_paytm_rt_ingest_brief_finalAdam Muise
Paytm Labs provides a quick overview of their Hadoop data ingest platform. We cover our journey from a batch focused ingest system with SQOOP to a streaming ingest supported by Kafka, Confluent.io, Hadoop, Cassandra, and Spark Streaming. This presentation also provides an overview of our complete data platform including our feature creation template
Reaching scale limits on a Hadoop platform: issues and errors created by spee...DataWorks Summit
Santander UK’s Big Data journey began in 2014, using Hadoop to make the most of our data and generate value for customers. Within 9 months, we created a highly available real-time customer facing application for customer analytics. We currently have 500 different people doing their own analysis and projects with this data, spanning a total of 50 different use cases. This data, (consisting of over 40 million customer records with billions of transactions), provides our business new insights that were inaccessible before.
Our business moves quickly, with several products and 20 use cases currently in production. We currently have a customer data lake and a technical data lake. Having a platform with very different workloads has proven to be challenging.
Our success in generating value created such growth in terms of data, use cases, analysts and usage patterns that 3 years later we find issues with scalability in HDFS, Hive metastore and Hadoop operations and challenges with highly available architectures with Hbase, Flume and Kafka. Going forward we are exploring alternative architectures including a hybrid cloud model, and moving towards streaming.
Our goal with this session is to assist people in the early part of their journey by building a solid foundation. We hope that others can benefit from us sharing our experiences and lessons learned during our journey.
Speaker
Nicolette Bullivant, Head of Data Engineering at Santander UK Technology, Santander UK Technology
5 Things that Make Hadoop a Game Changer
Webinar by Elliott Cordo, Caserta Concepts
There is much hype and mystery surrounding Hadoop's role in analytic architecture. In this webinar, Elliott presented, in detail, the services and concepts that makes Hadoop a truly unique solution - a game changer for the enterprise. He talked about the real benefits of a distributed file system, the multi workload processing capabilities enabled by YARN, and the 3 other important things you need to know about Hadoop.
To access the recorded webinar, visit the event site: https://www.brighttalk.com/webcast/9061/131029
For more information the services and solutions that Caserta Concepts offers, please visit http://casertaconcepts.com/
Deploy Apache Spark™ on Rackspace OnMetal™ for Cloud Big Data PlatformRackspace
There's an elephant in the room when it comes to Big Data. Apache Hadoop and Spark offer the promise to transform how businesses leverage Big Data, finding the right mix of flexible deployments, elastic scalability, and performance can be daunting.
Introducing Rackspace OnMetal™ for Apache Spark™ an industry first that combines the performance and efficiency of bare metal with the ease and flexibility of cloud. With Rackspace OnMetal for Cloud Big Data Platform you can transform how you run Hadoop and Spark workloads:
•Deploy in minutes, not months
•Spin instances up or down on demand
•Process data in-memory for faster query times
•Get bare metal performance and say goodbye to virtualization taxes
Sign up and learn how Rackspace OnMetal for Cloud Big Data Platform can rapidly move your organization from planning to deploying.
AWS Summit 2013 | India - Petabyte Scale Data Warehousing at Low Cost, Abhish...Amazon Web Services
Amazon Redshift is a fast and powerful, fully managed, petabyte-scale data warehouse service in the cloud. In this session we'll give an introduction to the service and its pricing before diving into how it delivers fast query performance on data sets ranging from hundreds of gigabytes to a petabyte or more.
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureDATAVERSITY
Whether to take data ingestion cycles off the ETL tool and the data warehouse or to facilitate competitive Data Science and building algorithms in the organization, the data lake – a place for unmodeled and vast data – will be provisioned widely in 2020.
Though it doesn’t have to be complicated, the data lake has a few key design points that are critical, and it does need to follow some principles for success. Avoid building the data swamp, but not the data lake! The tool ecosystem is building up around the data lake and soon many will have a robust lake and data warehouse. We will discuss policy to keep them straight, send data to its best platform, and keep users’ confidence up in their data platforms.
Data lakes will be built in cloud object storage. We’ll discuss the options there as well.
Get this data point for your data lake journey.
The next generation user experience should move to customer engagement zones along their preferred channels with desired action to outcome approaches. With scores of information ranging from inventory to inquiry, weather to warehouse alerts, product to promotion info at disposal, enterprise digitization can create value at every customer touch point. Attendees witnessed the manifestation of TCS’ Thought Leadership in the Game of Retail.
Conflict in the Cloud – Issues & Solutions for Big DataHalo BI
Halo BI CEO, Keith Peterson, presents at the 6th Annual Cloud Computing Conference - AITP San Diego: Conflict in the Cloud – Issues & Solutions for Big Data.
Cloud services make money based on the volume of data stored in the cloud – and big data delivers that volume. But companies seeking to use big data are looking for economies of scale from the Cloud.
Data Engineer's Lunch #60: Series - Developing Enterprise ConsciousnessAnant Corporation
In Data Engineer's Lunch #60, Rahul Singh, CEO here at Anant, will discuss modern data processing/pipeline approaches.
Want to learn about modern data engineering patterns & practices for global data platforms? A high-level overview of different types, frameworks, and workflows in data processing and pipeline design.
Hadoop and the Relational Database: The Best of Both WorldsInside Analysis
The Briefing Room with Dr. Robin Bloor and Splice Machine
Live Webcast on August 5, 2014
Watch the archive:
https://bloorgroup.webex.com/bloorgroup/lsr.php?RCID=71551d669454741c8bd56f2349bdf140
As the pressure of Big Data collides with the reality of daily operations, many organizations are trying to solve the challenge of meeting new requirements without disrupting the flow of business. One solution focuses on the data layer itself, by combining the well known functionality of relational database technology with the scale-out capabilities of Hadoop.
Register for this episode of The Briefing Room to hear from veteran Analyst Dr. Robin Bloor as he outlines the critical components of a business-ready data layer. He’ll be briefed by John Leach and Rich Reimer of Splice Machine who will explain how their solution delivers the best of both data worlds: the trusted capabilities of relational with the infinite scalability of Hadoop. They will also discuss how Hadoop has transformed from a batch-oriented workhorse into a scale-out layer capable of supporting real-time applications and operational analytics using traditional SQL.
Visit InsideAnlaysis.com for more information.
Using real time big data analytics for competitive advantageAmazon Web Services
Many organisations find it challenging to successfully perform real-time data analytics using their own on premise IT infrastructure. Building a system that can adapt and scale rapidly to handle dramatic increases in transaction loads can potentially be quite a costly and time consuming exercise.
Most of the time, infrastructure is under-utilised and it’s near impossible for organisations to forecast the amount of computing power they will need in the future to serve their customers and suppliers.
To overcome these challenges, organisations can instead utilise the cloud to support their real-time data analytics activities. Scalable, agile and secure, cloud-based infrastructure enables organisations to quickly spin up infrastructure to support their data analytics projects exactly when it is needed. Importantly, they can ‘switch off’ infrastructure when it is not.
BluePi Consulting and Amazon Web Services (AWS) are giving you the opportunity to discover how organisations are using real time data analytics to gain new insights from their information to improve the customer experience and drive competitive advantage.
Search and Society: Reimagining Information Access for Radical FuturesBhaskar Mitra
The field of Information retrieval (IR) is currently undergoing a transformative shift, at least partly due to the emerging applications of generative AI to information access. In this talk, we will deliberate on the sociotechnical implications of generative AI for information access. We will argue that there is both a critical necessity and an exciting opportunity for the IR community to re-center our research agendas on societal needs while dismantling the artificial separation between the work on fairness, accountability, transparency, and ethics in IR and the rest of IR research. Instead of adopting a reactionary strategy of trying to mitigate potential social harms from emerging technologies, the community should aim to proactively set the research agenda for the kinds of systems we should build inspired by diverse explicitly stated sociotechnical imaginaries. The sociotechnical imaginaries that underpin the design and development of information access technologies needs to be explicitly articulated, and we need to develop theories of change in context of these diverse perspectives. Our guiding future imaginaries must be informed by other academic fields, such as democratic theory and critical theory, and should be co-developed with social science scholars, legal scholars, civil rights and social justice activists, and artists, among others.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
2. Wealthfront & Me
• Wealthfront is the largest and fastest growing softwarebased financial advisor
• We manage the first $10,000 for free the rest for only
0.25% a year
• Our automated trading system continuously rebalances
a portfolio of low-cost ETFs, with continuous tax-loss
harvesting for accounts over $100,000
• I’ve been working on the data platform we use for
website optimization, investment research, business
analytics, and operations
wealthfront.com | 2
3. Why the Ptolemy conference?
• This is not a talk about modeling, simulation, and
design of concurrent, real-time embedded systems
• This is a talk about the design of a data analytics
system
• It turns out many of the patterns are the same in both
fields
wealthfront.com | 3
5. Hadoop at a Glance
• Scales well for large data sets
• Industry standard for data processing
• Optimized for throughput batch-processing
• Long latency
• Overkill for small data sets
wealthfront.com | 5
7. Why Cascading?
• Most real problems require multiple MapReduce jobs
• Provides a data-flow abstraction to specify data
transformations
• Builds on standard database concepts: joins, groups,
and so on
• Provides decent testing capabilities, which we’ve
extended
wealthfront.com | 7
8. From SQL to Cascading
select name from users join mails on users.email=mails.to
Pipe joined = new CoGroup(users, “email”, mails, “to);
Pipe name = new Retain(joined, “lastName”);
wealthfront.com | 8
10. Getting data ready for Cascading
Production
MySQL DB
Avro
Avro
Avrofile
file
files
extract
transform
Production
Amazon Simple
MySQL DB
Storage Service
load
wealthfront.com | 10
11. Why Avro?
• A compact data format, capable of storing large data sets
• We compress with Google
Snappy
• Compressed is splittable
into 128MB chunks
• De-facto file format for
Hadoop
wealthfront.com | 11
12. Running Cascading Jobs
Elastic MapReduce
Production
Amazon Simple
MySQL DB
Storage Service
Online
Systems
Redshift
data
warehouse
wealthfront.com | 12
13. What do we do with the data?
• We use it to track how well the investment product is
performing
• We use it to track how well the business is performing
• We use it to monitor our production systems
• We use it to test how well new features perform on the
website
wealthfront.com | 13
14. Bandit Testing
• When rolling new features out, we expose
the new version to some users and the old
version to the rest
• We monitor what percent of users
“convert”: sign up, fund account, etc.
• We gradually send more traffic to the
winning variant of the experiment
• Similar to A/B testing, but way faster
wealthfront.com | 14
16. Thompson Sampling
1. Estimate the probability for each variant of the
experiment that it performs best, using Bayesian
inference
2. Weight the percentage of traffic sent to each variant
according to this probability
3. End the experiment when one variant has a 95%
chance of winning, or when the losing arms have no
more than a %5 chance of beating the winner by more
than 1%
4. In 2012, Kaufmann et al proved optimality of
Thompson sampling
wealthfront.com | 16
17. What’s Redshift?
• Amazon’s cloud-based data
warehouse database
• To support ad-hoc analysis,
we copy all raw and computed
data into redshift
• It’s a column-oriented
database, optimized for
aggregate queries and joins
over large batch sizes
wealthfront.com | 17
18. What are the technical challenges?
• Testing complicated analytics computations is nontrivial
-
We ended up writing a small library to make testing
Cascading jobs simpler
• Running multiple Hadoop jobs on large datasets takes a
long time
-
We use Spark for prototyping, to get a speedup
• Your assumptions about the constraints on the data is
always wrong
wealthfront.com | 18
19. Where’s this heading?
• We have a unique collection of
consumer web data and
financial data
• There are many ways we can
combine this data to make our
product better
• Hypothetical example: suggest
portfolio risk adjustments
based on a client’s withdrawal
patterns
wealthfront.com | 19
20. How is this relevant?
• We use data flow as the
primary model of computation
• While the time scales are much
slower, we have timing
constraints, called SLAs,
imposed by production use
cases
• We have to make sure all code
can safely execute
concurrently on multiple
machines, cores, and threads
wealthfront.com | 20
21. Disclosure
Nothing in this presentation should be construed as
a solicitation or offer, or recommendation, to buy
or sell any security. Financial advisory services
are only provided to investors who become
Wealthfront clients pursuant to a written agreement,
Tex
which investors are urged to read and carefully
consider in determining t
whether such agreement is
suitable for their individual facts and
circumstances. Past performance is no guarantee of
future results, and any hypothetical returns,
expected returns, or probability projections may not
reflect actual future performance. Investors should
review Wealthfront’s website for additional
information about advisory services.
wealthfront.com | 21