- How do we currently think about Data Science?
- Why is infrastructure important to our field?
- Two tools we've built on Sailthru's Data Science team to deal with these problems are "Stolos" and "Relay.Mesos".
Jeremy Stanley, EVP/Data Scientist, Sailthru at MLconf NYCMLconf
Cost Effectively Scaling Machine Learning Systems in the Cloud: E-commerce and publishing clients use Sailthru to personalize billions of digital experiences for their customers weekly. Earlier this year, Sailthru launched Sightlines to allow clients to predict the future behavior of individual users. In this talk we cover how we scaled Sightlines cost effectively in the cloud by combining inexpensive computing resources with an efficient architecture and easy to maintain and evolve implementation.
To access computing resources cost effectively, we utilize Amazon spot instances and Apache Mesos to pool together large quantities of CPU and memory. This approach can be orders of magnitude more cost effective than traditional deployments, but requires sophisticated automation and orchestration tools, and a fine-grained fault tolerant application architecture.
Given cost effective resources, the next challenge was to design the application to be efficient. Simple sampling and data pre-processing techniques significantly limit the computational requirements without adversely impacting model performance. Further, by controlling how often we run various components of the pipeline, we minimize cost while keeping models up to date.
The final challenge is to make such a system maintainable and easy to evolve. This includes removing single points of failure, automating infrastructure management, building distributed logging and monitoring capabilities, and running identical A / B production environments to enable aggressive, iterative changes to the code base and architecture in production.
We hope to demonstrate that the challenges faced in scaling a complex machine learning system in the cloud are at least as interesting as the science behind it, and to provide some insight into modern tools and methods for addressing these scalability challenges.
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/2lGNybu.
Stefan Krawczyk discusses how his team at StitchFix use the cloud to enable over 80 data scientists to be productive. He also talks about prototyping ideas, algorithms and analyses, how they set up & keep schemas in sync between Hive, Presto, Redshift & Spark and make access easy for their data scientists, etc. Filmed at qconsf.com..
Stefan Krawczyk is Algo Dev Platform Lead at StitchFix, where he’s leading development of the algorithm development platform. He spent formative years at Stanford, LinkedIn, Nextdoor & Idibon, working on everything from growth engineering, product engineering, data engineering, to recommendation systems, NLP, data science and business intelligence.
Nubank is the leading fintech in Latin America. Using bleeding-edge technology, design, and data, the company aims to fight complexity and empower people to take control of their finances. We are disrupting an outdated and bureaucratic system by building a simple, safe and 100% digital environment.
In order to succeed, we need to constantly make better decisions in the speed of insight, and that’s what We aim when building Nubank’s Data Platform. In this talk we want to explore and share the guiding principles and how we created an automated, scalable, declarative and self-service platform that has more than 200 contributors, mostly non-technical, to build 8 thousand distinct datasets, ingesting data from 800 databases, leveraging Apache Spark expressiveness and scalability.
The topics we want to explore are:
– Making data-ingestion a no-brainer when creating new services
– Reducing the cycle time to deploy new Datasets and Machine Learning models to production
– Closing the loop and leverage knowledge processed in the analytical environment to take decisions in production
– Providing the perfect level of abstraction to users
You will get from this talk:
– Our love for ‘The Log’ and how we use it to decouple databases from its schema and distribute the work to keep schemas up to date to the entire team.
– How we made data ingestion so simple using Kafka Streams that teams stopped using databases for analytical data.
– The huge benefits of relying on the DataFrame API to create datasets which made possible having tests end-to-end verifying that the 8000 datasets work without even running a Spark Job and much more.
– The importance of creating the right amount of abstractions and restrictions to have the power to optimize.
Introduction of Artificial Intelligence and Machine Learning bigdata trunk
A Workshop to introduce Artificial Intelligence and Machine Learning for beginners. It starts with basics , terminologies and concepts for machine learning, compares with deep learning and artificial Intelligence. Highlights the ML and AI offerings like Jupyter Notebook, Azure ML , Amazon Sagemaker, Tensorflow etc.
Doing Analytics Right - Building the Analytics EnvironmentTasktop
Implementing analytics for development processes is challenging. As in discussed in the previous webinars, the right analytics are determined by the goals of the organization, not by the available data. So implementing your analytics solutions will require an efficient analytics and data architecture, including the ability to combine and stage data from heterogeneous sources. An architecture that excludes the ability to gain access to the necessary data will create a barrier to deploying your newly designed analytics program, and will force you back into the “light is brighter here” anti-pattern.
This webinar will describe the technical considerations of implementing the data architecture for your analytics program, and explain how Tasktop can help.
Jeremy Stanley, EVP/Data Scientist, Sailthru at MLconf NYCMLconf
Cost Effectively Scaling Machine Learning Systems in the Cloud: E-commerce and publishing clients use Sailthru to personalize billions of digital experiences for their customers weekly. Earlier this year, Sailthru launched Sightlines to allow clients to predict the future behavior of individual users. In this talk we cover how we scaled Sightlines cost effectively in the cloud by combining inexpensive computing resources with an efficient architecture and easy to maintain and evolve implementation.
To access computing resources cost effectively, we utilize Amazon spot instances and Apache Mesos to pool together large quantities of CPU and memory. This approach can be orders of magnitude more cost effective than traditional deployments, but requires sophisticated automation and orchestration tools, and a fine-grained fault tolerant application architecture.
Given cost effective resources, the next challenge was to design the application to be efficient. Simple sampling and data pre-processing techniques significantly limit the computational requirements without adversely impacting model performance. Further, by controlling how often we run various components of the pipeline, we minimize cost while keeping models up to date.
The final challenge is to make such a system maintainable and easy to evolve. This includes removing single points of failure, automating infrastructure management, building distributed logging and monitoring capabilities, and running identical A / B production environments to enable aggressive, iterative changes to the code base and architecture in production.
We hope to demonstrate that the challenges faced in scaling a complex machine learning system in the cloud are at least as interesting as the science behind it, and to provide some insight into modern tools and methods for addressing these scalability challenges.
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/2lGNybu.
Stefan Krawczyk discusses how his team at StitchFix use the cloud to enable over 80 data scientists to be productive. He also talks about prototyping ideas, algorithms and analyses, how they set up & keep schemas in sync between Hive, Presto, Redshift & Spark and make access easy for their data scientists, etc. Filmed at qconsf.com..
Stefan Krawczyk is Algo Dev Platform Lead at StitchFix, where he’s leading development of the algorithm development platform. He spent formative years at Stanford, LinkedIn, Nextdoor & Idibon, working on everything from growth engineering, product engineering, data engineering, to recommendation systems, NLP, data science and business intelligence.
Nubank is the leading fintech in Latin America. Using bleeding-edge technology, design, and data, the company aims to fight complexity and empower people to take control of their finances. We are disrupting an outdated and bureaucratic system by building a simple, safe and 100% digital environment.
In order to succeed, we need to constantly make better decisions in the speed of insight, and that’s what We aim when building Nubank’s Data Platform. In this talk we want to explore and share the guiding principles and how we created an automated, scalable, declarative and self-service platform that has more than 200 contributors, mostly non-technical, to build 8 thousand distinct datasets, ingesting data from 800 databases, leveraging Apache Spark expressiveness and scalability.
The topics we want to explore are:
– Making data-ingestion a no-brainer when creating new services
– Reducing the cycle time to deploy new Datasets and Machine Learning models to production
– Closing the loop and leverage knowledge processed in the analytical environment to take decisions in production
– Providing the perfect level of abstraction to users
You will get from this talk:
– Our love for ‘The Log’ and how we use it to decouple databases from its schema and distribute the work to keep schemas up to date to the entire team.
– How we made data ingestion so simple using Kafka Streams that teams stopped using databases for analytical data.
– The huge benefits of relying on the DataFrame API to create datasets which made possible having tests end-to-end verifying that the 8000 datasets work without even running a Spark Job and much more.
– The importance of creating the right amount of abstractions and restrictions to have the power to optimize.
Introduction of Artificial Intelligence and Machine Learning bigdata trunk
A Workshop to introduce Artificial Intelligence and Machine Learning for beginners. It starts with basics , terminologies and concepts for machine learning, compares with deep learning and artificial Intelligence. Highlights the ML and AI offerings like Jupyter Notebook, Azure ML , Amazon Sagemaker, Tensorflow etc.
Doing Analytics Right - Building the Analytics EnvironmentTasktop
Implementing analytics for development processes is challenging. As in discussed in the previous webinars, the right analytics are determined by the goals of the organization, not by the available data. So implementing your analytics solutions will require an efficient analytics and data architecture, including the ability to combine and stage data from heterogeneous sources. An architecture that excludes the ability to gain access to the necessary data will create a barrier to deploying your newly designed analytics program, and will force you back into the “light is brighter here” anti-pattern.
This webinar will describe the technical considerations of implementing the data architecture for your analytics program, and explain how Tasktop can help.
Projects failing because of “communication issues” is something I hear quite frequently. But how agile can we be in project communication? I will share my experience by overviewing the main lessons learned in the areas of:
Work planning/scheduling insights and hidden risks;
Tips on communication among team members and with outside stakeholders;
Tools & techniques for organizing effective and transparent communication;
Change requests and project information management: why and how?
An illustrated guide to microservices (ploneconf 10 21-2016)Ambassador Labs
A (simpler) Microservices Definition
A Microservice is a unit of business logic.
A Microservice application is a distributed composition of business logic via services.
This presentation has slides from a talk that I gave at the annual Experimental Biology meeting, 2015, on our curriculum for Big Data Analytics in the Inland Empire.
The sole purpose of sharing these slides are to educate the beginners of IT and Computer Science/Engineering. Credits should go to the referred material and also CICRA campus, Colombo 4, Sri Lanka where I taught these in 2017.
Maximize Big Data ROI via Best of Breed Patterns and PracticesJeff Bertman
******** Abstract: ********
Not long ago the question was whether your organization had big data. Did you have
the volume, the velocity, the technology. Now those basics are largely given for most of
the people attending this event. The path to success is still fuzzy, however, with so many
technologies to choose from – and so many ways to use them.
This presentation triangulates in a holistic manner on the modern business dilemma:
how can we leverage technology to improve revenue, profit, market share, and numerous
other success criteria. That said, this is not about the analytics or KPIs -- although it is
about measurable improvement. It’s about lining up the right technologies and using them
in effective, proven ways to maximize Return on Investment (ROI). Since the slant here
is holistic, we’ll show how to blend infrastructure, tools, methods, and talent to avoid and
constantly trim technical debt… and to produce success stories that are consistently
repeatable, not a byproduct of individual heroics.
Introduction talk at the University of Strathclyde (Scotland) Algorithms Workshop, providing a quick overview of the fundamental and practical reasons why algorithms are/are not technical black boxes. (This talk does not address issues of trade secret or other business reasons for lack of transparency). The presentation was given to an audience of academics and students at the law department.
From prototype to production - The journey of re-designing SmartUp.ioMáté Lang
Talk about the joureny of small tech team re-designing SmartUp.io from scratch, and the technical paths from MVP to Production.
High level overview of architecture and tech stack decisions, best-practices and culture.
(SPOT205) 5 Lessons for Managing Massive IT Transformation ProjectsAmazon Web Services
Choice Hotels is undertaking a multiyear, $20 million project to recreate our core business engines on AWS. In trying to approach this complex undertaking, we determined that the project itself is a system too. You can apply principles of good architecture and design work in how you approach the project structure and management. Come to this talk by Choice Hotels’ CTO to learn five key lessons and 20 concrete takeaways that you can implement today to help your AWS projects succeed.
Data Engineer's Lunch #85: Designing a Modern Data StackAnant Corporation
What are the design considerations that go into architecting a modern data warehouse? This presentation will cover some of the requirements analysis, design decisions, and execution challenges of building a modern data lake/data warehouse.
How we integrate Machine Learning Algorithms into our IT Platform at OutfitteryOUTFITTERY
Outfittery's mission is to provide relevant fashion to men. In the past it was our stylists that put together the best outfits for our customers. But since about a year ago we started to rely more on intelligent algorithms to augment our human experts.
This transition to become a data driven company has left its marks on our IT landscape:
In the beginning we just did simple A/B tests. Then we wanted to use more complex logic so we added a generic data enrichment layer. Later we also provided easy configurability to steer processes. And this in turn enabled us to orchestrate our machine learning algorithms as self contained Docker containers within a Kubernetes cluster. All in all it's a nice setup that we are pretty happy with.
It then really took us some time to realise that we actually had built a delivery platform to deliver just any pure function that our data scientists come up with - directly into our microservice landscape. We just now started to use it that way; we just put their R&D experiments directly into production... :-)
This talk will guide you through this journey, explain how this platform is built, and what we do with it.
ML in the Browser: Interactive Experiences with Tensorflow.jsC4Media
Video and slides synchronized, mp3 and slide download available at URL https://bit.ly/39SddUL.
Victor Dibia provides a friendly introduction to machine learning, covers concrete steps on how front-end developers can create their own ML models and deploy them as part of web applications. He discusses his experience building Handtrack.js - a library for prototyping real time hand tracking interactions in the browser. Filmed at qconsf.com.
Victor Dibia is a Research Engineer with Cloudera’s Fast Forward Labs. Prior to this, he was a Research Staff Member at the IBM TJ Watson Research Center, New York. His research interests are at the intersection of human computer interaction, computational social science, and applied AI.
This presentation was given in one of the DSATL Mettups in March 2018 in partnership with Southern Data Science Conference 2018 (www.southerndatascience.com)
Projects failing because of “communication issues” is something I hear quite frequently. But how agile can we be in project communication? I will share my experience by overviewing the main lessons learned in the areas of:
Work planning/scheduling insights and hidden risks;
Tips on communication among team members and with outside stakeholders;
Tools & techniques for organizing effective and transparent communication;
Change requests and project information management: why and how?
An illustrated guide to microservices (ploneconf 10 21-2016)Ambassador Labs
A (simpler) Microservices Definition
A Microservice is a unit of business logic.
A Microservice application is a distributed composition of business logic via services.
This presentation has slides from a talk that I gave at the annual Experimental Biology meeting, 2015, on our curriculum for Big Data Analytics in the Inland Empire.
The sole purpose of sharing these slides are to educate the beginners of IT and Computer Science/Engineering. Credits should go to the referred material and also CICRA campus, Colombo 4, Sri Lanka where I taught these in 2017.
Maximize Big Data ROI via Best of Breed Patterns and PracticesJeff Bertman
******** Abstract: ********
Not long ago the question was whether your organization had big data. Did you have
the volume, the velocity, the technology. Now those basics are largely given for most of
the people attending this event. The path to success is still fuzzy, however, with so many
technologies to choose from – and so many ways to use them.
This presentation triangulates in a holistic manner on the modern business dilemma:
how can we leverage technology to improve revenue, profit, market share, and numerous
other success criteria. That said, this is not about the analytics or KPIs -- although it is
about measurable improvement. It’s about lining up the right technologies and using them
in effective, proven ways to maximize Return on Investment (ROI). Since the slant here
is holistic, we’ll show how to blend infrastructure, tools, methods, and talent to avoid and
constantly trim technical debt… and to produce success stories that are consistently
repeatable, not a byproduct of individual heroics.
Introduction talk at the University of Strathclyde (Scotland) Algorithms Workshop, providing a quick overview of the fundamental and practical reasons why algorithms are/are not technical black boxes. (This talk does not address issues of trade secret or other business reasons for lack of transparency). The presentation was given to an audience of academics and students at the law department.
From prototype to production - The journey of re-designing SmartUp.ioMáté Lang
Talk about the joureny of small tech team re-designing SmartUp.io from scratch, and the technical paths from MVP to Production.
High level overview of architecture and tech stack decisions, best-practices and culture.
(SPOT205) 5 Lessons for Managing Massive IT Transformation ProjectsAmazon Web Services
Choice Hotels is undertaking a multiyear, $20 million project to recreate our core business engines on AWS. In trying to approach this complex undertaking, we determined that the project itself is a system too. You can apply principles of good architecture and design work in how you approach the project structure and management. Come to this talk by Choice Hotels’ CTO to learn five key lessons and 20 concrete takeaways that you can implement today to help your AWS projects succeed.
Data Engineer's Lunch #85: Designing a Modern Data StackAnant Corporation
What are the design considerations that go into architecting a modern data warehouse? This presentation will cover some of the requirements analysis, design decisions, and execution challenges of building a modern data lake/data warehouse.
How we integrate Machine Learning Algorithms into our IT Platform at OutfitteryOUTFITTERY
Outfittery's mission is to provide relevant fashion to men. In the past it was our stylists that put together the best outfits for our customers. But since about a year ago we started to rely more on intelligent algorithms to augment our human experts.
This transition to become a data driven company has left its marks on our IT landscape:
In the beginning we just did simple A/B tests. Then we wanted to use more complex logic so we added a generic data enrichment layer. Later we also provided easy configurability to steer processes. And this in turn enabled us to orchestrate our machine learning algorithms as self contained Docker containers within a Kubernetes cluster. All in all it's a nice setup that we are pretty happy with.
It then really took us some time to realise that we actually had built a delivery platform to deliver just any pure function that our data scientists come up with - directly into our microservice landscape. We just now started to use it that way; we just put their R&D experiments directly into production... :-)
This talk will guide you through this journey, explain how this platform is built, and what we do with it.
ML in the Browser: Interactive Experiences with Tensorflow.jsC4Media
Video and slides synchronized, mp3 and slide download available at URL https://bit.ly/39SddUL.
Victor Dibia provides a friendly introduction to machine learning, covers concrete steps on how front-end developers can create their own ML models and deploy them as part of web applications. He discusses his experience building Handtrack.js - a library for prototyping real time hand tracking interactions in the browser. Filmed at qconsf.com.
Victor Dibia is a Research Engineer with Cloudera’s Fast Forward Labs. Prior to this, he was a Research Staff Member at the IBM TJ Watson Research Center, New York. His research interests are at the intersection of human computer interaction, computational social science, and applied AI.
This presentation was given in one of the DSATL Mettups in March 2018 in partnership with Southern Data Science Conference 2018 (www.southerndatascience.com)
Similar to Balancing Infrastructure with Optimization and Problem Formulation (20)
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
4. Talk Outline
Part 1:
● What is Data Science
● Where should we spend our time as data
scientists?
Part 2:
● How we balance infrastructure,
optimization and problem formulation at
Sailthru.
19. Components of a Solid Infrastructure
● Lots of Machinery. VMs, Containers
● Machines require coordination, redundancy
and fault tolerance. CAP Theorem
20. Components of a Solid Infrastructure
● Resource Allocation Fair Scheduling, Bin Packing
● Control strategies Auto Scaling, Feedback, PID
● Communication algorithms Gossip, Paxos, ...
● Configuration Dynamic Persistence, Namespaces
● Monitoring Anomaly Detection, Visualization
● Data Storage Relational, Graph, Key-Value
● SO MANY TOOLS!
21. So What is Data Science?
Problem
Formulation
Infrastructure
Optimization
23. As a Data Scientist, ...
...when do I:
○ build infrastructure that supports my ideas
○ optimize my existing models and
problems
○ find new problems to work on
27. ● Sailthru is a personalization platform.
● We help our clients communicate with their
customers.
● Our goal is to maximize the lifetime value of these
customers so that our clients do well, customers
are happy, and Sailthru is successful.
29. Sightlines - Example Use Cases
Incentivize users with low
chance of purchasing
Personalize discounts
above expected order value
Suppress users likely to opt-
out of messages
Engage users unlikely to
open on other channels
34. What problem does it solve?
A Directed Acyclic Multi-Graph task dependency
scheduler designed to simplify complex, distributed
pipelines.
It creates application queues that can be consumed
from in any order.
38. What problem does it solve?
Relay actively minimizes the difference between a
measured signal and a target signal.
Relay.Mesos plugs Relay into a tool called Mesos.
→ Lets us auto-scale consumers of queued Stolos
jobs
42. The PID Algorithm
PV = Process Variable (Signal)
SP = Set Point (Target)
MV = Manipulated Variable (Output)
t = index on timesteps
**The “D” in PID is excluded here
43. The PID Algorithm
PV = Process Variable (Signal)
SP = Set Point (Target)
MV = Manipulated Variable (Output)
t = index on timesteps
**The “D” in PID is excluded here
+ Kd
Δ dt
48. Sightlines - On Mesos
←----------------> CPU Units <------------------>
←--------------------->RAM←--------------------->
←----------------> CPU Units <------------------>
←--------------------->RAM←--------------------->