Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/2lGNybu.
Stefan Krawczyk discusses how his team at StitchFix use the cloud to enable over 80 data scientists to be productive. He also talks about prototyping ideas, algorithms and analyses, how they set up & keep schemas in sync between Hive, Presto, Redshift & Spark and make access easy for their data scientists, etc. Filmed at qconsf.com..
Stefan Krawczyk is Algo Dev Platform Lead at StitchFix, where he’s leading development of the algorithm development platform. He spent formative years at Stanford, LinkedIn, Nextdoor & Idibon, working on everything from growth engineering, product engineering, data engineering, to recommendation systems, NLP, data science and business intelligence.
Data Day Texas 2017: Scaling Data Science at Stitch FixStefan Krawczyk
At Stitch Fix we have a lot of Data Scientists. Around eighty at last count. One reason why I think we have so many, is that we do things differently. To get their work done, Data Scientists have access to whatever resources they need (within reason), because they’re end to end responsible for their work; they collaborate with their business partners on objectives and then prototype, iterate, productionize, monitor and debug everything and anything required to get the output desired. They’re full data-stack data scientists!
The teams in the organization do a variety of different tasks:
- Clothing recommendations for clients.
- Clothes reordering recommendations.
- Time series analysis & forecasting of inventory, client segments, etc.
- Warehouse worker path routing.
- NLP.
… and more!
They’re also quite prolific at what they do -- we are approaching 4500 job definitions at last count. So one might be wondering now, how have we enabled them to get their jobs done without getting in the way of each other?
This is where the Data Platform teams comes into play. With the goal of lowering the cognitive overhead and engineering effort required on part of the Data Scientist, the Data Platform team tries to provide abstractions and infrastructure to help the Data Scientists. The relationship is a collaborative partnership, where the Data Scientist is free to make their own decisions and thus choose they way they do their work, and the onus then falls on the Data Platform team to convince Data Scientists to use their tools; the easiest way to do that is by designing the tools well.
In regard to scaling Data Science, the Data Platform team has helped establish some patterns and infrastructure that help alleviate contention. Contention on:
Access to Data
Access to Compute Resources:
Ad-hoc compute (think prototype, iterate, workspace)
Production compute (think where things are executed once they’re needed regularly)
For the talk (and this post) I only focused on how we reduced contention on Access to Data, & Access to Ad-hoc Compute to enable Data Science to scale at Stitch Fix. With that I invite you to take a look through the slides.
Enabling Data Scientists to easily create and own Kafka ConsumersStefan Krawczyk
At Stitch Fix, we hire Full Stack Data Scientists (145+) and expect them to perform diverse functions: from conception to modeling to implementation to measurement. Since Kafka is the way we get event data, this inevitably means that a Data Scientist will need to write a Kafka consumer if they’re going to complete their implementation work. E.g. to transform some client data into features, or perform a model prediction, or allocate someone to an A/B test, etc. In this talk I’ll go over how we built an opinionated Kafka client to easily enable Data Scientists to deploy and own production Kafka consumers, by focusing on writing python functions rather than fighting pitfalls with Kafka.
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Codemotion
Monitoring an entire application is not a simple task, but with the right tools it is not a hard task either. However, events like Black Friday can push your application to the limit, and even cause crashes. As the system is stressed, it generates a lot more logs, which may crash the monitoring system as well. In this talk I will walk through the best practices when using the Elastic Stack to centralize and monitor your logs. I will also share some tricks to help you with the huge increase of traffic typical in Black Fridays.
This presentation is an attempt do demystify the practice of building reliable data processing pipelines. We go through the necessary pieces needed to build a stable processing platform: data ingestion, processing engines, workflow management, schemas, and pipeline development processes. The presentation also includes component choice considerations and recommendations, as well as best practices and pitfalls to avoid, most learnt through expensive mistakes.
Data Day Texas 2017: Scaling Data Science at Stitch FixStefan Krawczyk
At Stitch Fix we have a lot of Data Scientists. Around eighty at last count. One reason why I think we have so many, is that we do things differently. To get their work done, Data Scientists have access to whatever resources they need (within reason), because they’re end to end responsible for their work; they collaborate with their business partners on objectives and then prototype, iterate, productionize, monitor and debug everything and anything required to get the output desired. They’re full data-stack data scientists!
The teams in the organization do a variety of different tasks:
- Clothing recommendations for clients.
- Clothes reordering recommendations.
- Time series analysis & forecasting of inventory, client segments, etc.
- Warehouse worker path routing.
- NLP.
… and more!
They’re also quite prolific at what they do -- we are approaching 4500 job definitions at last count. So one might be wondering now, how have we enabled them to get their jobs done without getting in the way of each other?
This is where the Data Platform teams comes into play. With the goal of lowering the cognitive overhead and engineering effort required on part of the Data Scientist, the Data Platform team tries to provide abstractions and infrastructure to help the Data Scientists. The relationship is a collaborative partnership, where the Data Scientist is free to make their own decisions and thus choose they way they do their work, and the onus then falls on the Data Platform team to convince Data Scientists to use their tools; the easiest way to do that is by designing the tools well.
In regard to scaling Data Science, the Data Platform team has helped establish some patterns and infrastructure that help alleviate contention. Contention on:
Access to Data
Access to Compute Resources:
Ad-hoc compute (think prototype, iterate, workspace)
Production compute (think where things are executed once they’re needed regularly)
For the talk (and this post) I only focused on how we reduced contention on Access to Data, & Access to Ad-hoc Compute to enable Data Science to scale at Stitch Fix. With that I invite you to take a look through the slides.
Enabling Data Scientists to easily create and own Kafka ConsumersStefan Krawczyk
At Stitch Fix, we hire Full Stack Data Scientists (145+) and expect them to perform diverse functions: from conception to modeling to implementation to measurement. Since Kafka is the way we get event data, this inevitably means that a Data Scientist will need to write a Kafka consumer if they’re going to complete their implementation work. E.g. to transform some client data into features, or perform a model prediction, or allocate someone to an A/B test, etc. In this talk I’ll go over how we built an opinionated Kafka client to easily enable Data Scientists to deploy and own production Kafka consumers, by focusing on writing python functions rather than fighting pitfalls with Kafka.
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Codemotion
Monitoring an entire application is not a simple task, but with the right tools it is not a hard task either. However, events like Black Friday can push your application to the limit, and even cause crashes. As the system is stressed, it generates a lot more logs, which may crash the monitoring system as well. In this talk I will walk through the best practices when using the Elastic Stack to centralize and monitor your logs. I will also share some tricks to help you with the huge increase of traffic typical in Black Fridays.
This presentation is an attempt do demystify the practice of building reliable data processing pipelines. We go through the necessary pieces needed to build a stable processing platform: data ingestion, processing engines, workflow management, schemas, and pipeline development processes. The presentation also includes component choice considerations and recommendations, as well as best practices and pitfalls to avoid, most learnt through expensive mistakes.
SASI, Cassandra on the full text search ride - DuyHai Doan - Codemotion Milan...Codemotion
Apache Cassandra is a scalable database with high availability features. But they come with severe limitations in term of querying capabilities. Since the introduction of SASI in Cassandra 3.4, the limitations belong to the pass. Now you can create performant indices on your columns as well as benefit from full text search capabilities with the introduction of the new LIKE %term% syntax. To illustrate how SASI works, we'll use a database of 100 000 albums and artists.
As companies adopt data processing technologies and add data-driven features to user-facing products, the need for effective automated test techniques for data processing applications increase. We go through anatomy of scalable data streaming applications, and how to set up test harnesses for reliable integration testing of such applications. We cover a few common anti-patterns that make asynchronous tests fragile, and corresponding patterns for remediation. We will also mention virtualisation components suitable for our testing scenarios.
RDMS have their data modeling methodology and diagrams. What about Cassandra? Let's discover the key principles of Cassandra data modeling with the Chebotko methodology. Have a look at KDM, a Chebotko modeling tool. And finally, let's talk about the time dimension in Cassandra.
This presentation was made for the Lyon Cassandra Users meetup (France).
Internet of Things is a currently a burgeoning market, and is often associated with specialized data-stores. However PostgreSQL is just as capable at this use-case and can offer some compelling advantages. We’ll explore ways to store IoT data in PostgreSQL covering various ways to store and structure this kind of data. How range types and differing types of indexes can be of use. Also taking a quick look at some extensions designed for this use case. Then looking at powerful SQL features which can really help when analyzing IoT data streams, and how the power of a real SQL database can be a key advantage.
Outlines the vision and philosophy for Wakari.io with a basic overview of popular python data analysis packages. Most of the talk is conducted in Wakari and is not visible on these slides. 90 minutes for PyData NYC, November 8th 2013.
Apache Spark vs rest of the world – Problems and Solutions by Arkadiusz Jachn...Big Data Spain
Apache Spark is a great solution for building Big Data applications. It provides really fast SQL-like processing, machine learning library, and streaming module for near real time processing of data streams. Unfortunately, during application development and production deployments we often encounter many difficulties in mixing various data sources or bulk loading of computed data to SQL or NoSQL databases
https://www.bigdataspain.org/2017/talk/apache-spark-vs-rest-of-the-world-problems-and-solutions
Big Data Spain 2017
16th - 17th November Kinépolis Madrid
A Developer’s View into Spark's Memory Model with Wenchen FanDatabricks
As part of Project Tungsten, we started an ongoing effort to substantially improve the memory and CPU efficiency of Apache Spark’s backend execution and push performance closer to the limits of modern hardware. In this talk, we’ll take a deep dive into Apache Spark’s unified memory model and discuss how Spark exploits memory hierarchy and leverages application semantics to manage memory explicitly (both on and off-heap) to eliminate the overheads of JVM object model and garbage collection.
Dirty data? Clean it up! - Datapalooza Denver 2016Dan Lynn
Dan Lynn (AgilData) & Patrick Russell (Craftsy) present on how to do data science in the real world. We discuss data cleansing, ETL, pipelines, hosting, and share several tools used in the industry.
Test strategies for data processing pipelines, v2.0Lars Albertsson
This talk will present recommended patterns and corresponding anti-patterns for testing data processing pipelines. We will suggest technology and architecture to improve testability, both for batch and streaming processing pipelines. We will primarily focus on testing for the purpose of development productivity and product iteration speed, but briefly also cover data quality testing.
I give a talk through my Graph Database and Python learning journey at PyCon Australia 2015. It should be up on PyVideo soon enough.
Note: A great question was asked regarding why I didn't cover Postgres on the "what should I use" slide. That was a great question. Definitely consider Postgres, especially if you've got existing expertise in it. Rhys Elsemores talk (Just Use Postgres) at the same conference is excellent.
This talk gives details about Spark internals and an explanation of the runtime behavior of a Spark application. It explains how high level user programs are compiled into physical execution plans in Spark. It then reviews common performance bottlenecks encountered by Spark users, along with tips for diagnosing performance problems in a production application.
Streaming machine learning is being integrated in Spark 2.1+, but you don’t need to wait. Holden Karau and Seth Hendrickson demonstrate how to do streaming machine learning using Spark’s new Structured Streaming and walk you through creating your own streaming model. By the end of this session, you’ll have a better understanding of Spark’s Structured Streaming API as well as how machine learning works in Spark.
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/2l2Rr6L.
Doug Daniels discusses the cloud-based platform they have built at DataDog and how it differs from a traditional datacenter-based analytics stack. He walks through the decisions they have made at each layer, covers the pros and cons of these decisions and discusses the tooling they have built. Filmed at qconsf.com.
Doug Daniels is a Director of Engineering at Datadog, where he works on high-scale data systems for monitoring, data science, and analytics. Prior to joining Datadog, he was CTO at Mortar Data and an architect and developer at Wireless Generation, where he designed data systems to serve more than 4 million students in 49 states.
How to Develop and Operate Cloud First Data PlatformsAlluxio, Inc.
Alluxio Online Meetup
Feb 11, 2020
Speakers:
Du Li, Electronic Arts
Bin Fan, Alluxio
In cloud-based software stacks, there are varying degrees of automation across different layers: infrastructure, platform, and application. The mismatch in automation often breaks balance in devops, causing ops nightmares in platforms and applications. This talk will overview two projects at Electronic Arts (EA) that address the mismatch by data orchestration: One project automatically generates configurations for all components in a large monitoring system, which reduces the daily average number of alerts from ~1000 to ~20. The other project introduces Alluxio for caching and unifying address space across ETL and analytics workloads, which substantially simplifies architecture, improves performance, and reduces ops overheads.
SASI, Cassandra on the full text search ride - DuyHai Doan - Codemotion Milan...Codemotion
Apache Cassandra is a scalable database with high availability features. But they come with severe limitations in term of querying capabilities. Since the introduction of SASI in Cassandra 3.4, the limitations belong to the pass. Now you can create performant indices on your columns as well as benefit from full text search capabilities with the introduction of the new LIKE %term% syntax. To illustrate how SASI works, we'll use a database of 100 000 albums and artists.
As companies adopt data processing technologies and add data-driven features to user-facing products, the need for effective automated test techniques for data processing applications increase. We go through anatomy of scalable data streaming applications, and how to set up test harnesses for reliable integration testing of such applications. We cover a few common anti-patterns that make asynchronous tests fragile, and corresponding patterns for remediation. We will also mention virtualisation components suitable for our testing scenarios.
RDMS have their data modeling methodology and diagrams. What about Cassandra? Let's discover the key principles of Cassandra data modeling with the Chebotko methodology. Have a look at KDM, a Chebotko modeling tool. And finally, let's talk about the time dimension in Cassandra.
This presentation was made for the Lyon Cassandra Users meetup (France).
Internet of Things is a currently a burgeoning market, and is often associated with specialized data-stores. However PostgreSQL is just as capable at this use-case and can offer some compelling advantages. We’ll explore ways to store IoT data in PostgreSQL covering various ways to store and structure this kind of data. How range types and differing types of indexes can be of use. Also taking a quick look at some extensions designed for this use case. Then looking at powerful SQL features which can really help when analyzing IoT data streams, and how the power of a real SQL database can be a key advantage.
Outlines the vision and philosophy for Wakari.io with a basic overview of popular python data analysis packages. Most of the talk is conducted in Wakari and is not visible on these slides. 90 minutes for PyData NYC, November 8th 2013.
Apache Spark vs rest of the world – Problems and Solutions by Arkadiusz Jachn...Big Data Spain
Apache Spark is a great solution for building Big Data applications. It provides really fast SQL-like processing, machine learning library, and streaming module for near real time processing of data streams. Unfortunately, during application development and production deployments we often encounter many difficulties in mixing various data sources or bulk loading of computed data to SQL or NoSQL databases
https://www.bigdataspain.org/2017/talk/apache-spark-vs-rest-of-the-world-problems-and-solutions
Big Data Spain 2017
16th - 17th November Kinépolis Madrid
A Developer’s View into Spark's Memory Model with Wenchen FanDatabricks
As part of Project Tungsten, we started an ongoing effort to substantially improve the memory and CPU efficiency of Apache Spark’s backend execution and push performance closer to the limits of modern hardware. In this talk, we’ll take a deep dive into Apache Spark’s unified memory model and discuss how Spark exploits memory hierarchy and leverages application semantics to manage memory explicitly (both on and off-heap) to eliminate the overheads of JVM object model and garbage collection.
Dirty data? Clean it up! - Datapalooza Denver 2016Dan Lynn
Dan Lynn (AgilData) & Patrick Russell (Craftsy) present on how to do data science in the real world. We discuss data cleansing, ETL, pipelines, hosting, and share several tools used in the industry.
Test strategies for data processing pipelines, v2.0Lars Albertsson
This talk will present recommended patterns and corresponding anti-patterns for testing data processing pipelines. We will suggest technology and architecture to improve testability, both for batch and streaming processing pipelines. We will primarily focus on testing for the purpose of development productivity and product iteration speed, but briefly also cover data quality testing.
I give a talk through my Graph Database and Python learning journey at PyCon Australia 2015. It should be up on PyVideo soon enough.
Note: A great question was asked regarding why I didn't cover Postgres on the "what should I use" slide. That was a great question. Definitely consider Postgres, especially if you've got existing expertise in it. Rhys Elsemores talk (Just Use Postgres) at the same conference is excellent.
This talk gives details about Spark internals and an explanation of the runtime behavior of a Spark application. It explains how high level user programs are compiled into physical execution plans in Spark. It then reviews common performance bottlenecks encountered by Spark users, along with tips for diagnosing performance problems in a production application.
Streaming machine learning is being integrated in Spark 2.1+, but you don’t need to wait. Holden Karau and Seth Hendrickson demonstrate how to do streaming machine learning using Spark’s new Structured Streaming and walk you through creating your own streaming model. By the end of this session, you’ll have a better understanding of Spark’s Structured Streaming API as well as how machine learning works in Spark.
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/2l2Rr6L.
Doug Daniels discusses the cloud-based platform they have built at DataDog and how it differs from a traditional datacenter-based analytics stack. He walks through the decisions they have made at each layer, covers the pros and cons of these decisions and discusses the tooling they have built. Filmed at qconsf.com.
Doug Daniels is a Director of Engineering at Datadog, where he works on high-scale data systems for monitoring, data science, and analytics. Prior to joining Datadog, he was CTO at Mortar Data and an architect and developer at Wireless Generation, where he designed data systems to serve more than 4 million students in 49 states.
How to Develop and Operate Cloud First Data PlatformsAlluxio, Inc.
Alluxio Online Meetup
Feb 11, 2020
Speakers:
Du Li, Electronic Arts
Bin Fan, Alluxio
In cloud-based software stacks, there are varying degrees of automation across different layers: infrastructure, platform, and application. The mismatch in automation often breaks balance in devops, causing ops nightmares in platforms and applications. This talk will overview two projects at Electronic Arts (EA) that address the mismatch by data orchestration: One project automatically generates configurations for all components in a large monitoring system, which reduces the daily average number of alerts from ~1000 to ~20. The other project introduces Alluxio for caching and unifying address space across ETL and analytics workloads, which substantially simplifies architecture, improves performance, and reduces ops overheads.
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB
Moving to a new home is daunting. Packing up all your things, getting a vehicle to move it all, unpacking it, updating your mailing address, and making sure you did not leave anything behind. Well, the move to MongoDB Atlas is similar, but all the logistics are already figured out for you by MongoDB.
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...NETWAYS
At Uber we use high cardinality monitoring to observe and detect issues with our 4,000 microservices running on Mesos and across our infrastructure systems and servers. We’ll cover how we put the resulting 6 billion plus time series to work in a variety of different ways, auto-discovering services and their usage of other systems at Uber, setting up and tearing down alerts automatically for services, sending smart alert notifications that rollup different failures into individual high level contextual alerts, and more. We’ll also talk about how we accomplish all this with a global view of our systems with M3, our open source metrics platform. We’ll take a deep dive look at how we use M3DB, now available as an open source Prometheus long term storage backend, to horizontally scale our metrics platform in a cost efficient manner with a system that’s still sane to operate with petabytes of metrics data.
Data Day Seattle 2017: Scaling Data Science at Stitch FixStefan Krawczyk
At Stitch Fix we have a lot of Data Scientists. Around eighty at last count. One reason why I think we have so many, is that we do things differently. To get their work done, Data Scientists have access to whatever resources they need (within reason), because they’re end to end responsible for their work; they collaborate with their business partners on objectives and then prototype, iterate, productionize, monitor and debug everything and anything required to get the output desired. They’re full data-stack data scientists!
The teams in the organization do a variety of different tasks:
- Clothing recommendations for clients.
- Clothes reordering recommendations.
- Time series analysis & forecasting of inventory, client segments, etc.
- Warehouse worker path routing.
- NLP.
… and more!
They’re also quite prolific at what they do -- we are approaching 4500 job definitions at last count. So one might be wondering now, how have we enabled them to get their jobs done without getting in the way of each other?
This is where the Data Platform teams comes into play. With the goal of lowering the cognitive overhead and engineering effort required on part of the Data Scientist, the Data Platform team tries to provide abstractions and infrastructure to help the Data Scientists. The relationship is a collaborative partnership, where the Data Scientist is free to make their own decisions and thus choose they way they do their work, and the onus then falls on the Data Platform team to convince Data Scientists to use their tools; the easiest way to do that is by designing the tools well.
In regard to scaling Data Science, the Data Platform team has helped establish some patterns and infrastructure that help alleviate contention. Contention on:
Access to Data
Access to Compute Resources:
Ad-hoc compute (think prototype, iterate, workspace)
Production compute (think where things are executed once they’re needed regularly)
For the talk (and this post) I only focused on how we reduced contention on Access to Data, & Access to Ad-hoc Compute to enable Data Science to scale at Stitch Fix. With that I invite you to take a look through the slides.
The Netflix Way to deal with Big Data ProblemsMonal Daxini
Netflix is a data driven company with a unique culture. Come take a holistic tour of the Big Data ecosystem, and how Netflix culture catalyzes the development of systems. Then ogle at how we quickly evolved and scaled the event pipeline to a 1 trillion events per day and over 1.4 PB of event data without service disruption, and a small team.
Thinking DevOps in the Era of the Cloud - Demi Ben-AriDemi Ben-Ari
The lines between Development and Operations people have gotten blurry and lots of skills needs to be held by both sides. In the talk we'll talk about all of the considerations that are needed to be taken when creating a development and production environment, mentioning Continuous Integration, Continuous Deployment and the Buzzword "DevOps", also talking about some real implementations in the industry. Of course how can we leave out the real enabler of the whole deal, "The Cloud", Giving us a tool set that makes life much easier when implementing all of these practices.
Triangle Devops Meetup covering Netflix open source, cloud architecture, and what Andrew did in his first year working as a senior software engineer in the cloud platform group.
Details:
• DevOps and Business Intelligence?
• CI/CD Pipelines: What are they?
• Database Deployments: State based vs Migration based
• Snowflake features for CI/CD
• Azure DevOps: Build and Release Pipelines
• Putting it all together: End to End solution
• Demo
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017Demi Ben-Ari
Once you start working with distributed Big Data systems, you start discovering a whole bunch of problems you won’t find in monolithic systems.
All of a sudden to monitor all of the components becomes a big data problem itself.
In the talk we’ll mention all of the aspects that you should take in consideration when monitoring a distributed system once you’re using tools like:
Web Services, Apache Spark, Cassandra, MongoDB, Amazon Web Services.
Not only the tools, what should you monitor about the actual data that flows in the system?
And we’ll cover the simplest solution with your day to day open source tools, the surprising thing, that it comes not from an Ops Guy.
Virtual training intro to InfluxDB - June 2021InfluxData
In this training webinar, we will walk you through the basics of InfluxDB – the purpose-built time series database. InfluxDB has everything you need from a time series platform in a single binary – a multi-tenanted time series database, UI and dashboarding tools, background processing and monitoring agent. This one-hour session will include the training and time for live Q&A.
What you will learn
Core concepts of time series databases
An overview of the InfluxDB platform
How to ingesting and query data in InfluxDB
Mario Molina, Software Engineer
CDC systems are usually used to identify changes in data sources, capture and replicate those changes to other systems. Companies are using CDC to sync data across systems, cloud migration or even applying stream processing, among others.
In this presentation we’ll see CDC patterns, how to use it in Apache Kafka, and do a live demo!
https://www.meetup.com/Mexico-Kafka/events/277309497/
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...Demi Ben-Ari
Once you start working with distributed Big Data systems, you start discovering a whole bunch of problems you won’t find in monolithic systems.
All of a sudden to monitor all of the components becomes a big data problem itself.
In the talk we’ll mention all of the aspects that you should take in consideration when monitoring a distributed system once you’re using tools like:
Web Services, Apache Spark, Cassandra, MongoDB, Amazon Web Services.
Not only the tools, what should you monitor about the actual data that flows in the system?
And we’ll cover the simplest solution with your day to day open source tools, the surprising thing, that it comes not from an Ops Guy.
Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...Codemotion
Once you start working with Big Data systems, you discover a whole bunch of problems you won’t find in monolithic systems. Monitoring all of the components becomes a big data problem itself. In the talk we’ll mention all of the aspects that you should take in consideration when monitoring a distributed system using tools like: Web Services,Spark,Cassandra,MongoDB,AWS. Not only the tools, what should you monitor about the actual data that flows in the system? We’ll cover the simplest solution with your day to day open source tools, the surprising thing, that it comes not from an Ops Guy.
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...Codemotion
Once you start working with Big Data systems, you discover a whole bunch of problems you won’t find in monolithic systems. Monitoring all of the components becomes a big data problem itself. In the talk, we’ll mention all of the aspects that you should take into consideration when monitoring a distributed system using tools like Web Services, Spark, Cassandra, MongoDB, AWS. Not only the tools, what should you monitor about the actual data that flows in the system? We’ll cover the simplest solution with your day to day open source tools, the surprising thing, that it comes not from an Ops Guy.
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...Demi Ben-Ari
Once you start working with distributed Big Data systems, you start discovering a whole bunch of problems you won’t find in monolithic systems.
All of a sudden to monitor all of the components becomes a big data problem itself.
In the talk we’ll mention all of the aspects that you should take in consideration when monitoring a distributed system once you’re using tools like:
Web Services, Apache Spark, Cassandra, MongoDB, Amazon Web Services.
Not only the tools, what should you monitor about the actual data that flows in the system?
And we’ll cover the simplest solution with your day to day open source tools, the surprising thing, that it comes not from an Ops Guy.
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...NETWAYS
Open source is at the heart of what we do at Grafana Labs and there is so much happening! The intent of this talk to update everyone on the latest development when it comes to Grafana, Pyroscope, Faro, Loki, Mimir, Tempo and more. Everyone has had at least heard about Grafana but maybe some of the other projects mentioned above are new to you? Welcome to this talk 😉 Beside the update what is new we will also quickly introduce them during this talk.
A Day in the Life of a Druid Implementor and Druid's RoadmapItai Yaffe
Benjamin Hopp (Solutions Architect) @ Imply:
Druid is an emerging standard in the data infrastructure world, designed for high-performance slice-and-dice analytics (“OLAP”-style) on large data sets.
This talk is for you if you’re interested in learning more about pushing Druid’s analytical performance to the limit.
Perhaps you’re already running Druid and are looking to speed up your deployment, or perhaps you aren’t familiar with Druid and are interested in learning the basics.
Some of the tips in this talk are Druid-specific, but many of them will apply to any operational analytics technology stack.
The most important contributor to a fast analytical setup is getting the data model right.
The talk will center around various choices you can make to prepare your data to get best possible query performance.
We’ll look at some general best practices to model your data before ingestion such as OLAP dimensional modeling (called “roll-up” in Druid), data partitioning, and tips for choosing column types and indexes.
We’ll also look at how more can be less: often, storing copies of your data partitioned, sorted, or aggregated in different ways can speed up queries by reducing the amount of computation needed.
We’ll also look at Druid-specific optimizations that take advantage of approximations; where you can trade accuracy for performance and reduced storage.
You’ll get introduced to Druid’s features for approximate counting, set operations, ranking, quantiles, and more.
And we will finish with the latest and greatest Druid news, including details about the latest roadmap and releases.
Similar to Data Science in the Cloud @StitchFix (20)
Streaming a Million Likes/Second: Real-Time Interactions on Live VideoC4Media
Video and slides synchronized, mp3 and slide download available at URL https://bit.ly/39NIjLV.
Akhilesh Gupta does a technical deep-dive into how Linkedin uses the Play/Akka Framework and a scalable distributed system to enable live interactions like likes/comments at massive scale at extremely low costs across multiple data centers. Filmed at qconlondon.com.
Akhilesh Gupta is the technical lead for LinkedIn's Real-time delivery infrastructure and LinkedIn Messaging. He has been working on the revamp of LinkedIn’s offerings to instant, real-time experiences. Before this, he was the head of engineering for the Ride Experience program at Uber Technologies in San Francisco.
Next Generation Client APIs in Envoy MobileC4Media
Video and slides synchronized, mp3 and slide download available at URL https://bit.ly/2x0Fav8.
Jose Nino guides the audience through the journey of Mobile APIs at Lyft. He focuses on how the team has reaped the benefits of API generation to experiment with the network transport layer. He also discusses recent developments the team has made with Envoy Mobile and the roadmap ahead. Filmed at qconlondon.com.
Jose Nino works as a Software Engineer at Lyft.
Software Teams and Teamwork Trends Report Q1 2020C4Media
How do we cope with an environment that has been radically disrupted, where people are suddenly thrust into remote work in a chaotic state? What are the emerging good practices and new ideas that are shaping the way in which software development teams work? What can we do to make the workplace a more secure and diverse one while increasing the productivity of our teams? This report aims to assist technical leaders in making mid- to long-term decisions that will have a positive impact on their organisations and teams and help individual contributors find the practices, approaches, tools, techniques, and frameworks that can help them get a better experience at work - irrespective of where they are working from.
Understand the Trade-offs Using Compilers for Java ApplicationsC4Media
Video and slides synchronized, mp3 and slide download available at URL https://bit.ly/2QCmmJ0.
Mark Stoodley examines some of the strengths and weaknesses of the different Java compilation technologies, if one was to apply them in isolation. Stoodley discusses how production JVMs are assembling a combination of these tools that work together to provide excellent performance across the large spectrum of applications written in Java and JVM based languages. Filmed at qconsf.com.
Mark Stoodley joined IBM Canada to build Java JIT compilers for production use and led the team that delivered AOT compilation in the IBM SDK for Java 6. He spent the last five years leading the effort to open source nearly 4.3 million lines of source code from the IBM J9 Java Virtual Machine to create the two open source projects Eclipse OMR and Eclipse OpenJ9, and now co-leads both projects.
Video and slides synchronized, mp3 and slide download available at URL https://bit.ly/2y2yPiS.
Colin McCabe talks about the ongoing effort to replace the use of Zookeeper in Kafka: why they want to do it and how it will work. He discusses the limitations they have found and how Kafka benefits both in terms of stability and scalability by bringing consensus in house. He talks about their progress, what work is remaining, and how contributors can help. Filmed at qconsf.com.
Colin McCabe is a Kafka committer at Confluent, working on the scalability and extensibility of Kafka. Previously, he worked on the Hadoop Distributed Filesystem and the Ceph Filesystem.
Video and slides synchronized, mp3 and slide download available at URL https://bit.ly/2SXXXiD.
Katharina Probst talks about what it means to act like an owner and why teams need ownership to be high-performing. When team members, regardless of whether they have a formal leadership role or not, act like owners, magical things can happen. She shares ideas that we can apply to our own work, and talks about how to recognize when we don’t live up to our own expectations of acting like an owner. Filmed at qconsf.com.
Katharina Probst is a Senior Engineering Leader, Kubernetes & SaaS at Google. Before this, she was leading engineering teams at Netflix, being responsible for the Netflix API, which helps bring Netflix streaming to millions of people around the world. Prior to joining Netflix, she was in the cloud computing team at Google, where she saw cloud computing from the provider side.
Does Java Need Inline Types? What Project Valhalla Can Bring to JavaC4Media
Video and slides synchronized, mp3 and slide download available at URL https://bit.ly/2T04Lw4.
Sergey Kuksenko talks about the performance benefits inline types bring to Java and how to exploit them. Inline/value types are the key part of experimental project Valhalla, which should bring new abilities to the Java language. Filmed at qconsf.com.
Sergey Kuksenko is a Java Performance Engineer at Oracle working on a variety of Java and JVM performance enhancements. He started working as Java Engineer in 1996 and as Java Performance Engineer in 2005. He has had a passion for exploring how Java works on modern hardware.
Do you need service meshes in your tech stack?
This on-line guide aims to answer pertinent questions for software architects and technical leaders, such as: what is a service mesh?, do I need a service mesh?, how do I evaluate the different service mesh offerings? In software architecture, a service mesh is a dedicated infrastructure layer for facilitating service-to-service communications between microservices, often using a sidecar proxy.
Video and slides synchronized, mp3 and slide download available at URL https://bit.ly/2UgQ3BU.
Christie Wilson describes what to expect from CI/CD in 2019, and how Tekton is helping bring that to as many tools as possible, such as Jenkins X and Prow. Wilson talks about Tekton itself and performs a live demo that shows how cloud native CI/CD can help debug, surface and fix mistakes faster. Filmed at qconsf.com.
Christie Wilson is a software engineer at Google, currently leading the Tekton project. Over the past decade, she has worked in the mobile, financial and video game industries. Prior to working at Google she led a team of software developers to build load testing tools for AAA video game titles, and founded the Vancouver chapter of PyLadies.
Video and slides synchronized, mp3 and slide download available at URL https://bit.ly/2S7lDiS.
Sasha Rosenbaum shows how a CI/CD pipeline for Machine Learning can greatly improve both productivity and reliability. Filmed at qconsf.com.
Sasha Rosenbaum is a Program Manager on the Azure DevOps engineering team, focused on improving the alignment of the product with open source software. She is a co-organizer of the DevOps Days Chicago and the DeliveryConf conferences, and recently published a book on Serverless computing in Azure with .NET.
Video and slides synchronized, mp3 and slide download available at URL https://bit.ly/36epVKg.
Todd Montgomery discusses the techniques and lessons learned from implementing Aeron Cluster. His focus is on how Raft can be implemented on Aeron, minimizing the network round trip overhead, and comparing single process to a fully distributed cluster. Filmed at qconsf.com.
Todd Montgomery is a networking hacker who has researched, designed, and built numerous protocols, messaging-oriented middleware systems, and real-time data systems, done research for NASA, contributed to the IETF and IEEE, and co-founded two startups. He currently works as an independent consultant and is active in several open source projects.
Architectures That Scale Deep - Regaining Control in Deep SystemsC4Media
Video and slides synchronized, mp3 and slide download available at URL https://bit.ly/2FWc5Sk.
Ben Sigelman talks about "Deep Systems", their common properties and re-introduces the fundamentals of control theory from the 1960s, including the original conceptualizations of Observability & Controllability. He uses examples from Google & other companies to illustrate how deep systems have damaged people's ability to observe software, and what needs to be done in order to regain control. Filmed at qconsf.com.
Ben Sigelman is a co-founder and the CEO at LightStep, a co-creator of Dapper (Google’s distributed tracing system), and co-creator of the OpenTracing and OpenTelemetry projects (both part of the CNCF). His work and interests gravitate towards observability, especially where microservices, high transaction volumes, and large engineering organizations are involved.
ML in the Browser: Interactive Experiences with Tensorflow.jsC4Media
Video and slides synchronized, mp3 and slide download available at URL https://bit.ly/39SddUL.
Victor Dibia provides a friendly introduction to machine learning, covers concrete steps on how front-end developers can create their own ML models and deploy them as part of web applications. He discusses his experience building Handtrack.js - a library for prototyping real time hand tracking interactions in the browser. Filmed at qconsf.com.
Victor Dibia is a Research Engineer with Cloudera’s Fast Forward Labs. Prior to this, he was a Research Staff Member at the IBM TJ Watson Research Center, New York. His research interests are at the intersection of human computer interaction, computational social science, and applied AI.
Video and slides synchronized, mp3 and slide download available at URL https://bit.ly/2s9T3Vl.
Colin Eberhardt looks at some of the internals of WebAssembly, explores how it works “under the hood”, and looks at how to create a (simple) compiler that targets this runtime. Filmed at qconsf.com.
Colin Eberhardt is the Technology Director at Scott Logic, a UK-based software consultancy where they create complex application for their financial services clients. He is an avid technology enthusiast, spending his evenings contributing to open source projects, writing blog posts and learning as much as he can.
User & Device Identity for Microservices @ Netflix ScaleC4Media
Video and slides synchronized, mp3 and slide download available at URL https://bit.ly/2S9tOgy.
Satyajit Thadeshwar provides useful insights on how Netflix implemented a secure, token-agnostic, identity solution that works with services operating at a massive scale. He shares some of the lessons learned from this process, both from architectural diagrams and code. Filmed at qconsf.com.
Satyajit Thadeshwar is an engineer on the Product Edge Access Services team at Netflix, where he works on some of the most critical services focusing on user and device authentication. He has more than a decade of experience building fault-tolerant and highly scalable, distributed systems.
Video and slides synchronized, mp3 and slide download available at URL https://bit.ly/2Ezs08q.
Justin Ryan talks about Netflix’ scalability issues and some of the ways they addressed it. He shares successes they’ve had from unintuitively partitioning computation into multiple services to get better runtime characteristics. He introduces us to useful probabilistic data structures, innovative bi-directional data passing, open-source projects available from Netflix that make this all possible. Filmed at qconsf.com.
Justin Ryan is Playback Edge Engineering at Netflix. He works on some of the most critical services at Netflix, specifically focusing on user and device authentication. Years of building developer tools has also given him a healthy set of opinions on developer productivity.
Make Your Electron App Feel at Home EverywhereC4Media
Video and slides synchronized, mp3 and slide download available at URL https://bit.ly/2Z4ZJjn.
Kilian Valkhof discusses the process of making an Electron app feel at home on all three platforms: Windows, MacOS and Linux, making devs aware of the pitfalls and how to avoid them. Filmed at qconsf.com.
Kilian Valkhof is a Front-end Developer & User-experience Designer at Firstversionist. He writes about various topics, from design to machine learning, on his personal website, kilianvalkhof.com and is a frequent contributer to open source software. He is part of the Electron governance team that oversees the development of the Electron framework.
Video and slides synchronized, mp3 and slide download available at URL https://bit.ly/344PnB1.
Steve Klabnik goes over the deep details of how async/await works in Rust, covering concepts like coroutines, generators, stack-less vs stack-ful, "pinning", and more. Filmed at qconsf.com.
Steve Klabnik is on the core team of Rust, leads the documentation team, and is an author of "The Rust Programming Language." He is a frequent speaker at conferences and is a prolific open source contributor, previously working on projects such as Ruby and Ruby on Rails.
Video and slides synchronized, mp3 and slide download available at URL https://bit.ly/2OUz6dt.
Chris Riccomini talks about the current state-of-the-art in data pipelines and data warehousing, and shares some of the solutions to current problems dealing with data streaming and warehousing. Filmed at qconsf.com.
Chris Riccomini works as a Software Engineer at WePay.
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreC4Media
Video and slides synchronized, mp3 and slide download available at URL https://bit.ly/2rm4hFD.
Yevgeniy Brikman talks about how to write automated tests for infrastructure code, including the code written for use with tools such as Terraform, Docker, Packer, and Kubernetes. Topics covered include: unit tests, integration tests, end-to-end tests, dependency injection, test parallelism, retries and error handling, static analysis, property testing and CI / CD for infrastructure code. Filmed at qconsf.com.
Yevgeniy Brikman is the co-founder of Gruntwork, a company that provides DevOps as a Service. He is the author of two books published by O'Reilly Media: Hello, Startup and Terraform: Up & Running. Previously, he worked as a software engineer at LinkedIn, TripAdvisor, Cisco Systems, and Thomson Financial.
"Impact of front-end architecture on development cost", Viktor TurskyiFwdays
I have heard many times that architecture is not important for the front-end. Also, many times I have seen how developers implement features on the front-end just following the standard rules for a framework and think that this is enough to successfully launch the project, and then the project fails. How to prevent this and what approach to choose? I have launched dozens of complex projects and during the talk we will analyze which approaches have worked for me and which have not.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
Neuro-symbolic is not enough, we need neuro-*semantic*
Data Science in the Cloud @StitchFix
1. Data Science
in the Cloud
Stefan Krawczyk
@stefkrawczyk
linkedin.com/in/skrawczyk
November 2016
2. InfoQ.com: News & Community Site
• Over 1,000,000 software developers, architects and CTOs read the site world-
wide every month
• 250,000 senior developers subscribe to our weekly newsletter
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• 2 dedicated podcast channels: The InfoQ Podcast, with a focus on
Architecture and The Engineering Culture Podcast, with a focus on building
• 96 deep dives on innovative topics packed as downloadable emags and
minibooks
• Over 40 new content items per week
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
stitchfix-cloud
3. Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
Presented at QCon San Francisco
www.qconsf.com
32. This is Usually Nothing to Worry About
● OS handles correct access
● DB has ACID properties
A
B
33. This is Usually Nothing to Worry About
● OS handles correct access
● DB has ACID properties
● But it’s easy to outgrow these
options with a big data/team.
A
B
34. ● Amazon’s Simple Storage Service
● Infinite* storage
● Can write, read, delete, BUT NOT append.
● Looks like a file system*:
○ URIs: my.bucket/path/to/files/file.txt
● Scales well
S3
* For all intents and purposes
35. ● Hadoop service, that stores:
○ Schema
○ Partition information, e.g. date
○ Data location for a partition
Hive Metastore
36. ● Hadoop service, that stores:
○ Schema
○ Partition information, e.g. date
○ Data location for a partition
Hive Metastore:
Hive Metastore
Partition Location
20161001 s3://bucket/sold_items/20161001
...
20161031 s3://bucket/sold_items/20161031
sold_items
41. But if we’re not careful
● S3 is eventually
consistent
● These bugs are hard
to track down
A
B
42. ● Use Hive Metastore to control partition source of truth
● Principles:
○ Never delete
○ Always write to a new place each time a partition changes
● Stitch Fix solution:
○ Use an inner directory → called Batch ID
Hive Metastore to the Rescue
44. Batch ID Pattern
Date Location
20161001 s3://bucket/sold_items/
20161001/20161002002334/
... ...
20161031 s3://bucket/sold_items/
20161031/20161101002256/
sold_items
45. ● Overwriting a partition is just a matter of updating the location
Batch ID Pattern
Date Location
20161001 s3://bucket/sold_items/
20161001/20161002002334/
... ...
20161031 s3://bucket/sold_items/
20161031/20161101002256/
s3://bucket/sold_items/
20161031/20161102234252
sold_items
46. ● Overwriting a partition is just a matter of updating the location
● To the user this is a hidden inner directory
Batch ID Pattern
Date Location
20161001 s3://bucket/sold_items/
20161001/20161002002334/
... ...
20161031 s3://bucket/sold_items/
20161031/20161101002256/
s3://bucket/sold_items/
20161031/20161102234252
sold_items
50. ● Full partition history
○ Can rollback
■ Data Scientists are less afraid of mistakes
○ Can create audit trails more easily
■ What data changed and when
○ Can anchor downstream consumers to a particular batch ID
Batch ID Pattern Benefits
52. Workstation Env. Mgmt. Scalability
Low Low
Medium Medium
High High
Ad hoc Infra: In the Beginning...
53. Workstation Env. Mgmt. Scalability
Low Low
Medium Medium
High High
Ad hoc Infra: Evolution I
54. Workstation Env. Mgmt. Scalability
Low Low
Medium Medium
High High
Ad hoc Infra: Evolution II
55. Workstation Env. Mgmt. Scalability
Low Low
Medium Medium
Low High
Ad hoc Infra: Evolution III
56. ● Control of environment
○ Data Scientists don’t need to worry about env.
● Isolation
○ can host many docker containers on a single machine.
● Better host management
○ allowing central control of machine types.
Why Does Docker Lower Overhead?
58. ● Has:
○ Our internal API libraries
○ Jupyter Notebook:
■ Pyspark
■ IPython
○ Python libs:
■ scikit, numpy, scipy, pandas, etc.
○ RStudio
○ R libs:
■ Dplyr, magrittr, ggplot2, lme4, BOOT, etc.
● Mounts User NFS
● User has terminal access to file system via Jupyter for git, pip, etc.
Our Docker Image
62. ● Docker tightly integrates with the Linux Kernel.
○ Hypothesis:
■ Anything that makes uninterruptable calls to the kernel can:
● Break the ECS agent because the container doesn’t respond.
● Break isolation between containers.
■ E.g. Mounting NFS
● Docker Hub:
○ Switched to artifactory
Our Docker Problems So Far
70. ● Naïve scheme of JSON + Zlib works well:
Observations
import json
import zlib
...
# compress
compressed = zlib.compress(json.dumps(value))
# decompress
original = json.loads(zlib.decompress(compressed))
71. ● Naïve scheme of JSON + Zlib works well:
● Double vs Float: do you really need to store that much precision?
Observations
import json
import zlib
...
# compress
compressed = zlib.compress(json.dumps(value))
# decompress
original = json.loads(zlib.decompress(compressed))
72. ● Naïve scheme of JSON + Zlib works well:
● Double vs Float: do you really need to store that much precision?
● For more inspiration look to columnar DBs and how they compress columns
Observations
import json
import zlib
...
# compress
compressed = zlib.compress(json.dumps(value))
# decompress
original = json.loads(zlib.decompress(compressed))
73. To Batch or Not To Batch:
When is batch inefficient?
74. ● Online:
○ Computation occurs synchronously when needed.
● Streamed:
○ Computation is triggered by an event(s).
Online & Streamed Computation
75. Online & Streamed Computation
Very likely
you start with
a batch system
76. Online & Streamed Computation
● Do you need to recompute:
○ features for all users?
○ predicted results for all users?
Very likely
you start with
a batch system
77. Online & Streamed Computation
● Do you need to recompute:
○ features for all users?
○ predicted results for all users?
● Are you heavily dependent on your
ETL running every night?
Very likely
you start with
a batch system
78. Online & Streamed Computation
● Do you need to recompute:
○ features for all users?
○ predicted results for all users?
● Are you heavily dependent on your
ETL running every night?
● Online vs Streamed depends on in
house factors:
○ Number of models
○ How often they change
○ Cadence of output required
○ In house eng. expertise
○ etc.
Very likely
you start with
a batch system
79. Online & Streamed Computation
● Do you need to recompute:
○ features for all users?
○ predicted results for all users?
● Are you heavily dependent on your
ETL running every night?
● Online vs Streamed depends on in
house factors:
○ Number of models
○ How often they change
○ Cadence of output required
○ In house eng. expertise
○ etc.
Very likely
you start with
a batch system
We use online
system for
recommendations
84. ● Dedicated infrastructure → More room on batch infrastructure
○ Hopefully $$$ savings
○ Hopefully less stressed Data Scientists
Online/Streaming Thoughts
85. ● Dedicated infrastructure → More room on batch infrastructure
○ Hopefully $$$ savings
○ Hopefully less stressed Data Scientists
● Requires better software engineering practices
○ Code portability/reuse
○ Designing APIs/Tools Data Scientists will use
Online/Streaming Thoughts
86. ● Dedicated infrastructure → More room on batch infrastructure
○ Hopefully $$$ savings
○ Hopefully less stressed Data Scientists
● Requires better software engineering practices
○ Code portability/reuse
○ Designing APIs/Tools Data Scientists will use
● Prototyping on AWS Lambda & Kinesis was surprisingly quick
○ Need to compile C libs on an amazon linux instance
Online/Streaming Thoughts
88. Ever:
● Had someone leave and then nobody understands how they trained their
models?
89. Ever:
● Had someone leave and then nobody understands how they trained their
models?
○ Or you didn’t remember yourself?
90. Ever:
● Had someone leave and then nobody understands how they trained their
models?
○ Or you didn’t remember yourself?
● Had performance dip in models and you have trouble figuring out why?
91. Ever:
● Had someone leave and then nobody understands how they trained their
models?
○ Or you didn’t remember yourself?
● Had performance dip in models and you have trouble figuring out why?
○ Or not known what’s changed between model deployments?
92. Ever:
● Had someone leave and then nobody understands how they trained their
models?
○ Or you didn’t remember yourself?
● Had performance dip in models and you have trouble figuring out why?
○ Or not known what’s changed between model deployments?
● Wanted to compare model performance over time?
93. Ever:
● Had someone leave and then nobody understands how they trained their
models?
○ Or you didn’t remember yourself?
● Had performance dip in models and you have trouble figuring out why?
○ Or not known what’s changed between model deployments?
● Wanted to compare model performance over time?
● Wanted to train a model in R/Python/Spark and then deploy it a webserver?
95. ● Isn’t that just saving the coefficients/model values?
Produce Model Artifacts
96. ● Isn’t that just saving the coefficients/model values?
○ NO!
Produce Model Artifacts
97. ● Isn’t that just saving the coefficients/model values?
○ NO!
● Why?
Produce Model Artifacts
98. ● Isn’t that just saving the coefficients/model values?
○ NO!
● Why?
Produce Model Artifacts
99. ● Isn’t that just saving the coefficients/model values?
○ NO!
● Why?
How do you deal with
organizational drift?
Produce Model Artifacts
100. ● Isn’t that just saving the coefficients/model values?
○ NO!
● Why?
How do you deal with
organizational drift?
Produce Model Artifacts
Makes it easy to keep an
archive and track
changes over time
101. ● Isn’t that just saving the coefficients/model values?
○ NO!
● Why?
How do you deal with
organizational drift?
Produce Model Artifacts
Helps a lot with model
debugging & diagnosis!
Makes it easy to keep an
archive and track
changes over time
102. ● Isn’t that just saving the coefficients/model values?
○ NO!
● Why?
How do you deal with
organizational drift?
Produce Model Artifacts
Helps a lot with model
debugging & diagnosis!
Makes it easy to keep an
archive and track
changes over time Can more easily use in
downstream processes
103. ● Analogous to software libraries
● Packaging:
○ Zip/Jar file
Produce Model Artifacts