Title
Real-time, Advanced Analytics and Recommendations using Machine Learning, Graph Processing, Natural Language Processing, and Approximations with Apache Spark, Stanford CoreNLP, and Twitter Algebird
BONUS: Netflix Recommendations: Then and Now
Agenda
Intro
Live, Interactive Recommendations Demo
Spark ML, GraphX, Streaming, Kafka, Cassandra, Docker
Types of Similarity
Euclidean vs. Non-Euclidean Similarity
User-to-User Similarity
Content-based, Item-to-Item Similarity (Amazon)
Collaborative-based, User-to-Item Similarity (Netflix)
Graph-based, Item-to-Item Similarity Pathway (Spotify)
Similarity Approximations at Scale
Twitter Algebird
MinHash and Bucketing
Locality Sensitive Hashing (LSH)
BONUS: Netflix Recommendations: From Ratings to Real-Time
DVD-Ratings-based $1M Netflix Prize (2009)
Streaming-based "Trending Now" (2016)
Wrap Up
Q & A
Bio
Chris Fregly is a Principal Data Solutions Engineer for the newly-formed IBM Spark Technology Center, an Apache Spark Contributor, and a Netflix Open Source Committer.
Chris is also the founder of the global Advanced Apache Spark Meetup and author of the upcoming book, Advanced Spark @ advancedspark.com.
Previously, Chris was a Data Solutions Engineer at Databricks and a Streaming Data Engineer at Netflix.
Related Links
https://github.com/fluxcapacitor/pipeline/wiki
http://cdn.oreillystatic.com/en/assets/1/event/105/Algebra%20for%20Scalable%20Analytics%20Presentation.pdf
http://static.echonest.com/BoilTheFrog/
http://www.netflixprize.com/assets/GrandPrize2009_BPC_BellKor.pdf
http://blog.echen.me/2011/10/24/winning-the-netflix-prize-a-summary/
http://www.cc.gatech.edu/~zha/CSE8801/CF/kdd-fp074-koren.pdf
Title:
Real-time, Advanced Analytics and Recommendations using Machine Learning, Natural Language Processing, Graph Processing, and Approximations with Apache Spark, Stanford CoreNLP, and Twitter Algebird
Agenda
Intro
Live, Interactive Recommendations Demo
Spark ML, GraphX, Streaming, Kafka, Cassandra, Docker
Types of Similarity
Euclidean vs. Non-Euclidean Similarity
User-to-User Similarity
Content-based, Item-to-Item Similarity (Amazon)
Collaborative-based, User-to-Item Similarity (Netflix)
Graph-based, Item-to-Item Similarity Pathway (Spotify)
Similarity Approximations at Scale
Twitter Algebird
MinHash and Bucketing
Locality Sensitive Hashing (LSH)
Netflix Recommendations: From Ratings to Real-Time
DVD-Ratings-based $1M Netflix Prize (2009)
Streaming-based "Trending Now" (2016)
Wrap Up
Q & A
*Bio*
Chris Fregly is a Principal Data Solutions Engineer for the newly-formed IBM Spark Technology Center, an Apache Spark Contributor, and a Netflix Open Source Committer. Chris is also the founder of the global Advanced Apache Spark Meetup and author of the upcoming book, Advanced Spark @ advancedspark.com. Previously, Chris was a Data Solutions Engineer at Databricks and a Streaming Data Engineer at Netflix.
*Related Links*
https://github.com/fluxcapacitor/pipeline/wiki
http://cdn.oreillystatic.com/en/assets/1/event/105/Algebra%20for%20Scalable%20Analytics%20Presentation.pdf
http://static.echonest.com/BoilTheFrog/
http://www.netflixprize.com/assets/GrandPrize2009_BPC_BellKor.pdf
http://blog.echen.me/2011/10/24/winning-the-netflix-prize-a-summary/
http://www.cc.gatech.edu/~zha/CSE8801/CF/kdd-fp074-koren.pdf
Title:
Real-time, Advanced Analytics and Recommendations using Machine Learning, Natural Language Processing, Graph Processing, and Approximations with Apache Spark, Stanford CoreNLP, and Twitter Algebird
Agenda
Intro
Live, Interactive Recommendations Demo
Spark ML, GraphX, Streaming, Kafka, Cassandra, Docker
Types of Similarity
Euclidean vs. Non-Euclidean Similarity
User-to-User Similarity
Content-based, Item-to-Item Similarity (Amazon)
Collaborative-based, User-to-Item Similarity (Netflix)
Graph-based, Item-to-Item Similarity Pathway (Spotify)
Similarity Approximations at Scale
Twitter Algebird
MinHash and Bucketing
Locality Sensitive Hashing (LSH)
Netflix Recommendations: From Ratings to Real-Time
DVD-Ratings-based $1M Netflix Prize (2009)
Streaming-based "Trending Now" (2016)
Wrap Up
Q & A
*Bio*
Chris Fregly is a Principal Data Solutions Engineer for the newly-formed IBM Spark Technology Center, an Apache Spark Contributor, and a Netflix Open Source Committer. Chris is also the founder of the global Advanced Apache Spark Meetup and author of the upcoming book, Advanced Spark @ advancedspark.com. Previously, Chris was a Data Solutions Engineer at Databricks and a Streaming Data Engineer at Netflix.
*Related Links*
https://github.com/fluxcapacitor/pipeline/wiki
http://cdn.oreillystatic.com/en/assets/1/event/105/Algebra%20for%20Scalable%20Analytics%20Presentation.pdf
http://static.echonest.com/BoilTheFrog/
http://www.netflixprize.com/assets/GrandPrize2009_BPC_BellKor.pdf
http://blog.echen.me/2011/10/24/winning-the-netflix-prize-a-summary/
http://www.cc.gatech.edu/~zha/CSE8801/CF/kdd-fp074-koren.pdf
5th Athens Big Data Meetup - PipelineIO Workshop - Real-Time Training and Dep...Athens Big Data
Title: Real-Time Training and Deploying Spark ML Recommendations With Kafka and NetflixOSS
Speaker: Chris Fregly (https://linkedin.com/in/cfregly/)
Date: Monday, October 17, 2016
Event: https://meetup.com/Athens-Big-Data/events/234546355/
Feature Talk: Real-time Aggregations, Approximations, Similarities, and Recommendations at Scale using Spark Streaming, ML, GraphX, Kafka, Cassandra, Docker, CoreNLP, Word2Vec, LDA, and Twitter Algebird
Talk Abstract: Starting with a live, interactive demo generating audience-specific recommendations, we'll dive deep into each of the key components including NiFi, Kafka, Stanford CoreNLP, Docker, Word2Vec, LDA, Twitter Algebird, Spark Streaming, SQL, ML, GraphX. As a bonus, we'll discuss the latest Netflix Recommendations Pipeline and related open source projects.
Talk Agenda:
• Intro
• Live, Interactive Recommendations Demo
• Spark Streaming, ML, GraphX, Kafka, Cassandra, Docker, CoreNLP, Word2Vec, LDA, and Twitter Algebird (advancedspark.com)
• Types of Similarity
• Euclidean vs. Non-Euclidean Similarity
• Jaccard Similarity
• Cosine Similarity
• LogLikelihood Similarity
• Edit Distance
• Text-based Similarities and Analytics
• Word2Vec
• LDA Topic Extraction
• TextRank
• Similarity-based Recommendations
• User-to-User
• Content-based, Item-to-Item (Amazon)
• Collaborative-based, User-to-Item (Netflix)
• Graph-based, Item-to-Item "Pathways" (Spotify)
• Aggregations, Approximations, and Similarities at Scale
• Twitter Algebird
• MinHash and Bucketing
• Locality Sensitive Hashing (LSH)
• BloomFilters
• CountMin Sketch
• HyperLogLog
• Q & A
Speaker Bio: Chris Fregly is a Research Engineer @ Flux Capacitor AI in SF, an Apache Spark Contributor, and a Netflix Open Source Committer.
Chris is also the founder of the global Advanced Apache Spark Meetup and author of the upcoming book, Advanced Spark @ advancedspark.com.
Previously, Chris was a Data Solutions Engineer at Databricks and a Streaming Data Engineer at Netflix.
Helsinki Spark Meetup Nov 20, 2015: Spark After Dark 1.5: Real-time, Advanced Analytics with Spark 1.5, Kafka, Cassandra, ElasticSearch, Zeppelin, and Docker
Zurich, Berlin, Vienna Spark and Big Data Meetup Nov 02 2015Chris Fregly
Zurich, Berlin, Vienna Spark Meetup Nov 02 2015
* Title *
Spark After Dark 1.5: Real-time, Advanced Analytics with Spark 1.5, Kafka, Cassandra, ElasticSearch, Zeppelin, and Docker
* Abstract *
Combining the most popular and technically-deep material from his wildly popular Advanced Apache Spark Meetup, Chris Fregly will provide code-level deep dives into the latest performance and scalability advancements within the Apache Spark Ecosystem by exploring the following:
1) Building a Scalable and Performant Spark SQL/DataFrames Data Source Connector such as Spark-CSV, Spark-Cassandra, Spark-ElasticSearch, and Spark-Redshift
2) Speeding Up Spark SQL Queries using Partition Pruning and Predicate Pushdowns with CSV, JSON, Parquet, Avro, and ORC
3) Tuning Spark Streaming Performance and Fault Tolerance with KafkaRDD and KinesisRDD
4) Maintaining Stability during High Scale Streaming Ingestion using Approximations and Probabilistic Data Structures from Spark, Redis, and Twitter's Algebird
5) Building Effective Machine Learning Models using Feature Engineering, Dimension Reduction, and Natural Language Processing with MLlib/GraphX, ML Pipelines, DIMSUM, Locality Sensitive Hashing, and Stanford's CoreNLP
6) Tuning Core Spark Performance by Acknowledging Mechanical Sympathy for the Physical Limitations of OS and Hardware Resources such as CPU, Memory, Network, and Disk with Project Tungsten, Asynchronous Netty, and Linux epoll
* Demos *
This talk features many interesting and audience-interactive demos - as well as code-level deep dives into many of the projects listed above.
All demo code is available on Github at the following link: https://github.com/fluxcapacitor/pipeline/wiki
In addition, the entire demo environment has been Dockerized and made available for download on Docker Hub at the following link: https://hub.docker.com/r/fluxcapacitor/pipeline/
* Speaker Bio *
Chris Fregly is a Principal Data Solutions Engineer for the newly-formed IBM Spark Technology Center, an Apache Spark Contributor, a Netflix Open Source Committer, as well as the Organizer of the global Advanced Apache Spark Meetup and Author of the Upcoming Book, Advanced Spark.
Previously, Chris was a Data Solutions Engineer at Databricks and a Streaming Data Engineer at Netflix.
When Chris isn’t contributing to Spark and other open source projects, he’s creating book chapters, slides, and demos to share knowledge with his peers at meetups and conferences throughout the world.
Brussels Spark Meetup Oct 30, 2015: Spark After Dark 1.5: Real-time, Advanc...Chris Fregly
Combining the most popular and technically-deep material from his wildly popular Advanced Apache Spark Meetup, Chris Fregly will provide code-level deep dives into the latest performance and scalability advancements within the Apache Spark Ecosystem by exploring the following:
1) Building a Scalable and Performant Spark SQL/DataFrames Data Source Connector such as Spark-CSV, Spark-Cassandra, Spark-ElasticSearch, and Spark-Redshift
2) Speeding Up Spark SQL Queries using Partition Pruning and Predicate Pushdowns with CSV, JSON, Parquet, Avro, and ORC
3) Tuning Spark Streaming Performance and Fault Tolerance with KafkaRDD and KinesisRDD
4) Maintaining Stability during High Scale Streaming Ingestion using Approximations and Probabilistic Data Structures from Spark, Redis, and Twitter's Algebird
5) Building Effective Machine Learning Models using Feature Engineering, Dimension Reduction, and Natural Language Processing with MLlib/GraphX, ML Pipelines, DIMSUM, Locality Sensitive Hashing, and Stanford's CoreNLP
6) Tuning Core Spark Performance by Acknowledging Mechanical Sympathy for the Physical Limitations of OS and Hardware Resources such as CPU, Memory, Network, and Disk with Project Tungsten, Asynchronous Netty, and Linux epoll
* Demos *
This talk features many interesting and audience-interactive demos - as well as code-level deep dives into many of the projects listed above.
All demo code is available on Github at the following link: https://github.com/fluxcapacitor/pipeline/wiki
In addition, the entire demo environment has been Dockerized and made available for download on Docker Hub at the following link: https://hub.docker.com/r/fluxcapacitor/pipeline/
* Speaker Bio *
Chris Fregly is a Principal Data Solutions Engineer for the newly-formed IBM Spark Technology Center, an Apache Spark Contributor, a Netflix Open Source Committer, as well as the Organizer of the global Advanced Apache Spark Meetup and Author of the Upcoming Book, Advanced Spark.
Previously, Chris was a Data Solutions Engineer at Databricks and a Streaming Data Engineer at Netflix.
When Chris isn’t contributing to Spark and other open source projects, he’s creating book chapters, slides, and demos to share knowledge with his peers at meetups and conferences throughout the world.
Scotland Data Science Meetup Oct 13, 2015: Spark SQL, DataFrames, Catalyst, ...Chris Fregly
This talk highlights the Data Sources API which participates in the Spark SQL DataFrame Catalyst Optimizer. We dive deep into the super-advanced Cassandra's open source implementation @ github.com/datastax/spark-cassandra-connector. We discuss data locality, cluster deployment - as well as the pros and cons of mixing OLAP and OLTP workloads.
We also implement a SimpleDataSource which is a basic implementation of the DataSources API.
All analysis is done with Apache Zeppelin.
Explore IoT in Big Data while brewing beer. All verticals are instrumenting devices to learn more about their process to help cut costs or improve efficiency.
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...Chris Fregly
https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/events/227622666/
Title: Spark on Kubernetes
Abstract: Engineers across several organizations are working on support for Kubernetes as a cluster scheduler backend within Spark. While designing this, we have encountered several challenges in translating Spark to use idiomatic Kubernetes constructs natively. This talk is about our high level design decisions and the current state of our work.
Speaker:
Anirudh Ramanathan is a software engineer on the Kubernetes team at Google. His focus is on running stateful and batch workloads. Previously, he worked on GGC (Google Global Cache) and prior to that, on the infrastructure team at NVIDIA."
5th Athens Big Data Meetup - PipelineIO Workshop - Real-Time Training and Dep...Athens Big Data
Title: Real-Time Training and Deploying Spark ML Recommendations With Kafka and NetflixOSS
Speaker: Chris Fregly (https://linkedin.com/in/cfregly/)
Date: Monday, October 17, 2016
Event: https://meetup.com/Athens-Big-Data/events/234546355/
Feature Talk: Real-time Aggregations, Approximations, Similarities, and Recommendations at Scale using Spark Streaming, ML, GraphX, Kafka, Cassandra, Docker, CoreNLP, Word2Vec, LDA, and Twitter Algebird
Talk Abstract: Starting with a live, interactive demo generating audience-specific recommendations, we'll dive deep into each of the key components including NiFi, Kafka, Stanford CoreNLP, Docker, Word2Vec, LDA, Twitter Algebird, Spark Streaming, SQL, ML, GraphX. As a bonus, we'll discuss the latest Netflix Recommendations Pipeline and related open source projects.
Talk Agenda:
• Intro
• Live, Interactive Recommendations Demo
• Spark Streaming, ML, GraphX, Kafka, Cassandra, Docker, CoreNLP, Word2Vec, LDA, and Twitter Algebird (advancedspark.com)
• Types of Similarity
• Euclidean vs. Non-Euclidean Similarity
• Jaccard Similarity
• Cosine Similarity
• LogLikelihood Similarity
• Edit Distance
• Text-based Similarities and Analytics
• Word2Vec
• LDA Topic Extraction
• TextRank
• Similarity-based Recommendations
• User-to-User
• Content-based, Item-to-Item (Amazon)
• Collaborative-based, User-to-Item (Netflix)
• Graph-based, Item-to-Item "Pathways" (Spotify)
• Aggregations, Approximations, and Similarities at Scale
• Twitter Algebird
• MinHash and Bucketing
• Locality Sensitive Hashing (LSH)
• BloomFilters
• CountMin Sketch
• HyperLogLog
• Q & A
Speaker Bio: Chris Fregly is a Research Engineer @ Flux Capacitor AI in SF, an Apache Spark Contributor, and a Netflix Open Source Committer.
Chris is also the founder of the global Advanced Apache Spark Meetup and author of the upcoming book, Advanced Spark @ advancedspark.com.
Previously, Chris was a Data Solutions Engineer at Databricks and a Streaming Data Engineer at Netflix.
Helsinki Spark Meetup Nov 20, 2015: Spark After Dark 1.5: Real-time, Advanced Analytics with Spark 1.5, Kafka, Cassandra, ElasticSearch, Zeppelin, and Docker
Zurich, Berlin, Vienna Spark and Big Data Meetup Nov 02 2015Chris Fregly
Zurich, Berlin, Vienna Spark Meetup Nov 02 2015
* Title *
Spark After Dark 1.5: Real-time, Advanced Analytics with Spark 1.5, Kafka, Cassandra, ElasticSearch, Zeppelin, and Docker
* Abstract *
Combining the most popular and technically-deep material from his wildly popular Advanced Apache Spark Meetup, Chris Fregly will provide code-level deep dives into the latest performance and scalability advancements within the Apache Spark Ecosystem by exploring the following:
1) Building a Scalable and Performant Spark SQL/DataFrames Data Source Connector such as Spark-CSV, Spark-Cassandra, Spark-ElasticSearch, and Spark-Redshift
2) Speeding Up Spark SQL Queries using Partition Pruning and Predicate Pushdowns with CSV, JSON, Parquet, Avro, and ORC
3) Tuning Spark Streaming Performance and Fault Tolerance with KafkaRDD and KinesisRDD
4) Maintaining Stability during High Scale Streaming Ingestion using Approximations and Probabilistic Data Structures from Spark, Redis, and Twitter's Algebird
5) Building Effective Machine Learning Models using Feature Engineering, Dimension Reduction, and Natural Language Processing with MLlib/GraphX, ML Pipelines, DIMSUM, Locality Sensitive Hashing, and Stanford's CoreNLP
6) Tuning Core Spark Performance by Acknowledging Mechanical Sympathy for the Physical Limitations of OS and Hardware Resources such as CPU, Memory, Network, and Disk with Project Tungsten, Asynchronous Netty, and Linux epoll
* Demos *
This talk features many interesting and audience-interactive demos - as well as code-level deep dives into many of the projects listed above.
All demo code is available on Github at the following link: https://github.com/fluxcapacitor/pipeline/wiki
In addition, the entire demo environment has been Dockerized and made available for download on Docker Hub at the following link: https://hub.docker.com/r/fluxcapacitor/pipeline/
* Speaker Bio *
Chris Fregly is a Principal Data Solutions Engineer for the newly-formed IBM Spark Technology Center, an Apache Spark Contributor, a Netflix Open Source Committer, as well as the Organizer of the global Advanced Apache Spark Meetup and Author of the Upcoming Book, Advanced Spark.
Previously, Chris was a Data Solutions Engineer at Databricks and a Streaming Data Engineer at Netflix.
When Chris isn’t contributing to Spark and other open source projects, he’s creating book chapters, slides, and demos to share knowledge with his peers at meetups and conferences throughout the world.
Brussels Spark Meetup Oct 30, 2015: Spark After Dark 1.5: Real-time, Advanc...Chris Fregly
Combining the most popular and technically-deep material from his wildly popular Advanced Apache Spark Meetup, Chris Fregly will provide code-level deep dives into the latest performance and scalability advancements within the Apache Spark Ecosystem by exploring the following:
1) Building a Scalable and Performant Spark SQL/DataFrames Data Source Connector such as Spark-CSV, Spark-Cassandra, Spark-ElasticSearch, and Spark-Redshift
2) Speeding Up Spark SQL Queries using Partition Pruning and Predicate Pushdowns with CSV, JSON, Parquet, Avro, and ORC
3) Tuning Spark Streaming Performance and Fault Tolerance with KafkaRDD and KinesisRDD
4) Maintaining Stability during High Scale Streaming Ingestion using Approximations and Probabilistic Data Structures from Spark, Redis, and Twitter's Algebird
5) Building Effective Machine Learning Models using Feature Engineering, Dimension Reduction, and Natural Language Processing with MLlib/GraphX, ML Pipelines, DIMSUM, Locality Sensitive Hashing, and Stanford's CoreNLP
6) Tuning Core Spark Performance by Acknowledging Mechanical Sympathy for the Physical Limitations of OS and Hardware Resources such as CPU, Memory, Network, and Disk with Project Tungsten, Asynchronous Netty, and Linux epoll
* Demos *
This talk features many interesting and audience-interactive demos - as well as code-level deep dives into many of the projects listed above.
All demo code is available on Github at the following link: https://github.com/fluxcapacitor/pipeline/wiki
In addition, the entire demo environment has been Dockerized and made available for download on Docker Hub at the following link: https://hub.docker.com/r/fluxcapacitor/pipeline/
* Speaker Bio *
Chris Fregly is a Principal Data Solutions Engineer for the newly-formed IBM Spark Technology Center, an Apache Spark Contributor, a Netflix Open Source Committer, as well as the Organizer of the global Advanced Apache Spark Meetup and Author of the Upcoming Book, Advanced Spark.
Previously, Chris was a Data Solutions Engineer at Databricks and a Streaming Data Engineer at Netflix.
When Chris isn’t contributing to Spark and other open source projects, he’s creating book chapters, slides, and demos to share knowledge with his peers at meetups and conferences throughout the world.
Scotland Data Science Meetup Oct 13, 2015: Spark SQL, DataFrames, Catalyst, ...Chris Fregly
This talk highlights the Data Sources API which participates in the Spark SQL DataFrame Catalyst Optimizer. We dive deep into the super-advanced Cassandra's open source implementation @ github.com/datastax/spark-cassandra-connector. We discuss data locality, cluster deployment - as well as the pros and cons of mixing OLAP and OLTP workloads.
We also implement a SimpleDataSource which is a basic implementation of the DataSources API.
All analysis is done with Apache Zeppelin.
Explore IoT in Big Data while brewing beer. All verticals are instrumenting devices to learn more about their process to help cut costs or improve efficiency.
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...Chris Fregly
https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/events/227622666/
Title: Spark on Kubernetes
Abstract: Engineers across several organizations are working on support for Kubernetes as a cluster scheduler backend within Spark. While designing this, we have encountered several challenges in translating Spark to use idiomatic Kubernetes constructs natively. This talk is about our high level design decisions and the current state of our work.
Speaker:
Anirudh Ramanathan is a software engineer on the Kubernetes team at Google. His focus is on running stateful and batch workloads. Previously, he worked on GGC (Google Global Cache) and prior to that, on the infrastructure team at NVIDIA."
IBM Runtimes Performance Observations with Apache SparkAdamRobertsIBM
In this talk presented at the Spark London meetup on the 23rd of November 2016 I have detailed our findings in IBM's Runtime Technologies department around Apache Spark. I share best practices we observed by profiling Spark on a variety of workloads I have covered and help Spark users to profile their own applications. I've also touched on how anybody can develop using fast networking capabilities (RDMA) and can achieve substantial performance speedups using GPUs.
This is the deck that I used for the talks that I gave in the Silicon Valley / San Francisco bay area at various events in April and May 2016.
1. Introduces Big Data and related challenges.
2. Briefly covers some of the important open-source big data related technologies.
3. Introduces Hadoop
4. Introduces Spark Core, Spark SQL, MLlib and GraphX
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsDataStax Academy
Apache Spark has grown to be one of the largest open source communities in big data, with over 190 developers and dozens of companies contributing. The latest 1.0 release alone includes contributions from 117 people. A clean API, interactive shell, distributed in-memory computation, stream processing, interactive SQL, and libraries delivering everything from machine learning to graph processing make it an excellent unified platform to solve a number of problems. Apache Spark works very well with a growing number of big data solutions, including Cassandra and Hadoop. Come learn about Apache Spark and see how easy it is for you to get started using Spark to build your own high performance big data applications today.
A new look on Spark 2 features and Under the hood. We try to look at Apache spark latest release with an examining look, while still loving it, but also criticising it.
This presentation focuses on Apache Spark’s MLlib library for distributed ML, focusing on how we simplified elements of production-grade ML by building MLlib on top of Spark’s distributed DataFrame API.
Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...Chris Fregly
* Title *
Spark After Dark 1.5: Deep Dive Into Latest Perf and Scale Improvements in Spark Ecosystem
* Abstract *
Combining the most popular and technically-deep material from his wildly popular Advanced Apache Spark Meetup, Chris Fregly will provide code-level deep dives into the latest performance and scalability advancements within the Apache Spark Ecosystem by exploring the following:
1) Building a Scalable and Performant Spark SQL/DataFrames Data Source Connector such as Spark-CSV, Spark-Cassandra, Spark-ElasticSearch, and Spark-Redshift
2) Speeding Up Spark SQL Queries using Partition Pruning and Predicate Pushdowns with CSV, JSON, Parquet, Avro, and ORC
3) Tuning Spark Streaming Performance and Fault Tolerance with KafkaRDD and KinesisRDD
4) Maintaining Stability during High Scale Streaming Ingestion using Approximations and Probabilistic Data Structures from Spark, Redis, and Twitter's Algebird
5) Building Effective Machine Learning Models using Feature Engineering, Dimension Reduction, and Natural Language Processing with MLlib/GraphX, ML Pipelines, DIMSUM, Locality Sensitive Hashing, and Stanford's CoreNLP
6) Tuning Core Spark Performance by Acknowledging Mechanical Sympathy for the Physical Limitations of OS and Hardware Resources such as CPU, Memory, Network, and Disk with Project Tungsten, Asynchronous Netty, and Linux epoll
* Demos *
This talk features many interesting and audience-interactive demos - as well as code-level deep dives into many of the projects listed above.
All demo code is available on Github at the following link: https://github.com/fluxcapacitor/pipeline/wiki
In addition, the entire demo environment has been Dockerized and made available for download on Docker Hub at the following link: https://hub.docker.com/r/fluxcapacitor/pipeline/
* Speaker Bio *
Chris Fregly is a Principal Data Solutions Engineer for the newly-formed IBM Spark Technology Center, an Apache Spark Contributor, a Netflix Open Source Committer, as well as the Organizer of the global Advanced Apache Spark Meetup and Author of the Upcoming Book, Advanced Spark.
Previously, Chris was a Data Solutions Engineer at Databricks and a Streaming Data Engineer at Netflix.
When Chris isn’t contributing to Spark and other open source projects, he’s creating book chapters, slides, and demos to share knowledge with his peers at meetups and conferences throughout the world.
Best Practices for Building and Deploying Data Pipelines in Apache SparkDatabricks
Many data pipelines share common characteristics and are often built in similar but bespoke ways, even within a single organisation. In this talk, we will outline the key considerations which need to be applied when building data pipelines, such as performance, idempotency, reproducibility, and tackling the small file problem. We’ll work towards describing a common Data Engineering toolkit which separates these concerns from business logic code, allowing non-Data-Engineers (e.g. Business Analysts and Data Scientists) to define data pipelines without worrying about the nitty-gritty production considerations.
We’ll then introduce an implementation of such a toolkit in the form of Waimak, our open-source library for Apache Spark (https://github.com/CoxAutomotiveDataSolutions/waimak), which has massively shortened our route from prototype to production. Finally, we’ll define new approaches and best practices about what we believe is the most overlooked aspect of Data Engineering: deploying data pipelines.
Rental Cars and Industrialized Learning to Rank with Sean DownesDatabricks
Data can be viewed as the exhaust of online activity. With the rise of cloud-based data platforms, barriers to data storage and transfer have crumbled. The demand for creative applications and learning from those datasets has accelerated. Rapid acceleration can quickly accrue disorder, and disorderly data design can turn the deepest data lake into an impenetrable swamp.
In this talk, I will discuss the evolution of the data science workflow at Expedia with a special emphasis on Learning to Rank problems. From the heroic early days of ad-hoc Spark exploration to our first production sort model on the cloud, we will explore the process of industrializing the workflow. Layered over our story, I will share some best practices and suggestions on how to keep your data productive, or even pull your organization out of the data swamp.
Intro to PySpark: Python Data Analysis at scale in the CloudDaniel Zivkovic
Why would you care? Because PySpark is a cloud-agnostic analytics tool for Big Data processing, "hidden" in:
* AWS Glue - Managed ETL Service
* Amazon EMR - Big Data Platform
* Google Cloud Dataproc - Cloud-native Spark and Hadoop
* Azure HDInsight - Microsoft implementation of Apache Spark in the cloud
In this #ServerlessTO talk, Jonathan Rioux - Head of Data Science at EPAM Canada & author of PySpark in Action book (https://www.manning.com/books/pyspark-in-action), will get you acquainted with PySpark - Python API for Spark.
Event details: https://www.meetup.com/Serverless-Toronto/events/269124392/
Event recording: https://youtu.be/QGxytMbrjGY
Like always, BIG thanks to our knowledge sponsor Manning Publications – who generously offered to raffle not 1 but 3 of Jonathan's books!
RSVP for more exciting (online) events at https://www.meetup.com/Serverless-Toronto/events/
Tactical Data Science Tips: Python and Spark TogetherDatabricks
Running Spark and Python data science workloads can be challenging given the complexity of the various data science tools in the ecosystem like sci-kit Learn, TensorFlow, Spark, Pandas, and MLlib. All these various tools and architectures, provide important trade-offs to consider when it comes to moving to proofs of concept and going to production. While proof of concepts may be relatively straightforward, moving to production can be challenging because it’s difficult to understand not just the short term effort to develop a solution, but the long term cost of supporting projects over the long term.
This talk will discuss important tactical patterns for evaluating projects, running proofs of concept to inform going to production, and finally the key tactics we use internally at Databricks to take data and machine learning projects into production. This session will cover some architectural choices involving Spark, PySpark, Pandas, notebooks, various machine learning toolkits, as well as frameworks and technologies necessary to support them.
This webinar focuses on the particular use case of graph databases in Network & IT-Management. This webinar is designed for people who work with Network Management at telecom companies or professionals within industries that handle and rely on complex networks.
We’ll start with an overview of Neo4j and Graph-thinking within Networks, explaining how Neworks are naturally modelled as graphs. We’ll explain how graph databases vastly help mitigate some of the major challenges the Network and Security Managers face on daily basis — including intrusions and other cyber crimes, performance optimization, outage simulations, fraud prevention and more.
Similar to Advanced Analytics and Recommendations with Apache Spark - Spark Maryland/DC Meetup Feb 22 2016 (15)
Pandas on AWS - Let me count the ways.pdfChris Fregly
Chris Fregly (Principal Solution Architect, AI and machine learning at AWS) will give a brief presentation on the various ways to perform scalable Pandas, Modin, and Ray on AWS. He will then answer questions from the audience and moderator, Alejandro Herrera (whatever he is) at Ponder.
Chris Fregly is a Principal Solution Architect for AI and Machine Learning at Amazon Web Services (AWS) based in San Francisco, California. He is the organizer of the Global Data Science on AWS meetup. He is co-author of the O'Reilly Book, "Data Science on AWS."
Related Links
O'Reilly Book: https://www.amazon.com/dp/1492079391/
Website: https://datascienceonaws.com
Meetup: https://meetup.datascienceonaws.com
GitHub Repo: https://github.com/data-science-on-aws/
YouTube: https://youtube.datascienceonaws.com
Slideshare: https://slideshare.datascienceonaws.com
Ray AI Runtime (AIR) on AWS - Data Science On AWS MeetupChris Fregly
RSVP Webinar: https://www.eventbrite.com/e/webinarkubeflow-tensorflow-tfx-pytorch-gpu-spark-ml-amazonsagemaker-tickets-45852865154
Talk #0: Introductions and Meetup Announcements By Chris Fregly and Antje Barth
Talk #1: Ray Overview, Ray AI Runtime on AWS using Amazon SageMaker, EC2, EMR, EKS by Chris Fregly, Principal Specialist Solution Architect, AI and Machine Learning @ AWS
Talk #2: Deep-dive Blueprints for Amazon Elastic Kubernetes Service (EKS) including Ray and Spark by Apoorva Kulkarni, Sr. Specialist Solution Architect, Containers and Kubernetes @ AWS
RSVP Webinar: https://www.eventbrite.com/e/webinarkubeflow-tensorflow-tfx-pytorch-gpu-spark-ml-amazonsagemaker-tickets-45852865154
Zoom link: https://us02web.zoom.us/j/82308186562
Related Links
O'Reilly Book: https://www.amazon.com/dp/1492079391/
Website: https://datascienceonaws.com
Meetup: https://meetup.datascienceonaws.com
GitHub Repo: https://github.com/data-science-on-aws/
YouTube: https://youtube.datascienceonaws.com
Slideshare: https://slideshare.datascienceonaws.com
Amazon reInvent 2020 Recap: AI and Machine LearningChris Fregly
Amazon reInvent 2020 Recap: AI and Machine Learning
Video here: https://youtu.be/YSXe02Y5pHM
NEW RELEASE! Build, Automate, Manage, and Scale ML Workflows with the NEW Amazon SageMaker Pipelines by Hallie Crosby Weishahn.
Description of Talk and Demo
AWS recently announced Amazon SageMaker Pipelines (https://aws.amazon.com/sagemaker/pipelines/), the first purpose-built, easy-to-use Continuous Integration and Continuous Delivery (CI/CD) service for machine learning.
SageMaker Pipelines has three main components which improve the operational resilience and reproducibility of your workflows: 1) pipelines, 2) model registry, and 3) projects.
In this talk and demo, Hallie will walk us through the new Amazon SageMaker Pipelines feature including MLOps support.
Date/Time
9-10am US Pacific Time (Third Monday of Every Month)
RSVP: https://www.eventbrite.com/e/1-hr-free-workshop-pipelineai-gpu-tpu-spark-ml-tensorflow-ai-kubernetes-kafka-scikit-tickets-45852865154
Meetup:
https://www.meetup.com/Data-Science-on-AWS/
Zoom:
https://zoom.us/j/690414331
Webinar ID: 690 414 331
Phone:
+1 646 558 8656 (US Toll) or +1 408 638 0968 (US Toll)
Related Links
Meetup: https://meetup.datascienceonaws.com
GitHub Repo: https://github.com/data-science-on-aws/
O'Reilly Book: https://datascienceonaws.com
YouTube: https://youtube.datascienceonaws.com
Slideshare: https://slideshare.datascienceonaws.com
Support: https://support.pipeline.ai
Monthly Workshop: https://www.eventbrite.com/e/full-day-workshop-kubeflow-gpu-kerastensorflow-20-tf-extended-tfx-kubernetes-pytorch-xgboost-tickets-63362929227
RSVP: https://www.eventbrite.com/e/1-hr-free-workshop-pipelineai-gpu-tpu-spark-ml-tensorflow-ai-kubernetes-kafka-scikit-tickets-45852865154
Waking the Data Scientist at 2am: Detect Model Degradation on Production Mod...Chris Fregly
Waking the Data Scientist at 2am:
Detect Model Degradation on Production Models with Amazon SageMaker Endpoints & Model Monitor
In this talk, I describe how to deploy a model into production and monitor its performance using SageMaker Model Monitor. With Model Monitor, I can detect if a model's predictive performance has degraded - and alert an on-call data scientist to take action and improve the model at 2am while the DevOps folks sleep soundly through the night.
Topics: AI and Machine Learning, Model Deployment, Anomaly Detection, Amazon SageMaker Endpoints, and Model Monitor
Quantum Computing with Amazon Braket
In this talk, I describe some fundamental principles of quantum computing including qu-bits, superposition, and entanglement. I will demonstrate how to perform secure quantum computing tasks across many Quantum Processing Units (QPUs) using Amazon Braket, IAM, and S3.
AI and Machine Learning, Quantum Computing, Amazon Braket, QPU
15 Tips to Scale a Large AI/ML Workshop - Both Online and In-PersonChris Fregly
In this talk, we present tips and best practices for scaling a large workshop for 1,000's of simultaneous attendees - both online and in-person. While our workshop is focused on AI and machine learning on AWS, we generalize our learnings for any domain or specialization.
Video: https://youtu.be/T0L0JxDaPkc
RSVP Here: https://www.eventbrite.com/e/full-day-workshop-kubeflow-kerastensorflow-20-tf-extended-tfx-kubernetes-pytorch-xgboost-airflow-tickets-63362929227
Description
In this workshop, we build real-world machine learning pipelines using TensorFlow Extended (TFX), KubeFlow, Airflow, and MLflow.
Described in the 2017 paper, TFX is used internally by thousands of Google data scientists and engineers across every major product line within Google.
KubeFlow is a modern, end-to-end pipeline orchestration framework that embraces the latest AI best practices including hyper-parameter tuning, distributed model training, and model tracking.
Airflow is the most-widely used pipeline orchestration framework in machine learning and data engineering.
MLflow is a lightweight experiment-tracking system recently open-sourced by Databricks, the creators of Apache Spark. MLflow supports Python, Java/Scala, and R - and offers native support for TensorFlow, Keras, and Scikit-Learn.
Pre-requisites
Modern browser - and that's it!
Every attendee will receive a cloud instance
Nothing will be installed on your local laptop
Everything can be downloaded at the end of the workshop
Location
Online Workshop
The link will be sent a few hours before the start of the workshop.
Only registered users will receive the link.
If you do not receive the link a few hours before the start of the workshop, please send your Eventbrite registration confirmation to support@pipeline.ai for help.
Agenda
1. Create a Kubernetes cluster
2. Install KubeFlow, Airflow, TFX, and Jupyter
3. Setup ML Training Pipelines with KubeFlow and Airflow
4. Transform Data with TFX Transform
5. Validate Training Data with TFX Data Validation
6. Train Models with Jupyter, Keras/TensorFlow 2.0, PyTorch, XGBoost, and KubeFlow
7. Run a Notebook Directly on Kubernetes Cluster with KubeFlow
8. Analyze Models using TFX Model Analysis and Jupyter
9. Perform Hyper-Parameter Tuning with KubeFlow
10. Select the Best Model using KubeFlow Experiment Tracking
11. Run Multiple Experiments with MLflow Experiment Tracking
12. Reproduce Model Training with TFX Metadata Store
13. Deploy the Model to Production with TensorFlow Serving and Istio
14. Save and Download your Workspace
Key Takeaways
Attendees will gain experience training, analyzing, and serving real-world Keras/TensorFlow 2.0 models in production using model frameworks and open-source tools.
RSVP Here: https://www.eventbrite.com/e/full-day-workshop-kubeflow-kerastensorflow-20-tf-extended-tfx-kubernetes-pytorch-xgboost-airflow-tickets-63362929227
https://youtu.be/T0L0JxDaPkc
Title
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTorch + XGBoost + Airflow + MLflow + Spark + Jupyter + TPU
Video
https://youtu.be/vaB4IM6ySD0
Description
In this workshop, we build real-world machine learning pipelines using TensorFlow Extended (TFX), KubeFlow, and Airflow.
Described in the 2017 paper, TFX is used internally by thousands of Google data scientists and engineers across every major product line within Google.
KubeFlow is a modern, end-to-end pipeline orchestration framework that embraces the latest AI best practices including hyper-parameter tuning, distributed model training, and model tracking.
Airflow is the most-widely used pipeline orchestration framework in machine learning.
Pre-requisites
Modern browser - and that's it!
Every attendee will receive a cloud instance
Nothing will be installed on your local laptop
Everything can be downloaded at the end of the workshop
Location
Online Workshop
Agenda
1. Create a Kubernetes cluster
2. Install KubeFlow, Airflow, TFX, and Jupyter
3. Setup ML Training Pipelines with KubeFlow and Airflow
4. Transform Data with TFX Transform
5. Validate Training Data with TFX Data Validation
6. Train Models with Jupyter, Keras/TensorFlow 2.0, PyTorch, XGBoost, and KubeFlow
7. Run a Notebook Directly on Kubernetes Cluster with KubeFlow
8. Analyze Models using TFX Model Analysis and Jupyter
9. Perform Hyper-Parameter Tuning with KubeFlow
10. Select the Best Model using KubeFlow Experiment Tracking
11. Reproduce Model Training with TFX Metadata Store and Pachyderm
12. Deploy the Model to Production with TensorFlow Serving and Istio
13. Save and Download your Workspace
Key Takeaways
Attendees will gain experience training, analyzing, and serving real-world Keras/TensorFlow 2.0 models in production using model frameworks and open-source tools.
Related Links
1. PipelineAI Home: https://pipeline.ai
2. PipelineAI Community Edition: http://community.pipeline.ai
3. PipelineAI GitHub: https://github.com/PipelineAI/pipeline
4. Advanced Spark and TensorFlow Meetup (SF-based, Global Reach): https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup
5. YouTube Videos: https://youtube.pipeline.ai
6. SlideShare Presentations: https://slideshare.pipeline.ai
7. Slack Support: https://joinslack.pipeline.ai
8. Web Support and Knowledge Base: https://support.pipeline.ai
9. Email Support: support@pipeline.ai
Speaker: Umayah Abdennabi
Agenda
* Intro Grammarly (Umayah Abdennabi, 5 mins)
* Meetup Updates and Announcements (Chris, 5 mins)
* Custom Functions in Spark SQL (30 mins)
Speaker: Umayah Abdennabi
Spark comes with a rich Expression library that can be extended to make custom expressions. We will look into custom expressions and why you would want to use them.
* TF 2.0 + Keras (30 mins)
Speaker: Francesco Mosconi
Tensorflow 2.0 was announced at the March TF Dev Summit, and it brings many changes and upgrades. The most significant change is the inclusion of Keras as the default model building API. In this talk, we'll review the main changes introduced in TF 2.0 and highlight the differences between open source Keras and tf.keras
* SQUAD Deep-Dive: Question & Answer with Context (45 mins)
Speaker: Brett Koonce (https://quarkworks.co)
SQuAD (Stanford Question Answer Dataset) is an NLP challenge based around answering questions by reading Wikipedia articles, designed to be a real-world machine learning benchmark. We will look at several different ways to tackle the SQuAD problem, building up to state of the art approaches in terms of time, complexity, and accuracy.
https://rajpurkar.github.io/SQuAD-explorer/
https://dawn.cs.stanford.edu/benchmark/#squad
Food and drinks will be provided. The event will be held at Grammarly's office at One Embarcadero Center on the 9th floor. When you arrive at One Embarcadero, take the escalator to the second floor where you will find the lobby and elevators to the office suites. Come on up to the 9th floor (no need to check in at security), and ring the Grammarly doorbell.
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...Chris Fregly
Traditional machine learning pipelines end with life-less models sitting on disk in the research lab. These traditional models are typically trained on stale, offline, historical batch data. Static models and stale data are not sufficient to power today's modern, AI-first Enterprises that require continuous model training, continuous model optimizations, and lightning-fast model experiments directly in production. Through a series of open source, hands-on demos and exercises, we will use PipelineAI to breathe life into these models using 4 new techniques that we’ve pioneered:
* Continuous Validation (V)
* Continuous Optimizing (O)
* Continuous Training (T)
* Continuous Explainability (E).
The Continuous "VOTE" techniques has proven to maximize pipeline efficiency, minimize pipeline costs, and increase pipeline insight at every stage from continuous model training (offline) to live model serving (online.)
Attendees will learn to create continuous machine learning pipelines in production with PipelineAI, TensorFlow, and Kafka.
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...Chris Fregly
Perform Online Predictions using Slack
A/B and multi-armed bandit model compare
Train Online Models with Kafka Streams
Create new models quickly
Deploy to production safely
Mirror traffic to validate online performance
Any Framework, Any Hardware, Any Cloud
Dashboard to manage the lifecycle of models from local development to live production
Generates optimized runtimes for the models
Custom targeting rules, shadow mode, and percentage-based rollouts to safely test features in live production
Continuous model training, model validation, and pipeline optimization
https://youtu.be/zpkH9oiIovU
https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/events/258276286/
Related Links
PipelineAI Home: https://pipeline.ai
PipelineAI Community Edition: https://community.pipeline.ai
PipelineAI GitHub: https://github.com/PipelineAI/pipeline
PipelineAI Quick Start: https://quickstart.pipeline.ai
Advanced Spark and TensorFlow Meetup (SF-based, Global Reach): https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup
YouTube Videos: https://youtube.pipeline.ai
SlideShare Presentations: https://slideshare.pipeline.ai
Slack Support:
https://joinslack.pipeline.ai
Web Support and Knowledge Base: https://support.pipeline.ai
Email Support: help@pipeline.ai
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...Chris Fregly
Chris Fregly, Founder @ PipelineAI, will walk you through a real-world, complete end-to-end Pipeline-optimization example. We highlight hyper-parameters - and model pipeline phases - that have never been exposed until now.
While most Hyperparameter Optimizers stop at the training phase (ie. learning rate, tree depth, ec2 instance type, etc), we extend model validation and tuning into a new post-training optimization phase including 8-bit reduced precision weight quantization and neural network layer fusing - among many other framework and hardware-specific optimizations.
Next, we introduce hyperparameters at the prediction phase including request-batch sizing and chipset (CPU v. GPU v. TPU).
Lastly, we determine a PipelineAI Efficiency Score of our overall Pipeline including Cost, Accuracy, and Time. We show techniques to maximize this PipelineAI Efficiency Score using our massive PipelineDB along with the Pipeline-wide hyper-parameter tuning techniques mentioned in this talk.
Bio
Chris Fregly is Founder and Applied AI Engineer at PipelineAI, a Real-Time Machine Learning and Artificial Intelligence Startup based in San Francisco.
He is also an Apache Spark Contributor, a Netflix Open Source Committer, founder of the Global Advanced Spark and TensorFlow Meetup, author of the O’Reilly Training and Video Series titled, "High Performance TensorFlow in Production with Kubernetes and GPUs."
Previously, Chris was a Distributed Systems Engineer at Netflix, a Data Solutions Engineer at Databricks, and a Founding Member and Principal Engineer at the IBM Spark Technology Center in San Francisco.
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...Chris Fregly
https://pipeline.ai
With PipelineAI, You Can…
* Generate Hardware-Specific Model Optimizations
* Deploy and Compare Models in Live Production
* Optimize Complete AI Pipeline Across Many Models
* Hyper-Parameter Tune Both Training & Predicting Phases
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...Chris Fregly
https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/events/244971261/
Based on this blog post: https://mengdong.github.io/2017/07/15/distributed-tensorflow-with-gpu-on-kubernetes-and-mapr/
youtube video:
https://www.youtube.com/watch?v=3phz1_B-rR4
http://pipeline.ai
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...Chris Fregly
Online Workshop
Note: A GPU-based cloud instance will be provided to each attendee for the duration of this event!!
At 8am PT on the morning of this workshop, we will email the Webinar details to your email address registered with Eventbrite.
If this email address is not up to date - or you do not get the email by 8am PT - please email your Eventbrite confirmation to help@pipeline.ai and we'll send you the details.
http://pipeline.ai
Title
PipelineAI Distributed Spark ML + Tensorflow AI + GPU Workshop
Time
Start: 9am PT Time
End: 1pm PT Time
Highlights
We will each build an end-to-end, continuous Tensorflow AI model training and deployment pipeline on our own GPU-based cloud instance.
At the end, we will combine our cloud instances to create the LARGEST Distributed Tensorflow AI Training and Serving Cluster in the WORLD!
Pre-requisites
Just a modern browser, internet connection, and a good night's sleep! We'll provide the rest.
Agenda
Spark ML
TensorFlow AI
Storing and Serving Models with HDFS
Trade-offs of CPU vs. *GPU, Scale Up vs. Scale Out
CUDA + cuDNN GPU Development Overview
TensorFlow Model Checkpointing, Saving, Exporting, and Importing
Distributed TensorFlow AI Model Training (Distributed Tensorflow)
TensorFlow's Accelerated Linear Algebra Framework (XLA)
TensorFlow's Just-in-Time (JIT) Compiler, Ahead of Time (AOT) Compiler
Centralized Logging and Visualizing of Distributed TensorFlow Training (Tensorboard)
Distributed Tensorflow AI Model Serving/Predicting (TensorFlow Serving)
Centralized Logging and Metrics Collection (Prometheus, Grafana)
Continuous TensorFlow AI Model Deployment (TensorFlow, Airflow)
Hybrid Cross-Cloud and On-Premise Deployments (Kubernetes)
High-Performance and Fault-Tolerant Micro-services (NetflixOSS)
More Info including GitHub and Docker Repos
http://pipeline.ai
Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...Hivelance Technology
Cryptocurrency trading bots are computer programs designed to automate buying, selling, and managing cryptocurrency transactions. These bots utilize advanced algorithms and machine learning techniques to analyze market data, identify trading opportunities, and execute trades on behalf of their users. By automating the decision-making process, crypto trading bots can react to market changes faster than human traders
Hivelance, a leading provider of cryptocurrency trading bot development services, stands out as the premier choice for crypto traders and developers. Hivelance boasts a team of seasoned cryptocurrency experts and software engineers who deeply understand the crypto market and the latest trends in automated trading, Hivelance leverages the latest technologies and tools in the industry, including advanced AI and machine learning algorithms, to create highly efficient and adaptable crypto trading bots
Experience our free, in-depth three-part Tendenci Platform Corporate Membership Management workshop series! In Session 1 on May 14th, 2024, we began with an Introduction and Setup, mastering the configuration of your Corporate Membership Module settings to establish membership types, applications, and more. Then, on May 16th, 2024, in Session 2, we focused on binding individual members to a Corporate Membership and Corporate Reps, teaching you how to add individual members and assign Corporate Representatives to manage dues, renewals, and associated members. Finally, on May 28th, 2024, in Session 3, we covered questions and concerns, addressing any queries or issues you may have.
For more Tendenci AMS events, check out www.tendenci.com/events
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...informapgpstrackings
Keep tabs on your field staff effortlessly with Informap Technology Centre LLC. Real-time tracking, task assignment, and smart features for efficient management. Request a live demo today!
For more details, visit us : https://informapuae.com/field-staff-tracking/
Listen to the keynote address and hear about the latest developments from Rachana Ananthakrishnan and Ian Foster who review the updates to the Globus Platform and Service, and the relevance of Globus to the scientific community as an automation platform to accelerate scientific discovery.
Advanced Flow Concepts Every Developer Should KnowPeter Caitens
Tim Combridge from Sensible Giraffe and Salesforce Ben presents some important tips that all developers should know when dealing with Flows in Salesforce.
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Globus
The Earth System Grid Federation (ESGF) is a global network of data servers that archives and distributes the planet’s largest collection of Earth system model output for thousands of climate and environmental scientists worldwide. Many of these petabyte-scale data archives are located in proximity to large high-performance computing (HPC) or cloud computing resources, but the primary workflow for data users consists of transferring data, and applying computations on a different system. As a part of the ESGF 2.0 US project (funded by the United States Department of Energy Office of Science), we developed pre-defined data workflows, which can be run on-demand, capable of applying many data reduction and data analysis to the large ESGF data archives, transferring only the resultant analysis (ex. visualizations, smaller data files). In this talk, we will showcase a few of these workflows, highlighting how Globus Flows can be used for petabyte-scale climate analysis.
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Globus
Large Language Models (LLMs) are currently the center of attention in the tech world, particularly for their potential to advance research. In this presentation, we'll explore a straightforward and effective method for quickly initiating inference runs on supercomputers using the vLLM tool with Globus Compute, specifically on the Polaris system at ALCF. We'll begin by briefly discussing the popularity and applications of LLMs in various fields. Following this, we will introduce the vLLM tool, and explain how it integrates with Globus Compute to efficiently manage LLM operations on Polaris. Attendees will learn the practical aspects of setting up and remotely triggering LLMs from local machines, focusing on ease of use and efficiency. This talk is ideal for researchers and practitioners looking to leverage the power of LLMs in their work, offering a clear guide to harnessing supercomputing resources for quick and effective LLM inference.
A Comprehensive Look at Generative AI in Retail App Testing.pdfkalichargn70th171
Traditional software testing methods are being challenged in retail, where customer expectations and technological advancements continually shape the landscape. Enter generative AI—a transformative subset of artificial intelligence technologies poised to revolutionize software testing.
Accelerate Enterprise Software Engineering with PlatformlessWSO2
Key takeaways:
Challenges of building platforms and the benefits of platformless.
Key principles of platformless, including API-first, cloud-native middleware, platform engineering, and developer experience.
How Choreo enables the platformless experience.
How key concepts like application architecture, domain-driven design, zero trust, and cell-based architecture are inherently a part of Choreo.
Demo of an end-to-end app built and deployed on Choreo.
We describe the deployment and use of Globus Compute for remote computation. This content is aimed at researchers who wish to compute on remote resources using a unified programming interface, as well as system administrators who will deploy and operate Globus Compute services on their research computing infrastructure.
Strategies for Successful Data Migration Tools.pptxvarshanayak241
Data migration is a complex but essential task for organizations aiming to modernize their IT infrastructure and leverage new technologies. By understanding common challenges and implementing these strategies, businesses can achieve a successful migration with minimal disruption. Data Migration Tool like Ask On Data play a pivotal role in this journey, offering features that streamline the process, ensure data integrity, and maintain security. With the right approach and tools, organizations can turn the challenge of data migration into an opportunity for growth and innovation.
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTier1 app
Even though at surface level ‘java.lang.OutOfMemoryError’ appears as one single error; underlyingly there are 9 types of OutOfMemoryError. Each type of OutOfMemoryError has different causes, diagnosis approaches and solutions. This session equips you with the knowledge, tools, and techniques needed to troubleshoot and conquer OutOfMemoryError in all its forms, ensuring smoother, more efficient Java applications.
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Globus
The U.S. Geological Survey (USGS) has made substantial investments in meeting evolving scientific, technical, and policy driven demands on storing, managing, and delivering data. As these demands continue to grow in complexity and scale, the USGS must continue to explore innovative solutions to improve its management, curation, sharing, delivering, and preservation approaches for large-scale research data. Supporting these needs, the USGS has partnered with the University of Chicago-Globus to research and develop advanced repository components and workflows leveraging its current investment in Globus. The primary outcome of this partnership includes the development of a prototype enterprise repository, driven by USGS Data Release requirements, through exploration and implementation of the entire suite of the Globus platform offerings, including Globus Flow, Globus Auth, Globus Transfer, and Globus Search. This presentation will provide insights into this research partnership, introduce the unique requirements and challenges being addressed and provide relevant project progress.
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?XfilesPro
Worried about document security while sharing them in Salesforce? Fret no more! Here are the top-notch security standards XfilesPro upholds to ensure strong security for your Salesforce documents while sharing with internal or external people.
To learn more, read the blog: https://www.xfilespro.com/how-does-xfilespro-make-document-sharing-secure-and-seamless-in-salesforce/
Software Engineering, Software Consulting, Tech Lead.
Spring Boot, Spring Cloud, Spring Core, Spring JDBC, Spring Security,
Spring Transaction, Spring MVC,
Log4j, REST/SOAP WEB-SERVICES.
How Recreation Management Software Can Streamline Your Operations.pptxwottaspaceseo
Recreation management software streamlines operations by automating key tasks such as scheduling, registration, and payment processing, reducing manual workload and errors. It provides centralized management of facilities, classes, and events, ensuring efficient resource allocation and facility usage. The software offers user-friendly online portals for easy access to bookings and program information, enhancing customer experience. Real-time reporting and data analytics deliver insights into attendance and preferences, aiding in strategic decision-making. Additionally, effective communication tools keep participants and staff informed with timely updates. Overall, recreation management software enhances efficiency, improves service delivery, and boosts customer satisfaction.
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamtakuyayamamoto1800
In this slide, we show the simulation example and the way to compile this solver.
In this solver, the Helmholtz equation can be solved by helmholtzFoam. Also, the Helmholtz equation with uniformly dispersed bubbles can be simulated by helmholtzBubbleFoam.
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
Advanced Analytics and Recommendations with Apache Spark - Spark Maryland/DC Meetup Feb 22 2016
1. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
Spark & Recommendations
Spark, Streaming, Machine Learning, Graph Processing,
Approximations, Probabilistic Data Structures, NLP
Apache Spark Maryland Meetup
Thanks to Tetra Concepts & Jailbreak Brewing Co!!
Feb 22nd, 2016
Chris Fregly
Principal Data Solutions Engineer
We’re Hiring! (Only Nice People)
advancedspark.com!
2. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Who Am I?
2
Streaming Data Engineer
Netflix OSS Committer
Data Solutions Engineer
Apache Contributor
Principal Data Solutions Engineer
IBM Technology Center
Meetup Organizer
Advanced Apache Meetup
Book Author
Advanced .
Due 2016
3. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Recent World Tour: Freg-a-Palooza!
London Spark Meetup (Oct 12th)
Scotland Data Science Meetup (Oct 13th)
Dublin Spark Meetup (Oct 15th)
Barcelona Spark Meetup (Oct 20th)
Madrid Big Data Meetup (Oct 22nd)
Paris Spark Meetup (Oct 26th)
Amsterdam Spark Summit (Oct 27th)
Brussels Spark Meetup (Oct 30th)
Zurich Big Data Meetup (Nov 2nd)
Geneva Spark Meetup (Nov 5th)
3
Oslo Big Data Hadoop Meetup (Nov 19th)
Helsinki Spark Meetup (Nov 20th)
Stockholm Spark Meetup (Nov 23rd)
Copenhagen Spark Meetup (Nov 25th)
Istanbul Spark Meetup (Nov 26th)
Budapest Spark Meetup (Nov 28th)
Singapore Spark Meetup (Dec 1st)
Sydney Spark Meetup (Dec 8th)
Melbourne Spark Meetup (Dec 9th)
Toronto Spark Meetup (Dec 14th)
4. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Advanced Apache Spark Meetup
http://advancedspark.com
Meetup Metrics
Top 5 Most-active Spark Meetup!
2600 Members in just 6 mos!!
2600 Docker downloads (demos)
Meetup Mission
Deep-dive into Spark and related open source projects
Surface key patterns and idioms
Focus on distributed systems, scale, and performance
4
5. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
Live, Interactive Demo!!
Audience Participation Required
(cell phone or laptop)
5
6. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
demo.advancedspark.com
End User ->
ElasticSearch ->
Spark ML ->
Data Scientist ->
6
<- Kafka
<- Spark
Streaming
<- Cassandra,
Redis
<- Zeppelin,
iPython
7. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Presentation Outline
Scaling with Parallelism and Composability
Similarity and Recommendations
When to Approximate
Common Algorithms and Data Structures
Common Libraries and Tools
Netflix Recommendations and Data Pipeline
7
8. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Scaling with Parallelism
8
Peter
O(log n)
O(log n)
9. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Scaling with Composability
Max (a max b max c max d) == (a max b) max (c max d)
Set Union (a U b U c U d)
== (a U b) U (c U d)
Addition (a + b + c + d)
== (a + b)
+
(c + d)
Multiply
(a * b * c * d)
== (a * b) * (c * d)
Division??
9
10. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
What about Division?
Division
(a / b / c / d)
!= (a / b) / (c / d)
(3 / 4 / 7 / 8)
!= (3 / 4) / (7 / 8)
(((3 / 4) / 7) / 8)
!= ((3 * 8) / (4 * 7))
0.134
!=
0.857
10
What were the Egyptians thinking?!
Not Composable
“Divide like
an Egyptian”
11. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
What about Average?
Overall AVG (
[3, 1]
((3 + 5) + (5 + 7))
20
[5, 1] == ----------------------- == --- == 5
[5, 1]
((1 + 2) + 1)
4
[7, 1]
)
11
value
count
Pairwise AVG
(3 + 5) (5 + 7) 8 12 20
------- + ------- == --- + --- == --- == 10 != 5
2
2
2 2
2
Divide, Add, Divide?
Not
Composable
Single Divide at the End?
Doesn’t need to be Composable!
AVG (3, 5, 5, 7) == 5
Add, Add, Add?
Composable!
12. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Presentation Outline
Scaling with Parallelism and Composability
Similarity and Recommendations
When to Approximate
Common Algorithms and Data Structures
Common Libraries and Tools
Netflix Recommendations and Data Pipeline
12
13. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
Similarity
13
14. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Euclidean Similarity
Exists in Euclidean, flat space
Based on Euclidean distance
Linear measure
Bias towards magnitude
14
15. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Cosine Similarity
Angular measure
Adjusts for Euclidean magnitude bias
15
Normalizes to unit vectors
16. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Jaccard Similarity
Set similarity measurement
Set intersection / set union ->
Based on Jaccard distance
Bias towards popularity
16
17. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Log Likelihood Similarity
Adjusts for popularity bias
Netflix “Shawshank” problem
17
18. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Word Similarity
Based on edit distance
Calculate char differences between words
Deletes, transposes, replaces, inserts
18
19. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Document Similarity
TD/IDF
Term Freq / Inverse Document Freq
Used by most search engines
Word2Vec
Words embedded in vector space nearby similars
19
20. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Similarity Pathway
ie. Closest recommendations between 2 people
20
21. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Calculating Similarity
Exact Brute-Force
“All-pairs similarity”
aka “Pair-wise similarity”, “Similarity join”
Cartesian O(n^2) shuffle and comparison
Approximate
Sampling
Bucketing (aka “Partitioning”, “Clustering”)
Remove data with low probability of similarity
Reduce shuffle and comparisons
21
22. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Bonus: Document Summary
Text Rank
aka “Sentence Rank”
TF/IDF + Similarity Graph + PageRank
Intuition
Surface summary sentences (abstract)
Most similar to all others (TF/IDF + Similarity Graph)
Most influential sentences (PageRank)
22
23. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Similarity Graph
Vertex is movie, tag, actor, plot summary, etc.
Edges are relationships and weights
23
24. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Topic-Sensitive PageRank
Graph diffusion algorithm
Pre-process graph, add vector of probabilities to each vertex
Probability of landing at this vertex from every other vertex
24
25. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
Recommendations
25
26. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Basic Terminology
User: User seeking recommendations
Item: Item being recommended
Explicit User Feedback: like, rating, movie view, profile read, search
Implicit User Feedback: click, hover, scroll, navigation
Instances: Rows of user feedback/input data
Overfitting: Training a model too closely to the training data & hyperparameters
Hold Out Split: Holding out some of the instances to avoid overfitting
Features: Columns of instance rows (of feedback/input data)
Cold Start Problem: Not enough data to personalize (new)
Hyperparameter: Model-specific config knobs for tuning (tree depth, iterations)
Model Evaluation: Compare predictions to actual values of hold out split
Feature Engineering: Modify, reduce, combine features
26
27. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Features
Binary: True or False
Numeric Discrete: Integers
Numeric: Real Values
Binning: Convert Continuous into Discrete (Time of Day->Morning, Afternoon)
Categorical Ordinal: Size (Small->Medium->Large), Ratings (1->5)
Categorical Nominal: Independent, Favorite Sports Teams, Dating Spots
Temporal: Time-based, Time of Day, Binge Viewing
Text: Movie Titles, Genres, Tags, Reviews (Tokenize, Stop Words, Stemming)
Media: Images, Audio, Video
Geographic: (Longitude, Latitude), Geohash
Latent: Hidden Features within Data (Collaborative Filtering)
Derived: Age of Movie, Duration of User Subscription
27
28. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Feature Engineering
Dimension Reduction
Reduce number of features in feature space
Principle Component Analysis (PCA)
Help find principle features that best describe variance in data
Peel the dimensional layers back until you describe the data
One-Hot Encoding
Convert nominal categorical feature values to 0’s, 1’s
Remove numerical relationship between the categories
Bears
-> 1
Bears ->
[1,0,0]
49’ers -> 2
-->
49’ers ->
[0,1,0]
Steelers-> 3
Steelers-> [0,0,1]
28
1 binary column
per category
29. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Normalize and Standardize Features
Goal
Scale features to standard size
Required by many ML algos
Normalize Features
Calculate L1 (or L2, etc) norm
Divide elements by norm
org.apache.spark.ml.feature.Normalizer
Standardize Features
Apply standard normal transformation
Mean == 0
StdDev == 1
org.apache.spark.ml.feature.StandardScaler
29
30. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
Non-Personalized Recommendations
30
31. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Cold Start Problem
“Cold Start” problem
New user, don’t know their preference, must show something!
Movies with highest-rated actors
Top K Aggregations
Most desirable singles
PageRank of likes and dislikes
Facebook social graph
Friend-based recommendations
31
32. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
Personalized Recommendations
32
33. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Clustering (aka. Nearest Neighbors)
User-to-User Clustering (User Behavior)
Similar items viewed or rated
Similar viewing pattern (ie. binge or casual)
Item-to-Item Clustering (Item Description)
Similar item tags/metadata (Jaccard Similiarity, Locality Sensitive Hash)
Similar profile text and categories (TF/IDF, Word2Vec, NLP)
Similar images/facial structures (Convolutional Neural Nets, Eigenfaces)
33
http://crockpotveggies.com/2015/02/09/automating-tinder-with-eigenfaces.htmMy OKCupid Profile
My Hinge Profile
Dating
Site ->
34. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Bonus: NLP Conversation Bot
34
“If your responses to my generic opening
lines are positive, I may read your profile.”
Spark ML and Stanford CoreNLP:
TF/IDF, DecisionTrees, Sentiment
Analysis
35. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
User-to-Item Collaborative Filtering
Matrix Factorization
① Factor the large matrix (left) into 2 smaller matrices (right)
② Smaller matrices, when multiplied, approximate original
③ Fill in the missing values with in the large matrix
④ Surface latent features from within user-item interaction
35
36. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Item-to-Item Collaborative Filtering
Made famous by Amazon Paper ~2003
Problem
As # of users grew, user-item collab filtering didn’t scale
Solution
Offline/Batch
Generate itemId -> List[userId] vectors
Online/Real-time
For each item in cart, recommend similar items from vector space
36
37. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Presentation Outline
Scaling with Parallelism and Composability
Similarity and Recommendations
When to Approximate
Common Algorithms and Data Structures
Common Libraries and Tools
Netflix Recommendations and Data Pipeline
37
38. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
When to Approximate?
Memory or time constrained queries
Relative vs. exact counts are OK (# errors between then and now)
Using machine learning or graph algos
Inherently probabilistic and approximate
Finding topics in documents (LDA)
Finding similar pairs of users, items, words at scale (LSH)
Finding top influencers (PageRank)
Streaming aggregations
Inherently sloppy collection (exactly once?)
38
Approximate as much as you can get away with!
Ask for forgiveness later !!
39. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
When NOT to Approximate?
If you’ve ever heard the term…
“Sarbanes-Oxley”
…at the office after 2002.
39
40. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Presentation Outline
Scaling with Parallelism and Composability
Similarity and Recommendations
When to Approximate
Common Algorithms and Data Structures
Common Libraries and Tools
Netflix Recommendations and Data Pipeline
40
41. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
A Few Good Algorithms
41
You can’t handle
the approximate!
42. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Common to These Algos & Data Structs
Low, fixed size in memory
Known error bounds
Store large amount of data
Less memory than Java/Scala collections
Tunable tradeoff between size and error
Rely on multiple hash functions or operations
Size of hash range defines error
42
43. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
Bloom Filter
Set.contains(key): Boolean
“Hash Multiple Times and Flip the Bits Wherever You Land”
43
44. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Bloom Filter
Approximate set membership for key
False positive: expect contains(), actual !contains()
True negative: expect !contains(), actual !contains()
Elements are only added, never removed
44
45. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Bloom Filter in Action
45
set(key)
contains(key): Boolean
Images by @avibryant
TRUE -> maybe contains
FALSE -> definitely does not contain.
46. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
CountMin Sketch
Frequency Count and TopK
“Hash Multiple Times and Add 1 Wherever You Land”
46
47. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
CountMin Sketch (CMS)
Approximate frequency count and TopK for key
ie. “Heavy Hitters” on Twitter
47
Matei Zaharia
Martin Odersky
Donald Trump
48. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
CountMin Sketch In Action (TopK, Count)
48
Images derived from @avibryant
Find minimum of all rows
…
…
Can overestimate,
but never underestimate
Multiple hash functions
(1 hash function per row)
Binary hash output
(1 element per column)
x 2 occurrences of
“Top Gun” for slightly
additional complexity
Top Gun
Top Gun
Top Gun
(x 2)
A Few
Good Men
Taps
Top Gun
(x 2)
add(Top Gun, 2)
getCount(Top Gun): Long
Use Case: TopK movies using total views
add(A Few Good Men, 1)
add(Taps, 1)
A Few
Good Men
Taps
…
…
Overlap Top Gun
Overlap A Few Good Men
49. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
HyperLogLog
Count Distinct
“Hash Multiple Times and Uniformly Distribute Where You Land”
49
50. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
HyperLogLog (HLL)
Approximate count distinct
Slight twist
Special hash function creates uniform distribution
Error estimate
14 bits for size of range
m = 2^14 = 16,384 hash slots
error = 1.04/(sqrt(16,384)) = .81%
50
Not many of these
51. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
HyperLogLog In Action (Count Distinct)
Use Case: Number of distinct users who view a movie
51
0
32
Top Gun: Hour 2
user
2001
user
4009
user
3002
user
7002
user
1005
user
6001
User
8001
User
8002
user
1001
user
2009
user
3005
user
3003
Top Gun: Hour 1
user
3001
user
7009
0
16
Uniform Distribution:
Estimate distinct # of users by
inspecting just the beginning
0
32
Top Gun: Hour 1 + 2
user
2001
user
4009
user
3002
user
7002
user
1005
user
6001
User
8001
User
8002
Combine across
different scales
user
7009
user
1001
user
2009
user
3005
user
3003
user
3001
52. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
Locality Sensitive Hashing
Set Similarity
“Pre-process Items into Buckets, Compare Within Buckets”
52
53. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Locality Sensitive Hashing (LSH)
Approximate set similarity
Hash designed to cluster similar items
Avoids cartesian all-pairs comparison
Pre-process m rows into b buckets
b << m
Hash items multiple times
Similar items hash to overlapping buckets
Compare just contents of buckets
Much smaller cartesian … and parallel !!
53
54. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
DIMSUM
Set Similarity
“Pre-process and ignore data that is unlikely to be similar.”
54
55. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
DIMSUM
“Dimension Independent Matrix Square Using MR”
Remove vectors with low probability of similarity
RowMatrix.columnSimiliarites(threshold)
Twitter DIMSUM Case Study
40% efficiency gain over bruce-force Cosine Sim
55
56. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Presentation Outline
Scaling with Parallelism and Composability
Similarity and Recommendations
When to Approximate
Common Algorithms and Data Structures
Common Libraries and Tools
Netflix Recommendations and Data Pipeline
56
57. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Common Tools to Approximate
Twitter Algebird
Redis
Apache Spark
57
Composable Library
Distributed Cache
Big Data Processing
58. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Twitter Algebird
Rooted in Algebraic Fundamentals!
Parallel
Associative
Composable
Examples
Min, Max, Avg
BloomFilter (Set.contains(key))
HyperLogLog (Count Distinct)
CountMin Sketch (TopK Count)
58
59. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Redis
Implementation of HyperLogLog (Count Distinct)
12KB per item count
2^64 max # of items
0.81% error (Tunable)
Add user views for given movie
PFADD TopGun_HLL user1001 user2009 user3005
PFADD TopGun_HLL user3003 user1001
Get distinct count (cardinality) of set
PFCOUNT TopGun_HLL
Returns: 4 (distinct users viewed this movie)
59
ignore duplicates
Tunable
Union 2 HyperLogLog Data Structures
PFMERGE TopGun_HLL Taps_HLL
60. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Spark Approximations
Spark Core
RDD.count*Approx()
Spark SQL
PartialResult
approxCountDistinct(column), HyperLogLogPlus
Spark ML
Stratified sampling
PairRDD.sampleByKey(fractions: Double[ ])
DIMSUM sampling
Probabilistic sampling reduces amount of comparison shuffle
RowMatrix.columnSimilarities(threshold)
Spark Streaming
A/B testing
StreamingTest.setTestMethod(“welch”).registerStream(dstream)
60
61. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
Demos!
61
62. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
Counting
Exact Count vs. Approx HyperLogLog, CountMin Sketch
62
63. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
HashSet vs. HyperLogLog (Memory)
63
64. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
HashSet vs. CountMin Sketch (Memory)
64
65. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
Set Similarity
Bruce Force vs. Locality Sensitive Hashing Similarity
65
66. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Brute Force Cartesian All Pair Similarity
66
47 seconds
67. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Locality Sensitive Hash All Pair Similarity
67
6 seconds
68. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
Many More Demos!
or
Download Docker
Clone Github
68
http://advancedspark.com
69. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Presentation Outline
Scaling with Parallelism and Composability
Similarity and Recommendations
When to Approximate
Common Algorithms and Data Structures
Common Libraries and Tools
Netflix Recommendations and Data Pipeline
69
70. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
Netflix Recommendation & Data Pipeline
From 5 Stars to Trending Now
70
71. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Netflix Has a Lot of Data
Netflix has a lot of data about a lot of users and a lot of movies.
Netflix can use this data to buy new movies.
Netflix is global.
Netflix can use this data to choose original programming.
Netflix knows that a lot of people like politics and Kevin Spacey.
71
The UK doesn’t have White Castle.
Renamed my favourite movie to:
“Harold and Kumar Get the Munchies”
My favorite movie:
“Harold and Kumar
Go to White Castle”
Summary: Buy NFLX Stock!
This broke my unit tests!
72. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
$1 Million Netflix Prize (2006-2009)
Goal
Improve movie predictions by 10% (RMSE)
Dataset
(userId, movieId, rating, timestamp)
Test data withheld to calculate RMSE upon submission
Winning algorithm
10.06% improvement (RMSE)
Ensemble of 500+ ML combined with GBDT’s
Computationally impractical
72
73. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Secrets to the Winning Algorithms
Adjust for the following human bias…
① Alice Effect: rate lower than average user
② Inception Effect: rated higher than average movie
③ Overall mean rating of a movie
④ Number of people who have rated a movie
⑤ Mood, time of day, day of week, season, weather
⑥ Number of days since user’s first rating
⑦ Number of days since movie’s first rating
73
74. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Netflix Data Pipeline - Then
74
v1.0!
v2.0!
75. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Netflix Data Pipeline - Now
75
v3.0!
8 million events per second
76. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Netflix Recommendation Pipeline
76
Throw away
batch-generated
user factors (U)
77. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Netflix Common ML Algorithms
Logistic Regression
Linear Regression
Gradient Boosted Decision Trees
Random Forest
Matrix Factorization
SVD
Restricted Boltzmann Machines
Deep Neural Nets
Markov Models
LDA
Clustering
77
Ensembles
78. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Netflix Trending Now
Time of day
Personalized to user (viewing history, past ratings)
Personalized to events (Valentine’s Day)
78
“VHS”
Number of
Plays
Number of
Impressions
Calculate
Take Rate
79. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Bonus: Pandora Time of Day Recs
Work Days
Play familiar music
User is less likely accept new music
Evenings and Weekends
Play new music
More like to accept new music
79
80. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Netflix Social Integration
Post to Facebook after movie start (5 mins)
Recommend without needing viewing history
Helps with Cold Start problem
80
81. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Netflix Search
No results? No problem… Show similar results!
Empty searches are good!
Explicit feedback for future recommendations
Content to buy and produce!
81
82. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Bonus: Netflix in 2004
Netflix noticed people started to rate movies higher!?
Why?
Significant UI improvements made around that time
Recommendation improvements (Cinematch)
82
83. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Thank You!!
Chris Fregly @cfregly
IBM Spark Tech Center
http://spark.tc
San Francisco, California, USA
http://advancedspark.com
Sign up for the Meetup and Book
Contribute to Github Repo
Run all Demos using Docker
Find me: LinkedIn, Twitter, Github, Email, Fax
83
Image derived from http://www.duchess-france.org/
84. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
advancedspark.com
@cfregly