Speaker: Umayah Abdennabi
Agenda
* Intro Grammarly (Umayah Abdennabi, 5 mins)
* Meetup Updates and Announcements (Chris, 5 mins)
* Custom Functions in Spark SQL (30 mins)
Speaker: Umayah Abdennabi
Spark comes with a rich Expression library that can be extended to make custom expressions. We will look into custom expressions and why you would want to use them.
* TF 2.0 + Keras (30 mins)
Speaker: Francesco Mosconi
Tensorflow 2.0 was announced at the March TF Dev Summit, and it brings many changes and upgrades. The most significant change is the inclusion of Keras as the default model building API. In this talk, we'll review the main changes introduced in TF 2.0 and highlight the differences between open source Keras and tf.keras
* SQUAD Deep-Dive: Question & Answer with Context (45 mins)
Speaker: Brett Koonce (https://quarkworks.co)
SQuAD (Stanford Question Answer Dataset) is an NLP challenge based around answering questions by reading Wikipedia articles, designed to be a real-world machine learning benchmark. We will look at several different ways to tackle the SQuAD problem, building up to state of the art approaches in terms of time, complexity, and accuracy.
https://rajpurkar.github.io/SQuAD-explorer/
https://dawn.cs.stanford.edu/benchmark/#squad
Food and drinks will be provided. The event will be held at Grammarly's office at One Embarcadero Center on the 9th floor. When you arrive at One Embarcadero, take the escalator to the second floor where you will find the lobby and elevators to the office suites. Come on up to the 9th floor (no need to check in at security), and ring the Grammarly doorbell.
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on GPUsChris Fregly
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on GPUs @ Strata London, May 24 2017
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on GPUs - Advanced Spark and TensorFlow Meetup May 23 2017 @ Hotels.com London
We'll discuss how to deploy TensorFlow, Spark, and Sciki-learn models on GPUs with Kubernetes across multiple cloud providers including AWS, Google, and Azure - as well as on-premise.
In addition, we'll discuss how to optimize TensorFlow models for high-performance inference using the latest TensorFlow XLA (Accelerated Linear Algebra) framework including the JIT and AOT Compilers.
Github Repo (100% Open Source!)
https://github.com/fluxcapacitor/pipeline
http://pipeline.io
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on G...Chris Fregly
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on GPUs - Advanced Spark and TensorFlow Meetup May 23 2017 @ Hotels.com London
We'll discuss how to deploy TensorFlow, Spark, and Sciki-learn models on GPUs with Kubernetes across multiple cloud providers including AWS, Google, and Azure - as well as on-premise.
In addition, we'll discuss how to optimize TensorFlow models for high-performance inference using the latest TensorFlow XLA (Accelerated Linear Algebra) framework including the JIT and AOT Compilers.
Github Repo (100% Open Source!)
https://github.com/fluxcapacitor/pipeline
http://pipeline.io
Migrating Apache Spark ML Jobs to Spark + Tensorflow on KubeflowDatabricks
This talk will take an two existings Spark ML pipeline (Frank The Unicorn, for predicting PR comments (Scala) – https://github.com/franktheunicorn/predict-pr-comments & Spark ML on Spark Errors (Python)) and explore the steps involved in migrating this into a combination of Spark and Tensorflow. Using the open source Kubeflow project (now with Spark support as of 0.5), we will create an two integrated end-to-end pipelines to explore the challenges involved & look at areas of improvement (e.g. Apache Arrow, etc.).
Big Data Beyond the JVM - Strata San Jose 2018Holden Karau
Many of the recent big data systems, like Hadoop, Spark, and Kafka, are written primarily in JVM languages. At the same time, there is a wealth of tools for data science and data analytics that exist outside of the JVM. Holden Karau and Rachel Warren explore the state of the current big data ecosystem and explain how to best work with it in non-JVM languages. While much of the focus will be on Python + Spark, the talk will also include interesting anecdotes about how these lessons apply to other systems (including Kafka).
Holden and Rachel detail how to bridge the gap using PySpark and discuss other solutions like Kafka Streams as well. They also outline the challenges of pure Python solutions like dask. Holden and Rachel start with the current architecture of PySpark and its evolution. They then turn to the future, covering Arrow-accelerated interchange for Python functions, how to expose Python machine learning models into Spark, and how to use systems like Spark to accelerate training of traditional Python models. They also dive into what other similar systems are doing as well as what the options are for (almost) completely ignoring the JVM in the big data space.
Python users will learn how to more effectively use systems like Spark and understand how the design is changing. JVM developers will gain an understanding of how to Python code from data scientist and Python developers while avoiding the traditional trap of needing to rewrite everything.
Accelerating Big Data beyond the JVM - Fosdem 2018Holden Karau
Many popular big data technologies (such as Apache Spark, BEAM, Flink, and Kafka) are built in the JVM, and many interesting tools are built in other languages (ranging from Python to CUDA). For simple operations the cost of copying the data can quickly dominate, and in complex cases can limit our ability to take advantage of specialty hardware. This talk explores how improved formats are being integrated to reduce these hurdles to co-operation.
Many popular big data technologies (such as Apache Spark, BEAM, and Flink) are built in the JVM, and many interesting AI tools are built in other languages, and some requiring copying to the GPU. As many folks have experienced, while we may wish that we spend all of our time playing with cool algorithms -- we often need to spend more of our time working on data prep. Having to copy our data slowly between the JVM and the target language of computation can remove much of the benefit of being able to access our specialized tooling. Thankfully, as illustrated in the soon to be released Spark 2.3, Apache Arrow and related tools offer the ability to reduce this overhead. This talk will explore how Arrow is being integrated into Spark, and how it can be integrated into other systems, but also limitations and places where Apache Arrow will not magically save us.
Link: https://fosdem.org/2018/schedule/event/big_data_outside_jvm/
Speaker: Umayah Abdennabi
Agenda
* Intro Grammarly (Umayah Abdennabi, 5 mins)
* Meetup Updates and Announcements (Chris, 5 mins)
* Custom Functions in Spark SQL (30 mins)
Speaker: Umayah Abdennabi
Spark comes with a rich Expression library that can be extended to make custom expressions. We will look into custom expressions and why you would want to use them.
* TF 2.0 + Keras (30 mins)
Speaker: Francesco Mosconi
Tensorflow 2.0 was announced at the March TF Dev Summit, and it brings many changes and upgrades. The most significant change is the inclusion of Keras as the default model building API. In this talk, we'll review the main changes introduced in TF 2.0 and highlight the differences between open source Keras and tf.keras
* SQUAD Deep-Dive: Question & Answer with Context (45 mins)
Speaker: Brett Koonce (https://quarkworks.co)
SQuAD (Stanford Question Answer Dataset) is an NLP challenge based around answering questions by reading Wikipedia articles, designed to be a real-world machine learning benchmark. We will look at several different ways to tackle the SQuAD problem, building up to state of the art approaches in terms of time, complexity, and accuracy.
https://rajpurkar.github.io/SQuAD-explorer/
https://dawn.cs.stanford.edu/benchmark/#squad
Food and drinks will be provided. The event will be held at Grammarly's office at One Embarcadero Center on the 9th floor. When you arrive at One Embarcadero, take the escalator to the second floor where you will find the lobby and elevators to the office suites. Come on up to the 9th floor (no need to check in at security), and ring the Grammarly doorbell.
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on GPUsChris Fregly
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on GPUs @ Strata London, May 24 2017
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on GPUs - Advanced Spark and TensorFlow Meetup May 23 2017 @ Hotels.com London
We'll discuss how to deploy TensorFlow, Spark, and Sciki-learn models on GPUs with Kubernetes across multiple cloud providers including AWS, Google, and Azure - as well as on-premise.
In addition, we'll discuss how to optimize TensorFlow models for high-performance inference using the latest TensorFlow XLA (Accelerated Linear Algebra) framework including the JIT and AOT Compilers.
Github Repo (100% Open Source!)
https://github.com/fluxcapacitor/pipeline
http://pipeline.io
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on G...Chris Fregly
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on GPUs - Advanced Spark and TensorFlow Meetup May 23 2017 @ Hotels.com London
We'll discuss how to deploy TensorFlow, Spark, and Sciki-learn models on GPUs with Kubernetes across multiple cloud providers including AWS, Google, and Azure - as well as on-premise.
In addition, we'll discuss how to optimize TensorFlow models for high-performance inference using the latest TensorFlow XLA (Accelerated Linear Algebra) framework including the JIT and AOT Compilers.
Github Repo (100% Open Source!)
https://github.com/fluxcapacitor/pipeline
http://pipeline.io
Migrating Apache Spark ML Jobs to Spark + Tensorflow on KubeflowDatabricks
This talk will take an two existings Spark ML pipeline (Frank The Unicorn, for predicting PR comments (Scala) – https://github.com/franktheunicorn/predict-pr-comments & Spark ML on Spark Errors (Python)) and explore the steps involved in migrating this into a combination of Spark and Tensorflow. Using the open source Kubeflow project (now with Spark support as of 0.5), we will create an two integrated end-to-end pipelines to explore the challenges involved & look at areas of improvement (e.g. Apache Arrow, etc.).
Big Data Beyond the JVM - Strata San Jose 2018Holden Karau
Many of the recent big data systems, like Hadoop, Spark, and Kafka, are written primarily in JVM languages. At the same time, there is a wealth of tools for data science and data analytics that exist outside of the JVM. Holden Karau and Rachel Warren explore the state of the current big data ecosystem and explain how to best work with it in non-JVM languages. While much of the focus will be on Python + Spark, the talk will also include interesting anecdotes about how these lessons apply to other systems (including Kafka).
Holden and Rachel detail how to bridge the gap using PySpark and discuss other solutions like Kafka Streams as well. They also outline the challenges of pure Python solutions like dask. Holden and Rachel start with the current architecture of PySpark and its evolution. They then turn to the future, covering Arrow-accelerated interchange for Python functions, how to expose Python machine learning models into Spark, and how to use systems like Spark to accelerate training of traditional Python models. They also dive into what other similar systems are doing as well as what the options are for (almost) completely ignoring the JVM in the big data space.
Python users will learn how to more effectively use systems like Spark and understand how the design is changing. JVM developers will gain an understanding of how to Python code from data scientist and Python developers while avoiding the traditional trap of needing to rewrite everything.
Accelerating Big Data beyond the JVM - Fosdem 2018Holden Karau
Many popular big data technologies (such as Apache Spark, BEAM, Flink, and Kafka) are built in the JVM, and many interesting tools are built in other languages (ranging from Python to CUDA). For simple operations the cost of copying the data can quickly dominate, and in complex cases can limit our ability to take advantage of specialty hardware. This talk explores how improved formats are being integrated to reduce these hurdles to co-operation.
Many popular big data technologies (such as Apache Spark, BEAM, and Flink) are built in the JVM, and many interesting AI tools are built in other languages, and some requiring copying to the GPU. As many folks have experienced, while we may wish that we spend all of our time playing with cool algorithms -- we often need to spend more of our time working on data prep. Having to copy our data slowly between the JVM and the target language of computation can remove much of the benefit of being able to access our specialized tooling. Thankfully, as illustrated in the soon to be released Spark 2.3, Apache Arrow and related tools offer the ability to reduce this overhead. This talk will explore how Arrow is being integrated into Spark, and how it can be integrated into other systems, but also limitations and places where Apache Arrow will not magically save us.
Link: https://fosdem.org/2018/schedule/event/big_data_outside_jvm/
This session talks about how unit testing of Spark applications is done, as well as tells the best way to do it. This includes writing unit tests with and without Spark Testing Base package, which is a spark package containing base classes to use when writing tests with Spark.
Testing and validating distributed systems with Apache Spark and Apache Beam ...Holden Karau
As distributed data parallel systems, like Spark, are used for more mission-critical tasks, it is important to have effective tools for testing and validation. This talk explores the general considerations and challenges of testing systems like Spark through spark-testing-base and other related libraries.
With over 40% of folks automatically deploying the results of their Spark jobs to production, testing is especially important. Many of the tools for working with big data systems (like notebooks) are great for exploratory work, and can give a false sense of security (as well as additional excuses not to test). This talk explores why testing these systems are hard, special considerations for simulating "bad" partioning, figuring out when your stream tests are stopped, and solutions to these challenges.
Apache Spark is one of the most popular big data projects, offering greatly improved performance over traditional MapReduce models. Much of Apache Spark’s power comes from lazy evaluation along with intelligent pipelining, which can make debugging more challenging. This talk will examine how to debug Apache Spark applications, the different options for logging in PySpark, as well as some common errors and how to detect them.
Spark’s own internal logging can often be quite verbose, and this talk will examine how to effectively search logs from Apache Spark to spot common problems. In addition to the internal logging, this talk will look at options for logging from within our program itself.
Spark’s accumulators have gotten a bad rap because of how they interact in the event of cache misses or partial recomputes, but this talk will look at how to effectively use Spark’s current accumulators for debugging as well as a look to future for data property type accumulators which may be coming to Spark in future version.
In addition to reading logs, and instrumenting our program with accumulators, Spark’s UI can be of great help for quickly detecting certain types of problems.
Debuggers are a wonderful tool, however when you have 100 computers the “wonder” can be a bit more like “pain”. This talk will look at how to connect remote debuggers, but also remind you that it’s probably not the easiest path forward.
Scalable Data Science with SparkR: Spark Summit East talk by Felix CheungSpark Summit
R is a very popular platform for Data Science. Apache Spark is a highly scalable data platform. How could we have the best of both worlds? How could a Data Scientist leverage the rich 9000+ packages on CRAN, and integrate Spark into their existing Data Science toolset?
In this talk we will walkthrough many examples how several new features in Apache Spark 2.x will enable this. We will also look at exciting changes in and coming next in Apache Spark 2.x releases.
A fast introduction to PySpark with a quick look at Arrow based UDFsHolden Karau
This talk will introduce Apache Spark (one of the most popular big data tools), the different built ins (from SQL to ML), and, of course, everyone's favorite wordcount example. Once we've got the nice parts out of the way, we'll talk about some of the limitations and the work being undertaken to improve those limitations. We'll also look at the cases where Spark is more like trying to hammer a screw. Since we want to finish on a happy note, we will close out with looking at the new vectorized UDFs in PySpark 2.3.
London Spark Meetup Project Tungsten Oct 12 2015Chris Fregly
Building on a previous talk about how Spark beat Hadoop @ 100TB Daytona GraySort, we present low-level details of Project Tungsten which includes many CPU and Memory optimizations.
Flink Forward SF 2017: Dean Wampler - Streaming Deep Learning Scenarios with...Flink Forward
As a low-latency streaming tool, Flink offers the possibility of using machine learning, even "deep learning" (neural networks), with low latency. The growing FlinkML library provides some of the infrastructure support required for this goal, combined with third-party tools. This talk is a progress report on several scenarios we are developing at Lightbend, which combine Flink, Deeplearning4J, Spark, and Kafka to analyze cluster telemetry for anomaly detection, predictive autoscaling, and other scenarios. I'll focus on the pragmatics of training deep learning models in a streaming context, using batch and mini-batch training, combined with low-latency application of those models. I'll discuss the architecture we're using and highlight trade offs of particular tools for certain design problems in the implementation. I'll discuss the drawbacks and workarounds of our design and finish with a look at how future developments in Flink could improve its support for scenarios like ours.
A Library for Emerging High-Performance Computing ClustersIntel® Software
Deployed next-generation architectures and systems are characterized by high concurrency, low memory per core, and multilevels of hierarchy and heterogeneity. These characteristics bring out new challenges in energy efficiency, fault-tolerance, and scalability. Next-generation programming models and their associated middleware and runtimes have a responsibility to tackle these challenges.
This talk focuses on challenges and opportunities in designing efficient runtimes using a formula (MPI+X) to accelerate applications for emerging high-performance computing (HPC) systems with millions of processors and featuring next-generation interconnects. Energy-aware designs and codesign schemes for such environments are also emphasized. View features and sample performance numbers from the MVAPICH2 libraries.
To hit Ruby3x3, we must first figure out **what** we're going to measure, **how** we're going to measure it, in order to get what we actually want. I'll cover some standard definitions of benchmarking in dynamic languages, as well as the tradeoffs that must be made when benchmarking. I'll look at some of the possible benchmarks that could be considered for Ruby 3x3, and evaluate them for what they're good for measuring, and what they're less good for measuring, in order to help the Ruby community decide what the 3x goal is going to be measured against.
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...Chris Fregly
Chris Fregly, Founder @ PipelineAI, will walk you through a real-world, complete end-to-end Pipeline-optimization example. We highlight hyper-parameters - and model pipeline phases - that have never been exposed until now.
While most Hyperparameter Optimizers stop at the training phase (ie. learning rate, tree depth, ec2 instance type, etc), we extend model validation and tuning into a new post-training optimization phase including 8-bit reduced precision weight quantization and neural network layer fusing - among many other framework and hardware-specific optimizations.
Next, we introduce hyperparameters at the prediction phase including request-batch sizing and chipset (CPU v. GPU v. TPU).
Lastly, we determine a PipelineAI Efficiency Score of our overall Pipeline including Cost, Accuracy, and Time. We show techniques to maximize this PipelineAI Efficiency Score using our massive PipelineDB along with the Pipeline-wide hyper-parameter tuning techniques mentioned in this talk.
Bio
Chris Fregly is Founder and Applied AI Engineer at PipelineAI, a Real-Time Machine Learning and Artificial Intelligence Startup based in San Francisco.
He is also an Apache Spark Contributor, a Netflix Open Source Committer, founder of the Global Advanced Spark and TensorFlow Meetup, author of the O’Reilly Training and Video Series titled, "High Performance TensorFlow in Production with Kubernetes and GPUs."
Previously, Chris was a Distributed Systems Engineer at Netflix, a Data Solutions Engineer at Databricks, and a Founding Member and Principal Engineer at the IBM Spark Technology Center in San Francisco.
LINCX is an OpenFlow switch written in Erlang and running on LING (Erlang on Xen). It shows some remarkable performance. The presentation discusses various speed-related optimizations.
At the recent sold-out Spark & Machine Learning Meetup in Brussels, Holden Karau of the Spark Technology Center delivered a lightning talk called A very brief introduction to extending Spark ML for custom models: Talk + Demo.
Holden took a look at Apache SparkML™ pipelines. Inspired by sci-kit learn, they have the potential to make machine learning tasks much easier. This talk looked at how to extend Spark ML with custom model types when the built-in options don't meet your needs.
Introduction to Spark ML Pipelines WorkshopHolden Karau
Introduction to Spark ML Pipelines Workshop slides - companion IJupyter notebooks in Python & Scala are available from my github at https://github.com/holdenk/spark-intro-ml-pipeline-workshop
Big Data Spain - Nov 17 2016 - Madrid Continuously Deploy Spark ML and Tensor...Chris Fregly
In this talk, I describe some recent advancements in Streaming ML and AI Pipelines to enable data scientists to rapidly train and test on streaming data - and ultimately deploy models directly into production on their own with low friction and high impact.
With proper tooling and monitoring, data scientist have the freedom and responsibility to experiment rapidly on live, streaming data - and deploy directly into production as often as necessary. I’ll describe this tooling - and demonstrate a real production pipeline using Jupyter Notebook, Docker, Kubernetes, Spark ML, Kafka, TensorFlow, Jenkins, and Netflix Open Source.
Deploy Spark ML and Tensorflow AI Models from Notebooks to Microservices - No...Chris Fregly
In this completely 100% Open Source demo-based talk, Chris Fregly from PipelineIO will be addressing an area of machine learning and artificial intelligence that is often overlooked: the real-time, end-user-facing "serving” layer in a hybrid-cloud and on-premise deployment environment using Jupyter, NetflixOSS, Docker, and Kubernetes.
Serving models to end-users in real-time in a highly-scalable, fault-tolerant manner requires not only an understanding of machine learning fundamentals, but also an understanding of distributed systems and scalable microservices.
Chris will combine his work experience from both Databricks and Netflix to present a 100% open source, real-world, hybrid-cloud, on-premise, and NetflixOSS-based production-ready environment to serve your notebook-based Spark ML and TensorFlow AI models with highly-scalable and highly-available robustness.
Speaker Bio
Chris Fregly is a Research Scientist at PipelineIO - a Streaming Analytics and Machine Learning Startup in San Francisco.
Chris is an Apache Spark Contributor, Netflix Open Source Committer, Founder of the Global Advanced Spark and TensorFlow Meetup, and Author of the upcoming book, Advanced Spark, and Creator of the upcoming O'Reilly video series, Scaling TensorFlow Distributed in Production.
Previously, Chris was an engineer at Databricks and Netflix - as well as a Founding Member of the IBM Spark Technology Center in San Francisco.
This session talks about how unit testing of Spark applications is done, as well as tells the best way to do it. This includes writing unit tests with and without Spark Testing Base package, which is a spark package containing base classes to use when writing tests with Spark.
Testing and validating distributed systems with Apache Spark and Apache Beam ...Holden Karau
As distributed data parallel systems, like Spark, are used for more mission-critical tasks, it is important to have effective tools for testing and validation. This talk explores the general considerations and challenges of testing systems like Spark through spark-testing-base and other related libraries.
With over 40% of folks automatically deploying the results of their Spark jobs to production, testing is especially important. Many of the tools for working with big data systems (like notebooks) are great for exploratory work, and can give a false sense of security (as well as additional excuses not to test). This talk explores why testing these systems are hard, special considerations for simulating "bad" partioning, figuring out when your stream tests are stopped, and solutions to these challenges.
Apache Spark is one of the most popular big data projects, offering greatly improved performance over traditional MapReduce models. Much of Apache Spark’s power comes from lazy evaluation along with intelligent pipelining, which can make debugging more challenging. This talk will examine how to debug Apache Spark applications, the different options for logging in PySpark, as well as some common errors and how to detect them.
Spark’s own internal logging can often be quite verbose, and this talk will examine how to effectively search logs from Apache Spark to spot common problems. In addition to the internal logging, this talk will look at options for logging from within our program itself.
Spark’s accumulators have gotten a bad rap because of how they interact in the event of cache misses or partial recomputes, but this talk will look at how to effectively use Spark’s current accumulators for debugging as well as a look to future for data property type accumulators which may be coming to Spark in future version.
In addition to reading logs, and instrumenting our program with accumulators, Spark’s UI can be of great help for quickly detecting certain types of problems.
Debuggers are a wonderful tool, however when you have 100 computers the “wonder” can be a bit more like “pain”. This talk will look at how to connect remote debuggers, but also remind you that it’s probably not the easiest path forward.
Scalable Data Science with SparkR: Spark Summit East talk by Felix CheungSpark Summit
R is a very popular platform for Data Science. Apache Spark is a highly scalable data platform. How could we have the best of both worlds? How could a Data Scientist leverage the rich 9000+ packages on CRAN, and integrate Spark into their existing Data Science toolset?
In this talk we will walkthrough many examples how several new features in Apache Spark 2.x will enable this. We will also look at exciting changes in and coming next in Apache Spark 2.x releases.
A fast introduction to PySpark with a quick look at Arrow based UDFsHolden Karau
This talk will introduce Apache Spark (one of the most popular big data tools), the different built ins (from SQL to ML), and, of course, everyone's favorite wordcount example. Once we've got the nice parts out of the way, we'll talk about some of the limitations and the work being undertaken to improve those limitations. We'll also look at the cases where Spark is more like trying to hammer a screw. Since we want to finish on a happy note, we will close out with looking at the new vectorized UDFs in PySpark 2.3.
London Spark Meetup Project Tungsten Oct 12 2015Chris Fregly
Building on a previous talk about how Spark beat Hadoop @ 100TB Daytona GraySort, we present low-level details of Project Tungsten which includes many CPU and Memory optimizations.
Flink Forward SF 2017: Dean Wampler - Streaming Deep Learning Scenarios with...Flink Forward
As a low-latency streaming tool, Flink offers the possibility of using machine learning, even "deep learning" (neural networks), with low latency. The growing FlinkML library provides some of the infrastructure support required for this goal, combined with third-party tools. This talk is a progress report on several scenarios we are developing at Lightbend, which combine Flink, Deeplearning4J, Spark, and Kafka to analyze cluster telemetry for anomaly detection, predictive autoscaling, and other scenarios. I'll focus on the pragmatics of training deep learning models in a streaming context, using batch and mini-batch training, combined with low-latency application of those models. I'll discuss the architecture we're using and highlight trade offs of particular tools for certain design problems in the implementation. I'll discuss the drawbacks and workarounds of our design and finish with a look at how future developments in Flink could improve its support for scenarios like ours.
A Library for Emerging High-Performance Computing ClustersIntel® Software
Deployed next-generation architectures and systems are characterized by high concurrency, low memory per core, and multilevels of hierarchy and heterogeneity. These characteristics bring out new challenges in energy efficiency, fault-tolerance, and scalability. Next-generation programming models and their associated middleware and runtimes have a responsibility to tackle these challenges.
This talk focuses on challenges and opportunities in designing efficient runtimes using a formula (MPI+X) to accelerate applications for emerging high-performance computing (HPC) systems with millions of processors and featuring next-generation interconnects. Energy-aware designs and codesign schemes for such environments are also emphasized. View features and sample performance numbers from the MVAPICH2 libraries.
To hit Ruby3x3, we must first figure out **what** we're going to measure, **how** we're going to measure it, in order to get what we actually want. I'll cover some standard definitions of benchmarking in dynamic languages, as well as the tradeoffs that must be made when benchmarking. I'll look at some of the possible benchmarks that could be considered for Ruby 3x3, and evaluate them for what they're good for measuring, and what they're less good for measuring, in order to help the Ruby community decide what the 3x goal is going to be measured against.
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...Chris Fregly
Chris Fregly, Founder @ PipelineAI, will walk you through a real-world, complete end-to-end Pipeline-optimization example. We highlight hyper-parameters - and model pipeline phases - that have never been exposed until now.
While most Hyperparameter Optimizers stop at the training phase (ie. learning rate, tree depth, ec2 instance type, etc), we extend model validation and tuning into a new post-training optimization phase including 8-bit reduced precision weight quantization and neural network layer fusing - among many other framework and hardware-specific optimizations.
Next, we introduce hyperparameters at the prediction phase including request-batch sizing and chipset (CPU v. GPU v. TPU).
Lastly, we determine a PipelineAI Efficiency Score of our overall Pipeline including Cost, Accuracy, and Time. We show techniques to maximize this PipelineAI Efficiency Score using our massive PipelineDB along with the Pipeline-wide hyper-parameter tuning techniques mentioned in this talk.
Bio
Chris Fregly is Founder and Applied AI Engineer at PipelineAI, a Real-Time Machine Learning and Artificial Intelligence Startup based in San Francisco.
He is also an Apache Spark Contributor, a Netflix Open Source Committer, founder of the Global Advanced Spark and TensorFlow Meetup, author of the O’Reilly Training and Video Series titled, "High Performance TensorFlow in Production with Kubernetes and GPUs."
Previously, Chris was a Distributed Systems Engineer at Netflix, a Data Solutions Engineer at Databricks, and a Founding Member and Principal Engineer at the IBM Spark Technology Center in San Francisco.
LINCX is an OpenFlow switch written in Erlang and running on LING (Erlang on Xen). It shows some remarkable performance. The presentation discusses various speed-related optimizations.
At the recent sold-out Spark & Machine Learning Meetup in Brussels, Holden Karau of the Spark Technology Center delivered a lightning talk called A very brief introduction to extending Spark ML for custom models: Talk + Demo.
Holden took a look at Apache SparkML™ pipelines. Inspired by sci-kit learn, they have the potential to make machine learning tasks much easier. This talk looked at how to extend Spark ML with custom model types when the built-in options don't meet your needs.
Introduction to Spark ML Pipelines WorkshopHolden Karau
Introduction to Spark ML Pipelines Workshop slides - companion IJupyter notebooks in Python & Scala are available from my github at https://github.com/holdenk/spark-intro-ml-pipeline-workshop
Big Data Spain - Nov 17 2016 - Madrid Continuously Deploy Spark ML and Tensor...Chris Fregly
In this talk, I describe some recent advancements in Streaming ML and AI Pipelines to enable data scientists to rapidly train and test on streaming data - and ultimately deploy models directly into production on their own with low friction and high impact.
With proper tooling and monitoring, data scientist have the freedom and responsibility to experiment rapidly on live, streaming data - and deploy directly into production as often as necessary. I’ll describe this tooling - and demonstrate a real production pipeline using Jupyter Notebook, Docker, Kubernetes, Spark ML, Kafka, TensorFlow, Jenkins, and Netflix Open Source.
Deploy Spark ML and Tensorflow AI Models from Notebooks to Microservices - No...Chris Fregly
In this completely 100% Open Source demo-based talk, Chris Fregly from PipelineIO will be addressing an area of machine learning and artificial intelligence that is often overlooked: the real-time, end-user-facing "serving” layer in a hybrid-cloud and on-premise deployment environment using Jupyter, NetflixOSS, Docker, and Kubernetes.
Serving models to end-users in real-time in a highly-scalable, fault-tolerant manner requires not only an understanding of machine learning fundamentals, but also an understanding of distributed systems and scalable microservices.
Chris will combine his work experience from both Databricks and Netflix to present a 100% open source, real-world, hybrid-cloud, on-premise, and NetflixOSS-based production-ready environment to serve your notebook-based Spark ML and TensorFlow AI models with highly-scalable and highly-available robustness.
Speaker Bio
Chris Fregly is a Research Scientist at PipelineIO - a Streaming Analytics and Machine Learning Startup in San Francisco.
Chris is an Apache Spark Contributor, Netflix Open Source Committer, Founder of the Global Advanced Spark and TensorFlow Meetup, and Author of the upcoming book, Advanced Spark, and Creator of the upcoming O'Reilly video series, Scaling TensorFlow Distributed in Production.
Previously, Chris was an engineer at Databricks and Netflix - as well as a Founding Member of the IBM Spark Technology Center in San Francisco.
Advanced Spark and Tensorflow Meetup - London - Nov 15, 2016 - Deploy Spark M...Chris Fregly
YouTube Video: https://www.youtube.com/watch?v=RnnweVC7wFc
In this completely 100% Open Source demo-based talk, Chris Fregly from PipelineIO will be addressing an area of machine learning and artificial intelligence that is often overlooked: the real-time, end-user-facing "serving” layer in a hybrid-cloud and on-premise deployment environment using Jupyter, NetflixOSS, Docker, and Kubernetes.
Serving models to end-users in real-time in a highly-scalable, fault-tolerant manner requires not only an understanding of machine learning fundamentals, but also an understanding of distributed systems and scalable microservices.
Chris will combine his work experience from both Databricks and Netflix to present a 100% open source, real-world, hybrid-cloud, on-premise, and NetflixOSS-based production-ready environment to serve your notebook-based Spark ML and TensorFlow AI models with highly-scalable and highly-available robustness.
Speaker Bio
Chris Fregly is a Research Scientist at PipelineIO - a Streaming Analytics and Machine Learning Startup in San Francisco.
Chris is an Apache Spark Contributor, Netflix Open Source Committer, Founder of the Global Advanced Spark and TensorFlow Meetup, and Author of the upcoming book, Advanced Spark, and Creator of the upcoming O'Reilly video series, Scaling TensorFlow Distributed in Production.
Previously, Chris was an engineer at Databricks and Netflix - as well as a Founding Member of the IBM Spark Technology Center in San Francisco.
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...Chris Fregly
https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/events/227622666/
Title: Spark on Kubernetes
Abstract: Engineers across several organizations are working on support for Kubernetes as a cluster scheduler backend within Spark. While designing this, we have encountered several challenges in translating Spark to use idiomatic Kubernetes constructs natively. This talk is about our high level design decisions and the current state of our work.
Speaker:
Anirudh Ramanathan is a software engineer on the Kubernetes team at Google. His focus is on running stateful and batch workloads. Previously, he worked on GGC (Google Global Cache) and prior to that, on the infrastructure team at NVIDIA."
Advanced Spark and TensorFlow Meetup 08-04-2016 One Click Spark ML Pipeline D...Chris Fregly
Empowering the Data Scientist with "1-Click" Production Deployment and Canary Testing of High-Performance and Highly-Scalable Spark ML and TensorFlow Models directly from Jupyter/iPython Notebooks using Docker, Kubernetes, Netflix OSS, Microservices, and Spinnaker.
With proper tooling and metrics, Data Scientists can directly deploy, analyze, A/B test, rollback, and scale out their Spark ML and TensorFlow model into live production serving with zero friction.
We will show you the open source tools that we've built based on Docker, Kubernetes, Netflix Open Source, Microservices, Spinnaker - and even Chaos Monkey!
Speaker: Chris Fregly @ PipelineIO, formerly Databricks and Netflix
Feature Talk: Real-time Aggregations, Approximations, Similarities, and Recommendations at Scale using Spark Streaming, ML, GraphX, Kafka, Cassandra, Docker, CoreNLP, Word2Vec, LDA, and Twitter Algebird
Talk Abstract: Starting with a live, interactive demo generating audience-specific recommendations, we'll dive deep into each of the key components including NiFi, Kafka, Stanford CoreNLP, Docker, Word2Vec, LDA, Twitter Algebird, Spark Streaming, SQL, ML, GraphX. As a bonus, we'll discuss the latest Netflix Recommendations Pipeline and related open source projects.
Talk Agenda:
• Intro
• Live, Interactive Recommendations Demo
• Spark Streaming, ML, GraphX, Kafka, Cassandra, Docker, CoreNLP, Word2Vec, LDA, and Twitter Algebird (advancedspark.com)
• Types of Similarity
• Euclidean vs. Non-Euclidean Similarity
• Jaccard Similarity
• Cosine Similarity
• LogLikelihood Similarity
• Edit Distance
• Text-based Similarities and Analytics
• Word2Vec
• LDA Topic Extraction
• TextRank
• Similarity-based Recommendations
• User-to-User
• Content-based, Item-to-Item (Amazon)
• Collaborative-based, User-to-Item (Netflix)
• Graph-based, Item-to-Item "Pathways" (Spotify)
• Aggregations, Approximations, and Similarities at Scale
• Twitter Algebird
• MinHash and Bucketing
• Locality Sensitive Hashing (LSH)
• BloomFilters
• CountMin Sketch
• HyperLogLog
• Q & A
Speaker Bio: Chris Fregly is a Research Engineer @ Flux Capacitor AI in SF, an Apache Spark Contributor, and a Netflix Open Source Committer.
Chris is also the founder of the global Advanced Apache Spark Meetup and author of the upcoming book, Advanced Spark @ advancedspark.com.
Previously, Chris was a Data Solutions Engineer at Databricks and a Streaming Data Engineer at Netflix.
Gradient Descent, Back Propagation, and Auto Differentiation - Advanced Spark...Chris Fregly
Advanced Spark and TensorFlow Meetup 08-04-2016
Fundamental Algorithms of Neural Networks including Gradient Descent, Back Propagation, Auto Differentiation, Partial Derivatives, Chain Rule
Title:
Real-time, Advanced Analytics and Recommendations using Machine Learning, Natural Language Processing, Graph Processing, and Approximations with Apache Spark, Stanford CoreNLP, and Twitter Algebird
Agenda
Intro
Live, Interactive Recommendations Demo
Spark ML, GraphX, Streaming, Kafka, Cassandra, Docker
Types of Similarity
Euclidean vs. Non-Euclidean Similarity
User-to-User Similarity
Content-based, Item-to-Item Similarity (Amazon)
Collaborative-based, User-to-Item Similarity (Netflix)
Graph-based, Item-to-Item Similarity Pathway (Spotify)
Similarity Approximations at Scale
Twitter Algebird
MinHash and Bucketing
Locality Sensitive Hashing (LSH)
Netflix Recommendations: From Ratings to Real-Time
DVD-Ratings-based $1M Netflix Prize (2009)
Streaming-based "Trending Now" (2016)
Wrap Up
Q & A
*Bio*
Chris Fregly is a Principal Data Solutions Engineer for the newly-formed IBM Spark Technology Center, an Apache Spark Contributor, and a Netflix Open Source Committer. Chris is also the founder of the global Advanced Apache Spark Meetup and author of the upcoming book, Advanced Spark @ advancedspark.com. Previously, Chris was a Data Solutions Engineer at Databricks and a Streaming Data Engineer at Netflix.
*Related Links*
https://github.com/fluxcapacitor/pipeline/wiki
http://cdn.oreillystatic.com/en/assets/1/event/105/Algebra%20for%20Scalable%20Analytics%20Presentation.pdf
http://static.echonest.com/BoilTheFrog/
http://www.netflixprize.com/assets/GrandPrize2009_BPC_BellKor.pdf
http://blog.echen.me/2011/10/24/winning-the-netflix-prize-a-summary/
http://www.cc.gatech.edu/~zha/CSE8801/CF/kdd-fp074-koren.pdf
Scotland Data Science Meetup Oct 13, 2015: Spark SQL, DataFrames, Catalyst, ...Chris Fregly
This talk highlights the Data Sources API which participates in the Spark SQL DataFrame Catalyst Optimizer. We dive deep into the super-advanced Cassandra's open source implementation @ github.com/datastax/spark-cassandra-connector. We discuss data locality, cluster deployment - as well as the pros and cons of mixing OLAP and OLTP workloads.
We also implement a SimpleDataSource which is a basic implementation of the DataSources API.
All analysis is done with Apache Zeppelin.
Demi Ben Ari - Apache Spark 101 - First Steps into distributed computing:
The world has changed, having one huge server won’t do the job, the ability to Scale Out would be your savior. Apache Spark is a fast and general engine for big data processing, with streaming, SQL, machine learning and graph processing. Showing the basics of Apache Spark and distributed computing.
Demi is a Software engineer, Entrepreneur and an International Tech Speaker.
Demi has over 10 years of experience in building various systems both from the field of near real time applications and Big Data distributed systems.
Co-Founder of the “Big Things” Big Data community and Google Developer Group Cloud.
Big Data Expert, but interested in all kinds of technologies, from front-end to backend, whatever moves data around.
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and VocabulariesMax Irwin
Presentation as given to the Haystack Conference, which outlines research and techniques for automatic extraction of keywords, concepts, and vocabularies from text corpora.
This contains the agenda of the Spark Meetup I organised in Bangalore on Friday, the 23rd of Jan 2014. It carries the slides for the talk I gave on distributed deep learning over Spark
Building and deploying LLM applications with Apache AirflowKaxil Naik
Behind the growing interest in Generate AI and LLM-based enterprise applications lies an expanded set of requirements for data integrations and ML orchestration. Enterprises want to use proprietary data to power LLM-based applications that create new business value, but they face challenges in moving beyond experimentation. The pipelines that power these models need to run reliably at scale, bringing together data from many sources and reacting continuously to changing conditions.
This talk focuses on the design patterns for using Apache Airflow to support LLM applications created using private enterprise data. We’ll go through a real-world example of what this looks like, as well as a proposal to improve Airflow and to add additional Airflow Providers to make it easier to interact with LLMs such as the ones from OpenAI (such as GPT4) and the ones on HuggingFace, while working with both structured and unstructured data.
In short, this shows how these Airflow patterns enable reliable, traceable, and scalable LLM applications within the enterprise.
https://airflowsummit.org/sessions/2023/keynote-llm/
Introduction to Apache Flink - Fast and reliable big data processingTill Rohrmann
This presentation introduces Apache Flink, a massively parallel data processing engine which currently undergoes the incubation process at the Apache Software Foundation. Flink's programming primitives are presented and it is shown how easily a distributed PageRank algorithm can be implemented with Flink. Intriguing features such as dedicated memory management, Hadoop compatibility, streaming and automatic optimisation make it an unique system in the world of Big Data processing.
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdfSease
f you want to expand your query/documents with synonyms in Apache Lucene, you need to have a predefined file containing the list of terms that share the same semantic. It’s not always easy to find a list of basic synonyms for a language and, even if you find it, this doesn’t necessarily match with your contextual domain.
The term “daemon” in the domain of operating system articles is not a synonym of “devil” but it’s closer to the term “process”.
Word2Vec is a two-layer neural network that takes as input a text and outputs a vector representation for each word in the dictionary. Two words with similar meanings are identified with two vectors close to each other.
Code as Data workshop: Using source{d} Engine to extract insights from git re...source{d}
This workshop will teach you the basics git concepts (such as references, commits, and blobs) and how they can be mapped into a series of relational tables.
Once we understand the basic concepts we will show how language classification and program parsing are available as SQL custom functions, how to use them correctly, and how to obtain aggregate results with `GROUP BY` and friends. We will discuss Universal Abstract Syntax Trees and how some advanced checks can be done on top this language agnostic structure. Running these checks at scale requires some extra knowledge and we’ll discuss the challenges and their possible solutions.
To finish, we will also discuss how the information in git repositories encodes a form of social network which can be used to better understand the engineering processes of a given organization.
Sjug #26 ml is in java but is dl too - ver1.04 - tomasz sikora 2018-03-23Tomasz Sikora
Vint Cerf, recognised as one of "the fathers of the Internet" as co-inventor of TCP/IP, said "And programming computers was so fascinating. You create your own little universe, and then it does what you tell it to do.". Now when computers are learning these words give different ground to debate on developer context here.
So what can we do from old good Software Craftsmanship perspective? I, as a developer who still believes in Java, can I use my beloved platform? How can I tackle problems requiring deeper models? On this session Tomasz tried to answer those, opening 100s more questions ;). He executed and configured a few models trying to explore the field from as practical perspective as possible.
To get more visit https://twitter.com/tomaszsikora and http://silesia.jug.pl/
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Helena Edelson
O'Reilly Webcast with Myself and Evan Chan on the new SNACK Stack (playoff of SMACK) with FIloDB: Scala, Spark Streaming, Akka, Cassandra, FiloDB and Kafka.
Language-agnostic data analysis workflows and reproducible researchAndrew Lowe
This was a talk that I gave at CERN at the Inter-experimental Machine Learning (IML) Working Group Meeting in April 2017 about language-agnostic (or polyglot) analysis workflows. I show how it is possible to work in multiple languages and switch between them without leaving the workflow you started. Additionally, I demonstrate how an entire workflow can be encapsulated in a markdown file that is rendered to a publishable paper with cross-references and a bibliography (and with raw LaTeX file produced as a by-product) in a simple process, making the whole analysis workflow reproducible. For experimental particle physics, ROOT is the ubiquitous data analysis tool, and has been for the last 20 years old, so I also talk about how to exchange data to and from ROOT.
Overview of the SPARQL-Generate language and latest developmentsMaxime Lefrançois
SPARQL-Generate is an extension of SPARQL 1.1 for querying not only RDF datasets but also documents in arbitrary formats. The solution bindings can then be used to output RDF (SPARQL-Generate) or text (SPARQL-Template)
Anyone familiar with SPARQL can easily learn SPARQL-Generate; Learning SPARQL-Generate helps you learning SPARQL.
The open-source implementation (Apache 2 license) is based on Apache Jena and can be used to execute transformations from a combination of RDF and any kind of documents in XML, JSON, CSV, HTML, GeoJSON, CBOR, streams of messages using WebSocket or MQTT... (easily extensible)
Recent extensions and improvement include:
- heavy refactoring to support parallelization
- more expressive iterators and functions
- simple generation of RDF lists
- support of aggregates
- generation of HDT (thanks Ana for the use case)
- partial implementation of STTL for the generation of Text (https://ns.inria.fr/sparql-template/)
- partial implementation of LDScript (http://ns.inria.fr/sparql-extension/)
- integration of all these types of rules to decouple or compose queries, e.g.:
- call a SPARQL-Generate query in the SPARQL FROM clause
- plug a SPARQL-Generate or a SPARQL-Template query to the output of a SPARQL-
Select function
- a Sublime Text package for local development
One of the biggest problems of software projects is that, while the practice of software development is commonly thought of as engineering, it is inherently a creative discipline; hence, many things about it are hard to measure. While simple yardsticks like test coverage and cyclomatic complexity are important for code quality, what other metrics can we apply to answer questions about our code? What coding conventions or development practices can we implement to make our code easier to measure? We'll take a tour through some processes and tools you can implement to begin improving code quality in your team or organization, and see what a difference it makes to long-term project maintainability. More importantly, we'll look at how we can move beyond today's tools to answer higher-level questions of code quality. Can 'good code' be quantified?
One of the biggest problems of software projects is that, while the practice of software development is commonly thought of as engineering, it is inherently a creative discipline; hence, many things about it are hard to measure. While simple yardsticks like test coverage and cyclomatic complexity are important for code quality, what other metrics can we apply to answer questions about our code? What coding conventions or development practices can we implement to make our code easier to measure? We'll take a tour through some processes and tools you can implement to begin improving code quality in your team or organization, and see what a difference it makes to long-term project maintainability. More importantly, we'll look at how we can move beyond today's tools to answer higher-level questions of code quality. Can 'good code' be quantified?
Context-aware Fast Food Recommendation with Ray on Apache Spark at Burger KingDatabricks
For fast food recommendation use cases, user behavior sequences and context features (such as time, weather, and location) are both important factors to be taken into consideration. At Burger King, we have developed a new state-of-the-art recommendation model called Transformer Cross Transformer (TxT). It applies Transformer encoders to capture both user behavior sequences and complicated context features and combines both transformers through the latent cross for joint context-aware fast food recommendations. Online A/B testings show not only the superiority of TxT comparing to existing methods results but also TxT can be successfully applied to other fast food recommendation use cases outside of Burger King.
Similar to Atlanta MLconf Machine Learning Conference 09-23-2016 (20)
Pandas on AWS - Let me count the ways.pdfChris Fregly
Chris Fregly (Principal Solution Architect, AI and machine learning at AWS) will give a brief presentation on the various ways to perform scalable Pandas, Modin, and Ray on AWS. He will then answer questions from the audience and moderator, Alejandro Herrera (whatever he is) at Ponder.
Chris Fregly is a Principal Solution Architect for AI and Machine Learning at Amazon Web Services (AWS) based in San Francisco, California. He is the organizer of the Global Data Science on AWS meetup. He is co-author of the O'Reilly Book, "Data Science on AWS."
Related Links
O'Reilly Book: https://www.amazon.com/dp/1492079391/
Website: https://datascienceonaws.com
Meetup: https://meetup.datascienceonaws.com
GitHub Repo: https://github.com/data-science-on-aws/
YouTube: https://youtube.datascienceonaws.com
Slideshare: https://slideshare.datascienceonaws.com
Ray AI Runtime (AIR) on AWS - Data Science On AWS MeetupChris Fregly
RSVP Webinar: https://www.eventbrite.com/e/webinarkubeflow-tensorflow-tfx-pytorch-gpu-spark-ml-amazonsagemaker-tickets-45852865154
Talk #0: Introductions and Meetup Announcements By Chris Fregly and Antje Barth
Talk #1: Ray Overview, Ray AI Runtime on AWS using Amazon SageMaker, EC2, EMR, EKS by Chris Fregly, Principal Specialist Solution Architect, AI and Machine Learning @ AWS
Talk #2: Deep-dive Blueprints for Amazon Elastic Kubernetes Service (EKS) including Ray and Spark by Apoorva Kulkarni, Sr. Specialist Solution Architect, Containers and Kubernetes @ AWS
RSVP Webinar: https://www.eventbrite.com/e/webinarkubeflow-tensorflow-tfx-pytorch-gpu-spark-ml-amazonsagemaker-tickets-45852865154
Zoom link: https://us02web.zoom.us/j/82308186562
Related Links
O'Reilly Book: https://www.amazon.com/dp/1492079391/
Website: https://datascienceonaws.com
Meetup: https://meetup.datascienceonaws.com
GitHub Repo: https://github.com/data-science-on-aws/
YouTube: https://youtube.datascienceonaws.com
Slideshare: https://slideshare.datascienceonaws.com
Amazon reInvent 2020 Recap: AI and Machine LearningChris Fregly
Amazon reInvent 2020 Recap: AI and Machine Learning
Video here: https://youtu.be/YSXe02Y5pHM
NEW RELEASE! Build, Automate, Manage, and Scale ML Workflows with the NEW Amazon SageMaker Pipelines by Hallie Crosby Weishahn.
Description of Talk and Demo
AWS recently announced Amazon SageMaker Pipelines (https://aws.amazon.com/sagemaker/pipelines/), the first purpose-built, easy-to-use Continuous Integration and Continuous Delivery (CI/CD) service for machine learning.
SageMaker Pipelines has three main components which improve the operational resilience and reproducibility of your workflows: 1) pipelines, 2) model registry, and 3) projects.
In this talk and demo, Hallie will walk us through the new Amazon SageMaker Pipelines feature including MLOps support.
Date/Time
9-10am US Pacific Time (Third Monday of Every Month)
RSVP: https://www.eventbrite.com/e/1-hr-free-workshop-pipelineai-gpu-tpu-spark-ml-tensorflow-ai-kubernetes-kafka-scikit-tickets-45852865154
Meetup:
https://www.meetup.com/Data-Science-on-AWS/
Zoom:
https://zoom.us/j/690414331
Webinar ID: 690 414 331
Phone:
+1 646 558 8656 (US Toll) or +1 408 638 0968 (US Toll)
Related Links
Meetup: https://meetup.datascienceonaws.com
GitHub Repo: https://github.com/data-science-on-aws/
O'Reilly Book: https://datascienceonaws.com
YouTube: https://youtube.datascienceonaws.com
Slideshare: https://slideshare.datascienceonaws.com
Support: https://support.pipeline.ai
Monthly Workshop: https://www.eventbrite.com/e/full-day-workshop-kubeflow-gpu-kerastensorflow-20-tf-extended-tfx-kubernetes-pytorch-xgboost-tickets-63362929227
RSVP: https://www.eventbrite.com/e/1-hr-free-workshop-pipelineai-gpu-tpu-spark-ml-tensorflow-ai-kubernetes-kafka-scikit-tickets-45852865154
Waking the Data Scientist at 2am: Detect Model Degradation on Production Mod...Chris Fregly
Waking the Data Scientist at 2am:
Detect Model Degradation on Production Models with Amazon SageMaker Endpoints & Model Monitor
In this talk, I describe how to deploy a model into production and monitor its performance using SageMaker Model Monitor. With Model Monitor, I can detect if a model's predictive performance has degraded - and alert an on-call data scientist to take action and improve the model at 2am while the DevOps folks sleep soundly through the night.
Topics: AI and Machine Learning, Model Deployment, Anomaly Detection, Amazon SageMaker Endpoints, and Model Monitor
Quantum Computing with Amazon Braket
In this talk, I describe some fundamental principles of quantum computing including qu-bits, superposition, and entanglement. I will demonstrate how to perform secure quantum computing tasks across many Quantum Processing Units (QPUs) using Amazon Braket, IAM, and S3.
AI and Machine Learning, Quantum Computing, Amazon Braket, QPU
15 Tips to Scale a Large AI/ML Workshop - Both Online and In-PersonChris Fregly
In this talk, we present tips and best practices for scaling a large workshop for 1,000's of simultaneous attendees - both online and in-person. While our workshop is focused on AI and machine learning on AWS, we generalize our learnings for any domain or specialization.
Video: https://youtu.be/T0L0JxDaPkc
RSVP Here: https://www.eventbrite.com/e/full-day-workshop-kubeflow-kerastensorflow-20-tf-extended-tfx-kubernetes-pytorch-xgboost-airflow-tickets-63362929227
Description
In this workshop, we build real-world machine learning pipelines using TensorFlow Extended (TFX), KubeFlow, Airflow, and MLflow.
Described in the 2017 paper, TFX is used internally by thousands of Google data scientists and engineers across every major product line within Google.
KubeFlow is a modern, end-to-end pipeline orchestration framework that embraces the latest AI best practices including hyper-parameter tuning, distributed model training, and model tracking.
Airflow is the most-widely used pipeline orchestration framework in machine learning and data engineering.
MLflow is a lightweight experiment-tracking system recently open-sourced by Databricks, the creators of Apache Spark. MLflow supports Python, Java/Scala, and R - and offers native support for TensorFlow, Keras, and Scikit-Learn.
Pre-requisites
Modern browser - and that's it!
Every attendee will receive a cloud instance
Nothing will be installed on your local laptop
Everything can be downloaded at the end of the workshop
Location
Online Workshop
The link will be sent a few hours before the start of the workshop.
Only registered users will receive the link.
If you do not receive the link a few hours before the start of the workshop, please send your Eventbrite registration confirmation to support@pipeline.ai for help.
Agenda
1. Create a Kubernetes cluster
2. Install KubeFlow, Airflow, TFX, and Jupyter
3. Setup ML Training Pipelines with KubeFlow and Airflow
4. Transform Data with TFX Transform
5. Validate Training Data with TFX Data Validation
6. Train Models with Jupyter, Keras/TensorFlow 2.0, PyTorch, XGBoost, and KubeFlow
7. Run a Notebook Directly on Kubernetes Cluster with KubeFlow
8. Analyze Models using TFX Model Analysis and Jupyter
9. Perform Hyper-Parameter Tuning with KubeFlow
10. Select the Best Model using KubeFlow Experiment Tracking
11. Run Multiple Experiments with MLflow Experiment Tracking
12. Reproduce Model Training with TFX Metadata Store
13. Deploy the Model to Production with TensorFlow Serving and Istio
14. Save and Download your Workspace
Key Takeaways
Attendees will gain experience training, analyzing, and serving real-world Keras/TensorFlow 2.0 models in production using model frameworks and open-source tools.
RSVP Here: https://www.eventbrite.com/e/full-day-workshop-kubeflow-kerastensorflow-20-tf-extended-tfx-kubernetes-pytorch-xgboost-airflow-tickets-63362929227
https://youtu.be/T0L0JxDaPkc
Title
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTorch + XGBoost + Airflow + MLflow + Spark + Jupyter + TPU
Video
https://youtu.be/vaB4IM6ySD0
Description
In this workshop, we build real-world machine learning pipelines using TensorFlow Extended (TFX), KubeFlow, and Airflow.
Described in the 2017 paper, TFX is used internally by thousands of Google data scientists and engineers across every major product line within Google.
KubeFlow is a modern, end-to-end pipeline orchestration framework that embraces the latest AI best practices including hyper-parameter tuning, distributed model training, and model tracking.
Airflow is the most-widely used pipeline orchestration framework in machine learning.
Pre-requisites
Modern browser - and that's it!
Every attendee will receive a cloud instance
Nothing will be installed on your local laptop
Everything can be downloaded at the end of the workshop
Location
Online Workshop
Agenda
1. Create a Kubernetes cluster
2. Install KubeFlow, Airflow, TFX, and Jupyter
3. Setup ML Training Pipelines with KubeFlow and Airflow
4. Transform Data with TFX Transform
5. Validate Training Data with TFX Data Validation
6. Train Models with Jupyter, Keras/TensorFlow 2.0, PyTorch, XGBoost, and KubeFlow
7. Run a Notebook Directly on Kubernetes Cluster with KubeFlow
8. Analyze Models using TFX Model Analysis and Jupyter
9. Perform Hyper-Parameter Tuning with KubeFlow
10. Select the Best Model using KubeFlow Experiment Tracking
11. Reproduce Model Training with TFX Metadata Store and Pachyderm
12. Deploy the Model to Production with TensorFlow Serving and Istio
13. Save and Download your Workspace
Key Takeaways
Attendees will gain experience training, analyzing, and serving real-world Keras/TensorFlow 2.0 models in production using model frameworks and open-source tools.
Related Links
1. PipelineAI Home: https://pipeline.ai
2. PipelineAI Community Edition: http://community.pipeline.ai
3. PipelineAI GitHub: https://github.com/PipelineAI/pipeline
4. Advanced Spark and TensorFlow Meetup (SF-based, Global Reach): https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup
5. YouTube Videos: https://youtube.pipeline.ai
6. SlideShare Presentations: https://slideshare.pipeline.ai
7. Slack Support: https://joinslack.pipeline.ai
8. Web Support and Knowledge Base: https://support.pipeline.ai
9. Email Support: support@pipeline.ai
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...Chris Fregly
Traditional machine learning pipelines end with life-less models sitting on disk in the research lab. These traditional models are typically trained on stale, offline, historical batch data. Static models and stale data are not sufficient to power today's modern, AI-first Enterprises that require continuous model training, continuous model optimizations, and lightning-fast model experiments directly in production. Through a series of open source, hands-on demos and exercises, we will use PipelineAI to breathe life into these models using 4 new techniques that we’ve pioneered:
* Continuous Validation (V)
* Continuous Optimizing (O)
* Continuous Training (T)
* Continuous Explainability (E).
The Continuous "VOTE" techniques has proven to maximize pipeline efficiency, minimize pipeline costs, and increase pipeline insight at every stage from continuous model training (offline) to live model serving (online.)
Attendees will learn to create continuous machine learning pipelines in production with PipelineAI, TensorFlow, and Kafka.
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...Chris Fregly
Perform Online Predictions using Slack
A/B and multi-armed bandit model compare
Train Online Models with Kafka Streams
Create new models quickly
Deploy to production safely
Mirror traffic to validate online performance
Any Framework, Any Hardware, Any Cloud
Dashboard to manage the lifecycle of models from local development to live production
Generates optimized runtimes for the models
Custom targeting rules, shadow mode, and percentage-based rollouts to safely test features in live production
Continuous model training, model validation, and pipeline optimization
https://youtu.be/zpkH9oiIovU
https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/events/258276286/
Related Links
PipelineAI Home: https://pipeline.ai
PipelineAI Community Edition: https://community.pipeline.ai
PipelineAI GitHub: https://github.com/PipelineAI/pipeline
PipelineAI Quick Start: https://quickstart.pipeline.ai
Advanced Spark and TensorFlow Meetup (SF-based, Global Reach): https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup
YouTube Videos: https://youtube.pipeline.ai
SlideShare Presentations: https://slideshare.pipeline.ai
Slack Support:
https://joinslack.pipeline.ai
Web Support and Knowledge Base: https://support.pipeline.ai
Email Support: help@pipeline.ai
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...Chris Fregly
https://pipeline.ai
With PipelineAI, You Can…
* Generate Hardware-Specific Model Optimizations
* Deploy and Compare Models in Live Production
* Optimize Complete AI Pipeline Across Many Models
* Hyper-Parameter Tune Both Training & Predicting Phases
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...Chris Fregly
https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/events/244971261/
Based on this blog post: https://mengdong.github.io/2017/07/15/distributed-tensorflow-with-gpu-on-kubernetes-and-mapr/
youtube video:
https://www.youtube.com/watch?v=3phz1_B-rR4
http://pipeline.ai
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...Chris Fregly
Online Workshop
Note: A GPU-based cloud instance will be provided to each attendee for the duration of this event!!
At 8am PT on the morning of this workshop, we will email the Webinar details to your email address registered with Eventbrite.
If this email address is not up to date - or you do not get the email by 8am PT - please email your Eventbrite confirmation to help@pipeline.ai and we'll send you the details.
http://pipeline.ai
Title
PipelineAI Distributed Spark ML + Tensorflow AI + GPU Workshop
Time
Start: 9am PT Time
End: 1pm PT Time
Highlights
We will each build an end-to-end, continuous Tensorflow AI model training and deployment pipeline on our own GPU-based cloud instance.
At the end, we will combine our cloud instances to create the LARGEST Distributed Tensorflow AI Training and Serving Cluster in the WORLD!
Pre-requisites
Just a modern browser, internet connection, and a good night's sleep! We'll provide the rest.
Agenda
Spark ML
TensorFlow AI
Storing and Serving Models with HDFS
Trade-offs of CPU vs. *GPU, Scale Up vs. Scale Out
CUDA + cuDNN GPU Development Overview
TensorFlow Model Checkpointing, Saving, Exporting, and Importing
Distributed TensorFlow AI Model Training (Distributed Tensorflow)
TensorFlow's Accelerated Linear Algebra Framework (XLA)
TensorFlow's Just-in-Time (JIT) Compiler, Ahead of Time (AOT) Compiler
Centralized Logging and Visualizing of Distributed TensorFlow Training (Tensorboard)
Distributed Tensorflow AI Model Serving/Predicting (TensorFlow Serving)
Centralized Logging and Metrics Collection (Prometheus, Grafana)
Continuous TensorFlow AI Model Deployment (TensorFlow, Airflow)
Hybrid Cross-Cloud and On-Premise Deployments (Kubernetes)
High-Performance and Fault-Tolerant Micro-services (NetflixOSS)
More Info including GitHub and Docker Repos
http://pipeline.ai
Into the Box Keynote Day 2: Unveiling amazing updates and announcements for modern CFML developers! Get ready for exciting releases and updates on Ortus tools and products. Stay tuned for cutting-edge innovations designed to boost your productivity.
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxrickgrimesss22
Discover the essential features to incorporate in your Winzo clone app to boost business growth, enhance user engagement, and drive revenue. Learn how to create a compelling gaming experience that stands out in the competitive market.
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...Mind IT Systems
Healthcare providers often struggle with the complexities of chronic conditions and remote patient monitoring, as each patient requires personalized care and ongoing monitoring. Off-the-shelf solutions may not meet these diverse needs, leading to inefficiencies and gaps in care. It’s here, custom healthcare software offers a tailored solution, ensuring improved care and effectiveness.
Large Language Models and the End of ProgrammingMatt Welsh
Talk by Matt Welsh at Craft Conference 2024 on the impact that Large Language Models will have on the future of software development. In this talk, I discuss the ways in which LLMs will impact the software industry, from replacing human software developers with AI, to replacing conventional software with models that perform reasoning, computation, and problem-solving.
We describe the deployment and use of Globus Compute for remote computation. This content is aimed at researchers who wish to compute on remote resources using a unified programming interface, as well as system administrators who will deploy and operate Globus Compute services on their research computing infrastructure.
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...informapgpstrackings
Keep tabs on your field staff effortlessly with Informap Technology Centre LLC. Real-time tracking, task assignment, and smart features for efficient management. Request a live demo today!
For more details, visit us : https://informapuae.com/field-staff-tracking/
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdfJay Das
With the advent of artificial intelligence or AI tools, project management processes are undergoing a transformative shift. By using tools like ChatGPT, and Bard organizations can empower their leaders and managers to plan, execute, and monitor projects more effectively.
Software Engineering, Software Consulting, Tech Lead.
Spring Boot, Spring Cloud, Spring Core, Spring JDBC, Spring Security,
Spring Transaction, Spring MVC,
Log4j, REST/SOAP WEB-SERVICES.
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteGoogle
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
👉👉 Click Here To Get More Info 👇👇
https://sumonreview.com/ai-pilot-review/
AI Pilot Review: Key Features
✅Deploy AI expert bots in Any Niche With Just A Click
✅With one keyword, generate complete funnels, websites, landing pages, and more.
✅More than 85 AI features are included in the AI pilot.
✅No setup or configuration; use your voice (like Siri) to do whatever you want.
✅You Can Use AI Pilot To Create your version of AI Pilot And Charge People For It…
✅ZERO Manual Work With AI Pilot. Never write, Design, Or Code Again.
✅ZERO Limits On Features Or Usages
✅Use Our AI-powered Traffic To Get Hundreds Of Customers
✅No Complicated Setup: Get Up And Running In 2 Minutes
✅99.99% Up-Time Guaranteed
✅30 Days Money-Back Guarantee
✅ZERO Upfront Cost
See My Other Reviews Article:
(1) TubeTrivia AI Review: https://sumonreview.com/tubetrivia-ai-review
(2) SocioWave Review: https://sumonreview.com/sociowave-review
(3) AI Partner & Profit Review: https://sumonreview.com/ai-partner-profit-review
(4) AI Ebook Suite Review: https://sumonreview.com/ai-ebook-suite-review
Check out the webinar slides to learn more about how XfilesPro transforms Salesforce document management by leveraging its world-class applications. For more details, please connect with sales@xfilespro.com
If you want to watch the on-demand webinar, please click here: https://www.xfilespro.com/webinars/salesforce-document-management-2-0-smarter-faster-better/
In software engineering, the right architecture is essential for robust, scalable platforms. Wix has undergone a pivotal shift from event sourcing to a CRUD-based model for its microservices. This talk will chart the course of this pivotal journey.
Event sourcing, which records state changes as immutable events, provided robust auditing and "time travel" debugging for Wix Stores' microservices. Despite its benefits, the complexity it introduced in state management slowed development. Wix responded by adopting a simpler, unified CRUD model. This talk will explore the challenges of event sourcing and the advantages of Wix's new "CRUD on steroids" approach, which streamlines API integration and domain event management while preserving data integrity and system resilience.
Participants will gain valuable insights into Wix's strategies for ensuring atomicity in database updates and event production, as well as caching, materialization, and performance optimization techniques within a distributed system.
Join us to discover how Wix has mastered the art of balancing simplicity and extensibility, and learn how the re-adoption of the modest CRUD has turbocharged their development velocity, resilience, and scalability in a high-growth environment.
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus
As part of the DOE Integrated Research Infrastructure (IRI) program, NERSC at Lawrence Berkeley National Lab and ALCF at Argonne National Lab are working closely with General Atomics on accelerating the computing requirements of the DIII-D experiment. As part of the work the team is investigating ways to speedup the time to solution for many different parts of the DIII-D workflow including how they run jobs on HPC systems. One of these routes is looking at Globus Compute as a way to replace the current method for managing tasks and we describe a brief proof of concept showing how Globus Compute could help to schedule jobs and be a tool to connect compute at different facilities.
Experience our free, in-depth three-part Tendenci Platform Corporate Membership Management workshop series! In Session 1 on May 14th, 2024, we began with an Introduction and Setup, mastering the configuration of your Corporate Membership Module settings to establish membership types, applications, and more. Then, on May 16th, 2024, in Session 2, we focused on binding individual members to a Corporate Membership and Corporate Reps, teaching you how to add individual members and assign Corporate Representatives to manage dues, renewals, and associated members. Finally, on May 28th, 2024, in Session 3, we covered questions and concerns, addressing any queries or issues you may have.
For more Tendenci AMS events, check out www.tendenci.com/events
Developing Distributed High-performance Computing Capabilities of an Open Sci...Globus
COVID-19 had an unprecedented impact on scientific collaboration. The pandemic and its broad response from the scientific community has forged new relationships among public health practitioners, mathematical modelers, and scientific computing specialists, while revealing critical gaps in exploiting advanced computing systems to support urgent decision making. Informed by our team’s work in applying high-performance computing in support of public health decision makers during the COVID-19 pandemic, we present how Globus technologies are enabling the development of an open science platform for robust epidemic analysis, with the goal of collaborative, secure, distributed, on-demand, and fast time-to-solution analyses to support public health.
Cyaniclab : Software Development Agency Portfolio.pdfCyanic lab
CyanicLab, an offshore custom software development company based in Sweden,India, Finland, is your go-to partner for startup development and innovative web design solutions. Our expert team specializes in crafting cutting-edge software tailored to meet the unique needs of startups and established enterprises alike. From conceptualization to execution, we offer comprehensive services including web and mobile app development, UI/UX design, and ongoing software maintenance. Ready to elevate your business? Contact CyanicLab today and let us propel your vision to success with our top-notch IT solutions.
Code reviews are vital for ensuring good code quality. They serve as one of our last lines of defense against bugs and subpar code reaching production.
Yet, they often turn into annoying tasks riddled with frustration, hostility, unclear feedback and lack of standards. How can we improve this crucial process?
In this session we will cover:
- The Art of Effective Code Reviews
- Streamlining the Review Process
- Elevating Reviews with Automated Tools
By the end of this presentation, you'll have the knowledge on how to organize and improve your code review proces
2. Who am I?
Chris Fregly, Research Scientist @ PipelineIO, San Francisco
Previously, Engineer @ Netflix, Databricks, and IBM Spark
Contributor @ Apache Spark, Committer @ Netflix OSS
Founder @ Advanced Spark and TensorFlow Meetup
Author @ Advanced Spark (advancedspark.com)
7. Confession #1
I Failed Linguistics in College!
Chose Pass/Fail Option
(90 (mid-term) + 70 (final)) / 2 = 80 = C+
How did a C+ turn into an F?
ZER0 (0) CLASS PARTICIPATION?!
8. Confession #2
I Hated Statistics in College
2 Degrees: Mechanical + Manufacturing Engg
Approximations were Bad!
I Wasn’t a Fluffy Physics Major
Though, I Kinda Wish I Was!
9. Wait… Please Don’t Leave!
I’m Older and Wiser Now
Approximate is the New Exact
Computational Linguistics and NLP are My Jam!
11. What is Tensorflow?
General Purpose Numerical Computation Engine
Happens to be good for neural nets!
Tooling
Tensorboard (port 6006 == `goog`) à
DAG-based like Spark!
Computation graph is logical plan
Stored in Protobuf’s
TF converts logical -> physical plan
Lots of Libraries
TFLearn (Tensorflow’s Scikit-learn Impl)
Tensorflow Serving (Prediction Layer) à ^^
Distributed and GPU-Optimized
12. What are Neural Networks?
Like All ML, Goal is to Minimize Loss (Error)
Error relative to known outcome of labeled data
Mostly Supervised Learning Classification
Labeled training data
Training Steps
Step 1: Randomly Guess Input Weights
Step 2: Calculate Error Against Labeled Data
Step 3: Determine Gradient Value, +/- Direction
Step 4: Back-propagateGradient to Update Each Input Weight
Step 5: Repeat Step 1 with New Weights until Convergence
Activation
Function
13. Activation Functions
Goal: Learn and Train a Model on Input Data
Non-Linear Functions
Find Non-Linear Fit of Input Data
Common Activation Functions
Sigmoid Function (sigmoid)
{0, 1}
Hyperbolic Tangent (tanh)
{-1, 1}
17. Convolutional Neural Networks
Feed-forward
Do not form a cycle
Apply Many Layers (aka. Filters) to Input
Each Layer/Filter Picks up on Features
Features not necessarily human-grokkable
Examples of Human-grokkable Filters
3 color filters: RGB
Moving AVG for time series
Brute Force
Try Diff numLayers & layerSizes
18. CNN Use Case: Stitch Fix
Stitch Fix Also Uses NLP to Analyze Return/Reject Comments
StitchFix Strata Conf SF 2016:
Using Deep Learning to Create New Clothing Styles!
19. Recurrent Neural Networks
Forms a Cycle (vs. Feed-forward)
Maintains State over Time
Keep track of context
Learns sequential patterns
Decay over time
Use Cases
Speech
Text/NLP Prediction
20. RNN Sequences
Input: Image
Output: Classification
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
Input: Image
Output: Text (Captions)
Input: Text
Output: Class (Sentiment)
Input: Text (English)
Output: Text (Spanish)
Input
Layer
Hidden
Layer
Output
Layer
21. Character-based RNNs
Tokens are Characters vs. Words/Phrases
Microsoft trains ever 3 characters
Less Combination of Possible Neighbors
Only 26 alpha character tokens vs. millions of word tokens
Preserves state
between
1st and 2nd ‘l’
improves prediction
22. Long Short Term Memory (LSTM)
More Complex
State Update
Function
than
Vanilla RNN
26. Use Cases
Document Summary
TextRank: TF/IDF + PageRank
Article Classification and Similarity
LDA: calculate top `k` topic distribution
Machine Translation
word2vec: compare word embedding vectors
Must Convert Text to Numbers!
27. Core Concepts
Corpus
Collection of text
ie. Documents, articles, genetic codes
Embeddings
Tokens represented/embedded in vector space
Learned, hidden features (~PCA, SVD)
Similar tokens cluster together, analogies cluster apart
k-skip-gram
Skip k neighbors when defining tokens
n-gram
Treat n consecutive tokens as a single token
Composable:
1-skip, bi-gram
(every other word)
28. Parsers and POS Taggers
Describe grammatical sentence structure
Requires context of entire sentence
Helps reason about sentence
80% obvious, simple token neighbors
Major bottleneck in NLP pipeline!
29. Pre-trained Parsers and Taggers
Penn Treebank
Parser and Part-of-Speech Tagger
Human-annotated (!)
Trained on 4.5 million words
Parsey McParseface
Trained by SyntaxNet
30. Feature Engineering
Lower-case
Preserve proper nouns using carat (`^`)
“MLconf” => “^m^lconf”
“Varsity” => “^varsity”
Encode Common N-grams (Phrases)
Create a single token using underscore (`_`)
“Senior Developer” => “senior_developer”
Stemming and Lemmatization
Try to avoid: let the neural network figure this out
Can preserve part of speech (POS) using “_noun”, “_verb”
“banking” => “banking_verb”
32. Count-based Models
Goal: Convert Text to Vector of Neighbor Co-occurrences
Bag of Words (BOW)
Simple hashmap with word counts
Loses neighbor context
Term Frequency / Inverse Document Frequency (TF/IDF)
Normalizes based on token frequency
GloVe
Matrix factorization on co-occurrence matrix
Highly parallelizable, reduce dimensions, capture global co-occurrence stats
Log smoothing of probability ratios
Stores word vector diffs for fast analogy lookups
33. Neural-based Predictive Models
Goal: Predict Text using Learned Embedding Vectors
word2vec
Shallow neural network
Local: nearby words predict each other
Fixed word embedding vector size (ie. 300)
Optimizer: Mini-batch Stochastic Gradient Descent (SGD)
SyntaxNet
Deep(er) neural network
Global(er)
Not a Recurrent Neural Net (RNN)!
Can combine with BOW-based models (ie. word2vec CBOW)
34. word2vec
CBOW word2vec
Predict target word from source context
A single source context is an observation
Loses useful distribution information
Good for small datasets
Skip-gram word2vec (Inverse of CBOW)
Predict source context words from target word
Each (source context, target word) tuple is observation
Better for large datasets
36. *2vec
lda2vec
LDA (global) + word2vec (local)
From Chris Moody @ Stitch Fix
like2vec
Embedding-based Recommender
37. word2vec vs. GloVe
Both are Fundamentally Similar
Capture local co-occurrence statistics (neighbors)
Capture distance between embedding vector
(analogies)
GloVe
Count-based
Also captures global co-occurrence statistics
Requires upfront pass through entire dataset
38. SyntaxNet POS Tagging
Determine coarse-grained grammatical role of each word
Multiple contexts, multiple roles
Neural Net
Inputs: stack, buffer
Results: POS probability distro
Already
Tagged
39. SyntaxNet Dependency Parser
Determine fine-grained roles using grammatical relationships
“Transition-based”, Incremental Dependency Parser
Globally Normalized using Beam Search with Early Update
Parsey McParseface: Pre-trained Parser/Tagger avail in 40 langs
Fine-grained
Coarse-grained
40. SyntaxNet Use Case: Nutrition
Nutrition and Health Startup in SF (Stealth)
Using Google’s SyntaxNet
Rate Recipes and Menus by Nutritional Value
Correct
Incorrect
42. Thank You, Atlanta!
Chris Fregly, Research Scientist @ PipelineIO
All Source Code, Demos, and Docker Images
@ pipeline.io
Join the Global Meetup for all Slides and Videos
@ advancedspark.com