This document discusses simplifying training and serving deep learning models with big data in Python using TensorFlow, Apache Beam, and Apache Flink. It covers using TF.Transform to preprocess data, running TensorFlow models on Spark using TensorFlowOnSpark, and the progress being made on non-JVM support in Apache Beam and TensorFlow on Flink. It also briefly discusses other big data systems and the challenges of interacting with them from outside the Java Virtual Machine.
Powering Tensorflow with big data using Apache Beam, Flink, and Spark - OSCON...Holden Karau
TensorFlow is all kinds of fancy, from helping startups raising their series A in Silicon Valley to detecting if something is a cat. However, when things start to get “real,” you may find yourself no longer just dealing with mnist.csv but instead needing do large-scale data prep as well as training.
Holden Karau details how to use TensorFlow in conjunction with Apache Spark, Flink, and Beam to create a full machine learning pipeline—including the annoying feature engineering and data prep components that we like to pretend don’t exist. Holden also explains why these feature prep stages need to be integrated into the serving layer. She concludes by examining changing industry trends, like Apache Arrow, and how they impact cross-language development for things like deep learning. Even if you’re not trying to raise a round of funding in Silicon Valley, this talk will give you tools to do interesting machine learning problems at scale.
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018Holden Karau
Apache Spark has driven a lot of adoption of both Scala and functional programming concepts in non-traditionally industries. For many programmers in the big data world they coming looking for a solution to scaling their code, and quickly find themselves dealing with immutable data structures and lambdas, and those who love it stay. However, there is a dark side (of escape), much of Spark’s functional programming is changing, and even though it encourages functional programming it’s in a variety of languages with different expectations (in-line XML as a valid part of your language is fun!). This talk will look at how Spark does a good job of introduce folks to concepts like immutability, but also places where we maybe don’t do a great job of setting up developers for a life of functional programming. Things like accumulators, our three different models for streaming data, and an “interesting” approach to closures (come to find out what the ClosuerCleaner does, stay to find out why). The talk will close out with a look at how the functional inspired API is in exposed in the different languages, and how this impacts the kind of code written (Scala, Java, and Python – other languages are supported by Spark but I don’t want to re-learn Javascript or learn R just for this talk). Pictures of cute animals will be included in the slides to distract from the sad parts.
Video: https://www.youtube.com/watch?v=EDJfpkDpoE4
Sharing (or stealing) the jewels of python with big data & the jvm (1)Holden Karau
With the new Apache Arrow integration in PySpark 2.3, it is now starting become reasonable to look to the Python world and ask “what else do we want to steal besides tensorflow”, or as a Python developer look and say “how can I get my code into production without it being rewritten into a mess of Java?”
Regardless of your specific side(s) in the JVM/Python divide, collaboration is getting a lot faster, so lets learn how to share! In this brief talk we will examine sharing some of the wonders of Spacy with the Java world, which still has a somewhat lackluster set of options for NLP.
Intro - End to end ML with Kubeflow @ SignalConf 2018Holden Karau
There are many great tools for training machine learning tools, ranging from sci-kit to Apache Spark, and tensorflow. However many of these systems largely leave open the question how to use our models outside of the batch world (like in a reactive application). Different options exist for persisting the results and using them for live training, and we will explore the trade-offs of the different formats and their corresponding serving/prediction layers.
Powering Tensorflow with big data using Apache Beam, Flink, and Spark - OSCON...Holden Karau
TensorFlow is all kinds of fancy, from helping startups raising their series A in Silicon Valley to detecting if something is a cat. However, when things start to get “real,” you may find yourself no longer just dealing with mnist.csv but instead needing do large-scale data prep as well as training.
Holden Karau details how to use TensorFlow in conjunction with Apache Spark, Flink, and Beam to create a full machine learning pipeline—including the annoying feature engineering and data prep components that we like to pretend don’t exist. Holden also explains why these feature prep stages need to be integrated into the serving layer. She concludes by examining changing industry trends, like Apache Arrow, and how they impact cross-language development for things like deep learning. Even if you’re not trying to raise a round of funding in Silicon Valley, this talk will give you tools to do interesting machine learning problems at scale.
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018Holden Karau
Apache Spark has driven a lot of adoption of both Scala and functional programming concepts in non-traditionally industries. For many programmers in the big data world they coming looking for a solution to scaling their code, and quickly find themselves dealing with immutable data structures and lambdas, and those who love it stay. However, there is a dark side (of escape), much of Spark’s functional programming is changing, and even though it encourages functional programming it’s in a variety of languages with different expectations (in-line XML as a valid part of your language is fun!). This talk will look at how Spark does a good job of introduce folks to concepts like immutability, but also places where we maybe don’t do a great job of setting up developers for a life of functional programming. Things like accumulators, our three different models for streaming data, and an “interesting” approach to closures (come to find out what the ClosuerCleaner does, stay to find out why). The talk will close out with a look at how the functional inspired API is in exposed in the different languages, and how this impacts the kind of code written (Scala, Java, and Python – other languages are supported by Spark but I don’t want to re-learn Javascript or learn R just for this talk). Pictures of cute animals will be included in the slides to distract from the sad parts.
Video: https://www.youtube.com/watch?v=EDJfpkDpoE4
Sharing (or stealing) the jewels of python with big data & the jvm (1)Holden Karau
With the new Apache Arrow integration in PySpark 2.3, it is now starting become reasonable to look to the Python world and ask “what else do we want to steal besides tensorflow”, or as a Python developer look and say “how can I get my code into production without it being rewritten into a mess of Java?”
Regardless of your specific side(s) in the JVM/Python divide, collaboration is getting a lot faster, so lets learn how to share! In this brief talk we will examine sharing some of the wonders of Spacy with the Java world, which still has a somewhat lackluster set of options for NLP.
Intro - End to end ML with Kubeflow @ SignalConf 2018Holden Karau
There are many great tools for training machine learning tools, ranging from sci-kit to Apache Spark, and tensorflow. However many of these systems largely leave open the question how to use our models outside of the batch world (like in a reactive application). Different options exist for persisting the results and using them for live training, and we will explore the trade-offs of the different formats and their corresponding serving/prediction layers.
Many of the recent big data systems, like Hadoop, Spark, and Kafka, are written primarily in JVM languages. At the same time, there is a wealth of tools for data science and data analytics that exist outside of the JVM. Holden Karau and Rachel Warren explore the state of the current big data ecosystem and explain how to best work with it in non-JVM languages. While much of the focus will be on Python + Spark, the talk will also include interesting anecdotes about how these lessons apply to other systems (including Kafka).
Holden and Rachel detail how to bridge the gap using PySpark and discuss other solutions like Kafka Streams as well. They also outline the challenges of pure Python solutions like dask. Holden and Rachel start with the current architecture of PySpark and its evolution. They then turn to the future, covering Arrow-accelerated interchange for Python functions, how to expose Python machine learning models into Spark, and how to use systems like Spark to accelerate training of traditional Python models. They also dive into what other similar systems are doing as well as what the options are for (almost) completely ignoring the JVM in the big data space.
Python users will learn how to more effectively use systems like Spark and understand how the design is changing. JVM developers will gain an understanding of how to Python code from data scientist and Python developers while avoiding the traditional trap of needing to rewrite everything.
Using Spark ML on Spark Errors - What do the clusters tell us?Holden Karau
If you’re subscribed to user@spark.apache.org, or work in a large company, you may see some common Spark error messages. Even attending Spark Summit over the past few years you have seen talks like the “Top K Mistakes in Spark.” While cool non-machine learning based tools do exist to examine Spark’s logs — they don’t use machine learning and therefore are not as cool but also limited in by the amount of effort humans can put into writing rules for them. This talk will look what happens when we train “regular” clustering models on stack traces, and explore DL models for classifying user message to the Spark list. Come for the reassurance that the robots are not yet able to fix themselves, and stay to learn how to work better with the help of our robot friends. The tl;dr of this talk is Spark ML on Spark output, plus a little bit of Tensorflow is fun for the whole family, but probably shouldn’t automatically respond to user list posts just yet.
A peek into Python's Metaclass and Bytecode from a Smalltalk UserKoan-Sin Tan
Understanding object model and bytecode is a crucial part in understanding an interpreted object-oriented language. Smalltalk, one of the oldest object-oriented programming languages, has a great object model and has been used bytecode and VM since 1970s. It is interesting to compare Smalltalk's and Python's object model and bytecode. Guido once said "I remember being surprised by its use of metaclasses (which is quite different from that in Python or Ruby!) when I read about them much later. " and "Smalltalk's bytecode was a bigger influence of Python's bytecode though." It is interesting to compare Smalltalk's and Python's metacalss and bytecode.
Devel::NYTProf v3 - 200908 (OUTDATED, see 201008)Tim Bunce
Slides of my talk on Devel::NYTProf and optimizing perl code at the Italian Perl Workshop (IPW09). It covers the new features in NYTProf v3 and a new section outlining a multi-phase approach to optimizing your perl code.
30 mins long plus 10 mins of questions. Best viewed fullscreen.
Quick intro to Neural Machine Translation and overview of the Joey NMT toolkit.
Code: https://github.com/joeynmt/joeynmt
Demo: https://github.com/joeynmt/joeynmt/blob/master/joey_demo.ipynb
Accelerating Big Data beyond the JVM - Fosdem 2018Holden Karau
Many popular big data technologies (such as Apache Spark, BEAM, Flink, and Kafka) are built in the JVM, and many interesting tools are built in other languages (ranging from Python to CUDA). For simple operations the cost of copying the data can quickly dominate, and in complex cases can limit our ability to take advantage of specialty hardware. This talk explores how improved formats are being integrated to reduce these hurdles to co-operation.
Many popular big data technologies (such as Apache Spark, BEAM, and Flink) are built in the JVM, and many interesting AI tools are built in other languages, and some requiring copying to the GPU. As many folks have experienced, while we may wish that we spend all of our time playing with cool algorithms -- we often need to spend more of our time working on data prep. Having to copy our data slowly between the JVM and the target language of computation can remove much of the benefit of being able to access our specialized tooling. Thankfully, as illustrated in the soon to be released Spark 2.3, Apache Arrow and related tools offer the ability to reduce this overhead. This talk will explore how Arrow is being integrated into Spark, and how it can be integrated into other systems, but also limitations and places where Apache Arrow will not magically save us.
Link: https://fosdem.org/2018/schedule/event/big_data_outside_jvm/
Making the big data ecosystem work together with Python & Apache Arrow, Apach...Holden Karau
Slides from PyData London exploring how the big data ecosystem (currently) works together as well as how different parts of the ecosystem work with Python. Proof-of-concept examples are provided using nltk & spacy with Spark. Then we look to the future and how we can improve.
Many of the recent big data systems, like Hadoop, Spark, and Kafka, are written primarily in JVM languages. At the same time, there is a wealth of tools for data science and data analytics that exist outside of the JVM. Holden Karau and Rachel Warren explore the state of the current big data ecosystem and explain how to best work with it in non-JVM languages. While much of the focus will be on Python + Spark, the talk will also include interesting anecdotes about how these lessons apply to other systems (including Kafka).
Holden and Rachel detail how to bridge the gap using PySpark and discuss other solutions like Kafka Streams as well. They also outline the challenges of pure Python solutions like dask. Holden and Rachel start with the current architecture of PySpark and its evolution. They then turn to the future, covering Arrow-accelerated interchange for Python functions, how to expose Python machine learning models into Spark, and how to use systems like Spark to accelerate training of traditional Python models. They also dive into what other similar systems are doing as well as what the options are for (almost) completely ignoring the JVM in the big data space.
Python users will learn how to more effectively use systems like Spark and understand how the design is changing. JVM developers will gain an understanding of how to Python code from data scientist and Python developers while avoiding the traditional trap of needing to rewrite everything.
Using Spark ML on Spark Errors - What do the clusters tell us?Holden Karau
If you’re subscribed to user@spark.apache.org, or work in a large company, you may see some common Spark error messages. Even attending Spark Summit over the past few years you have seen talks like the “Top K Mistakes in Spark.” While cool non-machine learning based tools do exist to examine Spark’s logs — they don’t use machine learning and therefore are not as cool but also limited in by the amount of effort humans can put into writing rules for them. This talk will look what happens when we train “regular” clustering models on stack traces, and explore DL models for classifying user message to the Spark list. Come for the reassurance that the robots are not yet able to fix themselves, and stay to learn how to work better with the help of our robot friends. The tl;dr of this talk is Spark ML on Spark output, plus a little bit of Tensorflow is fun for the whole family, but probably shouldn’t automatically respond to user list posts just yet.
A peek into Python's Metaclass and Bytecode from a Smalltalk UserKoan-Sin Tan
Understanding object model and bytecode is a crucial part in understanding an interpreted object-oriented language. Smalltalk, one of the oldest object-oriented programming languages, has a great object model and has been used bytecode and VM since 1970s. It is interesting to compare Smalltalk's and Python's object model and bytecode. Guido once said "I remember being surprised by its use of metaclasses (which is quite different from that in Python or Ruby!) when I read about them much later. " and "Smalltalk's bytecode was a bigger influence of Python's bytecode though." It is interesting to compare Smalltalk's and Python's metacalss and bytecode.
Devel::NYTProf v3 - 200908 (OUTDATED, see 201008)Tim Bunce
Slides of my talk on Devel::NYTProf and optimizing perl code at the Italian Perl Workshop (IPW09). It covers the new features in NYTProf v3 and a new section outlining a multi-phase approach to optimizing your perl code.
30 mins long plus 10 mins of questions. Best viewed fullscreen.
Quick intro to Neural Machine Translation and overview of the Joey NMT toolkit.
Code: https://github.com/joeynmt/joeynmt
Demo: https://github.com/joeynmt/joeynmt/blob/master/joey_demo.ipynb
Accelerating Big Data beyond the JVM - Fosdem 2018Holden Karau
Many popular big data technologies (such as Apache Spark, BEAM, Flink, and Kafka) are built in the JVM, and many interesting tools are built in other languages (ranging from Python to CUDA). For simple operations the cost of copying the data can quickly dominate, and in complex cases can limit our ability to take advantage of specialty hardware. This talk explores how improved formats are being integrated to reduce these hurdles to co-operation.
Many popular big data technologies (such as Apache Spark, BEAM, and Flink) are built in the JVM, and many interesting AI tools are built in other languages, and some requiring copying to the GPU. As many folks have experienced, while we may wish that we spend all of our time playing with cool algorithms -- we often need to spend more of our time working on data prep. Having to copy our data slowly between the JVM and the target language of computation can remove much of the benefit of being able to access our specialized tooling. Thankfully, as illustrated in the soon to be released Spark 2.3, Apache Arrow and related tools offer the ability to reduce this overhead. This talk will explore how Arrow is being integrated into Spark, and how it can be integrated into other systems, but also limitations and places where Apache Arrow will not magically save us.
Link: https://fosdem.org/2018/schedule/event/big_data_outside_jvm/
Making the big data ecosystem work together with Python & Apache Arrow, Apach...Holden Karau
Slides from PyData London exploring how the big data ecosystem (currently) works together as well as how different parts of the ecosystem work with Python. Proof-of-concept examples are provided using nltk & spacy with Spark. Then we look to the future and how we can improve.
Debugging PySpark - Spark Summit East 2017Holden Karau
Apache Spark is one of the most popular big data projects, offering greatly improved performance over traditional MapReduce models. Much of Apache Spark’s power comes from lazy evaluation along with intelligent pipelining, which can make debugging more challenging. This talk will examine how to debug Apache Spark applications, the different options for logging in Spark’s variety of supported languages, as well as some common errors and how to detect them.
Spark’s own internal logging can often be quite verbose, and this talk will examine how to effectively search logs from Apache Spark to spot common problems. In addition to the internal logging, this talk will look at options for logging from within our program itself.
Spark’s accumulators have gotten a bad rap because of how they interact in the event of cache misses or partial recomputes, but this talk will look at how to effectively use Spark’s current accumulators for debugging as well as a look to future for data property type accumulators which may be coming to Spark in future version.
In addition to reading logs, and instrumenting our program with accumulators, Spark’s UI can be of great help for quickly detecting certain types of problems.
Video: https://www.youtube.com/watch?v=A0jYQlxc2FU&feature=youtu.be
Debugging PySpark: Spark Summit East talk by Holden KarauSpark Summit
Apache Spark is one of the most popular big data projects, offering greatly improved performance over traditional MapReduce models. Much of Apache Spark’s power comes from lazy evaluation along with intelligent pipelining, which can make debugging more challenging. This talk will examine how to debug Apache Spark applications, the different options for logging in Spark’s variety of supported languages, as well as some common errors and how to detect them.
Spark’s own internal logging can often be quite verbose, and this talk will examine how to effectively search logs from Apache Spark to spot common problems. In addition to the internal logging, this talk will look at options for logging from within our program itself.
Spark’s accumulators have gotten a bad rap because of how they interact in the event of cache misses or partial recomputes, but this talk will look at how to effectively use Spark’s current accumulators for debugging as well as a look to future for data property type accumulators which may be coming to Spark in future version.
In addition to reading logs, and instrumenting our program with accumulators, Spark’s UI can be of great help for quickly detecting certain types of problems.
Big Data Beyond the JVM - Strata San Jose 2018Holden Karau
Many of the recent big data systems, like Hadoop, Spark, and Kafka, are written primarily in JVM languages. At the same time, there is a wealth of tools for data science and data analytics that exist outside of the JVM. Holden Karau and Rachel Warren explore the state of the current big data ecosystem and explain how to best work with it in non-JVM languages. While much of the focus will be on Python + Spark, the talk will also include interesting anecdotes about how these lessons apply to other systems (including Kafka).
Holden and Rachel detail how to bridge the gap using PySpark and discuss other solutions like Kafka Streams as well. They also outline the challenges of pure Python solutions like dask. Holden and Rachel start with the current architecture of PySpark and its evolution. They then turn to the future, covering Arrow-accelerated interchange for Python functions, how to expose Python machine learning models into Spark, and how to use systems like Spark to accelerate training of traditional Python models. They also dive into what other similar systems are doing as well as what the options are for (almost) completely ignoring the JVM in the big data space.
Python users will learn how to more effectively use systems like Spark and understand how the design is changing. JVM developers will gain an understanding of how to Python code from data scientist and Python developers while avoiding the traditional trap of needing to rewrite everything.
Keeping the fun in functional w/ Apache Spark @ Scala Days NYCHolden Karau
Apache Spark has been a great driver of not only Scala adoption, but introducing a new generation of developers to functional programming concepts. As Spark places more emphasis on its newer DataFrame & Dataset APIs, it’s important to ask ourselves how we can benefit from this while still keeping our fun functional roots. We will explore the cases where the Dataset APIs empower us to do cool things we couldn’t before, what the different approaches to serialization mean, and how to figure out when the shiny new API is actually just trying to steal your lunch money (aka CPU cycles).
Apache Spark is one of the most popular big data projects, offering greatly improved performance over traditional MapReduce models. Much of Apache Spark’s power comes from lazy evaluation along with intelligent pipelining, which can make debugging more challenging. This talk will examine how to debug Apache Spark applications, the different options for logging in PySpark, as well as some common errors and how to detect them.
Spark’s own internal logging can often be quite verbose, and this talk will examine how to effectively search logs from Apache Spark to spot common problems. In addition to the internal logging, this talk will look at options for logging from within our program itself.
Spark’s accumulators have gotten a bad rap because of how they interact in the event of cache misses or partial recomputes, but this talk will look at how to effectively use Spark’s current accumulators for debugging as well as a look to future for data property type accumulators which may be coming to Spark in future version.
In addition to reading logs, and instrumenting our program with accumulators, Spark’s UI can be of great help for quickly detecting certain types of problems.
Debuggers are a wonderful tool, however when you have 100 computers the “wonder” can be a bit more like “pain”. This talk will look at how to connect remote debuggers, but also remind you that it’s probably not the easiest path forward.
Are general purpose big data systems eating the world?Holden Karau
Every-time there is a new piece of big data technology we often see many different specific implementations of the concepts, which often eventually consolidate down to a few viable options, and then frequently end up getting rolled into part of another larger project. This talk will examine this trend in big data ecosystem, look at the exceptions to the "rule", and look at how better interchange formats like Apache Arrow have the potential to change this going forward. In addition to general vague happy feelings (or sad depending on your ideas about how software should be made), this talk will look at some specific examples with deep learning, so if anyone is looking for a little bit of pixie dust to sprinkle on a failing business plan to take to silicon valley to raise a series A, you'll get something out this as well.
Video - https://www.youtube.com/watch?v=P_YKrLFZQJo
An introduction into Spark ML plus how to go beyond when you get stuckData Con LA
Abstract:-
This talk will introduce Spark new machine learning frame work (Spark ML) and how to train basic models with it. A companion Jupyter notebook for people to follow along with will be provided. Once we've got the basics down we'll look at what to do when we find we need more than the tools available in Spark ML (and I'll try and convince people to contribute to my latest side project -- Sparkling ML).
Bio:-
Holden Karau is a transgender Canadian, Apache Spark committer, an active open source contributor, and coauthor of Learning Spark and High Performance Spark. When not in San Francisco working as a software development engineer at IBM’s Spark Technology Center, Holden speaks internationally about Spark and holds office hours at coffee shops at home and abroad. She makes frequent contributions to Spark, specializing in PySpark and machine learning. Prior to IBM, she worked on a variety of distributed, search, and classification problems at Alpine, Databricks, Google, Foursquare, and Amazon. She holds a bachelor of mathematics in computer science from the University of Waterloo. Outside of computers she enjoys scootering and playing with fire.
Prototype4Production Presented at FOSSASIA2015 at SingaporeDhruv Gohil
Topic: Protoype for Production. Get ready to launch in a week with Django+Ansible and friends! Speaker: Dhruvkumar Gohil, IshiSystems Description: Sharing our whole Idea to Execution to Production work flow and tooling (all open source) centred around awesome Django. Ansible + AngujarJS + Postgresql Full Text Search + Supervisord + Nginx+Uwsgi.
Migrating Apache Spark ML Jobs to Spark + Tensorflow on KubeflowDatabricks
This talk will take an two existings Spark ML pipeline (Frank The Unicorn, for predicting PR comments (Scala) – https://github.com/franktheunicorn/predict-pr-comments & Spark ML on Spark Errors (Python)) and explore the steps involved in migrating this into a combination of Spark and Tensorflow. Using the open source Kubeflow project (now with Spark support as of 0.5), we will create an two integrated end-to-end pipelines to explore the challenges involved & look at areas of improvement (e.g. Apache Arrow, etc.).
Introduction to Spark ML Pipelines WorkshopHolden Karau
Introduction to Spark ML Pipelines Workshop slides - companion IJupyter notebooks in Python & Scala are available from my github at https://github.com/holdenk/spark-intro-ml-pipeline-workshop
Performance optimization techniques for Java codeAttila Balazs
The presentation covers the the basics of performance optimizations for real-world Java code. It starts with a theoretical overview of the concepts followed by several live demos
showing how performance bottlenecks can be diagnosed and eliminated. The demos include some non-trivial multi-threaded examples
inspired by real-world applications.
How to Choose a Deep Learning FrameworkNavid Kalaei
The trend of neural networks has been attracted a huge community of researchers and practitioners. However, not all of the upfront runners are masters of deep learning and the colorful frameworks could be confusing, especially for the newcomers. In this presentation, I demystified the mystery of the leading frameworks of deep learning and provided a guideline on how to choose the most suitable option.
Introducing Kubeflow (w. Special Guests Tensorflow and Apache Spark)DataWorks Summit
Data Science, Machine Learning, and Artificial Intelligence has exploded in popularity in the last five years, but the nagging question remains, “How to put models into production?” Engineers are typically tasked to build one-off systems to serve predictions which must be maintained amid a quickly evolving back-end serving space which has evolved from single-machine, to custom clusters, to “serverless”, to Docker, to Kubernetes. In this talk, we present KubeFlow- an open source project which makes it easy for users to move models from laptop to ML Rig to training cluster to deployment. In this talk we will discuss, “What is KubeFlow?”, “why scalability is so critical for training and model deployment?”, and other topics.
Users can deploy models written in Python’s skearn, R, Tensorflow, Spark, and many more. The magic of Kubernetes allows data scientists to write models on their laptop, deploy to an ML-Rig, and then devOps can move that model into production with all of the bells and whistles such as monitoring, A/B tests, multi-arm bandits, and security.
A super fast introduction to Spark and glance at BEAMHolden Karau
Apache Spark is one of the most popular general purpose distributed systems, with built in libraries to support everything from ML to SQL. Spark has APIs across languages including Scala, Java, Python, and R -- with more 3rd party language support (like Julia & C#). Apache BEAM is a cross-platform tool for building on top of different distributed systems, but its in it’s early stages. This talk will introduce the core concepts of Apache Spark, and look to the potential future of Apache BEAM.
Apache Spark has two core abstractions for representing distributed data and computations. This talk will introduce the basics of RDDs and Spark DataFrames & Datasets, and Spark’s method for achieving resiliency. Since it’s a big data talk, we will include the almost required wordcount example, and end the Spark part with follow up pointers on Spark’s new ML APIs. For folks who are interested we’ll then talk a bit about portability, and how Apache BEAM aims to improve portability (as well it’s unique approach to cross-language support).
Slides from Holden's talk at https://www.meetup.com/Wellington-Data-Scaling-Chats/events/mdcsdpyxcbxb/
OpenLineage for Stream Processing | Kafka Summit LondonHostedbyConfluent
"OpenLineage is an open platform for the collection and analysis of data lineage, which includes an open standard for lineage data collection, integration libraries for the most common tools, and a metadata repository/reference implementation (Marquez).
In recent months, stream processing, which is an important use case for Apache Kafka, has gained the particular focus of the OpenLineage community with many useful features completed or begun, including:
* A seamless OpenLineage & Apache Flink integration,
* Support for streaming jobs in Marquez,
* Progress on a built-in lineage API within the Flink codebase.
Cross-platform lineage allows for a holistic overview of data flow and its dependencies within organizations, including stream processing.
This talk will provide an overview of the most recent developments in the OpenLineage Flink integration and share what’s in store for this important collaboration.
This talk is a must-attend for those wishing to stay up-to-date on lineage developments in the stream processing world."
# Internet Security: Safeguarding Your Digital World
In the contemporary digital age, the internet is a cornerstone of our daily lives. It connects us to vast amounts of information, provides platforms for communication, enables commerce, and offers endless entertainment. However, with these conveniences come significant security challenges. Internet security is essential to protect our digital identities, sensitive data, and overall online experience. This comprehensive guide explores the multifaceted world of internet security, providing insights into its importance, common threats, and effective strategies to safeguard your digital world.
## Understanding Internet Security
Internet security encompasses the measures and protocols used to protect information, devices, and networks from unauthorized access, attacks, and damage. It involves a wide range of practices designed to safeguard data confidentiality, integrity, and availability. Effective internet security is crucial for individuals, businesses, and governments alike, as cyber threats continue to evolve in complexity and scale.
### Key Components of Internet Security
1. **Confidentiality**: Ensuring that information is accessible only to those authorized to access it.
2. **Integrity**: Protecting information from being altered or tampered with by unauthorized parties.
3. **Availability**: Ensuring that authorized users have reliable access to information and resources when needed.
## Common Internet Security Threats
Cyber threats are numerous and constantly evolving. Understanding these threats is the first step in protecting against them. Some of the most common internet security threats include:
### Malware
Malware, or malicious software, is designed to harm, exploit, or otherwise compromise a device, network, or service. Common types of malware include:
- **Viruses**: Programs that attach themselves to legitimate software and replicate, spreading to other programs and files.
- **Worms**: Standalone malware that replicates itself to spread to other computers.
- **Trojan Horses**: Malicious software disguised as legitimate software.
- **Ransomware**: Malware that encrypts a user's files and demands a ransom for the decryption key.
- **Spyware**: Software that secretly monitors and collects user information.
### Phishing
Phishing is a social engineering attack that aims to steal sensitive information such as usernames, passwords, and credit card details. Attackers often masquerade as trusted entities in email or other communication channels, tricking victims into providing their information.
### Man-in-the-Middle (MitM) Attacks
MitM attacks occur when an attacker intercepts and potentially alters communication between two parties without their knowledge. This can lead to the unauthorized acquisition of sensitive information.
### Denial-of-Service (DoS) and Distributed Denial-of-Service (DDoS) Attacks
This 7-second Brain Wave Ritual Attracts Money To You.!nirahealhty
Discover the power of a simple 7-second brain wave ritual that can attract wealth and abundance into your life. By tapping into specific brain frequencies, this technique helps you manifest financial success effortlessly. Ready to transform your financial future? Try this powerful ritual and start attracting money today!
1.Wireless Communication System_Wireless communication is a broad term that i...JeyaPerumal1
Wireless communication involves the transmission of information over a distance without the help of wires, cables or any other forms of electrical conductors.
Wireless communication is a broad term that incorporates all procedures and forms of connecting and communicating between two or more devices using a wireless signal through wireless communication technologies and devices.
Features of Wireless Communication
The evolution of wireless technology has brought many advancements with its effective features.
The transmitted distance can be anywhere between a few meters (for example, a television's remote control) and thousands of kilometers (for example, radio communication).
Wireless communication can be used for cellular telephony, wireless access to the internet, wireless home networking, and so on.
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024APNIC
Ellisha Heppner, Grant Management Lead, presented an update on APNIC Foundation to the PNG DNS Forum held from 6 to 10 May, 2024 in Port Moresby, Papua New Guinea.
Bridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptxBrad Spiegel Macon GA
Brad Spiegel Macon GA’s journey exemplifies the profound impact that one individual can have on their community. Through his unwavering dedication to digital inclusion, he’s not only bridging the gap in Macon but also setting an example for others to follow.
Multi-cluster Kubernetes Networking- Patterns, Projects and GuidelinesSanjeev Rampal
Talk presented at Kubernetes Community Day, New York, May 2024.
Technical summary of Multi-Cluster Kubernetes Networking architectures with focus on 4 key topics.
1) Key patterns for Multi-cluster architectures
2) Architectural comparison of several OSS/ CNCF projects to address these patterns
3) Evolution trends for the APIs of these projects
4) Some design recommendations & guidelines for adopting/ deploying these solutions.
Simplifying training deep and serving learning models with big data in python using tensorflow - PyData Berlin
1. Simplifying Training &
Serving Deep Learning
with Big Data in Python
w/ Tensorflow, Apache Beam, and Apache Flink
+ Possible Spark bonus
@holdenkarau
2. Holden:
● My name is Holden Karau
● Prefered pronouns are she/her
● Developer Advocate at Google
● Apache Spark PMC, Beam contributor
● previously IBM, Alpine, Databricks, Google, Foursquare & Amazon
● co-author of Learning Spark & High Performance Spark
● Twitter: @holdenkarau
● Slide share http://www.slideshare.net/hkarau
● Code review livestreams: https://www.twitch.tv/holdenkarau /
https://www.youtube.com/user/holdenkarau
● Spark Talk Videos http://bit.ly/holdenSparkVideos
3.
4. Who I think you wonderful humans are?
● Nice enough people
● Don’t mind pictures of cats
● Probably pretty familiar with Spark
● Maybe somewhat familiar with Beam?
Lori Erickson
5. What we did get to:
● TensorFlowOnSpark w/basic Apache Arrow
● Python & Go on Beam on Flink prototype (in random
github branch, not finished)
○ Everyone loves wordcount right? right?....
● New Beam architecture allowing for better portability &
handling dependencies (like Tensorflow)
● TF Transform demo on Apache Flink via Apache Beam
● DO NOT non-JVM BEAM on Flink IN PRODUCTION
Vladimir Pustovit
6. DO NOT USE THIS IN PRODUCTION TODAY
● I’m serious, I don’t want to die or cause the next
financial meltdown with software I’m a part of
● By Today I mean June 8 2018, but it’s probably going to
not be great for at least a “little while”
Vladimir Pustovit
PROTambako The Jaguar
7. What will be covered?
Most likely:
● Where we are today in non-JVM support in Beam
● And why this matters for Tensorflow
● What the rest of Big Data ecosystem looks like going outside the JVM
● A partial TF Transform demo + TFMA links + possible system crash
If there is time (e.g. demo runs quickly, wifi works, several other happy unlikely
things):
● TensorFlowOnSpark
● Apache Arrow - How this changes “everything”*
8. So why do I need to power DL w/Big Data?
● Deep learning is most effective with large sample sets for training
● As much as some may say that no feature prep is required even if you’re
looking at mnist.csv you probably have _some_ feature prep
● Since we need big data for training we need to to do our feature prep it
● Even if your just trying to raise some VC money it's going to go a lot better if
you add some keywords about a large proprietary dataset
9. TensorFlow isn’t enough on its own
● Enter TFX & friends like Kubeflow
○ Current related TFX OSS components: TF.Transform TF.Serving (with
more coming)
● Alternatives: piles of custom code re-created at serving time.
○ Yay job security?
PROJennifer C.
10. How do I do feature prep? (old skool)
● Write custom preparation jobs in your favourite big data tool (mine is Apache
Spark, but maybe yours is something else)
● Run it, train on the prepared data
● Rewrite your feature prep code to run at serving time
○ Error prone and sad
11. Enter: TF.Transform
● For pre-processing of your data
○ e.g. where you spend 90% of your dev time anyways
● Integrates into serving time :D
● OSS
● Runs on top of Apache Beam, but current release not yet outside of GCP
○ On master this can run on Flink, but probably has bugs currently.
○ Please don’t use this in production today unless your on
GCP/Dataflow
PROKathryn Yengel
12. Defining a Transform processing function
def preprocessing_fn(inputs):
x = inputs['x']
y = inputs['y']
s = inputs['s']
x_centered = x - tft.mean(x)
y_normalized = tft.scale_to_0_1(y)
s_int = tft.string_to_int(s)
return { 'x_centered': x_centered,
'y_normalized': y_normalized, 's_int': s_int}
15. Scale to ... Bag of Words / N-Grams
Bucketization Feature Crosses
tft.ngrams
tft.string_to_int
tf.string_split
tft.scale_to_z_score
tft.apply_buckets
tft.quantiles
tft.string_to_int
tf.string_join
...
Some common use-cases...
16. BEAM Beyond the JVM: Current release
● Non JVM BEAM doesn’t work outside of Google’s environment yet
● tl;dr : uses grpc / protobuf
○ Similar to the common design but with more efficient representations (often)
● But exciting new plans to unify the runners and ease the support of different
languages (called SDKS)
○ See https://beam.apache.org/contribute/portability/
● If this is exciting, you can come join me on making BEAM work in Python3
○ Yes we still don’t have that :(
○ But we're getting closer & you can come join us on BEAM-2874 :D
Emma
17. BEAM Beyond the JVM: Master + Experiments
● Common interface for setting up jobs
● Portability framework allows SDK harnesses in arbitrary to be kicked off
● Runners ship in their own docker containers (goodbye dependency hell, hello
container hell)
○ Also for now rolling containers leaves something to be desired (e.g. edit docker file by hand)
● Hacked up Python SDK to sort of talk to the new interface
● Go SDK talks to the new interface, still missing some features
● Need permissions? Run on GKE or plumb through permissions file :(
Nick
18. BEAM Beyond the JVM: Master w/ experiments
*ish
*ish
*ish
Nick
portability
*ish
19. So what does that look like?
Driver
Worker 1
Docker
grpc
Worker K
Docker
grpc
20. BEAM Beyond the JVM: The “future”
E.g. not now
*ish
*ish
*ish
Nick
portability
*ish
*ish
21. So how TF does this relate to TF?
● Tensorflow is in Python (kind of)
● Once we finish the Python SDK on Beam on Flink adventure you can use all
sorts of cool libraries (like TFT/TFX) to do your tensorflow work
○ You can use them today too if your use case is on Dataflow
○ If you don’t mind bugs you can experiment with them on Flink too
● You will be able manage your dependencies
● You will be able to (in theory) re-use dataprep code at serving time
○ 80% less copy n’ paste code with slight mistakes that get out of date!**
● No that doesn’t work today
● Or tomorrow
● But… eventually
○ Standard OSS excuse “patches welcome” (sort of if you can find the branch :p)
**Not a guarantee, see your vendor for details.
22. Ooor from the chicago taxi data...
for key in taxi.DENSE_FLOAT_FEATURE_KEYS:
# Preserve this feature as a dense float, setting nan's to
the mean.
outputs[key] = transform.scale_to_z_score(inputs[key])
for key in taxi.VOCAB_FEATURE_KEYS:
# Build a vocabulary for this feature.
outputs[key] = transform.string_to_int(
inputs[key], top_k=taxi.VOCAB_SIZE,
num_oov_buckets=taxi.OOV_SIZE)
for key in taxi.BUCKET_FEATURE_KEYS:
outputs[key] = transform.bucketize(inputs[key],
23. Beam Demo Time!!!!!
● TFT!!!!!! So amazing!!!!!!!!!
○ Want to move to SF and raise a series A? Pay attention :p
● Based on my testing on Saturday there is a 1 in 3 chance this will hard lock
my computer
○ But that’s OK, we can reboot and take questions while I bounce :)
● That’s your friendly reminder not to run any of this in production (yet)
● The demo code is forked from axelmagn/model-analysis which is forked from
tensorflow/model-analysis and runs on master of Apache Beam w/Flink
○ The example does something… possible questionable with the evaluation dataset currently,
but #todofixmelater
24. Ok now what?
● Integrate this into your model serving pipeline of choice
○ Don’t have one or open to change? Checkout TFMA which can directly serve it
● There’s a guide (it doesn’t show Flink because not released yet) but steps are
similar
○ But you’re not using this in production today anyways?
○ Right?
Nick Perla
25. (Optional) Second Beam Demo Time!!!!!
● Word count!!!!!! So amazing!!!!!!!!!
● Based on my testing on Saturday there is a 2 in 3 chance this will hard lock
my computer
● That’s your friendly reminder not to run any of this in production
● Demo shell script of fun (go only) & python + go
26. What do the rest of the systems do?
● Spoiler: mostly it’s not better
○ Although it tends to be more finished
○ Sometimes it's different
● Different tradeoffs, maybe better for your use case but all tradeoffs
Kate Neilan
28. PySpark
● The Python interface to Spark
● Same general technique used as the bases for the C#, R, Julia, etc.
interfaces to Spark
● Fairly mature, integrates well-ish into the ecosystem, less a Pythonrific API
● Has some serious performance hurdles from the design
29. So what does that look like?
Driver
py4j
Worker 1
Worker K
pipe
pipe
31. So how does that impact Py[X]
forall X in {Big Data}-{Native Python Big Data}
● Double serialization cost makes everything more
expensive
● Python worker startup takes a bit of extra time
● Python memory isn’t controlled by the JVM - easy to go
over container limits if deploying on YARN or similar
● Error messages make ~0 sense
● Dependency management makes limited sense
● features aren’t automatically exposed, but exposing
them is normally simple
33. The “future”*: faster interchange
● By future I mean availability today but running it in production is “adventurous”
● Unifying our cross-language experience
○ And not just “normal” languages, CUDA counts yo
Tambako The Jaguar
35. What does the future look like?*
*Source: https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html.
*Vendor
benchmark.
Trust but verify.
36. Arrow (a poorly drawn big data view)
Logos trademarks of their respective projects
Juha Kettunen
*ish
37. Rewriting your code because why not
spark.catalog.registerFunction(
"add", lambda x, y: x + y, IntegerType())
=>
add = pandas_udf(lambda x, y: x + y, IntegerType())
Jennifer C.
38. And we can do this in TFOnSpark*:
unionRDD.foreachPartition(TFSparkNode.train(self.cluster_info,
self.cluster_meta, qname))
Will Transform Into something magical (aka fast but unreliable)
on the next slide!
Delaina Haslam
39. Which becomes
train_func = TFSparkNode.train(self.cluster_info,
self.cluster_meta, qname)
@pandas_udf("int")
def do_train(inputSeries1, inputSeries2):
# Sad hack for now
modified_series = map(lambda x: (x[0], x[1]),
zip(inputSeries1, inputSeries2))
train_func(modified_series)
return pandas.Series([0] * len(inputSeries1))
ljmacphee
40. And this now looks like:
Logos trademarks of their respective projects
Juha Kettunen
*ish
41. TFOnSpark Possible vNext+1?
● Avoid funneling the data through Python native types
○ For now the Spark Arrow UDFS aren’t perfect for this
○ But we can (and are) improving them
● mmapped Arrow?
● Skip Python on the workers handling data entirely (idk I’m lazy so probably
not)
Renars
42. References
● TFMA + TFT example guide -
https://www.tensorflow.org/tfx/model_analysis/examples/chicago_taxi
● Apache Beam github repo (w/early alpha portable Flink support)-
https://beam.apache.org/
● TFMA Example fork for use w/Beam on Flink -
● TensorFlowOnSpark -https://github.com/yahoo/TensorFlowOnSpark
● Spark Deep Learning Pipelines - https://github.com/databricks/spark-deep-
learning
● flink-tensorflow - https://github.com/FlinkML/flink-tensorflow
● TF.Transform - https://github.com/tensorflow/transform
● Beam portability design: https://beam.apache.org/contribute/portability/
● Beam on Flink + portability https://issues.apache.org/jira/browse/BEAM-2889
PROR. Crap Mariner
43. k thnx bye :)
If you care about Spark testing and
don’t hate surveys:
http://bit.ly/holdenTestingSpark
I need to give a testing talk next
month, help a “friend” out.
Will tweet results
“eventually” @holdenkarau
Do you want more realistic
benchmarks? Share your UDFs!
http://bit.ly/pySparkUDF
Pssst: Have feedback on the presentation? Give me a
shout (holden@pigscanfly.ca) if you feel comfortable doing
so :)
Give feedback on this presentation
http://bit.ly/holdenTalkFeedback
44. What’s the rest of big data outside the JVM
look like?
Most of the tools are built in the JVM, so how do we play together?
● Pickling, Strings, JSON, XML, oh my!
● Unix pipes
● Sockets
What about if we don’t want to copy the data all the time?
● Or standalone “pure”* re-implementations of everything
○ Reasonable option for things like Kafka where you would have the I/O regardless.
○ Also cool projects like dask (pure python) -- but hard to talk to existing ecosystem
David Brown
45. Hadoop “streaming” (Python/R)
● Unix pipes!
● Involves a data copy, formats get sad
● But the overhead of a Map/Reduce task is pretty high anyways...
Lisa Larsson
46. Kafka: re-implement all the things
● Multiple options for connecting to Kafka from outside of the JVM (yay!)
● They implement the protocol to talk to Kafka (yay!)
● This involves duplicated client work, and sometimes the clients can be slow
(solution, FFI bindings to C instead of Java)
● Buuuut -- we can’t access all of the cool Kafka business (like Kafka Streams)
and features depend on client libraries implementing them (easy to slip below
parity)
Smokey Combs
47. Dask: a new beginning?
● Pure* python implementation
● Provides real enough DataFrame interface for distributed data
● Also your standard-ish distributed collections
● Multiple backends
● Primary challenge: interacting with the rest of the big data ecosystem
○ Arrow & friends might make this better with time too, buuut….
● See https://dask.pydata.org/en/latest/ &
http://dask.pydata.org/en/latest/spark.html
Lisa Zins
Editor's Notes
Photo from https://www.flickr.com/photos/lorika/4148361363/in/photolist-7jzriM-9h3my2-9Qn7iD-bp55TS-7YCJ4G-4pVTXa-7AFKbm-bkBfKJ-9Qn6FH-aniTRF-9LmYvZ-HD6w6-4mBo3t-8sekvz-mgpFzD-5z6BRK-de513-8dVhBu-bBZ22n-4Vi2vS-3g13dh-e7aPKj-b6iHHi-4ThGzv-7NcFNK-aniTU6-Kzqxd-7LPmYs-4ok2qy-dLY9La-Nvhey-Kte6U-74B7Ma-6VfnBK-6VjrY7-58kAY9-7qUeDK-4eoSxM-6Vjs5A-9v5Pvb-26mja-4scwq3-GHzAL-672eVr-nFUomD-4s8u8F-5eiQmQ-bxXXCc-5P9cCT-5GX8no
introduce spark rdds, purple blog diagrams
https://www.flickr.com/photos/pustovit/15867520885/in/photolist-qbac9i-9XLrR4-74scWq-bpnxfN-qYAD3D-e6u5Ej-oztsCu-qJMG4L-7b4y4a-gu8Wa-8MzgVR-b5gHki-djzdH3-82TowY-qJc99b-pC6yth-ifAkvP-mju1Ce-3ACPG6-F9aWR2-5QQL1U-4Hav3S-dGHvJj-jxQLth-djzdgd-dL24wn-8znjgb-aZxA6H-gDkWNo-djzcZF-22NYH5r-9amo58-apqLdG-fZhoSH-cDjpEQ-nLbBuK-6EuEN2-dAN5KN-asBjbL-Vx2zFR-djzdki-SRkhd1-djzcnF-Tc9FAf-qsduzQ-djzd4P-9wCiZT-8JALTP-eqbpop-R2S2FR
As just discussed, TFT provides utility functions that run analyzers as needed. These are data processing jobs that can be run in arbitrary environments using the Beam SDK.
What happens behind the scene is that the analyzers run as a distributed data processing graph, and the result is put into the output graph as constants.
Some of the common use cases:
Just talked about ones on left:
Scale scores
Bucketize
Text features: apply bag of words or n grams
For Feature crosses: cross strings and generate vocabs of the result of those crosses
As mentioned before, tf.Tranform is powerful in that you can chain these transformations.
SparkContext uses Py4J to launch a JVM and create a JavaSparkContext.
On The executors, java subprocess launches a python subprocess via a pipe.
The data is serialized using cpickle and sent ot the python subprocess
SparkContext uses Py4J to launch a JVM and create a JavaSparkContext.
On The executors, java subprocess launches a python subprocess via a pipe.
The data is serialized using cpickle and sent ot the python subprocess
What does python memory isn’t controlled by the JVM mean?
Double serialization from lanague transform