Presentation given on Monday 10 September at the ROOT Users' Workshop 2018 in Sarajevo. Progress update on the Automated Parallel Computation of Collaborative Statistical Models project, a collaboration between the Netherlands eScience Center and Nikhef.
We present an update on our recent efforts to further parallelize RooFit. We have performed extensive benchmarks and identified at least three bottlenecks that will benefit from parallelization. To tackle these and possible future bottlenecks, we designed a parallelization layer that allows us to parallelize existing classes with minimal effort, but with high performance and retaining as much of the existing class's interface as possible. The high-level parallelization model is a task-stealing approach. The implementation is currently based on the bi-directional memory mapped pipe (BidirMMapPipe), but could in the future be replaced by other modes of communication between processes.
Concurrency and parallelism in Python are always hot topics. This talk will look the variety of forms of concurrency and parallelism. In particular this talk will give an overview of various forms of message-passing concurrency which have become popular in languages like Scala and Go. A Python library called python-csp which implements similar ideas in a Pythonic way will be introduced and we will look at how this style of programming can be used to avoid deadlocks, race hazards and "callback hell".
Beating the (sh** out of the) GIL - Multithreading vs. MultiprocessingGuy K. Kloss
Talk given at the June 2008 meeting of the New Zealand Python User Group in Auckland.
Outline: An overview to approaches for parallel/concurrent programming in Python.
Code demonstrated in the presentation can be found here:
http://www.kloss-familie.de/moin/TalksPresentations
torm is a free and open source distributed realtime computation system. Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing. Storm is simple, can be used with any programming language, and is a lot of fun to use!
Process the Twitter stream using Storm & Redstorm with Ruby & JRuby. Full working demo, code on github https://github.com/colinsurprenant/tweitgeist and live demo http://tweitgeist.needium.com/
Concurrency and parallelism in Python are always hot topics. This talk will look the variety of forms of concurrency and parallelism. In particular this talk will give an overview of various forms of message-passing concurrency which have become popular in languages like Scala and Go. A Python library called python-csp which implements similar ideas in a Pythonic way will be introduced and we will look at how this style of programming can be used to avoid deadlocks, race hazards and "callback hell".
Beating the (sh** out of the) GIL - Multithreading vs. MultiprocessingGuy K. Kloss
Talk given at the June 2008 meeting of the New Zealand Python User Group in Auckland.
Outline: An overview to approaches for parallel/concurrent programming in Python.
Code demonstrated in the presentation can be found here:
http://www.kloss-familie.de/moin/TalksPresentations
torm is a free and open source distributed realtime computation system. Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing. Storm is simple, can be used with any programming language, and is a lot of fun to use!
Process the Twitter stream using Storm & Redstorm with Ruby & JRuby. Full working demo, code on github https://github.com/colinsurprenant/tweitgeist and live demo http://tweitgeist.needium.com/
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016MLconf
Comparing TensorFlow NLP Options: word2Vec, gloVe, RNN/LSTM, SyntaxNet, and Penn Treebank: Through code samples and demos, we’ll compare the architectures and algorithms of the various TensorFlow NLP options. We’ll explore both feed-forward and recurrent neural networks such as word2vec, gloVe, RNN/LSTM, SyntaxNet, and Penn Treebank using the latest TensorFlow libraries.
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16MLconf
Say What You Mean: Scaling Machine Learning Algorithms Directly from Source Code: Scaling machine learning applications is hard. Even with powerful systems like Spark, Tensor Flow, and Theano, the code you write has more to do with getting these systems to work at all than it does with your algorithm itself. But it doesn’t have to be this way!
In this talk, I’ll discuss an alternate approach we’ve taken with Pyfora, an open-source platform for scalable machine learning and data science in Python. I’ll show how it produces efficient, large scale machine learning implementations directly from the source code of single-threaded Python programs. Instead of programming to a complex API, you can simply say what you mean and move on. I’ll show some classes of problem where this approach truly shines, discuss some practical realities of developing the system, and I’ll talk about some future directions for the project.
Apache Storm and twitter Streaming API integrationUday Vakalapudi
1) Storm is a distributed, real-time computation system.
2) The input stream of a Storm cluster is handled by a component called a spout. The spout passes the data to a bolt, a bolt either persists the data in some sort of storage, or passes it to some other bolt. You can imagine a Storm cluster as a chain of bolt components that each make some kind of transformation on the data exposed by the spout.
1) Real-time systems must guarantee the data processing.
2) And also it should be horizontally scalable, means, just adding few nodes to improve the scalability of a cluster.
3) It should be fault-tolerance, means, if any error occurs or any node goes down, our system should work without any hesitation.
4) We need to get rid of all the intermediate message brokers, because they are complex, and slow, because, instead of sending messages directly from producer to consumers, it has to go through third party message brokers, moreover, those third party message brokers are persist the input data into the disk. This whole process will consume extra time to process the data.
5) In comparison with Storm, Hadoop is ok, because Hadoop also provides a high latency system, so if you take a few hours of down time, you still have high latency, but in real time systems, if you take few hours of down time. Then you no longer in real time, which means robustness requirements, are much harder. Storm satisfies all those properties without any hesitation.
1) Both Hadoop and Storm are distributed and fault-Tolerance systems, but, Hadoop mainly used for batch processing systems, whereas Storm used for Real-time computation systems.
2) Storm doesn’t have inbuilt Storage system, it mainly builds on “come and get some” strategy. In other side, Hadoop have HDFS as storage file system.
1) Both Storm and Flume used for real-time data processing, but Flume will not give you real-time computation systems. moreover flume depends on channel Message broker component, for, guaranteed data processing, here, channel always persist the data before sending it to Consumer, but for Storm, there is no intermediate message brokers concept, it Just Works like as lite as possible. Whatever business logic that you want to write, will goes under Bolt component of Storm.
Introduction to Deep Learning with Pythonindico data
A presentation by Alec Radford, Head of Research at indico Data Solutions, on deep learning with Python's Theano library.
The emphasis of the presentation is high performance computing, natural language processing (using recurrent neural nets), and large scale learning with GPUs.
Video of the talk available here: https://www.youtube.com/watch?v=S75EdAcXHKk
An Introduction to TensorFlow architectureMani Goswami
Introduces you to the internals of TensorFlow and deep dives into distributed version of TensorFlow. Refer to https://github.com/manigoswami/tensorflow-examples for examples.
Caffe (Convolutional Architecture for Fast Feature Embedding) is a deep learning framework made with expression, speed, and modularity in mind. It is developed by the Berkeley Vision and Learning Center (BVLC) and by community contributors.
Caffe’s expressive architecture encourages application and innovation. Models and optimization are defined by configuration without hard-coding. Switch between CPU and GPU by setting a single flag to train on a GPU machine then deploy to commodity clusters or mobile devices.Caffe’s extensible code fosters active development. In Caffe’s first year, it has been forked by over 1,000 developers and had many significant changes contributed back. Thanks to these contributors the framework tracks the state-of-the-art in both code and models.Speed makes Caffe perfect for research experiments and industry deployment. Caffe can processover 60M images per day with a single NVIDIA K40 GPU*. That’s 1 ms/image for inference and 4 ms/image for learning. We believe that Caffe is the fastest convnet implementation available.Caffe already powers academic research projects, startup prototypes, and even large-scale industrial applications in vision, speech, and multimedia. Join our community of brewers on the caffe-users group and Github.
This tutorial is designed to equip researchers and developers with the tools and know-how needed to incorporate deep learning into their work. Both the ideas and implementation of state-of-the-art deep learning models will be presented. While deep learning and deep features have recently achieved strong results in many tasks, a common framework and shared models are needed to advance further research and applications and reduce the barrier to entry. To this end we present the Caffe framework, public reference models, and working examples for deep learning. Join our tour from the 1989 LeNet for digit recognition to today’s top ILSVRC14 vision models. Follow along with do-it-yourself code notebooks. While focusing on vision, general techniques are covered.
Data Science at the Command Line
Don't forget to think about this good old tool that is so handy, interactive and efficient, before rushing to Hadoop, Spark, Storm, etc.
You can do streaming, you can make joins on csv, and even machine learning at the command line... and have a lot of fun.
Braxton McKee, Founder & CEO, Ufora at MLconf SF - 11/13/15MLconf
Is Machine Learning Code for 100 Rows or a Billion the Same?: We have built an automatically distributed, implicitly parallel data science platform for running large scale machine learning applications. By abstracting away the computer science required to scale machine learning models, The Ufora platform lets data scientists focus on building data science models in simple scripting code, without having to worry about building large-scale distributed systems, their race conditions, fault-tolerance, etc. This automatic approach requires solving some interesting challenges, like optimal data layout for different ML models. For example, when a data scientist says “do a linear regression on this 100GB dataset”, Ufora needs to figure out how to automatically distribute and lay out that data across a cluster of machines in the cluster in order to minimize travel over the wire. Running a GBM against the same dataset might require a completely different layout of that data. This talk will cover how the platform works, in terms of data and thread distribution, how it generates parallel processes out of single-threaded programs, and more.
This slides are for a brief seminar that I give in a Ph.D. exam "Perspective in Parallel Computing" (held by prof. Marco Danelutto) at University of Pisa (Italy).
They are a rapid introduction to Apache Storm and how it relates to classical algorithmic skeleton parallel frameworks
TF2.0 is designed to improve usability and productivity. As a TF's enthusiastic user, I am very excited. Personally, I think the most important thing about usability is "how does TF provide a user-friendly API?" Aside from the other aspects in TF 2.0, this post was a quick review from an API usage perspective.
이 발표에서는 TensorFlow의 지난 1년을 간단하게 돌아보고, TensorFlow의 차기 로드맵에 따라 개발 및 도입될 예정인 여러 기능들을 소개합니다. 또한 2017년 및 2018년의 머신러닝 프레임워크 개발 트렌드와 방향에 대한 이야기도 함께 합니다.
In this talk, I look back the TensorFlow development over the past year. Then discusses the overall development direction of machine learning frameworks, with an introduction to features that will be added to TensorFlow later on.
Title
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTorch + XGBoost + Airflow + MLflow + Spark + Jupyter + TPU
Video
https://youtu.be/vaB4IM6ySD0
Description
In this workshop, we build real-world machine learning pipelines using TensorFlow Extended (TFX), KubeFlow, and Airflow.
Described in the 2017 paper, TFX is used internally by thousands of Google data scientists and engineers across every major product line within Google.
KubeFlow is a modern, end-to-end pipeline orchestration framework that embraces the latest AI best practices including hyper-parameter tuning, distributed model training, and model tracking.
Airflow is the most-widely used pipeline orchestration framework in machine learning.
Pre-requisites
Modern browser - and that's it!
Every attendee will receive a cloud instance
Nothing will be installed on your local laptop
Everything can be downloaded at the end of the workshop
Location
Online Workshop
Agenda
1. Create a Kubernetes cluster
2. Install KubeFlow, Airflow, TFX, and Jupyter
3. Setup ML Training Pipelines with KubeFlow and Airflow
4. Transform Data with TFX Transform
5. Validate Training Data with TFX Data Validation
6. Train Models with Jupyter, Keras/TensorFlow 2.0, PyTorch, XGBoost, and KubeFlow
7. Run a Notebook Directly on Kubernetes Cluster with KubeFlow
8. Analyze Models using TFX Model Analysis and Jupyter
9. Perform Hyper-Parameter Tuning with KubeFlow
10. Select the Best Model using KubeFlow Experiment Tracking
11. Reproduce Model Training with TFX Metadata Store and Pachyderm
12. Deploy the Model to Production with TensorFlow Serving and Istio
13. Save and Download your Workspace
Key Takeaways
Attendees will gain experience training, analyzing, and serving real-world Keras/TensorFlow 2.0 models in production using model frameworks and open-source tools.
Related Links
1. PipelineAI Home: https://pipeline.ai
2. PipelineAI Community Edition: http://community.pipeline.ai
3. PipelineAI GitHub: https://github.com/PipelineAI/pipeline
4. Advanced Spark and TensorFlow Meetup (SF-based, Global Reach): https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup
5. YouTube Videos: https://youtube.pipeline.ai
6. SlideShare Presentations: https://slideshare.pipeline.ai
7. Slack Support: https://joinslack.pipeline.ai
8. Web Support and Knowledge Base: https://support.pipeline.ai
9. Email Support: support@pipeline.ai
introduction to data processing using Hadoop and PigRicardo Varela
In this talk we make an introduction to data processing with big data and review the basic concepts in MapReduce programming with Hadoop. We also comment about the use of Pig to simplify the development of data processing applications
YDN Tuesdays are geek meetups organized the first Tuesday of each month by YDN in London
Streaming machine learning is being integrated in Spark 2.1+, but you don’t need to wait. Holden Karau and Seth Hendrickson demonstrate how to do streaming machine learning using Spark’s new Structured Streaming and walk you through creating your own streaming model. By the end of this session, you’ll have a better understanding of Spark’s Structured Streaming API as well as how machine learning works in Spark.
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016MLconf
Comparing TensorFlow NLP Options: word2Vec, gloVe, RNN/LSTM, SyntaxNet, and Penn Treebank: Through code samples and demos, we’ll compare the architectures and algorithms of the various TensorFlow NLP options. We’ll explore both feed-forward and recurrent neural networks such as word2vec, gloVe, RNN/LSTM, SyntaxNet, and Penn Treebank using the latest TensorFlow libraries.
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16MLconf
Say What You Mean: Scaling Machine Learning Algorithms Directly from Source Code: Scaling machine learning applications is hard. Even with powerful systems like Spark, Tensor Flow, and Theano, the code you write has more to do with getting these systems to work at all than it does with your algorithm itself. But it doesn’t have to be this way!
In this talk, I’ll discuss an alternate approach we’ve taken with Pyfora, an open-source platform for scalable machine learning and data science in Python. I’ll show how it produces efficient, large scale machine learning implementations directly from the source code of single-threaded Python programs. Instead of programming to a complex API, you can simply say what you mean and move on. I’ll show some classes of problem where this approach truly shines, discuss some practical realities of developing the system, and I’ll talk about some future directions for the project.
Apache Storm and twitter Streaming API integrationUday Vakalapudi
1) Storm is a distributed, real-time computation system.
2) The input stream of a Storm cluster is handled by a component called a spout. The spout passes the data to a bolt, a bolt either persists the data in some sort of storage, or passes it to some other bolt. You can imagine a Storm cluster as a chain of bolt components that each make some kind of transformation on the data exposed by the spout.
1) Real-time systems must guarantee the data processing.
2) And also it should be horizontally scalable, means, just adding few nodes to improve the scalability of a cluster.
3) It should be fault-tolerance, means, if any error occurs or any node goes down, our system should work without any hesitation.
4) We need to get rid of all the intermediate message brokers, because they are complex, and slow, because, instead of sending messages directly from producer to consumers, it has to go through third party message brokers, moreover, those third party message brokers are persist the input data into the disk. This whole process will consume extra time to process the data.
5) In comparison with Storm, Hadoop is ok, because Hadoop also provides a high latency system, so if you take a few hours of down time, you still have high latency, but in real time systems, if you take few hours of down time. Then you no longer in real time, which means robustness requirements, are much harder. Storm satisfies all those properties without any hesitation.
1) Both Hadoop and Storm are distributed and fault-Tolerance systems, but, Hadoop mainly used for batch processing systems, whereas Storm used for Real-time computation systems.
2) Storm doesn’t have inbuilt Storage system, it mainly builds on “come and get some” strategy. In other side, Hadoop have HDFS as storage file system.
1) Both Storm and Flume used for real-time data processing, but Flume will not give you real-time computation systems. moreover flume depends on channel Message broker component, for, guaranteed data processing, here, channel always persist the data before sending it to Consumer, but for Storm, there is no intermediate message brokers concept, it Just Works like as lite as possible. Whatever business logic that you want to write, will goes under Bolt component of Storm.
Introduction to Deep Learning with Pythonindico data
A presentation by Alec Radford, Head of Research at indico Data Solutions, on deep learning with Python's Theano library.
The emphasis of the presentation is high performance computing, natural language processing (using recurrent neural nets), and large scale learning with GPUs.
Video of the talk available here: https://www.youtube.com/watch?v=S75EdAcXHKk
An Introduction to TensorFlow architectureMani Goswami
Introduces you to the internals of TensorFlow and deep dives into distributed version of TensorFlow. Refer to https://github.com/manigoswami/tensorflow-examples for examples.
Caffe (Convolutional Architecture for Fast Feature Embedding) is a deep learning framework made with expression, speed, and modularity in mind. It is developed by the Berkeley Vision and Learning Center (BVLC) and by community contributors.
Caffe’s expressive architecture encourages application and innovation. Models and optimization are defined by configuration without hard-coding. Switch between CPU and GPU by setting a single flag to train on a GPU machine then deploy to commodity clusters or mobile devices.Caffe’s extensible code fosters active development. In Caffe’s first year, it has been forked by over 1,000 developers and had many significant changes contributed back. Thanks to these contributors the framework tracks the state-of-the-art in both code and models.Speed makes Caffe perfect for research experiments and industry deployment. Caffe can processover 60M images per day with a single NVIDIA K40 GPU*. That’s 1 ms/image for inference and 4 ms/image for learning. We believe that Caffe is the fastest convnet implementation available.Caffe already powers academic research projects, startup prototypes, and even large-scale industrial applications in vision, speech, and multimedia. Join our community of brewers on the caffe-users group and Github.
This tutorial is designed to equip researchers and developers with the tools and know-how needed to incorporate deep learning into their work. Both the ideas and implementation of state-of-the-art deep learning models will be presented. While deep learning and deep features have recently achieved strong results in many tasks, a common framework and shared models are needed to advance further research and applications and reduce the barrier to entry. To this end we present the Caffe framework, public reference models, and working examples for deep learning. Join our tour from the 1989 LeNet for digit recognition to today’s top ILSVRC14 vision models. Follow along with do-it-yourself code notebooks. While focusing on vision, general techniques are covered.
Data Science at the Command Line
Don't forget to think about this good old tool that is so handy, interactive and efficient, before rushing to Hadoop, Spark, Storm, etc.
You can do streaming, you can make joins on csv, and even machine learning at the command line... and have a lot of fun.
Braxton McKee, Founder & CEO, Ufora at MLconf SF - 11/13/15MLconf
Is Machine Learning Code for 100 Rows or a Billion the Same?: We have built an automatically distributed, implicitly parallel data science platform for running large scale machine learning applications. By abstracting away the computer science required to scale machine learning models, The Ufora platform lets data scientists focus on building data science models in simple scripting code, without having to worry about building large-scale distributed systems, their race conditions, fault-tolerance, etc. This automatic approach requires solving some interesting challenges, like optimal data layout for different ML models. For example, when a data scientist says “do a linear regression on this 100GB dataset”, Ufora needs to figure out how to automatically distribute and lay out that data across a cluster of machines in the cluster in order to minimize travel over the wire. Running a GBM against the same dataset might require a completely different layout of that data. This talk will cover how the platform works, in terms of data and thread distribution, how it generates parallel processes out of single-threaded programs, and more.
This slides are for a brief seminar that I give in a Ph.D. exam "Perspective in Parallel Computing" (held by prof. Marco Danelutto) at University of Pisa (Italy).
They are a rapid introduction to Apache Storm and how it relates to classical algorithmic skeleton parallel frameworks
TF2.0 is designed to improve usability and productivity. As a TF's enthusiastic user, I am very excited. Personally, I think the most important thing about usability is "how does TF provide a user-friendly API?" Aside from the other aspects in TF 2.0, this post was a quick review from an API usage perspective.
이 발표에서는 TensorFlow의 지난 1년을 간단하게 돌아보고, TensorFlow의 차기 로드맵에 따라 개발 및 도입될 예정인 여러 기능들을 소개합니다. 또한 2017년 및 2018년의 머신러닝 프레임워크 개발 트렌드와 방향에 대한 이야기도 함께 합니다.
In this talk, I look back the TensorFlow development over the past year. Then discusses the overall development direction of machine learning frameworks, with an introduction to features that will be added to TensorFlow later on.
Title
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTorch + XGBoost + Airflow + MLflow + Spark + Jupyter + TPU
Video
https://youtu.be/vaB4IM6ySD0
Description
In this workshop, we build real-world machine learning pipelines using TensorFlow Extended (TFX), KubeFlow, and Airflow.
Described in the 2017 paper, TFX is used internally by thousands of Google data scientists and engineers across every major product line within Google.
KubeFlow is a modern, end-to-end pipeline orchestration framework that embraces the latest AI best practices including hyper-parameter tuning, distributed model training, and model tracking.
Airflow is the most-widely used pipeline orchestration framework in machine learning.
Pre-requisites
Modern browser - and that's it!
Every attendee will receive a cloud instance
Nothing will be installed on your local laptop
Everything can be downloaded at the end of the workshop
Location
Online Workshop
Agenda
1. Create a Kubernetes cluster
2. Install KubeFlow, Airflow, TFX, and Jupyter
3. Setup ML Training Pipelines with KubeFlow and Airflow
4. Transform Data with TFX Transform
5. Validate Training Data with TFX Data Validation
6. Train Models with Jupyter, Keras/TensorFlow 2.0, PyTorch, XGBoost, and KubeFlow
7. Run a Notebook Directly on Kubernetes Cluster with KubeFlow
8. Analyze Models using TFX Model Analysis and Jupyter
9. Perform Hyper-Parameter Tuning with KubeFlow
10. Select the Best Model using KubeFlow Experiment Tracking
11. Reproduce Model Training with TFX Metadata Store and Pachyderm
12. Deploy the Model to Production with TensorFlow Serving and Istio
13. Save and Download your Workspace
Key Takeaways
Attendees will gain experience training, analyzing, and serving real-world Keras/TensorFlow 2.0 models in production using model frameworks and open-source tools.
Related Links
1. PipelineAI Home: https://pipeline.ai
2. PipelineAI Community Edition: http://community.pipeline.ai
3. PipelineAI GitHub: https://github.com/PipelineAI/pipeline
4. Advanced Spark and TensorFlow Meetup (SF-based, Global Reach): https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup
5. YouTube Videos: https://youtube.pipeline.ai
6. SlideShare Presentations: https://slideshare.pipeline.ai
7. Slack Support: https://joinslack.pipeline.ai
8. Web Support and Knowledge Base: https://support.pipeline.ai
9. Email Support: support@pipeline.ai
introduction to data processing using Hadoop and PigRicardo Varela
In this talk we make an introduction to data processing with big data and review the basic concepts in MapReduce programming with Hadoop. We also comment about the use of Pig to simplify the development of data processing applications
YDN Tuesdays are geek meetups organized the first Tuesday of each month by YDN in London
Streaming machine learning is being integrated in Spark 2.1+, but you don’t need to wait. Holden Karau and Seth Hendrickson demonstrate how to do streaming machine learning using Spark’s new Structured Streaming and walk you through creating your own streaming model. By the end of this session, you’ll have a better understanding of Spark’s Structured Streaming API as well as how machine learning works in Spark.
Online learning, Vowpal Wabbit and HadoopHéloïse Nonne
Online learning, Vowpal Wabbit and Hadoop
Online learning has recently caught a lot of attention, following some competitions, and especially after Criteo released 11GB for the training set of a Kaggle contest.
Online learning allows to process massive data as the learner processes data in a sequential way using up a low amount of memory and limited CPU ressources. It is also particularly suited for handling time-evolving date.
Vowpal Wabbit has become quite popular: it is a handy, light and efficient command line tool allowing to do online learning on GB of data, even on a standard laptop with standard memory. After a reminder of the online learning principles, we present how to run Vowpal Wabbit on Hadoop in a distributed fashion.
En esta charla presentaremos las ultimas propuestas del Barcelona Supercomputing Center (BSC) al modelo de programación paralela OpenMP relacionadas con el modelo de tareas. Nos centraremos en las oportunidades que presentan dichas extensiones al runtime que da soporte a la ejecución paralela y al co-diseño de arquitecturas runtime-aware. En la ultima parte de la charla se presentará como dicho modelo basado en tareas forma el eje central de las dos asignaturas de paralelismo en la Facultad de Informatica de Barcelona (FIB) de la Universitat Politècnica de Catalunya (UPC).
Travis Oliphant "Python for Speed, Scale, and Science"Fwdays
Python is sometimes discounted as slow because of its dynamic typing and interpreted nature and not suitable for scale because of the GIL. But, in this talk, I will show how with the help of talented open-source contributors around the world, we have been able to build systems in Python that are fast and scalable to many machines and how this has helped Python take over Science.
Slides to support Austin Machine Learning Meetup, 1/19/2015.
Overview of techniques of recent Kaggle code to perform online logistic regression with FTRL-proximal (SGD, L1/L2 regularization) and hash trick.
In this deck, Huihuo Zheng from Argonne National Laboratory presents: Data Parallel Deep Learning.
"The Argonne Training Program on Extreme-Scale Computing (ATPESC) provides intensive, two weeks of training on the key skills, approaches, and tools to design, implement, and execute computational science and engineering applications on current high-end computing systems and the leadership-class computing systems of the future."
Watch the video: https://wp.me/p3RLHQ-lsl
Learn more: https://extremecomputingtraining.anl.gov/archive/atpesc-2019/agenda-2019/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Distributed computing and hyper-parameter tuning with RayJan Margeta
Come for a while and learn how to execute your Python computations in parallel, to seamlessly scale the training of your machine learning models from a single machine to a cluster, find the right hyper-parameters, or accelerate your Pandas pipelines with large dataframes.
In this talk, we will cover Ray in action with plenty of examples. Ray is a flexible, high-performance distributed execution framework. Ray is well suited to deep learning workflows but its utility goes far beyond that.
Ray has several interesting and unique components, such as actors, or Plasma - an in-memory object store with zero-copy reads (particularly useful for working on large objects), and includes powerful hyper-parameter tuning tools.
We will compare Ray with its alternatives such as Dask or Celery, and see when it is more convenient or where it might even completely replace them.
Golang Performance : microbenchmarks, profilers, and a war storyAerospike
Slides for Brian Bulkowski's talk about Golang performance:
microbenchmarks, profilers, and a war story about optimizing the Aerospike Database Go client.
http://www.meetup.com/Go-lang-Developers-NYC/events/216650022/
High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...EUDAT
Giuseppe will present the differences between high-performance and high-throughput applications. High-throughput computing (HTC) refers to computations where individual tasks do not need to interact while running. It differs from High-performance (HPC) where frequent and rapid exchanges of intermediate results is required to perform the computations. HPC codes are based on tightly coupled MPI, OpenMP, GPGPU, and hybrid programs and require low latency interconnected nodes. HTC makes use of unreliable components distributing the work out to every node and collecting results at the end of all parallel tasks.
Visit: https://www.eudat.eu/eudat-summer-school
On the necessity and inapplicability of pythonYung-Yu Chen
Python is a popular scripting language adopted by numerical software vendors to help users solve challenging numerical problems. It provides easy-to-use interface and offers decent speed through array operations, but it is not suitable for engineering the low-level constructs. To make good numerical software, developers need to be familiar with C++ and computer architecture. The gap of understandings between the high-level applications and low-level implementation motivated me to organize a course to train computer scientists what it takes to build numerical software that the users (application experts) want. This talk will portray a bird view of the advantages and disadvantages of Python and where and how C++ should be used in the context of numerical software. The information may be used to map out a plan to acquire the necessary skill sets for making the software.
Recording https://www.youtube.com/watch?v=OwA-Xt_Ke3Y
Online learning with structured streaming, spark summit brussels 2016Ram Sriharsha
Structured Streaming is a new API in Spark 2.0 that simplifies the end to end development of continuous applications. One such continuous application is online model updates: Online models are incrementally updated with new data and can be continuously queried while being updated. As a result, they can be fast to train and leverage new data faster than offline algorithms. In this talk, we give a brief introduction the area of online learning and describe how online model updates can be built using structured streaming APIs. The end result is a robust pipeline for updating models that is scalable, fast and fault tolerant.
Resource-Efficient Deep Learning Model Selection on Apache SparkDatabricks
Deep neural networks (deep nets) are revolutionizing many machine learning (ML) applications. But there is a major bottleneck to broader adoption: the pain of model selection.
From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNetEric Haibin Lin
Training large deep learning models like Mask R-CNN and BERT takes lots of time and compute resources. Using MXNet, the Amazon Web Services deep learning framework team has been working with NVIDIA to optimize many different areas to cut the training time from hours to minutes.
Greater Chicago Area - Independent Non-Profit Organization Management Professional
View clifford sugerman's professional profile on LinkedIn. LinkedIn is the world's largest business network, helping professionals like clifford sugerman discover.
Cancer cell metabolism: special Reference to Lactate PathwayAADYARAJPANDEY1
Normal Cell Metabolism:
Cellular respiration describes the series of steps that cells use to break down sugar and other chemicals to get the energy we need to function.
Energy is stored in the bonds of glucose and when glucose is broken down, much of that energy is released.
Cell utilize energy in the form of ATP.
The first step of respiration is called glycolysis. In a series of steps, glycolysis breaks glucose into two smaller molecules - a chemical called pyruvate. A small amount of ATP is formed during this process.
Most healthy cells continue the breakdown in a second process, called the Kreb's cycle. The Kreb's cycle allows cells to “burn” the pyruvates made in glycolysis to get more ATP.
The last step in the breakdown of glucose is called oxidative phosphorylation (Ox-Phos).
It takes place in specialized cell structures called mitochondria. This process produces a large amount of ATP. Importantly, cells need oxygen to complete oxidative phosphorylation.
If a cell completes only glycolysis, only 2 molecules of ATP are made per glucose. However, if the cell completes the entire respiration process (glycolysis - Kreb's - oxidative phosphorylation), about 36 molecules of ATP are created, giving it much more energy to use.
IN CANCER CELL:
Unlike healthy cells that "burn" the entire molecule of sugar to capture a large amount of energy as ATP, cancer cells are wasteful.
Cancer cells only partially break down sugar molecules. They overuse the first step of respiration, glycolysis. They frequently do not complete the second step, oxidative phosphorylation.
This results in only 2 molecules of ATP per each glucose molecule instead of the 36 or so ATPs healthy cells gain. As a result, cancer cells need to use a lot more sugar molecules to get enough energy to survive.
Unlike healthy cells that "burn" the entire molecule of sugar to capture a large amount of energy as ATP, cancer cells are wasteful.
Cancer cells only partially break down sugar molecules. They overuse the first step of respiration, glycolysis. They frequently do not complete the second step, oxidative phosphorylation.
This results in only 2 molecules of ATP per each glucose molecule instead of the 36 or so ATPs healthy cells gain. As a result, cancer cells need to use a lot more sugar molecules to get enough energy to survive.
introduction to WARBERG PHENOMENA:
WARBURG EFFECT Usually, cancer cells are highly glycolytic (glucose addiction) and take up more glucose than do normal cells from outside.
Otto Heinrich Warburg (; 8 October 1883 – 1 August 1970) In 1931 was awarded the Nobel Prize in Physiology for his "discovery of the nature and mode of action of the respiratory enzyme.
WARNBURG EFFECT : cancer cells under aerobic (well-oxygenated) conditions to metabolize glucose to lactate (aerobic glycolysis) is known as the Warburg effect. Warburg made the observation that tumor slices consume glucose and secrete lactate at a higher rate than normal tissues.
Multi-source connectivity as the driver of solar wind variability in the heli...Sérgio Sacani
The ambient solar wind that flls the heliosphere originates from multiple
sources in the solar corona and is highly structured. It is often described
as high-speed, relatively homogeneous, plasma streams from coronal
holes and slow-speed, highly variable, streams whose source regions are
under debate. A key goal of ESA/NASA’s Solar Orbiter mission is to identify
solar wind sources and understand what drives the complexity seen in the
heliosphere. By combining magnetic feld modelling and spectroscopic
techniques with high-resolution observations and measurements, we show
that the solar wind variability detected in situ by Solar Orbiter in March
2022 is driven by spatio-temporal changes in the magnetic connectivity to
multiple sources in the solar atmosphere. The magnetic feld footpoints
connected to the spacecraft moved from the boundaries of a coronal hole
to one active region (12961) and then across to another region (12957). This
is refected in the in situ measurements, which show the transition from fast
to highly Alfvénic then to slow solar wind that is disrupted by the arrival of
a coronal mass ejection. Our results describe solar wind variability at 0.5 au
but are applicable to near-Earth observatories.
This pdf is about the Schizophrenia.
For more details visit on YouTube; @SELF-EXPLANATORY;
https://www.youtube.com/channel/UCAiarMZDNhe1A3Rnpr_WkzA/videos
Thanks...!
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...Sérgio Sacani
We characterize the earliest galaxy population in the JADES Origins Field (JOF), the deepest
imaging field observed with JWST. We make use of the ancillary Hubble optical images (5 filters
spanning 0.4−0.9µm) and novel JWST images with 14 filters spanning 0.8−5µm, including 7 mediumband filters, and reaching total exposure times of up to 46 hours per filter. We combine all our data
at > 2.3µm to construct an ultradeep image, reaching as deep as ≈ 31.4 AB mag in the stack and
30.3-31.0 AB mag (5σ, r = 0.1” circular aperture) in individual filters. We measure photometric
redshifts and use robust selection criteria to identify a sample of eight galaxy candidates at redshifts
z = 11.5 − 15. These objects show compact half-light radii of R1/2 ∼ 50 − 200pc, stellar masses of
M⋆ ∼ 107−108M⊙, and star-formation rates of SFR ∼ 0.1−1 M⊙ yr−1
. Our search finds no candidates
at 15 < z < 20, placing upper limits at these redshifts. We develop a forward modeling approach to
infer the properties of the evolving luminosity function without binning in redshift or luminosity that
marginalizes over the photometric redshift uncertainty of our candidate galaxies and incorporates the
impact of non-detections. We find a z = 12 luminosity function in good agreement with prior results,
and that the luminosity function normalization and UV luminosity density decline by a factor of ∼ 2.5
from z = 12 to z = 14. We discuss the possible implications of our results in the context of theoretical
models for evolution of the dark matter halo mass function.
Richard's entangled aventures in wonderlandRichard Gill
Since the loophole-free Bell experiments of 2020 and the Nobel prizes in physics of 2022, critics of Bell's work have retreated to the fortress of super-determinism. Now, super-determinism is a derogatory word - it just means "determinism". Palmer, Hance and Hossenfelder argue that quantum mechanics and determinism are not incompatible, using a sophisticated mathematical construction based on a subtle thinning of allowed states and measurements in quantum mechanics, such that what is left appears to make Bell's argument fail, without altering the empirical predictions of quantum mechanics. I think however that it is a smoke screen, and the slogan "lost in math" comes to my mind. I will discuss some other recent disproofs of Bell's theorem using the language of causality based on causal graphs. Causal thinking is also central to law and justice. I will mention surprising connections to my work on serial killer nurse cases, in particular the Dutch case of Lucia de Berk and the current UK case of Lucy Letby.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.Sérgio Sacani
The return of a sample of near-surface atmosphere from Mars would facilitate answers to several first-order science questions surrounding the formation and evolution of the planet. One of the important aspects of terrestrial planet formation in general is the role that primary atmospheres played in influencing the chemistry and structure of the planets and their antecedents. Studies of the martian atmosphere can be used to investigate the role of a primary atmosphere in its history. Atmosphere samples would also inform our understanding of the near-surface chemistry of the planet, and ultimately the prospects for life. High-precision isotopic analyses of constituent gases are needed to address these questions, requiring that the analyses are made on returned samples rather than in situ.
Nutraceutical market, scope and growth: Herbal drug technologyLokesh Patil
As consumer awareness of health and wellness rises, the nutraceutical market—which includes goods like functional meals, drinks, and dietary supplements that provide health advantages beyond basic nutrition—is growing significantly. As healthcare expenses rise, the population ages, and people want natural and preventative health solutions more and more, this industry is increasing quickly. Further driving market expansion are product formulation innovations and the use of cutting-edge technology for customized nutrition. With its worldwide reach, the nutraceutical industry is expected to keep growing and provide significant chances for research and investment in a number of categories, including vitamins, minerals, probiotics, and herbal supplements.
4. Collaborative Statistical Modeling
• RooFit: build models together
• Teams 10-100 physicists
• Collaborations ~3000
à ~100 teams
• 1 goal
• Pretty impressive to an
outsider
5. Collaborative Statistical Modeling with RooFit
Making RooFit faster (~30x; ~h à ~m)
• More efficient collaboration
• Faster iteration/debugging
• Faster feedback between teams
• Next level physics modeling ambitions,
retaining interactive workflow
1. Complex likelihood models, e.g.
a) Higgs fit to all channels, ~200 datasets, O(1000)
parameter, now O(few) hours
b) EFT framework: again 10-100x more expensive
2. Unbinned ML fits with very large data samples
3. Unbinned ML fits with MC-style numeric integrals
Higgs @ ATLAS
20k+ nodes, 125k hours
Expression tree of C++ objects for mathematical components (variables,
operators, functions, integrals, datasets, etc.)
Couple with data, event “observables”
7. Making fitting in RooFit faster: how?
Serial:
benchmarks show no obvious bottlenecks
RooFit already highly optimized (pre-calculation/memoization, MPFE)
Parallel
8. Faster fitting: (how) can we do it?
Levels of parallelism
1. Gradient (parameter partial
derivatives) in minimizer
2. Likelihood
3. Integrals (normalization) &
other expensive shared
components
likelihood:
events
likelihood:
(unequal)
components
integrals etc.
“Vector”
9. Faster fitting: (how) can we do it?
Heterogeneous: sizes, types
• Multiple strategies
• How to split up?
• Small components à need low
latency/overhead
• Large components as well…
• How to divide over cores?
• Load balancing à task-based
approach: work stealing
likelihood:
events
likelihood:
(unequal)
components
integrals etc.
10. Design: MultiProcess task-stealing framework
Task-stealing, worker pool, executes Job tasks
No threads, process-based: “bipe”
(BidirMMapPipe) handles fork, mmap, pipes
Master Queue
Worker 1
Worker 2
...
bipesbipe
Master: main RooFit process, submits Jobs to queue, waits for results (or does other things in between)
Worker requests
Job task
Queue pops task
Worker executes
task
Worker sends
result Queue
... repeat ...
Job done: Queue
sends to Master
on request
worker loop:
queue loop: act on input from Master or Workers (mainly to avoid loop in Master / user code)
template <class T> class MP::Vector :
public T, public MP::Job
Parallelized
class
MP::Vector
MP::Job
MP::TaskManager
Serial class
likelihood, gradient..
11. MultiProcess usage for devs
template <class T> class MP::Vector : public T, public MP::Job
class Parallel : public MP:Vector<Serial>
Parallelized
class
MP::Vector
MP::Job
MP::TaskManager
Serial class
12. MultiProcess usage for devs
class xSquaredSerial {
public:
xSquaredSerial(vector<double> x_init)
: x(move(x_init))
, result(x.size()) {}
virtual void evaluate() {
for (size_t ix = 0; ix < x.size(); ++ix) {
x_squared[ix] = x[ix] * x[ix];
}
}
vector<double> get_result() {
evaluate();
return x_squared;
}
protected:
vector<double> x;
vector<double> x_squared;
};
class xSquaredParallel
: public RooFit::MultiProcess::Vector<xSquaredSerial> {
public:
xSquaredParallel(size_t N_workers, vector<double> x_init) :
RooFit::MultiProcess::Vector<xSquaredSerial>(N_workers, x_init)
{}
private:
void evaluate_task(size_t task) override {
result[task] = x[task] * x[task];
}
public:
void evaluate() override {
if (get_manager()->is_master()) {
// do necessary synchronization before work_mode
// enable work mode: workers will start stealing work from queue
get_manager()->set_work_mode(true);
// master fills queue with tasks
for (size_t task_id = 0; task_id < x.size(); ++task_id) {
get_manager()->to_queue(JobTask(id, task_id));
}
// wait for task results back from workers to master
gather_worker_results();
// end work mode
get_manager()->set_work_mode(false);
// put gathered results in desired container (same as used in serial class)
for (size_t task_id = 0; task_id < x.size(); ++task_id) {
x_squared[task_id] = results[task_id];
}
}
}
};
template <class T> class MP::Vector : public T, public MP::Job
13. MultiProcess for users
vector<double> x {1, 4, 5, 6.48074};
xSquaredSerial xsq_serial(x);
size_t N_workers = 4;
xSquaredParallel xsq_parallel(N_workers, x);
// get the same results, but now faster:
xsq_serial.get_result();
xsq_parallel.get_result();
// use parallelized version in your existing functions
void some_function(xSquaredSerial* xsq);
some_function(&xsq_parallel); // no problem!
15. Parallel likelihood fits: unbinned, MPFE
Before: max ~2x
Now (with CPU affinity fixed):
max ~20x (more for larger fits)
Run-time vs N(cores)
Actual performance
Expected
performance (ideal
parallelization)
16. Parallel likelihood fits: binned
Run-time vs N(cores) in binned fits
Actual performance
Expected
performance (ideal
parallelization)
CPU time (single core)
Room for
improvement
WIP
17. Gradient parallelization
0th step: get Minuit to use external derivative
1st step: replicate Minuit2 behavior
• NumericalDerivator (Lorenzo)
• Modified to exactly (floating point bit-wise) replicate Minuit2
• à RooGradMinimizer
2nd step: calculate partial derivative for each parameter in
parallel
18. Gradient parallelization
First benchmarks (yesterday):
ggF workspace (Carsten), migrad fit
scaling not perfect and erratic (+/- 5s)
similar as we saw for likelihoods without CPU pinning
probably due to too much synchronization
RooMinimizer MultiProcess GradMinimizer
- 1 worker 2 workers 3 workers 4 workers 6 workers 8 workers
28s 33s 20s 15s 14s 17s (…) 11s
21. Future work
Load balancing
PDF timings change dynamically due to RooFit precalculation strategies
… not a problem for numerical integrals
Analytical derivatives (automated? CLAD)
23. Numerical integrals
Maxima
Individual NI timings
(variation in runs and iterations)
Minima
Sum of slowest integrals/cores
per iteration over the entire run
(single core total runtime: 3.2s)
24. Faster fitting: MultiProcess design
RooFit::MultiProcess::Vector<YourSerialClass>
Serial class: likelihood (e.g. RooNLLVar) or gradient (Minuit)
Interface: subclass + MP
Define ”vector elements”
Group elements into tasks (to be executed in parallel)
RooFit::MultiProcess::SharedArg<T>
RooFit::MultiProcess::TaskManager
25. Faster fitting: MultiProcess design
RooFit::MultiProcess::Vector<YourSerialClass>
RooFit::MultiProcess::SharedArg<T>
Normalization integrals or other shared expensive objects
Parallel task definition specific to type of object
… design in progress
RooFit::MultiProcess::TaskManager
26. Faster fitting: MultiProcess design
RooFit::MultiProcess::Vector<YourSerialClass>
RooFit::MultiProcess::SharedArg<T>
RooFit::MultiProcess::TaskManager
Queue gathers tasks and communicates with worker pool
Workers steal tasks from queue
Worker pool: forked processes (BidirMMapPipe)
• performant and already used in RooFit
• no thread-safety concerns
• instead: communication concerns
• … flexible design, implementation can be replaced (e.g. TBB)
28. Faster fitting: single core profiling with Callgrind, Cachegrind, Instruments !
Higgs ggf & 9 channel fits (workspaces by Lydia Brenner)
Most time spent on:
1. Memory access à RooVectorDataStore::get() (4% / 32%), 0.3% LL
cache misses (expensive!)
• Row-wise access pattern on column-wise data store (and std::vector<std::vector>)
2. Logarithms: 12%
3. Interpolation à RooStats::HistFactory::FlexibleInterpVar (10%)
29. Faster fitting: single core improvements
RooLinkedList::findArg: ~ 5% of memory access instructions
RooLinkedList::At took considerable time in Gaussian test fit (Vince)
std::vector lookup à 1.6x speedup! WIP
30. Faster fitting: future work
Reorder tree evaluation à CPU cache use, vectorization
Smarter fitting (stochastic minimizer, analytical gradient, CLAD)
Front-end / back-end separation (e.g. TensorFlow back-end)
31. Faster fitting: single core profiling meta-conclusions
profiling functions & classes
valgrind
gprof
Instruments
… etc.
profiling objects (e.g. call-trees, e.g. RooFit…)
… DIY?
33. Parallel likelihood fits: existing RooFit implementation details
RooRealMPFE / BidirMMapPipe
Custom multi-process message passing protocol
• POSIX fork, pipe, mmap
Communication “overhead” (delay between sending and receiving
messages): ~ 1e-4 seconds
• serverLoop waits for message & runs server-side code
• messages used sparingly
• data transfer over memory-mapped pipes
34. TensorFlow experiments
Fits on identical model & data (single i7 machine)
TensorFlow: No pre-calculation / caching!
Major advantage of RooFit for binned fits (e.g. morphing histograms)
(feature request for memoization https://github.com/tensorflow/tensorflow/issues/5323)
N.B.: measured before CPU affinity fixing
RooFit now even faster (but limited to running one machine)
RooFit (MINUIT) TensorFlow (BFGS)
Unbinned fit 0.1s 0.01 - 0.1s (dep. on precision)
Binned fit 0.7ms 2.3ms