These slides outline the common distributed computing abstractions necessary to implement data science at scale. It starts with a characterization of the computations required to realize common machine learning at scale. Introductions to Hadoop MR, Spark, GraphLab are covered currently. Going forward, we shall update with Flink, Titan and TensorFlow and how to realize machine learning/deep learning algorithms on top of these frameworks as well as trade-offs between these frameworks.
This deck covers some of the open problems in the big data analytics space, starting with a discussion of state-of-art analytics using Spark/Hadoop YARN. It details out whether each of these are appropriate technologies and explores alternatives wherever possible. It ends with an important problem discussion - how to build a single system to handle big data pipelines without explicit data transfers.
Big data analytics beyond Hadoop - 7 giants categorization of computing/ML problems. Hadoop is good for giant 1, whereas Spark is good for giants 2, 3 and 4. GraphLab is appropriate for giant 5, while Storm is good for real-time processing.
Covers basics Artificial neural networks and motivation for deep learning and explains certain deep learning networks, including deep belief networks and autoencoders. It also details challenges of implementing a deep learning network at scale and explains how we have implemented a distributed deep learning network over Spark.
Graph Databases and Machine Learning | November 2018TigerGraph
Graph Database and Machine Learning: Finding a Happy Marriage. Graph Databases and Machine Learning
both represent powerful tools for getting more value from data, learn how they can form a harmonious marriage to up-level machine learning.
Predictive Maintenance Using Recurrent Neural NetworksJustin Brandenburg
My presentation from AnacondaCON 2018 where I discussed using Recurrent Neural Networks, Python, Tensorflow and the MapR Platform to develop deploy a predictive maintenance model for an IoT device in the manufacturing industry.
These slides outline the common distributed computing abstractions necessary to implement data science at scale. It starts with a characterization of the computations required to realize common machine learning at scale. Introductions to Hadoop MR, Spark, GraphLab are covered currently. Going forward, we shall update with Flink, Titan and TensorFlow and how to realize machine learning/deep learning algorithms on top of these frameworks as well as trade-offs between these frameworks.
This deck covers some of the open problems in the big data analytics space, starting with a discussion of state-of-art analytics using Spark/Hadoop YARN. It details out whether each of these are appropriate technologies and explores alternatives wherever possible. It ends with an important problem discussion - how to build a single system to handle big data pipelines without explicit data transfers.
Big data analytics beyond Hadoop - 7 giants categorization of computing/ML problems. Hadoop is good for giant 1, whereas Spark is good for giants 2, 3 and 4. GraphLab is appropriate for giant 5, while Storm is good for real-time processing.
Covers basics Artificial neural networks and motivation for deep learning and explains certain deep learning networks, including deep belief networks and autoencoders. It also details challenges of implementing a deep learning network at scale and explains how we have implemented a distributed deep learning network over Spark.
Graph Databases and Machine Learning | November 2018TigerGraph
Graph Database and Machine Learning: Finding a Happy Marriage. Graph Databases and Machine Learning
both represent powerful tools for getting more value from data, learn how they can form a harmonious marriage to up-level machine learning.
Predictive Maintenance Using Recurrent Neural NetworksJustin Brandenburg
My presentation from AnacondaCON 2018 where I discussed using Recurrent Neural Networks, Python, Tensorflow and the MapR Platform to develop deploy a predictive maintenance model for an IoT device in the manufacturing industry.
Leveraging NLP and Deep Learning for Document Recommendations in the CloudDatabricks
Efficient recommender systems are critical for the success of many industries, such as job recommendation, news recommendation, ecommerce, etc. This talk will illustrate how to build an efficient document recommender system by leveraging Natural Language Processing(NLP) and Deep Neural Networks (DNNs). The end-to-end flow of the document recommender system is build on AWS at scale, using Analytics Zoo for Spark and BigDL. The system first processes text rich documents into embeddings by incorporating Global Vectors (GloVe), then trains a K-means model using native Spark APIs to cluster users into several groups. The system further trains a recommender model for each group, and gives an ensemble prediction for each test record. By adopting the end-to-end pipeline of Analytics Zoo solution, we saw about 10% improvement of mean reciprocal ranking and 6% of precision respectively compared to the search recommendations for a job recommendation study.
Speaker: Guoqiong Song
Comparing Big Data and Simulation Applications and Implications for Software ...Geoffrey Fox
At eScience in the Cloud 2014, Redmond WA, April 30 2014
There is perhaps a broad consensus as to important issues in practical parallel computing as applied to large scale simulations; this is reflected in supercomputer architectures, algorithms, libraries, languages, compilers and best practice for application development.
However the same is not so true for data intensive, even though commercially clouds devote much more resources to data analytics than supercomputers devote to simulations.
We look at a sample of over 50 big data applications to identify characteristics of data intensive applications and to deduce needed runtime and architectures.
We suggest a big data version of the famous Berkeley dwarfs and NAS parallel benchmarks.
Our analysis builds on combining HPC and the Apache software stack that is well used in modern cloud computing.
Initial results on Azure and HPC Clusters are presented
Palestra apresentada por Pedro Mário Cruz e Silva, Solution Architect da NVIDIA, como parte da programação da VIII Semana de Inverno de Geofísica, em 19/07/2017.
State of the Art Robot Predictive Maintenance with Real-time Sensor DataMathieu Dumoulin
Our Strata Beijing 2017 presentation slides where we show how to use data from a movement sensor, in real-time, to do anomaly detection at scale using standard enterprise big data software.
II-SDV 2017: The Next Era: Deep Learning for Biomedical ResearchDr. Haxel Consult
Deep learning is hot, making waves, delivering results, and is somewhat of a buzzword today. There is a desire to apply deep learning to anything that is digital. Unlike the brain, these artificial neural networks have a very strict predefined structure. The brain is made up of neurons that talk to each other via electrical and chemical signals. We do not differentiate between these two types of signals in artificial neural networks. They are essentially a series of advanced statistics based exercises that review the past to indicate the likely future. Another buzzword that was used for the last few years across all industries is “big data”. In biomedical and health sciences, both unstructured and structured information constitute "big data". On the one hand deep learning needs lot of data whereas “big data" has value only when it generates actionable insight. Given this, these two areas are destined to be married. The couple is made for each other. The time is ripe now for a synergistic association that will benefit the pharmaceutical companies. It may be only a short time before we have vice presidents of machine learning or deep learning in pharmaceutical and biotechnology companies. This presentation will review the prominent deep learning methods and discuss these techniques for their usefulness in biomedical and health informatics.
Python is dominating the fast-growing data-science landscape. This talk provides a foundational overview of the practice of data science and some of the most popular Python libraries for doing data science. It also provides an overview of how Anaconda brings it all together.
Fast data in times of crisis with GPU accelerated database QikkDB | Business ...Matej Misik
Graphics cards (GPU) open up new ways of processing and analytics over big data, showing millisecond selections over billions of lines, as well as telling stories about data. #QikkDB
How to present data to be understood by everyone? Data analysis is for scientists, but data storytelling is for everyone. For managers, product owners, sales teams, the general public. #TellStory
Learn about high performance computing with GPU and how to present data with a rich Covid-19 data story example on the upcoming webinar.
5th Multicore World
15-17 February 2016 – Shed 6, Wellington, New Zealand
http://openparallel.com/multicore-world-2016/
We start by dividing applications into data plus model components and classifying each component (whether from Big Data or Big Simulations) in the same way. These leads to 64 properties divided into 4 views, which are Problem Architecture (Macro pattern); Execution Features (Micro patterns); Data Source and Style; and finally the Processing (runtime) View.
We discuss convergence software built around HPC-ABDS (High Performance Computing enhanced Apache Big Data Stack) http://hpc-abds.org/kaleidoscope/ and show how one can merge Big Data and HPC (Big Simulation) concepts into a single stack.
We give examples of data analytics running on HPC systems including details on persuading Java to run fast.
Some details can be found at http://dsc.soic.indiana.edu/publications/HPCBigDataConvergence.pdf
Graph Data: a New Data Management FrontierDemai Ni
Graph Data: a New Data Management Frontier -- Huawei’s view and Call for Collaboration by Demai Ni:
Huawei provides Enterprise Databases, and are actively exploring the latest technology to provide end-to-end Data Management Solution on Cloud. We are looking at to bridge classic RDMS to Graph Database on a distributed platform.
Big Data Analysis in Hydrogen Station using Spark and Azure MLJongwook Woo
Decision Forest machine learning algorithm is adopted to find out the features to affect the temperature of fueling valve and controller and to predict it.
Introduction of streaming data, difference between batch processing and stream processing, Research issues in streaming data processing, Performance evaluation metrics , tools for stream processing.
In this deck from the HPC User Forum in Milwaukee, Tim Barr from Cray presents: Perspective on HPC-enabled AI.
"Cray’s unique history in supercomputing and analytics has given us front-line experience in pushing the limits of CPU and GPU integration, network scale, tuning for analytics, and optimizing for both model and data parallelization. Particularly important to machine learning is our holistic approach to parallelism and performance, which includes extremely scalable compute, storage and analytics."
Watch the video: https://wp.me/p3RLHQ-hpw
Learn more: http://cray.com
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Deep Learning: Evolution of ML from Statistical to Brain-like Computing- Data...Impetus Technologies
Presentation on 'Deep Learning: Evolution of ML from Statistical to Brain-like Computing'
Speaker- Dr. Vijay Srinivas Agneeswaran,Director, Big Data Labs, Impetus
The main objective of the presentation is to give an overview of our cutting edge work on realizing distributed deep learning networks over GraphLab. The objectives can be summarized as below:
- First-hand experience and insights into implementation of distributed deep learning networks.
- Thorough view of GraphLab (including descriptions of code) and the extensions required to implement these networks.
- Details of how the extensions were realized/implemented in GraphLab source – they have been submitted to the community for evaluation.
- Arrhythmia detection use case as an application of the large scale distributed deep learning network.
This is an 1 hour presentation on Neural Networks, Deep Learning, Computer Vision, Recurrent Neural Network and Reinforcement Learning. The talks later have links on how to run Neural Networks on
Leveraging NLP and Deep Learning for Document Recommendations in the CloudDatabricks
Efficient recommender systems are critical for the success of many industries, such as job recommendation, news recommendation, ecommerce, etc. This talk will illustrate how to build an efficient document recommender system by leveraging Natural Language Processing(NLP) and Deep Neural Networks (DNNs). The end-to-end flow of the document recommender system is build on AWS at scale, using Analytics Zoo for Spark and BigDL. The system first processes text rich documents into embeddings by incorporating Global Vectors (GloVe), then trains a K-means model using native Spark APIs to cluster users into several groups. The system further trains a recommender model for each group, and gives an ensemble prediction for each test record. By adopting the end-to-end pipeline of Analytics Zoo solution, we saw about 10% improvement of mean reciprocal ranking and 6% of precision respectively compared to the search recommendations for a job recommendation study.
Speaker: Guoqiong Song
Comparing Big Data and Simulation Applications and Implications for Software ...Geoffrey Fox
At eScience in the Cloud 2014, Redmond WA, April 30 2014
There is perhaps a broad consensus as to important issues in practical parallel computing as applied to large scale simulations; this is reflected in supercomputer architectures, algorithms, libraries, languages, compilers and best practice for application development.
However the same is not so true for data intensive, even though commercially clouds devote much more resources to data analytics than supercomputers devote to simulations.
We look at a sample of over 50 big data applications to identify characteristics of data intensive applications and to deduce needed runtime and architectures.
We suggest a big data version of the famous Berkeley dwarfs and NAS parallel benchmarks.
Our analysis builds on combining HPC and the Apache software stack that is well used in modern cloud computing.
Initial results on Azure and HPC Clusters are presented
Palestra apresentada por Pedro Mário Cruz e Silva, Solution Architect da NVIDIA, como parte da programação da VIII Semana de Inverno de Geofísica, em 19/07/2017.
State of the Art Robot Predictive Maintenance with Real-time Sensor DataMathieu Dumoulin
Our Strata Beijing 2017 presentation slides where we show how to use data from a movement sensor, in real-time, to do anomaly detection at scale using standard enterprise big data software.
II-SDV 2017: The Next Era: Deep Learning for Biomedical ResearchDr. Haxel Consult
Deep learning is hot, making waves, delivering results, and is somewhat of a buzzword today. There is a desire to apply deep learning to anything that is digital. Unlike the brain, these artificial neural networks have a very strict predefined structure. The brain is made up of neurons that talk to each other via electrical and chemical signals. We do not differentiate between these two types of signals in artificial neural networks. They are essentially a series of advanced statistics based exercises that review the past to indicate the likely future. Another buzzword that was used for the last few years across all industries is “big data”. In biomedical and health sciences, both unstructured and structured information constitute "big data". On the one hand deep learning needs lot of data whereas “big data" has value only when it generates actionable insight. Given this, these two areas are destined to be married. The couple is made for each other. The time is ripe now for a synergistic association that will benefit the pharmaceutical companies. It may be only a short time before we have vice presidents of machine learning or deep learning in pharmaceutical and biotechnology companies. This presentation will review the prominent deep learning methods and discuss these techniques for their usefulness in biomedical and health informatics.
Python is dominating the fast-growing data-science landscape. This talk provides a foundational overview of the practice of data science and some of the most popular Python libraries for doing data science. It also provides an overview of how Anaconda brings it all together.
Fast data in times of crisis with GPU accelerated database QikkDB | Business ...Matej Misik
Graphics cards (GPU) open up new ways of processing and analytics over big data, showing millisecond selections over billions of lines, as well as telling stories about data. #QikkDB
How to present data to be understood by everyone? Data analysis is for scientists, but data storytelling is for everyone. For managers, product owners, sales teams, the general public. #TellStory
Learn about high performance computing with GPU and how to present data with a rich Covid-19 data story example on the upcoming webinar.
5th Multicore World
15-17 February 2016 – Shed 6, Wellington, New Zealand
http://openparallel.com/multicore-world-2016/
We start by dividing applications into data plus model components and classifying each component (whether from Big Data or Big Simulations) in the same way. These leads to 64 properties divided into 4 views, which are Problem Architecture (Macro pattern); Execution Features (Micro patterns); Data Source and Style; and finally the Processing (runtime) View.
We discuss convergence software built around HPC-ABDS (High Performance Computing enhanced Apache Big Data Stack) http://hpc-abds.org/kaleidoscope/ and show how one can merge Big Data and HPC (Big Simulation) concepts into a single stack.
We give examples of data analytics running on HPC systems including details on persuading Java to run fast.
Some details can be found at http://dsc.soic.indiana.edu/publications/HPCBigDataConvergence.pdf
Graph Data: a New Data Management FrontierDemai Ni
Graph Data: a New Data Management Frontier -- Huawei’s view and Call for Collaboration by Demai Ni:
Huawei provides Enterprise Databases, and are actively exploring the latest technology to provide end-to-end Data Management Solution on Cloud. We are looking at to bridge classic RDMS to Graph Database on a distributed platform.
Big Data Analysis in Hydrogen Station using Spark and Azure MLJongwook Woo
Decision Forest machine learning algorithm is adopted to find out the features to affect the temperature of fueling valve and controller and to predict it.
Introduction of streaming data, difference between batch processing and stream processing, Research issues in streaming data processing, Performance evaluation metrics , tools for stream processing.
In this deck from the HPC User Forum in Milwaukee, Tim Barr from Cray presents: Perspective on HPC-enabled AI.
"Cray’s unique history in supercomputing and analytics has given us front-line experience in pushing the limits of CPU and GPU integration, network scale, tuning for analytics, and optimizing for both model and data parallelization. Particularly important to machine learning is our holistic approach to parallelism and performance, which includes extremely scalable compute, storage and analytics."
Watch the video: https://wp.me/p3RLHQ-hpw
Learn more: http://cray.com
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Deep Learning: Evolution of ML from Statistical to Brain-like Computing- Data...Impetus Technologies
Presentation on 'Deep Learning: Evolution of ML from Statistical to Brain-like Computing'
Speaker- Dr. Vijay Srinivas Agneeswaran,Director, Big Data Labs, Impetus
The main objective of the presentation is to give an overview of our cutting edge work on realizing distributed deep learning networks over GraphLab. The objectives can be summarized as below:
- First-hand experience and insights into implementation of distributed deep learning networks.
- Thorough view of GraphLab (including descriptions of code) and the extensions required to implement these networks.
- Details of how the extensions were realized/implemented in GraphLab source – they have been submitted to the community for evaluation.
- Arrhythmia detection use case as an application of the large scale distributed deep learning network.
This is an 1 hour presentation on Neural Networks, Deep Learning, Computer Vision, Recurrent Neural Network and Reinforcement Learning. The talks later have links on how to run Neural Networks on
LAD -GroundBreakers-Jul 2019 - Introduction to Machine Learning - From DBA's ...Sandesh Rao
Come to this session and experience a deep-dive into what you really need to know to may your DBA career thrive in an autonomous-driven world. Create cloud-scale automation; assess, score, and remediate IT and business compliance violations , get real-time insights into log data to find anomalies and ensure early detection of potential problems; and enable rapid detection, investigation, and remediation of the range of security threats across databases Features around the self-driving, self-securing, and self-repairing capabilities available across data management, application development, analytics, security and management. how DBAs can easily configure log collection and efficiently analyze logs from their database environment to rapidly troubleshoot problems. See how to use simple analysis to easily find errors across different log sources. Experience the power of machine learning techniques to rapidly identify anomalies that can lead to a problem’s root cause and eliminate finger-pointing. To use machine learning techniques to identify performance issues , steady windows for maintenance activities.
LAD -GroundBreakers-Jul 2019 - The Machine Learning behind the Autonomous Dat...Sandesh Rao
We are entering a new era in the database with the introduction of the Oracle Autonomous Database. AI and Machine Learning are center stage to most projects and assist in making complex decisions which was not possible before. Most data science projects don’t get beyond the data scientist and rarely operationalize their predictive models. there are new toolsets and methods available everyday which make this an extremely dynamic space. There are different categories of users who want to use the algorithms , the toolsets but don't know where to start. Whether you are a data scientist who wants to play with data and build your own models or make use of the database features with the built in models or use the specific AI services within a specific vertical such as Insurance or Healthcare . We will take a glimpse at Oracle's Machine Learning Zeppelin-based notebooks for Oracle Autonomous Data Warehouse Cloud to how Oracle uses AIOps and Applied Machine learning for its own operations and the Oracle AI Platform Cloud Service to provided an all rounded view of what Oracle is upto in this space
Chatbots have entered our lives unknowingly. Little do we realize that when that lil window pops up asking if we need support or help- it could just be a chatbot that we are talking to...
Big Data in the Cloud: How the RISElab Enables Computers to Make Intelligent ...Amazon Web Services
Scientists, developers, and other technologists from many different industries are taking advantage of Amazon Web Services to perform big data workloads from analytics to using data lakes for better decision making to meet the challenges of the increasing volume, variety, and velocity of digital information. This session will feature UCB's RISELab (Real time Intelligent Secure Execution), a new lab recently created at UCB to enable computers to make intelligent, real-time decisions. You will hear how they are building on their earlier success with AMPLab to enable applications to interact intelligently and securely with their environment in real time, wherever computing decisions need to interact with the world. From cybersecurity to coordinating fleets of self-driving cars and drones to earthquake warning systems, you will come away with insight on how they are using AWS to develop and experiment with the systems for important research. Learn More: https://aws.amazon.com/government-education/
Spark and Deep Learning Frameworks at Scale 7.19.18Cloudera, Inc.
We'll outline approaches for preprocessing, training, inference, and deployment across datasets (time series, audio, video, text, etc.) that leverage Spark, along with its extended ecosystem of libraries and deep learning frameworks using Cloudera's Data Science Workbench.
Much of data is sequential – think speech, text, DNA, stock prices, financial transactions and customer action histories. Modern methods for modelling sequence data are often deep learning-based, composed of either recurrent neural networks (RNNs) or attention-based Transformers. A tremendous amount of research progress has recently been made in sequence modelling, particularly in the application to NLP problems. However, the inner workings of these sequence models can be difficult to dissect and intuitively understand.
This presentation/tutorial will start from the basics and gradually build upon concepts in order to impart an understanding of the inner mechanics of sequence models – why do we need specific architectures for sequences at all, when you could use standard feed-forward networks? How do RNNs actually handle sequential information, and why do LSTM units help longer-term remembering of information? How can Transformers do such a good job at modelling sequences without any recurrence or convolutions?
In the practical portion of this tutorial, attendees will learn how to build their own LSTM-based language model in Keras. A few other use cases of deep learning-based sequence modelling will be discussed – including sentiment analysis (prediction of the emotional valence of a piece of text) and machine translation (automatic translation between different languages).
The goals of this presentation are to provide an overview of popular sequence-based problems, impart an intuition for how the most commonly-used sequence models work under the hood, and show that quite similar architectures are used to solve sequence-based problems across many domains.
Synthetic dialogue generation with Deep LearningS N
A walkthrough of a Deep Learning based technique which would generate TV scripts using Recurrent Neural Network. The model will generate a completely new TV script for a scene, after being training from a dataset. One will learn the concepts around RNN, NLP and various deep learning techniques.
Technologies to be used:
Python 3, Jupyter, TensorFlow
Source code: https://github.com/syednasar/talks/tree/master/synthetic-dialog
Deep learning (DL) is still one of the fastest developing areas in machine learning. As models increase their complexity and data sets grow in size, your model training can last hours or even days. In this session we will explore some of the trends in Deep Neural Networks to accelerate training using parallelize/distribute deep learning.
We will also present how to apply some of these strategies using Cloudera Data Science Workbenck and some popular (DL) open source frameworks like Uber Horovod, Tensorflow and Keras.
Speakers
Rafael Arana, Senior Solutions Architect
Cloudera
Zuling Kang, Senior Solutions Architect
Cloudera Inc.
A talk given at VT Code Camp 2019 covering a variety of big data infrastructures. High level summary of distributed relational databases, NoSQL databases, ETL processes, high throughput computing, high performance computing, and hybrid systems.
AI for Manufacturing (Machine Vision, Edge AI, Federated Learning)byteLAKE
This is the extended presentation about byteLAKE's and Lenovo's Artificial Intelligence solutions for Manufacturing.
Topics covered: AI strategy for manufacturing, Edge AI, Federated Learning and Machine Vision.
It's the first publication in the upcoming series: AI for Manufacturing. Highlights: AI-assisted quality monitoring automation, AI-assisted production line monitoring and issues detection, AI-assisted measurements, Intelligent Cameras and many more. Reach out to us to learn more: welcome@byteLAKE.com.
Presented during the world's first Federated Learning conference (Jun'20). Recording: https://youtu.be/IMqRIi45dDA
Related articles:
- Revolution in factories: Industry 4.0.
https://medium.com/@marcrojek/revolution-in-factories-industry-4-0-conference-made-in-wroclaw-2020-translation-ae96e5e14d55
- Cognitive Automation helps where RPAs fall short.
https://medium.com/@marcrojek/cognitive-automation-helps-where-rpas-fall-short-a1c5a01a66f8
- Machine Vision, how AI brings value to industries.
https://medium.com/@marcrojek/machine-vision-how-ai-brings-value-to-industries-e6a4f8e56f42
Learn more:
- https://www.bytelake.com/en/cognitive-services/
- https://www.lenovo.com/ai
- https://federatedlearningconference.com/
Self driving computers active learning workflows with human interpretable ve...Adam Gibson
Human in the loop learning workflows leveraging deep learning to group and cluster data. Also, techniques for accounting for machine learning failures.
Top 5 In-demand technologies to Learn in 2020Intellipaat
Intellipaat Online Courses on top trending IT technologies : https://intellipaat.com/course-cat/big-data-analytics-courses/
Expert written Tutorials : https://intellipaat.com/blog/tutorials/
Latest Blogs : https://intellipaat.com/blog/blog-category/
Top 5 In-demand Technologies to Learn in 2020Intellipaat
Youtube Link : https://www.youtube.com/watch?v=-tmEE2KsvzM
Intellipaat Online Courses on top trending IT technologies : https://intellipaat.com/course-cat/big-data-analytics-courses/
Expert written Tutorials : https://intellipaat.com/blog/tutorials/
Latest Blogs : https://intellipaat.com/blog/blog-category/
Similar to Distributed deep learning_framework_spark_4_may_2015_ver_0.7 (20)
Explains how deep learning creates howlers using commonly used annotation tools for images. We have identified several such howlers. Essentially, this presentation outlines the deficiencies of deep learning networks. We also explain the theoretical reasoning for these, building on Bengio's recent paper. The presentation also contains solutions which address these gaps, such as capsule networks, transfer learning, meta-learning and federated learning.
This contains the agenda of the Spark Meetup I organised in Bangalore on Friday, the 23rd of Jan 2014. It carries the slides for the talk I gave on distributed deep learning over Spark
This deck was presented at the Spark meetup at Bangalore. The key idea behind the presentation was to focus on limitations of Hadoop MapReduce and introduce both Hadoop YARN and Spark in this context. An overview of the other aspects of the Berkeley Data Analytics Stack was also provided.
This gives a characterization of the machine learning computations and brings out the deficiencies of Hadoop 1.0. It gives the motivation for Hadoop YARN and a brief view of YARN architecture. It illustrates the power of specialized processing frameworks over YARN, such as Spark and GraphLab. In short, Hadoop YARN allows your data to be stored in HDFS and specialized processing frameworks may be used to process the data in various ways.
This was the deck I used for the Hadoop Meetup talk at Bangalore on 18th of July 2013. The talk was titled "Big-data Analytics: Need to Look Beyond Hadoop?"
Big-data analytics beyond Hadoop - Big-data is not equal to Hadoop, especially for iterative algorithms! Lot of alternatives have emerged. Spark and GraphLab are most interesting next generation platforms for analytics.
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Reference : http://neuralnetworksanddeeplearning.com/chap1.html
Consider the problem to identify the individual digits from the input image
Each image 28 by 28 pixel image. Then network is designed as follows
Input layer (image) -> 28*28 = 784 neurons. Each neuron corresponds to a pixel
The output layer can be identified by the number of digits to be identified i.e. 10 (0 to 9)
The intermediate hidden layer can be experimented with varied number of neurons. Let us fix at 10 nodes in hidden layer
Reference: http://neuralnetworksanddeeplearning.com/chap1.html
How about recognizing a human face from given set of random images?
Attack this problem in the similar fashion explained earlier. Input -> Image pixels, output -> Is it a face or not? (a single node)
A face can be recognized by answering some questions like “Is there an eye in the top left?”, “Is there a nose in the middle?” etc..
Each question corresponds to a hidden layer
http://deeplearning4j.org/convolutionalnets.html
Refined by Lecun in 1989 – mainly to apply CNNs to identify variability in 2D image data.
Introduced in 1980 by Fukushima
A type of RBMs where the communication is absent across the nodes in the same layer
Nodes are not connected to every other node of next layer. Symmetry is not there
Convolution networks learn images by pieces rather than learning as a whole (RBM does this)
Designed to use minimal amounts of pre processing