Overview of the structured data science domain, OSS machine learning platforms and algorithms. Using the JPMML family of libraries to implement a unified, production-oriented workflow.
As the complexity of choosing optimised and task specific steps and ML models is often beyond non-experts, the rapid growth of machine learning applications has created a demand for off-the-shelf machine learning methods that can be used easily and without expert knowledge. We call the resulting research area that targets progressive automation of machine learning AutoML.
Although it focuses on end users without expert knowledge, AutoML also offers new tools to machine learning experts, for example to:
1. Perform architecture search over deep representations
2. Analyse the importance of hyperparameters.
Using AI to build AI is a promising solution to give the power of AI to those who can't afford it as those multinational corporations. The technology is also known as Automatic Machine Learning (AutoML). OneClick.ai is the first deep learning AutoML platform that make the latest AI technology accessible to anyone with/without AI background. The deck gives a 30 minutes overview of the recent history of AutoML, and how OneClick.ai innovates on it. Check out our platform at http://www.oneclick.ai
The key challenge in making AI technology more accessible to the broader community is the scarcity of AI experts. Most businesses simply don’t have the much needed resources or skills for modeling and engineering. This is why automated machine learning and deep learning technologies (AutoML and AutoDL) are increasingly valued by academics and industry. The core of AI is the model design. Automated machine learning technology reduces the barriers to AI application, enabling developers with no AI expertise to independently and easily develop and deploy AI models. Automated machine learning is expected to completely overturn the AI industry in the next few years, making AI ubiquitous.
A tremendous backlog of predictive modeling problems in the industry and short supply of trained data scientists have spiked interest in automation over the last few years. A new academic field, AutoML, has emerged. However, there is a significant gap between the topics that are academically interesting and automation capabilities that are necessary to solve real-world industrial problems end-to-end. An even greater challenge is enabling a non-expert to build a robust and trustworthy AI solution for their company. In this talk, we’ll discuss what an industry-grade AutoML system consists of and the scientific and engineering challenges of building it.
Artificial Intelligence Layer: Mahout, MLLib, and other projectsVictor Sanchez Anguix
This is part of an introductory course on Big Data tools for Artificial Intelligence. These final slides gather some of the most important AI layers in Big Data.
As the complexity of choosing optimised and task specific steps and ML models is often beyond non-experts, the rapid growth of machine learning applications has created a demand for off-the-shelf machine learning methods that can be used easily and without expert knowledge. We call the resulting research area that targets progressive automation of machine learning AutoML.
Although it focuses on end users without expert knowledge, AutoML also offers new tools to machine learning experts, for example to:
1. Perform architecture search over deep representations
2. Analyse the importance of hyperparameters.
Using AI to build AI is a promising solution to give the power of AI to those who can't afford it as those multinational corporations. The technology is also known as Automatic Machine Learning (AutoML). OneClick.ai is the first deep learning AutoML platform that make the latest AI technology accessible to anyone with/without AI background. The deck gives a 30 minutes overview of the recent history of AutoML, and how OneClick.ai innovates on it. Check out our platform at http://www.oneclick.ai
The key challenge in making AI technology more accessible to the broader community is the scarcity of AI experts. Most businesses simply don’t have the much needed resources or skills for modeling and engineering. This is why automated machine learning and deep learning technologies (AutoML and AutoDL) are increasingly valued by academics and industry. The core of AI is the model design. Automated machine learning technology reduces the barriers to AI application, enabling developers with no AI expertise to independently and easily develop and deploy AI models. Automated machine learning is expected to completely overturn the AI industry in the next few years, making AI ubiquitous.
A tremendous backlog of predictive modeling problems in the industry and short supply of trained data scientists have spiked interest in automation over the last few years. A new academic field, AutoML, has emerged. However, there is a significant gap between the topics that are academically interesting and automation capabilities that are necessary to solve real-world industrial problems end-to-end. An even greater challenge is enabling a non-expert to build a robust and trustworthy AI solution for their company. In this talk, we’ll discuss what an industry-grade AutoML system consists of and the scientific and engineering challenges of building it.
Artificial Intelligence Layer: Mahout, MLLib, and other projectsVictor Sanchez Anguix
This is part of an introductory course on Big Data tools for Artificial Intelligence. These final slides gather some of the most important AI layers in Big Data.
BKK16-408B Data Analytics and Machine Learning From Node to ClusterLinaro
Linaro is building an OpenStack based Developer Cloud. Here we present what was required to bring OpenStack to 64-bit ARM, the pitfalls, successes and lessons learnt; what’s missing and what’s next.
BigDL webinar - Deep Learning Library for SparkDESMOND YUEN
BigDL is a distributed deep learning library for Apache Spark*
and a unified Big Data Platform Driving Analytics and Data Science.
If you like what you read be sure you ♥ it below. Thank you!
Joker'14 Java as a fundamental working tool of the Data ScientistAlexey Zinoviev
Alexey Zinoviev presented this paper on the Jocker conference http://jokerconf.com/#zinoviev.
This paper covers next topics: Data Mining, Machine Learning, Mahout, Spark, MLlib, Python, Octave, R language
The Data Science Process - Do we need it and how to apply?Ivo Andreev
Machine learning is not black magic but a discipline that involves statistics, data science, analysis and hard work. From searching patterns and data preparation through applying and optimizing algorithms to obtaining usable predictions, one would need background and appropriate tools.
But do we need it, when there is already available AI as a service solution out there? Do we need to try hard with artificial neural networks? And if we decide to do so, what tools would be a safe bet?
In this session we will go through real world examples, mention key tools from Microsoft and open source world to do data science and machine learning and most importantly - we will provide a workflow and some best practices.
Near Data Computing Architectures for Apache Spark: Challenges and Opportunit...Spark Summit
Scale-out big data processing frameworks like Apache Spark have been designed to use on off the shelf commodity machines where each machine has the modest amount of compute , memory and storage capacity. Recent advancement in the hardware technology motivates understanding Spark performance on novel hardware architectures. Our earlier work has shown that the performance of Spark based data analytics is bounded by the frequent accesses to the DRAM. In this talk, we argue in favor of Near Data Computing Architectures that enable processing the data where it resides (e.g Smart SSDs and Compute Memories) for Apache Spark. We envision a programmable logic based hybrid near-memory and near-storage compute architecture for Apache Spark. Furthermore we discuss the challenges involved to achieve 10x performance gain for Apache Spark on NDC architectures.
Mirabilis_Design AMD Versal System-Level IP LibraryDeepak Shankar
Mirabilis Design provides the VisualSim Versal Library that enable System Architect and Algorithm Designers to quickly map the signal processing algorithms onto the Versal FPGA and define the Fabric based on the performance. The Versal IP support all the heterogeneous resource.
A Hands-on Intro to Data Science and R Presentation.pptSanket Shikhar
Using popular data science tools such as Python and R, the book offers many examples of real-life applications, with practice ranging from small to big data.
ISC Frankfurt 2015: Good, bad and ugly of accelerators and a complementary pathJohn Holden
Accelerators Vs Adjoint Algorithmic Differentation (AAD).... NONSENSE. It is not a choice. The two can be combined to provide the ultimate accelerator. Accelerators such as NVIDIA GPUs, Intel Xeon Phis CAN be combined with AD. NAG has the software tools and expertise to deliver AD solutions for traditional architectures and accelerarors
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKzmhassan
As spark applications move to a containerized environment, there are many questions about how to best configure server systems in the container world. In this talk we will demonstrate a set of tools to better monitor performance and identify optimal configuration settings. We will demonstrate how Prometheus, a project that is now part of the Cloud Native Computing Foundation (CNCF), can be applied to monitor and archive system performance data in a containerized spark environment. In our examples, we will gather spark metric output through Prometheus and present the data with Grafana dashboards. We will use our examples to demonstrate how performance can be enhanced through different tuned configuration settings. Our demo will show how to configure settings across the cluster as well as within each node.
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedInGrokking VN
Bài techtalk của anh Khải Trần nói về hệ thống data pipeline của LinkedIn được dùng để thu thập hàng chục tỷ messages mỗi ngày, và cách họ chạy hệ thống real-time processing để thống kê lượng dữ liệu này cho mục đính metrics monitoring.
1 số điểm bài talk sẽ chia sẻ:
- Giới thiệu về hệ thống unified metrics platform của LinkedIn
- Cách LinkedIn setup hệ thống BigData pipeline dùng Kafka, HDFS, Apache Calcite và Apache Samza.
- Khái niệm nearline storage, và cách LinkedIn chuyển từ offline architecture sang nearline architecture.
Speaker: Khai Tran, Staff Software Engineer - LinkedIn.
- Hiện đang là staff software engineer ở LinkedIn, phụ trách hệ thống metrics monitoring system. Trước đây từng làm ở Amazon AWS và Oracle.
- PhD, University of Wisconsin-Madison, nghiên cứu về Database Systems.
Performance and Scale Options for R with Hadoop: A comparison of potential ar...Revolution Analytics
R and Hadoop go together. In fact, they go together so well, that the number of options available can be confusing to IT and data science teams seeking solutions under varying performance and operational requirements.
Which configuration is faster for big files? Which is faster for sharing data and servers among groups? Which eliminates data movement? Which is easiest to manage? Which works best with iterative and multistep algorithms? What are the hardware requirements of each alternative?
This webinar is intended to help new users of R with Hadoop select their best architecture for integrating Hadoop and R, by explaining the benefits of several popular configurations, their performance potential, workload handling and programming model and administrative characteristics.
Presenters from Revolution Analytics will describe the options for using Revolution R Open and Revolution R Enterprise with Hadoop including servers, edge nodes, rHadoop and ScaleR. We’ll then compare the characteristics of each configuration as regards performance but also programming model, administration, data movement, ease of scaling, mixed workload handling, and performance for large individual analyses vs. mixed workloads.
BKK16-408B Data Analytics and Machine Learning From Node to ClusterLinaro
Linaro is building an OpenStack based Developer Cloud. Here we present what was required to bring OpenStack to 64-bit ARM, the pitfalls, successes and lessons learnt; what’s missing and what’s next.
BigDL webinar - Deep Learning Library for SparkDESMOND YUEN
BigDL is a distributed deep learning library for Apache Spark*
and a unified Big Data Platform Driving Analytics and Data Science.
If you like what you read be sure you ♥ it below. Thank you!
Joker'14 Java as a fundamental working tool of the Data ScientistAlexey Zinoviev
Alexey Zinoviev presented this paper on the Jocker conference http://jokerconf.com/#zinoviev.
This paper covers next topics: Data Mining, Machine Learning, Mahout, Spark, MLlib, Python, Octave, R language
The Data Science Process - Do we need it and how to apply?Ivo Andreev
Machine learning is not black magic but a discipline that involves statistics, data science, analysis and hard work. From searching patterns and data preparation through applying and optimizing algorithms to obtaining usable predictions, one would need background and appropriate tools.
But do we need it, when there is already available AI as a service solution out there? Do we need to try hard with artificial neural networks? And if we decide to do so, what tools would be a safe bet?
In this session we will go through real world examples, mention key tools from Microsoft and open source world to do data science and machine learning and most importantly - we will provide a workflow and some best practices.
Near Data Computing Architectures for Apache Spark: Challenges and Opportunit...Spark Summit
Scale-out big data processing frameworks like Apache Spark have been designed to use on off the shelf commodity machines where each machine has the modest amount of compute , memory and storage capacity. Recent advancement in the hardware technology motivates understanding Spark performance on novel hardware architectures. Our earlier work has shown that the performance of Spark based data analytics is bounded by the frequent accesses to the DRAM. In this talk, we argue in favor of Near Data Computing Architectures that enable processing the data where it resides (e.g Smart SSDs and Compute Memories) for Apache Spark. We envision a programmable logic based hybrid near-memory and near-storage compute architecture for Apache Spark. Furthermore we discuss the challenges involved to achieve 10x performance gain for Apache Spark on NDC architectures.
Mirabilis_Design AMD Versal System-Level IP LibraryDeepak Shankar
Mirabilis Design provides the VisualSim Versal Library that enable System Architect and Algorithm Designers to quickly map the signal processing algorithms onto the Versal FPGA and define the Fabric based on the performance. The Versal IP support all the heterogeneous resource.
A Hands-on Intro to Data Science and R Presentation.pptSanket Shikhar
Using popular data science tools such as Python and R, the book offers many examples of real-life applications, with practice ranging from small to big data.
ISC Frankfurt 2015: Good, bad and ugly of accelerators and a complementary pathJohn Holden
Accelerators Vs Adjoint Algorithmic Differentation (AAD).... NONSENSE. It is not a choice. The two can be combined to provide the ultimate accelerator. Accelerators such as NVIDIA GPUs, Intel Xeon Phis CAN be combined with AD. NAG has the software tools and expertise to deliver AD solutions for traditional architectures and accelerarors
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKzmhassan
As spark applications move to a containerized environment, there are many questions about how to best configure server systems in the container world. In this talk we will demonstrate a set of tools to better monitor performance and identify optimal configuration settings. We will demonstrate how Prometheus, a project that is now part of the Cloud Native Computing Foundation (CNCF), can be applied to monitor and archive system performance data in a containerized spark environment. In our examples, we will gather spark metric output through Prometheus and present the data with Grafana dashboards. We will use our examples to demonstrate how performance can be enhanced through different tuned configuration settings. Our demo will show how to configure settings across the cluster as well as within each node.
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedInGrokking VN
Bài techtalk của anh Khải Trần nói về hệ thống data pipeline của LinkedIn được dùng để thu thập hàng chục tỷ messages mỗi ngày, và cách họ chạy hệ thống real-time processing để thống kê lượng dữ liệu này cho mục đính metrics monitoring.
1 số điểm bài talk sẽ chia sẻ:
- Giới thiệu về hệ thống unified metrics platform của LinkedIn
- Cách LinkedIn setup hệ thống BigData pipeline dùng Kafka, HDFS, Apache Calcite và Apache Samza.
- Khái niệm nearline storage, và cách LinkedIn chuyển từ offline architecture sang nearline architecture.
Speaker: Khai Tran, Staff Software Engineer - LinkedIn.
- Hiện đang là staff software engineer ở LinkedIn, phụ trách hệ thống metrics monitoring system. Trước đây từng làm ở Amazon AWS và Oracle.
- PhD, University of Wisconsin-Madison, nghiên cứu về Database Systems.
Performance and Scale Options for R with Hadoop: A comparison of potential ar...Revolution Analytics
R and Hadoop go together. In fact, they go together so well, that the number of options available can be confusing to IT and data science teams seeking solutions under varying performance and operational requirements.
Which configuration is faster for big files? Which is faster for sharing data and servers among groups? Which eliminates data movement? Which is easiest to manage? Which works best with iterative and multistep algorithms? What are the hardware requirements of each alternative?
This webinar is intended to help new users of R with Hadoop select their best architecture for integrating Hadoop and R, by explaining the benefits of several popular configurations, their performance potential, workload handling and programming model and administrative characteristics.
Presenters from Revolution Analytics will describe the options for using Revolution R Open and Revolution R Enterprise with Hadoop including servers, edge nodes, rHadoop and ScaleR. We’ll then compare the characteristics of each configuration as regards performance but also programming model, administration, data movement, ease of scaling, mixed workload handling, and performance for large individual analyses vs. mixed workloads.
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
7. OSS ML frameworks
R Scikit-Learn Apache Spark
Scale Desktop Desktop, Server Server, Cluster
Data model Rich Poor Mixed
Main concept Model formula Pipeline Pipeline
Feature eng. Average Very good Good
Modeling Very good Good Average
Decision eng. N/A N/A Very good
8. OSS ML algorithm providers
H2O.ai XGBoost LightGBM
Scale Desktop, Server, Cluster
Data model Mixed
Algorithms Decision tree
ensembles,
GLM,
Naive Bayes,
Neural Networks,
Stacking
Decision tree
ensembles
Decision tree
ensembles
11. The challenge
Productionalization happens on Java, C# or SQL, which have
very limited interoperability with R and Python:
● Desktop vs. Server/Cluster
● Language-level threading, memory management issues
● Zero overlap between library sets
ML training and deployment are completely separate
concerns and shouldn't be (force-)blended into one.
12. Possible solutions
● Translation from R/Python application code to
Java/C#/SQL application code
● Translation from R/Python application code to some
standardized intermediate representation:
○ Higher abstraction level
○ Reorganized, refactored and simplified
○ Stable in time
● Containerization
13. Deployment options on Java/JVM
● R: N/A
● Scikit-Learn: nok/sklearn-porter
● Apache Spark: Java/Scala APIs, which are tightly coupled
to the Apache Spark runtime
● H2O.ai: POJO and MOJO mechanisms
● XGBoost: JNI wrapper,
komiya-atsushi/xgboost-predictor-java
● LightGBM: JNI wrapper(?)
14. The JPMML ecosystem
● "The API is the product"
○ Conversion APIs
○ Analysis and interpretation APIs
○ Scoring APIs
● Layered, role-based design
○ Data scientists, business analysts
○ Data engineers
○ Java/Scala developers
● Automated annotation, quality control
15. Converting R to PMML
library("r2pmml")
audit = read.csv("Audit.csv")
audit$Adjusted = as.factor(audit$Adjusted)
audit.glm = glm(Adjusted ~ . - Age + cut(Age, breaks = c(0, 18, 65, 100))
+ Gender:Education + I(Income / (Hours * 52)), data = audit, family =
"binomial")
audit.glm = r2pmml::verify(audit.glm, audit[sample(nrow(audit), 100), ])
r2pmml::r2pmml(audit.glm, "GLMAudit.pmml")
17. Scoring PMML
The JPMML-Evaluator®
library:
● Small (<1 MB), nearly self-contained (F)OSS Java library
● Developer API consists a single Java interface and a
couple of utility classes (model loading, optimization)
● Supports all PMML 3.X and 4.X schema versions
● Vendor extensions:
○ Accessing 3rd party Java libraries in PMML data flow
○ MathML prediction reports
18. Scoring application scenarios
● Synchronous vs. asynchronous
● Low-latency vs. high-latency
● Individual data records vs. batches of data records
● Embedded vs. over-the-network