This is a English slides of my presentation about machine learning implementation for model web application. Some advices for developers, which decided to create the same implementation in real production environment.
"Быстрое обнаружение вредоносного ПО для Android с помощью машинного обучения...Yandex
В докладе речь пойдёт о применении алгоритмов машинного обучения для обнаружения вредоносных приложений для Android. Я расскажу, как на базе Матрикснета в Яндексе был спроектирован высокопроизводительный инструмент для решения этой задачи. А также продемонстрирую, в каких случаях аналитические методы выявления вредоносного ПО помогают блокировать множество простых образцов вирусного кода. Затем мы поговорим о том, как можно усовершенствовать такие методы для обнаружения более хитроумных вредных программ.
Ml based detection of users anomaly activities (20th OWASP Night Tokyo, Japan...Yury Leonychev
This is a Japanese slides of my presentation about machine learning implementation for model web application. Some advices for developers, which decided to create the same implementation in real production environment.
Developing Highly Instrumented Applications with Minimal EffortTim Hobson
Presentation from Silicon Valley Code Camp 2013. Related code on github:
* https://github.com/hoserdude/mvcmusicstore-instrumented
* https://github.com/hoserdude/spring-petclinic-instrumented
* https://github.com/hoserdude/nodecellar-instrumented
Reproducibility and automation of machine learning processDenis Dus
A speech about organization of machine learning process in practice. Conceptual and technical aspects discussed. Introduction into Luigi framework. A short story about neural networks fitting in Flo - top-level mobile tracker of women health.
Delivered at Pittsburgh Tech Fest - 6/10/2017
Knowledge is power, but is it if you're not using it? What if the application you delivered to your customers was extremely intelligent? It could retrieve, analyze and use the massive amounts of data that businesses are generating at an astronomical rate.
It could analyze business deals, predict potential issues, proactively recommend business decisions and estimate profit, loss and risks.
Those things provide direct benefits to your company. Churning through that data by hand doesn't. Enter Azure Machine Learning.
In this session you will learn how to integrate Azure Machine Learning into your existing applications and workflows with REST services. You will learn how to deliver a modular, maintainable solution to your customers that allows them to analyze their data.
You will learn to:
* Numerous ways to abstract business rules, workflows, AI (Machine Learning) and more into your applications
* How to Integrate Azure Machine Learning into your existing Applications and Processes
* Create Azure Machine Learning Experiments
* Retrieve the Score from an Azure Machine Learning Experiment and integrate it into your applications and processes
* Integrate numerous Machine Learning Experiments from the Azure Machine Learning Marketplace into your existing applications and processes
* Learn various concepts for abstracting and managing services and api's.
"Быстрое обнаружение вредоносного ПО для Android с помощью машинного обучения...Yandex
В докладе речь пойдёт о применении алгоритмов машинного обучения для обнаружения вредоносных приложений для Android. Я расскажу, как на базе Матрикснета в Яндексе был спроектирован высокопроизводительный инструмент для решения этой задачи. А также продемонстрирую, в каких случаях аналитические методы выявления вредоносного ПО помогают блокировать множество простых образцов вирусного кода. Затем мы поговорим о том, как можно усовершенствовать такие методы для обнаружения более хитроумных вредных программ.
Ml based detection of users anomaly activities (20th OWASP Night Tokyo, Japan...Yury Leonychev
This is a Japanese slides of my presentation about machine learning implementation for model web application. Some advices for developers, which decided to create the same implementation in real production environment.
Developing Highly Instrumented Applications with Minimal EffortTim Hobson
Presentation from Silicon Valley Code Camp 2013. Related code on github:
* https://github.com/hoserdude/mvcmusicstore-instrumented
* https://github.com/hoserdude/spring-petclinic-instrumented
* https://github.com/hoserdude/nodecellar-instrumented
Reproducibility and automation of machine learning processDenis Dus
A speech about organization of machine learning process in practice. Conceptual and technical aspects discussed. Introduction into Luigi framework. A short story about neural networks fitting in Flo - top-level mobile tracker of women health.
Delivered at Pittsburgh Tech Fest - 6/10/2017
Knowledge is power, but is it if you're not using it? What if the application you delivered to your customers was extremely intelligent? It could retrieve, analyze and use the massive amounts of data that businesses are generating at an astronomical rate.
It could analyze business deals, predict potential issues, proactively recommend business decisions and estimate profit, loss and risks.
Those things provide direct benefits to your company. Churning through that data by hand doesn't. Enter Azure Machine Learning.
In this session you will learn how to integrate Azure Machine Learning into your existing applications and workflows with REST services. You will learn how to deliver a modular, maintainable solution to your customers that allows them to analyze their data.
You will learn to:
* Numerous ways to abstract business rules, workflows, AI (Machine Learning) and more into your applications
* How to Integrate Azure Machine Learning into your existing Applications and Processes
* Create Azure Machine Learning Experiments
* Retrieve the Score from an Azure Machine Learning Experiment and integrate it into your applications and processes
* Integrate numerous Machine Learning Experiments from the Azure Machine Learning Marketplace into your existing applications and processes
* Learn various concepts for abstracting and managing services and api's.
Trenowanie i wdrażanie modeli uczenia maszynowego z wykorzystaniem Google Clo...Sotrender
Okej, mam już mój świetny model w Notebooku, co dalej? Większość kursów i źródeł dotyczących uczenia maszynowego dobrze przygotowuje nas do implementacji algorytmów uczenia maszynowego i budowy mniej lub bardziej skomplikowanych modeli. Jednak w większości przypadków model jest jedynie małym fragmentem większego systemu, a jego wdrożenie i utrzymywanie okazuje się w praktyce procesem czasochłonnym i generującym rozmaite błędy. Problem potęguje się kiedy mamy do sproduktyzowania nie jeden, a więcej modeli. Choć z roku na rok powstaje coraz więcej narzędzi i platform do usprawnienia tego procesu, jest to zagadnienie któremu wciąż poświęca się stosunkowo mało uwagi.
W mojej prezentacji przedstawię jakich podejść, dobrych praktyk oraz narzędzi i usług Google Cloud Platform używamy w Sotrender do efektywnego trenowania i produktyzacji naszych modeli ML, służących do analizy danych z mediów społecznościowych. Omówię na które aspekty DevOps zwracamy uwagę w kontekście wytwarzania produktów opartych o modele ML (MLOps) i jak z wykorzystaniem Google Cloud Platform można je w łatwy sposób wdrożyć w swoim startupie lub firmie.
Prezentacja Macieja Pieńkosza z Sotrendera poczas Data Science Summit 2020
As data science workloads grow, so does their need for infrastructure. But, is it fair to ask data scientists to also become infrastructure experts? If not the data scientists, then, who is responsible for spinning up and managing data science infrastructure? This talk will address the context in which ML infrastructure is emerging, walk through two examples of ML infrastructure tools for launching hyperparameter optimization jobs, and end with some thoughts for building better tools in the future.
Originally given as a talk at the PyData Ann Arbor meetup (https://www.meetup.com/PyData-Ann-Arbor/events/260380989/)
All we know that REST services are almost everywhere now and nearly all new projects use it.
But do we really know how to design proper interfaces? What are pitfalls and how to avoid them?
I did many REST service designs and have a bunch of tips and tricks you definitely would like to use.
It will save you and your team a lot of time in future.
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioAlluxio, Inc.
Alluxio Global Online Meetup
Apr 23, 2020
For more Alluxio events: https://www.alluxio.io/events/
Speakers:
Jiao (Jennie) Wang, Intel
Tsai Louie, Intel
Bin Fan, Alluxio
Today, many people run deep learning applications with training data from separate storage such as object storage or remote data centers. This presentation will demo the Intel Analytics Zoo + Alluxio stack, an architecture that enables high performance while keeping cost and resource efficiency balanced without network being I/O bottlenecked.
Intel Analytics Zoo is a unified data analytics and AI platform open-sourced by Intel. It seamlessly unites TensorFlow, Keras, PyTorch, Spark, Flink, and Ray programs into an integrated pipeline, which can transparently scale from a laptop to large clusters to process production big data. Alluxio, as an open-source data orchestration layer, accelerates data loading and processing in Analytics Zoo deep learning applications.
This talk, we will go over:
- What is Analytics Zoo and how it works
- How to run Analytics Zoo with Alluxio in deep learning applications
- Initial performance benchmark results using the Analytics Zoo + Alluxio stack
Delivered @ MusicCityCode 6/2/2017
Knowledge is power, but is it if you're not using it? What if the application you delivered to your customers was extremely intelligent? It could retrieve, analyze and use the massive amounts of data that businesses are generating at an astronomical rate.
It could analyze business deals, predict potential issues, proactively recommend business decisions and estimate profit, loss and risks.
Those things provide direct benefits to your company. Churning through that data by hand doesn't. Enter Azure Machine Learning.
In this session you will learn how to integrate Azure Machine Learning into your existing applications and workflows with REST services. You will learn how to deliver a modular, maintainable solution to your customers that allows them to analyze their data.
You will learn to:
* Numerous ways to abstract business rules, workflows, AI (Machine Learning) and more into your applications
* How to Integrate Azure Machine Learning into your existing Applications and Processes
* Create Azure Machine Learning Experiments
* Retrieve the Score from an Azure Machine Learning Experiment and integrate it into your applications and processes
* Integrate numerous Machine Learning Experiments from the Azure Machine Learning Marketplace into your existing applications and processes
* Learn various concepts for abstracting and managing services and api's.
Monitoring AI applications with AI
The best performing offline algorithm can lose in production. The most accurate model does not always improve business metrics. Environment misconfiguration or upstream data pipeline inconsistency can silently kill the model performance. Neither prodops, data science or engineering teams are skilled to detect, monitor and debug such types of incidents.
Was it possible for Microsoft to test Tay chatbot in advance and then monitor and adjust it continuously in production to prevent its unexpected behaviour? Real mission critical AI systems require advanced monitoring and testing ecosystem which enables continuous and reliable delivery of machine learning models and data pipelines into production. Common production incidents include:
Data drifts, new data, wrong features
Vulnerability issues, malicious users
Concept drifts
Model Degradation
Biased Training set / training issue
Performance issue
In this demo based talk we discuss a solution, tooling and architecture that allows machine learning engineer to be involved in delivery phase and take ownership over deployment and monitoring of machine learning pipelines.
It allows data scientists to safely deploy early results as end-to-end AI applications in a self serve mode without assistance from engineering and operations teams. It shifts experimentation and even training phases from offline datasets to live production and closes a feedback loop between research and production.
Technical part of the talk will cover the following topics:
Automatic Data Profiling
Anomaly Detection
Clustering of inputs and outputs of the model
A/B Testing
Service Mesh, Envoy Proxy, trafic shadowing
Stateless and stateful models
Monitoring of regression, classification and prediction models
Data Summer Conf 2018, “Monitoring AI with AI (RUS)” — Stepan Pushkarev, CTO ...Provectus
In this demo based talk we discuss a solution, tooling and architecture that allows machine learning engineer to be involved in delivery phase and take ownership over deployment and monitoring of machine learning pipelines. It allows data scientists to safely deploy early results as end-to-end AI applications in a self serve mode without assistance from engineering and operations teams. It shifts experimentation and even training phases from offline datasets to live production and closes a feedback loop between research and production.
Integrating Splunk into your Spring ApplicationsDamien Dallimore
How much visibility do you really have into your Spring applications? How effectively are you capturing,harnessing and correlating the logs, metrics, & messages from your Spring applications that can be used to deliver this visibility ? What tools and techniques are you providing your Spring developers with to better create and utilize this mass of machine data ? In this session I'll answer these questions and show how Splunk can be used to not only provide historical and realtime visibility into your Spring applications , but also as a platform that developers can use to become more "devops effective" & easily create custom big data integrations and standalone solutions.I'll discuss and demonstrate many of Splunk's Java apps,frameworks and SDK and also cover the Spring Integration Adaptors for Splunk.
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsAnyscale
Apache Spark has rapidly become a key tool for data scientists to explore, understand and transform massive datasets and to build and train advanced machine learning models. The question then becomes, how do I deploy these model to a production environment? How do I embed what I have learned into customer facing data applications?
In this webinar, we will discuss best practices from Databricks on
how our customers productionize machine learning models
do a deep dive with actual customer case studies,
show live tutorials of a few example architectures and code in Python, Scala, Java and SQL.
This slide was used in ISO/IEC JTC1 SC36 Plenary Meeting in June 22, 2015.
Title of this slide is 'Proof of Concept for Learning Analytics Interoperability and subtitle is 'Reference Model based on open source SW'.
The iOS technical interview: get your dream job as an iOS developerJuan C Catalan
So you have been doing tutorials, sample projects, and watching videos on iOS development for a while. You are trying to publish an app in the App Store or maybe you got one already there. You dream of becoming a professional iOS developer.
Believe me, I was in the same situation six years ago. I started as an indie developer, self employed, and landed a few short contracts, then a six-month contract, and finally, one day, I got a job as a full-time professional iOS developer with a corporation. I have interviewed for a few companies and I have also interviewed come iOS candidates.
In this talk I will explain how to prepare yourself for the iOS technical interview. I will go thru the most usual questions, give my personal advice on how to succeed and pass the interview, and provide links to training material.
Constrained Optimization with Genetic Algorithms and Project BonsaiIvo Andreev
Traditional machine learning requires volumes of labelled data that can be time consuming and expensive to produce,”
“Machine teaching leverages the human capability to decompose and explain concepts to train machine learning models
direction (teaching the correct answer is not by showing the data for it, but by using a person to show the answer).
Project Bonsai is a low code platform for intelligent solutions but with a different perspective on data it allows a completely new approach to tasks, especially when the physical world is involved. Under the hood it combines machine teaching, calibration and optimization to create intelligent control systems using simulations. The teaching curriculum is performed using a new language concept - “Inkling” and training a model is easy and interactive.
Trenowanie i wdrażanie modeli uczenia maszynowego z wykorzystaniem Google Clo...Sotrender
Okej, mam już mój świetny model w Notebooku, co dalej? Większość kursów i źródeł dotyczących uczenia maszynowego dobrze przygotowuje nas do implementacji algorytmów uczenia maszynowego i budowy mniej lub bardziej skomplikowanych modeli. Jednak w większości przypadków model jest jedynie małym fragmentem większego systemu, a jego wdrożenie i utrzymywanie okazuje się w praktyce procesem czasochłonnym i generującym rozmaite błędy. Problem potęguje się kiedy mamy do sproduktyzowania nie jeden, a więcej modeli. Choć z roku na rok powstaje coraz więcej narzędzi i platform do usprawnienia tego procesu, jest to zagadnienie któremu wciąż poświęca się stosunkowo mało uwagi.
W mojej prezentacji przedstawię jakich podejść, dobrych praktyk oraz narzędzi i usług Google Cloud Platform używamy w Sotrender do efektywnego trenowania i produktyzacji naszych modeli ML, służących do analizy danych z mediów społecznościowych. Omówię na które aspekty DevOps zwracamy uwagę w kontekście wytwarzania produktów opartych o modele ML (MLOps) i jak z wykorzystaniem Google Cloud Platform można je w łatwy sposób wdrożyć w swoim startupie lub firmie.
Prezentacja Macieja Pieńkosza z Sotrendera poczas Data Science Summit 2020
As data science workloads grow, so does their need for infrastructure. But, is it fair to ask data scientists to also become infrastructure experts? If not the data scientists, then, who is responsible for spinning up and managing data science infrastructure? This talk will address the context in which ML infrastructure is emerging, walk through two examples of ML infrastructure tools for launching hyperparameter optimization jobs, and end with some thoughts for building better tools in the future.
Originally given as a talk at the PyData Ann Arbor meetup (https://www.meetup.com/PyData-Ann-Arbor/events/260380989/)
All we know that REST services are almost everywhere now and nearly all new projects use it.
But do we really know how to design proper interfaces? What are pitfalls and how to avoid them?
I did many REST service designs and have a bunch of tips and tricks you definitely would like to use.
It will save you and your team a lot of time in future.
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioAlluxio, Inc.
Alluxio Global Online Meetup
Apr 23, 2020
For more Alluxio events: https://www.alluxio.io/events/
Speakers:
Jiao (Jennie) Wang, Intel
Tsai Louie, Intel
Bin Fan, Alluxio
Today, many people run deep learning applications with training data from separate storage such as object storage or remote data centers. This presentation will demo the Intel Analytics Zoo + Alluxio stack, an architecture that enables high performance while keeping cost and resource efficiency balanced without network being I/O bottlenecked.
Intel Analytics Zoo is a unified data analytics and AI platform open-sourced by Intel. It seamlessly unites TensorFlow, Keras, PyTorch, Spark, Flink, and Ray programs into an integrated pipeline, which can transparently scale from a laptop to large clusters to process production big data. Alluxio, as an open-source data orchestration layer, accelerates data loading and processing in Analytics Zoo deep learning applications.
This talk, we will go over:
- What is Analytics Zoo and how it works
- How to run Analytics Zoo with Alluxio in deep learning applications
- Initial performance benchmark results using the Analytics Zoo + Alluxio stack
Delivered @ MusicCityCode 6/2/2017
Knowledge is power, but is it if you're not using it? What if the application you delivered to your customers was extremely intelligent? It could retrieve, analyze and use the massive amounts of data that businesses are generating at an astronomical rate.
It could analyze business deals, predict potential issues, proactively recommend business decisions and estimate profit, loss and risks.
Those things provide direct benefits to your company. Churning through that data by hand doesn't. Enter Azure Machine Learning.
In this session you will learn how to integrate Azure Machine Learning into your existing applications and workflows with REST services. You will learn how to deliver a modular, maintainable solution to your customers that allows them to analyze their data.
You will learn to:
* Numerous ways to abstract business rules, workflows, AI (Machine Learning) and more into your applications
* How to Integrate Azure Machine Learning into your existing Applications and Processes
* Create Azure Machine Learning Experiments
* Retrieve the Score from an Azure Machine Learning Experiment and integrate it into your applications and processes
* Integrate numerous Machine Learning Experiments from the Azure Machine Learning Marketplace into your existing applications and processes
* Learn various concepts for abstracting and managing services and api's.
Monitoring AI applications with AI
The best performing offline algorithm can lose in production. The most accurate model does not always improve business metrics. Environment misconfiguration or upstream data pipeline inconsistency can silently kill the model performance. Neither prodops, data science or engineering teams are skilled to detect, monitor and debug such types of incidents.
Was it possible for Microsoft to test Tay chatbot in advance and then monitor and adjust it continuously in production to prevent its unexpected behaviour? Real mission critical AI systems require advanced monitoring and testing ecosystem which enables continuous and reliable delivery of machine learning models and data pipelines into production. Common production incidents include:
Data drifts, new data, wrong features
Vulnerability issues, malicious users
Concept drifts
Model Degradation
Biased Training set / training issue
Performance issue
In this demo based talk we discuss a solution, tooling and architecture that allows machine learning engineer to be involved in delivery phase and take ownership over deployment and monitoring of machine learning pipelines.
It allows data scientists to safely deploy early results as end-to-end AI applications in a self serve mode without assistance from engineering and operations teams. It shifts experimentation and even training phases from offline datasets to live production and closes a feedback loop between research and production.
Technical part of the talk will cover the following topics:
Automatic Data Profiling
Anomaly Detection
Clustering of inputs and outputs of the model
A/B Testing
Service Mesh, Envoy Proxy, trafic shadowing
Stateless and stateful models
Monitoring of regression, classification and prediction models
Data Summer Conf 2018, “Monitoring AI with AI (RUS)” — Stepan Pushkarev, CTO ...Provectus
In this demo based talk we discuss a solution, tooling and architecture that allows machine learning engineer to be involved in delivery phase and take ownership over deployment and monitoring of machine learning pipelines. It allows data scientists to safely deploy early results as end-to-end AI applications in a self serve mode without assistance from engineering and operations teams. It shifts experimentation and even training phases from offline datasets to live production and closes a feedback loop between research and production.
Integrating Splunk into your Spring ApplicationsDamien Dallimore
How much visibility do you really have into your Spring applications? How effectively are you capturing,harnessing and correlating the logs, metrics, & messages from your Spring applications that can be used to deliver this visibility ? What tools and techniques are you providing your Spring developers with to better create and utilize this mass of machine data ? In this session I'll answer these questions and show how Splunk can be used to not only provide historical and realtime visibility into your Spring applications , but also as a platform that developers can use to become more "devops effective" & easily create custom big data integrations and standalone solutions.I'll discuss and demonstrate many of Splunk's Java apps,frameworks and SDK and also cover the Spring Integration Adaptors for Splunk.
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsAnyscale
Apache Spark has rapidly become a key tool for data scientists to explore, understand and transform massive datasets and to build and train advanced machine learning models. The question then becomes, how do I deploy these model to a production environment? How do I embed what I have learned into customer facing data applications?
In this webinar, we will discuss best practices from Databricks on
how our customers productionize machine learning models
do a deep dive with actual customer case studies,
show live tutorials of a few example architectures and code in Python, Scala, Java and SQL.
This slide was used in ISO/IEC JTC1 SC36 Plenary Meeting in June 22, 2015.
Title of this slide is 'Proof of Concept for Learning Analytics Interoperability and subtitle is 'Reference Model based on open source SW'.
The iOS technical interview: get your dream job as an iOS developerJuan C Catalan
So you have been doing tutorials, sample projects, and watching videos on iOS development for a while. You are trying to publish an app in the App Store or maybe you got one already there. You dream of becoming a professional iOS developer.
Believe me, I was in the same situation six years ago. I started as an indie developer, self employed, and landed a few short contracts, then a six-month contract, and finally, one day, I got a job as a full-time professional iOS developer with a corporation. I have interviewed for a few companies and I have also interviewed come iOS candidates.
In this talk I will explain how to prepare yourself for the iOS technical interview. I will go thru the most usual questions, give my personal advice on how to succeed and pass the interview, and provide links to training material.
Constrained Optimization with Genetic Algorithms and Project BonsaiIvo Andreev
Traditional machine learning requires volumes of labelled data that can be time consuming and expensive to produce,”
“Machine teaching leverages the human capability to decompose and explain concepts to train machine learning models
direction (teaching the correct answer is not by showing the data for it, but by using a person to show the answer).
Project Bonsai is a low code platform for intelligent solutions but with a different perspective on data it allows a completely new approach to tasks, especially when the physical world is involved. Under the hood it combines machine teaching, calibration and optimization to create intelligent control systems using simulations. The teaching curriculum is performed using a new language concept - “Inkling” and training a model is easy and interactive.
Similar to Ml based detection of users anomaly activities (20th OWASP Night Tokyo, English) (20)
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Ml based detection of users anomaly activities (20th OWASP Night Tokyo, English)
1. ML based detection of users anomaly activities
Yury Leonychev
ESG, Rakuten inc.
OWASP Night 9/3/2016
2. Agenda
• Case study presentation
• Workshop format
2
What Where
IDE Continuum Analytics Anaconda https://www.continuum.io/downloads
Python3+NumPy+SciPy+ScikitLearn https://www.python.org/downloads/
http://www.scipy.org/install.html
Model Application https://github.com/tracer0tong/buzzboard
3. Abstract problem definition
3
1. Browser based activity
a. Normal user interacts with browser
b. Web application generated activity
2. HTTP request activity
a. Normal UA
b. Headless browser or script/bot
3. Frontend/Backend data exchange
5. Model description
1. Business understanding – we want to classify “bad” and
“good” users, where “bad” users couldn’t enter
CAPTCHA, but “good” users – could.
2. Data understanding – HTTP requests and result of
CAPTCHA checks.
3. Data preparation – collect requests, prove that this is full
set. Get data from users and collect to database.
4. Create model. Define and tune settings for Decision Tree.
5. Calculate mistakes, validate model.
6. Deploy model to production.
5
6. Feature extraction
Direct Indirect
Size of HTTP request IP address reputation
Length of URI address User reputation
User Agent History based features
Amount of HTTP headers Time based features
Response code/Response time Business logic based features
… …
6
11. Offline computations
• Offline with Hadoop, Spark (MLlib), Elasticsearch
• Realtime with Spark (Streams and MLlib), Kafka
• Same technologies available in AWS and Azure
11
13. Knowledge matters!
• You should understand what are you doing!
– Is it normal to have 1.0 accuracy?
– Could we measure Mean Squared Error for our model application?
– Have we already chose correct algorithm and parameters?
– This is correct feature?
METHODS = ['GET', 'POST', 'PUT', 'DELETE', 'OPTIONS', 'HEAD']
def MethodFeature(request):
return METHODS.index(request.method)
13
14. Conclusion
• Use a decomposition (different levels of classification)
• Use flexible features collection
• Prefer offline computations
• Give yourself field for experiments
• Don’t forget ML integration – continuous process
• Get knowledges about ML
14
Hi! My name is Yury. I’m a lead architect in Rakuten. Nice to meet you here.
Today I want to explain how to use machine learning methods to detect strange user behavior.
We will see that some mathematics algorithms are applicable for real tasks. And why its so simple and difficult in the same time to use machine learning.
I tried to keep my presentation maximally illustrative, so right now you can download additional utilities and check results.
I will answer a questions after presentation, don’t hesitate to ask your questions.
To illustrate my presentation I’ve made special model web application.
This is a Python 3.5 with additional libraries. It’s to bit complicated to install some of this to Windows machines (Linux and Mac users should have no problems). I recommend for developing purposes use a Anaconda IDE from Continuum Analytics.
This IDE contains all prerequisites and modules for ML implementation. We will use one of the most popular ML implementations, Scikit Learn.
If somebody want to reproduce results, you can download this tools and application from my GitHub repository.
Let’s move forward.
This is abstract problem definition for our web application. We decided to make anonymous message board and want to block spamers.
Basically this looks like binary classification task, but look at this technically.
We have a normal user, which basically will open our site in normal full stack User Agent (browser). And we have malicious user (attacker), which is usually a script or headless browser, but sometimes this is a hacked laptop or malicious module inside user browser.
Anyways we should define such model of our application. In real life this model should be more detailed of course.
How to solve our task?
We shouldn’t invent bicycles, because there is perfect methodology for Data Mining. You can follow the link and read about it later.
Basically CRISP-DM defines lifecycle of data mining task.
It defines DM task as continuous process of different steps.
You should start from business understanding of task. Sometime you should minimize expenses, sometime maximize clicks on advertisement banners.
But metrics of success should be defined at the beginning of project. You should now how to measure your success.
On the second step you should understand, which sources of data are available. How much garbage in this data. Semantics of data and many other things like that.
Third step: you should prepare your data, create learning set. Probably you need to recover lost data or perform additional mapping of data with human support.
Next step is modeling: you should choose and train optimal model. Sounds simple, but no this is very difficult part.
You can change something in business understanding during Evaluation phase.
At the end you can deploy optimal model to production.
Everything changes, that’s why you model in most cases are optimal only temporary.
Let’s apply this methodology to our task.
We want to separate “bad” and “good” users. Main difference between this users is ability to input CAPTCHA. We decided that bad users cannot input reCaptcha.
We will extract all features from HTTP requests. Our target vector is CAPTCHA inputs.
To store requests (features) we will use Redis database.
To classify users we will chose not the best possible classification algorithm, but most illustrative and well interpretable.
That we will do next?
Calculate mistake, tune parameters and push our trained classifier to production.
For web application you can define two different type of features.
First type is direct features, which could be extracted from HTTP request and responses. They are mostly intuitive, but you can construct something more complex if you want. In our model application we will use direct features only.
Second type is indirect features. For example IP address reputation. This features usually are more difficult to construct, and you need additional services and systems to extract them from raw data. But they are also very strong addition to learning set.
*I will show first screen of web aaplication*
Look at other web application. As I mention before, this is a message board for anonymous users. But we still want to block spam activity here.
We want to construct learning set from user activity. Our web application written on Flask, so we can get data from “request” object. We will ask users to input CAPTCHA after every message.
This is a good schema to block spamers, but there are couple of problems.
CAPTCHA will annoy users.
If we will use CAPTCHAs every time, attackers starts to recognize them automatically.
Let’s improve our application and insert ML inside.
We will choose one of the basic machine learning algorithms – Decision Tree. In our model application we will relearn classifier every time, when we get message from user.
This is not a very good schema for production, but we made a model application.
Let’s imagine that after project start evil spamers comes to your site and starts to spoil it with spam messages. Spamers couldn’t input CAPTCHA, so after short period of time we will got learning set.
And also we will get learned classifier. Look at “ML” page of our application.
Machine learning algorithm chose “UALengthFeature” to construct rule. And classifier can predict next event score with 100% accuracy. That’s perfect.
Let’s give our application ability to use trained classifier for predict spam messages and block them.
We will send to malicious users 400 HTTP error code. After enabling “Strict mode” normal users can send messages without CAPTCHA. Malicious user will receive HTTP error.
That happens if attacker will use normal User Agent string? And will send more spam to us.
You can see that ML algorithm automatically chose a new classification feature. Amount of HTTP headers.
But that happens if attacker will be more smarter? Normal UA and huge amount of HTTP headers? ML will solve this problem for us as combination of different rules.
If I mention before we made model application, but that we should do for real production environment?
One of the good ideas is to decompose your difficult task to different layers. If you want to try to build one classifier for all available features this is probably huge mistake.
Basically many different kinds of users activity can be collected and analyzed separately. Sometime it’s better to use heuristic rules or very simple classifiers to block extremely strange requests on frontends or firewalls, than pass all of them to backend.
Speed and even size of classifier depends on amount of features, which you will use. Some of this features are heavy for extraction and calculation. It can be expensive from performance point of view to calculate all this features for every event.
How to train classifier? In our model application we have only 5 features and less than 30 events. You shouldn’t fit classifier in production this way. Much better to use special high scalable and powerful tools. Such as Apache Storm, Spark, Hadoop, Kafka.
Apache Spark has special Mllib, which contains many of well know ML algorithms and methods. For Storm you can just create special “bolts” with ML algorithms. Storm also applicable for real time processing of events flow.
Hadoop and Kafka – high performance computational storage and transport.
Smooth integration to production environment needs more flexible implementation of ML.
There are different difficulties in machine learning process. In our model application we use a Redis database to store learning set. This is because impossible to create one classifier and use it forever.
Features can be changed, business requirements can be change. For example if you want to add new feature column to feature set, of course you can drop all data and start calculations from scratch, but better to store learning set or different learning sets to have ability for switching from one to another.
Also you should have ability to perform experiments, because all ML algorithms need some parameters tuning. That’s why you should be able to compare different classifiers, before changing them into production.
You can ask me: if it’s so simple why we couldn’t apply ML immediately everywhere?
Problem that you should have understanding of mathematical internals of this algorithms. To illustrate this I specially made some mistakes in model application.
Just ask yourself:
Is it ok to get 100% accuracy? – No. For real life it means that you have very poor learning set.
Is it ok to use Mean Squared Error as measure of model quality in our case? – No. Because this is not a regression task.
Did we choose right algorithm and features? No. Decision tree – not very cool algorithm.
How about features? MeathodFeature is incorrect at all. We use array index as feature value, but how we can compare POST and HEAD HTTP requests methods?
Finally to summarize my presentation I want briefly remind all my suggestions.
--slide content—
Thank you!
ありがとうございます。
I glad to hear your questions.