❑ The document discusses a meetup event about machine learning on document-based data using Apache Spark and MongoDB. It provides background on the speakers and their companies. It then summarizes the agenda which includes introductions to Apache Spark, MongoDB, the Mongo Spark connector, and a case study on predicting SYN-DOS attacks on IoT devices. Diagrams are presented on Spark clusters, RDDs, MongoDB replica sets and sharding. The case study architecture uses Databricks and MongoDB Atlas in the cloud.
The document discusses the rise of NoSQL databases as an alternative to traditional relational databases. It provides a brief history of NoSQL, noting that new types of applications and data led developers to look for databases that offer more flexibility and scalability. It also describes the main types of NoSQL databases - key-value stores, graph stores, column stores, and document stores - and discusses some of the advantages of NoSQL databases like flexibility, scalability, availability and lower costs.
APACHE SPARK PER IL MACHINE LEARNING: INTRODUZIONE ED UN CASO DI STUDIO_ Meet...Deep Learning Italia
Valerio Morfino will present on using Apache Spark for machine learning. He will provide an introduction to Apache Spark, describe its parallel programming model, and present a case study on predicting DNA splicing sites. For the case study, he will load and prepare bioinformatics data, train models using Python and Spark MLlib, and evaluate the results to classify splicing sites. The goal is to understand how to approach a bioinformatics problem using supervised machine learning with Apache Spark.
"Machine Learning with Apache Spark: from lab to scale"
Abstract: Apache Spark è una piattaforma di grande successo nell’ambito del Big Data processing, grazie all’elevata scalabilità, alle prestazioni ed alla semplicità di utilizzo. In questo talk verranno introdotte la struttura di Apache Spark, le principali caratteristiche, la libreria MLLIB per il Machine Learning. Le potenzialità di Spark verranno illustrate tramite l’analisi di un caso di Machine Learning supervisionato su un problema di bioinformatica. Infine, verranno illustrate delle possibili architetture per applicare il modello a grandi quantità di dati.
Delivered this talk as part of Spark & Kafka Summit 2017 organized by Unicom Learning Conference.
Big data processing is undoubtedly one of the most exciting areas in computing today, and remains an area of fast evolution and introduction of new ideas. Apache Spark is at the cusp of overtaking MapReduce to emerge as the de-facto standard for big data processing. Thanks to its multi-functional capabilities (SQL, Structured Streaming, ML Pipelines and GraphX) under one unified platform , Spark is now a dominant compute technology across various industry use cases and real-time analytics applications. Apache Spark in past few years has seen successful production and commercial deployments across E-Commerce, Healthcare and Travel industry.
Session gave audience an understanding about the latest and upcoming trends in Big-Data Analytics and the role of Spark in enabling those future use-cases of advanced analytics.
Session explored the latest concepts from Apache Spark 2.x and introduction to various ML/DL frameworks that can run Spark along with some real-life use-cases and applications from Retail and IoT verticals.
Scality Launches Open Source Cloud Program with $100,000 Incentive Fund for S...Marc Villemade
Scality's Giorgio Regni and Bradley King launched Scality's Opensource Program (SCOP) at SNIA's SDC in Santa Clara in september 2010. This slideshow explains the concept of the first library released and the bounty program attached to it.
http://scop.scality.com
The document summarizes the SESAM4 project, which aims to lower barriers for small and medium companies to exploit semantic systems. The project developed open source software, best practices, and tools based on semantic technologies and linked open data. It had 10 partners and was funded for 3 years to work on topics like ontology development, content management integration, and demonstrator applications in tourism.
The document discusses the rise of NoSQL databases as an alternative to traditional relational databases. It provides a brief history of NoSQL, noting that new types of applications and data led developers to look for databases that offer more flexibility and scalability. It also describes the main types of NoSQL databases - key-value stores, graph stores, column stores, and document stores - and discusses some of the advantages of NoSQL databases like flexibility, scalability, availability and lower costs.
APACHE SPARK PER IL MACHINE LEARNING: INTRODUZIONE ED UN CASO DI STUDIO_ Meet...Deep Learning Italia
Valerio Morfino will present on using Apache Spark for machine learning. He will provide an introduction to Apache Spark, describe its parallel programming model, and present a case study on predicting DNA splicing sites. For the case study, he will load and prepare bioinformatics data, train models using Python and Spark MLlib, and evaluate the results to classify splicing sites. The goal is to understand how to approach a bioinformatics problem using supervised machine learning with Apache Spark.
"Machine Learning with Apache Spark: from lab to scale"
Abstract: Apache Spark è una piattaforma di grande successo nell’ambito del Big Data processing, grazie all’elevata scalabilità, alle prestazioni ed alla semplicità di utilizzo. In questo talk verranno introdotte la struttura di Apache Spark, le principali caratteristiche, la libreria MLLIB per il Machine Learning. Le potenzialità di Spark verranno illustrate tramite l’analisi di un caso di Machine Learning supervisionato su un problema di bioinformatica. Infine, verranno illustrate delle possibili architetture per applicare il modello a grandi quantità di dati.
Delivered this talk as part of Spark & Kafka Summit 2017 organized by Unicom Learning Conference.
Big data processing is undoubtedly one of the most exciting areas in computing today, and remains an area of fast evolution and introduction of new ideas. Apache Spark is at the cusp of overtaking MapReduce to emerge as the de-facto standard for big data processing. Thanks to its multi-functional capabilities (SQL, Structured Streaming, ML Pipelines and GraphX) under one unified platform , Spark is now a dominant compute technology across various industry use cases and real-time analytics applications. Apache Spark in past few years has seen successful production and commercial deployments across E-Commerce, Healthcare and Travel industry.
Session gave audience an understanding about the latest and upcoming trends in Big-Data Analytics and the role of Spark in enabling those future use-cases of advanced analytics.
Session explored the latest concepts from Apache Spark 2.x and introduction to various ML/DL frameworks that can run Spark along with some real-life use-cases and applications from Retail and IoT verticals.
Scality Launches Open Source Cloud Program with $100,000 Incentive Fund for S...Marc Villemade
Scality's Giorgio Regni and Bradley King launched Scality's Opensource Program (SCOP) at SNIA's SDC in Santa Clara in september 2010. This slideshow explains the concept of the first library released and the bounty program attached to it.
http://scop.scality.com
The document summarizes the SESAM4 project, which aims to lower barriers for small and medium companies to exploit semantic systems. The project developed open source software, best practices, and tools based on semantic technologies and linked open data. It had 10 partners and was funded for 3 years to work on topics like ontology development, content management integration, and demonstrator applications in tourism.
Spark and Hadoop Perfect Togeher by Arun MurthySpark Summit
Spark and Hadoop are perfectly together. Spark is a key tool in Hadoop's toolbox that provides elegant developer APIs and accelerates data science and machine learning. It can process streaming data in real-time for applications like web analytics and insurance claims processing. The future of Spark and Hadoop includes innovating the core technologies, providing seamless data access across data platforms, and further accelerating data science tools and libraries.
Spark and Hadoop are perfectly together. Spark is a key tool in Hadoop's toolbox that provides elegant developer APIs and accelerates data science and machine learning. It can process streaming data in real-time for applications like web analytics and insurance claims processing. The future of Spark and Hadoop includes innovating the core technologies, providing seamless data access across data platforms, and further accelerating data science tools and libraries.
Different data types, operational efficiencies, and variable workloads are driving the convergence of data platforms. A converged data platform combines technologies like Hadoop, Spark, streaming, and databases on a single platform with centralized management. This reduces costs and improves reliability compared to separate data silos. Major vendors like MapR are offering converged data platforms that provide real-time processing, multi-model databases, and integration of streaming and batch workloads. Widespread adoption of converged data platforms is expected to continue as businesses seek improved data management and analytics capabilities.
The document describes a reference architecture for a linguistic linked data ecosystem. It proposes standards and best practices for publishing, linking, and accessing multilingual data as linked open data. The key components of the architecture include publishing and hosting linguistic linked data, metadata standards, vocabularies for describing different resource types, linking of open and closed data, discovery layers, and semantic web service composition. The architecture supports decentralization, interoperability, and the development of language technologies and analytics services over linked data.
The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...Gezim Sejdiu
Over the past decade, vast amounts of machine-readable structured information have become available through the automation of research processes as well as the increasing popularity of knowledge graphs and semantic technologies.
A major and yet unsolved challenge that research faces today is to perform scalable analysis of large scale knowledge graphs in order to facilitate applications like link prediction, knowledge base completion, and question answering.
Most machine learning approaches, which scale horizontally (i.e. can be executed in a distributed environment) work on simpler feature vector based input rather than more expressive knowledge structures.
On the other hand, the learning methods which exploit the expressive structures, e.g. Statistical Relational Learning and Inductive Logic Programming approaches, usually do not scale well to very large knowledge bases owing to their working complexity.
This talk gives an overview of the ongoing project Semantic Analytics Stack (SANSA) which aims to bridge this research gap by creating an out of the box library for scalable, in-memory, structured learning.
PiLOD 2013: Is Linked Data the future of data integration in the enterprise?John Walker
NXP is exploring using Linked Data to improve data integration across its systems and better share product information with partners and customers. Currently, NXP uses RDF and Linked Data to create a unified, trusted source of product master data stored in a triplestore. This has helped answer previously unanswerable questions and make the data easier to integrate, query and publish through various formats and channels. Next steps include adding more full product data, integrating additional sources, and publishing Linked Open Data for broader online use.
This document provides information about a MongoDB class taught by Alexandre Bergere. The class covers topics including Big Data, NoSQL, MongoDB architecture and modeling, CRUD operations, replication, security, and aggregation. It includes Alexandre's background and credentials, as well as sources and use cases for MongoDB.
In the past, emerging technologies took years to mature. In the case of big data, while effective tools are still emerging, the analytics requirements are changing rapidly resulting in businesses to either make it or be left behind
EDINA is a national data center in the UK that delivers geospatial data and services using open standards and open source software. It provides access to collections like Digimap and OpenBoundaries through web mapping applications and data downloads. EDINA uses open standards like OGC and open source software from OSGeo projects to build interoperable and resilient systems while reducing costs. This hybrid approach provides flexible and innovative services to users while meeting the needs of funders.
EDINA is a national data center in the UK that delivers geospatial data and services using open standards and open source software. It provides access to collections like Digimap and OpenBoundaries through web mapping applications and data downloads. EDINA uses open standards like OGC and open source software from projects in OSGeo to build robust and interoperable systems while reducing costs and increasing flexibility.
Native Spark Executors on Kubernetes: Diving into the Data Lake - Chicago Clo...Mariano Gonzalez
Everybody wants to do big data on a data lake! However, implementing it and maintaining the infrastructure necessary to explore it, such as Spark, has been a historically challenging endeavor. Kubernetes is the tool of choice for cloud orchestration, and Spark continues to be the de facto framework for most data wrangling tasks. We’ve previously tried different data lake architectures, and suffered from the pain that Hadoop carries with it. Finally, we decided to bring the best from the cloud and big data worlds together, and walk you through a session on how to set an endless data lake powered with native Spark executors on Kubernetes
Geschäftliches Potential für System-Integratoren und Berater - Graphdatenban...Neo4j
This document provides an agenda for a Neo4j partner day event. The agenda includes sessions on the business potential of Neo4j for system integrators and consultants, the Neo4j partner program, and a case study on using Neo4j to analyze data from the Panama Papers leak. There are also sessions on networking breaks and lunch.
This document describes Schema.org and its potential uses beyond search engine optimization. Schema.org was created in 2011 by major search engines to provide a set of shared vocabularies for structured data on web pages. It has since grown to include over 2000 terms covering entities, relationships, and actions. The document discusses how Schema.org data can be used for analytics by extracting metadata from web pages and sending it to Google Analytics for additional dimensions and metrics. This enables analysis of user behavior at a more granular level than is normally possible from web analytics alone.
Rajeev kumar apache_spark & scala developerRajeev Kumar
Rajeev Kumar is an experienced Apache Spark and Scala developer based in Amsterdam, NL. He has over 8 years of experience working with big data technologies like Apache Spark, Scala, Java, Hadoop, and data integration tools. He is proficient in processing large structured and unstructured datasets to identify patterns and gain insights. His experience includes designing and developing Spark applications using Scala, ETL processes, data warehousing, and working with technologies like Hive, HDFS, MapReduce, Sqoop, Kafka and more.
Ebooks - Accelerating Time to Value of Big Data of Apache Spark | QuboleVasu S
This ebook deep dives into Apache Spark optimizations that improve performance, reduce costs and deliver unmatched scale
https://www.qubole.com/resources/ebooks/accelerating-time-to-value-of-big-data-of-apache-spark
How google is using linked data today and vision for tomorrowVasu Jain
In this presentation, I will discuss how modern search engines, such as Google, make use of Linked Data spread inWeb pages for displaying Rich Snippets. Also i will present an example of the technology and analyze its current uptake.
Then i sketched some ideas on how Rich Snippets could be extended in the future, in particular for multimedia documents.
Original Paper :
http://scholar.google.com/citations?view_op=view_citation&hl=en&user=K3TsGbgAAAAJ&authuser=1&citation_for_view=K3TsGbgAAAAJ:u-x6o8ySG0sC
Another Presentation by Author: https://docs.google.com/present/view?id=dgdcn6h3_185g8w2bdgv&pli=1
Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...Ashok Royal
Bigdata Hadoop, Its components and a Hadoop project is described in Details.
Visit http://hadoop-beginners.blogspot.com to see Hadoop Tutorials.
Thanks for the visit. :)
This slide deck has been prepared for a workshop on Linked Data Publishing and Semantic Processing using the Redlink platform (http://redlink.co). The workshop delivered at the Department of Information Engineering, Computer Science and Mathematics at Università degli Studi dell'Aquila aimed at providing a general understanding of Semantic Web Technologies and how these can be used in real world use cases such as Salzburgerland Tourismus.
A brief introduction has been also included on MICO (Media in Context) a European Union part-funded research project to provide cross-media analysis solutions for online multimedia producers.
Nagapandu Potti seeks a software engineering role that utilizes his technical skills. He has strong skills in Java, C, C++, Ruby, Scala, C#, databases like MySQL and MongoDB, web development technologies like JavaScript, AngularJS, and Ruby on Rails. He has work experience developing applications using these skills at Citrix and Cerner. Potti has a Master's degree in Computer Science from the University of Florida and a Bachelor's degree in Computer Science from Manipal University.
The document discusses using machine learning and quantum computing to optimize marketing campaigns. Specifically, it details using transformers to predict user appreciation and activation for different incentive levels. A quantum annealer is then used to solve the NP-hard problem of allocating incentives to users to maximize appreciation within a cost target. Three A/B tests are conducted, with the second hitting targets and the third having uncertain results. Overall the approach shows potential to optimize large marketing budgets.
Spark and Hadoop Perfect Togeher by Arun MurthySpark Summit
Spark and Hadoop are perfectly together. Spark is a key tool in Hadoop's toolbox that provides elegant developer APIs and accelerates data science and machine learning. It can process streaming data in real-time for applications like web analytics and insurance claims processing. The future of Spark and Hadoop includes innovating the core technologies, providing seamless data access across data platforms, and further accelerating data science tools and libraries.
Spark and Hadoop are perfectly together. Spark is a key tool in Hadoop's toolbox that provides elegant developer APIs and accelerates data science and machine learning. It can process streaming data in real-time for applications like web analytics and insurance claims processing. The future of Spark and Hadoop includes innovating the core technologies, providing seamless data access across data platforms, and further accelerating data science tools and libraries.
Different data types, operational efficiencies, and variable workloads are driving the convergence of data platforms. A converged data platform combines technologies like Hadoop, Spark, streaming, and databases on a single platform with centralized management. This reduces costs and improves reliability compared to separate data silos. Major vendors like MapR are offering converged data platforms that provide real-time processing, multi-model databases, and integration of streaming and batch workloads. Widespread adoption of converged data platforms is expected to continue as businesses seek improved data management and analytics capabilities.
The document describes a reference architecture for a linguistic linked data ecosystem. It proposes standards and best practices for publishing, linking, and accessing multilingual data as linked open data. The key components of the architecture include publishing and hosting linguistic linked data, metadata standards, vocabularies for describing different resource types, linking of open and closed data, discovery layers, and semantic web service composition. The architecture supports decentralization, interoperability, and the development of language technologies and analytics services over linked data.
The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...Gezim Sejdiu
Over the past decade, vast amounts of machine-readable structured information have become available through the automation of research processes as well as the increasing popularity of knowledge graphs and semantic technologies.
A major and yet unsolved challenge that research faces today is to perform scalable analysis of large scale knowledge graphs in order to facilitate applications like link prediction, knowledge base completion, and question answering.
Most machine learning approaches, which scale horizontally (i.e. can be executed in a distributed environment) work on simpler feature vector based input rather than more expressive knowledge structures.
On the other hand, the learning methods which exploit the expressive structures, e.g. Statistical Relational Learning and Inductive Logic Programming approaches, usually do not scale well to very large knowledge bases owing to their working complexity.
This talk gives an overview of the ongoing project Semantic Analytics Stack (SANSA) which aims to bridge this research gap by creating an out of the box library for scalable, in-memory, structured learning.
PiLOD 2013: Is Linked Data the future of data integration in the enterprise?John Walker
NXP is exploring using Linked Data to improve data integration across its systems and better share product information with partners and customers. Currently, NXP uses RDF and Linked Data to create a unified, trusted source of product master data stored in a triplestore. This has helped answer previously unanswerable questions and make the data easier to integrate, query and publish through various formats and channels. Next steps include adding more full product data, integrating additional sources, and publishing Linked Open Data for broader online use.
This document provides information about a MongoDB class taught by Alexandre Bergere. The class covers topics including Big Data, NoSQL, MongoDB architecture and modeling, CRUD operations, replication, security, and aggregation. It includes Alexandre's background and credentials, as well as sources and use cases for MongoDB.
In the past, emerging technologies took years to mature. In the case of big data, while effective tools are still emerging, the analytics requirements are changing rapidly resulting in businesses to either make it or be left behind
EDINA is a national data center in the UK that delivers geospatial data and services using open standards and open source software. It provides access to collections like Digimap and OpenBoundaries through web mapping applications and data downloads. EDINA uses open standards like OGC and open source software from OSGeo projects to build interoperable and resilient systems while reducing costs. This hybrid approach provides flexible and innovative services to users while meeting the needs of funders.
EDINA is a national data center in the UK that delivers geospatial data and services using open standards and open source software. It provides access to collections like Digimap and OpenBoundaries through web mapping applications and data downloads. EDINA uses open standards like OGC and open source software from projects in OSGeo to build robust and interoperable systems while reducing costs and increasing flexibility.
Native Spark Executors on Kubernetes: Diving into the Data Lake - Chicago Clo...Mariano Gonzalez
Everybody wants to do big data on a data lake! However, implementing it and maintaining the infrastructure necessary to explore it, such as Spark, has been a historically challenging endeavor. Kubernetes is the tool of choice for cloud orchestration, and Spark continues to be the de facto framework for most data wrangling tasks. We’ve previously tried different data lake architectures, and suffered from the pain that Hadoop carries with it. Finally, we decided to bring the best from the cloud and big data worlds together, and walk you through a session on how to set an endless data lake powered with native Spark executors on Kubernetes
Geschäftliches Potential für System-Integratoren und Berater - Graphdatenban...Neo4j
This document provides an agenda for a Neo4j partner day event. The agenda includes sessions on the business potential of Neo4j for system integrators and consultants, the Neo4j partner program, and a case study on using Neo4j to analyze data from the Panama Papers leak. There are also sessions on networking breaks and lunch.
This document describes Schema.org and its potential uses beyond search engine optimization. Schema.org was created in 2011 by major search engines to provide a set of shared vocabularies for structured data on web pages. It has since grown to include over 2000 terms covering entities, relationships, and actions. The document discusses how Schema.org data can be used for analytics by extracting metadata from web pages and sending it to Google Analytics for additional dimensions and metrics. This enables analysis of user behavior at a more granular level than is normally possible from web analytics alone.
Rajeev kumar apache_spark & scala developerRajeev Kumar
Rajeev Kumar is an experienced Apache Spark and Scala developer based in Amsterdam, NL. He has over 8 years of experience working with big data technologies like Apache Spark, Scala, Java, Hadoop, and data integration tools. He is proficient in processing large structured and unstructured datasets to identify patterns and gain insights. His experience includes designing and developing Spark applications using Scala, ETL processes, data warehousing, and working with technologies like Hive, HDFS, MapReduce, Sqoop, Kafka and more.
Ebooks - Accelerating Time to Value of Big Data of Apache Spark | QuboleVasu S
This ebook deep dives into Apache Spark optimizations that improve performance, reduce costs and deliver unmatched scale
https://www.qubole.com/resources/ebooks/accelerating-time-to-value-of-big-data-of-apache-spark
How google is using linked data today and vision for tomorrowVasu Jain
In this presentation, I will discuss how modern search engines, such as Google, make use of Linked Data spread inWeb pages for displaying Rich Snippets. Also i will present an example of the technology and analyze its current uptake.
Then i sketched some ideas on how Rich Snippets could be extended in the future, in particular for multimedia documents.
Original Paper :
http://scholar.google.com/citations?view_op=view_citation&hl=en&user=K3TsGbgAAAAJ&authuser=1&citation_for_view=K3TsGbgAAAAJ:u-x6o8ySG0sC
Another Presentation by Author: https://docs.google.com/present/view?id=dgdcn6h3_185g8w2bdgv&pli=1
Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...Ashok Royal
Bigdata Hadoop, Its components and a Hadoop project is described in Details.
Visit http://hadoop-beginners.blogspot.com to see Hadoop Tutorials.
Thanks for the visit. :)
This slide deck has been prepared for a workshop on Linked Data Publishing and Semantic Processing using the Redlink platform (http://redlink.co). The workshop delivered at the Department of Information Engineering, Computer Science and Mathematics at Università degli Studi dell'Aquila aimed at providing a general understanding of Semantic Web Technologies and how these can be used in real world use cases such as Salzburgerland Tourismus.
A brief introduction has been also included on MICO (Media in Context) a European Union part-funded research project to provide cross-media analysis solutions for online multimedia producers.
Nagapandu Potti seeks a software engineering role that utilizes his technical skills. He has strong skills in Java, C, C++, Ruby, Scala, C#, databases like MySQL and MongoDB, web development technologies like JavaScript, AngularJS, and Ruby on Rails. He has work experience developing applications using these skills at Citrix and Cerner. Potti has a Master's degree in Computer Science from the University of Florida and a Bachelor's degree in Computer Science from Manipal University.
The document discusses using machine learning and quantum computing to optimize marketing campaigns. Specifically, it details using transformers to predict user appreciation and activation for different incentive levels. A quantum annealer is then used to solve the NP-hard problem of allocating incentives to users to maximize appreciation within a cost target. Three A/B tests are conducted, with the second hitting targets and the third having uncertain results. Overall the approach shows potential to optimize large marketing budgets.
This document provides an overview of transformers in computer vision. It discusses how transformers were originally developed for natural language processing using attention mechanisms instead of recurrent connections. Vision transformers apply this approach to images by treating patches as tokens and using self-attention. Early vision transformers achieved strong results on image classification tasks. Recent developments include Swin transformers which use shifted windows to incorporate positional information, and models that combine convolutional and transformer architectures. Transformers are also being applied to video understanding tasks. The document explores different transformer architectures and applications of vision transformers.
1. The document discusses operations research, which uses mathematical modeling to help make better decisions.
2. Operations research tools like mathematical programming and decomposition methods can be used to solve large, complex problems and scale to practical applications.
3. Decomposition methods break large problems into smaller subproblems that can be solved independently to find good feasible solutions for the original problem.
This document describes a deep learning approach called c-ResUnet for counting cells in fluorescent microscopy images. It discusses fluorescent microscopy imaging techniques and applications in life sciences. It then introduces the Fluorescent Neuronal Cells dataset and challenges in counting cells, such as class imbalance, overcrowding, and noise. The c-ResUnet model is presented, which uses a convolutional neural network with residual blocks for semantic segmentation. Experiments show that c-ResUnet outperforms other architectures and achieves performance close to human experts on this dataset through the use of weight maps and oversampling artifacts during training. Both qualitative and quantitative evaluations demonstrate the effectiveness of c-ResUnet for automated cell counting.
Negli ultimi anni la robotica sta finalmente uscendo dalle fabbriche per popolare le città in cui viviamo. Auto a guida autonoma, droni e robot per la consegna di cibo, quadrupedi per la sorveglianza delle strade: questi sono solo alcuni esempi di ciò che si può trovare già oggi in molti quartieri nel mondo. La rivoluzione generata dal deep learning a partire dal 2012 è soltanto uno degli elementi di questa diffusione, che si fonda anche su complesse dinamiche di mercato e decenni di ricerca precedente nell'ambito dei sistemi robotici, dal punto di vista sia software che hardware. A che punto siamo arrivati? Quali sono le sfide che i ricercatori e le aziende devono affrontare oggi in questo settore? Quali sono i meccanismi di mercato che guidano lo sviluppo di questi sistemi? In questo talk risponderemo a queste domande, in modo da fornire una panoramica completa sullo stato dell'arte nella robotica mobile urbana.
L’identificazione di anomalie è una tematica sempre più popolare che viene affrontata su più fronti. In generale, l’anomalia rappresenta un’entità, un evento o una caratteristica che non risulta conforme allo standard di normalità. Le anomalie sono un ostacolo, a volte anche pericoloso come per esempio nella sicurezza informatica, in cui l’intrusione di persone non fidate all’interno di sistemi informatici può diventare critico per un’azienda o un’istituzione; in industrie invece, le anomalie possono danneggiare la qualità dei prodotti, causando pesanti perdite in termini economici. Per questo motivo vengono ideate numerose tecniche che permettono di riconoscere le anomalie e ridurre i pericoli, i danni da esse causate o semplicemente per monitorare la qualità e gestire la manutenzione.
In un contesto di immagini, il riconoscimento di anomalie è un problema di Computer Vision. Esistono metodi di ricostruzione come gli Autoencoder o metodi generativi come le GAN che si occupano di risolvere tale problema. Tra i modelli che si basano sulle GAN, chiamati GAN-based, si distingue il modello Ganomaly: esso permette di rilevare se un’immagine sia anomala.
Sulla base di quest’ultimo, nascono Patch-Ganomaly, con cui si vuole migliorare il comportamento di Ganomaly, andando a localizzare la regione anomala di un’immagine, in termini di pixel, e migliorarne efficacia ed efficienza.
Mediante l’utilizzo di transfer learning basato sulla rete VGG16 è possibile ottenere un modello più preciso, TL-Ganomaly. Esso localizza la regione anomala in maniera precisa, in termini di pixel riconosciuti correttamente anomali.
In fase di post-processing inoltre è possibile dare un ulteriore apporto con il modello Conv-Processing, il quale apprende quale kernel convoluzionale riesca a migliorare la segmentazione delle anomalie in fase di post-processing.
This document outlines a presentation on exploiting graph theory for systems biology. It introduces concepts of graph theory including networks, connected graphs, representations like adjacency matrices, and biological network abstractions. It discusses analyzing biological networks for non-random organization using measures like node degree, power law distributions, and scale-free properties. Examples of protein interaction and metabolic networks are provided. Sources of interaction data and network analysis tools like Cytoscape and Genemania are also mentioned. The document outlines identifying key molecules and mechanisms through centrality measures and optimal diffusion/disruption of networks.
This document discusses machine learning security risks. It begins by explaining how machine learning works and its increasing applications. However, it notes that criminals are also exploiting machine learning tools. It then describes different types of machine learning attacks, including evasion attacks, adversarial attacks, data poisoning attacks, and backdoor poisoning attacks. Specifically, backdoor poisoning aims to force a model to predict an attacker's chosen class when presented with a specific trigger. The document argues that understanding how backdoor poisoning works is an open problem and presents a framework using learning curves to better understand a model's vulnerability to backdoors.
The document discusses deep learning applications for medical image analysis, including for diagnosis, surgical planning and guidance, and risk assessment. Specifically, it presents examples of using deep learning for tasks like classification, segmentation, detection, and pose estimation using medical images from modalities like ultrasound, X-ray, and video. Challenges in the field include limited datasets, variability in medical images, and privacy concerns, but deep learning methods are able to learn features directly from data to help with complex medical image analysis problems.
This document summarizes an AI training program hosted by Pi Campus. The program trains engineers from around the world in artificial intelligence skills and has them apply their new skills on industry projects provided by partner companies. It offers the training for free to top developers and provides grants to cover travel and accommodation for those transferring from abroad. The program focuses on learning by doing through hands-on projects rather than traditional teaching. It partners with companies like Google, Facebook, and Amazon to sponsor developers and solve real-world challenges.
LIME is a model-agnostic framework that provides local explanations for black box machine learning models. It works by generating new data points around the prediction being explained and training a simple interpretable model, such as linear regression, on those points. This local model approximates the more complex black box model and is used to provide feature importance values for explaining the prediction. The key steps in LIME are data point generation, weighting points by proximity to the prediction being explained, and training an interpretable local model on the weighted points. LIME aims to provide human-understandable explanations by approximating the black box model with an interpretable local model.
This document discusses explainable artificial intelligence (XAI) techniques. It begins with an introduction to XAI and defines interpretability, comprehensibility, and explainability. It then discusses the problems of "black box" models and the need for explanations. The document outlines several XAI techniques including LIME, LORE, and SHAP. LIME provides local explanations by learning an interpretable model on a perturbed dataset. LORE uses a genetic algorithm to sample the dataset and extracts rules. SHAP assigns feature importance values based on Shapley values from game theory.
Felipe Campos Kitamura is a medical doctor, radiologist, and AI practitioner whose research interests include medical imaging, computer vision, artificial intelligence, and machine learning. He is currently focused on using machine learning in healthcare applications such as medical imaging analysis and using AI to help summarize surgical events in real-time. Machine learning can be applied in healthcare for tasks like medical diagnosis, predictive analytics for disease screening and monitoring, and assisting with surgical procedures.
The document summarizes recent work in natural language generation (NLG), including common training and evaluation practices as well as efforts to address limitations. It discusses how teacher forcing can lead to exposure bias during inference and explores alternatives like reinforcement learning and generative adversarial networks. It also reviews work on multilingual datasets and metrics as well as efforts to develop more accurate evaluation methods for NLG like question-based metrics and SAFEval. The document concludes by discussing promising directions for future work such as leveraging discriminators during training and generating questions to evaluate NLG models.
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)Deep Learning Italia
This document provides an overview of transformer seq2seq models, including their concepts, trends, and limitations. It discusses how transformer models have replaced RNNs for seq2seq tasks due to being more parallelizable and effective at modeling long-term dependencies. Popular seq2seq models like T5, BART, and Pegasus are introduced. The document reviews common pretraining objectives for seq2seq models and current trends in larger model sizes, task-specific pretraining, and long-range modeling techniques. Limitations discussed include the need for grounded representations and efficient generation for seq2seq models.
Towards Quantum Machine Learning Hands-on
Machine Learning (ML) gained a lot of momentum in the last ten years, mostly thanks to the advancements in non-linear patterns discovery, and more specifically, in Deep Learning (DL). But those who think that DL is going to address all possible problems might be terribly wrong. DL and ML tasks, in general, are categorized as Non-Polynomial problems, which means that the number of possible solutions for a given problem can grow exponentially, making it intractable using the classical algorithmic approach. Here, Quantum Computing (QC) techniques have the potential to address these issues and help ML methods to solve problems faster and sometimes better than the classical counterpart. The conjunction of these two disciplines resulted in a new exciting research direction to explore: Quantum Machine Learning (QML).
towards Quantum Machine Learning
Machine Learning (ML) gained a lot of momentum in the last ten years, mostly thanks to the advancements in non-linear patterns discovery, and more specifically, in Deep Learning (DL). But those who think that DL is going to address all possible problems might be terribly wrong. DL and ML tasks, in general, are categorized as Non-Polynomial problems, which means that the number of possible solutions for a given problem can grow exponentially, making it intractable using the classical algorithmic approach. Here, Quantum Computing (QC) techniques have the potential to address these issues and help ML methods to solve problems faster and sometimes better than the classical counterpart. The conjunction of these two disciplines resulted in a new exciting research direction to explore: Quantum Machine Learning (QML).
Analysis insight about a Flyball dog competition team's performanceroli9797
Insight of my analysis about a Flyball dog competition team's last year performance. Find more: https://github.com/rolandnagy-ds/flyball_race_analysis/tree/main
The Ipsos - AI - Monitor 2024 Report.pdfSocial Samosa
According to Ipsos AI Monitor's 2024 report, 65% Indians said that products and services using AI have profoundly changed their daily life in the past 3-5 years.
Learn SQL from basic queries to Advance queriesmanishkhaire30
Dive into the world of data analysis with our comprehensive guide on mastering SQL! This presentation offers a practical approach to learning SQL, focusing on real-world applications and hands-on practice. Whether you're a beginner or looking to sharpen your skills, this guide provides the tools you need to extract, analyze, and interpret data effectively.
Key Highlights:
Foundations of SQL: Understand the basics of SQL, including data retrieval, filtering, and aggregation.
Advanced Queries: Learn to craft complex queries to uncover deep insights from your data.
Data Trends and Patterns: Discover how to identify and interpret trends and patterns in your datasets.
Practical Examples: Follow step-by-step examples to apply SQL techniques in real-world scenarios.
Actionable Insights: Gain the skills to derive actionable insights that drive informed decision-making.
Join us on this journey to enhance your data analysis capabilities and unlock the full potential of SQL. Perfect for data enthusiasts, analysts, and anyone eager to harness the power of data!
#DataAnalysis #SQL #LearningSQL #DataInsights #DataScience #Analytics
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataKiwi Creative
Harness the power of AI-backed reports, benchmarking and data analysis to predict trends and detect anomalies in your marketing efforts.
Peter Caputa, CEO at Databox, reveals how you can discover the strategies and tools to increase your growth rate (and margins!).
From metrics to track to data habits to pick up, enhance your reporting for powerful insights to improve your B2B tech company's marketing.
- - -
This is the webinar recording from the June 2024 HubSpot User Group (HUG) for B2B Technology USA.
Watch the video recording at https://youtu.be/5vjwGfPN9lw
Sign up for future HUG events at https://events.hubspot.com/b2b-technology-usa/
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
End-to-end pipeline agility - Berlin Buzzwords 2024Lars Albertsson
We describe how we achieve high change agility in data engineering by eliminating the fear of breaking downstream data pipelines through end-to-end pipeline testing, and by using schema metaprogramming to safely eliminate boilerplate involved in changes that affect whole pipelines.
A quick poll on agility in changing pipelines from end to end indicated a huge span in capabilities. For the question "How long time does it take for all downstream pipelines to be adapted to an upstream change," the median response was 6 months, but some respondents could do it in less than a day. When quantitative data engineering differences between the best and worst are measured, the span is often 100x-1000x, sometimes even more.
A long time ago, we suffered at Spotify from fear of changing pipelines due to not knowing what the impact might be downstream. We made plans for a technical solution to test pipelines end-to-end to mitigate that fear, but the effort failed for cultural reasons. We eventually solved this challenge, but in a different context. In this presentation we will describe how we test full pipelines effectively by manipulating workflow orchestration, which enables us to make changes in pipelines without fear of breaking downstream.
Making schema changes that affect many jobs also involves a lot of toil and boilerplate. Using schema-on-read mitigates some of it, but has drawbacks since it makes it more difficult to detect errors early. We will describe how we have rejected this tradeoff by applying schema metaprogramming, eliminating boilerplate but keeping the protection of static typing, thereby further improving agility to quickly modify data pipelines without fear.
1. Meetup Deep Learning Italia – 11/11/2019 - Roma
Machine Learning in the Big Data context on document-based data using Apache Spark & MongoDB
Speaker Valerio Morfino & Orlando Moroni
Machine Learning in the Big Data
context on document-based data
using Apache Spark & MongoDB
2. DB Services è una Technology company che opera in tutta Italia con sedi a
Roma e Milano.
Mission dell’azienda è aiutare i propri clienti a gestire, valorizzare e analizzare
al meglio i propri dati, grazie a talenti e tecnologie.
Around and inside data perché DBS opera su tutte le tecnologie che
riguardano i dati. Advanced Analytics, Visual Analytics, Business Intelligence,
ETL, Data Injestion, Sistemi SQL, NoSQL e BigData, Middleware, OnPremise e
Cloud sono quindi parte del DNA aziendale.
DBS lavora portando professionalità ed innovazione alle principali Big
company pubbliche e private ed alle PMI.
Con un’attenta strategia di partnership, DBS porta ai propri clienti le tecnologie
più importanti ed avanzate del mercato, tra cui MongoDB, Tableau, Oracle.
Sperimentiamo, supportiamo e proponiamo tecnologie all’avanguardia grazie
a partnership strategiche con Università e Startup, come Deep Learning Italia.
www.dbservices.it
Giorgia Butera, Sales Executive
giorgia.butera@dbservices.it
https://www.linkedin.com/in/giorgia-butera-693786122/
3. VALERIO MORFINO
Head of Big Data & Analytics
DB Services
Ingegnere informatico è Head of
Big Data & Analytics presso DB
Services.
Nel corso della propria carriera ha
lavorato in società di consulenza,
università ed aziende occupandosi
di consulenza, formazione,
ricerca, direzione di progetti.
E' autore di articoli e relatore in
conferenze sui temi web, e-
commerce, machine learning e big
data.
ORLANDO MORONI
COO & Principal MongoDB Architect
DB Services
Ingegnere informatico è Chief
Operation Officer & Principal
MongoDB architect presso DB
Services.
Nel corso della propria carriera ha
lavorato con le più importanti
tecnologie di database Relazionali e
non Relazionali su tematiche
architetturali, di modellazione e di
performance tuning.
Ha partecipato alla realizzazione di
alcune delle più importanti
installazioni di MongoDB in Italia.
4. Summary
❑ Introduction
❑ Apache Spark
❑ Mongo DB
❑ Mongo Spark Connector
❑ Case Study: SYN-DOS attack prediction on IoT
devices
Deep Learning Italia 11/11/2019 – Roma - Machine Learning in the Big Data context on document-based data using Apache Spark & MongoDB, Valerio Morfino & Orlando Moroni
5. Introduction
Deep Learning Italia 11/11/2019 – Roma - Machine Learning in the Big Data context on document-based data using Apache Spark & MongoDB, Valerio Morfino & Orlando Moroni
6. Toward a new era
Traditional data platforms are failing to meet new business
requirements that demand nocompromises combination of:
❑ Real-time data
❑ Performance
❑ Scale
❑ Integrated data
❑ Security
Deep Learning Italia 11/11/2019 – Roma - Machine Learning in the Big Data context on document-based data using Apache Spark & MongoDB, Valerio Morfino & Orlando Moroni
7. The Translytical era
Translytical is a hot, emerging market that delivers a unified
data platform to support all kinds of workloads.
Translytical can support various use cases, including real-time
insights, machine learning (ML), streaming analytics, extreme
transactional processing, and operational reporting.
[The Forrester Wave™: Translytical Data Platforms, Q4 2019]
Deep Learning Italia 11/11/2019 – Roma - Machine Learning in the Big Data context on document-based data using Apache Spark & MongoDB, Valerio Morfino & Orlando Moroni
8. Why MongoDB & Spark?
❑ We are in a Big Data World!
❑ Store high Volume of Data
❑ Store and Analyze data with high Velocity
❑ Store data in a Variety of formats and locations
❑ Be aware of Vulnerability!
Deep Learning Italia 11/11/2019 – Roma - Machine Learning in the Big Data context on document-based data using Apache Spark & MongoDB, Valerio Morfino & Orlando Moroni
9. Why MongoDB & Spark? Examples
❑ Store and analyze data from IoT devices
❑ Store and analyze data in distributed environments
❑ Enable real-time analytics without ETL
❑ Advanced analytics and Machine Learning at scale
❑ Enrich BI report & dashboard with augmented
analytics features
Deep Learning Italia 11/11/2019 – Roma - Machine Learning in the Big Data context on document-based data using Apache Spark & MongoDB, Valerio Morfino & Orlando Moroni
10. Apache Spark
Deep Learning Italia 11/11/2019 – Roma - Machine Learning in the Big Data context on document-based data using Apache Spark & MongoDB, Valerio Morfino & Orlando Moroni
11. Apache Spark
❑ A distributed cluster based general engine for big data
processing
❑ Fully integrated with Hadoop ecosystem
❑ Available both in local and in cloud environments
❑ Clusters of hundreds or even thousands of nodes
❑ Up to 100X faster than Hadoop Map Reduce
❑ Resilient thanks to lineage and distributed storage system (e.g. HDFS or
MongoDB)
❑ This is important for Big Data and long processing tasks on big clusters and hardware,
software or networks connections can fail!
Deep Learning Italia 11/11/2019 – Roma - Machine Learning in the Big Data context on document-based data using Apache Spark & MongoDB, Valerio Morfino & Orlando Moroni
12. Apache Spark
❑ High-level APIs accessible in Java, Scala, Python and R
❑ The MLlib library is rich of efficient parallel implementation of Machine
learning algorithms
Deep Learning Italia 11/11/2019 – Roma - Machine Learning in the Big Data context on document-based data using Apache Spark & MongoDB, Valerio Morfino & Orlando Moroni
13. Spark Cluster configurations
❑ Several Cluster configurations:
❑ Stand Alone
❑ Hadoop Yarn
❑ Mesos
❑ Kubernetes
Deep Learning Italia 11/11/2019 – Roma - Machine Learning in the Big Data context on document-based data using Apache Spark & MongoDB, Valerio Morfino & Orlando Moroni
14. RDDs to store Large datasets
❑ Resilient, i.e. fault-tolerant thanks to RDD lineage graph, able to recompute missing or
damaged partitions
❑ Distributed, with data residing on multiple nodes in a cluster
❑ Dataset is a collection of partitioned data stored in memory as far as possible
(otherwise disk)
Deep Learning Italia 11/11/2019 – Roma - Machine Learning in the Big Data context on document-based data using Apache Spark & MongoDB, Valerio Morfino & Orlando Moroni
15. Mllib - Spark’s machine learning library
❑ ML Algorithms: common learning algorithms such as classification,
regression, clustering, and collaborative filtering
❑ Featurization: feature extraction, transformation, dimensionality reduction, and selection
❑ Pipelines: tools for constructing, evaluating, and tuning ML Pipelines
❑ Persistence: saving and load algorithms, models, and Pipelines
❑ Utilities: linear algebra, statistics, data handling, etc.
❑ Text Manipulations: Tokenization, Common Word Removing, Word combinations, Word2Vec
Deep Learning Italia 11/11/2019 – Roma - Machine Learning in the Big Data context on document-based data using Apache Spark & MongoDB, Valerio Morfino & Orlando Moroni
Note: As of Spark 2.0, DataFrame-based API is primary API (package spark.ml). The MLlib RDD-based API is now in
maintenance mode (package spark.mllib)
16. MongoDB
Deep Learning Italia 11/11/2019 – Roma - Machine Learning in the Big Data context on document-based data using Apache Spark & MongoDB, Valerio Morfino & Orlando Moroni
17. MongoDB
❑ Document Oriented NOSQL Database
❑ JSON-like documents with SCHEMA
❑ A distributed database at its core:
❑high availability (replica set)
❑horizontal scaling (sharding)
❑geographic distribution
❑ Open source cross platform
Deep Learning Italia 11/11/2019 – Roma - Machine Learning in the Big Data context on document-based data using Apache Spark & MongoDB, Valerio Morfino & Orlando Moroni
18. MongoDB – Why?
Intelligent Operational Data Platform
Document Model Distributed Architecture Run Anywhere
Best way to work
with data
Intelligently put data
where you need it
Freedom
to run anywhere
Deep Learning Italia 11/11/2019 – Roma - Machine Learning in the Big Data context on document-based data using Apache Spark & MongoDB, Valerio Morfino & Orlando Moroni
19. MongoDB
Rich Queries
Point | Range | Geospatial | Faceted Search | Aggregations | JOINs | Graph Traversals
JSON Documents Tabular Key-Value Text GraphGeospatial
Versatile: Multiple data models, rich query functionality
Deep Learning Italia 11/11/2019 – Roma - Machine Learning in the Big Data context on document-based data using Apache Spark & MongoDB, Valerio Morfino & Orlando Moroni
20. Mongo – Relational dictionary
{
first_name: ‘Paul’,
surname: ‘Miller’,
city: ‘London’,
profession: [‘banking’, ‘finance’],
location: [45.123,47.232],
cars: [
{ model: ‘Bentley’,
year: 1973,
value: 100000, … },
{ model: ‘Rolls Royce’,
year: 1965,
value: 330000, … }
]
}
RDBMS
Deep Learning Italia 11/11/2019 – Roma - Machine Learning in the Big Data context on document-based data using Apache Spark & MongoDB, Valerio Morfino & Orlando Moroni
21. Mongo – Relational dictionary
MongoDB SQL
database database
collection table
document record (row)
field column
linking/embedded documents join
primary key (_id field) primary key (user designated)
index index
Deep Learning Italia 11/11/2019 – Roma - Machine Learning in the Big Data context on document-based data using Apache Spark & MongoDB, Valerio Morfino & Orlando Moroni
22. MongoDB – Replica Set
Deep Learning Italia 11/11/2019 – Roma - Machine Learning in the Big Data context on document-based data using Apache Spark & MongoDB, Valerio Morfino & Orlando Moroni
Replica Set
• Up to 50 replicas
• Distributed across racks, data centers, and regions
Self-healing
Data Center Aware
Addresses availability considerations:
• High Availability
• Disaster Recovery
• Maintenance
Application
Driver
Primary
Secondary
Secondary
Replication
23. MongoDB – Automatic Sharding
Deep Learning Italia 11/11/2019 – Roma - Machine Learning in the Big Data context on document-based data using Apache Spark & MongoDB, Valerio Morfino & Orlando Moroni
Application transparent
Multiple sharding policies: hashed, ranged, zoned
Increase or decrease capacity as you go
Automatic balancing for elasticity
Horizontally Scalable
•••Shard 1 Shard 2 Shard 3 Shard N
24. Mongo Spark Connector
Deep Learning Italia 11/11/2019 – Roma - Machine Learning in the Big Data context on document-based data using Apache Spark & MongoDB, Valerio Morfino & Orlando Moroni
25. Mongo Spark connector most important features
❑ Most important connector features:
❑ Ability to read/write BSON documents directly from/to MongoDB
❑ Automatic conversion from MongoDB collection to Spark RDD (Dataframe and Dataset)
❑ Predicates pushdown:
❑ Filters (e.g. where conditions) and Select are pushed down to the datasource. So, the actual filtering
and projections are done on the MongoDB node before returning the data to the Spark node.
❑ Integration with the MongoDB aggregation pipeline:
❑ A MongoRDD accept a MongoDB pipeline, to execute aggregations on the MongoDB nodes instead of
the Spark nodes. However, most of the work is automatically performed by connector.
❑ Data locality:
❑ If the Spark nodes and MongoDB nodes (in Sharded Cluster configuration) are deployed on the same
server the data will be loaded according to their locality in the cluster, avoiding costly network
transfers.
Deep Learning Italia 11/11/2019 – Roma - Machine Learning in the Big Data context on document-based data using Apache Spark & MongoDB, Valerio Morfino & Orlando Moroni
26. Reference Architecture for MongoDB & Spark
❑ Apache Spark
❑ MongoDB Connector for
Spark
❑ MongoDB nodes
❑ Data locality (Spark
Workers and MongoDB
nodes on the same node)
Deep Learning Italia 11/11/2019 – Roma - Machine Learning in the Big Data context on document-based data using Apache Spark & MongoDB, Valerio Morfino & Orlando Moroni
27. Case study Architecture
❑ Full Cloud architecture
❑ Databricks Community
❑ MongoDB Atlas
Deep Learning Italia 11/11/2019 – Roma - Machine Learning in the Big Data context on document-based data using Apache Spark & MongoDB, Valerio Morfino & Orlando Moroni
28. Case Study configuration
❑ Databricks community edition
❑ https://databricks.com/try-databricks
❑ 5.5 LTS (Spark 2.4.3. + Scala 2.11)
❑ Importazione libreria Maven per MongoDB Spark Connector:
❑ org.mongodb.spark:mongo-spark-connector_2.11:2.4.1
❑ MongoDB Atlas
❑ https://www.mongodb.com/cloud/atlas
Deep Learning Italia 11/11/2019 – Roma - Machine Learning in the Big Data context on document-based data using Apache Spark & MongoDB, Valerio Morfino & Orlando Moroni
29. CASE STUDY
SYN-DOS attack prediction
Deep Learning Italia 11/11/2019 – Roma - Machine Learning in the Big Data context on document-based data using Apache Spark & MongoDB, Valerio Morfino & Orlando Moroni
30. Attacchi informatici
❑ Possono minare:
❑ Riservatezza
❑ Integrità
❑ Disponibilità
❑ Gli attacchi DOS – Denial of Service minano la Disponibilità
❑ L’attacco SYN-DOS (detto anche SYN-Flood) mina la disponibilità
saturando le connessioni TCP/IP del server
Deep Learning Italia 11/11/2019 – Roma - Machine Learning in the Big Data context on document-based data using Apache Spark & MongoDB, Valerio Morfino & Orlando Moroni
31. SYN-DOS Attack
1. Client requests
connection by sending
SYN (synchronize)
message to the server.
2. Server acknowledges
by sending SYN-ACK
(synchronize-
acknowledge)
message back to the
client.
3. Client responds with
an ACK
(acknowledge)
message, and the
connection is
established.
https://www.imperva.com/learn/application-security/syn-flood/
Deep Learning Italia 11/11/2019 – Roma - Machine Learning in the Big Data context on document-based data using Apache Spark & MongoDB, Valerio Morfino & Orlando Moroni
32. Dataset & Reference
❑ Dataset Description
❑ 115 features (Double)
❑ 1 Label (String)
❑ 11.000 total samples (10.000 normal + 1.000 attack)
❑ Features contains statistics which are used to implicitly describe the
current state of the channel
❑ Data came from IP-Cameras
❑ The statistics are generated by a Feature Extractor
❑ Syn-Dos
❑ Paper: https://arxiv.org/pdf/1802.09089.pdf
❑ Full Dataset:
https://drive.google.com/drive/folders/1kmoWY4poGWfmmVSdS
u-r_3Vo84Tu4PyE
Deep Learning Italia 11/11/2019 – Roma - Machine Learning in the Big Data context on document-based data using Apache Spark & MongoDB, Valerio Morfino & Orlando Moroni
33. Let’s code!
Deep Learning Italia 11/11/2019 – Roma - Machine Learning in the Big Data context on document-based data using Apache Spark & MongoDB, Valerio Morfino & Orlando Moroni
34. Attacchi informatici
❑ Accesso alla Consolle MongoDB Atlas:
❑ https://cloud.mongodb.com
❑ Vista delle Collections
❑ Accesso a Databricks
❑ Creazione del cluster
❑ Import Maven libreria: org.mongodb.spark:mongo-spark-
connector_2.11:2.4.1
❑ Notebook creazione Collection su MongoDB
❑ Notebook Training
Deep Learning Italia 11/11/2019 – Roma - Machine Learning in the Big Data context on document-based data using Apache Spark & MongoDB, Valerio Morfino & Orlando Moroni
35. Useful links
❑ https://spark.apache.org/docs/latest/
❑ https://spark.apache.org/docs/latest/ml-guide.html
❑ https://spark.apache.org/docs/latest/ml-classification-regression.html
❑ https://docs.databricks.com/getting-started/index.html
❑ https://www.mongodb.com/it
❑ https://databricks.com/try-databricks
❑ https://www.mongodb.com/cloud/atlas
❑ https://docs.mongodb.com/spark-connector/
Deep Learning Italia 11/11/2019 – Roma - Machine Learning in the Big Data context on document-based data using Apache Spark & MongoDB, Valerio Morfino & Orlando Moroni
36. Grazie per l’attenzione
valerio.morfino@dbservices.it https://it.linkedin.com/in/valerio-morfino
orlando.moroni@dbservices.it https://www.linkedin.com/in/orlandomoroni
Deep Learning Italia 11/11/2019 – Roma - Machine Learning in the Big Data context on document-based data using Apache Spark & MongoDB, Valerio Morfino & Orlando Moroni