The document discusses distributed machine learning on the Java Virtual Machine (JVM) without advanced degrees. It introduces concepts like big data, machine learning, and distributed systems. It then describes how projects like Spark and MLlib use the JVM to perform scalable machine learning without a PhD by distributing tasks across a cluster. Examples shown include similarity search, clustering, recommendation systems, and model evaluation to demonstrate machine learning algorithms in MLlib.
See 2020 update: https://derwen.ai/s/h88s
SF Python Meetup, 2017-02-08
https://www.meetup.com/sfpython/events/237153246/
PyTextRank is a pure Python open source implementation of *TextRank*, based on the [Mihalcea 2004 paper](http://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf) -- a graph algorithm which produces ranked keyphrases from texts. Keyphrases generally more useful than simple keyword extraction. PyTextRank integrates use of `TextBlob` and `SpaCy` for NLP analysis of texts, including full parse, named entity extraction, etc. It also produces auto-summarization of texts, making use of an approximation algorithm, `MinHash`, for better performance at scale. Overall, the package is intended to complement machine learning approaches -- specifically deep learning used for custom search and recommendations -- by developing better feature vectors from raw texts. This package is in production use at O'Reilly Media for text analytics.
OrientDB: Unlock the Value of Document Data RelationshipsFabrizio Fortino
a) A general introduction of graph databases and OrientDB,
b) Why connected data has more value than just data,
c)How to "have fun" with OrientDB combining documents with graphs via SQL,
d) A use case on how OrientDB has helped to raise standards in Irish Public Office.
On OrientDB: NOSQL document databases provide an elegant way to deal with data in different shapes enabling developers to create better and faster products quickly. The main goal of these systems is to find the most efficient solution to manage data itself. With the Big Data Explosion we need to deal with a myriad of highly interconnected information. The challenge now is not only on how to store data but on how to manage, analyse, traverse and use your data within the context of relationships. Graph databases shine at maintaining highly connected data and is the fastest growing category in database management systems: 2014 registered an increase of 250% in terms of adoption and Forrester Research predicts that more than a quarter of enterprises will be using graphs by 2017. OrientDB combines more than one NOSQL model offering the unique flexibility of modelling data in the form of either documents, or graphs, while incorporating object oriented programming as a way of encapsulating relationships.
Dmitry Kan, Principal AI Scientist at Silo AI and host of the Vector Podcast [1], will give an overview of the landscape of vector search databases and their role in NLP, along with the latest news and his view on the future of vector search. Further, he will share how he and his team participated in the Billion-Scale Approximate Nearest Neighbor Challenge and improved recall by 12% over a baseline FAISS.
Presented at https://www.meetup.com/open-nlp-meetup/events/282678520/
YouTube: https://www.youtube.com/watch?v=RM0uuMiqO8s&t=179s
Follow Vector Podcast to stay up to date on this topic: https://www.youtube.com/@VectorPodcast
The document discusses OrientDB, a document-graph database. It provides an overview of key OrientDB concepts like documents, vertices, edges, classes, clusters, and properties. It also compares the relational and graph data models. The presentation was given by Greg McCarvell and introduces Node.js integration with OrientDB through examples.
Haystack LIVE! - 5 ways to increase result diversity at web-scale - Dmitry Ka...Dmitry Kan
Promoting diversity among items in a search result has been shown to increase user satisfaction, compared to relevancy only based ranking. In this talk, we'll present how we went about implementing search result diversification methods across different vertical search engines. Starting from zero with no diversification at all, exploring simple heuristic-based methods and moving onwards to more complex ones based on entropy and determinantal point processing. We'll also discuss evaluation methods and useful tooling around that.
Presented by Dmitry Kan, Principal AI Scientist at Silo AI and Daniel Wärnå, AI Engineer, Silo AI.
YouTube recording:
https://www.youtube.com/watch?v=bri0C28mfl8
Code demoed: https://github.com/DmitryKey/bert-solr-search/tree/master/src/diversify
This document discusses working with events and styles in JavaScript. It covers creating event handlers, using the event object, exploring object properties, working with mouse and keyboard events, and controlling event propagation. Specific topics include adding and removing event listeners, changing inline styles, creating object collections with CSS selectors, and changing the cursor style. The overall goal is to teach how to build interactive elements that respond to user input through events.
Chapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han & Kambererror007
The document describes chapter 7 of the book "Data Mining: Concepts and Techniques" which covers cluster analysis. The chapter discusses what cluster analysis is, different types of data that can be analyzed, major clustering methods like partitioning, hierarchical, and density-based methods. It also covers measuring cluster quality, requirements for clustering in data mining, and how to calculate similarity and dissimilarity between data objects.
See 2020 update: https://derwen.ai/s/h88s
SF Python Meetup, 2017-02-08
https://www.meetup.com/sfpython/events/237153246/
PyTextRank is a pure Python open source implementation of *TextRank*, based on the [Mihalcea 2004 paper](http://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf) -- a graph algorithm which produces ranked keyphrases from texts. Keyphrases generally more useful than simple keyword extraction. PyTextRank integrates use of `TextBlob` and `SpaCy` for NLP analysis of texts, including full parse, named entity extraction, etc. It also produces auto-summarization of texts, making use of an approximation algorithm, `MinHash`, for better performance at scale. Overall, the package is intended to complement machine learning approaches -- specifically deep learning used for custom search and recommendations -- by developing better feature vectors from raw texts. This package is in production use at O'Reilly Media for text analytics.
OrientDB: Unlock the Value of Document Data RelationshipsFabrizio Fortino
a) A general introduction of graph databases and OrientDB,
b) Why connected data has more value than just data,
c)How to "have fun" with OrientDB combining documents with graphs via SQL,
d) A use case on how OrientDB has helped to raise standards in Irish Public Office.
On OrientDB: NOSQL document databases provide an elegant way to deal with data in different shapes enabling developers to create better and faster products quickly. The main goal of these systems is to find the most efficient solution to manage data itself. With the Big Data Explosion we need to deal with a myriad of highly interconnected information. The challenge now is not only on how to store data but on how to manage, analyse, traverse and use your data within the context of relationships. Graph databases shine at maintaining highly connected data and is the fastest growing category in database management systems: 2014 registered an increase of 250% in terms of adoption and Forrester Research predicts that more than a quarter of enterprises will be using graphs by 2017. OrientDB combines more than one NOSQL model offering the unique flexibility of modelling data in the form of either documents, or graphs, while incorporating object oriented programming as a way of encapsulating relationships.
Dmitry Kan, Principal AI Scientist at Silo AI and host of the Vector Podcast [1], will give an overview of the landscape of vector search databases and their role in NLP, along with the latest news and his view on the future of vector search. Further, he will share how he and his team participated in the Billion-Scale Approximate Nearest Neighbor Challenge and improved recall by 12% over a baseline FAISS.
Presented at https://www.meetup.com/open-nlp-meetup/events/282678520/
YouTube: https://www.youtube.com/watch?v=RM0uuMiqO8s&t=179s
Follow Vector Podcast to stay up to date on this topic: https://www.youtube.com/@VectorPodcast
The document discusses OrientDB, a document-graph database. It provides an overview of key OrientDB concepts like documents, vertices, edges, classes, clusters, and properties. It also compares the relational and graph data models. The presentation was given by Greg McCarvell and introduces Node.js integration with OrientDB through examples.
Haystack LIVE! - 5 ways to increase result diversity at web-scale - Dmitry Ka...Dmitry Kan
Promoting diversity among items in a search result has been shown to increase user satisfaction, compared to relevancy only based ranking. In this talk, we'll present how we went about implementing search result diversification methods across different vertical search engines. Starting from zero with no diversification at all, exploring simple heuristic-based methods and moving onwards to more complex ones based on entropy and determinantal point processing. We'll also discuss evaluation methods and useful tooling around that.
Presented by Dmitry Kan, Principal AI Scientist at Silo AI and Daniel Wärnå, AI Engineer, Silo AI.
YouTube recording:
https://www.youtube.com/watch?v=bri0C28mfl8
Code demoed: https://github.com/DmitryKey/bert-solr-search/tree/master/src/diversify
This document discusses working with events and styles in JavaScript. It covers creating event handlers, using the event object, exploring object properties, working with mouse and keyboard events, and controlling event propagation. Specific topics include adding and removing event listeners, changing inline styles, creating object collections with CSS selectors, and changing the cursor style. The overall goal is to teach how to build interactive elements that respond to user input through events.
Chapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han & Kambererror007
The document describes chapter 7 of the book "Data Mining: Concepts and Techniques" which covers cluster analysis. The chapter discusses what cluster analysis is, different types of data that can be analyzed, major clustering methods like partitioning, hierarchical, and density-based methods. It also covers measuring cluster quality, requirements for clustering in data mining, and how to calculate similarity and dissimilarity between data objects.
A Data Ecosystem to Support Machine Learning in Materials ScienceGlobus
This presentation was given at the 2019 GlobusWorld Conference in Chicago, IL by Ben Blaiszik from University of Chicago and Argonne National Laboratory Data Science and Learning Division.
Neo4j-Databridge: Enterprise-scale ETL for Neo4jGraphAware
Neo4j - London User Group Meetup - 28th March, 2018
If your data ingestion requirements have grown beyond importing occasional CSV files then this talk is for you. Neo4j-Databridge from GraphAware is a comprehensive ETL tool specifically built for Neo4j. It has been designed for usability, expressive power and high performance to address the most common isues faced when importing data into Neo4j - multiple data sources and type, very large data sets, bespoke data conversions, non-tabular formats, filtering, merging and de-duplication, as well as bulk imports and incremental updates.
In this talk, we'll take a quick tour of the some of the main features, loading data from Kafka, Redis, JDBC and various other data sources along the way, to understand how Neo4j Databridge solves these problems and how it can help you import your data quickly and easily into Neo4j.
Vince Bickers is a Principal Consultant at GraphAware and the main author of Spring Data Neo4j (v4). He has been writing software and leading software development teams for over 30 years at organisations like Vodafone, Deutsche Bank, HSBC, Network Rail, UBS, VMWare, ConocoPhillips, Aviva and British Gas.
1) Entity-centric data management stores and integrates information at the entity level rather than keywords or structured schemas. This allows for more natural integration of heterogeneous data as entities can be interlinked.
2) Techniques presented include ZenCrowd for crowdsourcing entity extraction, hybrid search to combine keyword and graph searches for entities, and Diplodocus for efficiently storing and querying entity data through clustering and co-location.
3) The approaches were shown to improve entity extraction precision by 14%, entity search results by up to 25%, and entity query performance by up to 300x compared to traditional techniques.
A New Year in Data Science: ML UnpausedPaco Nathan
This document summarizes Paco Nathan's presentation at Data Day Texas in 2015. Some key points:
- Paco Nathan discussed observations and trends from the past year in machine learning, data science, big data, and open source technologies.
- He argued that the definitions of data science and statistics are flawed and ignore important areas like development, visualization, and modeling real-world business problems.
- The presentation covered topics like functional programming approaches, streaming approximations, and the importance of an interdisciplinary approach combining computer science, statistics, and other fields like physics.
- Paco Nathan advocated for newer probabilistic techniques for analyzing large datasets that provide approximations using less resources compared to traditional batch processing approaches.
This document provides an agenda for a training session on AI and data science. The session is divided into two units: data science and data visualization. Key Python libraries that will be covered for data science include NumPy, Pandas, and Matplotlib. NumPy will be used to create and manipulate multi-dimensional arrays. Pandas allows users to work with labeled and relational data. Matplotlib enables data visualization through graphs and plots. The session aims to provide knowledge of core data science libraries and demonstrate data exploration techniques using these packages.
- NASA has a large database of documents and lessons learned from past programs and projects dating back to the 1950s.
- Graph databases can be used to connect related information across different topics, enabling more efficient search and pattern recognition compared to isolated data silos.
- Natural language processing techniques like named entity recognition, parsing, and keyword extraction can be applied to NASA's text data and combined with a graph database to create a knowledge graph for exploring relationships in the data.
Konstantin Vorontsov - BigARTM: Open Source Library for Regularized Multimoda...AIST
The document describes BigARTM, an open source library for regularized multimodal topic modeling of large collections. It discusses probabilistic topic modeling and how additive regularization of topic models (ARTM) handles ill-posed inverse problems in topic modeling. ARTM allows various regularizers to be combined. BigARTM provides a parallel implementation for improved time and memory performance. Experiments show how ARTM can combine regularizers and be used for classification and multi-language topic modeling. Multimodal topic modeling binds topics to terms, authors, images, links and other modalities.
Toward Semantic Data Stream - Technologies and ApplicationsRaja Chiky
Massive data stream processing is a scientific challenge and an industrial concern. But with the current volumes of data streams , their number and variety, current techniques are not able to meet the requirements of applications. The Semantic Web tools , through the RDF for example, allow to address the problem of heterogeneous data. Thus, the data stream are converted to semantic data stream by using RDF triples extended with a timestamp. To be able to query , filter, or reason semantic data streams, the query language SPARQL must be extended to include concepts such as windowing , based on what has been done in Data Stream Management Systems. In this talk, I will present recent work on the semantic data stream management , particularly extensions made on SPARQL language and associated benchmarks.
This document provides an overview of machine learning for Java Virtual Machine (JVM) developers. It begins with introductions to the speaker and topics to be covered. It then discusses the growth of data and opportunities for machine learning applications. Key machine learning concepts are defined, including observations, features, models, supervised vs. unsupervised learning, and common algorithms like classification, regression, and clustering. Popular JVM machine learning tools are listed, with Spark/MLlib highlighted for its community support and implementation of standard algorithms. Example machine learning demos on price prediction and spam classification are described. The document concludes with recommendations for further learning resources.
This document discusses data visualization for big data. It begins by explaining why visualization is important, as it can help users notice unexpected patterns in data. It then defines data visualization as using interactive visual representations to amplify cognition. The document outlines several steps to create a visualization: identifying relevant tasks; choosing a library; transforming data into a nested JSON format; binding the data; and creating a user-friendly experience with settings. It provides an example of visualizing network threat data to identify suspicious IP addresses and domains.
This document discusses developing analytics applications using machine learning on Azure Databricks and Apache Spark. It begins with an introduction to Richard Garris and the agenda. It then covers the data science lifecycle including data ingestion, understanding, modeling, and integrating models into applications. Finally, it demonstrates end-to-end examples of predicting power output, scoring leads, and predicting ratings from reviews.
Qualitative AI : Hoo-ha or Step-Change? CAQDAS webinarChristina Silver
Slides from the CAQDAS Networking Project's webinar on 1st September 2023: Artificial Intelligence in Qualitative Data Analysis - Hoo-ha or Step-Change?
During 2023 there’s been increasing discussion about the use of artificial intelligence (AI) in qualitative research, spurred by widespread access to generative-AI technologies such as ChatGPT developed by OpenAI.
In this webinar Christina first recounts the history of AI in qualitative data analysis, outlining developments that far pre-date the current upsurge; including Qualrus, Discovertext, WordStat and QDA Miner, and Leximancer.
She’ll then outline how generative-AI is being used in qualitative data analysis at the moment, discussing three uses: chat bots alongside other analytic tools; integrations of OpenAI technology into already established Qualitative Software; and the rise of new generative-AI applications designed specifically for qualitative data analysis tasks.
Christina will open discussion about the implications of these developments for the practice of qualitative research. When are these tools appropriate? What do we need to know about them? What are the ethics of using them? What should we be cautious and excited about? How can the qualitative community shape their development?
Whether you’re an advocate of the use of AI in qualitative data analysis or a sceptic, these technologies are here, they have already impacted the field of qualitative research and they will continue to do so. Join Christina to be part of the conversation, find out what’s happening, share your experiences and experimentations, your fears and hopes. Let the developers know how you want to see these technologies harnessed.
Course 3 : Types of data and opportunities by Nikolaos DeligiannisBetacowork
This document discusses big data and opportunities related to different types of data. It covers challenges of big data including volume, velocity, variety and veracity. It also discusses value that can be extracted from data. The document outlines static versus real-time data and structured versus unstructured data. Examples of applying machine learning techniques like regression, classification, clustering and dimensionality reduction are provided. The introduction to cloud computing discusses public, private and hybrid clouds and features of cloud infrastructure.
Choosing the right software for your research study : an overview of leading ...Merlien Institute
Choosing the right software for your research study : an overview of leading CAQDAS packages by Christina Silver. This presentation is part of the proceedings of the International workshop on Computer-Aided Qualitative Research organised by Merlien Institute. This workshop was held on the 4-5 June in Utrecht, The Netherlands
Lightweight Collection and Storage of Software Repository Data with DataRoverChristoph Matthies
The ease of setting up collaboration infrastructures for software engineering projects creates a challenge for researchers that aim to analyze the resulting data. As teams can choose from various available software-as-a-service solutions and can configure them with a few clicks, researchers have to create and maintain multiple implementations for collecting and aggregating the collaboration data in order to perform their analyses across different setups.
The DataRover system simplifies this task by only requiring custom source code for API authentication and querying. Data transformation and linkage is performed based on mappings, which users can define based on sample responses through a graphical front end. This allows storing the same input data in formats and databases most suitable for the intended analysis without requiring additional coding.
A screencast of DataRover is available at https://youtu.be/mt4ztff4SfU.
DataRover is available at: https://bitbucket.org/tkowark/data-rover
Mind the Gap - Data Science Meets Software EngineeringBernhard Haslhofer
This document summarizes a talk on combining data science and software engineering approaches. It discusses how the two fields approach problems differently, with software engineering focusing on implementing features and ensuring quality through testing, while data science focuses on evaluating models and metrics. The document proposes a solution of defining goals, collecting ground truth data, implementing models and functions, testing and evaluating them, analyzing errors, and deploying services based on metrics. This "metrics driven software engineering" approach aims to bridge the gaps between the two fields.
This document discusses tools and services for data intensive research in the cloud. It describes several initiatives by the eXtreme Computing Group at Microsoft Research related to cloud computing, multicore computing, quantum computing, security and cryptography, and engaging with research partners. It notes that the nature of scientific computing is changing to be more data-driven and exploratory. Commercial clouds are important for research as they allow researchers to start work quickly without lengthy installation and setup times. The document discusses how economics has driven improvements in computing technologies and how this will continue to impact research computing infrastructure. It also summarizes several Microsoft technologies for data intensive computing including Dryad, LINQ, and Complex Event Processing.
Real time analytics with Spark Streaming by Padma at Bangalore I & D meetup (https://www.meetup.com/Bengaluru-Insights-and-Data-Meetup/events/238459154)
Multiplatform Spark solution for Graph datasources by Javier DominguezBig Data Spain
This document summarizes a presentation given by Javier Dominguez at Big Data Spain about Stratio's multiplatform solution for graph data sources. It discusses graph use cases, different data stores like Spark, GraphX, GraphFrames and Neo4j. It demonstrates the machine learning life cycle using a massive dataset from Freebase, running queries and algorithms. It shows notebooks and a business example of clustering bank data using Jaccard distance and connected components. The presentation concludes with future directions like a semantic search engine and applying more machine learning algorithms.
Data mining technique for classification and feature evaluation using stream ...ranjit banshpal
This document discusses data stream mining techniques for classification and feature evaluation. It introduces data stream mining and its applications, including network traffic analysis and sensor data. It describes decision trees and the VFDT algorithm for data stream classification. VFDT can classify high-dimensional data streams more efficiently than decision trees. The document also covers challenges in data stream mining like concept drift and feature evolution, and concludes by discussing applications and referencing related work.
Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...Ali Alkan
The document summarizes an agenda for a presentation on machine learning and data science. It includes an introduction to CRISP-DM (Cross Industry Standard for Data Mining), guided analytics, and a KNIME demo. It also discusses the differences between machine learning, artificial intelligence, and data science. Machine learning produces predictions, artificial intelligence produces actions, and data science produces insights. It provides an overview of the CRISP-DM process for data mining projects including the business understanding, data understanding, data preparation, modeling, evaluation, and deployment phases. It also discusses guided analytics and interactive systems to assist business analysts in finding insights and predicting outcomes from data.
A Data Ecosystem to Support Machine Learning in Materials ScienceGlobus
This presentation was given at the 2019 GlobusWorld Conference in Chicago, IL by Ben Blaiszik from University of Chicago and Argonne National Laboratory Data Science and Learning Division.
Neo4j-Databridge: Enterprise-scale ETL for Neo4jGraphAware
Neo4j - London User Group Meetup - 28th March, 2018
If your data ingestion requirements have grown beyond importing occasional CSV files then this talk is for you. Neo4j-Databridge from GraphAware is a comprehensive ETL tool specifically built for Neo4j. It has been designed for usability, expressive power and high performance to address the most common isues faced when importing data into Neo4j - multiple data sources and type, very large data sets, bespoke data conversions, non-tabular formats, filtering, merging and de-duplication, as well as bulk imports and incremental updates.
In this talk, we'll take a quick tour of the some of the main features, loading data from Kafka, Redis, JDBC and various other data sources along the way, to understand how Neo4j Databridge solves these problems and how it can help you import your data quickly and easily into Neo4j.
Vince Bickers is a Principal Consultant at GraphAware and the main author of Spring Data Neo4j (v4). He has been writing software and leading software development teams for over 30 years at organisations like Vodafone, Deutsche Bank, HSBC, Network Rail, UBS, VMWare, ConocoPhillips, Aviva and British Gas.
1) Entity-centric data management stores and integrates information at the entity level rather than keywords or structured schemas. This allows for more natural integration of heterogeneous data as entities can be interlinked.
2) Techniques presented include ZenCrowd for crowdsourcing entity extraction, hybrid search to combine keyword and graph searches for entities, and Diplodocus for efficiently storing and querying entity data through clustering and co-location.
3) The approaches were shown to improve entity extraction precision by 14%, entity search results by up to 25%, and entity query performance by up to 300x compared to traditional techniques.
A New Year in Data Science: ML UnpausedPaco Nathan
This document summarizes Paco Nathan's presentation at Data Day Texas in 2015. Some key points:
- Paco Nathan discussed observations and trends from the past year in machine learning, data science, big data, and open source technologies.
- He argued that the definitions of data science and statistics are flawed and ignore important areas like development, visualization, and modeling real-world business problems.
- The presentation covered topics like functional programming approaches, streaming approximations, and the importance of an interdisciplinary approach combining computer science, statistics, and other fields like physics.
- Paco Nathan advocated for newer probabilistic techniques for analyzing large datasets that provide approximations using less resources compared to traditional batch processing approaches.
This document provides an agenda for a training session on AI and data science. The session is divided into two units: data science and data visualization. Key Python libraries that will be covered for data science include NumPy, Pandas, and Matplotlib. NumPy will be used to create and manipulate multi-dimensional arrays. Pandas allows users to work with labeled and relational data. Matplotlib enables data visualization through graphs and plots. The session aims to provide knowledge of core data science libraries and demonstrate data exploration techniques using these packages.
- NASA has a large database of documents and lessons learned from past programs and projects dating back to the 1950s.
- Graph databases can be used to connect related information across different topics, enabling more efficient search and pattern recognition compared to isolated data silos.
- Natural language processing techniques like named entity recognition, parsing, and keyword extraction can be applied to NASA's text data and combined with a graph database to create a knowledge graph for exploring relationships in the data.
Konstantin Vorontsov - BigARTM: Open Source Library for Regularized Multimoda...AIST
The document describes BigARTM, an open source library for regularized multimodal topic modeling of large collections. It discusses probabilistic topic modeling and how additive regularization of topic models (ARTM) handles ill-posed inverse problems in topic modeling. ARTM allows various regularizers to be combined. BigARTM provides a parallel implementation for improved time and memory performance. Experiments show how ARTM can combine regularizers and be used for classification and multi-language topic modeling. Multimodal topic modeling binds topics to terms, authors, images, links and other modalities.
Toward Semantic Data Stream - Technologies and ApplicationsRaja Chiky
Massive data stream processing is a scientific challenge and an industrial concern. But with the current volumes of data streams , their number and variety, current techniques are not able to meet the requirements of applications. The Semantic Web tools , through the RDF for example, allow to address the problem of heterogeneous data. Thus, the data stream are converted to semantic data stream by using RDF triples extended with a timestamp. To be able to query , filter, or reason semantic data streams, the query language SPARQL must be extended to include concepts such as windowing , based on what has been done in Data Stream Management Systems. In this talk, I will present recent work on the semantic data stream management , particularly extensions made on SPARQL language and associated benchmarks.
This document provides an overview of machine learning for Java Virtual Machine (JVM) developers. It begins with introductions to the speaker and topics to be covered. It then discusses the growth of data and opportunities for machine learning applications. Key machine learning concepts are defined, including observations, features, models, supervised vs. unsupervised learning, and common algorithms like classification, regression, and clustering. Popular JVM machine learning tools are listed, with Spark/MLlib highlighted for its community support and implementation of standard algorithms. Example machine learning demos on price prediction and spam classification are described. The document concludes with recommendations for further learning resources.
This document discusses data visualization for big data. It begins by explaining why visualization is important, as it can help users notice unexpected patterns in data. It then defines data visualization as using interactive visual representations to amplify cognition. The document outlines several steps to create a visualization: identifying relevant tasks; choosing a library; transforming data into a nested JSON format; binding the data; and creating a user-friendly experience with settings. It provides an example of visualizing network threat data to identify suspicious IP addresses and domains.
This document discusses developing analytics applications using machine learning on Azure Databricks and Apache Spark. It begins with an introduction to Richard Garris and the agenda. It then covers the data science lifecycle including data ingestion, understanding, modeling, and integrating models into applications. Finally, it demonstrates end-to-end examples of predicting power output, scoring leads, and predicting ratings from reviews.
Qualitative AI : Hoo-ha or Step-Change? CAQDAS webinarChristina Silver
Slides from the CAQDAS Networking Project's webinar on 1st September 2023: Artificial Intelligence in Qualitative Data Analysis - Hoo-ha or Step-Change?
During 2023 there’s been increasing discussion about the use of artificial intelligence (AI) in qualitative research, spurred by widespread access to generative-AI technologies such as ChatGPT developed by OpenAI.
In this webinar Christina first recounts the history of AI in qualitative data analysis, outlining developments that far pre-date the current upsurge; including Qualrus, Discovertext, WordStat and QDA Miner, and Leximancer.
She’ll then outline how generative-AI is being used in qualitative data analysis at the moment, discussing three uses: chat bots alongside other analytic tools; integrations of OpenAI technology into already established Qualitative Software; and the rise of new generative-AI applications designed specifically for qualitative data analysis tasks.
Christina will open discussion about the implications of these developments for the practice of qualitative research. When are these tools appropriate? What do we need to know about them? What are the ethics of using them? What should we be cautious and excited about? How can the qualitative community shape their development?
Whether you’re an advocate of the use of AI in qualitative data analysis or a sceptic, these technologies are here, they have already impacted the field of qualitative research and they will continue to do so. Join Christina to be part of the conversation, find out what’s happening, share your experiences and experimentations, your fears and hopes. Let the developers know how you want to see these technologies harnessed.
Course 3 : Types of data and opportunities by Nikolaos DeligiannisBetacowork
This document discusses big data and opportunities related to different types of data. It covers challenges of big data including volume, velocity, variety and veracity. It also discusses value that can be extracted from data. The document outlines static versus real-time data and structured versus unstructured data. Examples of applying machine learning techniques like regression, classification, clustering and dimensionality reduction are provided. The introduction to cloud computing discusses public, private and hybrid clouds and features of cloud infrastructure.
Choosing the right software for your research study : an overview of leading ...Merlien Institute
Choosing the right software for your research study : an overview of leading CAQDAS packages by Christina Silver. This presentation is part of the proceedings of the International workshop on Computer-Aided Qualitative Research organised by Merlien Institute. This workshop was held on the 4-5 June in Utrecht, The Netherlands
Lightweight Collection and Storage of Software Repository Data with DataRoverChristoph Matthies
The ease of setting up collaboration infrastructures for software engineering projects creates a challenge for researchers that aim to analyze the resulting data. As teams can choose from various available software-as-a-service solutions and can configure them with a few clicks, researchers have to create and maintain multiple implementations for collecting and aggregating the collaboration data in order to perform their analyses across different setups.
The DataRover system simplifies this task by only requiring custom source code for API authentication and querying. Data transformation and linkage is performed based on mappings, which users can define based on sample responses through a graphical front end. This allows storing the same input data in formats and databases most suitable for the intended analysis without requiring additional coding.
A screencast of DataRover is available at https://youtu.be/mt4ztff4SfU.
DataRover is available at: https://bitbucket.org/tkowark/data-rover
Mind the Gap - Data Science Meets Software EngineeringBernhard Haslhofer
This document summarizes a talk on combining data science and software engineering approaches. It discusses how the two fields approach problems differently, with software engineering focusing on implementing features and ensuring quality through testing, while data science focuses on evaluating models and metrics. The document proposes a solution of defining goals, collecting ground truth data, implementing models and functions, testing and evaluating them, analyzing errors, and deploying services based on metrics. This "metrics driven software engineering" approach aims to bridge the gaps between the two fields.
This document discusses tools and services for data intensive research in the cloud. It describes several initiatives by the eXtreme Computing Group at Microsoft Research related to cloud computing, multicore computing, quantum computing, security and cryptography, and engaging with research partners. It notes that the nature of scientific computing is changing to be more data-driven and exploratory. Commercial clouds are important for research as they allow researchers to start work quickly without lengthy installation and setup times. The document discusses how economics has driven improvements in computing technologies and how this will continue to impact research computing infrastructure. It also summarizes several Microsoft technologies for data intensive computing including Dryad, LINQ, and Complex Event Processing.
Real time analytics with Spark Streaming by Padma at Bangalore I & D meetup (https://www.meetup.com/Bengaluru-Insights-and-Data-Meetup/events/238459154)
Multiplatform Spark solution for Graph datasources by Javier DominguezBig Data Spain
This document summarizes a presentation given by Javier Dominguez at Big Data Spain about Stratio's multiplatform solution for graph data sources. It discusses graph use cases, different data stores like Spark, GraphX, GraphFrames and Neo4j. It demonstrates the machine learning life cycle using a massive dataset from Freebase, running queries and algorithms. It shows notebooks and a business example of clustering bank data using Jaccard distance and connected components. The presentation concludes with future directions like a semantic search engine and applying more machine learning algorithms.
Data mining technique for classification and feature evaluation using stream ...ranjit banshpal
This document discusses data stream mining techniques for classification and feature evaluation. It introduces data stream mining and its applications, including network traffic analysis and sensor data. It describes decision trees and the VFDT algorithm for data stream classification. VFDT can classify high-dimensional data streams more efficiently than decision trees. The document also covers challenges in data stream mining like concept drift and feature evolution, and concludes by discussing applications and referencing related work.
Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...Ali Alkan
The document summarizes an agenda for a presentation on machine learning and data science. It includes an introduction to CRISP-DM (Cross Industry Standard for Data Mining), guided analytics, and a KNIME demo. It also discusses the differences between machine learning, artificial intelligence, and data science. Machine learning produces predictions, artificial intelligence produces actions, and data science produces insights. It provides an overview of the CRISP-DM process for data mining projects including the business understanding, data understanding, data preparation, modeling, evaluation, and deployment phases. It also discusses guided analytics and interactive systems to assist business analysts in finding insights and predicting outcomes from data.
1) Entity-centric data management stores information at the entity level and integrates information by interlinking entities. This provides advantages over keyword-based and relational database approaches.
2) The XI Pipeline extracts mentions from text and performs named entity recognition, entity linking, and entity typing to associate entities with text.
3) Approaches like ZenCrowd and TRank leverage both algorithms and human computation through crowdsourcing to improve entity linking and fine-grained entity typing.
Metadata Quality Assurance Part II. The implementation beginsPéter Király
This document outlines a metadata quality assurance framework. It discusses why data quality is important, what the framework can be used for, and its key principles. It then describes how metadata quality will be measured, including examining schema-independent structural features, use case scenarios, and cataloging known metadata problems. Specific discovery scenarios and their metadata requirements are provided as examples. The document concludes by outlining further steps to develop and implement the framework.
CodeOne 2018 - Microservices in action at the Dutch National PoliceBert Jan Schrijver
The document discusses the use of microservices architecture at the Dutch National Police. It describes how they have transitioned to using microservices and DevOps practices across their frontend and backend systems. Key points include:
- They have 5 teams building data and web applications using microservices in a private cloud.
- Teams use continuous delivery, have short feedback loops, and focus on people over products.
- The architecture uses microservices, event streaming, and multiple data stores. Services are developed independently and deployed continuously.
- Their methodology focuses on usability testing, minimizing meetings, and using tools like Phabricator and Kubernetes in production.
The document discusses data warehousing, data mining, and business intelligence applications. It explains that data warehousing organizes and structures data for analysis, and that data mining involves preprocessing, characterization, comparison, classification, and forecasting of data to discover knowledge. The final stage is presenting discovered knowledge to end users through visualization and business intelligence applications.
How to Create the Google for Earth Data (XLDB 2015, Stanford)Rainer Sternfeld
Rainer Sternfeld presented on creating a Google-like platform for earth data using Planet OS. He described the challenges NOAA faces in managing tens of terabytes of weather data per day across scattered systems. Planet OS could index NOAA's metadata and downsample remote datasets via APIs. It would store chunked array data in object stores like S3 and provide on-demand computing via cloud services. This would make NOAA's large-scale data easily discoverable and machine-readable while addressing issues like data volume, transport, and real-time dissemination.
The document discusses schema-less databases and how they differ from traditional databases. Schema-less databases like MongoDB, CouchDB, and Cassandra use documents rather than tables and fields. Documents can vary in structure and there are no enforced relationships between data like with schemas. This flexibility allows for easier development of certain types of applications, like a campaign management system, though it comes with some disadvantages compared to SQL databases.
Artificia Intellicence and XPath Extension FunctionsOctavian Nadolu
The purpose of this presentation is to provide an overview of how you can use AI from XSLT, XQuery, Schematron, or XML Refactoring operations, the potential benefits of using AI, and some of the challenges we face.
WhatsApp offers simple, reliable, and private messaging and calling services for free worldwide. With end-to-end encryption, your personal messages and calls are secure, ensuring only you and the recipient can access them. Enjoy voice and video calls to stay connected with loved ones or colleagues. Express yourself using stickers, GIFs, or by sharing moments on Status. WhatsApp Business enables global customer outreach, facilitating sales growth and relationship building through showcasing products and services. Stay connected effortlessly with group chats for planning outings with friends or staying updated on family conversations.
DDS Security Version 1.2 was adopted in 2024. This revision strengthens support for long runnings systems adding new cryptographic algorithms, certificate revocation, and hardness against DoS attacks.
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...Crescat
Crescat is industry-trusted event management software, built by event professionals for event professionals. Founded in 2017, we have three key products tailored for the live event industry.
Crescat Event for concert promoters and event agencies. Crescat Venue for music venues, conference centers, wedding venues, concert halls and more. And Crescat Festival for festivals, conferences and complex events.
With a wide range of popular features such as event scheduling, shift management, volunteer and crew coordination, artist booking and much more, Crescat is designed for customisation and ease-of-use.
Over 125,000 events have been planned in Crescat and with hundreds of customers of all shapes and sizes, from boutique event agencies through to international concert promoters, Crescat is rigged for success. What's more, we highly value feedback from our users and we are constantly improving our software with updates, new features and improvements.
If you plan events, run a venue or produce festivals and you're looking for ways to make your life easier, then we have a solution for you. Try our software for free or schedule a no-obligation demo with one of our product specialists today at crescat.io
Why Mobile App Regression Testing is Critical for Sustained Success_ A Detail...kalichargn70th171
A dynamic process unfolds in the intricate realm of software development, dedicated to crafting and sustaining products that effortlessly address user needs. Amidst vital stages like market analysis and requirement assessments, the heart of software development lies in the meticulous creation and upkeep of source code. Code alterations are inherent, challenging code quality, particularly under stringent deadlines.
Do you want Software for your Business? Visit Deuglo
Deuglo has top Software Developers in India. They are experts in software development and help design and create custom Software solutions.
Deuglo follows seven steps methods for delivering their services to their customers. They called it the Software development life cycle process (SDLC).
Requirement — Collecting the Requirements is the first Phase in the SSLC process.
Feasibility Study — after completing the requirement process they move to the design phase.
Design — in this phase, they start designing the software.
Coding — when designing is completed, the developers start coding for the software.
Testing — in this phase when the coding of the software is done the testing team will start testing.
Installation — after completion of testing, the application opens to the live server and launches!
Maintenance — after completing the software development, customers start using the software.
Transform Your Communication with Cloud-Based IVR SolutionsTheSMSPoint
Discover the power of Cloud-Based IVR Solutions to streamline communication processes. Embrace scalability and cost-efficiency while enhancing customer experiences with features like automated call routing and voice recognition. Accessible from anywhere, these solutions integrate seamlessly with existing systems, providing real-time analytics for continuous improvement. Revolutionize your communication strategy today with Cloud-Based IVR Solutions. Learn more at: https://thesmspoint.com/channel/cloud-telephony
Neo4j - Product Vision and Knowledge Graphs - GraphSummit ParisNeo4j
Dr. Jesús Barrasa, Head of Solutions Architecture for EMEA, Neo4j
Découvrez les dernières innovations de Neo4j, et notamment les dernières intégrations cloud et les améliorations produits qui font de Neo4j un choix essentiel pour les développeurs qui créent des applications avec des données interconnectées et de l’IA générative.
Flutter is a popular open source, cross-platform framework developed by Google. In this webinar we'll explore Flutter and its architecture, delve into the Flutter Embedder and Flutter’s Dart language, discover how to leverage Flutter for embedded device development, learn about Automotive Grade Linux (AGL) and its consortium and understand the rationale behind AGL's choice of Flutter for next-gen IVI systems. Don’t miss this opportunity to discover whether Flutter is right for your project.
GraphSummit Paris - The art of the possible with Graph TechnologyNeo4j
Sudhir Hasbe, Chief Product Officer, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
OpenMetadata Community Meeting - 5th June 2024OpenMetadata
The OpenMetadata Community Meeting was held on June 5th, 2024. In this meeting, we discussed about the data quality capabilities that are integrated with the Incident Manager, providing a complete solution to handle your data observability needs. Watch the end-to-end demo of the data quality features.
* How to run your own data quality framework
* What is the performance impact of running data quality frameworks
* How to run the test cases in your own ETL pipelines
* How the Incident Manager is integrated
* Get notified with alerts when test cases fail
Watch the meeting recording here - https://www.youtube.com/watch?v=UbNOje0kf6E
Takashi Kobayashi and Hironori Washizaki, "SWEBOK Guide and Future of SE Education," First International Symposium on the Future of Software Engineering (FUSE), June 3-6, 2024, Okinawa, Japan
UI5con 2024 - Boost Your Development Experience with UI5 Tooling ExtensionsPeter Muessig
The UI5 tooling is the development and build tooling of UI5. It is built in a modular and extensible way so that it can be easily extended by your needs. This session will showcase various tooling extensions which can boost your development experience by far so that you can really work offline, transpile your code in your project to use even newer versions of EcmaScript (than 2022 which is supported right now by the UI5 tooling), consume any npm package of your choice in your project, using different kind of proxies, and even stitching UI5 projects during development together to mimic your target environment.
Odoo ERP software
Odoo ERP software, a leading open-source software for Enterprise Resource Planning (ERP) and business management, has recently launched its latest version, Odoo 17 Community Edition. This update introduces a range of new features and enhancements designed to streamline business operations and support growth.
The Odoo Community serves as a cost-free edition within the Odoo suite of ERP systems. Tailored to accommodate the standard needs of business operations, it provides a robust platform suitable for organisations of different sizes and business sectors. Within the Odoo Community Edition, users can access a variety of essential features and services essential for managing day-to-day tasks efficiently.
This blog presents a detailed overview of the features available within the Odoo 17 Community edition, and the differences between Odoo 17 community and enterprise editions, aiming to equip you with the necessary information to make an informed decision about its suitability for your business.
4. “Big data is high-volume, high-velocity and/or high-variety information assets that demand cost-
effective, innovative forms of information processing that enable enhanced insight, decision making,
and process automation.” — Gartner
Mateusz Dymczyk Prague, 23rd October 2015
3Vs
NSA Baidu
10-100pb
eBay
100pb
Google
100pb
* Estimated data processed per day, circa 2014
11. “The field of machine learning is concerned with the question of how to construct computer
programs that automatically improve with experience.” — Mitchell, Tom M., “Machine Learning”
ML is extremely broad and involves several domains:
• computer science
• probability and statistics
• optimisation
• linear algebra
Mateusz Dymczyk Prague, 23rd October 2015
Machine learning
12. • Observation - object which is used for learning or evaluation (eg. a house)
• Features - representation of the observation (eg. square meters, number of rooms, location)
• Labels - a value assigned to an observation (not always used)
• System - set of related objects forming a complex whole (eg. set of observations)
• Model (math) - description of a system using mathematical concepts/language
• Data:
• training gets us our candidate parameters =>
• validation (optional) gets us optimal parameter set =>
• test checks how good the model is
Mateusz Dymczyk Prague, 23rd October 2015
Basic terminology
13. Mateusz Dymczyk Prague, 23rd October 2015
eg. regression,
when you want to
predict a real
number
eg. clustering,
when you want to
cluster or have
too much data
eg. classification,
when you want to
assign to a category
eg. association analysis,
when you want to find
relations between data
16. • Lack of distributed/scalable solutions
• Not enough data and/or computing power
• False conviction that we:
• Need to read hard research papers
• Use “weird” programming languages
Mateusz Dymczyk Prague, 23rd October 2015
So what’s the problem?
19. Mateusz Dymczyk Prague, 23rd October 2015
Still not good enough…
•Not designed for big data
•Didn’t fit machine learning computation models
vs
20. ML, JVM
and a (iterative) distribution?
Mateusz Dymczyk Prague, 23rd October 2015
21. Mateusz Dymczyk Prague, 23rd October 2015
New (distributed) kids on the block
•MLlib (+Spark)
• TridentML (+Storm)
• Apache FlinkML (+Flink)
• Mahout Samsara
• Mahout R-like DSL
• Mahout on Spark
• H2O
• back-end agnostic (but with native APIs)
• open-source machine learning platform
22. Mateusz Dymczyk Prague, 23rd October 2015
What is Spark?
• Distributed, fast, in-memory computational framework
• Based RDD (Resilient Distributed Dataset: abstract, immutable, distributed, easily
rebuilt data format)
• Support for Scala, Java, Python and R
• Focuses on well known methods
(map(), flatMap(), filter(), reduce() …)
23. Mateusz Dymczyk Prague, 23rd October 2015
What is Spark?
val conf = new SparkConf().setAppName("Spark App")
val sc = new SparkContext(conf)
val textFile: RDD[String] = sc.textFile("hdfs://...")
val counts: RDD[(String, Int)] = textFile.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
println(s"Found ${counts.count()}")
counts.saveAsTextFile("hdfs://...")
24. Mateusz Dymczyk Prague, 23rd October 2015
What is Spark?
SparkSQL
Spark
Streaming
MLlib GraphX
Apache Spark (core)
Mesos/Yarn/Standalone
(cluster management)
25. Mateusz Dymczyk Prague, 23rd October 2015
What is MLlib?
•Machine learning library for Spark (scalable by definition)
•Since September 2013, initially created at AMPLab (UC Berkeley)
•Contains common, well established machine learning algorithms and
utilities
26. Mateusz Dymczyk Prague, 23rd October 2015
Is it for me?
PROS
• extensive community, part of Spark
(Databricks support)
• Java, Scala, Python, R APIs
• solid implementation of most popular
algorithms
• easy to use, well documented, multitude of
examples
• fast and robust
CONS
• only Spark
• very young, still missing algorithms
• still pretty “low level”
27. Mateusz Dymczyk Prague, 23rd October 2015
Any problems left?
•Young projects, still require a lot of work
•Plenty of ML algorithms are not good for distribution by definition
•Simply throwing more machines won’t always work (eg. too
much data movement, too many operations)
28. Mateusz Dymczyk Prague, 23rd October 2015
What can we do?
•Go to Spark’s JIRA
•Add a ticket to MLlib
•Relax
1
2
3
29. Mateusz Dymczyk Prague, 23rd October 2015
Go smart(er)
•Compromise:
•Approximate
•Lambda architecture
•Compose algorithms:
•eg. clustering + actual similarity check
•User different algorithms
•for instance instead of closed form solution use iterative solutions
•Come up with new algorithms :-)
Data in
Serving layer
31. Mateusz Dymczyk Prague, 23rd October 2015
What we’ll see
•End to end example: similarity search
•Built-in algorithm/util examples:
•Clustering
•Recommender systems (collaborative filtering)
•Logistic regression
•Model evaluation
32. Mateusz Dymczyk Prague, 23rd October 2015
Similarity search
•Problem: given an object (document, image) find all objects which
are similar to it in a given set.
•Solution: similarity is a well research topic in
mathematics!
•Why:
•Find most popular objects.
•Aggregate similar objects to declutter view.
•Find k most similar objects.
34. Mateusz Dymczyk Prague, 23rd October 2015
Similarity search - pipeline
Data preprocessing
(eg. tokenization,
text normalization)
Input data
Vectorization
Similarity
check
Result
“This’s a Short test” [“short”, “test”]
“This’s a not so [“long”, “test”]
long Test”
[1,1,0], [1,0,1] …
Similarity
algorithm
35. Mateusz Dymczyk Prague, 23rd October 2015
Similarity search - distributed pipeline
Input data Result
Data
preprocessing
Node
Data
preprocessing
Node
Vectorization
Node
Vectorization
Node
Similarity
check
Node
Similarity
check
Node
Cluster
36. Mateusz Dymczyk Prague, 23rd October 2015
Similarity search I
•Brute force solution:
•pre-process text
•vectorize (in our case TF-IDF)
•compute all possible pairs
•compute cosine similarity between each pair
37. Mateusz Dymczyk Prague, 23rd October 2015
Vectorization: TF-IDF
• Term Frequency–Inverse Document Frequency:
• how important a word is for a document in a collection
• higher when the word occurrence is big in a document
• smaller when the word is also common in the whole collection
“This’s a Short test” [“short”, “test”]
“This’s a not so long Test” [“long”, “test”]
[1/6, 1/3, 0], [1/6 , 0, 1/3] …
38. Mateusz Dymczyk Prague, 23rd October 2015
TF-IDF
val documents: RDD[Seq[String]] = sc.textFile("...")
.map(_.split(" ").toSeq)
val hashingTF = new HashingTF()
val tf: RDD[Vector] = hashingTF.transform(documents)
tf.cache()
val idf = new IDF().fit(tf)
val tfidf: RDD[Vector] = idf.transform(tf)
39. Mateusz Dymczyk Prague, 23rd October 2015
Similarity search I
type DocSimilarity = (String, Seq[(String, Double)])
case class Document(id: Long, doc: String)
def similarity(docs: RDD[Document]): RDD[DocSimilarity] = {
val normalized: RDD[(Document, Set[String])] = docs.map(Normalizer.normalize(_))
val vectorized: RDD[(Document, Vector)] = TfIdf.vectorize(normalized).cache()
// Brute-force similarity
val cartesian = vectorized
.cartesian(vectorized)
.filter { case (doc1, doc2) => doc1._1.id < doc2._1.id }
.map {
case (doc1, doc2) =>
val similarity: Double = cosine(doc1._2, doc2._2)
Seq(
(doc1._2, (doc2._2, similarity)),
(doc2._2, (doc1._2, similarity))
)
}
.flatMap(identity)
.combineByKey[Seq[(RantTuple, Double)]](
(x: (RantTuple, Double)) => Seq(x),
(acc: Seq[(RantTuple, Double)], y: (RantTuple, Double)) => acc.+:(y),
(acc1: Seq[(RantTuple, Double)], acc2: Seq[(RantTuple, Double)]) => acc1.++:(acc2)
)
}
“This’s a Short test” [“short”, “test”]
[1,1,0], [1,0,1] …
40. Mateusz Dymczyk Prague, 23rd October 2015
Similarity search I - problems
•Compute all-pairs similarity:
•O(n^2) comparisons
•10^6 documents
•~5*10^11 comparisons =>
•~6 days (10^3 comp/ms)
•Data shuffle size O(nL^2)
•Largest reduce-key: O(n)
n — # docs, L — # of unique words in a doc
Similarity
check
Node
Similarity
check
Node
Cluster
41. Mateusz Dymczyk Prague, 23rd October 2015
Why is data shuffle so bad?
50 GB/s
100MB/s
100-600MB/s
1 GB/s
0.3 GB/s
42. Mateusz Dymczyk Prague, 23rd October 2015
Similarity search II
Input data Result
Data
preprocessing
Node
Data
preprocessing
Node
Vectorization
Node
Vectorization
Node
Cluster
Similarity
check
Node
Similarity
check
Node
Cluster
Group by
feature(s)
43. Mateusz Dymczyk Prague, 23rd October 2015
Similarity search II
•Problems:
•What if no features to group by
•What if it produces too big clusters?
•Solution: cluster anyway but smart!
44. Mateusz Dymczyk Prague, 23rd October 2015
Locality sensitive hashing
•Similar objects = same bucket (maximizes the % of collisions)
•Group of algorithms (different similarity measures):
•random projection for cosine
•min-hash for jaccard
•…
•Problems:
•possibility of false positives and false negatives
•double check the former, minimize the latter
•might produce duplicates pairs!
45. Mateusz Dymczyk Prague, 23rd October 2015
Similarity search III
type DocSimilarity = (String, Seq[(String, Double)])
case class Document(id: Long, doc: String)
def similarity(docs: RDD[Document]): RDD[DocSimilarity] = {
val normalized: RDD[(Document, Set[String])] = docs.map(Normalizer.normalize(_))
val vectorized: RDD[(Document, Vector)] = TfIdf.extract(normalized).cache()
val lsh = new LSH(data=vectorized, p=65537, m=1000, numRows=1000, numBands=25, minClusterSize=2)
val model = lsh.run
var clusters : RDD[(Long, Iterable[SparseVector])] = model.clusters
clusters.map { case (id, cluster) => cosines(cluster) }
}
• Sample implementations:
• https://github.com/mrsqueeze/spark-hash (min-hash)
• https://github.com/marufaytekin/lsh-spark (Charikar’s LSH for cosine)
46. Mateusz Dymczyk Prague, 23rd October 2015
Similarity search - results
INPUT
”パウダーファンデーションのパフがすぐに汚れ
てしまう。” (“Powder foundation’s puff gets dirty really fast”)
OUTPUT
0.80 “パウダーをつけるパフがすぐに汚れる。”
(“The puff gets dirty really fast after applying the powder.”)
0.53 “パフがすぐに汚くなってしまう。” (“The
puff gets dirty really fast.”)
0.30 “パウダリーファンデーションをつけるた
めのスポンジというかパフ、すぐに汚れて、ファ
ンデをつける時にきれいに伸ばせなくなる。”
(“The sponge for applying the powdery foundation gets dirty really
fast, when using the foundation it doesn’t spread nicely.”)
48. Mateusz Dymczyk Prague, 23rd October 2015
Clustering
val data = sc.textFile("...")
val parsedData = data.map(_.split(' ').map(_.toDouble)).cache()
val clusters = KMeans.train(parsedData, 2, numIterations = 20)
val prediction = clusters.predict(point)
•unsupervised learning problem which tries to group
subsets of objects with one another based on some notion
of similarity.
•supported algorithms: K-means, Gaussian mixture, Power
iteration clustering (PIC), Latent Dirichlet allocation (LDA)
49. Mateusz Dymczyk Prague, 23rd October 2015
Recommender systems
val data = sc.textFile(“…”)
val ratings = data.map(_.split(',') match {
case Array(user, item, rate) =>
Rating(user.toInt, item.toInt, rate.toDouble)
})
val model = ALS.train(ratings, 1, 20, 0.01)
val usersProducts = ratings.map { case Rating(user, product, rate) =>
(user, product)
}
val predictions = model.predict(usersProducts)
• Collaborative filtering
• User/product matrix predictions
50. Mateusz Dymczyk Prague, 23rd October 2015
(Logistic) Regression
•iterative algorithm - greatly benefits from caching
•often used for binary classification (can be
generalised)
// <label> <idx1>:<val1> <idx2>:<val2> ...
val data = MLUtils.loadLibSVMFile(sc, “…”).cache()
val model = new LogisticRegressionWithLBFGS()
.setNumClasses(10)
.run(data)
model.predict(pointToPredict)
52. Mateusz Dymczyk Prague, 23rd October 2015
Supervised learning workflow
Raw
data
Cleaned/
scaled
data
Training
set
Validating
set
Model
creation
Final
model
Validation
Incoming
new
data
53. Mateusz Dymczyk Prague, 23rd October 2015
Model evaluation
•Certain ML algorithms create models
•How do we know if the model we got is good (enough)?
•Different types of evaluation depending on the ML algorithm type:
•classification: prediction and recall (based on true/false positive/
negative)
•regression: different methods based on the difference of evaluation
and validation data
54. Mateusz Dymczyk Prague, 23rd October 2015
Model evaluation
val data = MLUtils.loadLibSVMFile(sc, "...")
val Array(training, test) = data.randomSplit(Array(0.6, 0.4), seed = 11L)
training.cache()
val model = new LogisticRegressionWithLBFGS()
.setNumClasses(2)
.run(training)
model.clearThreshold
val predictionAndLabels = test.map { case LabeledPoint(label, features) =>
val prediction = model.predict(features)
(prediction, label)
}
val metrics = new BinaryClassificationMetrics(predictionAndLabels)
val precision = metrics.precisionByThreshold
precision.foreach { case (t, p) =>
println(s"Threshold: $t, Precision: $p")
}
val recall = metrics.recallByThreshold
recall.foreach { case (t, r) =>
println(s"Threshold: $t, Recall: $r")
}
56. Mateusz Dymczyk Prague, 23rd October 2015
Common pitfalls
1. Try to avoid groupByKey()
•instead try reduceByKey()
2.Don’t collect all the data in the driver:
•collect() will copy all the elements to the driver node
•instead persist it (file, DB)
3.Use cache()/persist() where necessary (use Sparks WebUI)!
4.Code for failure and handle malformed input!
5.Remember about Serializable!
57. Mateusz Dymczyk Prague, 23rd October 2015
Performance recap
1. Parallelising (not concurrency!) makes us faster
2.Network traffic makes us (really) slow
1. keep data close to the processing units (stay local)
2.take note of operation order
3.don’t iterate more than necessary
3.In-memory computation/caching helps a lot (especially in case of
iterative machine learning!)
58. Mateusz Dymczyk Prague, 23rd October 2015
Where to go from here
• Get ideas: https://www.kaggle.com/wiki/DataScienceUseCases
• Get started with Spark:
• http://spark.apache.org/docs/latest/quick-start.html
• https://www.edx.org/course/introduction-big-data-apache-spark-uc-berkeleyx-cs100-1x
• Get started with MLlib:
• http://spark.apache.org/docs/latest/mllib-guide.html
• https://www.edx.org/course/scalable-machine-learning-uc-berkeleyx-cs190-1x
• Try out other frameworks and courses:
• https://github.com/h2oai/sparkling-water
• https://www.coursera.org/course/mmds
• Learn the basics:
• https://www.coursera.org/learn/machine-learning
• Practical books:
• “Advanced Analytics with Spark” — Sandy Ryza et al. O’Reilly Media
• “Data Algorithms: Recipes for Scaling Up with Hadoop and Spark” — Mahmoud Parsian, O’Reilly Media
61. Mateusz Dymczyk Prague, 23rd October 2015
Can I has stream?
•Linear models (regression) can be trained in a streaming fashion (1.1+)
•Clustering can be done on streams (with k-means)
•what if data over time changes? — mllib supports “forgetfulness”
62. Mateusz Dymczyk Prague, 23rd October 2015
Can I has stream?
val trainingData = ssc.textFileStream("...").map(Vectors.parse)
val testData = ssc.textFileStream("...").map(LabeledPoint.parse)
val model = new StreamingKMeans()
.setK(2)
.setDecayFactor(1.0)
.setRandomCenters(3, 0.0)
model.trainOn(trainingData)
model.predictOnValues(testData.map(lp => (lp.label, lp.features))).print()
ssc.start()
ssc.awaitTermination()
64. Mateusz Dymczyk Prague, 23rd October 2015
Seldon.io
• open predictive platform
• provides content
recommendation and
predictive functionality
65. Mateusz Dymczyk Prague, 23rd October 2015
Prediction.io
• open source ML server for building predictive engines
• event collection, algorithms, evaluation and querying predictive results via REST
• uses Hadoop, HBase, Spark and Elasticsearch