Magellan is a geospatial analytics library for Apache Spark that allows users to perform spatial queries and analysis on large geospatial datasets in a scalable manner. It provides custom data types and expressions to represent spatial objects like points, polygons and perform spatial operations. Magellan reads geospatial file formats like Shapefiles and GeoJSON, integrates with Spark SQL to allow spatial joins, and aims to simplify building geospatial applications at scale on Spark. The current version supports basic functionality and future versions will add more operators, optimizations, and support for additional formats and use cases.
The linked open data cloud is constantly evolving as datasets are continuously updated with newer versions. As a result, representing, querying, and visualizing the temporal dimension of linked data is crucial. This is especially important for geospatial datasets that form the backbone of large scale open data publication efforts in many sectors of the economy (the public sector, the Earth observation sector). Although there has been some work on the representation and querying of linked geospatial data that change over time, to the best of our knowledge, there is currently no tool that offers spatiotemporal visualization of such data. Although the visualization of the temporal evolution of geospatial data is common practice in the GIS area, there is no tool that handles linked geospatial data and allows for the visualization of both the spatial and temporal dimensions, to the best of our knowledge. In this demo paper, we present SexTant, a Web-based system for the visualization and exploration of time-evolving linked geospatial data and the creation, sharing, and collaborative editing of "temporally-enriched" thematic maps which are produced by combining different sources of such data.
Multi-Label Graph Analysis and Computations Using GraphX with Qiang Zhu and Q...Databricks
In real-life applications, we often deal with situations where analysis needs to be conducted on graphs where the nodes and edges are associated with multiple labels. For example, in a graph that represents user activities in social networks, the labels associated with nodes may indicate their membership in communities (e.g. group, school, company, etc.), and the labels associated with edges may denote types of activities (e.g. comment, like, share, etc.). The current GraphX library in Spark does not directly support efficient calculation on the label-defined subgraph analysis and computations.
In this session, the speakers will propose a general API library that is able to support analysis on multi-label graphs, and can be reused and extended to design more complicated algorithms. It includes a method to create multi-label graphs and calculate basic statistics and metrics at both the global and subgraph level. Common graph algorithms, such as PageRank, can also be efficiently implemented in a parallel scheme by reusing the module/algorithm in GraphX, such as Pregel API.
See how LinkedIn is able to leverage this tool to efficiently find top LinkedIn feed influencers in different communities and by different actions. can be reused and extended to design more complicated algorithms. It includes a method to create multi-label graphs and calculate basic statistics and metrics at both the global and subgraph level. Common graph algorithms, such as PageRank, can also be efficiently implemented in a parallel scheme by reusing the module/algorithm in GraphX, such as Pregel API.
See how LinkedIn is able to leverage this tool to efficiently find top LinkedIn feed influencers in different communities and by different actions.
The linked open data cloud is constantly evolving as datasets are continuously updated with newer versions. As a result, representing, querying, and visualizing the temporal dimension of linked data is crucial. This is especially important for geospatial datasets that form the backbone of large scale open data publication efforts in many sectors of the economy (the public sector, the Earth observation sector). Although there has been some work on the representation and querying of linked geospatial data that change over time, to the best of our knowledge, there is currently no tool that offers spatiotemporal visualization of such data. Although the visualization of the temporal evolution of geospatial data is common practice in the GIS area, there is no tool that handles linked geospatial data and allows for the visualization of both the spatial and temporal dimensions, to the best of our knowledge. In this demo paper, we present SexTant, a Web-based system for the visualization and exploration of time-evolving linked geospatial data and the creation, sharing, and collaborative editing of "temporally-enriched" thematic maps which are produced by combining different sources of such data.
Multi-Label Graph Analysis and Computations Using GraphX with Qiang Zhu and Q...Databricks
In real-life applications, we often deal with situations where analysis needs to be conducted on graphs where the nodes and edges are associated with multiple labels. For example, in a graph that represents user activities in social networks, the labels associated with nodes may indicate their membership in communities (e.g. group, school, company, etc.), and the labels associated with edges may denote types of activities (e.g. comment, like, share, etc.). The current GraphX library in Spark does not directly support efficient calculation on the label-defined subgraph analysis and computations.
In this session, the speakers will propose a general API library that is able to support analysis on multi-label graphs, and can be reused and extended to design more complicated algorithms. It includes a method to create multi-label graphs and calculate basic statistics and metrics at both the global and subgraph level. Common graph algorithms, such as PageRank, can also be efficiently implemented in a parallel scheme by reusing the module/algorithm in GraphX, such as Pregel API.
See how LinkedIn is able to leverage this tool to efficiently find top LinkedIn feed influencers in different communities and by different actions. can be reused and extended to design more complicated algorithms. It includes a method to create multi-label graphs and calculate basic statistics and metrics at both the global and subgraph level. Common graph algorithms, such as PageRank, can also be efficiently implemented in a parallel scheme by reusing the module/algorithm in GraphX, such as Pregel API.
See how LinkedIn is able to leverage this tool to efficiently find top LinkedIn feed influencers in different communities and by different actions.
Large-Scale Geographically Weighted Regression on SparkViet-Trung TRAN
Geographically Weighted Regression (GWR) is a local version of spatial regression that captures spatial dependency in regression analysis. GWR has many application in practice as a visualization and prediction tool for spatial exploration- (e.g in climate, economy, medical). However, this locally regression model is slow in process upon the volume of calculations and the spatial getting bigger. Improving performance of GWR is an critical issue, but their distributed implementations have not been studied. Recently, with the advent of Spark as well MapReduce framework, the development of machine learning applications and parallel programming becomes easier. In this article, we propose several large-scale implementations of distributed GWR, leveraging Spark framework. We implemented and evaluated these approaches with large datasets. To our best knowledge, this is the first work addressing GWR at large-scale.
"Quantum clustering - physics inspired clustering algorithm", Sigalit Bechler...Dataconomy Media
"Quantum clustering - physics inspired clustering algorithm", Sigalit Bechler, Researcher, Similar Web
Watch more from Data Natives Tel Aviv 2016 here: http://bit.ly/2hw1MY0
Visit the conference website to learn more: http://telaviv.datanatives.io/
Follow Data Natives:
https://www.facebook.com/DataNatives
https://twitter.com/DataNativesConf
Stay Connected to Data Natives by Email: Subscribe to our newsletter to get the news first about Data Natives 2017: http://bit.ly/1WMJAqS
About the Author:
I am a data science researcher. I have a diverse academic background - a B.Sc. in electrical engineering, a B.Sc. in physics (cum laude) from Tel Aviv University's prestigious program for parallel B.Sc. in Physics and in Electrical Engineering, an M.Sc. in condensed matter (cum laude), and have started my Ph.D. in bioinformatics. Prior to my M.Sc. I have served as a captain in a technology unit of the IDF.
I am passionate about science and solving complex big data problems that require out of the box thinking, and like to dive deep into the details. I always take a positive, proactive approach, and put an emphasis on understanding the big picture as well.
Max-kernel search: How to search for just about anything?
Nearest neighbor search is a well studied and widely used task in computer science and is quite pervasive in everyday applications. While search is not synonymous with learning, search is a crucial tool for the most nonparametric form of learning. Nearest neighbor search can directly be used for all kinds of learning tasks — classification, regression, density estimation, outlier detection. Search is also the computational bottleneck in various other learning tasks such as clustering and dimensionality reduction. Key to nearest neighbor search is the notion of “near”-ness or similarity. Mercer kernels form a class of general nonlinear similarity functions and are widely used in machine learning. They can define a notion of similarity between pairs of objects of any arbitrary type and have been successfully applied to a wide variety of object types — fixed-length data, images, text, time series, graphs. I will present a technique to do nearest neighbor search with this class of similarity functions provably efficiently, hence facilitating faster learning for larger data.
Congressional PageRank: Graph Analytics of US Congress With Neo4jWilliam Lyon
Interactions among members of any large organization are naturally a graph, yet the tools we use to analyze data about these organizations often ignore the graphiness of the domain and instead map the data into structures (such as relational databases) that make taking advantage of the relationships in the data much more difficult when it comes time for analysis. Collaboration networks are a perfect example. This talk will focus on analyzing one of the most powerful collaboration networks in the world, the US Congress. We will show how to model US Congressional data (legislators, bills, committees and the interactions among them) as a graph, how to import the data into the Neo4j graph database and how to write ad-hoc queries to answer simple questions such as “What are the topics of bills referred to committees on which California House Representatives serve?”. We will then see how we can combine a graph processing engine (Apache Spark) with Neo4j to run graph algorithms like PageRank on our data stored in Neo4j. This will allow us to identify influential legislators in the network and the topics over which they exert influence. This talk will touch on topics related to graph data modeling, graph databases, graph processing, and social network analysis that can be applied to many different domains.
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...MLconf
Spark and GraphX in the Netflix Recommender System: We at Netflix strive to deliver maximum enjoyment and entertainment to our millions of members across the world. We do so by having great content and by constantly innovating on our product. A key strategy to optimize both is to follow a data-driven method. Data allows us to find optimal approaches to applications such as content buying or our renowned personalization algorithms. But, in order to learn from this data, we need to be smart about the algorithms we use, how we apply them, and how we can scale them to our volume of data (over 50 million members and 5 billion hours streamed over three months). In this talk we describe how Spark and GraphX can be leveraged to address some of our scale challenges. In particular, we share insights and lessons learned on how to run large probabilistic clustering and graph diffusion algorithms on top of GraphX, making it possible to apply them at Netflix scale.
This contains the agenda of the Spark Meetup I organised in Bangalore on Friday, the 23rd of Jan 2014. It carries the slides for the talk I gave on distributed deep learning over Spark
Online learning with structured streaming, spark summit brussels 2016Ram Sriharsha
Structured Streaming is a new API in Spark 2.0 that simplifies the end to end development of continuous applications. One such continuous application is online model updates: Online models are incrementally updated with new data and can be continuously queried while being updated. As a result, they can be fast to train and leverage new data faster than offline algorithms. In this talk, we give a brief introduction the area of online learning and describe how online model updates can be built using structured streaming APIs. The end result is a robust pipeline for updating models that is scalable, fast and fault tolerant.
Distributed End-to-End Drug Similarity Analytics and Visualization Workflow w...Databricks
The majority of a data scientist’s time is spent cleaning and organizing data before insights can be derived. Frequently, with large datasets, a lack of integration with visualization tools makes it hard to know what’s most interesting in the data and also creates challenges for validating numerical insights from models. Given the vast number of tools available in the ecosystem, it is hard to experiment with different tools to pick the most suitable one, especially given the complexity involved in integrating them with one’s solution.
The speakers will present an easy to use workflow that solves this integration challenge by combining various open source libraries, databases (e.g. Hive, Postgres, MySQL, HBase etc.) and visualization with distributed analytics. Intel developed a highly scalable library built over Apache Spark with novel graph, statistical and machine learning algorithms that also enhances the user experience of Apache Spark via easier to use APIs.
This session will showcase how to address the above mentioned issues for a drug similarity use case. We’ll go from ETL operations on raw drug data to deriving relevant features from the drug’s chemical structure using statistical and graph algorithms, using techniques to identify best model and parameters for this data to derive insights, and then demonstrating the ease of connectivity to different databases and visualization tools.
Meetup MLDD: Machine Learning Dresden, 8th May 2018
Signals from outer space
How NASA Benefits from Graph-Powered NLP
Vlasta Kus talked about the advantages of graph-based natural language processing (NLP) using a public NASA dataset as example. From his abstract: "[...] we are building a platform (from large part open-source) that integrates Neo4j and NLP (such as Named Entity Recognition, sentiment analysis, word embeddings, LDA topic extraction), and we test and develop further related features and tools, lately, for example, integrating Neo4j and Tensorflow for employing deep learning techniques (such as deep auto-encoders for automatic text summarisation)."
Vlasta holds a Ph.D. in Physics from the Charles University in Prague and has worked for SecureOps, as a freelance Data Scientist, and since 2017 as a Data Scientist at GraphAware (https://graphaware.com/), a London-based company that builds solutions around Neo4j.
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...Jose Quesada (hiring)
The machine learning libraries in Apache Spark are an impressive piece of software engineering, and are maturing rapidly. What advantages does Spark.ml offer over scikit-learn? At Data Science Retreat we've taken a real-world dataset and worked through the stages of building a predictive model -- exploration, data cleaning, feature engineering, and model fitting; which would you use in production?
The machine learning libraries in Apache Spark are an impressive piece of software engineering, and are maturing rapidly. What advantages does Spark.ml offer over scikit-learn?
At Data Science Retreat we've taken a real-world dataset and worked through the stages of building a predictive model -- exploration, data cleaning, feature engineering, and model fitting -- in several different frameworks. We'll show what it's like to work with native Spark.ml, and compare it to scikit-learn along several dimensions: ease of use, productivity, feature set, and performance.
In some ways Spark.ml is still rather immature, but it also conveys new superpowers to those who know how to use it.
Global Empire-Building for Fun and Profit: Spark Summit East talk by Michelle...Spark Summit
In order to establish a user base across the globe, a product needs to support a variety of locales. The challenge with supporting multiple locales is the maintenance and generation of localized strings, which are deeply integrated into many facets of a product. To address these challenges at Qordoba, we’re using highly scalable technologies and machine learning to automate the process. Specifically, we need to generate high-quality translations in many different languages and make them available in real-time across platforms, e.g. mobile, print, and web.
In this talk, we describe the techniques we’re using to provide:
* Continuous deployment of localized strings
* Live syncing across platforms (mobile, web, photoshop, sketch, help desk, etc.)
* Content generation for any locale
* Emotional response
We will also share our architecture for handling billions of localized strings in many different languages. We talk about our use of:
* Scala and Akka as an orchestration layer
* Apache Cassandra and MariaDB as a storage layer
* Apache Spark for natural language processing
* Apache Kafka as a message bus for reporting, billing, & notifications
* Docker, Marathon, & Apache Mesos for containerized deployment
We present our solution in the context of a platform that makes it feasible to build products that feel native to every user, regardless of language.
Tech-Talk at Bay Area Spark Meetup
Apache Spark(tm) has rapidly become a key tool for data scientists to explore, understand and transform massive datasets and to build and train advanced machine learning models. The question then becomes, how do I deploy these model to a production environment. How do I embed what I have learned into customer facing data applications. Like all things in engineering, it depends.
In this meetup, we will discuss best practices from Databricks on how our customers productionize machine learning models and do a deep dive with actual customer case studies and live demos of a few example architectures and code in Python and Scala. We will also briefly touch on what is coming in Apache Spark 2.X with model serialization and scoring options.
Large-Scale Geographically Weighted Regression on SparkViet-Trung TRAN
Geographically Weighted Regression (GWR) is a local version of spatial regression that captures spatial dependency in regression analysis. GWR has many application in practice as a visualization and prediction tool for spatial exploration- (e.g in climate, economy, medical). However, this locally regression model is slow in process upon the volume of calculations and the spatial getting bigger. Improving performance of GWR is an critical issue, but their distributed implementations have not been studied. Recently, with the advent of Spark as well MapReduce framework, the development of machine learning applications and parallel programming becomes easier. In this article, we propose several large-scale implementations of distributed GWR, leveraging Spark framework. We implemented and evaluated these approaches with large datasets. To our best knowledge, this is the first work addressing GWR at large-scale.
"Quantum clustering - physics inspired clustering algorithm", Sigalit Bechler...Dataconomy Media
"Quantum clustering - physics inspired clustering algorithm", Sigalit Bechler, Researcher, Similar Web
Watch more from Data Natives Tel Aviv 2016 here: http://bit.ly/2hw1MY0
Visit the conference website to learn more: http://telaviv.datanatives.io/
Follow Data Natives:
https://www.facebook.com/DataNatives
https://twitter.com/DataNativesConf
Stay Connected to Data Natives by Email: Subscribe to our newsletter to get the news first about Data Natives 2017: http://bit.ly/1WMJAqS
About the Author:
I am a data science researcher. I have a diverse academic background - a B.Sc. in electrical engineering, a B.Sc. in physics (cum laude) from Tel Aviv University's prestigious program for parallel B.Sc. in Physics and in Electrical Engineering, an M.Sc. in condensed matter (cum laude), and have started my Ph.D. in bioinformatics. Prior to my M.Sc. I have served as a captain in a technology unit of the IDF.
I am passionate about science and solving complex big data problems that require out of the box thinking, and like to dive deep into the details. I always take a positive, proactive approach, and put an emphasis on understanding the big picture as well.
Max-kernel search: How to search for just about anything?
Nearest neighbor search is a well studied and widely used task in computer science and is quite pervasive in everyday applications. While search is not synonymous with learning, search is a crucial tool for the most nonparametric form of learning. Nearest neighbor search can directly be used for all kinds of learning tasks — classification, regression, density estimation, outlier detection. Search is also the computational bottleneck in various other learning tasks such as clustering and dimensionality reduction. Key to nearest neighbor search is the notion of “near”-ness or similarity. Mercer kernels form a class of general nonlinear similarity functions and are widely used in machine learning. They can define a notion of similarity between pairs of objects of any arbitrary type and have been successfully applied to a wide variety of object types — fixed-length data, images, text, time series, graphs. I will present a technique to do nearest neighbor search with this class of similarity functions provably efficiently, hence facilitating faster learning for larger data.
Congressional PageRank: Graph Analytics of US Congress With Neo4jWilliam Lyon
Interactions among members of any large organization are naturally a graph, yet the tools we use to analyze data about these organizations often ignore the graphiness of the domain and instead map the data into structures (such as relational databases) that make taking advantage of the relationships in the data much more difficult when it comes time for analysis. Collaboration networks are a perfect example. This talk will focus on analyzing one of the most powerful collaboration networks in the world, the US Congress. We will show how to model US Congressional data (legislators, bills, committees and the interactions among them) as a graph, how to import the data into the Neo4j graph database and how to write ad-hoc queries to answer simple questions such as “What are the topics of bills referred to committees on which California House Representatives serve?”. We will then see how we can combine a graph processing engine (Apache Spark) with Neo4j to run graph algorithms like PageRank on our data stored in Neo4j. This will allow us to identify influential legislators in the network and the topics over which they exert influence. This talk will touch on topics related to graph data modeling, graph databases, graph processing, and social network analysis that can be applied to many different domains.
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...MLconf
Spark and GraphX in the Netflix Recommender System: We at Netflix strive to deliver maximum enjoyment and entertainment to our millions of members across the world. We do so by having great content and by constantly innovating on our product. A key strategy to optimize both is to follow a data-driven method. Data allows us to find optimal approaches to applications such as content buying or our renowned personalization algorithms. But, in order to learn from this data, we need to be smart about the algorithms we use, how we apply them, and how we can scale them to our volume of data (over 50 million members and 5 billion hours streamed over three months). In this talk we describe how Spark and GraphX can be leveraged to address some of our scale challenges. In particular, we share insights and lessons learned on how to run large probabilistic clustering and graph diffusion algorithms on top of GraphX, making it possible to apply them at Netflix scale.
This contains the agenda of the Spark Meetup I organised in Bangalore on Friday, the 23rd of Jan 2014. It carries the slides for the talk I gave on distributed deep learning over Spark
Online learning with structured streaming, spark summit brussels 2016Ram Sriharsha
Structured Streaming is a new API in Spark 2.0 that simplifies the end to end development of continuous applications. One such continuous application is online model updates: Online models are incrementally updated with new data and can be continuously queried while being updated. As a result, they can be fast to train and leverage new data faster than offline algorithms. In this talk, we give a brief introduction the area of online learning and describe how online model updates can be built using structured streaming APIs. The end result is a robust pipeline for updating models that is scalable, fast and fault tolerant.
Distributed End-to-End Drug Similarity Analytics and Visualization Workflow w...Databricks
The majority of a data scientist’s time is spent cleaning and organizing data before insights can be derived. Frequently, with large datasets, a lack of integration with visualization tools makes it hard to know what’s most interesting in the data and also creates challenges for validating numerical insights from models. Given the vast number of tools available in the ecosystem, it is hard to experiment with different tools to pick the most suitable one, especially given the complexity involved in integrating them with one’s solution.
The speakers will present an easy to use workflow that solves this integration challenge by combining various open source libraries, databases (e.g. Hive, Postgres, MySQL, HBase etc.) and visualization with distributed analytics. Intel developed a highly scalable library built over Apache Spark with novel graph, statistical and machine learning algorithms that also enhances the user experience of Apache Spark via easier to use APIs.
This session will showcase how to address the above mentioned issues for a drug similarity use case. We’ll go from ETL operations on raw drug data to deriving relevant features from the drug’s chemical structure using statistical and graph algorithms, using techniques to identify best model and parameters for this data to derive insights, and then demonstrating the ease of connectivity to different databases and visualization tools.
Meetup MLDD: Machine Learning Dresden, 8th May 2018
Signals from outer space
How NASA Benefits from Graph-Powered NLP
Vlasta Kus talked about the advantages of graph-based natural language processing (NLP) using a public NASA dataset as example. From his abstract: "[...] we are building a platform (from large part open-source) that integrates Neo4j and NLP (such as Named Entity Recognition, sentiment analysis, word embeddings, LDA topic extraction), and we test and develop further related features and tools, lately, for example, integrating Neo4j and Tensorflow for employing deep learning techniques (such as deep auto-encoders for automatic text summarisation)."
Vlasta holds a Ph.D. in Physics from the Charles University in Prague and has worked for SecureOps, as a freelance Data Scientist, and since 2017 as a Data Scientist at GraphAware (https://graphaware.com/), a London-based company that builds solutions around Neo4j.
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...Jose Quesada (hiring)
The machine learning libraries in Apache Spark are an impressive piece of software engineering, and are maturing rapidly. What advantages does Spark.ml offer over scikit-learn? At Data Science Retreat we've taken a real-world dataset and worked through the stages of building a predictive model -- exploration, data cleaning, feature engineering, and model fitting; which would you use in production?
The machine learning libraries in Apache Spark are an impressive piece of software engineering, and are maturing rapidly. What advantages does Spark.ml offer over scikit-learn?
At Data Science Retreat we've taken a real-world dataset and worked through the stages of building a predictive model -- exploration, data cleaning, feature engineering, and model fitting -- in several different frameworks. We'll show what it's like to work with native Spark.ml, and compare it to scikit-learn along several dimensions: ease of use, productivity, feature set, and performance.
In some ways Spark.ml is still rather immature, but it also conveys new superpowers to those who know how to use it.
Global Empire-Building for Fun and Profit: Spark Summit East talk by Michelle...Spark Summit
In order to establish a user base across the globe, a product needs to support a variety of locales. The challenge with supporting multiple locales is the maintenance and generation of localized strings, which are deeply integrated into many facets of a product. To address these challenges at Qordoba, we’re using highly scalable technologies and machine learning to automate the process. Specifically, we need to generate high-quality translations in many different languages and make them available in real-time across platforms, e.g. mobile, print, and web.
In this talk, we describe the techniques we’re using to provide:
* Continuous deployment of localized strings
* Live syncing across platforms (mobile, web, photoshop, sketch, help desk, etc.)
* Content generation for any locale
* Emotional response
We will also share our architecture for handling billions of localized strings in many different languages. We talk about our use of:
* Scala and Akka as an orchestration layer
* Apache Cassandra and MariaDB as a storage layer
* Apache Spark for natural language processing
* Apache Kafka as a message bus for reporting, billing, & notifications
* Docker, Marathon, & Apache Mesos for containerized deployment
We present our solution in the context of a platform that makes it feasible to build products that feel native to every user, regardless of language.
Tech-Talk at Bay Area Spark Meetup
Apache Spark(tm) has rapidly become a key tool for data scientists to explore, understand and transform massive datasets and to build and train advanced machine learning models. The question then becomes, how do I deploy these model to a production environment. How do I embed what I have learned into customer facing data applications. Like all things in engineering, it depends.
In this meetup, we will discuss best practices from Databricks on how our customers productionize machine learning models and do a deep dive with actual customer case studies and live demos of a few example architectures and code in Python and Scala. We will also briefly touch on what is coming in Apache Spark 2.X with model serialization and scoring options.
Realtime Analytical Query Processing and Predictive Model Building on High Di...Spark Summit
Spark SQL and Mllib are optimized for running feature extraction and machine learning algorithms on row based columnar datasets through full scan but does not provide constructs for column indexing and time series analysis. For dealing with document datasets with timestamps where the features are represented as variable number of columns in each document and use-cases demand searching over columns and time to retrieve documents to generate learning models in realtime, a close integration within Spark and Lucene was needed. We introduced LuceneDAO in Spark Summit Europe 2016 to build distributed lucene shards from data frame but the time series attributes were not part of the data model. In this talk we present our extension to LuceneDAO to maintain time stamps with document-term view for search and allow time filters. Lucene shards maintain the time aware document-term view for search and vector space representation for machine learning pipelines. We used Spark as our distributed query processing engine where each query is represented as boolean combination over terms with filters on time. LuceneDAO is used to load the shards to Spark executors and power sub-second distributed document retrieval for the queries.
Our synchronous API uses Spark-as-a-Service to power analytical queries while our asynchronous API uses kafka, spark streaming and HBase to power time series prediction algorithms. In this talk we will demonstrate LuceneDAO write and read performance on millions of documents with 1M+ terms and configurable time stamp aggregate columns. We will demonstrate the latency of APIs on a suite
of queries generated from terms. Key takeaways from the talk will be a thorough understanding of how to make Lucene powered time aware search a first class citizen in Spark to build interactive analytical query processing and time series prediction algorithms.
Accelerating Machine Learning and Deep Learning At Scale...With Apache Spark:...Spark Summit
Deep learning is a fast growing subset of machine learning. There is an emerging trend to conduct deep learning in the same cluster along with existing data processing pipelines to support feature engineering and traditional machine learning. As the leading framework for Distributed ML, we believe that the addition of deep learning to the super-popular Spark framework is important, because it allows Spark developers to perform a range of data analysis tasks within a single framework that helps avoid the complexity inherent in using multiple frameworks and libraries. As one of the early and top contributors to Apache Spark, Intel is thrilled to share with the community a big deal contribution to open source Spark…”BigDL” -… A distributed deep Learning framework organically built on Big Data (Apache Spark) platform. It combines the benefits of “high performance computing” and “Big Data” architecture for rich deep learning support. With BigDL on Spark, customers can eliminate large volume of unnecessary dataset transfer between separate systems, eliminate separate HW clusters and move towards a CPU cluster, reduce system complexity and the latency for end-to-end learning. Ultimately, customers can achieve better scale, higher resource utilization, ease of use/development, and better TCO. Feature parity with Caffe and Torch, significant performance boost when combined with Intel’s Math Kernel Library (MKL), scale-out, fault tolerance, elasticity and dynamic resource sharing are some of the prominent features of BigDL.
BigDL open source project will be launched at the 2017 Spark Summit East and this keynote will help spotlight this new contribution and benefits to the Spark developer community and encourage their wide contribution and collaboration. We will also showcase some real world applications of Big DL from early customers’ adoption.
Building a Real-Time Fraud Prevention Engine Using Open Source (Big Data) Sof...Spark Summit
Fraudsters attempt to pay for goods, flights, hotels – you name it – using stolen credit cards. This hurts both the trust of card holders and the business of vendors around the world. We built a Real-Time Fraud Prevention Engine using Open Source (Big Data) Software: Spark, Spark ML, H2O, Hive, Esper. In my talk I will highlight both the business and the technical challenges that we’ve faced and dealt with.
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...Spark Summit
Netflix is the world’s largest streaming service, with 80 million members in over 250 countries. Netflix uses machine learning to inform nearly every aspect of the product, from the recommendations you get, to the boxart you see, to the decisions made about which TV shows and movies are created.
Given this scale, we utilized Apache Spark to be the engine of our recommendation pipeline. Apache Spark enables Netflix to use a single, unified framework/API – for ETL, feature generation, model training, and validation. With pipeline framework in Spark ML, each step within the Netflix recommendation pipeline (e.g. label generation, feature encoding, model training, model evaluation) is encapsulated as Transformers, Estimators and Evaluators – enabling modularity, composability and testability. Thus, Netflix engineers can build our own feature engineering logics as Transformers, learning algorithms as Estimators, and customized metrics as Evaluators, and with these building blocks, we can more easily experiment with new pipelines and rapidly deploy them to production.
In this talk, we will discuss how Apache Spark is used as a distributed framework we build our own algorithms on top of to generate personalized recommendations for each of our 80+ million subscribers, specific techniques we use at Netflix to scale, and the various pitfalls we’ve found along the way.
Hail: SCALING GENETIC DATA ANALYSIS WITH APACHE SPARK: Keynote by Cotton SeedSpark Summit
In 2001, it cost ~$100M to sequence a single human genome. In 2014, due to dramatic improvements in sequencing technology far outpacing Moore’s law, we entered the era of the $1,000 genome. At the same time, the power of genetics to impact medicine has become evident: for example, drugs with supporting genetic evidence have twice the clinical trial success rate. These factors have led to an explosion in the volume of genetic data, in the face of which existing analysis tools are breaking down.
Therefore, we began the open-source Hail project (https://hail.is) to be a scalable platform built on Apache Spark to enable the worldwide genetics community to build, share, and apply new tools. Hail is focused on variant-level (post-read) data; querying genetic data, annotations and sample data; and performing rare and common variant association analyses. Hail has already been used to analyze datasets with hundreds of thousands of exomes and tens of thousands of whole genomes.
We will give an overview of the goals of the Hail project and its architecture. The challenge of efficiently manipulating genetic data in Spark has led to several innovations that may have wider applicability, including an RDD-like abstraction for representing multidimensional data and an OrderedRDD abstraction for ordered data, (for example, data indexed by position in the genome). Finally, we will discuss Hail performance and future directions.
In this video from the ISC Big Data'14 Conference, Ted Willke from Intel presents: The Analytics Frontier of the Hadoop Eco-System.
"The Hadoop MapReduce framework grew out of an effort to make it easy to express and parallelize simple computations that were routinely performed at Google. It wasn’t long before libraries, like Apache Mahout, were developed to enable matrix factorization, clustering, regression, and other more complex analyses on Hadoop. Now, many of these libraries and their workloads are migrating to Apache Spark because it supports a wider class of applications than MapReduce and is more appropriate for iterative algorithms, interactive processing, and streaming applications. What’s next beyond Spark? Where is big data analytics processing headed? How will data scientists program these systems? In this talk, we will explore the current analytics frontier, the popular debates, and discuss some potentially clever additions. We will also share the emergent data science applications and collaborative university research that inform our thinking."
Learn more:
http://www.isc-events.com/bigdata14/schedule.html
and
http://www.intel.com/content/www/us/en/software/intel-graph-solutions.html
Watch the video presentation: https://www.youtube.com/watch?v=qlfx495Ekw0
Adding Location and Geospatial Analytics to Big Data Analytics (BDT210) | AWS...Amazon Web Services
(Presented by Esri)
When people analyze a problem, they often include location at the core of the analysis. Location and spatial context, combined with geographical knowledge, can make the biggest difference in understanding a problem and analyzing it in a more meaningful way.
In this session, we show how Amazon EMR can be used with location and geospatial analytics, and how the Amazon EMR API and the Python SDK were used to build tools that integrate Big Data and geospatial analysis. We also show powerful visualization options for displaying your results, using maps which can be shared in reports or distributed online and to mobile apps.
Large Scale Geospatial Indexing and Analysis on Apache SparkDatabricks
SafeGraph is a data company — just a data company — that aims to be the source of truth for data on physical places. We are focused on creating high-precision geospatial data sets specifically about places where people spend time and money. We have business listings, building footprint data, and foot traffic insights for over 7 million across multiple countries and regions.
In this talk, we will inspect the challenges with geospatial processing, running at a large scale. We will look at open-source frameworks like Apache Sedona (incubating) and its key improvements over conventional technology, including spatial indexing and partitioning. We will explore spatial data structure, data format, and open-source indexing like H3. We will illustrate how all of these fit together in a cloud-first architecture running on Databricks, Delta, MLFlow, and AWS. We will explore examples of geospatial analysis with complex geometries and practical use cases of spatial queries. Lastly, we will discuss how this is augmented by Machine Learning modeling, Human-in-the-loop (HITL) annotation, and quality validation.
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...Big Data Spain
http://www.bigdataspain.org/2014/conference/state-of-play-data-science-on-hadoop-in-2015-keynote
Machine Learning is not new. Big Machine Learning is qualitatively different: More data beats algorithm improvement, scale trumps noise and sample size effects, can brute-force manual tasks.
Session presented at Big Data Spain 2014 Conference
18th Nov 2014
Kinépolis Madrid
http://www.bigdataspain.org
Event promoted by: http://www.paradigmatecnologico.com
Slides: https://speakerdeck.com/bigdataspain/state-of-play-data-science-on-hadoop-in-2015-by-sean-owen-at-big-data-spain-2014
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Perficient, Inc.
Most organizations still rely on batch and offline processing of data streams to gain meaningful analysis and insight into their business. However, in our instant gratification world, real-time computation and analysis of streaming data is crucial in gaining insight into patterns and threats. A trend is emerging for real-time and instant analysis from live data streams, promoting the value of logs and a move toward functional programming.
This shift in technology is not about what and how to store the data, but what we can do with it to see emerging patterns and trends across multiple resources, applications, services and environments. Log data represents a wealth of information, yet is often sporadic, unstructured, scattered across the enterprise and difficult to track.
These slides provide insights into some of the most helpful Big Data tools used by the largest social media and data-centric organizations for competitive trends, instant analysis and feedback from large volume data streams. We show how how using Big Data tools Storm, ElasticSearch and an elastic UI can turn application logs into real-time analytical views.
You will also learn how Big Data:
Contains data that is elastic, minimally structured, flexible and scalable
Helps process live streams into meaningful data
Promotes a move toward functional programming
Effects the enterprise data architecture
Works with real-time CEP tools like Storm for functional programming
How Graph Databases used in Police Department?Samet KILICTAS
This presentation delivers basics of graph concept and graph databases to audience. It clearly explains how graph databases are used with sample use cases from industry and how it can be used for police departments. Questions like "When to use a graph DB?" and "Should I solve a problem with Graph DB?" are answered.
Euro30 2019 - Benchmarking tree approaches on street dataFabion Kauker
By examining the use of algorithms to solve the Prize Collecting Steiner Tree (PCST) problem we consider the facets which determine effectiveness. Specifically, by measuring a number of solution approaches and comparing them based on metrics. In order to understand the solution approach we must asses why it is useful. Our goal is to determine the effectiveness of Mixed Integer Programming (MIP) and heuristic methods. Utilizing freely available street and address data a base graph representation is created and then computed on. Such that a tree connects every address utilizing the minimum total length of edges from the street network. This is the basis of many approaches used to solve infrastructure problems including telecommunications network design and costing. The analysis is conducted on methods developed by Hegde et al. 2015, Ljubić et al. 2006, and Teitz et al. 1963. We present a data processing architecture, as well as a concise set of results and a framework for assessing the facets and trade-offs for a given approach. In this case the heuristic approaches are proven to have advantages in the simplistic case but fail when more complex requirements are added. This is where the MIP approach is able to capitalize, whilst detrimentally limiting the flexibility due to the strictness and specificity in modelling.
Scalable Machine Learning: The Role of Stratified Data Shardinginside-BigData.com
In this deck from the 2019 Stanford HPC Conference, Srinivasan Parthasarathy from Ohio State University presents: Scalable Machine Learning: The Role of Stratified Data Sharding.
"With the increasing popularity of structured data stores, social networks and Web 2.0 and 3.0 applications, complex data formats, such as trees and graphs, are becoming ubiquitous. Managing and learning from such large and complex data stores, on modern computational eco-systems, to realize actionable information efficiently, is daunting. In this talk I will begin with discussing some of these challenges. Subsequently I will discuss a critical element at the heart of this challenge relates to the sharding, placement, storage and access of such tera- and peta- scale data. In this work we develop a novel distributed framework to ease the burden on the programmer and propose an agile and intelligent placement service layer as a flexible yet unified means to address this challenge. Central to our framework is the notion of stratification which seeks to initially group structurally (or semantically) similar entities into strata. Subsequently strata are partitioned within this eco-system according to the needs of the application to maximize locality, balance load, minimize data skew or even take into account energy consumption. Results on several real-world applications validate the efficacy and efficiency of our approach. (Notes: Joint work with Y. Wang (Airbnb) and A. Chakrabarti (MSR))."
Srinivasan Parthasarathy, Professor of Computer Science & Engineering, The Ohio State University
Srinivasan Parthasarathy is a Professor of Computer Science and Engineering and the director of the data mining research laboratory at Ohio State. His research interests span databases, data mining and high performance computing. He is among a handful of researchers nationwide to have won both the Department of Energy and National Science Foundation Career awards. He and his students have won multiple best paper awards or "best of" nominations from leading forums in the field including: SIAM Data Mining, ACM SIGKDD, VLDB, ISMB, WWW, ICDM, and ACM Bioinformatics. He chairs the SIAM data mining conference steering committee and serves on the action board of ACM TKDD and ACM DMKD --leading journals in the field. Since 2012 he also helped lead the creation of OSU's first-of-a-kind nationwide (USA) undergraduate major in data analytics and serves as one of its founding directors.
Watch the video: https://youtu.be/hOJI8e0p-UI
Learn more: http://web.cse.ohio-state.edu/~parthasarathy.2/
and
http://hpcadvisorycouncil.com/events/2019/stanford-workshop/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Geospatial Intelligence Middle East 2013_Big Data_Steven RamageSteven Ramage
Some initial considerations and discussion points around geospatial big data. Location adds context and relevance. Need to consider a number of V factors including Value.
In 2013:
- 1.4 Trillion digital interactions happen per month.
- 2.9 million emails are sent every second.
- 72.9 products are ordered on Amazon per second.
That is a lot of connected data, graphs are truly everywhere. Companies are finding that graph database technology is helping them make sense of their big data.
Objectivity’s Nick Quinn, Chief Architect of InfiniteGraph, shows us just how popular graph databases have become and where they are being used, as well as showing us the ins and outs.
Do you want to build technology that does great things with big data? You might want to find out what your colleagues are Tweeting about, make recommendations for apps, music or other retail that result in higher purchase rates, discover hidden connections between new and recorded medical research data, or maybe even leverage intel across government agencies to catch the bad guys.
All this is possible with a graph database.
AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...Cambridge Semantics
Thomas Cook, director of sales, Cambridge Semantics, offers a primer on graph database technology and the rapid growth of knowledge graphs at Data Summit 2020 in his presentation titled "AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Connected World".
At Data-centric Architecture Forum 2020 Thomas Cook, our Sales Director of AnzoGraph DB, gave his presentation "Knowledge Graph for Machine Learning and Data Science". These are his slides.
Transforming AI with Graphs: Real World Examples using Spark and Neo4jDatabricks
Graphs – or information about the relationships, connection, and topology of data points – are transforming machine learning. We’ll walk through real world examples of how to get transform your tabular data into a graph and how to get started with graph AI. This talk will provide an overview of how we to incorporate graph based features into traditional machine learning pipelines, create graph embeddings to better describe your graph topology, and give you a preview of approaches for graph native learning using graph neural networks. We’ll talk about relevant, real world case studies in financial crime detection, recommendations, and drug discovery. This talk is intended to introduce the concept of graph based AI to beginners, as well as help practitioners understand new techniques and applications. Key take aways: how graph data can improve machine learning, when graphs are relevant to data science applications, what graph native learning is and how to get started.
Transforming AI with Graphs: Real World Examples using Spark and Neo4jFred Madrid
Graphs – or information about the relationships, connection, and topology of data points – are transforming machine learning. We’ll walk through real world examples of how to get transform your tabular data into a graph and how to get started with graph AI. This talk will provide an overview of how we to incorporate graph based features into traditional machine learning pipelines, create graph embeddings to better describe your graph topology, and give you a preview of approaches for graph native learning using graph neural networks. We’ll talk about relevant, real world case studies in financial crime detection, recommendations, and drug discovery. This talk is intended to introduce the concept of graph based AI to beginners, as well as help practitioners understand new techniques and applications. Key take aways: how graph data can improve machine learning, when graphs are relevant to data science applications, what graph native learning is and how to get started.
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
2. Page2
What is geospatial context?
•Given a point = (-122.412651, 37.777748) which
city is it in?
•Does shape X intersect shape Y?
–Compute the intersection
•Given a sequence of points and a system of roads
–Compute best path representing points
3. Page3
Geospatial context is useful
What neighborhoods do people go to on weekends?
Predict the drop off neighborhood of a user?
Predict the location where next pick up can be expected?
How does usage pattern change with time?
Identify crime hotspot neighborhoods
How do these hotspots evolve with time?
Predict the likelihood of crime occurring at a given neighborhood
Predict climate at fairly granular level
Climate insurance: do I need to buy insurance for my crops?
Climate as a factor in crime: Join climate dataset with Crimes
7. Page7
Parsing!
•ESRI Shapefiles
–Spec for Shapes, no spec for metadata
–Worse, metadata = Dbase Format (really??)
•GeoJSON
–Verbose
–But atleast parseable
–Unfortunately not common
•ESRI Format
–JSON but not GeoJSON!
9. Page9
Scalability (or the lack thereof)
•ESRI Hive (runs on Hadoop but lacks spatial joins)
•JTS, Geos, Shapely (no support for scalability)
•Other proprietary engines = black boxes
11. Page11
Feature Extractors
Language integration simplifies exploratory analytics
Q-Q
Q-A
similarity
Parse +
Clean
Logs
Ad
category
mapping
Query
category
mapping
Poly
Exp
(Q-A)
Features
Model
Convex
Solver
Train/T
est
Split
train
Test/validation
Metrics
Ad Server
HDFS
Data Prep
Score Model - Real-time
Data
Flow
Stage
Data Flow Stage - Batch
Feedback
Spatial
Context
12. Page12
Not all is lost!
• local computations w/ ESRI Java API
• Scale out computation w/ Spark
• Python + R support without compromising
performance via Pyspark , SparkR
• Catalyst + Data Sources + Data Frames
= Flexibility + Simplicity + Performance
• Stitch it all together + Allow extension points
=> Success!
13. Page13
Magellan: a complete story for geospatial?
Create geospatial analytics applications
faster:
• Use your favorite language (Python/ Scala), even R
• Get best in class algorithms for common spatial analytics
• Write less code
• Read data efficiently
• Let the optimizer do the heavy lifting
14. Page14
How does it work?
Custom Data Types for Shapes:
• Point, Line, PolyLine, Polygon extend Shape
• Local Computations using ESRI Java API
• No need for Scala -> SQL serialization
Expressions for Operators:
• Literals e.g point(-122.4, 37.6)
• Boolean Expressions e.g Intersects, Contains
• Binary Expressions e.g Intersection
Custom Data Sources:
• Schema = [point, polyline, polygon, metadata]
• Metadata = Map[String, String]
• GeoJSON and Shapefile implementations
Custom Strategies for Spatial Join:
• Broadcast Cartesian Join
• Geohash Join (in progress)
• Plug into Catalyst as experimental strategies
15. Page15
Magellan in a nutshell
• Read Shapefiles/ GeoJSON as DataSources:
–sqlContext.read("magellan").load(“$path”)
–sqlContext.read(“magellan”).option(“type”, “geojson”).load(“$path”)
• Spatial Queries using Expressions
–point(-122.5, 37.6) = Shape Literal
–$”point” within $”polygon” = Boolean Expression
–$”polygon1” intersection $”polygon2” = Binary Expression
• Joins using Catalyst + Spatial Optimizations
–points.join(polygons).where($”point” within $”polygon”)
16. Page16
Where are we at?
Magellan 1.0.3 is out on Spark Packages, go give it a try!:
• Scala support, Python support will be functional in 1.0.4 (needs Spark 1.5)
• Github: https://github.com/harsha2010/magellan
• Spark Packages: http://spark-packages.org/package/harsha2010/magellan
• Data Formats: ESRI Shapefile + metadata, GeoJSON
• Operators: Intersects, Contains, Within, Intersection
• Joins: Broadcast
• Blog: http://hortonworks.com/blog/magellan-geospatial-analytics-in-spark/
• Zeppelin Notebook Example: http://bit.ly/1GwLyrV
17. Page17
What is next?
Magellan 1.0.4 expected release December:
• Python support
• MultiPolygon (Polygon Collection), MultiLineString (PolyLine Collection)
• Spark 1.5, 1.6
• Spatial Join Optimization
• Map Matching Algorithms
• More Operators based on requirements
• Support for other common geospatial data formats (WKT, others?)