Graph Gurus Episode 37: Modeling for Kaggle COVID-19 DatasetTigerGraph
Full Webinar: https://info.tigergraph.com/graph-gurus-37
In this Graph Gurus Episode, we:
-Learn how to process text and extract entities (words and phrases) as well as classes linking the entities using SciSpacy, a Natural Language Processing (NLP) tool.
-Import the output of NLP and semantically link it in TigerGraph
-Run advanced analytics queries with TigerGraph to analyze the relationships and deliver insights
Full Webinar: https://info.tigergraph.com/graph-gurus-21
In this Graph Gurus episode, we:
Explain the architecture and technical implementation for a TigerGraph + Spark graph-enhanced Machine Learning pipeline
Use TigerGraph both before training to extract (graph and non-graph) features and after training to apply the model on streaming data
Use Spark to train and tune machine learning models at scale
Present a solution in production at China Mobile that detects and prevents phone-based scams using machine learning with TigerGraph
Demo the data flow between Spark and TigerGraph via TigerGraph’s JDBC driver
Graph Databases and Machine Learning | November 2018TigerGraph
Graph Database and Machine Learning: Finding a Happy Marriage. Graph Databases and Machine Learning
both represent powerful tools for getting more value from data, learn how they can form a harmonious marriage to up-level machine learning.
Graph Gurus Episode 37: Modeling for Kaggle COVID-19 DatasetTigerGraph
Full Webinar: https://info.tigergraph.com/graph-gurus-37
In this Graph Gurus Episode, we:
-Learn how to process text and extract entities (words and phrases) as well as classes linking the entities using SciSpacy, a Natural Language Processing (NLP) tool.
-Import the output of NLP and semantically link it in TigerGraph
-Run advanced analytics queries with TigerGraph to analyze the relationships and deliver insights
Full Webinar: https://info.tigergraph.com/graph-gurus-21
In this Graph Gurus episode, we:
Explain the architecture and technical implementation for a TigerGraph + Spark graph-enhanced Machine Learning pipeline
Use TigerGraph both before training to extract (graph and non-graph) features and after training to apply the model on streaming data
Use Spark to train and tune machine learning models at scale
Present a solution in production at China Mobile that detects and prevents phone-based scams using machine learning with TigerGraph
Demo the data flow between Spark and TigerGraph via TigerGraph’s JDBC driver
Graph Databases and Machine Learning | November 2018TigerGraph
Graph Database and Machine Learning: Finding a Happy Marriage. Graph Databases and Machine Learning
both represent powerful tools for getting more value from data, learn how they can form a harmonious marriage to up-level machine learning.
Graph Gurus Episode 35: No Code Graph Analytics to Get Insights from Petabyte...TigerGraph
Full Webinar: https://info.tigergraph.com/graph-gurus-35
By attending this webinar you will:
-Learn how to use TigerGraph’s no-code capabilities;
-Understand how TigerGraph is built for scale and performance;
-Get a deep dive into TigerGraph 3.0 feature enhancements.
Graph Gurus Episode 25: Unleash the Business Value of Your Data Lake with Gra...TigerGraph
Full Webinar: https://info.tigergraph.com/graph-gurus-25
A new weapon is available for businesses wanting to accomplish more with Hadoop: native parallel graphs can reveal the connections across multiple domains and datasets in data lakes and provide powerful insights to deliver superior outcomes. In this webinar we will explain how native parallel graphs can analyze the information in data lakes to enable the following outcomes:
Recommending next best actions such as promoting a student loan to someone heading off to college, advocating life insurance to a newly married couple, and so on
Improving network utilization by analyzing petabytes of data collected from millions of IoT devices across a smart grid
Accelerating M&A activity by intelligently merging data lakes from multiple businesses.
Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...Databricks
As data grows in size and connectedness dramatically in all dimensions, the potential for graph-enriched machine learning grows likewise, but scalable technologies are needed to both build models and apply them in real-time. Real-time deep-link graph pattern matching and analytics provides new opportunities for enriching your machine learning models with graph features.
‘In addition to the real-time deep-link aspect, the ability to process large datasets in a production pipeline provides a synergistic approach for the two distributed and performant platforms: Spark and TigerGraph. The TigerGraph graph database provides scalable real-time deep link graph analytics and augments Spark with graph analytics and predictions for a wide range of Machine Learning use cases.
In this session, we will explain the architecture and technical implementation for a TigerGraph+Spark graph-enhanced Machine Learning pipeline: Use TigerGraph both before training to extract (graph and non-graph) features and after training to apply the model on streaming data; use Spark to train and tune machine learning models at scale. As an example, we will present a solution in production at China Mobile that detects and prevents phone-based scams using machine learning with TigerGraph.
Specifically, the solution generates 118 graph features for 600 million users, to feed a machine learning system which detects three types of unwanted phone calls. TigerGraph then helps to deploy the model by extracting these 118 features in real-time for up to 10,000 calls per second, to give customers a real-time diagnosis of their incoming calls.
Graph Gurus Episode 26: Using Graph Algorithms for Advanced Analytics Part 1TigerGraph
Full Webinar: https://info.tigergraph.com/graph-gurus-26
Have you ever wondered how routing apps like Google Maps find the best route from one place to another? Finding that route is solved by the Shortest Path graph algorithm. Today, graph algorithms are moving from the classroom to a host of important and valuable operational and analytical applications. This webinar will give you an overview of graph algorithms, how to use them, and the categories of problems they can solve, and then take a closer look at path algorithms. This webinar is the first part in a five-part series, each part examining a different type of problem to be solved.
Graph Gurus Episode 17: Seven Key Data Science Capabilities Powered by a Nati...TigerGraph
This webinar will demonstrate seven key data science capabilities using TigerGraph’s intuitive GUI, GraphStudio and GSQL queries. In this episode, we:
-Share the capabilities and tie those to specific use cases across healthcare, pharmaceutical, financial services, Telecom, Internet and government industries.
-Walk you through a sample dataset, GraphStudio UI flow, and GSQL queries demonstrating the capabilities.
-Cover client case studies for Amgen, Intuit, China Mobile, Santa Clara County, and other enterprise customers
In the big data world, it's not always easy for Python users to move huge amounts of data around. Apache Arrow defines a common format for data interchange, while Arrow Flight introduced in version 0.11.0, provides a means to move that data efficiently between systems. Arrow Flight is a framework for Arrow-based messaging built with gRPC. It enables data microservices where clients can produce and consume streams of Arrow data to share it over the wire. In this session, I'll give a brief overview of Arrow Flight from a Python perspective, and show that it's easy to build high performance connections when systems can talk Arrow. I'll also cover some ongoing work in using Arrow Flight to connect PySpark with TensorFlow - two systems with great Python APIs but very different underlying internal data.
Brief introduction to Cerved data, the role of data scientist in Cerved and how a data scientist can take advantage from graph database.
Bio:
Stefano Gatti: Born in 1970, has been involved for more than 15 years in several big data and technologies driven projects in leading business information companies like Lince and Cerved. He is very fond of agile metodologies, trying to apply them at all organizational levels. In last years he is strongly engaged in facilitating in Cerved the spread of innovation and the taking advantage from the new big and smart data technologies especially from a business usage perspective. datatelling, open innovation, partnership with smart actors of worldwide data driven innovation ecosystem are his actual mantra. Nunzio Pellegrino: Data Scientist in Cerved, as part of Innovation team, with focus on extract value from data and resolve problems with the latest technologies available. I’ve a degree in Statistics with background in Machine Learning. I’ve being worked primarily in Data Integration and Business Intelligence projects for 3 years. In this moment, I’m product owner of a web application based on GraphDB and involved in Italian Open Data projects. I’m a R enthusiastic, Python practitioner and fascinated of graph ecosystem.
Graph Gurus Episode 27: Using Graph Algorithms for Advanced Analytics Part 2TigerGraph
Full Webinar: https://info.tigergraph.com/graph-gurus-27
What does finding the best location for a warehouse/office/retail store have in common with finding the most influential person in a referral network? Answer: they are both Centrality problems and can be solved with graph algorithms. Join us for Part 2 of our five-part webinar series on using graph algorithms for advanced analytics.
By attending this webinar you will:
- Hear about use cases for centrality graph algorithms
- Learn how to select the right algorithm for your use case
- Be able to run and tailor GSQL graph algorithms
Using Graph Algorithms for Advanced Analytics - Part 2 CentralityTigerGraph
What does finding the best location for a warehouse/office/retail store have in common with finding the most influential person in a referral network? Answer: they are both Centrality problems and can be solved with graph algorithms.
Applying graph analytics on data stored in relational databases can provide tremendous value in many application domains. We discuss the importance of leveraging these analyses, and the challenges in enabling them. We present a tool, called GraphGen, that allows users to visually explore, and rapidly analyze (using NetworkX) different graph structures present in their databases.
Predicting Influence and Communities Using Graph AlgorithmsDatabricks
Relationships are one of the most predictive indicators of behavior and preferences. Communities detection based on relationships is a powerful tool for inferring similar preferences in peer groups, anticipating future behavior, estimating group resiliency, finding hierarchies, and preparing data for other analysis. Centrality measures based on relationships identify the most important items in a network and help us understand group dynamics such as influence, accessibility, the speed at which things spread, and bridges between groups. Data scientists use graph algorithms to identify groups and estimate important entities based on their interactions. In this session, we'll cover the common uses of community detection and centrality measures and how some of the iconic graph algorithms compute values. We'll show examples of how to run community detection and centrality algorithms in Apache Spark including using the AggregateMessages function to add your own algorithms. You'll learn best practices and tips for tricky situations. For those that want to run graph algorithms in a graph platform, we'll also illustrate a few examples in Neo4j. Some of the Community Detection Algorithms included: * Triangle Count and Clustering Coefficient to estimate network cohesiveness * Strongly Connected Components and Connected Components to find clusters * Label Propagation to quickly infer groups and data cleans with semi-supervised learning * Louvain Modularity to uncover at group hierarchies Balanced Triad to identify unstable groups * PageRank to reveal influencers * Betweenness Centrality to predict bottlenecks and bridges.
Authors: Amy Hodler, Sören Reichardt
In this session you will learn about how H&M have created a reference architecture for deploying their machine learning models on azure utilizing databricks following devOps principles. The architecture is currently used in production and has been iterated over multiple times to solve some of the discovered pain points. The team that are presenting is currently responsible for ensuring that best practices are implemented on all H&M use cases covering 100''s of models across the entire H&M group. <br> This architecture will not only give benefits to data scientist to use notebooks for exploration and modeling but also give the engineers a way to build robust production grade code for deployment. The session will in addition cover topics like lifecycle management, traceability, automation, scalability and version control.
AI on Spark for Malware Analysis and Anomalous Threat DetectionDatabricks
At Avast, we believe everyone has the right to be safe. We are dedicated to creating a world that provides safety and privacy for all, not matter where you are, who you are, or how you connect. With over 1.5 billion attacks stopped and 30 million new executable files monthly, big data pipelines are crucial for the security of our customers. At Avast we are leveraging Apache Spark machine learning libraries and TensorflowOnSpark for a variety of tasks ranging from marketing and advertisement, through network security to malware detection. This talk will cover our main cybersecurity usecases of Spark. After describing our cluster environment we will first demonstrate anomaly detection on time series of threats. Having thousands of types of attacks and malware, AI helps human analysts select and focus on most urgent or dire threats. We will walk through our setup for distributed training of deep neural networks with Tensorflow to deploying and monitoring of a streaming anomaly detection application with trained model. Next we will show how we use Spark for analysis and clustering of malicious files and large scale experimentation to automatically process and handle changes in malware. In the end, we will give comparison to other tools we used for solving those problems.
Swift Parallel Scripting for High-Performance WorkflowDaniel S. Katz
The Swift scripting language was created to provide a simple, compact way to write parallel scripts that run many copies of ordinary programs concurrently in various workflow patterns, reducing the need for complex parallel programming or arcane scripting to achieve this common high-level task. The result was a highly portable programming model based on implicitly parallel functional dataflow. The same Swift script runs on multi-core computers, clusters, grids, clouds, and supercomputers, and is thus a useful tool for moving workflow computations from laptop to distributed and/or high performance systems.
Swift has proven to be very general, and is in use in domains ranging from earth systems to bioinformatics to molecular modeling. It’s more recently been adapted to serve as a programming model for much finer-grain in-memory workflow on extreme scale systems, where it can perform task rates in the millions to billion-per-second.
In this talk, we describe the state of Swift’s implementation, present several Swift applications, and discuss ideas for of the future evolution of the programming model on which it’s based.
Cloud Technologies for Microsoft Computational Biology Toolsijait
Executing large number of self-regulating tasks or tasks that execute minimal inter-task communication in analogous is a common requirement in many domains. In this paper, we present our knowledge in applying two new Microsoft technologies Dryad and Azure to three bioinformatics applications. We also contrast with traditional MPI and Apache Hadoop MapReduce completion in one example. The applications are an EST (Expressed Sequence Tag) series assembly program, PhyloD statistical package to recognize HLA-associated viral evolution, and a pairwise Alu gene alignment application. We give detailed presentation discussion on a 768 core Windows HPC Server cluster and an Azure cloud. All the applications start with a “doubly data parallel step” connecting independent data chosen from two parallel (EST, Alu) or two different databases (PhyloD). There are different structures for final stages in each application.
Graph Gurus Episode 35: No Code Graph Analytics to Get Insights from Petabyte...TigerGraph
Full Webinar: https://info.tigergraph.com/graph-gurus-35
By attending this webinar you will:
-Learn how to use TigerGraph’s no-code capabilities;
-Understand how TigerGraph is built for scale and performance;
-Get a deep dive into TigerGraph 3.0 feature enhancements.
Graph Gurus Episode 25: Unleash the Business Value of Your Data Lake with Gra...TigerGraph
Full Webinar: https://info.tigergraph.com/graph-gurus-25
A new weapon is available for businesses wanting to accomplish more with Hadoop: native parallel graphs can reveal the connections across multiple domains and datasets in data lakes and provide powerful insights to deliver superior outcomes. In this webinar we will explain how native parallel graphs can analyze the information in data lakes to enable the following outcomes:
Recommending next best actions such as promoting a student loan to someone heading off to college, advocating life insurance to a newly married couple, and so on
Improving network utilization by analyzing petabytes of data collected from millions of IoT devices across a smart grid
Accelerating M&A activity by intelligently merging data lakes from multiple businesses.
Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...Databricks
As data grows in size and connectedness dramatically in all dimensions, the potential for graph-enriched machine learning grows likewise, but scalable technologies are needed to both build models and apply them in real-time. Real-time deep-link graph pattern matching and analytics provides new opportunities for enriching your machine learning models with graph features.
‘In addition to the real-time deep-link aspect, the ability to process large datasets in a production pipeline provides a synergistic approach for the two distributed and performant platforms: Spark and TigerGraph. The TigerGraph graph database provides scalable real-time deep link graph analytics and augments Spark with graph analytics and predictions for a wide range of Machine Learning use cases.
In this session, we will explain the architecture and technical implementation for a TigerGraph+Spark graph-enhanced Machine Learning pipeline: Use TigerGraph both before training to extract (graph and non-graph) features and after training to apply the model on streaming data; use Spark to train and tune machine learning models at scale. As an example, we will present a solution in production at China Mobile that detects and prevents phone-based scams using machine learning with TigerGraph.
Specifically, the solution generates 118 graph features for 600 million users, to feed a machine learning system which detects three types of unwanted phone calls. TigerGraph then helps to deploy the model by extracting these 118 features in real-time for up to 10,000 calls per second, to give customers a real-time diagnosis of their incoming calls.
Graph Gurus Episode 26: Using Graph Algorithms for Advanced Analytics Part 1TigerGraph
Full Webinar: https://info.tigergraph.com/graph-gurus-26
Have you ever wondered how routing apps like Google Maps find the best route from one place to another? Finding that route is solved by the Shortest Path graph algorithm. Today, graph algorithms are moving from the classroom to a host of important and valuable operational and analytical applications. This webinar will give you an overview of graph algorithms, how to use them, and the categories of problems they can solve, and then take a closer look at path algorithms. This webinar is the first part in a five-part series, each part examining a different type of problem to be solved.
Graph Gurus Episode 17: Seven Key Data Science Capabilities Powered by a Nati...TigerGraph
This webinar will demonstrate seven key data science capabilities using TigerGraph’s intuitive GUI, GraphStudio and GSQL queries. In this episode, we:
-Share the capabilities and tie those to specific use cases across healthcare, pharmaceutical, financial services, Telecom, Internet and government industries.
-Walk you through a sample dataset, GraphStudio UI flow, and GSQL queries demonstrating the capabilities.
-Cover client case studies for Amgen, Intuit, China Mobile, Santa Clara County, and other enterprise customers
In the big data world, it's not always easy for Python users to move huge amounts of data around. Apache Arrow defines a common format for data interchange, while Arrow Flight introduced in version 0.11.0, provides a means to move that data efficiently between systems. Arrow Flight is a framework for Arrow-based messaging built with gRPC. It enables data microservices where clients can produce and consume streams of Arrow data to share it over the wire. In this session, I'll give a brief overview of Arrow Flight from a Python perspective, and show that it's easy to build high performance connections when systems can talk Arrow. I'll also cover some ongoing work in using Arrow Flight to connect PySpark with TensorFlow - two systems with great Python APIs but very different underlying internal data.
Brief introduction to Cerved data, the role of data scientist in Cerved and how a data scientist can take advantage from graph database.
Bio:
Stefano Gatti: Born in 1970, has been involved for more than 15 years in several big data and technologies driven projects in leading business information companies like Lince and Cerved. He is very fond of agile metodologies, trying to apply them at all organizational levels. In last years he is strongly engaged in facilitating in Cerved the spread of innovation and the taking advantage from the new big and smart data technologies especially from a business usage perspective. datatelling, open innovation, partnership with smart actors of worldwide data driven innovation ecosystem are his actual mantra. Nunzio Pellegrino: Data Scientist in Cerved, as part of Innovation team, with focus on extract value from data and resolve problems with the latest technologies available. I’ve a degree in Statistics with background in Machine Learning. I’ve being worked primarily in Data Integration and Business Intelligence projects for 3 years. In this moment, I’m product owner of a web application based on GraphDB and involved in Italian Open Data projects. I’m a R enthusiastic, Python practitioner and fascinated of graph ecosystem.
Graph Gurus Episode 27: Using Graph Algorithms for Advanced Analytics Part 2TigerGraph
Full Webinar: https://info.tigergraph.com/graph-gurus-27
What does finding the best location for a warehouse/office/retail store have in common with finding the most influential person in a referral network? Answer: they are both Centrality problems and can be solved with graph algorithms. Join us for Part 2 of our five-part webinar series on using graph algorithms for advanced analytics.
By attending this webinar you will:
- Hear about use cases for centrality graph algorithms
- Learn how to select the right algorithm for your use case
- Be able to run and tailor GSQL graph algorithms
Using Graph Algorithms for Advanced Analytics - Part 2 CentralityTigerGraph
What does finding the best location for a warehouse/office/retail store have in common with finding the most influential person in a referral network? Answer: they are both Centrality problems and can be solved with graph algorithms.
Applying graph analytics on data stored in relational databases can provide tremendous value in many application domains. We discuss the importance of leveraging these analyses, and the challenges in enabling them. We present a tool, called GraphGen, that allows users to visually explore, and rapidly analyze (using NetworkX) different graph structures present in their databases.
Predicting Influence and Communities Using Graph AlgorithmsDatabricks
Relationships are one of the most predictive indicators of behavior and preferences. Communities detection based on relationships is a powerful tool for inferring similar preferences in peer groups, anticipating future behavior, estimating group resiliency, finding hierarchies, and preparing data for other analysis. Centrality measures based on relationships identify the most important items in a network and help us understand group dynamics such as influence, accessibility, the speed at which things spread, and bridges between groups. Data scientists use graph algorithms to identify groups and estimate important entities based on their interactions. In this session, we'll cover the common uses of community detection and centrality measures and how some of the iconic graph algorithms compute values. We'll show examples of how to run community detection and centrality algorithms in Apache Spark including using the AggregateMessages function to add your own algorithms. You'll learn best practices and tips for tricky situations. For those that want to run graph algorithms in a graph platform, we'll also illustrate a few examples in Neo4j. Some of the Community Detection Algorithms included: * Triangle Count and Clustering Coefficient to estimate network cohesiveness * Strongly Connected Components and Connected Components to find clusters * Label Propagation to quickly infer groups and data cleans with semi-supervised learning * Louvain Modularity to uncover at group hierarchies Balanced Triad to identify unstable groups * PageRank to reveal influencers * Betweenness Centrality to predict bottlenecks and bridges.
Authors: Amy Hodler, Sören Reichardt
In this session you will learn about how H&M have created a reference architecture for deploying their machine learning models on azure utilizing databricks following devOps principles. The architecture is currently used in production and has been iterated over multiple times to solve some of the discovered pain points. The team that are presenting is currently responsible for ensuring that best practices are implemented on all H&M use cases covering 100''s of models across the entire H&M group. <br> This architecture will not only give benefits to data scientist to use notebooks for exploration and modeling but also give the engineers a way to build robust production grade code for deployment. The session will in addition cover topics like lifecycle management, traceability, automation, scalability and version control.
AI on Spark for Malware Analysis and Anomalous Threat DetectionDatabricks
At Avast, we believe everyone has the right to be safe. We are dedicated to creating a world that provides safety and privacy for all, not matter where you are, who you are, or how you connect. With over 1.5 billion attacks stopped and 30 million new executable files monthly, big data pipelines are crucial for the security of our customers. At Avast we are leveraging Apache Spark machine learning libraries and TensorflowOnSpark for a variety of tasks ranging from marketing and advertisement, through network security to malware detection. This talk will cover our main cybersecurity usecases of Spark. After describing our cluster environment we will first demonstrate anomaly detection on time series of threats. Having thousands of types of attacks and malware, AI helps human analysts select and focus on most urgent or dire threats. We will walk through our setup for distributed training of deep neural networks with Tensorflow to deploying and monitoring of a streaming anomaly detection application with trained model. Next we will show how we use Spark for analysis and clustering of malicious files and large scale experimentation to automatically process and handle changes in malware. In the end, we will give comparison to other tools we used for solving those problems.
Swift Parallel Scripting for High-Performance WorkflowDaniel S. Katz
The Swift scripting language was created to provide a simple, compact way to write parallel scripts that run many copies of ordinary programs concurrently in various workflow patterns, reducing the need for complex parallel programming or arcane scripting to achieve this common high-level task. The result was a highly portable programming model based on implicitly parallel functional dataflow. The same Swift script runs on multi-core computers, clusters, grids, clouds, and supercomputers, and is thus a useful tool for moving workflow computations from laptop to distributed and/or high performance systems.
Swift has proven to be very general, and is in use in domains ranging from earth systems to bioinformatics to molecular modeling. It’s more recently been adapted to serve as a programming model for much finer-grain in-memory workflow on extreme scale systems, where it can perform task rates in the millions to billion-per-second.
In this talk, we describe the state of Swift’s implementation, present several Swift applications, and discuss ideas for of the future evolution of the programming model on which it’s based.
Cloud Technologies for Microsoft Computational Biology Toolsijait
Executing large number of self-regulating tasks or tasks that execute minimal inter-task communication in analogous is a common requirement in many domains. In this paper, we present our knowledge in applying two new Microsoft technologies Dryad and Azure to three bioinformatics applications. We also contrast with traditional MPI and Apache Hadoop MapReduce completion in one example. The applications are an EST (Expressed Sequence Tag) series assembly program, PhyloD statistical package to recognize HLA-associated viral evolution, and a pairwise Alu gene alignment application. We give detailed presentation discussion on a 768 core Windows HPC Server cluster and an Azure cloud. All the applications start with a “doubly data parallel step” connecting independent data chosen from two parallel (EST, Alu) or two different databases (PhyloD). There are different structures for final stages in each application.
Abstract: The processing power of computing devices has increased with number of available cores. This paper presents an approach
towards clustering of categorical data on multi-core platform. K-modes algorithm is used for clustering of categorical data which
uses simple dissimilarity measure for distance computation. The multi-core approach aims to achieve speedup in processing. Open
Multi Processing (OpenMP) is used to achieve parallelism in k-modes algorithm. OpenMP is a shared memory API that uses
thread approach using the fork-join model. The dataset used for experiment is Congressional Voting Dataset collected from UCI
repository. The dataset contains votes of members in categorical format provided in CSV format. The experiment is performed for
increased number of clusters and increasing size of dataset.
Comparison of Cost Estimation Methods using Hybrid Artificial Intelligence on...IJERA Editor
Cost estimating at schematic design stage as the basis of project evaluation, engineering design, and cost
management, plays an important role in project decision under a limited definition of scope and constraints in
available information and time, and the presence of uncertainties. The purpose of this study is to compare the
performance of cost estimation models of two different hybrid artificial intelligence approaches: regression
analysis-adaptive neuro fuzzy inference system (RANFIS) and case based reasoning-genetic algorithm (CBRGA)
techniques. The models were developed based on the same 50 low-cost apartment project datasets in
Indonesia. Tested on another five testing data, the models were proven to perform very well in term of accuracy.
A CBR-GA model was found to be the best performer but suffered from disadvantage of needing 15 cost drivers
if compared to only 4 cost drivers required by RANFIS for on-par performance.
This gives a characterization of the machine learning computations and brings out the deficiencies of Hadoop 1.0. It gives the motivation for Hadoop YARN and a brief view of YARN architecture. It illustrates the power of specialized processing frameworks over YARN, such as Spark and GraphLab. In short, Hadoop YARN allows your data to be stored in HDFS and specialized processing frameworks may be used to process the data in various ways.
Addresses streaming data challenges in sampling rates, cache maintenance, deductive reasoning, and the surrounding Semantic Web framework. Using a fixed-size cache, the challenge is to identify and preserve assertions within a stream. Deductive reasoning will continuously be performed over the cache to draw relevant conclusions as quickly as possible. The use of a cache differentiates our work from state-of-the-art works in deductive stream reasoning in that the cache enables us to temporarily store propositions that are no longer in the stream window.
Synergy of Human and Artificial Intelligence in Software EngineeringTao Xie
Keynote Talk by Tao Xie at International NSF sponsored Workshop on Realizing Artificial Intelligence Synergies in Software Engineering (RAISE 2013) http://promisedata.org/raise/2013/
A Modified Technique For Performing Data Encryption & Data DecryptionIJERA Editor
In this age of universal electronic connectivity of viruses and hackers of electronic eavesdropping and electronic fraud, there is indeed needed to store the information securely. This, in turn, led to a heightened awareness to protect data and resources from disclosure, to guarantee the authenticity of data and messages and to protect systems from network-based attacks. Information security via encryption decryption techniques is a very popular research area for many people’s over the years. This paper elaborates the basic concept of the cryptography, specially public and private cryptography. It also contains a review of some popular encryption decryption algorithms. A modified method is also proposed. This method is fast in comparison to the existing methods.
The Future is Big Graphs: A Community View on Graph Processing SystemsNeo4j
Alexandru Iosup, Full Professor, Vrije Universiteit Amsterdam (VU Amsterdam)
Angela Bonifati, Full Professor of Computer Science, Université de Lyon
Hannes Voigt, Software Engineer, Neo4j
This deck was presented at the Spark meetup at Bangalore. The key idea behind the presentation was to focus on limitations of Hadoop MapReduce and introduce both Hadoop YARN and Spark in this context. An overview of the other aspects of the Berkeley Data Analytics Stack was also provided.
Similar to Plume - A Code Property Graph Extraction and Analysis Library (20)
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Plume - A Code Property Graph Extraction and Analysis Library
1. Plume
A Code Property Graph Extraction and
Analysis Library
1
S.D. Baker Effendi, A.B. van der Merwe, & W. Visser
Stellenbosch University
Using Code Property Graphs and Pushdown
Systems for Static Analysis
2. | GRAPHAIWORLD.COM | #GRAPHAIWORLD |
❏ Introduction to Plume
❏ Background
❏ Code Property Graph
❏ Data-Flow Analysis
❏ Pushdown Systems
❏ How Plume works
❏ The future of Plume
2
Overview
3. | GRAPHAIWORLD.COM | #GRAPHAIWORLD |
❏ Plume is an open-source, static
analysis library
❏ A code property graph is extracted
from JVM bytecode
❏ This code property graph is stored in a
graph database backend
❏ Data-flow analysis is run on the graph
database by using graph queries
❏ Written using Kotlin which is
interoperable with Java
3
Introduction
5. | GRAPHAIWORLD.COM | #GRAPHAIWORLD |
The Code Property Graph
F Yamaguchi, et al. introduced the code property graph (CPG) that merges the
❏ abstract syntax tree (AST),
❏ control flow graph (CFG), and
❏ program dependence graph (PDG)
into a joint data structure.
5
Illustration of a code property graph from the original paper “Modeling and Discovering Vulnerabilities with Code Property Graphs”
6. | GRAPHAIWORLD.COM | #GRAPHAIWORLD |
The Code Property Graph
❏ The CPG is independent of the
programming language
❏ Software vulnerabilities can be
identified from the CPG
❏ Graph patterns of known
vulnerabilities are then matched
❏ ShiftLeft have commercialized the
CPG for DevSecOps
6
Illustration of a CPG projection from ShiftLeft.io
Yamaguchi, Fabian, et al. "Modeling and discovering vulnerabilities with code property graphs." 2014 IEEE Symposium on Security and Privacy. IEEE, 2014.
7. | GRAPHAIWORLD.COM | #GRAPHAIWORLD |
Data-Flow Analysis
❏ Data-flow analysis is a technique for
gathering information about the
possible set of values calculated at
various points in a program
❏ The control flow graph is used to
determine where a particular value
might propagate
7
Sagiv, Mooly, Thomas Reps, and Susan Horwitz. "Precise interprocedural dataflow analysis with applications to constant propagation." Theoretical Computer Science 167.1-2 (1996): 131-170.
The supergraph is annotated with the dataflow functions for the “possibly-
uninitialized variables” problem.
8. | GRAPHAIWORLD.COM | #GRAPHAIWORLD |
Data-Flow Analysis
❏ A procedure is a small section of a program
that performs a specific task
❏ Intraprocedural analysis looks at analyzing a
single procedure
❏ Interprocedural analysis uses calling
relationships among multiple procedures
❏ Example analysis’ are:
❏ reaching definitions
❏ liveness analysis
❏ constant propagation
8
Reps, Thomas, Susan Horwitz, and Mooly Sagiv. "Precise interprocedural dataflow analysis via graph reachability." Proceedings of the 22nd ACM SIGPLAN-SIGACT symposium on Principles of
programming languages. 1995.
The exploded super-graph that corresponds to the instance of the
possibly-uninitialized variables problem shown in the last figure.
9. | GRAPHAIWORLD.COM | #GRAPHAIWORLD |
Data-Flow Analysis
❏ Reps, Horwitz and Sagiv introduced
frameworks for a general way of
solving these problems in polynomial
time
❏ E Bodden created a generic IFDS/IDE
solver on top of Soot
❏ This was able to implement a wider
range of analysis such as typestate
and information-flow
9
Bodden, Eric. "Inter-procedural data-flow analysis with IFDS/IDE and Soot." Proceedings of the ACM SIGPLAN International Workshop on State of the Art in Java Program analysis. 2012.
Exploded super-graph for an IFDS information-flow analysis.
10. | GRAPHAIWORLD.COM | #GRAPHAIWORLD |
Soot
❏ Soot is a Java optimization framework
originally developed by the Sable Research
Group of McGill University
❏ Soot provides a range of analysis such as:
❏ call-graph construction
❏ points-to analysis
❏ data-flow analysis with IFDS/IDE
❏ Soot transforms programs into an intermediate
representation (IR) which is then analyzed
10
Soot - A framework for analyzing and transforming Java and Android applications https://soot-oss.github.io/soot
11. | GRAPHAIWORLD.COM | #GRAPHAIWORLD |
IDE in Typestate Analysis
❏ Typestates define valid sequences of operations that can be performed upon
an instance of a given type
❏ Aliasing refers to the situation where the same memory location can be
accessed using different names
❏ Späth, et. al. presented an alias-aware extension on the IDE framework with
IDEal
which improved upon the efficiency and precision of typestate analysis
11
File a = new File();
File b = a;
b.open();
a.close();
Späth, J., Ali, K., & Bodden, E. (2017). IDEal: efficient and precise alias-aware dataflow analysis. Proc. ACM Program. Lang., 1(OOPSLA), 99-1.
12. | GRAPHAIWORLD.COM | #GRAPHAIWORLD |
Data-Flow Analysis Limitations
Rice’s theorem
Any non-trivial, semantic property of a program is
undecidable.
A semantic property concerns a program’s behaviour
e.g. does a program terminate for all inputs?
To ensure an analysis terminates, we need to put a
boundary on the data-flow domain but ultimately leads
to imprecision. One technique is by limiting
field-sequence access paths to length k.
12
If we have an algorithm that decides a non-trivial property, we can
construct a Turing machine that decides the halting problem.
By Booyabazooka - Own work, Public Domain,
https://commons.wikimedia.org/w/index.php?curid=5407483
13. | GRAPHAIWORLD.COM | #GRAPHAIWORLD |
Pushdown Systems
❏ A pushdown automata (PDS) is a
finite-state automata with extra memory
called a stack
❏ Each state is called a control location
❏ This class of automata recognize Context
Free Languages (CFL)
❏ A CFL is generated by a context free
grammar (CFG)
13
A diagram of a pushdown automaton.
By Jochgem - Own work, CC BY-SA 3.0,
https://commons.wikimedia.org/w/index.php?curid=4983792
14. | GRAPHAIWORLD.COM | #GRAPHAIWORLD |
Pushdown Systems
❏ Context-, and field-sensitivity can be
expressed using CFL reachability problems
❏ Späth, et. al. introduced the notion of
synchronized pushdown systems (SPDS)
to efficiently solve any single
CFL-reachability problem
❏ An SPDS is a combination of two
flow-sensitive automata; a call-PDS and a
field-PDS
14
Späth, Johannes, Karim Ali, and Eric Bodden. "Context-, flow-, and field-sensitive data-flow analysis using synchronized pushdown systems." Proceedings of the ACM on
Programming Languages 3.POPL (2019): 1-29.
A points-to analysis can be formulated by the reachability problem under the
following Dyck Language:
Yuan, Hao, and Patrick Eugster. "An efficient algorithm for solving the
dyck-cfl reachability problem on trees." European Symposium on
Programming. Springer, Berlin, Heidelberg, 2009.
15. | GRAPHAIWORLD.COM | #GRAPHAIWORLD |
Pushdown System of Calls
15
Späth, Johannes, Karim Ali, and Eric Bodden. "Context-, flow-, and field-sensitive data-flow analysis using synchronized pushdown systems." Proceedings of the ACM on
Programming Languages 3.POPL (2019): 1-29.
Data-flow example for a simple recursive program.
Automaton computed with the post* algorithm.
The structure of a call-PDS:
❏ Control locations are program
variables
❏ The stack alphabet is the set of
program statements
❏ The rule set models the data-flow
effect of a variable at a statement
This automaton provides
context-sensitivity.
16. | GRAPHAIWORLD.COM | #GRAPHAIWORLD |
Pushdown System of Fields
The structure of a field-PDS:
❏ Control locations is a pair of a variable
and a statement
❏ The stack alphabet is the set of all fields
of a program
❏ The rule set models the data-flow
within the access paths
This automaton provides field-sensitivity.
16
Späth, Johannes, Karim Ali, and Eric Bodden. "Context-, flow-, and field-sensitive data-flow analysis using synchronized pushdown systems." Proceedings of the ACM on
Programming Languages 3.POPL (2019): 1-29.
Data-flow example for a simple if-else statement with field accesses.
Automaton computed with the post* algorithm.
17. | GRAPHAIWORLD.COM | #GRAPHAIWORLD |
Synchronized Pushdown Systems
17
Pushdown System of Calls Pushdown System of Fields SPDS
Flow sensitive ✔ ✔ ✔
Context-Sensitive ✔ ✘ ✔
Field-Sensitive ✘ ✔ ✔
Späth, Johannes, Karim Ali, and Eric Bodden. "Context-, flow-, and field-sensitive data-flow analysis using synchronized pushdown systems." Proceedings of the ACM on
Programming Languages 3.POPL (2019): 1-29.
Both pushdown systems can answer reachability queries and handle recursive
structures.
Each PDS has a precision advantage over the other so by combining them we
get the precision benefits of both.
18. | GRAPHAIWORLD.COM | #GRAPHAIWORLD |
SPDS Advantages
18
Späth, Johannes, Karim Ali, and Eric Bodden. "Context-, flow-, and field-sensitive data-flow analysis using synchronized pushdown systems." Proceedings of the ACM on
Programming Languages 3.POPL (2019): 1-29.
❏ The PDA of fields is a concise and finite
representation of (potentially infinitely many)
access paths
❏ No need to resort to k-limiting - preserves
precision!
❏ In pointer-analysis, SPDS avoids exponential
growth of the abstract domain by using
PDS-based encoding
❏ Typestate information can be encoded as
weights to any of the PDAs
A PDA of fields and its finite representation of an infinite set of
access paths.
19. | GRAPHAIWORLD.COM | #GRAPHAIWORLD |
SPDS Limitations
19
Späth, Johannes, Karim Ali, and Eric Bodden. "Context-, flow-, and field-sensitive data-flow analysis using synchronized pushdown systems." Proceedings of the ACM on
Programming Languages 3.POPL (2019): 1-29.
SPDS over-approximates in corner cases where a
context-insensitive data-flow path occurs at the
same time as a field-sensitive path or vice versa.
These are typically only during synthetic examples
and, based on Späth, et. al.’s empirical evaluation,
these situations do not arise in practice.
Thus, an improperly matched call site does not
induce a properly matched field access.
21. | GRAPHAIWORLD.COM | #GRAPHAIWORLD |
Features of Plume
21
Code Property Graph
+
Synchronized Pushdown Systems
+
Graph Database
=
❏ Language independent analysis on the CPG
❏ Provides flow-, context-, field- sensitive and
alias-aware dataflow analysis
❏ Provides the ability to perform static analysis
incrementally and store results in the graph
database
❏ Partial updates to the CPG when
source-code is updated
❏ Scales for large programs by leveraging a
graph database backend
22. | GRAPHAIWORLD.COM | #GRAPHAIWORLD |
How does Plume work?
22
Plume is a Kotlin library divided into 3 parts
❏ Driver: connects to the database of choice
❏ Extractor: creates a CPG from bytecode
❏ Analyser: performs data-flow analysis on the CPG
The three parts represent the separation of concerns between the different
stages and requirements of the CPG driven analysis pipeline.
Connect to Graph Database Extract Code Property Graph
Graph Icons from graph theory tree by Ecem Afacan from the Noun Project
Analyze Code Property Graph
.java
.py
.js
23. | GRAPHAIWORLD.COM | #GRAPHAIWORLD |
How does Plume communicate?
23
Plume’s driver aims to be graph
database agnostic in order to
eventually benchmark all supported
graph databases in the application of
data-flow analysis against each other.
The driver provides a generic interface
with which the extractor and analyzer
are to interact with.
There are more graph databases to be
supported in the future.
<<interface>>
IDriver
+ exists(PlumeVertex): boolean
+ addVertex(PlumeVertex)
+ addEdge(PlumeVertex, PlumeVertex, EdgeType)
...
TinkerGraph JanusGraph TigerGraph Amazon Neptune
24. | GRAPHAIWORLD.COM | #GRAPHAIWORLD |
Plume’s Extraction Process
24
❏ Soot is used to convert JVM bytecode to an IR called Jimple
❏ Jimple is based on three-address code and only uses 15 different
operations
❏ Jimple is then converted into Soot’s UnitGraph and CallGraph objects
❏ The extractor converts these two objects into a code property graph
❏ Plume supports compiling Python 2.7 and JavaScript 1.7 into JVM bytecode
using Jython and Mozilla Rhino respectively
Convert source code to class files
.java
.py
.js
.class .jimple
Extract Jimple and graphs using Soot
Graph Icons from graph theory tree by Ecem Afacan from the Noun Project
Store CPG in database
25. | GRAPHAIWORLD.COM | #GRAPHAIWORLD |
Example
25
package intraprocedural.basic;
public class Basic1 {
public static void main(String[] args) {
int a = 3;
int b = 2;
int c = a + b;
}
}
26. | GRAPHAIWORLD.COM | #GRAPHAIWORLD |
Example
26
package intraprocedural.conditional;
public class Conditional1 {
public static void main(String[] args) {
int a = 1;
int b = 2;
if (a > b) {
a -= b;
b -= b;
} else {
b += a;
}
}
}
27. | GRAPHAIWORLD.COM | #GRAPHAIWORLD |
What Plume can do
❏ Generate an intraprocedural
code property graph
❏ Connect to TinkerGraph,
JanusGraph, TigerGraph, and
Amazon Neptune
❏ Compile Java, Python 2.7 and
JavaScript 1.7 code
Plans for Plume
❏ Add interprocedural edges
❏ Include Neo4j
❏ Perform interprocedural
data-flow analysis algorithms
❏ Investigate soundness of
analysis for dynamic vs static
languages
❏ Investigate the use of GCNNs
for vulnerability detection
27
Plume Roadmap