This document discusses using a data model approach for pattern analysis. It begins with an agenda that includes defining a pattern analysis data model as an extension to an analytical star schema. It then provides examples of patterns that could be analyzed using different logical partitions of a dataset. Key points covered include defining global and partitioned patterns, logical partitions for pattern matching, and calculating different types of key performance indicators at the global, partition, and non-partition levels. The goal is to support flexible ad-hoc reporting and analysis of patterns and rules across datasets or subsets defined by attribute fields.
How To Model and Construct Graphs with Oracle Database (AskTOM Office Hours p...Jean Ihm
2nd in the AskTOM Office Hours series on graph database technologies. https://devgym.oracle.com/pls/apex/dg/office_hours/3084
With property graphs in Oracle Database, you can perform powerful analysis on big data such as social networks, financial transactions, sensor networks, and more.
To use property graphs, first, you’ll need a graph model. For a new user, modeling and generating a suitable graph for an application domain can be a challenge. This month, we’ll describe key steps required to construct a meaningful graph, and offer a few tips on validating the generated graph.
Albert Godfrind (EMEA Solutions Architect), Zhe Wu (Architect), and Jean Ihm (Product Manager) walk you through, and take your questions.
4th in the AskTOM Office Hours series on graph database technologies. https://devgym.oracle.com/pls/apex/dg/office_hours/3084
Learn how to visualize graphs – a powerful, intuitive way to interact with data. Using open source tools like Cytoscape or third party tools, you have several choices on how to visualize and interact with graphs from Oracle Database and big data platforms. Albert Godfrind (EMEA Solutions Architect) and Gabriela Montiel-Moreno (Software Development Manager) share all you need to get started, with detailed demos using a banking customer data set.
Oracle Spatial Studio: Fast and Easy Spatial Analytics and MapsJean Ihm
Learn about a new tool, Spatial Studio, that lets you quickly and easily do spatial analytics and create maps, even if you don't have GIS or Spatial knowledge. Now business users and non-GIS developers have a simple user interface to access the spatial features in Oracle Database.
Spatial Studio lets you prepare your data for spatial analysis, perform spatial analysis operations, publish, and share the results – as well access spatial analyses results via REST and incorporate in applications and workflows. Presented by Carol Palmer, Sr. Principal Product Manager, and David Lapp, Sr. Principal Product Manager, Oracle Spatial and Graph.
Presentation video including demo and resources available here: https://devgym.oracle.com/pls/apex/dg/office_hours/3084 .
5th in the AskTOM Office Hours series on graph database technologies. https://devgym.oracle.com/pls/apex/dg/office_hours/3084
PGQL: A Query Language for Graphs
Learn how to query graphs using PGQL, an expressive and intuitive graph query language that's a lot like SQL. With PGQL, it's easy to get going writing graph analysis queries to the database in a very short time. Albert and Oskar show what you can do with PGQL, and how to write and execute PGQL code.
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...Spark Summit
This talk will cover the tools we used, the hurdles we faced and the work arounds we developed with the help from Databricks support in our attempt to build a custom machine learning model and use it to predict the TV ratings for different networks and demographics.
The Apache Spark machine learning and dataframe APIs make it incredibly easy to produce a machine learning pipeline to solve an archetypal supervised learning problem. In our applications at Cadent, we face a challenge with high dimensional labels and relatively low dimensional features; at first pass such a problem is all but intractable but thanks to a large number of historical records and the tools available in Apache Spark, we were able to construct a multi-stage model capable of forecasting with sufficient accuracy to drive the business application.
Over the course of our work we have come across many tools that made our lives easier, and others that forced work around. In this talk we will review our custom multi-stage methodology, review the challenges we faced and walk through the key steps that made our project successful.
Build Knowledge Graphs with Oracle RDF to Extract More Value from Your DataJean Ihm
AnD Summit '19 slides - Souri Das, Matthew Perry, Melli Annamalai. This presentation covers knowledge graphs built using the RDF capabilities of Oracle Spatial and Graph. We will illustrate how to define a knowledge graph, create virtual or materialized graphs from existing data (relational tables, CSV files, etc.), derive new knowledge through logical inference, navigate and query graphs using W3C standards, analyze knowledge graphs with graph algorithms, and more. Real-world use cases from various industries will also be shared.
An Introduction to Graph: Database, Analytics, and Cloud ServicesJean Ihm
Graph analysis employs powerful algorithms to explore and discover relationships in social network, IoT, big data, and complex transaction data. Learn how graph technologies are used in applications such as fraud detection for banking, customer 360, public safety, and manufacturing. This session will provide an overview and demos of graph technologies for Oracle Cloud Services, Oracle Database, NoSQL, Spark and Hadoop, including PGX analytics and PGQL property graph query language.
Presented at Analytics and Data Summit, March 20, 2018
How To Model and Construct Graphs with Oracle Database (AskTOM Office Hours p...Jean Ihm
2nd in the AskTOM Office Hours series on graph database technologies. https://devgym.oracle.com/pls/apex/dg/office_hours/3084
With property graphs in Oracle Database, you can perform powerful analysis on big data such as social networks, financial transactions, sensor networks, and more.
To use property graphs, first, you’ll need a graph model. For a new user, modeling and generating a suitable graph for an application domain can be a challenge. This month, we’ll describe key steps required to construct a meaningful graph, and offer a few tips on validating the generated graph.
Albert Godfrind (EMEA Solutions Architect), Zhe Wu (Architect), and Jean Ihm (Product Manager) walk you through, and take your questions.
4th in the AskTOM Office Hours series on graph database technologies. https://devgym.oracle.com/pls/apex/dg/office_hours/3084
Learn how to visualize graphs – a powerful, intuitive way to interact with data. Using open source tools like Cytoscape or third party tools, you have several choices on how to visualize and interact with graphs from Oracle Database and big data platforms. Albert Godfrind (EMEA Solutions Architect) and Gabriela Montiel-Moreno (Software Development Manager) share all you need to get started, with detailed demos using a banking customer data set.
Oracle Spatial Studio: Fast and Easy Spatial Analytics and MapsJean Ihm
Learn about a new tool, Spatial Studio, that lets you quickly and easily do spatial analytics and create maps, even if you don't have GIS or Spatial knowledge. Now business users and non-GIS developers have a simple user interface to access the spatial features in Oracle Database.
Spatial Studio lets you prepare your data for spatial analysis, perform spatial analysis operations, publish, and share the results – as well access spatial analyses results via REST and incorporate in applications and workflows. Presented by Carol Palmer, Sr. Principal Product Manager, and David Lapp, Sr. Principal Product Manager, Oracle Spatial and Graph.
Presentation video including demo and resources available here: https://devgym.oracle.com/pls/apex/dg/office_hours/3084 .
5th in the AskTOM Office Hours series on graph database technologies. https://devgym.oracle.com/pls/apex/dg/office_hours/3084
PGQL: A Query Language for Graphs
Learn how to query graphs using PGQL, an expressive and intuitive graph query language that's a lot like SQL. With PGQL, it's easy to get going writing graph analysis queries to the database in a very short time. Albert and Oskar show what you can do with PGQL, and how to write and execute PGQL code.
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...Spark Summit
This talk will cover the tools we used, the hurdles we faced and the work arounds we developed with the help from Databricks support in our attempt to build a custom machine learning model and use it to predict the TV ratings for different networks and demographics.
The Apache Spark machine learning and dataframe APIs make it incredibly easy to produce a machine learning pipeline to solve an archetypal supervised learning problem. In our applications at Cadent, we face a challenge with high dimensional labels and relatively low dimensional features; at first pass such a problem is all but intractable but thanks to a large number of historical records and the tools available in Apache Spark, we were able to construct a multi-stage model capable of forecasting with sufficient accuracy to drive the business application.
Over the course of our work we have come across many tools that made our lives easier, and others that forced work around. In this talk we will review our custom multi-stage methodology, review the challenges we faced and walk through the key steps that made our project successful.
Build Knowledge Graphs with Oracle RDF to Extract More Value from Your DataJean Ihm
AnD Summit '19 slides - Souri Das, Matthew Perry, Melli Annamalai. This presentation covers knowledge graphs built using the RDF capabilities of Oracle Spatial and Graph. We will illustrate how to define a knowledge graph, create virtual or materialized graphs from existing data (relational tables, CSV files, etc.), derive new knowledge through logical inference, navigate and query graphs using W3C standards, analyze knowledge graphs with graph algorithms, and more. Real-world use cases from various industries will also be shared.
An Introduction to Graph: Database, Analytics, and Cloud ServicesJean Ihm
Graph analysis employs powerful algorithms to explore and discover relationships in social network, IoT, big data, and complex transaction data. Learn how graph technologies are used in applications such as fraud detection for banking, customer 360, public safety, and manufacturing. This session will provide an overview and demos of graph technologies for Oracle Cloud Services, Oracle Database, NoSQL, Spark and Hadoop, including PGX analytics and PGQL property graph query language.
Presented at Analytics and Data Summit, March 20, 2018
Introduction to Property Graph Features (AskTOM Office Hours part 1) Jean Ihm
1st in the AskTOM Office Hours series on graph database technologies. https://devgym.oracle.com/pls/apex/dg/office_hours/3084
Xavier Lopez (PM Senior Director) and Zhe Wu (Graph Architect) will share a brief intro to what property graphs can do for you, and take your questions - on property graphs or any other aspect of Oracle Database Spatial and Graph features. With property graphs, you can analyze relationships in Big Data like social networks, financial transactions, or IoT sensor networks; identify influencers; discover patterns of fraudulent behavior; recommend products, and much more -- right inside Oracle Database.
3rd in the AskTOM Office Hours series on graph database technologies. https://devgym.oracle.com/pls/apex/dg/office_hours/3084
See the magic of graphs in this session. Graph analysis can answer questions like detecting patterns of fraud or identifying influential customers - and do it quickly and efficiently. We’ll show you the APIs for accessing graphs and running analytics such as finding influencers, communities, anomalies, and how to use them from various languages including Groovy, Python, and Javascript, with Jupiter and Zeppelin notebooks.
Albert Godfrind (EMEA Solutions Architect), Zhe Wu (Architect), and Jean Ihm (Product Manager) walk you through, and take your questions.
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Databricks
This talk presents how we accelerated deep learning processing from preprocessing to inference and training on Apache Spark in SK Telecom. In SK Telecom, we have half a Korean population as our customers. To support them, we have 400,000 cell towers, which generates logs with geospatial tags.
The slides give an overview of how Spark can be used to tackle Machine learning tasks, such as classification, regression, clustering, etc., at a Big Data scale.
Presented by David Taieb, Architect, IBM Cloud Data Services
Along with Spark Streaming, Spark SQL and GraphX, MLLib is one of the four key architectural components of Spark. It provides easy-to-use (even for beginners), powerful Machine Learning APIs that are designed to work in parallel using Spark RDDs. In this session, we’ll introduce the different algorithms available in MLLib, e.g. supervised learning with classification (binary and multi class) and regression but also unsupervised learning with clustering (K-means) and recommendation systems. We’ll conclude the presentation with a deep dive on a sample machine learning application built with Spark MLLib that predicts whether a scheduled flight will be delayed or not. This application trains a model using data from real flight information. The labeled flight data is combined with weather data from the “Insight for Weather” service available on IBM Bluemix Cloud Platform to form the training, test and blind data. Even if you are not a black belt in machine learning, you will learn in this session how to leverage powerful Machine Learning algorithms available in Spark to build interesting predictive and prescriptive applications.
About the Speaker: For the last 4 years, David has been the lead architect for the Watson Core UI & Tooling team based in Littleton, Massachusetts. During that time, he led the design and development of a Unified Tooling Platform to support all the Watson Tools including accuracy analysis, test experiments, corpus ingestion, and training data generation. Before that, he was the lead architect for the Domino Server OSGi team responsible for integrating the eXpeditor J2EE Web Container in Domino and building first class APIs for the developer community. He started with IBM in 1996, working on various globalization technologies and products including Domino Global Workbench (used to develop multilingual Notes/Domino NSF applications) and a multilingual Content Management system for the Websphere Application Server. David enjoys sharing his experience by speaking at conferences. You’ll find him at various events like the Unicode conference, Eclipsecon, and Lotusphere. He’s also passionate about building tools that help improve developer productivity and overall experience.
In this webinar Thomas Cook, Sales Director, AnzoGraph DB, provides a history lesson on the origins of SPARQL, including its roots in the Semantic Web, and how linked open data is used to create Knowledge Graphs. Then, he dives into "What is RDF?", "What is a URI?" and "What is SPARQL?", wrapping up with a real-world demonstration via a Zeppelin notebook.
Large Scale Fuzzy Name Matching with a Custom ML Pipeline in Batch and Stream...Databricks
ING bank is a Dutch multinational, multi-product bank that offers banking services to 33 million retail and commercial customers in over 40 countries. At this scale, ING naturally faces a multitude of data consolidation tasks across its disparate sources. A common consolidation problem is fuzzy name matching: given a name (streaming) or a list of names (batch), find out the most similar name(s) from a different list.
Popular methods such as Levenshtein distance are not appropriate because of the time complexity and sheer volume of names involved. In this talk, we will introduce how we use a Spark custom ML pipeline and Structured Streaming to build fuzzy name matching products in batch and streaming. This can successfully match 8000 names per second against a 10 million name list, using a ten-node cluster. Firstly, we will give an introduction into the name matching problem.
Secondly, we will explain why Levenshtein distance approach is limited, and demonstrate a faster approach; token-based cosine similarity matching. Next, we will show how a ML pipeline helps to build an elegant solution. Here, we will deep dive into the detail of each stage, including customized preprocessing, tokenization, term-frequency, customized inverse document frequency, customized cosine similarity with distributed sparse matrix multiplication, and a customized supervision stage.
Finally, we will show how we deploy the ML pipeline within a batch data pipeline, and additionally as a fuzzy search engine in a streaming manner. Â The main conclusions will be: (1) a spark custom ML pipeline provides a powerful way to handle complicated data science problems (2) a uniform ML pipeline can serve both batch and streaming products easily from the same codebase.
Building A Hybrid Warehouse: Efficient Joins between Data Stored in HDFS and ...Yuanyuan Tian
With the advent of big data, the enterprise analytics landscape has dramatically changed. The HDFS has become an important data repository for all business analytics. Enterprises are using various big data technologies to process data and drive actionable insights. HDFS serves as the storage where other distributed processing frameworks, such as Hadoop and Spark, access and operate on large volumes of data. At the same time, enterprise data warehouses (EDWs) continue to support critical business analytics. EDWs are usually shared-nothing parallel databases that support complex SQL processing, updates, and transactions. As a result, they manage up-to-date data and support various business analytics tools, such as reporting and dashboards. A new generation of applications have emerged, requiring access and correlation of data stored in HDFS and EDWs. This has created the need for a new generation of a special federation between Hadoop-like big data platforms and EDWs, which we call the hybrid warehouse. In this talk, we identify the best hybrid warehouse architecture by studying various algorithms to join database and HDFS tables.
Data Science, Statistical Analysis and R... Learn what those mean, how they can help you find answers to your questions and complement the existing toolsets and processes you are currently using to make sense of data. We will explore R and the RStudio development environment, installing and using R packages, basic and essential data structures and data types, plotting graphics, manipulating data frames and how to connect R and SQL Server.
https://www.eventbrite.com/e/talk-by-paco-nathan-graph-analytics-in-spark-tickets-17173189472
Big Brains meetup hosted by BloomReach, 2015-06-04
Case study / demo of a large-scale graph analytics project, leveraging GraphX in Apache Spark to surface insights about open source developer communities — based on data mining of their email forums. The project works with any Apache email archive, applying NLP and machine learning techniques to analyze message threads, then constructs a large graph. Graph analytics, based on concise Scala coding examples in Spark, surface themes and interactions within the community. Results are used as feedback for respective developer communities, such as leaderboards, etc. As an example, we will examine analysis of the Spark developer community itself.
L'architettura di classe enterprise di nuova generazione - Massimo BrignoliData Driven Innovation
La nascita dei data lake - La aziende, ormai, sono sommerse dai dati e il classico datawarehouse fa fatica a macinare questi dati per numerosità e varietà. In molti hanno iniziato a guardare a delle architetture chiamate Data Lakes con Hadoop come tecnologia di riferimento. Ma questa soluzione va bene per tutto? Vieni a capire come operazionalizzare i data lakes per creare delle moderne architetture di gestione dati.
Introduction to Property Graph Features (AskTOM Office Hours part 1) Jean Ihm
1st in the AskTOM Office Hours series on graph database technologies. https://devgym.oracle.com/pls/apex/dg/office_hours/3084
Xavier Lopez (PM Senior Director) and Zhe Wu (Graph Architect) will share a brief intro to what property graphs can do for you, and take your questions - on property graphs or any other aspect of Oracle Database Spatial and Graph features. With property graphs, you can analyze relationships in Big Data like social networks, financial transactions, or IoT sensor networks; identify influencers; discover patterns of fraudulent behavior; recommend products, and much more -- right inside Oracle Database.
3rd in the AskTOM Office Hours series on graph database technologies. https://devgym.oracle.com/pls/apex/dg/office_hours/3084
See the magic of graphs in this session. Graph analysis can answer questions like detecting patterns of fraud or identifying influential customers - and do it quickly and efficiently. We’ll show you the APIs for accessing graphs and running analytics such as finding influencers, communities, anomalies, and how to use them from various languages including Groovy, Python, and Javascript, with Jupiter and Zeppelin notebooks.
Albert Godfrind (EMEA Solutions Architect), Zhe Wu (Architect), and Jean Ihm (Product Manager) walk you through, and take your questions.
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Databricks
This talk presents how we accelerated deep learning processing from preprocessing to inference and training on Apache Spark in SK Telecom. In SK Telecom, we have half a Korean population as our customers. To support them, we have 400,000 cell towers, which generates logs with geospatial tags.
The slides give an overview of how Spark can be used to tackle Machine learning tasks, such as classification, regression, clustering, etc., at a Big Data scale.
Presented by David Taieb, Architect, IBM Cloud Data Services
Along with Spark Streaming, Spark SQL and GraphX, MLLib is one of the four key architectural components of Spark. It provides easy-to-use (even for beginners), powerful Machine Learning APIs that are designed to work in parallel using Spark RDDs. In this session, we’ll introduce the different algorithms available in MLLib, e.g. supervised learning with classification (binary and multi class) and regression but also unsupervised learning with clustering (K-means) and recommendation systems. We’ll conclude the presentation with a deep dive on a sample machine learning application built with Spark MLLib that predicts whether a scheduled flight will be delayed or not. This application trains a model using data from real flight information. The labeled flight data is combined with weather data from the “Insight for Weather” service available on IBM Bluemix Cloud Platform to form the training, test and blind data. Even if you are not a black belt in machine learning, you will learn in this session how to leverage powerful Machine Learning algorithms available in Spark to build interesting predictive and prescriptive applications.
About the Speaker: For the last 4 years, David has been the lead architect for the Watson Core UI & Tooling team based in Littleton, Massachusetts. During that time, he led the design and development of a Unified Tooling Platform to support all the Watson Tools including accuracy analysis, test experiments, corpus ingestion, and training data generation. Before that, he was the lead architect for the Domino Server OSGi team responsible for integrating the eXpeditor J2EE Web Container in Domino and building first class APIs for the developer community. He started with IBM in 1996, working on various globalization technologies and products including Domino Global Workbench (used to develop multilingual Notes/Domino NSF applications) and a multilingual Content Management system for the Websphere Application Server. David enjoys sharing his experience by speaking at conferences. You’ll find him at various events like the Unicode conference, Eclipsecon, and Lotusphere. He’s also passionate about building tools that help improve developer productivity and overall experience.
In this webinar Thomas Cook, Sales Director, AnzoGraph DB, provides a history lesson on the origins of SPARQL, including its roots in the Semantic Web, and how linked open data is used to create Knowledge Graphs. Then, he dives into "What is RDF?", "What is a URI?" and "What is SPARQL?", wrapping up with a real-world demonstration via a Zeppelin notebook.
Large Scale Fuzzy Name Matching with a Custom ML Pipeline in Batch and Stream...Databricks
ING bank is a Dutch multinational, multi-product bank that offers banking services to 33 million retail and commercial customers in over 40 countries. At this scale, ING naturally faces a multitude of data consolidation tasks across its disparate sources. A common consolidation problem is fuzzy name matching: given a name (streaming) or a list of names (batch), find out the most similar name(s) from a different list.
Popular methods such as Levenshtein distance are not appropriate because of the time complexity and sheer volume of names involved. In this talk, we will introduce how we use a Spark custom ML pipeline and Structured Streaming to build fuzzy name matching products in batch and streaming. This can successfully match 8000 names per second against a 10 million name list, using a ten-node cluster. Firstly, we will give an introduction into the name matching problem.
Secondly, we will explain why Levenshtein distance approach is limited, and demonstrate a faster approach; token-based cosine similarity matching. Next, we will show how a ML pipeline helps to build an elegant solution. Here, we will deep dive into the detail of each stage, including customized preprocessing, tokenization, term-frequency, customized inverse document frequency, customized cosine similarity with distributed sparse matrix multiplication, and a customized supervision stage.
Finally, we will show how we deploy the ML pipeline within a batch data pipeline, and additionally as a fuzzy search engine in a streaming manner. Â The main conclusions will be: (1) a spark custom ML pipeline provides a powerful way to handle complicated data science problems (2) a uniform ML pipeline can serve both batch and streaming products easily from the same codebase.
Building A Hybrid Warehouse: Efficient Joins between Data Stored in HDFS and ...Yuanyuan Tian
With the advent of big data, the enterprise analytics landscape has dramatically changed. The HDFS has become an important data repository for all business analytics. Enterprises are using various big data technologies to process data and drive actionable insights. HDFS serves as the storage where other distributed processing frameworks, such as Hadoop and Spark, access and operate on large volumes of data. At the same time, enterprise data warehouses (EDWs) continue to support critical business analytics. EDWs are usually shared-nothing parallel databases that support complex SQL processing, updates, and transactions. As a result, they manage up-to-date data and support various business analytics tools, such as reporting and dashboards. A new generation of applications have emerged, requiring access and correlation of data stored in HDFS and EDWs. This has created the need for a new generation of a special federation between Hadoop-like big data platforms and EDWs, which we call the hybrid warehouse. In this talk, we identify the best hybrid warehouse architecture by studying various algorithms to join database and HDFS tables.
Data Science, Statistical Analysis and R... Learn what those mean, how they can help you find answers to your questions and complement the existing toolsets and processes you are currently using to make sense of data. We will explore R and the RStudio development environment, installing and using R packages, basic and essential data structures and data types, plotting graphics, manipulating data frames and how to connect R and SQL Server.
https://www.eventbrite.com/e/talk-by-paco-nathan-graph-analytics-in-spark-tickets-17173189472
Big Brains meetup hosted by BloomReach, 2015-06-04
Case study / demo of a large-scale graph analytics project, leveraging GraphX in Apache Spark to surface insights about open source developer communities — based on data mining of their email forums. The project works with any Apache email archive, applying NLP and machine learning techniques to analyze message threads, then constructs a large graph. Graph analytics, based on concise Scala coding examples in Spark, surface themes and interactions within the community. Results are used as feedback for respective developer communities, such as leaderboards, etc. As an example, we will examine analysis of the Spark developer community itself.
L'architettura di classe enterprise di nuova generazione - Massimo BrignoliData Driven Innovation
La nascita dei data lake - La aziende, ormai, sono sommerse dai dati e il classico datawarehouse fa fatica a macinare questi dati per numerosità e varietà. In molti hanno iniziato a guardare a delle architetture chiamate Data Lakes con Hadoop come tecnologia di riferimento. Ma questa soluzione va bene per tutto? Vieni a capire come operazionalizzare i data lakes per creare delle moderne architetture di gestione dati.
What Is SAS | SAS Tutorial For Beginners | SAS Training | SAS Programming | E...Edureka!
This Edureka "What Is SAS" tutorial will help you get started with SAS. This tutorial will also introduce you to Data Analytics and SAS Programming concepts. Below are the topics covered in this tutorial:
1. Data Analytics
2. Data Analytical Tools
3. Why SAS?
4. What Is SAS?
5. SAS Framework
6. SAS Programming Concepts
7. SAS Applications
ASHviz - Dats visualization research experiments using ASH dataJohn Beresniewicz
RMOUG Training Days 2020 abstract:
The Active Session History (ASH) mechanism is a rich source of fine-grained data about database activity, and is the lynchpin for many database performance management features in the Diagnostic and Tuning packs. Many interesting stories about happenings in the database are buried in ASH waiting to be revealed, and data visualization is key to sifting these out from the high dimensionality and volume of ASH data. The session will cover a number of data visualization experiments conducted using a single ASH dump with an emphasis on the iterative process of discovering useful data visualizations.
Deploying Enterprise Scale Deep Learning in Actuarial Modeling at NationwideDatabricks
The traditional approach to insurance pricing involves fitting a generalized linear model (GLM) to data collected on historical claims payments and premiums received. The explosive growth in data availability and increasing competitiveness in the marketplace are challenging actuaries to find new insights in their data and make predictions with more granularity, improved speed and efficiency, and with tighter integration among business units to support strategic decisions.
In this session we will share our experience implementing deep hierarchical neural networks using TensorFlow and PySpark on Databricks. We will discuss the benefits of the ML Runtime, our experience using the goofys mount, our process for hyperparameter tuning, specific considerations for the large dataset size and extreme volatility present in insurance data, among other topics.
Authors: Bryn Clark, Krish Rajaram
Apache CarbonData+Spark to realize data convergence and Unified high performa...Tech Triveni
Challenges in Data Analytics:
Different application scenarios need different storage solutions: HBASE is ideal for point query scenarios but unsuitable for multi-dimensional queries. MPP is suitable for data warehouse scenarios but engine and data are coupled together which hampers scalability. OLAP stores used in BI applications perform best for Aggregate queries but full scan queries perform at a sub-optimal performance. Moreover, they are not suitable for real-time analysis. These distinct systems lead to low resource sharing and need different pipelines for data and application management.
Large corporations have to master vast amounts of heterogeneous data in order to stay competitive. While existing approaches have attempted to consolidate and manage the data by forcing it into a single shared data model, data lakes recently emerged that instead provide a central storage point for holding all data sets in their original form.
In this talk, we present eccenca CorporateMemory, which extends the data lake paradigm with a semantic integration layer for managing diverse, but semantically enriched data. eccenca CorporateMemory builds an extensible knowledge graph that employs RDF vocabularies for transforming and linking multiple datasets in order to generate an integrated semantic understanding of the data.
Robert Isele | Head of Data Integration Unit at eccenca GmbH
Presentation at Semantics 2016 in Leipzig in the context with the results of the LEDS project
Learn how graph technologies can be applied to real-world use cases, using medical, network security, and financial data. By combining graph models and machine learning techniques, we can discover relationships, classify information, and identify patterns and anomalies in data. We can answer questions such as “How did other investigators approach similar cases?” and “Do these symptoms seem similar to ones we’ve seen in other diseases?” Presented by Sungpack Hong, Research Director, Oracle Labs.
RISELab: Enabling Intelligent Real-Time Decisions keynote by Ion StoicaSpark Summit
A long-standing grand challenge in computing is to enable machines to act autonomously and intelligently: to rapidly and repeatedly take appropriate actions based on information in the world around them. To address this challenge, at UC Berkeley we are starting a new five year effort that focuses on the development of data-intensive systems that provide Real-Time Intelligence with Secure Execution (RISE). Following in the footsteps of AMPLab, RISELab is an interdisciplinary effort bringing together researchers across AI, robotics, security, and data systems. In this talk I’ll present our research vision and then discuss some of the applications that will be enabled by RISE technologies.
RISELab:Enabling Intelligent Real-Time DecisionsJen Aman
Spark Summit East Keynote by Ion Stoica
A long-standing grand challenge in computing is to enable machines to act autonomously and intelligently: to rapidly and repeatedly take appropriate actions based on information in the world around them. To address this challenge, at UC Berkeley we are starting a new five year effort that focuses on the development of data-intensive systems that provide Real-Time Intelligence with Secure Execution (RISE). Following in the footsteps of AMPLab, RISELab is an interdisciplinary effort bringing together researchers across AI, robotics, security, and data systems. In this talk I’ll present our research vision and then discuss some of the applications that will be enabled by RISE technologies.
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...DataStax
Element Fleet has the largest benchmark database in our industry and we needed a robust and linearly scalable platform to turn this data into actionable insights for our customers. The platform needed to support advanced analytics, streaming data sets, and traditional business intelligence use cases.
In this presentation, we will discuss how we built a single, unified platform for both Advanced Analytics and traditional Business Intelligence using Cassandra on DSE. With Cassandra as our foundation, we are able to plug in the appropriate technology to meet varied use cases. The platform we’ve built supports real-time streaming (Spark Streaming/Kafka), batch and streaming analytics (PySpark, Spark Streaming), and traditional BI/data warehousing (C*/FiloDB). In this talk, we are going to explore the entire tech stack and the challenges we faced trying support the above use cases. We will specifically discuss how we ingest and analyze IoT (vehicle telematics data) in real-time and batch, combine data from multiple data sources into to single data model, and support standardized and ah-hoc reporting requirements.
About the Speaker
Jim Peregord Vice President - Analytics, Business Intelligence, Data Management, Element Corp.
Benchmark Showdown: Which Relational Database is the Fastest on AWS?Clustrix
Do you have a high-value, high throughput application running on AWS? Are you moving part or all of your infrastructure to AWS? Do you have a high-transaction workload that is only expected to grow as your company grows? Choosing the right database for your move to AWS can make you a hero or a goat. Be a hero!
Databases are the mission-critical lifeline of most businesses. For years MySQL has been the easy choice -- but the popularity of the cloud and new products like Aurora, RDS MySQL and ClustrixDB have given customers choices and options that can help them work smarter and more efficiently.
Enterprise Strategy Group (ESG) presents their findings from a recent performance benchmark test configured for high-transaction, low-latency workloads running on AWS.
In this webinar, you will learn:
How high-transaction, high-value database workloads perform when run on three popular databases solutions running on AWS.
How key metrics like transactions per second (tps) and database response time (latency) can affect performance and customer satisfaction.
How the ability to scale both database reads and writes is the key to unlocking performance on AWS
Insights into Real World Data Management ChallengesDataWorks Summit
Data is your most valuable business asset and it's also your biggest challenge. This challenge and opportunity means we continually face significant road blocks toward becoming a data driven organisation. From the management of data, to the bubbling open source frameworks, the limited industry skills to surmounting time and cost pressures, our challenge in data is big.
We all want and need a “fit for purpose” approach to management of data, especially Big Data, and overcoming the ongoing challenges around the ‘3Vs’ means we get to focus on the most important V - ‘Value’.Come along and join the discussion on how Oracle Big Data Cloud provides Value in the management of data and supports your move toward becoming a data driven organisation.
Speaker
Noble Raveendran, Principal Consultant, Oracle
[Sirius Day Eindhoven 2018] ASML's MDE Going SiriusObeo
Talk done by Wilbert Alberts (ASML) at Sirius Day Eindhoven:
ASML is the world's leading provider of lithography systems for the semiconductor industry. Such systems are controlled by more than 20 million lines of code. To improve the efficiency and quality of its software development process, ASML is using, amongst others, model-driven-engineering and associated tools and techniques.
Recently, subsystems are being developed according to an architecture pattern that separates Data, Control and Algorithms (DCA). To support this pattern, the ASML software architecture group is working towards a SW development environment (ASOME). This environment consists of a set of modeling languages, associated editors that allow specification of (sub)systems according to this DCA pattern. Furthermore, it contains model-to-model transformations to (COTS) analysis tools (e.g. model checkers) and model-to-text transformation to generate (parts of) the implementation.
In this presentation, I will briefly introduce ASML and the kind of (software) systems that we develop. Some aspects of the DCA architectural pattern, the languages that we are developing and the associated Sirius based editors, will be presented. For the Data part, a DSL and editor have been developed allowing the definition of various kinds of datatypes from which various kinds of repositories can be generated supporting clone based data or reference based data, modifiable and read-only entities etc. In order support the Control aspect; a language and editor have been defined that allow specification of interfaces and their realization based on state machines. A system editor allows decomposition of a system into subsystems while allowing delegation of incoming requests to internal parts. The editors are mostly Sirius based graphical editors, where the created models are persisted textually using XText.
The presentation will focus on sharing some of our experiences with both the development and deployment of products based on Sirius technology. Building the ASOME environment imposes many challenges and I would like to conclude with some that specifically target the development of the front ends of this environment.
Data Warehousing, Data Mining, Data Marts, Data Cube, OLAP Operations, Introduction to Common Messaging System, Web Tier Deployment, Application Servers & Clustered Deployment, IBM Notes and IBM Domino
From Pipelines to Refineries: scaling big data applications with Tim HunterDatabricks
Big data tools are challenging to combine into a larger application: ironically, big data applications themselves do not tend to scale very well. These issues of integration and data management are only magnified by increasingly large volumes of data. Apache Spark provides strong building blocks for batch processes, streams and ad-hoc interactive analysis. However, users face challenges when putting together a single coherent pipeline that could involve hundreds of transformation steps, especially when confronted by the need of rapid iterations. This talk explores these issues through the lens of functional programming. It presents an experimental framework that provides full-pipeline guarantees by introducing more laziness to Apache Spark. This framework allows transformations to be seamlessly composed and alleviates common issues, thanks to whole program checks, auto-caching, and aggressive computation parallelization and reuse.
This introduction to graph databases is specifically designed for Enterprise Architects who need to map business requirements to architectural components like graph databases. It explains how and why graphs matter for Enterprise Architecture and reviews the architectural differences between relational and graph models.
Similar to AnDSummit2020 Session Pattern Analysis Data Model (20)
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).