This document discusses using social media data like tweets to predict commodity futures prices. It analyzes over 5 years of tweets related to corn, soybeans and wheat to see if textual and sentiment analysis can provide a signal for predicting futures prices. The analysis was done using the Pivotal Greenplum database and open source tools like MADlib for machine learning and a Twitter part-of-speech tagger. The results found tweet sentiment was negatively correlated with commodity prices and that a blended model using text, sentiment and other factors like weather could provide encouraging results for predicting commodity futures.
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...Srivatsan Ramanujam
Unstructured data is everywhere - in the form of posts, status updates, bloglets or news feeds in social media or in the form of customer interactions Call Center CRM. While many organizations study and monitor social media for tracking brand value and targeting specific customer segments, in our experience blending the unstructured data with the structured data in supplementing data science models has been far more effective than working with it independently.
In this talk we will show case an end-to-end topic and sentiment analysis pipeline we've built on the Pivotal Greenplum Database platform for Twitter feeds from GNIP, using open source tools like MADlib and PL/Python. We've used this pipeline to build regression models to predict commodity futures from tweets and in enhancing churn models for telecom through topic and sentiment analysis of call center transcripts. All of this was possible because of the flexibility and extensibility of the platform we worked with.
Pivotal: Data Scientists on the Front Line: Examples of Data Science in ActionEMC
A close-up examination of how data science is being used today to drive company and sector-level transformations. Reviewing architecture, business goals, data science methodology and tool-usage, and the path to operationalization. Multiple case-studies reveal how Data Science has delivered lasting value to the organization, while also paving the way for data to become a new source of competitive differentiation.
Objective 1: Learn how companies can become predictive rather than reacting to the past.
After this session you will be able to:
Objective 2: Understand why companies that employ data science strategies will be able to develop a competitive advantage.
Objective 3: Understand how companies can get started on their journey to become data-driven organizations.
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...Srivatsan Ramanujam
These are slides from my talk @ DataDay Texas, in Austin on 30 Mar 2013
(http://2013.datadaytexas.com/schedule)
Favorite and Fork PyMADlib on GitHub: https://github.com/gopivotal/pymadlib
MADlib: http://madlib.net
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal Srivatsan Ramanujam
These slides give an overview of the technology and the tools used by Data Scientists at Pivotal Data Labs. This includes Procedural Languages like PL/Python, PL/R, PL/Java, PL/Perl and the parallel, in-database machine learning library MADlib. The slides also highlight the power and flexibility of the Pivotal platform from embracing open source libraries in Python, R or Java to using new computing paradigms such as Spark on Pivotal HD.
Pivotal OSS meetup - MADlib and PivotalRgo-pivotal
With the explosion of big data, the need for fast and inexpensive analytics solutions has become a key basis of competition in many industries. Extracting the value of big data with analytics can be complex, and requires advanced skills.
At Pivotal, we are building open-source solutions (MADlib, PivotalR, PyMadlib) to simplify this process for the user, while maintaining the efficiency necessary for big data analysis.
This talk will provide information about MADlib, an open source library of SQL-based algorithms for machine learning, data mining and statistics that run at large scale within a database engine, with no need for data import/export to other tools.
It provides an overview of the library’s architecture and compares various statistical methods with those available in Apache Mahout.
We also introduce, PivotalR, a R-based wrapper for MADlib that allows data scientists and programmers to access power of MADlib along with the ease of use of R.
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...Sarah Aerni
Slides from the Pivotal Open Source Hub Meetup
"Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Science!"
As the need for data science as a key differentiator grows in all industries, from large corporations to startups, the need to get to results quickly is enabled by sharing ideas and methods in the community. The data science team at Pivotal leverages and contributes to this community of publicly available and open source technologies as part of their practice. We will share the resources we use by highlighting specific toolkits for building models (e.g. MADlib, R) and visualization (e.g. Gephi and Circos) along with their benefits and limitations by sharing examples from Pivotal's data science engagements. At the end of this session we hope to have answered the questions: Where can I get started with Data Science? Which toolkit is most appropriate for building a model with my dataset? How can I visualize my results to have the greatest impact?
Bio: Sarah Aerni is a member of the Pivotal Data Science team with a focus on healthcare and life science. She has a background in the field of Bioinformatics, developing tools to help biomedical researchers understand their data. She holds a B.S. In Biology with a specialization in Bioinformatics and minor in French Literature from UCSD, and an M.S. and Ph.D in Biomedical Informatics from Stanford University. During her time as a researcher she focused on the interface between machine learning and biology, building computational models enabling research for a broad range of fields in biomedicine. She also co-founded a start-up providing informatics services to researchers and small companies. At Pivotal she works with customers in life science and healthcare building models to derive insight and business value from their data.
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...Srivatsan Ramanujam
Unstructured data is everywhere - in the form of posts, status updates, bloglets or news feeds in social media or in the form of customer interactions Call Center CRM. While many organizations study and monitor social media for tracking brand value and targeting specific customer segments, in our experience blending the unstructured data with the structured data in supplementing data science models has been far more effective than working with it independently.
In this talk we will show case an end-to-end topic and sentiment analysis pipeline we've built on the Pivotal Greenplum Database platform for Twitter feeds from GNIP, using open source tools like MADlib and PL/Python. We've used this pipeline to build regression models to predict commodity futures from tweets and in enhancing churn models for telecom through topic and sentiment analysis of call center transcripts. All of this was possible because of the flexibility and extensibility of the platform we worked with.
Pivotal: Data Scientists on the Front Line: Examples of Data Science in ActionEMC
A close-up examination of how data science is being used today to drive company and sector-level transformations. Reviewing architecture, business goals, data science methodology and tool-usage, and the path to operationalization. Multiple case-studies reveal how Data Science has delivered lasting value to the organization, while also paving the way for data to become a new source of competitive differentiation.
Objective 1: Learn how companies can become predictive rather than reacting to the past.
After this session you will be able to:
Objective 2: Understand why companies that employ data science strategies will be able to develop a competitive advantage.
Objective 3: Understand how companies can get started on their journey to become data-driven organizations.
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...Srivatsan Ramanujam
These are slides from my talk @ DataDay Texas, in Austin on 30 Mar 2013
(http://2013.datadaytexas.com/schedule)
Favorite and Fork PyMADlib on GitHub: https://github.com/gopivotal/pymadlib
MADlib: http://madlib.net
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal Srivatsan Ramanujam
These slides give an overview of the technology and the tools used by Data Scientists at Pivotal Data Labs. This includes Procedural Languages like PL/Python, PL/R, PL/Java, PL/Perl and the parallel, in-database machine learning library MADlib. The slides also highlight the power and flexibility of the Pivotal platform from embracing open source libraries in Python, R or Java to using new computing paradigms such as Spark on Pivotal HD.
Pivotal OSS meetup - MADlib and PivotalRgo-pivotal
With the explosion of big data, the need for fast and inexpensive analytics solutions has become a key basis of competition in many industries. Extracting the value of big data with analytics can be complex, and requires advanced skills.
At Pivotal, we are building open-source solutions (MADlib, PivotalR, PyMadlib) to simplify this process for the user, while maintaining the efficiency necessary for big data analysis.
This talk will provide information about MADlib, an open source library of SQL-based algorithms for machine learning, data mining and statistics that run at large scale within a database engine, with no need for data import/export to other tools.
It provides an overview of the library’s architecture and compares various statistical methods with those available in Apache Mahout.
We also introduce, PivotalR, a R-based wrapper for MADlib that allows data scientists and programmers to access power of MADlib along with the ease of use of R.
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...Sarah Aerni
Slides from the Pivotal Open Source Hub Meetup
"Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Science!"
As the need for data science as a key differentiator grows in all industries, from large corporations to startups, the need to get to results quickly is enabled by sharing ideas and methods in the community. The data science team at Pivotal leverages and contributes to this community of publicly available and open source technologies as part of their practice. We will share the resources we use by highlighting specific toolkits for building models (e.g. MADlib, R) and visualization (e.g. Gephi and Circos) along with their benefits and limitations by sharing examples from Pivotal's data science engagements. At the end of this session we hope to have answered the questions: Where can I get started with Data Science? Which toolkit is most appropriate for building a model with my dataset? How can I visualize my results to have the greatest impact?
Bio: Sarah Aerni is a member of the Pivotal Data Science team with a focus on healthcare and life science. She has a background in the field of Bioinformatics, developing tools to help biomedical researchers understand their data. She holds a B.S. In Biology with a specialization in Bioinformatics and minor in French Literature from UCSD, and an M.S. and Ph.D in Biomedical Informatics from Stanford University. During her time as a researcher she focused on the interface between machine learning and biology, building computational models enabling research for a broad range of fields in biomedicine. She also co-founded a start-up providing informatics services to researchers and small companies. At Pivotal she works with customers in life science and healthcare building models to derive insight and business value from their data.
In this paper we introduce the MADlib project, including the background that led to its beginnings, and the motivation for its open source nature. We provide an overview of the library’s architecture and design patterns, and provide a description of various statistical methods in that context.
Sangchul Song and Thu Kyaw discuss machine learning at AOL, and the challenges and solutions they encountered when trying to train a large number of machine learning models using Hadoop. Algorithms including SVM and packages like Mahout are discussed. Finally, they discuss their analytics pipeline, which includes some custom components used to interoperate with a range of machine learning libraries, as well as integration with the query language Pig.
Graph Databases and Machine Learning | November 2018TigerGraph
Graph Database and Machine Learning: Finding a Happy Marriage. Graph Databases and Machine Learning
both represent powerful tools for getting more value from data, learn how they can form a harmonious marriage to up-level machine learning.
Full Webinar: https://info.tigergraph.com/graph-gurus-21
In this Graph Gurus episode, we:
Explain the architecture and technical implementation for a TigerGraph + Spark graph-enhanced Machine Learning pipeline
Use TigerGraph both before training to extract (graph and non-graph) features and after training to apply the model on streaming data
Use Spark to train and tune machine learning models at scale
Present a solution in production at China Mobile that detects and prevents phone-based scams using machine learning with TigerGraph
Demo the data flow between Spark and TigerGraph via TigerGraph’s JDBC driver
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...MLconf
Why Machine Learning Algorithms Fall Short (And What You Can Do About It): Many think that machine learning is all about the algorithms. Want a self-learning system? Get your data, start coding or hire a PhD that will build you a model that will stand the test of time. Of course we know that this is not enough. Models degrade over time, algorithms that work great on yesterday’s data may not be the best option, new data sources and types are made available. In short, your self-learning system may not be learning anything at all. In this session, we will examine how to overcome challenges in creating self-learning systems that perform better and are built to stand the test of time. We will show how to apply mathematical optimization algorithms that often prove superior to local optimization methods favored by typical machine learning applications and discuss why these methods can crate better results. We will also examine the role of smart automation in the context of machine learning and how smart automation can create self-learning systems that are built to last.
Kaz Sato, Evangelist, Google at MLconf ATL 2016MLconf
Machine Intelligence at Google Scale: Tensor Flow and Cloud Machine Learning: The biggest challenge of Deep Learning technology is the scalability. As long as using single GPU server, you have to wait for hours or days to get the result of your work. This doesn’t scale for production service, so you need a Distributed Training on the cloud eventually. Google has been building infrastructure for training the large scale neural network on the cloud for years, and now started to share the technology with external developers. In this session, we will introduce new pre-trained ML services such as Cloud Vision API and Speech API that works without any training. Also, we will look how TensorFlow and Cloud Machine Learning will accelerate custom model training for 10x – 40x with Google’s distributed training infrastructure.
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016MLconf
Big Data Processing Above and Beyond Hadoop: Data-intensive computing represents a new computing paradigm to address Big Data processing requirements using high-performance architectures supporting scalable parallel processing to allow government, commercial organizations, and research environments to process massive amounts of data and implement new applications previously thought to be impractical or infeasible. The fundamental challenges of data-intensive computing are managing and processing exponentially growing data volumes, significantly reducing associated data analysis cycles to support practical, timely applications, and developing new algorithms which can scale to search and process massive amounts of data. The open source HPCC (High-Performance Computing Cluster) Systems platform offers a unified approach to Big Data processing requirements: (1) a scalable, integrated computer systems hardware and software architecture designed for parallel processing of data-intensive computing applications, and (2) a new programming paradigm in the form of a high-level, declarative, data-centric programming language designed specifically for big data processing. This presentation explores the challenges of data-intensive computing from a programming perspective, and describes the ECL programming language and the HPCC architecture designed for data-intensive computing applications. HPCC is an alternative to the Hadoop platform, and ECL is compared to Pig Latin, a high-level language developed for the Hadoop MapReduce architecture.
Graph Gurus Episode 25: Unleash the Business Value of Your Data Lake with Gra...TigerGraph
Full Webinar: https://info.tigergraph.com/graph-gurus-25
A new weapon is available for businesses wanting to accomplish more with Hadoop: native parallel graphs can reveal the connections across multiple domains and datasets in data lakes and provide powerful insights to deliver superior outcomes. In this webinar we will explain how native parallel graphs can analyze the information in data lakes to enable the following outcomes:
Recommending next best actions such as promoting a student loan to someone heading off to college, advocating life insurance to a newly married couple, and so on
Improving network utilization by analyzing petabytes of data collected from millions of IoT devices across a smart grid
Accelerating M&A activity by intelligently merging data lakes from multiple businesses.
Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...Databricks
A long time ago, there was Caffe and Theano, then came Torch and CNTK and Tensorflow, Keras and MXNet and Pytorch and Caffe2….a sea of Deep learning tools but none for Spark developers to dip into. Finally, there was BigDL, a deep learning library for Apache Spark. While BigDL is integrated into Spark and extends its capabilities to address the challenges of Big Data developers, will a library alone be enough to simplify and accelerate the deployment of ML/DL workloads on production clusters? From high level pipeline API support to feature transformers to pre-defined models and reference use cases, a rich repository of easy to use tools are now available with the ‘Analytics Zoo’. We’ll unpack the production challenges and opportunities with ML/DL on Spark and what the Zoo can do
Data Science at Scale on MPP databases - Use Cases & Open Source ToolsEsther Vasiete
Pivotal workshop slide deck for Structure Data 2016 held in San Francisco.
Abstract:
Learn how data scientists at Pivotal build machine learning models at massive scale on open source MPP databases like Greenplum and HAWQ (under Apache incubation) using in-database machine learning libraries like MADlib (under Apache incubation) and procedural languages like PL/Python and PL/R to take full advantage of the rich set of libraries in the open source community. This workshop will walk you through use cases in text analytics and image processing on MPP.
Data Science Amsterdam - Massively Parallel Processing with Procedural LanguagesIan Huston
The goal of in-database analytics is to bring the calculations to the data, reducing transport costs and I/O bottlenecks. With Procedural Languages such as PL/Python and PL/R data parallel queries can be run across terabytes of data using not only pure SQL but also familiar Python and R packages. The Pivotal Data Science team have used this technique to create fraud behaviour models for each individual user in a large corporate network, to understand interception rates at customs checkpoints by accelerating natural language processing of package descriptions and to reduce customer churn by building a sentiment model using customer call centre records.
http://www.meetup.com/Data-Science-Amsterdam/events/178974942/
In this paper we introduce the MADlib project, including the background that led to its beginnings, and the motivation for its open source nature. We provide an overview of the library’s architecture and design patterns, and provide a description of various statistical methods in that context.
Sangchul Song and Thu Kyaw discuss machine learning at AOL, and the challenges and solutions they encountered when trying to train a large number of machine learning models using Hadoop. Algorithms including SVM and packages like Mahout are discussed. Finally, they discuss their analytics pipeline, which includes some custom components used to interoperate with a range of machine learning libraries, as well as integration with the query language Pig.
Graph Databases and Machine Learning | November 2018TigerGraph
Graph Database and Machine Learning: Finding a Happy Marriage. Graph Databases and Machine Learning
both represent powerful tools for getting more value from data, learn how they can form a harmonious marriage to up-level machine learning.
Full Webinar: https://info.tigergraph.com/graph-gurus-21
In this Graph Gurus episode, we:
Explain the architecture and technical implementation for a TigerGraph + Spark graph-enhanced Machine Learning pipeline
Use TigerGraph both before training to extract (graph and non-graph) features and after training to apply the model on streaming data
Use Spark to train and tune machine learning models at scale
Present a solution in production at China Mobile that detects and prevents phone-based scams using machine learning with TigerGraph
Demo the data flow between Spark and TigerGraph via TigerGraph’s JDBC driver
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...MLconf
Why Machine Learning Algorithms Fall Short (And What You Can Do About It): Many think that machine learning is all about the algorithms. Want a self-learning system? Get your data, start coding or hire a PhD that will build you a model that will stand the test of time. Of course we know that this is not enough. Models degrade over time, algorithms that work great on yesterday’s data may not be the best option, new data sources and types are made available. In short, your self-learning system may not be learning anything at all. In this session, we will examine how to overcome challenges in creating self-learning systems that perform better and are built to stand the test of time. We will show how to apply mathematical optimization algorithms that often prove superior to local optimization methods favored by typical machine learning applications and discuss why these methods can crate better results. We will also examine the role of smart automation in the context of machine learning and how smart automation can create self-learning systems that are built to last.
Kaz Sato, Evangelist, Google at MLconf ATL 2016MLconf
Machine Intelligence at Google Scale: Tensor Flow and Cloud Machine Learning: The biggest challenge of Deep Learning technology is the scalability. As long as using single GPU server, you have to wait for hours or days to get the result of your work. This doesn’t scale for production service, so you need a Distributed Training on the cloud eventually. Google has been building infrastructure for training the large scale neural network on the cloud for years, and now started to share the technology with external developers. In this session, we will introduce new pre-trained ML services such as Cloud Vision API and Speech API that works without any training. Also, we will look how TensorFlow and Cloud Machine Learning will accelerate custom model training for 10x – 40x with Google’s distributed training infrastructure.
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016MLconf
Big Data Processing Above and Beyond Hadoop: Data-intensive computing represents a new computing paradigm to address Big Data processing requirements using high-performance architectures supporting scalable parallel processing to allow government, commercial organizations, and research environments to process massive amounts of data and implement new applications previously thought to be impractical or infeasible. The fundamental challenges of data-intensive computing are managing and processing exponentially growing data volumes, significantly reducing associated data analysis cycles to support practical, timely applications, and developing new algorithms which can scale to search and process massive amounts of data. The open source HPCC (High-Performance Computing Cluster) Systems platform offers a unified approach to Big Data processing requirements: (1) a scalable, integrated computer systems hardware and software architecture designed for parallel processing of data-intensive computing applications, and (2) a new programming paradigm in the form of a high-level, declarative, data-centric programming language designed specifically for big data processing. This presentation explores the challenges of data-intensive computing from a programming perspective, and describes the ECL programming language and the HPCC architecture designed for data-intensive computing applications. HPCC is an alternative to the Hadoop platform, and ECL is compared to Pig Latin, a high-level language developed for the Hadoop MapReduce architecture.
Graph Gurus Episode 25: Unleash the Business Value of Your Data Lake with Gra...TigerGraph
Full Webinar: https://info.tigergraph.com/graph-gurus-25
A new weapon is available for businesses wanting to accomplish more with Hadoop: native parallel graphs can reveal the connections across multiple domains and datasets in data lakes and provide powerful insights to deliver superior outcomes. In this webinar we will explain how native parallel graphs can analyze the information in data lakes to enable the following outcomes:
Recommending next best actions such as promoting a student loan to someone heading off to college, advocating life insurance to a newly married couple, and so on
Improving network utilization by analyzing petabytes of data collected from millions of IoT devices across a smart grid
Accelerating M&A activity by intelligently merging data lakes from multiple businesses.
Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...Databricks
A long time ago, there was Caffe and Theano, then came Torch and CNTK and Tensorflow, Keras and MXNet and Pytorch and Caffe2….a sea of Deep learning tools but none for Spark developers to dip into. Finally, there was BigDL, a deep learning library for Apache Spark. While BigDL is integrated into Spark and extends its capabilities to address the challenges of Big Data developers, will a library alone be enough to simplify and accelerate the deployment of ML/DL workloads on production clusters? From high level pipeline API support to feature transformers to pre-defined models and reference use cases, a rich repository of easy to use tools are now available with the ‘Analytics Zoo’. We’ll unpack the production challenges and opportunities with ML/DL on Spark and what the Zoo can do
Data Science at Scale on MPP databases - Use Cases & Open Source ToolsEsther Vasiete
Pivotal workshop slide deck for Structure Data 2016 held in San Francisco.
Abstract:
Learn how data scientists at Pivotal build machine learning models at massive scale on open source MPP databases like Greenplum and HAWQ (under Apache incubation) using in-database machine learning libraries like MADlib (under Apache incubation) and procedural languages like PL/Python and PL/R to take full advantage of the rich set of libraries in the open source community. This workshop will walk you through use cases in text analytics and image processing on MPP.
Data Science Amsterdam - Massively Parallel Processing with Procedural LanguagesIan Huston
The goal of in-database analytics is to bring the calculations to the data, reducing transport costs and I/O bottlenecks. With Procedural Languages such as PL/Python and PL/R data parallel queries can be run across terabytes of data using not only pure SQL but also familiar Python and R packages. The Pivotal Data Science team have used this technique to create fraud behaviour models for each individual user in a large corporate network, to understand interception rates at customs checkpoints by accelerating natural language processing of package descriptions and to reduce customer churn by building a sentiment model using customer call centre records.
http://www.meetup.com/Data-Science-Amsterdam/events/178974942/
Objective of the Project
Tweet sentiment analysis gives businesses insights into customers and competitors. In this project, we combined several text preprocessing techniques with machine learning algorithms. Neural network, Random Forest and Logistic Regression models were trained on the Sentiment140 twitter data set. We then predicted the sentiment of a hold-out test set of tweets. We used both Python and PySpark (local Spark Context) to program different parts of the pre-processing and modelling.
Graph Gurus Episode 37: Modeling for Kaggle COVID-19 DatasetTigerGraph
Full Webinar: https://info.tigergraph.com/graph-gurus-37
In this Graph Gurus Episode, we:
-Learn how to process text and extract entities (words and phrases) as well as classes linking the entities using SciSpacy, a Natural Language Processing (NLP) tool.
-Import the output of NLP and semantically link it in TigerGraph
-Run advanced analytics queries with TigerGraph to analyze the relationships and deliver insights
Details regarding the working of chatgpt and basic use cases can be found in this presentation. The presentation also contains details regarding other Open AI products and their useability. You can also find ways in which chatgpt can be implemented in existing App and websites.
Kamanja: Driving Business Value through Real-Time Decisioning SolutionsGreg Makowski
This is a first presentation of Kamanja, a new open-source real-time software product, which integrates with other big-data systems. See also links: http://www.meetup.com/SF-Bay-ACM/events/223615901/ and http://Kamanja.org to download, for docs or community support. For the YouTube video, see https://www.youtube.com/watch?v=g9d87rvcSNk (you may want to start at minute 33).
Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...PyData
The Python data ecosystem has grown beyond the confines of single machines to embrace scalability. Here we describe one of our approaches to scaling, which is already being used in production systems. The goal of in-database analytics is to bring the calculations to the data, reducing transport costs and I/O bottlenecks. Using PL/Python we can run parallel queries across terabytes of data using not only pure SQL but also familiar PyData packages such as scikit-learn and nltk. This approach can also be used with PL/R to make use of a wide variety of R packages. We look at examples on Postgres compatible systems such as the Greenplum Database and on Hadoop through Pivotal HAWQ. We will also introduce MADlib, Pivotal’s open source library for scalable in-database machine learning, which uses Python to glue SQL queries to low level C++ functions and is also usable through the PyMADlib package.
In the last few years, deep learning has achieved significant success in a wide range of domains, including computer vision, artificial intelligence, speech, NLP, and reinforcement learning. However, deep learning in recommender systems has, until recently, received relatively little attention. This talks explores recent advances in this area in both research and practice. I will explain how deep learning can be applied to recommendation settings, architectures for handling contextual data, side information, and time-based models.
Big Data and BI Tools - BI Reporting for Bay Area Startups User GroupScott Mitchell
This presentation was presented at the July 8th 2014 user group meeting for BI Reporting for Bay Area Start Ups
Content - Creation Infocepts/DWApplications
Presented by: Scott Mitchell - DWApplications
Implementing a highly scalable stock prediction system with R, Geode, SpringX...William Markito Oliveira
Finance market prediction has always been one of the hottest topics in Data Science and Machine Learning. However, the prediction algorithm is just a small piece of the puzzle. Building a data stream pipeline that is constantly combining the latest price info with high volume historical data is extremely challenging using traditional platforms, requiring a lot of code and thinking about how to scale or move into the cloud. This session is going to walk-through the architecture and implementation details of an application built on top of open-source tools that demonstrate how to easily build a stock prediction solution with no source code - except a few lines of R and the web interface that will consume data through a RESTful endpoint, real-time. The solution leverages in-memory data grid technology for high-speed ingestion, combining streaming of real-time data and distributed processing for stock indicator algorithms.
ChatGPT is a state-of-the-art language model developed by OpenAI, based on the GPT-3 architecture. It's designed to generate human-like text responses, making it incredibly versatile and useful for various applications.
Using Graph Algorithms for Advanced Analytics - Part 2 CentralityTigerGraph
What does finding the best location for a warehouse/office/retail store have in common with finding the most influential person in a referral network? Answer: they are both Centrality problems and can be solved with graph algorithms.
Graph Gurus Episode 29: Using Graph Algorithms for Advanced Analytics Part 3TigerGraph
Full Webinar: https://info.tigergraph.com/graph-gurus-29
In this webinar you will:
-Hear about use cases for community detection graph algorithms
-Learn how to select the right algorithm for your use case
-Be able to run and tailor GSQL graph algorithms
Deploying Enterprise Scale Deep Learning in Actuarial Modeling at NationwideDatabricks
The traditional approach to insurance pricing involves fitting a generalized linear model (GLM) to data collected on historical claims payments and premiums received. The explosive growth in data availability and increasing competitiveness in the marketplace are challenging actuaries to find new insights in their data and make predictions with more granularity, improved speed and efficiency, and with tighter integration among business units to support strategic decisions.
In this session we will share our experience implementing deep hierarchical neural networks using TensorFlow and PySpark on Databricks. We will discuss the benefits of the ML Runtime, our experience using the goofys mount, our process for hyperparameter tuning, specific considerations for the large dataset size and extreme volatility present in insurance data, among other topics.
Authors: Bryn Clark, Krish Rajaram
Lambda architecture for real time big dataTrieu Nguyen
Lambda Architecture in Real-time Big Data Project
Concepts & Techniques “Thinking with Lambda”
Case study in some real projects
Why lambda architecture is correct solution for big data?
Similar to Analyzing Power of Tweets in Predicting Commodity Futures (20)
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.