AWS hosted back-testing framework for algorithmic trading. The objective is to analyze a given data set of equities containing price-volume data and develop medium to high frequency strategies for trading equities quickly via cloud computing. A simple web interface will be constructed allowing users to control and tweak trading parameter settings, and see in-sample and out-of-sample trading performance (i.e., hypothetical profit and loss, tradings costs, turnover, portfolio risk and Sharpe ratio).
This report details the findings, implementation and research method of the teams algorithmic trading model developed using machine learning frameworks implemented leveraging AWS cloud computing infrastructure.
This document describes a web application for analyzing building energy management data using predictive modeling and machine learning techniques. The application contains years of sensor data from a CUNY building and allows users to visualize the data, perform statistical analysis, and generate forecasts using Python modules. Key features include interactive data visualization, filtering and selecting subsets of data, defining expressions of sensor variables, and applying machine learning models for prediction. The application provides a customizable platform for exploring time series data while allowing different users to share their work.
Implementation of query optimization for reducing run timeAlexander Decker
This document discusses query optimization techniques to improve performance. It proposes performing query optimization at compile-time using histograms of data statistics rather than at run-time. Histograms are used to estimate selectivity of query joins and predicates at compile-time, allowing a query plan to be constructed in advance and executed without run-time optimization. The technique uses a split and merge algorithm to incrementally maintain histograms as data changes. Selectivity estimation with histograms allows join and predicate ordering to be determined at compile-time for query plan generation. Experimental results showed this compile-time optimization approach improved runtime performance over traditional run-time optimization.
This document describes a parallel implementation of the Apriori algorithm for frequent itemset mining using MapReduce. The key steps are: (1) The Apriori algorithm is broken down into independent mapping tasks to count candidate itemset occurrences in parallel; (2) A MapReduce job is used for each iteration where the map function counts occurrences and the reduce function sums the counts; (3) Experimental results on real datasets show the approach achieves good speedup, scaleup, and can efficiently process large datasets in a distributed manner.
Performance Analysis of Hashing Mathods on the Employment of App IJECEIAES
The administrative process carried out continuously produces large data. So the search process takes a long time. The search process by hashing methods can save time faster. Hashing is methods that directly access data in a table by making references to the key that hashing becomes the address in the table. The performance analysis of the hashing method is done by the number of 18 digit character values. The process of analysis is done on applications that have been implemented in the application. The algorithm of hashing method analyzed is progressive overflow (PO) and linear quotient (LQ). The main purpose of performance analysis of hashing method is to know how gig the performance of each method. The results analyzed showed the average value of collision with 15 keys in the analysis of 53.3% yield the same value, while 46.7% showed the linear quotient has better performance.
The document describes developing a model to predict house prices using deep learning techniques. It proposes using a dataset with house features without labels and applying regression algorithms like K-nearest neighbors, support vector machine, and artificial neural networks. The models are trained and tested on split data, with the artificial neural network achieving the lowest mean absolute percentage error of 18.3%, indicating it is the most accurate model for predicting house prices based on the data.
This document describes research on using a Long Short-Term Memory (LSTM) neural network model to predict bitcoin prices over the next 5 days. It discusses collecting bitcoin price data from 2015-2021, cleaning the data, and using features like date, price, high, low to train and test the LSTM model. Lag plots show the data has positive correlation at daily intervals. The model is trained on recent data and tested on past data to predict future prices. Root mean square error is calculated between predicted and actual test prices. The model accurately predicts future prices but could be improved by adding more price-influencing features to the training data.
IRJET- Stock Market Prediction using Machine LearningIRJET Journal
This document discusses using machine learning techniques to predict stock market movements. Specifically, it uses a Support Vector Machine (SVM) algorithm with a Radial Basis Function (RBF) kernel to predict stock prices. It describes collecting stock price data, selecting features like price volatility and momentum, training the SVM model on historical data, and generating predictions of future stock prices. The results show the SVM model was able to accurately predict the movements of IBM stock prices based on historical data.
This document describes a web application for analyzing building energy management data using predictive modeling and machine learning techniques. The application contains years of sensor data from a CUNY building and allows users to visualize the data, perform statistical analysis, and generate forecasts using Python modules. Key features include interactive data visualization, filtering and selecting subsets of data, defining expressions of sensor variables, and applying machine learning models for prediction. The application provides a customizable platform for exploring time series data while allowing different users to share their work.
Implementation of query optimization for reducing run timeAlexander Decker
This document discusses query optimization techniques to improve performance. It proposes performing query optimization at compile-time using histograms of data statistics rather than at run-time. Histograms are used to estimate selectivity of query joins and predicates at compile-time, allowing a query plan to be constructed in advance and executed without run-time optimization. The technique uses a split and merge algorithm to incrementally maintain histograms as data changes. Selectivity estimation with histograms allows join and predicate ordering to be determined at compile-time for query plan generation. Experimental results showed this compile-time optimization approach improved runtime performance over traditional run-time optimization.
This document describes a parallel implementation of the Apriori algorithm for frequent itemset mining using MapReduce. The key steps are: (1) The Apriori algorithm is broken down into independent mapping tasks to count candidate itemset occurrences in parallel; (2) A MapReduce job is used for each iteration where the map function counts occurrences and the reduce function sums the counts; (3) Experimental results on real datasets show the approach achieves good speedup, scaleup, and can efficiently process large datasets in a distributed manner.
Performance Analysis of Hashing Mathods on the Employment of App IJECEIAES
The administrative process carried out continuously produces large data. So the search process takes a long time. The search process by hashing methods can save time faster. Hashing is methods that directly access data in a table by making references to the key that hashing becomes the address in the table. The performance analysis of the hashing method is done by the number of 18 digit character values. The process of analysis is done on applications that have been implemented in the application. The algorithm of hashing method analyzed is progressive overflow (PO) and linear quotient (LQ). The main purpose of performance analysis of hashing method is to know how gig the performance of each method. The results analyzed showed the average value of collision with 15 keys in the analysis of 53.3% yield the same value, while 46.7% showed the linear quotient has better performance.
The document describes developing a model to predict house prices using deep learning techniques. It proposes using a dataset with house features without labels and applying regression algorithms like K-nearest neighbors, support vector machine, and artificial neural networks. The models are trained and tested on split data, with the artificial neural network achieving the lowest mean absolute percentage error of 18.3%, indicating it is the most accurate model for predicting house prices based on the data.
This document describes research on using a Long Short-Term Memory (LSTM) neural network model to predict bitcoin prices over the next 5 days. It discusses collecting bitcoin price data from 2015-2021, cleaning the data, and using features like date, price, high, low to train and test the LSTM model. Lag plots show the data has positive correlation at daily intervals. The model is trained on recent data and tested on past data to predict future prices. Root mean square error is calculated between predicted and actual test prices. The model accurately predicts future prices but could be improved by adding more price-influencing features to the training data.
IRJET- Stock Market Prediction using Machine LearningIRJET Journal
This document discusses using machine learning techniques to predict stock market movements. Specifically, it uses a Support Vector Machine (SVM) algorithm with a Radial Basis Function (RBF) kernel to predict stock prices. It describes collecting stock price data, selecting features like price volatility and momentum, training the SVM model on historical data, and generating predictions of future stock prices. The results show the SVM model was able to accurately predict the movements of IBM stock prices based on historical data.
This document summarizes a report on analyzing a stock prediction model using neural networks. The report presents a model that predicts stock prices by extracting stock data, dividing it into training and validation sets, and feeding it into a neural network. Experimental results showed the model could accurately predict stock prices after training on 90% of the data, but predictions on the remaining 10% of data sometimes differed from actual prices. The model allows users to choose different stock attributes or time periods for analysis and prediction.
IRJET- Classification of Pattern Storage System and Analysis of Online Shoppi...IRJET Journal
This document discusses classifying patterns from online shopping data using data mining techniques. It proposes using the Apriori algorithm to mine frequent patterns from transaction data stored in a data warehouse. Patterns mined from the data warehouse using Apriori would then be stored in a pattern warehouse. This would allow users to view product details and related patterns when browsing items online. The system aims to efficiently analyze large amounts of user data to discover useful patterns for improving the online shopping experience.
IRJET - Stock Market Analysis and PredictionIRJET Journal
This document discusses using machine learning algorithms to analyze stock market data and predict future stock prices. It proposes collecting historical stock price and Twitter sentiment data and using recurrent neural networks and long short-term memory models to analyze the data and generate predictions and visualizations. The models would allow investors to make informed decisions about buying and selling stocks to potentially achieve returns on their investments.
This document contains notes from a data structures and algorithms class. It defines data structures as organized ways to store and access data efficiently. Good data structures allow for easy retrieval, access, and manipulation of data. Different types of data structures are discussed, including static vs dynamic, linear vs non-linear, and homogeneous structures. Common operations on data like insertion, deletion, searching and sorting are also defined. The document then defines algorithms as step-by-step instructions to solve problems, and discusses analyzing algorithms to determine their time and space complexity as input size increases. Asymptotic notation for describing how complexity grows is also introduced.
IRJET- Empower Syntactic Exploration Based on Conceptual Graph using Searchab...IRJET Journal
This document discusses a proposed system for empowering syntactic exploration based on conceptual graphs using searchable symmetric encryption. It begins with an abstract that outlines using conceptual graphs and related natural language processing techniques to perform semantic search over encrypted cloud data. It then describes the system modules, including data owners who can upload and authorize access to encrypted files, data users who can search for files, and a cloud server that stores the outsourced encrypted data and indexes. Key algorithms discussed include named entity recognition, term frequency-inverse document frequency (TF-IDF) calculation, data encryption standard (DES) encryption, and hashed message authentication codes (HMACs) to identify duplicate documents. The proposed system architecture involves data owners encrypting and outsourcing documents
This document presents research on using machine learning and deep learning models to predict stock prices. The researcher collected stock price data for two companies, Tata Steel and Hero Motocorp, at 5-minute intervals over two years. Eight classification and eight regression models were tested on this data to predict opening stock prices. The results showed that deep learning models like LSTM outperformed other regression models, while ANN performed best among classification models on average. Identifying individual errors in the best-performing LSTM model is identified as a potential next step.
The document describes a data science project conducted on streaming log data from Cloudera Movies, an online streaming video service. The goals of the project were to understand which user accounts are used most by younger viewers, segment user sessions to improve site usability, and build a recommendation engine. Key steps included exploring and cleaning the data, classifying users as children or adults using a SimRank approach, clustering user sessions to identify behavior patterns, and predicting user ratings through user-user and item-item similarity models to build a recommendation system. Accuracy of 99.64% was achieved in classifying users.
The hidden engineering behind machine learning products at HelixaAlluxio, Inc.
Data Orchestration Summit 2020 organized by Alluxio
https://www.alluxio.io/data-orchestration-summit-2020/
The hidden engineering behind machine learning products at Helixa
Gianmario Spacagna, (Helixa)
About Alluxio: alluxio.io
Engage with the open source community on slack: alluxio.io/slack
Creating a scalable & cost efficient BI infrastructure for a startup in the A...vcrisan
Presentation for Bucharest Big Data Meetup - October 14th 2021
How we created an efficient BI solution that can easily used by a startup, using the AWS cloud environment. Using Python we can easily import, process and store data in Amazon S3 from different data sources including Rabbit MQ, Big Query, MySQL etc. From there we are taking advantage of the power of Dremio as a query engine & the scalability of S3, you can create beautiful dashboards in Tableau fast, in order to kickstart a data journey in a startup.
IRJET- Text based Deep Learning for Stock PredictionIRJET Journal
This document presents a proposed system for text-based deep learning stock prediction and interpretation. It discusses using a neural network model to extract relevant predictive factors from news texts and financial tweets that influence stock prices. An interactive visualization interface is explored to effectively communicate the interpreted model predictions to users. The system aims to help with stock market investment and analysis tasks. It evaluates the approach on two case studies predicting stock prices from online news and tweets, finding the proposed neural network architecture outperforms other models.
IRJET- Towards Efficient Framework for Semantic Query Search Engine in Large-...IRJET Journal
The document proposes a new framework for efficient semantic search in large datasets. It aims to improve understanding of short texts by enriching them with concepts and related terms from a probabilistic knowledge base. A deep learning model using stacked autoencoders is designed to learn features from the enriched short texts and encode them into binary codes, allowing similarity searches. Experiments show the new approach captures semantics better than existing methods and enables applications like short text retrieval and classification.
IRJET- Foster Hashtag from Image and TextIRJET Journal
This document proposes a hybrid model for generating hashtags from images and text by analyzing both the content and trending topics. It discusses using object detection and named entity recognition techniques like Single Shot Multibox Detector and bidirectional LSTM to extract information from images and text. This extracted information is then used to automatically propose relevant hashtags that could help promote and tag the content. The intended users are social media users who could benefit from suggested hashtags to better share and organize their posts.
Improving Association Rule Mining by Defining a Novel Data StructureIRJET Journal
This document proposes a new data structure to improve the performance of association rule mining algorithms on large datasets. It involves compressing the original data in three steps: shuffling records based on similarity, creating an inverted index to map attribute values to transactions, and applying run length encoding. This compressed data structure reduces data size, loading time, and memory requirements compared to the original data. Experimental results on two datasets show the proposed structure improves the speed of an association rule mining algorithm while maintaining data validity. The structure is designed to work with existing algorithms without changing them.
Sachin Jain worked on a project to develop a Window Management Framework to add windowing capabilities to an existing Data Stream Management System called STORM. He served in various roles such as client liaison, requirements manager, project manager, and developer. As developer, he designed and developed the core functionality of the framework, including the emitting logic to read data from disk and populate buffers, as well as send end of window markers. Project management involved using planning poker to estimate tasks and tracking progress with earned-value charts.
IRJET- Key-Aggregate Cryptosystem for Scalable Data Sharing in Cloud StorageIRJET Journal
This document proposes a key-aggregate searchable encryption (KASE) scheme to enable scalable and flexible data sharing in cloud storage. KASE allows a data owner to generate a single aggregate encryption key that can be used by authorized users to search over any number of encrypted files, avoiding the need to distribute multiple keys. The scheme consists of seven algorithms for system setup, key generation, encryption, key extraction, trapdoor generation, trapdoor adjustment, and trapdoor testing. It provides a secure and efficient way for flexible data access delegation in cloud storage.
This document summarizes a research paper that proposes methods to improve the efficiency of ranked keyword search over encrypted cloud data. It discusses using order preserving symmetric encryption to encrypt cloud data for security. It proposes using both keyword search and concept-based search to retrieve more relevant results. It describes calculating relevance scores based on term and document frequency to rank results. The paper evaluates the proposed methods and finds they significantly improve search efficiency over existing techniques through experimental analysis.
Intelligent Systems Project: Bike sharing service modelingAlessio Villardita
A university project on data analysis and modeling based on a bike sharing system data set. Models developed: MLP and RBF fitting, Fuzzy Inference System, ANFIS and time series forecasting.
This document describes a proposed intelligent supermarket system that uses a distributed implementation of the Apriori algorithm (called Dipriori) to analyze transaction data from multiple supermarket locations and identify frequent item sets and association rules. The system would have client machines at each location running Apriori locally on their transaction data and sending the frequent item sets to a central global server. The server would calculate average support counts and identify globally frequent item sets. Association rules would then be generated and used to provide insights into customer purchasing behaviors and inform targeted marketing strategies. The authors implemented this approach and found it helped improve the time efficiency of analyzing large, distributed transaction datasets.
IRJET- Sentimental Analysis for Online Reviews using Machine Learning AlgorithmsIRJET Journal
The document discusses sentiment analysis of online product reviews using machine learning algorithms. It first provides background on sentiment analysis and its uses. It then describes preprocessing customer review data and extracting features using count and TF-IDF vectorization. Three machine learning algorithms are tested - support vector machine (SVM), random forest, and XGBoost classifier. The results show that XGBoost achieved higher accuracy than SVM and random forest for sentiment classification of the product review data.
The Ipsos - AI - Monitor 2024 Report.pdfSocial Samosa
According to Ipsos AI Monitor's 2024 report, 65% Indians said that products and services using AI have profoundly changed their daily life in the past 3-5 years.
This document summarizes a report on analyzing a stock prediction model using neural networks. The report presents a model that predicts stock prices by extracting stock data, dividing it into training and validation sets, and feeding it into a neural network. Experimental results showed the model could accurately predict stock prices after training on 90% of the data, but predictions on the remaining 10% of data sometimes differed from actual prices. The model allows users to choose different stock attributes or time periods for analysis and prediction.
IRJET- Classification of Pattern Storage System and Analysis of Online Shoppi...IRJET Journal
This document discusses classifying patterns from online shopping data using data mining techniques. It proposes using the Apriori algorithm to mine frequent patterns from transaction data stored in a data warehouse. Patterns mined from the data warehouse using Apriori would then be stored in a pattern warehouse. This would allow users to view product details and related patterns when browsing items online. The system aims to efficiently analyze large amounts of user data to discover useful patterns for improving the online shopping experience.
IRJET - Stock Market Analysis and PredictionIRJET Journal
This document discusses using machine learning algorithms to analyze stock market data and predict future stock prices. It proposes collecting historical stock price and Twitter sentiment data and using recurrent neural networks and long short-term memory models to analyze the data and generate predictions and visualizations. The models would allow investors to make informed decisions about buying and selling stocks to potentially achieve returns on their investments.
This document contains notes from a data structures and algorithms class. It defines data structures as organized ways to store and access data efficiently. Good data structures allow for easy retrieval, access, and manipulation of data. Different types of data structures are discussed, including static vs dynamic, linear vs non-linear, and homogeneous structures. Common operations on data like insertion, deletion, searching and sorting are also defined. The document then defines algorithms as step-by-step instructions to solve problems, and discusses analyzing algorithms to determine their time and space complexity as input size increases. Asymptotic notation for describing how complexity grows is also introduced.
IRJET- Empower Syntactic Exploration Based on Conceptual Graph using Searchab...IRJET Journal
This document discusses a proposed system for empowering syntactic exploration based on conceptual graphs using searchable symmetric encryption. It begins with an abstract that outlines using conceptual graphs and related natural language processing techniques to perform semantic search over encrypted cloud data. It then describes the system modules, including data owners who can upload and authorize access to encrypted files, data users who can search for files, and a cloud server that stores the outsourced encrypted data and indexes. Key algorithms discussed include named entity recognition, term frequency-inverse document frequency (TF-IDF) calculation, data encryption standard (DES) encryption, and hashed message authentication codes (HMACs) to identify duplicate documents. The proposed system architecture involves data owners encrypting and outsourcing documents
This document presents research on using machine learning and deep learning models to predict stock prices. The researcher collected stock price data for two companies, Tata Steel and Hero Motocorp, at 5-minute intervals over two years. Eight classification and eight regression models were tested on this data to predict opening stock prices. The results showed that deep learning models like LSTM outperformed other regression models, while ANN performed best among classification models on average. Identifying individual errors in the best-performing LSTM model is identified as a potential next step.
The document describes a data science project conducted on streaming log data from Cloudera Movies, an online streaming video service. The goals of the project were to understand which user accounts are used most by younger viewers, segment user sessions to improve site usability, and build a recommendation engine. Key steps included exploring and cleaning the data, classifying users as children or adults using a SimRank approach, clustering user sessions to identify behavior patterns, and predicting user ratings through user-user and item-item similarity models to build a recommendation system. Accuracy of 99.64% was achieved in classifying users.
The hidden engineering behind machine learning products at HelixaAlluxio, Inc.
Data Orchestration Summit 2020 organized by Alluxio
https://www.alluxio.io/data-orchestration-summit-2020/
The hidden engineering behind machine learning products at Helixa
Gianmario Spacagna, (Helixa)
About Alluxio: alluxio.io
Engage with the open source community on slack: alluxio.io/slack
Creating a scalable & cost efficient BI infrastructure for a startup in the A...vcrisan
Presentation for Bucharest Big Data Meetup - October 14th 2021
How we created an efficient BI solution that can easily used by a startup, using the AWS cloud environment. Using Python we can easily import, process and store data in Amazon S3 from different data sources including Rabbit MQ, Big Query, MySQL etc. From there we are taking advantage of the power of Dremio as a query engine & the scalability of S3, you can create beautiful dashboards in Tableau fast, in order to kickstart a data journey in a startup.
IRJET- Text based Deep Learning for Stock PredictionIRJET Journal
This document presents a proposed system for text-based deep learning stock prediction and interpretation. It discusses using a neural network model to extract relevant predictive factors from news texts and financial tweets that influence stock prices. An interactive visualization interface is explored to effectively communicate the interpreted model predictions to users. The system aims to help with stock market investment and analysis tasks. It evaluates the approach on two case studies predicting stock prices from online news and tweets, finding the proposed neural network architecture outperforms other models.
IRJET- Towards Efficient Framework for Semantic Query Search Engine in Large-...IRJET Journal
The document proposes a new framework for efficient semantic search in large datasets. It aims to improve understanding of short texts by enriching them with concepts and related terms from a probabilistic knowledge base. A deep learning model using stacked autoencoders is designed to learn features from the enriched short texts and encode them into binary codes, allowing similarity searches. Experiments show the new approach captures semantics better than existing methods and enables applications like short text retrieval and classification.
IRJET- Foster Hashtag from Image and TextIRJET Journal
This document proposes a hybrid model for generating hashtags from images and text by analyzing both the content and trending topics. It discusses using object detection and named entity recognition techniques like Single Shot Multibox Detector and bidirectional LSTM to extract information from images and text. This extracted information is then used to automatically propose relevant hashtags that could help promote and tag the content. The intended users are social media users who could benefit from suggested hashtags to better share and organize their posts.
Improving Association Rule Mining by Defining a Novel Data StructureIRJET Journal
This document proposes a new data structure to improve the performance of association rule mining algorithms on large datasets. It involves compressing the original data in three steps: shuffling records based on similarity, creating an inverted index to map attribute values to transactions, and applying run length encoding. This compressed data structure reduces data size, loading time, and memory requirements compared to the original data. Experimental results on two datasets show the proposed structure improves the speed of an association rule mining algorithm while maintaining data validity. The structure is designed to work with existing algorithms without changing them.
Sachin Jain worked on a project to develop a Window Management Framework to add windowing capabilities to an existing Data Stream Management System called STORM. He served in various roles such as client liaison, requirements manager, project manager, and developer. As developer, he designed and developed the core functionality of the framework, including the emitting logic to read data from disk and populate buffers, as well as send end of window markers. Project management involved using planning poker to estimate tasks and tracking progress with earned-value charts.
IRJET- Key-Aggregate Cryptosystem for Scalable Data Sharing in Cloud StorageIRJET Journal
This document proposes a key-aggregate searchable encryption (KASE) scheme to enable scalable and flexible data sharing in cloud storage. KASE allows a data owner to generate a single aggregate encryption key that can be used by authorized users to search over any number of encrypted files, avoiding the need to distribute multiple keys. The scheme consists of seven algorithms for system setup, key generation, encryption, key extraction, trapdoor generation, trapdoor adjustment, and trapdoor testing. It provides a secure and efficient way for flexible data access delegation in cloud storage.
This document summarizes a research paper that proposes methods to improve the efficiency of ranked keyword search over encrypted cloud data. It discusses using order preserving symmetric encryption to encrypt cloud data for security. It proposes using both keyword search and concept-based search to retrieve more relevant results. It describes calculating relevance scores based on term and document frequency to rank results. The paper evaluates the proposed methods and finds they significantly improve search efficiency over existing techniques through experimental analysis.
Intelligent Systems Project: Bike sharing service modelingAlessio Villardita
A university project on data analysis and modeling based on a bike sharing system data set. Models developed: MLP and RBF fitting, Fuzzy Inference System, ANFIS and time series forecasting.
This document describes a proposed intelligent supermarket system that uses a distributed implementation of the Apriori algorithm (called Dipriori) to analyze transaction data from multiple supermarket locations and identify frequent item sets and association rules. The system would have client machines at each location running Apriori locally on their transaction data and sending the frequent item sets to a central global server. The server would calculate average support counts and identify globally frequent item sets. Association rules would then be generated and used to provide insights into customer purchasing behaviors and inform targeted marketing strategies. The authors implemented this approach and found it helped improve the time efficiency of analyzing large, distributed transaction datasets.
IRJET- Sentimental Analysis for Online Reviews using Machine Learning AlgorithmsIRJET Journal
The document discusses sentiment analysis of online product reviews using machine learning algorithms. It first provides background on sentiment analysis and its uses. It then describes preprocessing customer review data and extracting features using count and TF-IDF vectorization. Three machine learning algorithms are tested - support vector machine (SVM), random forest, and XGBoost classifier. The results show that XGBoost achieved higher accuracy than SVM and random forest for sentiment classification of the product review data.
Similar to Algorithmic Trading Deutsche Borse Public Dataset (20)
The Ipsos - AI - Monitor 2024 Report.pdfSocial Samosa
According to Ipsos AI Monitor's 2024 report, 65% Indians said that products and services using AI have profoundly changed their daily life in the past 3-5 years.
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Kaxil Naik
Navigating today's data landscape isn't just about managing workflows; it's about strategically propelling your business forward. Apache Airflow has stood out as the benchmark in this arena, driving data orchestration forward since its early days. As we dive into the complexities of our current data-rich environment, where the sheer volume of information and its timely, accurate processing are crucial for AI and ML applications, the role of Airflow has never been more critical.
In my journey as the Senior Engineering Director and a pivotal member of Apache Airflow's Project Management Committee (PMC), I've witnessed Airflow transform data handling, making agility and insight the norm in an ever-evolving digital space. At Astronomer, our collaboration with leading AI & ML teams worldwide has not only tested but also proven Airflow's mettle in delivering data reliably and efficiently—data that now powers not just insights but core business functions.
This session is a deep dive into the essence of Airflow's success. We'll trace its evolution from a budding project to the backbone of data orchestration it is today, constantly adapting to meet the next wave of data challenges, including those brought on by Generative AI. It's this forward-thinking adaptability that keeps Airflow at the forefront of innovation, ready for whatever comes next.
The ever-growing demands of AI and ML applications have ushered in an era where sophisticated data management isn't a luxury—it's a necessity. Airflow's innate flexibility and scalability are what makes it indispensable in managing the intricate workflows of today, especially those involving Large Language Models (LLMs).
This talk isn't just a rundown of Airflow's features; it's about harnessing these capabilities to turn your data workflows into a strategic asset. Together, we'll explore how Airflow remains at the cutting edge of data orchestration, ensuring your organization is not just keeping pace but setting the pace in a data-driven future.
Session in https://budapestdata.hu/2024/04/kaxil-naik-astronomer-io/ | https://dataml24.sessionize.com/session/667627
Learn SQL from basic queries to Advance queriesmanishkhaire30
Dive into the world of data analysis with our comprehensive guide on mastering SQL! This presentation offers a practical approach to learning SQL, focusing on real-world applications and hands-on practice. Whether you're a beginner or looking to sharpen your skills, this guide provides the tools you need to extract, analyze, and interpret data effectively.
Key Highlights:
Foundations of SQL: Understand the basics of SQL, including data retrieval, filtering, and aggregation.
Advanced Queries: Learn to craft complex queries to uncover deep insights from your data.
Data Trends and Patterns: Discover how to identify and interpret trends and patterns in your datasets.
Practical Examples: Follow step-by-step examples to apply SQL techniques in real-world scenarios.
Actionable Insights: Gain the skills to derive actionable insights that drive informed decision-making.
Join us on this journey to enhance your data analysis capabilities and unlock the full potential of SQL. Perfect for data enthusiasts, analysts, and anyone eager to harness the power of data!
#DataAnalysis #SQL #LearningSQL #DataInsights #DataScience #Analytics
Global Situational Awareness of A.I. and where its headedvikram sood
You can see the future first in San Francisco.
Over the past year, the talk of the town has shifted from $10 billion compute clusters to $100 billion clusters to trillion-dollar clusters. Every six months another zero is added to the boardroom plans. Behind the scenes, there’s a fierce scramble to secure every power contract still available for the rest of the decade, every voltage transformer that can possibly be procured. American big business is gearing up to pour trillions of dollars into a long-unseen mobilization of American industrial might. By the end of the decade, American electricity production will have grown tens of percent; from the shale fields of Pennsylvania to the solar farms of Nevada, hundreds of millions of GPUs will hum.
The AGI race has begun. We are building machines that can think and reason. By 2025/26, these machines will outpace college graduates. By the end of the decade, they will be smarter than you or I; we will have superintelligence, in the true sense of the word. Along the way, national security forces not seen in half a century will be un-leashed, and before long, The Project will be on. If we’re lucky, we’ll be in an all-out race with the CCP; if we’re unlucky, an all-out war.
Everyone is now talking about AI, but few have the faintest glimmer of what is about to hit them. Nvidia analysts still think 2024 might be close to the peak. Mainstream pundits are stuck on the wilful blindness of “it’s just predicting the next word”. They see only hype and business-as-usual; at most they entertain another internet-scale technological change.
Before long, the world will wake up. But right now, there are perhaps a few hundred people, most of them in San Francisco and the AI labs, that have situational awareness. Through whatever peculiar forces of fate, I have found myself amongst them. A few years ago, these people were derided as crazy—but they trusted the trendlines, which allowed them to correctly predict the AI advances of the past few years. Whether these people are also right about the next few years remains to be seen. But these are very smart people—the smartest people I have ever met—and they are the ones building this technology. Perhaps they will be an odd footnote in history, or perhaps they will go down in history like Szilard and Oppenheimer and Teller. If they are seeing the future even close to correctly, we are in for a wild ride.
Let me tell you what we see.
Analysis insight about a Flyball dog competition team's performanceroli9797
Insight of my analysis about a Flyball dog competition team's last year performance. Find more: https://github.com/rolandnagy-ds/flyball_race_analysis/tree/main
1. Algorithmic Trading using the Deutsche B¨orse
Public Dataset
Marjan Ahmed
Dept of Computer Science
University of Illinois
at Urbana-Champaign
Vancouver, BC, Canada
marjana2@illinois.edu
Fan Yang
Dept of Computer Science
University of Illinois
at Urbana-Champaign
Palo Alto, CA, United States
fanyang3@illinois.edu
Dilruba Hawk
Dept of Computer Science
University of Illinois
at Urbana-Champaign
New York, NY, United States
dilruba2@illinois.edu
Wang Chun Wei
Dept of Computer Science
University of Illinois
at Urbana-Champaign
Brisbane, QLD, Australia
wcwei2@illinois.edu
Abstract—This report details the findings, implementation and
research method of the teams algorithmic trading model devel-
oped using machine learning frameworks implemented leveraging
AWS cloud computing infrastructure. Github project website:
https://github.com/weiwangchun/cs498cca
I. INTRODUCTION
We intend to build a cloud hosted backtesting framework for
algorithmic trading. The objective is to analyze a given data set
of equities containing price-volume data and develop medium
to high frequency strategies for trading equities quickly via
cloud computing. A simple web interface will be constructed
allowing users to control and tweak trading parameter settings,
and see in-sample and out-of-sample trading performance (i.e.,
hypothetical profit and loss, tradings costs, turnover, portfolio
risk and Sharpe ratio).
Given the limitations of our dataset, we focus on using a
feature set that includes (i) stock returns, (ii) implied stock
volatility, (iii) order imbalance, (iv) trading volumes and (v)
number of trades to predict future stock prices. These infor-
mation are all publically available, and hence if the German
market was at least weakly efficient, trading profitably using
this subset of information would be difficult. For a given
stock at a given time, we compute its most recent feature
set, its long run average feature set and also its feature set
percentile compared to its historical feature sets (i.e., how
does it compare to history) and lastly compute its feature set
percentile compared to its peers’ feature sets (i.e., how does it
compare to its peers). We attempt to use both a simple logistic
regression model and a neural network to predict future stock
price.
Our results are mildly successful, but after accounting for
transaction costs, we conclude that we are unable to construct
a commerically viable trading strategy.
II. DATASET
The chosen dataset is the Deutsche Borse public dataset
available on Amazon Web Services (AWS). It contains trad-
ing data in 2 minute intervals for every tradeable security
listed on the Eurex and Xetra trading platforms located
in Frankfurt, Germany. The dataset is circa 3 GB in size
with over 18,000 files and is located at: https://s3.eu-central-
1.amazonaws.com/deutsche-boerse-xetra-pds/ In aggregate,
we collected data from the 2nd of January 2018 to the 1st
of April 2019 (314 unique trading days). In total, this had
18,854,026 rows. Each row represented a 2 minute trading
interval for a particular stock. There were 2,837 unique stocks
in the dataset. A snapshot of the dataset is shown below.
III. TECHNOLOGIES AND TOOLS
To implement the objective of this project, we have chosen
to leverage the vast computing and storage resources offered
by AWS. The technologies used in this project are as follows:
• Simple Storage Service (S3) for storing the dataset and
outputs.
• Elastic Cloud Compute (EC2) for initial exploratory
analysis and model development.
• Elastic MapReduce (EMR) - for Deployment in Produc-
tion.
• Route53 for front end web interface.
A step-by-step methodical approach was implemented start-
ing from creating an AWS account, creating users (team
members), creating roles and permissions for team members
and applications using Identity Access Management (IAM),
downloading the data into the created S3 bucket using python
helper functions and then use PySpark, built on AWS EMR
clusters for exploratory analysis of the data. EMR cluster
creation process involved creating a 3-node cluster. Selection
of PySpark for model development led to inclusion of the
relevant applictions, in this case, PySpark and JupyterHub.
2. JupyterHub helps in line block execution of codes and display
results subsequently helping us explore the data and keep track
of every outcome. This process will help us test different
predicivte models and alogrithms and fine tune the process
in developing the final model before production deployment.
After the model is finalized, We intend to develop a front
end Web Interface using Amazon Route 53. A domain named
datainsight.guru is chosen where users can choose tickers and,
control and tweak trading parameters to see the outcome of
the predictive model.
After running the dataset through EMR cluster on AWS
using dataset uploaded into S3 buckets, we met to discuss
limitations imposed on us due to cost of using AWS. The free
tier account and credit issued to us was exhausted. As a result,
we looked into other platforms that will help us complete the
project without incurring cost.
Progressing through the Cloud Computing Course helped us
gain more knowledge on using PySpark and after reasearch we
found Databricks Community Edition to be a viable platform
for performing additional distributed computing required to
complete the project. The Databricks platform provides jupyter
style notebook with Python, Scala and Java environment
in addition to the convenience of running SQL queries on
uploaded data and creating Spark Dataframes directly from
SQL queries.
IV. METHOD DESIGN
In terms of the overall methodology, we choose S3 to store
the stock price data and choose to use PySpark built on AWS
EMR clusters1
to analyze the data. We use boto3 package in
Python to upload stock price data from the Deustche public
dataset to S32
. Lastly, we use EMR Notebook to run PySpark
jobs. Announced in Nov 2018, EMR Notebook is essentially
a Python Jupyter notebook pre-configured to connect to EMR
and S3.
The general process of the project is as follows:
1) Stock price-volume data is uploaded on S3
2) User (via web interface) selects the time horizon and
trading parameters (i.e., trading costs, trading frequen-
cies, lookback period, size of trading portfolio, ...)
3) We iterate through the time period provided by the user,
at each point t:
• A training set is constructed based on a lookback
period (k), i.e. from t − k to t − 1
• A ‘trading model’ (described in the next section)
will be fitted / applied to the training set
• Stocks will be brought and sold in the test set (time
t)
• Profitability will be recorded.
4) Results will be displayed back to the user via the web
interface
1We created a 3-nodes cluster following the AWS Cluster creation process.
2See our codes in Github
A. Trading Model
Initially, we opted to examine momentum strategies (Je-
gadeesh and Titman, 1993) on the German market. However,
given the Deutsche database has a relatively short history
(circa 2 years), we decided to instead focus on higher fre-
quency trading strategies. In particular, we will focus on
implementing pairs trading (Vidyamurthy, 2004; Zhang and
Zang, 2008) on relatively high frequency equities data.
Over user selected trading frequencies (highest frequency =
2 minutes, lowest frequency = 1 day), we systematically
search for cointegrated pairs of stocks on the Xetra and Eurex
stock exchanges using both the Engle-Granger and Johansen
methods. Our trading strategy involves simultaneously buying
and selling top cointegrated pairs whose spread has diverged,
i.e., betting on convergence.
Our trading model incorporated both time-series and cross-
sectional data. We decided to focus on a daily frequency
trading model. Since our data was in 2-minute frequency, we
convert the raw data into the follow key features:
1) Total Volume: an aggregation of 2-minute frequency
volume for each stock on a daily basis
2) Total Trades: the aggregate number of trades made on
a given stock per day
3) VWAP: daily volume weighted average price (granular-
ity at the 2-minute frequency)
4) Daily Return: the total return made on the stock per
day
5) Realized Volatility: the absolute difference between the
maximum price and the minimum price on a stock on a
given day
6) Order Imbalance: a daily measure of trade imbalance
denoted as,
OI =
|B − S|
B + S
where B is the total number of buy initiated trades
and S is the total number of sell initiated trades. We
approximate buy/sell initiation as follows: if in a given
2 minute interval, the end price is greater than the start
price, the trades within that interval are classified as buy
(and vice versa).
The features above alone are not enough to build a pre-
dictive model. We extend by looking at 4 dimensions of
these features. To predict today’s returns of a given stock,
we analyze the following:
1) Yesterday’s feature set of given stock
2) Historical long run average feature set of given stock
3) Yesterday’s feature set as a percentile rank compared
to the given stocks’ historical feature set (gauge for
anything abnormal)
4) Yesterday’s feature set as a percentile rank compared
to yesterday’s peer feature set (compare this stock’s
features against all other stocks)
We excluded VWAP as a predictive feature, leaving us with 5
× 4 (dimensions) = 20 features for classification modelling.
3. We classify stock returns into three categories for classifi-
cation:
1) 0 Negative: returns < -0.001
2) 1 Neutral: -0.001 ≤ returns ≤ +0.001
3) 2 Positive: returns > +0.001
We run a logistic regression model to train the features on
the classification labels. We also tried implementing a neural
network model (and a simplier 2 layer fully connected model),
however, these did not outperform the simplier logistic model.
Our trading strategy is simple. At the beginning of a given
trading day, we use historical features to try and predict the
total returns for all the stocks at the end of the day. For stocks
where we predict a negative outcome, we short (short sell)
it with a k percent weight. For stocks where we predict a
positive outcome, we long (buy) it with a k percent weight.
For stocks where we predict a neutral outcome, we refrain
from investing. We define k = 1/#of stocks. Hence, each
position is equal weighted. For our benchmark strategy, we
will invest (long) k percent weight to all of the stocks, i.e., a
classic equal weighted portfolio. We will attempt to beat this
equal weighted portfolio with our algorithmic trading strategy.
V. EVALUATION
A. EMR PySpark Setup
Since the main objective of CS 498 Cloud Computing
Application is to understand cloud computing applications and
their successful deployment, we emphasized the preliminary
evaluation results on demostrating our ability to successfully
create and deploy the cloud computing technologies listed in
the Method Design section of this report.
The various stages of deployment that we have completed
so far:
1) Using IAM to create users and apply roles
and permissions for accessing the applications
2) Creating Roles for enabling application and user access
to applictions
3) Creating S3 bucket for data storage and access
4) IAM Role for S3 Bucket access by other applications
used in this project
5) Create EMR cluster (see Figure 1)
6) Configuration for Python Jupyter Note-
book to connect to EMR and S3
7) Run PySpark job on Jupyter Notebook (see Figure 2)
B. DataBricks Model Setup
Following team discussion on using DataBricks in the
Technologies and Tools section of this report, a DataBricks
community edition account was created to build a prediction
model (logistic regression) for the stock data. Databricks
provided a fully loaded environment of Python and SQL along
with its east to use internal DBFS (Data Bricks File System)
where the data was uploded from S3 using secure access keys.
The predictive logistic regression model was setup in the
following steps:
1) We import the high frequecy trade data from S3. The
display and initial analysis of the data are illustrated in
Figure 3.
2) Data analysis and wrangling can be due using Spark
SQL
3) Variable selection for the logistic model and vectoriz-
ing the features and labels can be done using Spark’s
VectorAssembler. This is illustrated in Figure 4.
4) An illustration of how we split training and testing data,
and fit the training data into Sparks LinearRegression
framework are illustrated in Figure 5. (For our final
results, we implement a logistic model rather than a
linear regression model).
VI. RESULTS
A. Random Batch Backtests
We randomly sample 100 stocks from our database, to run
our backtest across the full history. We found our logistic re-
gression model to be lacklustre in performance. In our training
sample, it yielded an accuracy of 52.5%. The confusion matrix
as as follows:
TABLE I
IN-SAMPLE CONFUSION MATRIX (TRAINING) - RANDOM BATCH
Actual
Negative Neutral Positive
Predicted
Negative 603 105 551
Neutral 137 9622 164
Positive 4884 7641 4693
This degraded to 43.2% in the test set. Overall, we did
not see a significant difference in performance between our
4. Fig. 1. EMR Cluster
Fig. 2. Running PySpark on Jupyter Notebooks
model and the benchmark long only trading strategy. Therefore
training the logistic model on a small random sample of 100
stocks is suboptimal. In the next section, we train our model
on all 2,837 stocks.
TABLE II
OUT-OF-SAMPLE CONFUSION MATRIX (TEST) - RANDOM BATCH
Actual
Negative Neutral Positive
Predicted
Negative 35 7 34
Neutral 14 420 10
Positive 401 670 409
B. Full Sample Backtests
When we ran our prediction model across all the stocks
(2,837) over all the dates (2nd Jan, 2018 to 1st Apr, 2019)
in the public dataset. Our logistic regression model performed
marginally better, In our training sample, it yielded an accu-
racy of 67.5%, and in our test sample, it yielded an accuracy
of 60.0%. We find that whilst it is better at predicting neutral,
reasonable at predicting negative states, it has a poor record
at predicting positive states.
We plot the performance of our model vs the benchmark,
which is an equal weighted portfolio of all the stocks listed on
the Deutsche Bourse. We find reasonably good performance
of our performance in-sample (on the trainings set), despite
not being very good at predicting positive movements. This is
5. Fig. 3. Data Import from Amazon S3 storage
TABLE III
IN-SAMPLE CONFUSION MATRIX (TRAINING) - FULL BACKTEST
Actual
Negative Neutral Positive
Predicted
Negative 26315 4557 23716
Neutral 115279 516430 116790
Positive 1220 254 1147
in part due to the fact that the German DAX had a poor run
during 2018. Our trading strategy also outperformed in the test
set, yielding a portfolio with lower volatility and high returns
than the benchmark equal weighted portfolio. However, these
TABLE IV
OUT-OF-SAMPLE CONFUSION MATRIX (TEST) - FULL BACKTEST
Actual
Negative Neutral Positive
Predicted
Negative 1680 322 1545
Neutral 9783 32312 10951
Positive 78 12 57
results do not include trading costs. Since we are trading on a
daily basis, it is most likely that these results would become
negative after applying broker transaction costs.
6. Fig. 4. An Example for Spark’s VectorAssembler in generating features. Here we construct only 2 features.
Fig. 5. An illustration of fitting a model in PySpark
7. Fig. 6. Training and Test - Profit and Loss (Random Batch of 100 stocks)
TABLE V
PORTFOLIO PERFORMANCE
Training Set Test Set
Portfolio Benchmark Portfolio Benchmark
Mean 1.79% -2.47% 1.47% 0.08%
Standard Deviation 0.73% 2.62% 0.56% 2.01%
Sharpe Ratio 2.45 - 0.94 2.63 0.04
Fig. 7. Training and Test - Profit and Loss (Full Sample)
VII. DISCUSSION
We find moderate success in predicting stock price move-
ments using a feature set that included returns, volatility,
volume and order imbalance. However, our backtests are only
profitable before transaction costs, and are unlikely to work
in real-life trading. Generally speaking, the Deutsche Bourse
is a relatively efficient market and therefore it is unreasonable
to expect to trade profitably by simply examining past trading
statistics as features.
Much of the team discussions focused on selecting the ap-
propriate technologies that can help us achieve our objective of
developing a successful application thats ready for real world
usage. Cost and the computing power of the technologies was
also an item of discussion. AWS can be quite expensive if we
dont select the proper technologies that can efficient execute
the tasks needed. For example, PySpark framework built on
EMR was chosen to achieve efficiency and speed that will
not only help us reduce cost but deliver fast results to the
users. As the project progessed, the team discussions focused
on coming up with new strategies build effective model for
the stock dataset and figuring out what other technologies or
platforms can be used to keep the project cost under check.
The team collaboratively agreed upon using Data Bricks Spark
platform which would help in preventing future AWS costs.
Subsequent discussions and meetings also resulted in dropping
Amazons Route 53 due to cost-prohibitive reasons.
VIII. FUTURE WORK
The projects initial set up and testing of the applications
seamless integration with each other has been completed.
Now the framework or the infratructre is in place, next
steps include creating or selecting the appripriate predictive
algorithms to help us build a successful model. After the model
is successully created, the project will move into production
phase, interconnected applications checked for seamless in-
tegration and development of the front end user interface
by using Amazons Route 53 application. As mentioned in
the Technologies and Tools and Discussion section of this
report, selecting new technologies and additional models led to
Linear Regression and Logistic Regression model building and
developing PySpark codes on Data Bricks Pyspark notebook.
IX. DIVISION OF WORK
Fan Yang and Wang Chun Wei, contributed to developing
the Python helper codes for connecting to and downloading the
data into S3. Along with their prior Macine Learning knowl-
edge, and in collaboartion with Marjan Ahmed and Dilruba
Hawk, the dataset was chosen and initial model development
strategy such as selecting the appropriate Machine Learning
model (for example, regression) were chosen. Marjan Ahmed
and Dilruba, along with inputs from Fan Yang and Wang Chun
Wei, created the Amazon account, created users, roles and
permissions using IAM and created EC2 and S3. Fan Yang
also tested the jupyter notebook and ran a computing job using
the EMR cluster. Dilruba will help with developing the front
end web application suing Route 53 for this project.
Selecting new technologies and additional models led to re-
distributing the work load based on team members familiarity
with the platform. Wang Chun and Fan Yang contributed sig-
nificantly in developing and testing machine learning models
of Logistic Regression and Neural Networks. Marjan Ahmed,
with the help of Fan Wang and Dilruba Hawk, developed a
Linear Regression and a Logistic Regression model on Data
8. Bricks platform. In the end we chose the Logistic model
for the final backests. Fan Yangs expert knowledge on AWS
helped Marjan Ahmed connect DataBricks platform to S3
using secure credetials. Marjans prior experience in Apache
Spark helped build the predictive model.
REFERENCES
[1] Jegadeesh, N., and S. Titman (1993) Returns to buying winners and
selling losers: Implications for stock market efficiency, Journal of
Finance, 48, 65 - 91
[2] Vidyamurthy, G. (2004) Pairs Trading: Quantitative Methods and Anal-
ysis, New Jersey: John Wiley & Sons.
[3] Zhang, H. and Zang, Q. (2008) Trading a meanreverting asset: Buy low
and sell high. Automatica Vol. 44, 1511-1518