This project studies the factors impacting the landing distance of a flight. The input data has 7 predictor variables and 1 response variable. First, descriptive study is performed to understand the data and relationship among variables. Data cleaning is done to handle duplicate and missing data. A linear model is then used to predict the landing distance of the flight by selecting the useful predictors. Residual diagnostics is also done to validate the fitted model.
The objective of this analysis is to quantify the factors that impact the landing distance of a commercial flight and built a linear regression model to predict the risk of landing overrun.
This document describes a statistical analysis project to determine the factors that affect commercial flight landing distance. The author analyzes 950 flight observations to build a linear regression model with landing distance as the target variable. Key factors found to impact landing distance are aircraft type, ground speed, and height of the aircraft. The final regression equation found is: Distance =-2554.47 + 501.57(Aircraft_Cat)+ 42.79(Ground Speed)+ 12.52(Height), where Aircraft_Cat is a dummy variable for aircraft type. 832 observations were used to fit the final model after removing outliers.
This document provides an overview of the C++ Data Structures lab manual. It covers topics like C++ review, implementation of various data structures like stack, queue, linked list, binary tree, graph. It also discusses sorting and searching techniques, file input/output, functions, classes, templates and exercises for students to practice implementing different data structures and algorithms. The instructor's contact details are provided at the beginning.
The document provides practices for coding PL/SQL that are worth considering. It discusses 11 practices:
1) Using UNION instead of mixing SELECT MIN and MAX to get faster performance.
2) COUNT(*), COUNT(1) or COUNT(PK) have same performance.
3) Whether to use NOT IN or MINUS depends on table sizes - MINUS is generally faster but NOT IN may be faster for larger tables.
4) Some hints like parallel are ignored or incompatible with others like index.
5) Nested loops can sometimes be improved by rewriting the query.
6) Full table scans with parallel hint can utilize multiple CPUs.
7) Rewriting NOT IN
A general coverage of the C language is presented. These slides are useful for students attending C courses at universities and other institutions as well as others following C tutorials or learning the language by themselves. Comments are welcome for creating better future.
Faculty of Eng. & Computer Sciences of IEU, Izmir-Turkey,
Assoc. Prof. Dr. S. Kondakci
This document provides information about stacks, queues, and hashing. It describes stacks as data structures where the last item inserted is the first item accessed. Common stack operations like push(), pop(), and peek() are discussed. Array and linked list implementations of stacks are presented. Examples of stack applications in compiler design, expression evaluation, and spell checking are given. Queues are defined as structures where the first item inserted is the first item accessed (FIFO). Circular queue implementations are described to avoid overflow issues. Deques, which allow insertion and removal from both ends, are also introduced. Implementation of queues using arrays is demonstrated with methods like enqueue(), dequeue(), and isEmpty().
Explanations to the article on Copy-PastePVS-Studio
Many readers liked my article "Consequences of using the Copy-Paste method in C++ programming and how to deal with it" [1]. Scott Meyers [2] noticed it too and asked me how static analysis proper helped us to detect the errors described in the article.
The objective of this analysis is to quantify the factors that impact the landing distance of a commercial flight and built a linear regression model to predict the risk of landing overrun.
This document describes a statistical analysis project to determine the factors that affect commercial flight landing distance. The author analyzes 950 flight observations to build a linear regression model with landing distance as the target variable. Key factors found to impact landing distance are aircraft type, ground speed, and height of the aircraft. The final regression equation found is: Distance =-2554.47 + 501.57(Aircraft_Cat)+ 42.79(Ground Speed)+ 12.52(Height), where Aircraft_Cat is a dummy variable for aircraft type. 832 observations were used to fit the final model after removing outliers.
This document provides an overview of the C++ Data Structures lab manual. It covers topics like C++ review, implementation of various data structures like stack, queue, linked list, binary tree, graph. It also discusses sorting and searching techniques, file input/output, functions, classes, templates and exercises for students to practice implementing different data structures and algorithms. The instructor's contact details are provided at the beginning.
The document provides practices for coding PL/SQL that are worth considering. It discusses 11 practices:
1) Using UNION instead of mixing SELECT MIN and MAX to get faster performance.
2) COUNT(*), COUNT(1) or COUNT(PK) have same performance.
3) Whether to use NOT IN or MINUS depends on table sizes - MINUS is generally faster but NOT IN may be faster for larger tables.
4) Some hints like parallel are ignored or incompatible with others like index.
5) Nested loops can sometimes be improved by rewriting the query.
6) Full table scans with parallel hint can utilize multiple CPUs.
7) Rewriting NOT IN
A general coverage of the C language is presented. These slides are useful for students attending C courses at universities and other institutions as well as others following C tutorials or learning the language by themselves. Comments are welcome for creating better future.
Faculty of Eng. & Computer Sciences of IEU, Izmir-Turkey,
Assoc. Prof. Dr. S. Kondakci
This document provides information about stacks, queues, and hashing. It describes stacks as data structures where the last item inserted is the first item accessed. Common stack operations like push(), pop(), and peek() are discussed. Array and linked list implementations of stacks are presented. Examples of stack applications in compiler design, expression evaluation, and spell checking are given. Queues are defined as structures where the first item inserted is the first item accessed (FIFO). Circular queue implementations are described to avoid overflow issues. Deques, which allow insertion and removal from both ends, are also introduced. Implementation of queues using arrays is demonstrated with methods like enqueue(), dequeue(), and isEmpty().
Explanations to the article on Copy-PastePVS-Studio
Many readers liked my article "Consequences of using the Copy-Paste method in C++ programming and how to deal with it" [1]. Scott Meyers [2] noticed it too and asked me how static analysis proper helped us to detect the errors described in the article.
Predicting landing distance: Adrian VallesAdrián Vallés
Conducted data preparation/cleaning and statistical modeling in a project using SAS to consider factors affecting flight landing overrun and predicting landing distance of commercial flights to reduce overrun. This was the final project for the statistical computing class (BANA 6043)
This document contains a laboratory record of a student from the MCA department of Muthayammal Engineering College in Rasipuram, Tamil Nadu, India. It includes programs written by the student to illustrate concepts like enumerated data types, function overloading, scope of variables, implementation of stacks, queues, constructors, destructors, static members and methods, and bit fields. The programs were run and the outputs were included to verify the concepts.
This document contains a laboratory record of a student from the MCA department of Muthayammal Engineering College in Rasipuram, Tamil Nadu, India. It includes programs written by the student to illustrate concepts like enumerated data types, function overloading, scope of variables, implementation of stacks, queues, constructors, destructors, static members and methods, and bit fields. The programs were run and the outputs were included to verify the concepts.
The project looks at factors which impact landing distance of a commercial flight so as to minimize risk over run. SAS is used to perform exploratory data analysis and fit a regression model.
The objective of the project is to predict whether a flight will be delayed or not by studying the various features of the flight and calculating the probability for delay using Bayesian Classification.
3- In the program figurespointers we have a base class location and va.pdfatozshoppe
3. In the program figurespointers we have a base class location and various derived classes:
circle, triangle, rectangle. Complete the implementation of the derived classes.
For marking purposes, run the program entering: circle(1,2,3), triangle(3,4,1,2,1) and
rectangle(5,6,3,4). Explain why this program does not work properly.
Change the classes (not main) so that the program runs properly and run the program with the
same data as before.
Declaration of Classes FigurePointers
This is the copy paste code for this code:
Declaration of Classes Figure Pointers:
/* File: figurepointers.h
*/
#ifndef FIGUREPOINTERS_H
#define FIGUREPOINTERS_H
#include <cmath>
#include <iostream>
#include <fstream>
#include <string>
using namespace std;
class location {
private :
float x; // position of the figure
float y;
public :
void read(istream & in ) ;
void write(ostream & out ) const ;
virtual float area( void ) const ;
};
class circle : public location {
private :
float radius;
public :
void read(istream & in );
void write(ostream & out ) const ;
float area( void ) const ; // area of the figure;
};
class rectangle : public location {
private :
float side1, side2;
public :
void read(istream & in );
void write(ostream & out ) const ;
float area( void ) const ; // area of the figure;
};
class triangle : public rectangle {
private :
float angle;
public :
void read(istream & in );
void write(ostream & out ) const ;
float area( void ) const ; // area of the figure;
};
#endif /* FIGUREPOINTERS_H */
Implementation of Class FigurePointers:
/* File: figurepointers.cpp
Implementation of class figurepointers */
#include "figurepointers.h"
////////////////// implementation of location /////// ///////////
void location::read(istream & in)
{
cout <<"x coordinate: ";
in >> x;
cout <<"y coordinate: ";
in >> y;
}
float location::area( void ) const
{
return 0.0;
}
void location::write(ostream & out) const
{
out << "x coordinate: " << x << endl;
out << "y coordinate: " << y << endl;
out << "area = " << area() << endl;
}
////////////////// implementation of circle //////////////////
////////////////// implementation of triangle //////////////////
////////////////// implementation of rectangle //////////////////
Main Program Using Classes FigurePointers:
/* File: testfigurepointers.cpp
Application program using classes figurepointers
Programmer: your name Date: */
#include "figurepointers.h"
int main( void )
{
string type; // figure type
ofstream fout ("testfigurepointersout.txt");
location * p;
while ( true ) { // loop until break occurs
cout << endl << "Type of figure: "; cin >> type;
if (type == "circle") {
p = new circle;
p->read(cin);
fout << endl << "object is a circle" << endl;
p->write(fout);
delete p;
} else if (type == "triangle") {
p = new triangle;
p->read(cin);
fout << endl << "object is a triangle" << endl;
p->write(fout);
delete p;
} else if (type == "rectangle") {
p = new rectangle;
p->read(cin);
fout << endl << "object is a rectangle" << endl;
p->write(fout);
} el.
Supporting Flight Test And Flight Matchingj2aircraft
This document describes how the j2 Universal Tool-Kit can be used to support a complete flight test program and flight matching through:
The Development of an A Priori Model
Flight Test Planning and Rehearsal
Flight Test Data Analysis
Flight Matching and Model Updates
Model Qualification Certification
Simulator Certification and Qualification
Mission Planning
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...Chester Chen
Uber developed an new Spark ingestion system, Marmaray, for data ingestion from various sources. It’s designed to ingest billions of Kafka messages every 30 minutes. The amount of data handled by the pipeline is of the order hundreds of TBs. Omar details how to tackle such scale and insights into the optimizations techniques. Some key highlights are how to understand bottlenecks in Spark applications, to cache or not to cache your Spark DAG to avoid rereading your input data, how to effectively use accumulators to avoid unnecessary Spark actions, how to inspect your heap and nonheap memory usage across hundreds of executors, how you can change the layout of data to save long-term storage cost, how to effectively use serializers and compression to save network and disk traffic, and how to reduce amortize the cost of your application by multiplexing your jobs, different techniques for reducing memory footprint, runtime, and on-disk usage. CGI was able to significantly (~10%–40%) reduce memory footprint, runtime, and disk usage.
Speaker: Omkar Joshi (Uber)
Omkar Joshi is a senior software engineer on Uber’s Hadoop platform team, where he’s architecting Marmaray. Previously, he led object store and NFS solutions at Hedvig and was an initial contributor to Hadoop’s YARN scheduler.
This document discusses configuring communications monitoring by implementing features and signatures from network traffic and learning a white-box model. It describes extracting feature values from packet fields using Python expressions and gathering them in a feature file. Signatures are defined as Python boolean expressions mapped to alert IDs. A white-box model is learned from a training set and stored in a histograms file, which can be tuned by adjusting likelihood values and bins. The steps are demonstrated on a bottle filling plant use case monitoring Modbus traffic.
IRJET - Comparative Study of Flight Delay Prediction using Back Propagati...IRJET Journal
This document describes a study that compares using backpropagation and radial basis function neural networks to predict flight delays. The study uses historical flight data to train models to predict delays. Both algorithms are described and their methodology is outlined. The data will be preprocessed, split for training and testing models, and then the models will be evaluated and compared to determine the most accurate for predicting flight delays.
The document discusses runtime environments and memory management techniques for programming languages. It covers stack-based vs dynamic environments, parameter passing mechanisms like pass by value and reference, and garbage collection algorithms. Dynamic memory allocation uses a heap structure with malloc and free functions. Object-oriented languages require special runtime support for objects, inheritance etc. Fully dynamic environments are needed for functional languages that allow nested functions.
The document outlines a chapter on methods in C#. It discusses key concepts like defining methods, passing arguments by value vs reference, and using built-in classes like Math. It provides examples of methods that square integers, find the maximum of 3 numbers, and demonstrate passing by reference and out parameters.
The document provides steps to develop various types of procedures, reports, interfaces, and loads in Oracle Applications. It outlines the key steps as: 1) develop the code/logic; 2) move files to server; 3) create concurrent executable; 4) create concurrent program; 5) attach program to request group; and 6) submit the program. The document also summarizes how to develop inbound and outbound interfaces, and provides examples of common tables and queries.
FAA Flight Landing Distance Forecasting and AnalysisQuynh Tran
The overall goal of this project is to get an ideal model to forecast landing distance based on variables given in the dataset. To be able to come up with a good model that fits the dataset, we need to go through some certain steps to explore, clean, visualize, and analyze values in the dataset.
This document provides an introduction to pointers in C programming. It discusses what pointers are, how they store memory addresses, and how they can be used to access and modify variables. Some key points:
- Pointers store the address of a variable rather than its value. They allow access to variables outside of the current scope.
- The & operator returns the address of a variable. This address can be assigned to a pointer variable.
- Pointer variables must be declared with a * before the name. They point to data of a specified type.
- Variables can be accessed through their pointers using the * operator on the pointer.
- Pointers can be passed to functions to allow call-by
Java Airline Reservation System – Travel Smarter, Not Harder.pdfSudhanshiBakre1
- The document describes the steps to create an airline reservation system using Java and SQLite. It will include four panels (reservations, customers, flights, airports) with JTables to display database data.
- The Database class is implemented to initialize the database, connect to it, and perform operations like retrieving, adding, updating and deleting data from the tables.
- The AirlineReservation class implements the user interface with JTextFields, JComboBoxes, JTables and buttons to display and interact with the database data.
The document outlines the schedule and objectives for an operating systems lab course over 10 weeks. The first few weeks focus on writing programs using Unix system calls like fork, exec, wait. Later weeks involve implementing I/O system calls, simulating commands like ls and grep, and scheduling algorithms like FCFS, SJF, priority and round robin. Students are asked to display Gantt charts, compute waiting times and turnaround times for each algorithm. The final weeks cover inter-process communication, the producer-consumer problem, and memory management techniques.
The document discusses a case study on customer churn for a telecommunications company called Mobicom. It aims to identify the top five factors driving churn and recommend proactive retention programs. Key factors analyzed include usage, billing issues, network quality, data usage and rate plans. The analysis identifies age, retention calls, equipment age and length of relationship as top predictors of churn. It recommends rate plan migration and prioritizing high-churn customers for retention campaigns.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
More Related Content
Similar to Flight Landing Distance Study Using SAS
Predicting landing distance: Adrian VallesAdrián Vallés
Conducted data preparation/cleaning and statistical modeling in a project using SAS to consider factors affecting flight landing overrun and predicting landing distance of commercial flights to reduce overrun. This was the final project for the statistical computing class (BANA 6043)
This document contains a laboratory record of a student from the MCA department of Muthayammal Engineering College in Rasipuram, Tamil Nadu, India. It includes programs written by the student to illustrate concepts like enumerated data types, function overloading, scope of variables, implementation of stacks, queues, constructors, destructors, static members and methods, and bit fields. The programs were run and the outputs were included to verify the concepts.
This document contains a laboratory record of a student from the MCA department of Muthayammal Engineering College in Rasipuram, Tamil Nadu, India. It includes programs written by the student to illustrate concepts like enumerated data types, function overloading, scope of variables, implementation of stacks, queues, constructors, destructors, static members and methods, and bit fields. The programs were run and the outputs were included to verify the concepts.
The project looks at factors which impact landing distance of a commercial flight so as to minimize risk over run. SAS is used to perform exploratory data analysis and fit a regression model.
The objective of the project is to predict whether a flight will be delayed or not by studying the various features of the flight and calculating the probability for delay using Bayesian Classification.
3- In the program figurespointers we have a base class location and va.pdfatozshoppe
3. In the program figurespointers we have a base class location and various derived classes:
circle, triangle, rectangle. Complete the implementation of the derived classes.
For marking purposes, run the program entering: circle(1,2,3), triangle(3,4,1,2,1) and
rectangle(5,6,3,4). Explain why this program does not work properly.
Change the classes (not main) so that the program runs properly and run the program with the
same data as before.
Declaration of Classes FigurePointers
This is the copy paste code for this code:
Declaration of Classes Figure Pointers:
/* File: figurepointers.h
*/
#ifndef FIGUREPOINTERS_H
#define FIGUREPOINTERS_H
#include <cmath>
#include <iostream>
#include <fstream>
#include <string>
using namespace std;
class location {
private :
float x; // position of the figure
float y;
public :
void read(istream & in ) ;
void write(ostream & out ) const ;
virtual float area( void ) const ;
};
class circle : public location {
private :
float radius;
public :
void read(istream & in );
void write(ostream & out ) const ;
float area( void ) const ; // area of the figure;
};
class rectangle : public location {
private :
float side1, side2;
public :
void read(istream & in );
void write(ostream & out ) const ;
float area( void ) const ; // area of the figure;
};
class triangle : public rectangle {
private :
float angle;
public :
void read(istream & in );
void write(ostream & out ) const ;
float area( void ) const ; // area of the figure;
};
#endif /* FIGUREPOINTERS_H */
Implementation of Class FigurePointers:
/* File: figurepointers.cpp
Implementation of class figurepointers */
#include "figurepointers.h"
////////////////// implementation of location /////// ///////////
void location::read(istream & in)
{
cout <<"x coordinate: ";
in >> x;
cout <<"y coordinate: ";
in >> y;
}
float location::area( void ) const
{
return 0.0;
}
void location::write(ostream & out) const
{
out << "x coordinate: " << x << endl;
out << "y coordinate: " << y << endl;
out << "area = " << area() << endl;
}
////////////////// implementation of circle //////////////////
////////////////// implementation of triangle //////////////////
////////////////// implementation of rectangle //////////////////
Main Program Using Classes FigurePointers:
/* File: testfigurepointers.cpp
Application program using classes figurepointers
Programmer: your name Date: */
#include "figurepointers.h"
int main( void )
{
string type; // figure type
ofstream fout ("testfigurepointersout.txt");
location * p;
while ( true ) { // loop until break occurs
cout << endl << "Type of figure: "; cin >> type;
if (type == "circle") {
p = new circle;
p->read(cin);
fout << endl << "object is a circle" << endl;
p->write(fout);
delete p;
} else if (type == "triangle") {
p = new triangle;
p->read(cin);
fout << endl << "object is a triangle" << endl;
p->write(fout);
delete p;
} else if (type == "rectangle") {
p = new rectangle;
p->read(cin);
fout << endl << "object is a rectangle" << endl;
p->write(fout);
} el.
Supporting Flight Test And Flight Matchingj2aircraft
This document describes how the j2 Universal Tool-Kit can be used to support a complete flight test program and flight matching through:
The Development of an A Priori Model
Flight Test Planning and Rehearsal
Flight Test Data Analysis
Flight Matching and Model Updates
Model Qualification Certification
Simulator Certification and Qualification
Mission Planning
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...Chester Chen
Uber developed an new Spark ingestion system, Marmaray, for data ingestion from various sources. It’s designed to ingest billions of Kafka messages every 30 minutes. The amount of data handled by the pipeline is of the order hundreds of TBs. Omar details how to tackle such scale and insights into the optimizations techniques. Some key highlights are how to understand bottlenecks in Spark applications, to cache or not to cache your Spark DAG to avoid rereading your input data, how to effectively use accumulators to avoid unnecessary Spark actions, how to inspect your heap and nonheap memory usage across hundreds of executors, how you can change the layout of data to save long-term storage cost, how to effectively use serializers and compression to save network and disk traffic, and how to reduce amortize the cost of your application by multiplexing your jobs, different techniques for reducing memory footprint, runtime, and on-disk usage. CGI was able to significantly (~10%–40%) reduce memory footprint, runtime, and disk usage.
Speaker: Omkar Joshi (Uber)
Omkar Joshi is a senior software engineer on Uber’s Hadoop platform team, where he’s architecting Marmaray. Previously, he led object store and NFS solutions at Hedvig and was an initial contributor to Hadoop’s YARN scheduler.
This document discusses configuring communications monitoring by implementing features and signatures from network traffic and learning a white-box model. It describes extracting feature values from packet fields using Python expressions and gathering them in a feature file. Signatures are defined as Python boolean expressions mapped to alert IDs. A white-box model is learned from a training set and stored in a histograms file, which can be tuned by adjusting likelihood values and bins. The steps are demonstrated on a bottle filling plant use case monitoring Modbus traffic.
IRJET - Comparative Study of Flight Delay Prediction using Back Propagati...IRJET Journal
This document describes a study that compares using backpropagation and radial basis function neural networks to predict flight delays. The study uses historical flight data to train models to predict delays. Both algorithms are described and their methodology is outlined. The data will be preprocessed, split for training and testing models, and then the models will be evaluated and compared to determine the most accurate for predicting flight delays.
The document discusses runtime environments and memory management techniques for programming languages. It covers stack-based vs dynamic environments, parameter passing mechanisms like pass by value and reference, and garbage collection algorithms. Dynamic memory allocation uses a heap structure with malloc and free functions. Object-oriented languages require special runtime support for objects, inheritance etc. Fully dynamic environments are needed for functional languages that allow nested functions.
The document outlines a chapter on methods in C#. It discusses key concepts like defining methods, passing arguments by value vs reference, and using built-in classes like Math. It provides examples of methods that square integers, find the maximum of 3 numbers, and demonstrate passing by reference and out parameters.
The document provides steps to develop various types of procedures, reports, interfaces, and loads in Oracle Applications. It outlines the key steps as: 1) develop the code/logic; 2) move files to server; 3) create concurrent executable; 4) create concurrent program; 5) attach program to request group; and 6) submit the program. The document also summarizes how to develop inbound and outbound interfaces, and provides examples of common tables and queries.
FAA Flight Landing Distance Forecasting and AnalysisQuynh Tran
The overall goal of this project is to get an ideal model to forecast landing distance based on variables given in the dataset. To be able to come up with a good model that fits the dataset, we need to go through some certain steps to explore, clean, visualize, and analyze values in the dataset.
This document provides an introduction to pointers in C programming. It discusses what pointers are, how they store memory addresses, and how they can be used to access and modify variables. Some key points:
- Pointers store the address of a variable rather than its value. They allow access to variables outside of the current scope.
- The & operator returns the address of a variable. This address can be assigned to a pointer variable.
- Pointer variables must be declared with a * before the name. They point to data of a specified type.
- Variables can be accessed through their pointers using the * operator on the pointer.
- Pointers can be passed to functions to allow call-by
Java Airline Reservation System – Travel Smarter, Not Harder.pdfSudhanshiBakre1
- The document describes the steps to create an airline reservation system using Java and SQLite. It will include four panels (reservations, customers, flights, airports) with JTables to display database data.
- The Database class is implemented to initialize the database, connect to it, and perform operations like retrieving, adding, updating and deleting data from the tables.
- The AirlineReservation class implements the user interface with JTextFields, JComboBoxes, JTables and buttons to display and interact with the database data.
The document outlines the schedule and objectives for an operating systems lab course over 10 weeks. The first few weeks focus on writing programs using Unix system calls like fork, exec, wait. Later weeks involve implementing I/O system calls, simulating commands like ls and grep, and scheduling algorithms like FCFS, SJF, priority and round robin. Students are asked to display Gantt charts, compute waiting times and turnaround times for each algorithm. The final weeks cover inter-process communication, the producer-consumer problem, and memory management techniques.
The document discusses a case study on customer churn for a telecommunications company called Mobicom. It aims to identify the top five factors driving churn and recommend proactive retention programs. Key factors analyzed include usage, billing issues, network quality, data usage and rate plans. The analysis identifies age, retention calls, equipment age and length of relationship as top predictors of churn. It recommends rate plan migration and prioritizing high-churn customers for retention campaigns.
Similar to Flight Landing Distance Study Using SAS (20)
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
Learn SQL from basic queries to Advance queriesmanishkhaire30
Dive into the world of data analysis with our comprehensive guide on mastering SQL! This presentation offers a practical approach to learning SQL, focusing on real-world applications and hands-on practice. Whether you're a beginner or looking to sharpen your skills, this guide provides the tools you need to extract, analyze, and interpret data effectively.
Key Highlights:
Foundations of SQL: Understand the basics of SQL, including data retrieval, filtering, and aggregation.
Advanced Queries: Learn to craft complex queries to uncover deep insights from your data.
Data Trends and Patterns: Discover how to identify and interpret trends and patterns in your datasets.
Practical Examples: Follow step-by-step examples to apply SQL techniques in real-world scenarios.
Actionable Insights: Gain the skills to derive actionable insights that drive informed decision-making.
Join us on this journey to enhance your data analysis capabilities and unlock the full potential of SQL. Perfect for data enthusiasts, analysts, and anyone eager to harness the power of data!
#DataAnalysis #SQL #LearningSQL #DataInsights #DataScience #Analytics
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...Social Samosa
The Modern Marketing Reckoner (MMR) is a comprehensive resource packed with POVs from 60+ industry leaders on how AI is transforming the 4 key pillars of marketing – product, place, price and promotions.
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Aggregage
This webinar will explore cutting-edge, less familiar but powerful experimentation methodologies which address well-known limitations of standard A/B Testing. Designed for data and product leaders, this session aims to inspire the embrace of innovative approaches and provide insights into the frontiers of experimentation!
Global Situational Awareness of A.I. and where its headedvikram sood
You can see the future first in San Francisco.
Over the past year, the talk of the town has shifted from $10 billion compute clusters to $100 billion clusters to trillion-dollar clusters. Every six months another zero is added to the boardroom plans. Behind the scenes, there’s a fierce scramble to secure every power contract still available for the rest of the decade, every voltage transformer that can possibly be procured. American big business is gearing up to pour trillions of dollars into a long-unseen mobilization of American industrial might. By the end of the decade, American electricity production will have grown tens of percent; from the shale fields of Pennsylvania to the solar farms of Nevada, hundreds of millions of GPUs will hum.
The AGI race has begun. We are building machines that can think and reason. By 2025/26, these machines will outpace college graduates. By the end of the decade, they will be smarter than you or I; we will have superintelligence, in the true sense of the word. Along the way, national security forces not seen in half a century will be un-leashed, and before long, The Project will be on. If we’re lucky, we’ll be in an all-out race with the CCP; if we’re unlucky, an all-out war.
Everyone is now talking about AI, but few have the faintest glimmer of what is about to hit them. Nvidia analysts still think 2024 might be close to the peak. Mainstream pundits are stuck on the wilful blindness of “it’s just predicting the next word”. They see only hype and business-as-usual; at most they entertain another internet-scale technological change.
Before long, the world will wake up. But right now, there are perhaps a few hundred people, most of them in San Francisco and the AI labs, that have situational awareness. Through whatever peculiar forces of fate, I have found myself amongst them. A few years ago, these people were derided as crazy—but they trusted the trendlines, which allowed them to correctly predict the AI advances of the past few years. Whether these people are also right about the next few years remains to be seen. But these are very smart people—the smartest people I have ever met—and they are the ones building this technology. Perhaps they will be an odd footnote in history, or perhaps they will go down in history like Szilard and Oppenheimer and Teller. If they are seeing the future even close to correctly, we are in for a wild ride.
Let me tell you what we see.
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
2. Flight Landing Distance Study by Sarita Maharia
1
Index
Contents
Summary.......................................................................................................................................................2
Chapter 1 – Data exploration and data cleaning ..........................................................................................3
Chapter 2 – Descriptive Study.......................................................................................................................7
Chapter 3 – Statistical Modeling.................................................................................................................13
Chapter 4 – Model Validation.....................................................................................................................17
Chapter 5 – Remodeling and model validation ..........................................................................................19
Questions from project pdf.........................................................................................................................21
Appendix .....................................................................................................................................................23
3. Flight Landing Distance Study by Sarita Maharia
2
Summary
This project report details steps taken to fit a linear model to predict flight landing distance given input
data. The dataset contains 850 observations of 8 variables. The variable dictionary is provided in the
appendix. Below is the summary of steps and corresponding observations:
1. Data Cleaning –
a. Duplicates in 2 input data sets are removed and
b. Negative values of height variable are deleted as these could be wrong recordings.
2. Descriptive Study – Analyze plots and correlation coefficients
a. Distance has strong positive correlation with both speed_air and speed_ground but
plots show little curve so transformations might be required.
b. Independent variables speed_air and speed_ground show strong correlation, hence can
distort model if both are included.
3. Statistical Modeling – Fit model with all variables and cleaned data from step 1
a. The regression coefficients change sign when regression is run with individual variables
and all variables together with distance.
b. Speed_ground is removed to solve issue in step a.
c. The significant variables are – aircraft type, height and speed_air.
d. MAPE(Mean absolute Percentage Error) is approx. 4% for this base model.
4. Model Validation – Validate model created in step 3
a. Residuals show a curve pattern and are not symmetric.
b. Mean of residuals is not zero.
c. This means independent variables as is don’t have linear relationship with distance and
transformation is required.
5. Remodeling and re-validation –
a. Alternate model is used with transformed spped_air.
b. The alternate model has better Adjusted R square and passes residual validation criteria.
c. Alternate model also better explains variability than the vase model and has lower
MAPE.
d. The final model is listed below:
𝑙𝑎𝑛𝑑𝑖𝑛𝑔 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 = −2031.73 + 405.21 ∗ 𝑎𝑖𝑟𝑐𝑟𝑎𝑓𝑡𝑡𝑦𝑝𝑒 + 0.38 ∗ 𝑠𝑝𝑒𝑒𝑑𝑎𝑖𝑟2
+ 13.94 ∗ ℎ𝑒𝑖𝑔ℎ𝑡
4. Flight Landing Distance Study by Sarita Maharia
3
Chapter 1 – Data exploration and data cleaning
Goals
Merge given datasets after understanding variables and eliminate duplicates.
Identify outliers and variables with missing values and treat the variables.
Check whether the minimum and maximum observations for a variable are logical.
SAS Code and Output
Both the given datasets are concatenated and results saved in combined_flights dataset. There are total 950 rows
in combined dataset.
Backup of dataset is taken
/* import first dataset */
FILENAME REFFILE '/folders/myfolders/sasuser.v94/FAA1.xls';
PROC IMPORT DATAFILE=REFFILE DBMS=XLS OUT=project.data1;
GETNAMES=YES;
RUN;
/* remove extra rows that might be created because of spreadsheet import */
data project.data1_required;
set project.data1;
if not(cmiss(of aircraft distance duration height no_pasg pitch speed_air
speed_ground) eq 8);
run;
/* import FAA2 sheet */
FILENAME REFFILE '/folders/myfolders/sasuser.v94/FAA2.xls';
PROC IMPORT DATAFILE=REFFILE DBMS=XLS OUT=project.data2;
GETNAMES=YES;
RUN;
/* remove extra rows that might be created because of spreadsheet import */
data project.data2_required;
set project.data2;
if not(cmiss(of aircraft distance height no_pasg pitch speed_air speed_ground) eq 7);
run;
/* concatenate both datasets */
data project.combined_flights;
set project.data1_required project.data2_required;
run;
/* create copy of main dataset */
data project.combined_flights_copy;
set project.combined_flights;
run;
5. Flight Landing Distance Study by Sarita Maharia
4
Check for duplicates in the combined dataset and remove duplicates if there are any
/* build frequency table to check for duplicates */
proc freq data=project.combined_flights;
tables aircraft*distance*height*no_pasg*pitch*speed_air*speed_ground / noprint
out=keylist;
run;
/* print duplicate rows */
proc print;
where count ge 2;
run;
/* sort data on all variables so that duplicates can be deleted */
proc sort data=project.combined_flights out=project.combined_flights_sort;
by aircraft descending duration distance height no_pasg pitch speed_air speed_ground;
run;
/* dataset with unique values */
proc sort data=project.combined_flights_sort out=project.combined_flights_unique
nodupkey;
by aircraft distance height no_pasg pitch speed_air speed_ground;
run;
850 rows in frequency table, hence there are duplicates in the dataset. Below are the sample duplicate rows from
frequency table:
Code to find number of missing, mean, min and max of all variables:
proc means data=project.combined_flights_unique n nmiss min max mean;
run;
Speed_air variable has almost 75% missing values. This variable will be retained as it’s an important variable.
Similarly duration variable will be retained even though it has null values.
6. Flight Landing Distance Study by Sarita Maharia
5
Code to find negative height values:
/* find rows of data that have negative heights. This looks to be wrong recording of data */
proc print data=project.combined_flights_unique;
where height < 0;
run;
Output:
Code to delete negative height values observations:
data project.combined_flights_updated;
set project.combined_flights_unique;
if height < 0 then delete;
run;
Now there are 845 observations:
Code to find levels of categorical variables:
/* find types of aircraft and their count in combined datatset */
proc freq data = project.combined_flights_updated nlevels ;
table aircraft;
run;
Output:
Find outliers and distribution for all variables using below code. Outliers are maintained in data as they represent
extreme conditions
/* plot for outliers */
proc univariate data=project.combined_flights_t2 plot;
run;
7. Flight Landing Distance Study by Sarita Maharia
6
Observations
1. Input datasets had 950 observations. After data cleaning, output dataset has 845 observations. Below are
the cleaned observations:
a. 100 duplicate observations from 2 datasets
b. 5 rows are deleted because height has negative values. These might be recorded incorrectly.
2. Missing values in output dataset – these are retained as is
a. 75% null values are present for speed_air variable
b. 50 observations have duration variable missing.
3. There are many outliers for distance variable but these are retained as these present chances of overrun.
4. There are almost same number of rows for both values of categorical variables
Conclusion
1. 845 observations are present in the cleaned dataset after deleting duplicate observations and negative
height rows.
2. Null values and the outliers in variables are retained.
8. Flight Landing Distance Study by Sarita Maharia
7
Chapter 2 – Descriptive Study
Goals
Understand correlation between different variables and analyze plots
SAS Code and Output
First create copy of input dataset and sort it. Also code the aircraft type so that it can be used in regression.
Code:
/* create copy of input dataset */
data project.flights_input;
set project.combined_flights_updated;
run;
/* sort input dataset by aircraft type */
Proc sort data=project.flights_input;
by aircraft;
run;
/* code aircraft type to dummy variables. airbus=0 and boeing=1 */
data project.flight_coded;
set project.flights_input;
if aircraft = "airbus" then aircraft_type=0;
esle aircraft_type=1;
drop aircraft;
run;
Output: 845 observations in output dataset with aircraft coded as 0 for airbus and 1 for boeing.
Generate plots for all variables with distance variable to understand direction and shape of relation
proc plot data=project.flight_coded;
plot distance*duration;
run;
proc plot data=project.flight_coded;
plot distance*height;
run;
proc plot data=project.flight_coded;
plot distance*no_pasg;
run;
proc plot data=project.flight_coded;
plot distance*pitch;
run;
proc plot data=project.flight_coded;
plot distance*speed_air;
run;
proc plot data=project.flight_coded;
plot distance*speed_ground;
run;
9. Flight Landing Distance Study by Sarita Maharia
8
No recognizable pattern between distance and other variables except below:
1. Distance and speed_air have positive relation with little curve. Also, there are no values below 90 for
speed_air which means that we have truncated data.
2. Distance and speed_ground have positive relation with curve
Find strength of correlation using below code:
proc corr data=project.flight_coded;
var _all_;
run;
10. Flight Landing Distance Study by Sarita Maharia
9
Output: Strong correlations are highlighted in yellow:
Since plots of distance with speed_air and speed_ground have little curve, these variables are transformed to have
linear relation and increased correlation coefficient. Out of all transformations, cube of speed_air and
speed_distance give maximum correlation coefficient.
Code:
/*possible transformations*/
data project.flights_coded_t;
set project.flight_coded;
speed_air2=speed_air**2;
speed_air3=speed_air**3;
speed_air12=sqrt(speed_air);
speed_airlog=log(speed_air);
speed_ground2=speed_ground**2;
speed_ground3=speed_ground**3;
speed_ground12=sqrt(speed_ground);
speed_groundlog=log(speed_ground);
run;
/*find correlations in transformed data*/
proc corr data=project.flights_coded_t;
var distance speed_air speed_air2 speed_air3 speed_air12 speed_airlog
speed_ground speed_ground2 speed_ground3 speed_ground12 speed_groundlog;
run;
/*verify plots for transformed data */
proc plot data=project.flights_coded_t;
plot distance*speed_air3;
run;
proc plot data=project.flights_coded_t;
plot distance*speed_ground3;
run;
proc plot data=project.flights_coded_t;
11. Flight Landing Distance Study by Sarita Maharia
10
plot speed_ground3*speed_air3;
run;
Increased correlation coefficients are highlighted in yellow
Plots after doing transformations look linear:
12. Flight Landing Distance Study by Sarita Maharia
11
Square of speed_ground and speed_air also have linear relationship
13. Flight Landing Distance Study by Sarita Maharia
12
Observations
Distance has strong positive correlation with speed_air and speed_ground but the plots have little curve. The curve
looks linear after applying square transformation.
Correlation coefficient for distance with speed_air and speed_ground increases after both variables are
transformed by applying square. They also show strong positive linear relation.
Conclusion
All variables as is might not fit linear model and we might need to use transformed speed variables to validate the
model because speed_ground, speed_air plots with distance have curve.
Speed_air and speed_ground have high collinearity that could impact the linear model.
14. Flight Landing Distance Study by Sarita Maharia
13
Chapter 3 – Statistical Modeling
Goals
Fit a linear model to predict landing distance
SAS Code and Output
First try to identify parameters for regression between distance and individual independent variables using below
code:
proc reg data=project.flight_coded;
model distance=aircraft_type;
run;
proc reg data=project.flight_coded;
model distance=duration;
run;
proc reg data=project.flight_coded;
model distance=height;
run;
proc reg data=project.flight_coded;
model distance=no_pasg;
run;
proc reg data=project.flight_coded;
model distance=pitch;
run;
proc reg data=project.flight_coded;
model distance=speed_air;
run;
proc reg data=project.flight_coded;
model distance=speed_ground;
run;
Now identify parameters for regression between distance and all other variables
proc reg data=project.flight_coded;
model distance=aircraft_type duration height no_pasg pitch speed_air speed_ground;
run;
15. Flight Landing Distance Study by Sarita Maharia
14
Below is summary output from correlation and regression models run above:
The values in yellow change sign when all variables are considered together. From Chapter 2 conclusion, we see
that there is strong correlation between speed_air and speed_ground. So, we need to remove impact from
collinearity among independent variables to fit the model properly.
Out of speed_air and speed_ground, we need to select one to remove from the model. Speed_air has truncated
data which means low speed_air observations are missing. The main purpose of this project is to identify scenarios
for overrun. Since there is strong positive relation between speed_air and distance, chances of overrun are more
for high speed scenarios. Also, speed_air is a very important variable to drop. Hence, we will keep speed_air
variable and drop speed_ground for our model.
Model without speed_ground:
proc reg data=project.flight_coded;
model distance=aircraft_type duration height no_pasg pitch speed_ground;
run;
Now insignificant variables (with p-value > 0.05) are removed from the model one by one and below are the final
variables:
proc reg data=project.flight_coded;
model distance=aircraft_type height speed_air;
run;
Independent
variables
Direction
Correaltion
coefficient
P-value
corr coeff
regression coeff
Distance vs
individual var
p-value reg coeff
Distance vs
individual var
regression
coeff
Distance vs all
p-value reg coeff
Distance vs all var
aircraft type 0.238 <.0001 442.765 <.0001 440.47015 <.0001
duration
no visible
relation
-0.06197 0.0808 -1.17686 0.0808 0.09881 0.6258
height
no visible
relation
0.12306 0.0003 11.40984 0.0003 13.93222 <.0001
no_psg
no visible
relation
-0.02778 0.42 -3.4422 0.42 -2.05743 0.1545
pitch
no visible
relation
0.10294 0.0027 180.88083 0.0027 -3.60074 0.8528
speed_air
Strong
Positive
little curve
0.94728 <.0001 82.17473 <.0001 87.61587 <.0001
speed_ground
Strong
Positive
little curve
0.862 <.0001 41.96801 <.0001 -3.96633 0.5562
16. Flight Landing Distance Study by Sarita Maharia
15
Output: all variables and the model are significant. Almost 97% data is explained using the model.
Fit diagnostics show that residuals are not random and they show a pattern.
17. Flight Landing Distance Study by Sarita Maharia
16
Observations
Sign of regression parameters change when regression is run with all independent variables together.
Speed_ground is removed from the model as it’s collinearity with speed_air was affecting the regression
parameters of other variables. Out of speed_ground and speed_air, speed_ground is removed.
No_pasg and duration are not significant, hence removed from model.
Fit diagnostics show that residuals are not symmetric.
Conclusion
Linear model fits data after removing non-significant variables but gives residuals plots showing curve pattern.
Model has R square 95%. We need to run diagnostics to understand the residual behavior.
18. Flight Landing Distance Study by Sarita Maharia
17
Chapter 4 – Model Validation
Goals
Analyze residual plot to check if it’s random and check if mean of residuals is zero
SAS Code and Output
Copy residuals in a separate dataset using below code
proc reg data=project.flight_coded;
model distance=aircraft_type height pitch speed_ground / r;
output out=project.model1_residuals r=residual;
run;
Code to check distribution and hypothesis for mean=0 of residuals
/* distribution not symmetric as per Shapiro Wilk test */
proc univariate data=project.residuals normal plot;
var residual;
run;
/* null hypothesis of mean=0 is not rejected as p value is 1 */
proc ttest data=project.residuals;
var residual;
run;
Distribution is not normal for residuals:
Residuals also fail normality test as highlighted p-value is less than 0.05
19. Flight Landing Distance Study by Sarita Maharia
18
Mape (Mean absolute percentage error) is calculated using below code and value is 23.3%
data project.model1_mape;
set project.model1_residuals;
err=abs(residual)/distance;
keep err;
run;
proc sql;
create table project.model1_mape_t as
select avg(err) from project.model1_mape;
run;
Observations
Residuals are not symmetric and also fail normality test. So, the linear model is not good fit. Mape is 4.22%.
Conclusion
Linear model generated in previous chapter is not a good fit. Transformations are required on data as residuals
have pattern in form of curve.
20. Flight Landing Distance Study by Sarita Maharia
19
Chapter 5 – Remodeling and model validation
Goals
Transform independent variables so that residuals are random and have normal distribution
Create alternative models and compare against base model to find best fit.
SAS Code and Output
Considering model created in Chapter 3 as base model, we will now create alternative model using transformed
speed_air variable.
As per Chapter 2 observations, transformed speed_air variable (after applying square) has linear plot with
distance. Transformed speed_air and speed_ground variables (after applying square) also have strong positive
linear relation. Below is the code to fit linear model using transformed variable:
proc reg data=project.flights_coded_t;
model distance=aircraft_type height speed_air2;
run;
Model has Adjusted R square 98.24 which is little better than base model. Below is the output:
21. Flight Landing Distance Study by Sarita Maharia
20
Fit diagnostics show that residuals look random:
Here the residuals pass normality test too as highlighted in below figure:
Residuals have zero mean based on below hypothesis test. p-value is 1, so we can’t reject null hypothesis that
mean of residuals is 0.
22. Flight Landing Distance Study by Sarita Maharia
21
MAPE is calculated as below – it comes as 3.65%. It’s good estimate to understand error in data and is lower than
the base model.
data project.model2_mape;
set project.model2_residuals;
err=abs(residual)/distance;
keep err;
run;
proc sql;
create table project.model2_mape_t as
select avg(err) from project.model2_mape;
run;
Conclusion
Base model created without any transformation is not a good model based on model diagnostics and fit test.
Alternate model created using speed_air square transformation gives best fit in terms of R square, MAPE, zero
mean of residuals and normal distributions of residuals.
The significant variables for the alternate model are - Aircraft_type, speed_air**2 and height. Approx 98% of
variability in data is explained using this model. It has MAPE of 3.65%.
Questions from project pdf
How many observations (flights) do you use to fit your final model? If not all 950 flights, why?
831 observations used after removing below rows:
• Duplicate 100 rows
• Rows with negative values of height – 5
• Rows with abnormal observations for each variable defined for the project – 14
However, final model is fit using speed_air variable that has missing values. So, model finally used 208 observations.
What factors and how they impact the landing distance of a flight?
Aircraft_type, speed_ground, height and pitch affect landing distance as per below equation:
𝑙𝑎𝑛𝑑𝑖𝑛𝑔 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 = −2031.73 + 405.21 ∗ 𝑎𝑖𝑟𝑐𝑟𝑎𝑓𝑡𝑡𝑦𝑝𝑒 + 0.38 ∗ 𝑠𝑝𝑒𝑒𝑑𝑎𝑖𝑟2
+ 13.94 ∗ ℎ𝑒𝑖𝑔ℎ𝑡
Is there any difference between the two makes Boeing and Airbus?
Mean landing distance of Boeing is more than mean landing distance of Airbus as per below TTEST results. The 2
scenarios of overrun are from Boeing only with speed_ground more than 135mph. There is difference in landing
distance means for each aircraft type when the mean of speed_ground is same.
Lower tail TTEST for landing distance: null hypothesis that mean landing distance of airbus is less than or equal to
Boeing is rejected. Hence, Boeing has more mean landing distance with 95% confidence level.