Lab 1: Data cleaning, exploration, removal of outliers, Correlation of Continuous Variables and Log Error (Target Variable), scatterplot analysis, adding new data features, Categorical and Continuous Feature Importance
https://github.com/yaowser/data_mining_group_project
https://www.kaggle.com/c/zillow-prize-1/data
From the Zillow real estate data set of properties in the southern California area, conduct the following data cleaning, data analysis, predictive analysis, and machine learning algorithms:
Lab 1: Data cleaning, exploration, removal of outliers, Correlation of Continuous Variables and Log Error (Target Variable), scatterplot analysis, adding new data features, Categorical and Continuous Feature Importance
Lab 3: Attribute Visualization, Continuous Variable Correlation Heatmap, Trai...Yao Yao
https://github.com/yaowser/data_mining_group_project
https://www.kaggle.com/c/zillow-prize-1/data
From the Zillow real estate data set of properties in the southern California area, conduct the following data cleaning, data analysis, predictive analysis, and machine learning algorithms:
Lab 3: Attribute Visualization, Continuous Variable Correlation Heatmap, Train and Adjust Parameters for KMeans, Spectral Clustering, and Agglomerative Clustering while evaluating metrics, deployment, Interactive Heatmap
This document provides an overview of indexing in MongoDB. It discusses what indexes are, why they are needed to optimize queries, and how to work with indexes in MongoDB. Some key points covered include how to create, manage, and optimize indexes. Common indexing mistakes are also discussed, such as trying to use multiple indexes per query, having indexes with low selectivity, and queries that cannot use indexes like regular expressions and negation queries.
As your data grows, the need to establish proper indexes becomes critical to performance. MongoDB supports a wide range of indexing options to enable fast querying of your data, but what are the right strategies for your application?
In this talk we’ll cover how indexing works, the various indexing options, and use cases where each can be useful. We'll dive into common pitfalls using real-world examples to ensure that you're ready for scale.
Indexing and Query Optimizer (Mongo Austin)MongoDB
The document discusses indexing and query optimization in MongoDB. It provides an overview of indexing basics, how to create indexes, when indexes can and cannot be used, and the importance of compound indexes. It also describes using explain() to check query plans and the database profiler for analyzing query performance.
Content providers in Android provide access to data that is shared between different apps.
1. First App: The producer application is designed to construct the SQLite database and the data entry screen.
2. Second App: The consumer application is designed to view the data stored by the first app (as list) the data can be edited also.
Reducing Development Time with MongoDB vs. SQLMongoDB
Buzz Moschetti compares the development time and effort required to save and fetch contact data using MongoDB versus SQL over the course of two weeks. With SQL, each time a new field is added or the data structure changes, the SQL schema must be altered and code updated in multiple places. With MongoDB, the data structure can evolve freely without changes to the data access code - it remains a simple insert and find. By day 14, representing the more complex data structure in SQL would require flattening some data and storing it in non-ideal ways, while MongoDB continues to require no changes to the simple data access code.
Indexing and Query Optimizer (Aaron Staple)MongoSF
This document discusses MongoDB indexing and query optimization. It defines what indexes are, how they are stored and used to improve query performance. It provides examples of different types of queries and whether they can utilize indexes, including compound, geospatial and regular expression indexes. It also covers index creation, maintenance and limitations.
Mythbusting: Understanding How We Measure the Performance of MongoDBMongoDB
Benchmarking, benchmarking, benchmarking. We all do it, mostly it tells us what we want to hear but often hides a mountain of misinformation. In this talk we will walk through the pitfalls that you might find yourself in by looking at some examples where things go wrong. We will then walk through how MongoDB performance is measured, the processes and methodology and ways to present and look at the information.
Lab 3: Attribute Visualization, Continuous Variable Correlation Heatmap, Trai...Yao Yao
https://github.com/yaowser/data_mining_group_project
https://www.kaggle.com/c/zillow-prize-1/data
From the Zillow real estate data set of properties in the southern California area, conduct the following data cleaning, data analysis, predictive analysis, and machine learning algorithms:
Lab 3: Attribute Visualization, Continuous Variable Correlation Heatmap, Train and Adjust Parameters for KMeans, Spectral Clustering, and Agglomerative Clustering while evaluating metrics, deployment, Interactive Heatmap
This document provides an overview of indexing in MongoDB. It discusses what indexes are, why they are needed to optimize queries, and how to work with indexes in MongoDB. Some key points covered include how to create, manage, and optimize indexes. Common indexing mistakes are also discussed, such as trying to use multiple indexes per query, having indexes with low selectivity, and queries that cannot use indexes like regular expressions and negation queries.
As your data grows, the need to establish proper indexes becomes critical to performance. MongoDB supports a wide range of indexing options to enable fast querying of your data, but what are the right strategies for your application?
In this talk we’ll cover how indexing works, the various indexing options, and use cases where each can be useful. We'll dive into common pitfalls using real-world examples to ensure that you're ready for scale.
Indexing and Query Optimizer (Mongo Austin)MongoDB
The document discusses indexing and query optimization in MongoDB. It provides an overview of indexing basics, how to create indexes, when indexes can and cannot be used, and the importance of compound indexes. It also describes using explain() to check query plans and the database profiler for analyzing query performance.
Content providers in Android provide access to data that is shared between different apps.
1. First App: The producer application is designed to construct the SQLite database and the data entry screen.
2. Second App: The consumer application is designed to view the data stored by the first app (as list) the data can be edited also.
Reducing Development Time with MongoDB vs. SQLMongoDB
Buzz Moschetti compares the development time and effort required to save and fetch contact data using MongoDB versus SQL over the course of two weeks. With SQL, each time a new field is added or the data structure changes, the SQL schema must be altered and code updated in multiple places. With MongoDB, the data structure can evolve freely without changes to the data access code - it remains a simple insert and find. By day 14, representing the more complex data structure in SQL would require flattening some data and storing it in non-ideal ways, while MongoDB continues to require no changes to the simple data access code.
Indexing and Query Optimizer (Aaron Staple)MongoSF
This document discusses MongoDB indexing and query optimization. It defines what indexes are, how they are stored and used to improve query performance. It provides examples of different types of queries and whether they can utilize indexes, including compound, geospatial and regular expression indexes. It also covers index creation, maintenance and limitations.
Mythbusting: Understanding How We Measure the Performance of MongoDBMongoDB
Benchmarking, benchmarking, benchmarking. We all do it, mostly it tells us what we want to hear but often hides a mountain of misinformation. In this talk we will walk through the pitfalls that you might find yourself in by looking at some examples where things go wrong. We will then walk through how MongoDB performance is measured, the processes and methodology and ways to present and look at the information.
MongoDB and Indexes - MUG Denver - 20160329Douglas Duncan
Indexes are data structures that store a subset of data to allow for efficient retrieval. MongoDB stores indexes using a b-tree format. There are several types of indexes including simple, compound, multikey, full-text, and geospatial indexes. Indexes improve performance by enabling efficient retrieval, sorting, and filtering of documents. Indexes are created using the createIndex command and their usage can be checked using explain plans.
I inherited a MongoDB database server with 60 collections and 100 or so indexes.
The business users are complaining about slow report completion times. What can I do to improve performance?
MongoDB World 2016: Deciphering .explain() OutputMongoDB
The document discusses different explain modes for MongoDB queries and aggregations. It begins with an overview of explain() and query plans, then covers the default "queryPlanner" mode which shows the winning and rejected plans. It also mentions the "executionStats" and "allPlansExecution" modes which provide more runtime statistics. The document aims to help understand how queries and aggregations are executed and troubleshoot performance issues.
Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...CloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2JgbT3E
This CloudxLab Basics of RDD & RDD Operations tutorial helps you to understand basics of RDD and RDD Operations in detail. Below are the topics covered in this tutorial:
1) Pick Random Samples From a Dataset using Spark
2) Spark Transformations - mapPartitions() & sortBy()
3) Spark Pseudo set operations - distinct(), union(), subtract(), intersection() & cartesian()
4) Spark Actions - fold(), aggregate(), countByValue(), top(), takeOrdered(), foreach() & foreachPartition()
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...CloudxLab
The document provides information about key-value RDD transformations and actions in Spark. It defines transformations like keys(), values(), groupByKey(), combineByKey(), sortByKey(), subtractByKey(), join(), leftOuterJoin(), rightOuterJoin(), and cogroup(). It also defines actions like countByKey() and lookup() that can be performed on pair RDDs. Examples are given showing how to use these transformations and actions to manipulate key-value RDDs.
The document discusses Mongo-Hadoop integration and provides examples of using the Mongo-Hadoop connector to run MapReduce jobs on data stored in MongoDB. It covers loading and writing data to MongoDB from Hadoop, using Java MapReduce, Hadoop Streaming with Python, and analyzing data with Pig and Hive. Examples show processing an email corpus to build a graph of sender-recipient relationships and message counts.
This document discusses MongoDB and how it compares to relational databases. MongoDB is a non-relational database that stores data in JSON-like documents rather than tables. It supports horizontal scaling, data availability, and simple design. Common CRUD operations in MongoDB are similar to relational databases, but MongoDB uses methods and functions instead of a query language like SQL. The document provides examples of MongoDB queries and aggregation commands.
Web Integration Patterns in the Era of HTML5johnwilander
Presentation given at OWASP BeNeLux November 2012 and GeekMeet Stockholm January 2013. Covers secure and robust integration patterns for the web using cross origin resource sharing (CORS), sandboxed iframes, and the postMessage API.
Time Series Analysis by JavaScript LL matsuri 2013 Daichi Morifuji
This document discusses time series analysis and visualization using JavaScript. It introduces Series.js, a utility library for time series data that provides methods for aggregation, statistics, and visualization. Sample time series CPU and memory usage data is provided, and it is summarized using Series.js methods to aggregate the data by minute. Future work includes improving performance, documentation, and client-side capabilities of Series.js.
Webinarserie: Einführung in MongoDB: “Back to Basics” - Teil 3 - Interaktion ...MongoDB
The document discusses MongoDB basics including:
1) Inserting and querying documents using operators like $lt and $in
2) Returning documents through cursors and using projections to select attributes
3) Updating documents using operators like $push, $inc, and $addToSet along with bucketing and pre-aggregated reports
It also covers durability options like acknowledged writes and waiting for replication.
Mythbusting: Understanding How We Measure Performance at MongoDBMongoDB
Benchmarking, benchmarking, benchmarking. We all do it, mostly it tells us what we want to hear but often hides a mountain of misinformation. In this talk we will walk through the pitfalls that you might find yourself in by looking at some examples where things go wrong. We will then walk through how MongoDB performance is measured, the processes and methodology and ways to present and look at the information.
Data in Motion: Streaming Static Data Efficiently 2Martin Zapletal
Updated version for SD Berlin 2016. Distributed streaming performance, consistency, reliable delivery, durability, optimisations, event time processing and other concepts discussed and explained on Akka Persistence and other examples.
This document discusses the different types of statements used in QuickTest Professional (QTP) test scripts, including declarations, comments, utility statements, object calls, flow control statements, function/action calls, checkpoints, output value statements, synchronization points, and VBScript statements. Declarations are used to define variables and constants. Comments explain parts of the test script. Utility statements launch applications and control program flow. Object calls interact with application objects. Flow control statements include if/else, select case, loops, and more. Functions and actions encapsulate reusable code. Checkpoints and output values verify test results. Synchronization waits ensure objects are ready. VBScript provides additional programming constructs.
This document provides an introduction to MapReduce and describes how to process stock market data using MapReduce. It explains how the data is split into input splits that are assigned to mappers. The custom MarketCapitalizationMapper and MarketCapitalizationReducer classes are used to calculate the market capitalization for each stock symbol by multiplying the stock price and volume. The mapper emits key-value pairs that are sorted and sent to reducers, and the reducer outputs the highest market cap for each symbol. Sample output is shown listing the market caps for different stock symbols.
This document discusses MongoDB performance tuning. It emphasizes that performance tuning is an obsession that requires planning schema design, statement tuning, and instance tuning in that order. It provides examples of using the MongoDB profiler and explain functions to analyze statements and identify tuning opportunities like non-covered indexes, unnecessary document scans, and low data locality. Instance tuning focuses on optimizing writes through fast update operations and secondary index usage, and optimizing reads by ensuring statements are tuned and data is sharded appropriately. Overall performance depends on properly tuning both reads and writes.
This document discusses various indexing strategies in MongoDB to help scale applications. It covers the basics of indexes, including creating and tuning indexes. It also discusses different index types like geospatial indexes, text indexes, and how to use explain plans and profiling to evaluate queries. The document concludes with a section on scaling strategies like sharding to scale beyond a single server's resources.
For developers new to MongoDB and Node.js, however, some the common design patterns are very different than those of a RDBMS and traditional synchronous languages. Developers learning these technologies together may find it a bit bewildering. In reality, however, these tools fit perfectly together and enable I high degree of developer productivity and application performance.
This webinar will walk developers through common MongoDB development patterns in Node.js, such as efficiently loading data into MongoDB using MongoDB's bulk API, iterating through query results, and managing simultaneous asynchronous MongoDB queries to provide the best possible application performance. Working Node.js and MongoDB examples will be used throughout the presentation.
Geospatial Indexing and Querying with MongoDBGrant Goodale
This document discusses using MongoDB to store and query location data. Key points include:
- MongoDB supports geospatial indexing and querying using Geo2D indexes on location fields.
- Queries like $near, $box, $center, and $polygon allow searching by proximity or within regions.
- Data should be structured with location as an array [long, lat] for spherical queries.
- Scaling involves sharding on a geo field, but has query limitations in MongoDB 1.8.
MongoDB.local DC 2018: Tips and Tricks for Avoiding Common Query PitfallsMongoDB
Query performance can either be a constant headache or the unsung hero of an application. MongoDB provides extremely powerful querying capabilities when used properly. As a member of the solutions architecture team, I will share common mistakes observed as well as tips and tricks to avoiding them.
Amazon DynamoDB is a fully managed NoSQL database service for applications that need consistent, single-digit millisecond latency at any scale. This talk explores DynamoDB capabilities and benefits in detail and discusses how to get the most out of your DynamoDB database. We go over schema design best practices with DynamoDB across multiple use cases, including gaming, AdTech, IoT, and others. We also explore designing efficient indexes, scanning, and querying, and go into detail on a number of recently released features, including JSON document support, Streams, and more.
MongoDB and Indexes - MUG Denver - 20160329Douglas Duncan
Indexes are data structures that store a subset of data to allow for efficient retrieval. MongoDB stores indexes using a b-tree format. There are several types of indexes including simple, compound, multikey, full-text, and geospatial indexes. Indexes improve performance by enabling efficient retrieval, sorting, and filtering of documents. Indexes are created using the createIndex command and their usage can be checked using explain plans.
I inherited a MongoDB database server with 60 collections and 100 or so indexes.
The business users are complaining about slow report completion times. What can I do to improve performance?
MongoDB World 2016: Deciphering .explain() OutputMongoDB
The document discusses different explain modes for MongoDB queries and aggregations. It begins with an overview of explain() and query plans, then covers the default "queryPlanner" mode which shows the winning and rejected plans. It also mentions the "executionStats" and "allPlansExecution" modes which provide more runtime statistics. The document aims to help understand how queries and aggregations are executed and troubleshoot performance issues.
Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...CloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2JgbT3E
This CloudxLab Basics of RDD & RDD Operations tutorial helps you to understand basics of RDD and RDD Operations in detail. Below are the topics covered in this tutorial:
1) Pick Random Samples From a Dataset using Spark
2) Spark Transformations - mapPartitions() & sortBy()
3) Spark Pseudo set operations - distinct(), union(), subtract(), intersection() & cartesian()
4) Spark Actions - fold(), aggregate(), countByValue(), top(), takeOrdered(), foreach() & foreachPartition()
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...CloudxLab
The document provides information about key-value RDD transformations and actions in Spark. It defines transformations like keys(), values(), groupByKey(), combineByKey(), sortByKey(), subtractByKey(), join(), leftOuterJoin(), rightOuterJoin(), and cogroup(). It also defines actions like countByKey() and lookup() that can be performed on pair RDDs. Examples are given showing how to use these transformations and actions to manipulate key-value RDDs.
The document discusses Mongo-Hadoop integration and provides examples of using the Mongo-Hadoop connector to run MapReduce jobs on data stored in MongoDB. It covers loading and writing data to MongoDB from Hadoop, using Java MapReduce, Hadoop Streaming with Python, and analyzing data with Pig and Hive. Examples show processing an email corpus to build a graph of sender-recipient relationships and message counts.
This document discusses MongoDB and how it compares to relational databases. MongoDB is a non-relational database that stores data in JSON-like documents rather than tables. It supports horizontal scaling, data availability, and simple design. Common CRUD operations in MongoDB are similar to relational databases, but MongoDB uses methods and functions instead of a query language like SQL. The document provides examples of MongoDB queries and aggregation commands.
Web Integration Patterns in the Era of HTML5johnwilander
Presentation given at OWASP BeNeLux November 2012 and GeekMeet Stockholm January 2013. Covers secure and robust integration patterns for the web using cross origin resource sharing (CORS), sandboxed iframes, and the postMessage API.
Time Series Analysis by JavaScript LL matsuri 2013 Daichi Morifuji
This document discusses time series analysis and visualization using JavaScript. It introduces Series.js, a utility library for time series data that provides methods for aggregation, statistics, and visualization. Sample time series CPU and memory usage data is provided, and it is summarized using Series.js methods to aggregate the data by minute. Future work includes improving performance, documentation, and client-side capabilities of Series.js.
Webinarserie: Einführung in MongoDB: “Back to Basics” - Teil 3 - Interaktion ...MongoDB
The document discusses MongoDB basics including:
1) Inserting and querying documents using operators like $lt and $in
2) Returning documents through cursors and using projections to select attributes
3) Updating documents using operators like $push, $inc, and $addToSet along with bucketing and pre-aggregated reports
It also covers durability options like acknowledged writes and waiting for replication.
Mythbusting: Understanding How We Measure Performance at MongoDBMongoDB
Benchmarking, benchmarking, benchmarking. We all do it, mostly it tells us what we want to hear but often hides a mountain of misinformation. In this talk we will walk through the pitfalls that you might find yourself in by looking at some examples where things go wrong. We will then walk through how MongoDB performance is measured, the processes and methodology and ways to present and look at the information.
Data in Motion: Streaming Static Data Efficiently 2Martin Zapletal
Updated version for SD Berlin 2016. Distributed streaming performance, consistency, reliable delivery, durability, optimisations, event time processing and other concepts discussed and explained on Akka Persistence and other examples.
This document discusses the different types of statements used in QuickTest Professional (QTP) test scripts, including declarations, comments, utility statements, object calls, flow control statements, function/action calls, checkpoints, output value statements, synchronization points, and VBScript statements. Declarations are used to define variables and constants. Comments explain parts of the test script. Utility statements launch applications and control program flow. Object calls interact with application objects. Flow control statements include if/else, select case, loops, and more. Functions and actions encapsulate reusable code. Checkpoints and output values verify test results. Synchronization waits ensure objects are ready. VBScript provides additional programming constructs.
This document provides an introduction to MapReduce and describes how to process stock market data using MapReduce. It explains how the data is split into input splits that are assigned to mappers. The custom MarketCapitalizationMapper and MarketCapitalizationReducer classes are used to calculate the market capitalization for each stock symbol by multiplying the stock price and volume. The mapper emits key-value pairs that are sorted and sent to reducers, and the reducer outputs the highest market cap for each symbol. Sample output is shown listing the market caps for different stock symbols.
This document discusses MongoDB performance tuning. It emphasizes that performance tuning is an obsession that requires planning schema design, statement tuning, and instance tuning in that order. It provides examples of using the MongoDB profiler and explain functions to analyze statements and identify tuning opportunities like non-covered indexes, unnecessary document scans, and low data locality. Instance tuning focuses on optimizing writes through fast update operations and secondary index usage, and optimizing reads by ensuring statements are tuned and data is sharded appropriately. Overall performance depends on properly tuning both reads and writes.
This document discusses various indexing strategies in MongoDB to help scale applications. It covers the basics of indexes, including creating and tuning indexes. It also discusses different index types like geospatial indexes, text indexes, and how to use explain plans and profiling to evaluate queries. The document concludes with a section on scaling strategies like sharding to scale beyond a single server's resources.
For developers new to MongoDB and Node.js, however, some the common design patterns are very different than those of a RDBMS and traditional synchronous languages. Developers learning these technologies together may find it a bit bewildering. In reality, however, these tools fit perfectly together and enable I high degree of developer productivity and application performance.
This webinar will walk developers through common MongoDB development patterns in Node.js, such as efficiently loading data into MongoDB using MongoDB's bulk API, iterating through query results, and managing simultaneous asynchronous MongoDB queries to provide the best possible application performance. Working Node.js and MongoDB examples will be used throughout the presentation.
Geospatial Indexing and Querying with MongoDBGrant Goodale
This document discusses using MongoDB to store and query location data. Key points include:
- MongoDB supports geospatial indexing and querying using Geo2D indexes on location fields.
- Queries like $near, $box, $center, and $polygon allow searching by proximity or within regions.
- Data should be structured with location as an array [long, lat] for spherical queries.
- Scaling involves sharding on a geo field, but has query limitations in MongoDB 1.8.
MongoDB.local DC 2018: Tips and Tricks for Avoiding Common Query PitfallsMongoDB
Query performance can either be a constant headache or the unsung hero of an application. MongoDB provides extremely powerful querying capabilities when used properly. As a member of the solutions architecture team, I will share common mistakes observed as well as tips and tricks to avoiding them.
MongoDB.local DC 2018: Tips and Tricks for Avoiding Common Query Pitfalls
Similar to Lab 1: Data cleaning, exploration, removal of outliers, Correlation of Continuous Variables and Log Error (Target Variable), scatterplot analysis, adding new data features, Categorical and Continuous Feature Importance
Amazon DynamoDB is a fully managed NoSQL database service for applications that need consistent, single-digit millisecond latency at any scale. This talk explores DynamoDB capabilities and benefits in detail and discusses how to get the most out of your DynamoDB database. We go over schema design best practices with DynamoDB across multiple use cases, including gaming, AdTech, IoT, and others. We also explore designing efficient indexes, scanning, and querying, and go into detail on a number of recently released features, including JSON document support, Streams, and more.
Linear Model Selection and Regularization (Article 6 - Practical exercises)Theodore Grammatikopoulos
The document discusses methods for improving linear regression models, including subset selection, shrinkage, and dimensional reduction. It applies these methods to baseball player salary data to predict salary based on performance statistics. Specifically, it uses subset selection methods like best subset selection, forward stepwise selection, and backward stepwise selection to select subsets of variables that best predict salary. Plots of model metrics like RSS, adjusted R-squared, Cp, and BIC are used to evaluate the optimal subset size.
This document describes a test data value store approach for handling test data needs in behavior-driven development (BDD) automated tests. It involves storing test data in a JSON file organized by feature, scenario, example case, page, and language. The data store is initialized by resolving keys to the current test from Cucumber scenario tags. Test code can then retrieve data values from the store as needed to validate page elements for each test case execution. The approach aims to provide flexibility in retrieving test-dependent data while avoiding cluttering the clean BDD syntax with data tables or excessive scenario outlines.
This document provides a 3 sentence summary of the OpenGL 4.6 API Reference Guide:
The document is an OpenGL 4.6 API reference guide that provides documentation on OpenGL functions and commands. It describes OpenGL command syntax and execution, covers topics like shaders and programs, buffer objects, and synchronization techniques. The guide also includes tables and descriptions for OpenGL constants, types, and enumerations used throughout the OpenGL API.
This document discusses different types of data visualizations that can be created using the matplotlib library in Python. It covers basic visualizations like line charts, bar charts, histograms, and scatterplots. Examples are provided for each type of visualization to illustrate how to construct them programmatically and customize aspects like titles, labels, legends. Guidelines are given for effective visualization practices, such as using consistent axis scales. The document demonstrates how to explore and communicate data through different matplotlib plotting functions.
This document analyzes income qualification data to predict poverty levels. It explores the data types and features, cleans missing and mixed values, and builds a random forest classifier model. Cross-validation shows the model achieves over 93% accuracy. The most important features are found to be monthly rent paid, number of rooms, and responses to questions about household members. The model is applied to test data to predict poverty levels.
OpenGL® is the only cross-platform graphics API that enables developers of software for PC, workstation, and supercomputing hardware to create high- performance, visually-compelling graphics software applications, in markets such as CAD, content creation, energy, entertainment, game development, manufacturing, medical, and virtual reality.
This document provides a reference for the OpenGL 4.4 API, including summaries of operations for floating-point numbers, command syntax, errors, shaders and programs, synchronization, buffers, textures, and samplers. It defines types and enumerations used throughout OpenGL and provides syntax and parameters for many core OpenGL functions.
This document provides an overview of Amazon DynamoDB including key concepts like tables, data types, indexes, scaling, data modeling best practices, and example scenarios. It discusses how to design DynamoDB tables for different data access patterns including 1:1, 1:N, and N:M relationships. It also provides recommendations for modeling time series data, popular fast-changing items, and messaging applications.
How to process data using events on top of PHP and MongoDB. Introducing Eventsourcing and CQRS, how to handle events and how to generate Read Models and Aggregates leveraging the MongoDB Aggregation Framework.
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...Yao Yao
https://github.com/yaowser/data_mining_group_project
https://www.kaggle.com/c/zillow-prize-1/data
From the Zillow real estate data set of properties in the southern California area, conduct the following data cleaning, data analysis, predictive analysis, and machine learning algorithms:
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regression Model Performance, Optimizing Support Vector Machine Classifier, Accuracy of results and efficiency, Logistic Regression Feature Importance, interpretation of support vectors, Density Graph
AWS December 2015 Webinar Series - Design Patterns using Amazon DynamoDBAmazon Web Services
If you’re familiar with relational databases, designing your app to use a NoSQL database like DynamoDB may be new to you. In this webinar, we’ll walk you through common data design patterns for a variety of applications to help you learn how to design a schema, then store and retrieve the data with DynamoDB. We will discuss the benefits of using DynamoDB to develop mobile, web, IoT, and gaming apps.
Learning Objectives:
Learn schema design best practices with DynamoDB across multiple use cases, including gaming, AdTech, IoT, and others
Who Should Attend:
Architects, Developers, and SysOps interested in learning how to design NoSQL schemas to support mobile, web, IoT, AdTech, and gaming apps.
Familiarity with DynamoDB is helpful
Amazon DynamoDB is a fast and flexible NoSQL database service for applications that need consistent, single-digit millisecond latency at any scale. It is a fully managed cloud database and supports both document and key-value store models. Its flexible data model and reliable performance make it a great fit for mobile, web, gaming, ad tech, IoT, and many other applications.
Learning Objectives:
Understand the differences between relational and non-relational databases
Learn about common use cases for DynamoDB across gaming, ad tech, IoT, and more
See how DynamoDB helps customers handle spikes in traffic and save development time for new feature launches
Who Should Attend:
Developers, IT Decision Makers, and Executives interested in learning more about Amazon Web Services’ serverless NoSQL service to scale mobile, web, IoT, ad tech, and gaming apps
Further discussion on Data Modeling with Apache Cassandra. Overview of formal data modeling techniques as well as practical. Real-world use cases and associated data models.
Sorry - How Bieber broke Google Cloud at SpotifyNeville Li
Talk at Scala Up North Jul 21 2017
We will talk about Spotify's story with Scala big data and our journey to migrate our entire data infrastructure to Google Cloud and how Justin Bieber contributed to breaking it. We'll talk about Scio, a Scala API for Apache Beam and Google Cloud Dataflow, and the technology behind it, including macros, algebird, chill and shapeless. There'll also be a live coding demo.
The Ring programming language version 1.9 book - Part 58 of 210Mahmoud Samir Fayed
This document discusses using different programming paradigms together in a game engine project. It describes layers for games, the game engine classes, and the interface to graphics libraries. The layers are: games (declarative), game engine classes (object-oriented), interface to graphics library (procedural), and graphics library bindings. The document provides details on classes for the game engine like Game, GameObject, Sprite, Text, Animate, Sound, and Map. It also gives examples of how to use the engine by creating a game window and drawing/moving text.
High-Performance Advanced Analytics with Spark-AlchemyDatabricks
Pre-aggregation is a powerful analytics technique as long as the measures being computed are reaggregable. Counts reaggregate with SUM, minimums with MIN, maximums with MAX, etc. The odd one out is distinct counts, which are not reaggregable.
Traditionally, the non-reaggregability of distinct counts leads to an implicit restriction: whichever system computes distinct counts has to have access to the most granular data and touch every row at query time. Because of this, in typical analytics architectures, where fast query response times are required, raw data has to be duplicated between Spark and another system such as an RDBMS. This talk is for everyone who computes or consumes distinct counts and for everyone who doesn’t understand the magical power of HyperLogLog (HLL) sketches.
We will break through the limits of traditional analytics architectures using the advanced HLL functionality and cross-system interoperability of the spark-alchemy open-source library, whose capabilities go beyond what is possible with OSS Spark, Redshift or even BigQuery. We will uncover patterns for 1000x gains in analytic query performance without data duplication and with significantly less capacity.
We will explore real-world use cases from Swoop’s petabyte-scale systems, improve data privacy when running analytics over sensitive data, and even see how a real-time analytics frontend running in a browser can be provisioned with data directly from Spark.
This document provides details about a student project titled "Multifunctional Tools" created using Python. The project allows users to perform various mathematical and logical operations through a graphical user interface. It includes functions for calculations, string manipulation, ASCII conversions, checking vowels/consonants, palindromes, prime numbers and more. The project was created by the student to provide a single platform for different operations and help users with schoolwork. It makes use of Python modules and functions along with a Tkinter GUI.
Similar to Lab 1: Data cleaning, exploration, removal of outliers, Correlation of Continuous Variables and Log Error (Target Variable), scatterplot analysis, adding new data features, Categorical and Continuous Feature Importance (20)
Yao Yao MSDS Alum The Job Search Interview Offer Letter Experience for Data S...Yao Yao
1) Have confidence that with hard work and the right attitude you can achieve your goals, regardless of perceived difficulties.
2) Remove negative influences and don't let others' assumptions discourage you. Follow the data and your own strengths.
3) Be adaptable - have placeholder decisions but be willing to change them as new information arrives. Act on intentions to learn from outcomes.
4) Inhibit procrastination by starting early and working efficiently to completion. Get into a productive "flow state."
In this paper, we present an analysis of features influencing Yelp's proprietary review filtering algorithm. Classifying or misclassifying reviews as recommended or non-recommended affects average ratings, consumer decisions, and ultimately, business revenue. Our analysis involves systematically sampling and scraping Yelp restaurant reviews. Features are extracted from review metadata and engineered from metrics and scores generated using text classifiers and sentiment analysis. The coefficients of a multivariate logistic regression model were interpreted as quantifications of the relative importance of features in classifying reviews as recommended or non-recommended. The model classified review recommendations with an accuracy of 78%. We found that reviews were most likely to be recommended when conveying an overall positive message written in a few moderately complex sentences expressing substantive detail with an informative range of varied sentiment. Other factors relating to patterns and frequency of platform use also bear strongly on review recommendations. Though not without important ethical implications, the findings are logically consistent with Yelp’s efforts to facilitate, inform, and empower consumer decisions.
- Data scraping, cluster, stratify
- Feature creation from metadata, NLP sentiment, spelling, readability, deceptive and extreme text classifiers
- Balance and scale, logistic regression, feature selection, final model
- Find features that correspond to Yelp's Algorithm and evaluate
Video of presentation: https://youtu.be/uavbPKiUg9M
Presentation: https://www.slideshare.net/YaoYao44/yelps-review-filtering-algorithm-powerpoint
Paper:
Github: https://github.com/post2web/capstone
Yelp's Review Filtering Algorithm PowerpointYao Yao
- Data scraping, cluster, stratify
- Feature creation from metadata, NLP sentiment, spelling, readability, deceptive and extreme text classifiers
- Balance and scale, logistic regression, feature selection, final model
- Find features that correspond to Yelp's Algorithm and evaluate
Video of presentation: https://youtu.be/uavbPKiUg9M
Poster: https://www.slideshare.net/YaoYao44/yelps-review-filtering-algorithm-poster
Paper:
Github: https://github.com/post2web/capstone
Audio Separation Comparison: Clustering Repeating Period and Hidden Markov ModelYao Yao
Audio Separation Comparison: Clustering Repeating Period and Hidden Markov Model
Presentation with Q&A: https://youtu.be/c_Bhtw5UWkA
Powerpoint: https://www.slideshare.net/YaoYao44/audio-separation-comparison-clustering-repeating-period-and-hidden-markov-model
Paper: https://www.slideshare.net/YaoYao44/audio-separation-comparison-clustering-repeating-period-and-hidden-markov-model-101442471
Audio Separation Comparison: Clustering Repeating Period and Hidden Markov ModelYao Yao
Audio Separation Comparison: Clustering Repeating Period and Hidden Markov Model
Presentation with Q&A: https://youtu.be/c_Bhtw5UWkA
Powerpoint: https://www.slideshare.net/YaoYao44/audio-separation-comparison-clustering-repeating-period-and-hidden-markov-model
Paper: https://www.slideshare.net/YaoYao44/audio-separation-comparison-clustering-repeating-period-and-hidden-markov-model-101442471
Estimating the initial mean number of views for videos to be on youtube's tre...Yao Yao
The document discusses analyzing YouTube trending video data to estimate the initial number of views needed for a video to be featured on YouTube's trending list. It cleans the data by removing duplicate trending days for videos and outliers. It then takes simple random samples and stratified samples by video category to estimate the mean initial views. It finds that a proportional allocation stratified sample after adjusting for design effects provides the most precise estimate near the true mean initial views of around 94,000.
Estimating the initial mean number of views for videos to be on youtube's tre...Yao Yao
The document analyzes metadata from trending YouTube videos collected between November 2017 and March 2018 to estimate the initial mean number of views for videos to be on YouTube's trending list. It explores using simple random sampling and stratified proportional allocation to estimate the mean views. The proportional allocation after adjusting for design effect is found to have the lowest average absolute difference from the true mean and highest percentage of the true mean falling within the confidence interval. The analysis also finds that "Music" and "Comedy" categories may be harder to trend while "Nonprofits & Activism" and "News & Politics" may be easier.
Lab 2: Classification and Regression Prediction Models, training and testing ...Yao Yao
https://github.com/yaowser/data_mining_group_project
https://www.kaggle.com/c/zillow-prize-1/data
From the Zillow real estate data set of properties in the southern California area, conduct the following data cleaning, data analysis, predictive analysis, and machine learning algorithms:
Lab 2: Classification and Regression Prediction Models, training and testing splits, optimization of K Nearest Neighbors (KD tree), optimization of Random Forest, optimization of Naive Bayes (Gaussian), advantages and model comparisons, feature importance, Feature ranking with recursive feature elimination, Two dimensional Linear Discriminant Analysis
Data Reduction and Classification for Lumosity DataYao Yao
In 2015 researchers at Lumos Labs, the Lumosity cognitive training games platform maker, sought to determine if cognitive training (via the Lumosity platform) would result in cognitive performance gains. Can the randomization grouping of participants in the original study be predicted? Utilizing cognitive ability measurements, participant activity measurements, and participants’ ages, we attempt to predict randomization grouping utilizing linear discriminant analysis and principal component analysis techniques.
https://github.com/yaowser/LDA-PCA-Lumosity-Categorical-Prediction
Predicting Sales Price of Homes Using Multiple Linear RegressionYao Yao
Develop a regression model to predict final selling price of homes based on explanatory variables in the Ames Housing dataset. The analysis was performed on Ames housing data that were collected from the Ames Assessor’s Office for individual residential properties sold in Ames, IA from 2006 to 2010.
https://github.com/yaowser/MLR-iowa-housing
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformYao Yao
Yao Yao Mooyoung Lee
https://github.com/yaowser/learn-spark/tree/master/Final%20project
https://www.youtube.com/watch?v=IVMbSDS4q3A
https://www.slideshare.net/YaoYao44/teaching-apache-spark-demonstrations-on-the-databricks-cloud-platform/
Apache Spark is a fast and general engine for big data analytics processing with libraries for SQL, streaming, and advanced analytics
Cloud Computing, Structured Streaming, Unified Analytics Integration, End-to-End Applications
Yao Yao, Jack Rasmus-Vorrath, Ivelin Angelov
https://github.com/yaowser/basic_blockchain
https://www.slideshare.net/YaoYao44/blockchain-security-and-demonstration/
Distributed ledger technology over a network of computers, which provides an alternative to the centralized system
Distributed Database
Peer-to-Peer Transmission
Transparency with Pseudonymity
Records are immutable
Computational Logic
https://www.youtube.com/watch?v=5ArZxRdhyPc
API Python Chess: Distribution of Chess Wins based on random movesYao Yao
Yao Yao
https://github.com/yaowser/python-chess
https://youtu.be/MayH8Yd3yqE
Distribution of chess wins: black and white expected values from random moves
Python chess library with move generation and validation, Polyglot opening book probing, PGN reading and writing, Gaviota tablebase probing, Syzygy tablebase probing, and UCI engine communication
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformYao Yao
Yao Yao Mooyoung Lee
https://github.com/yaowser/learn-spark/tree/master/Final%20project
https://www.youtube.com/watch?v=IVMbSDS4q3A
https://www.academia.edu/35646386/Teaching_Apache_Spark_Demonstrations_on_the_Databricks_Cloud_Platform
https://www.slideshare.net/YaoYao44/teaching-apache-spark-demonstrations-on-the-databricks-cloud-platform-86063070/
Apache Spark is a fast and general engine for big data analytics processing with libraries for SQL, streaming, and advanced analytics
Cloud Computing, Structured Streaming, Unified Analytics Integration, End-to-End Applications
Blockchain provides an alternative to centralized systems through a distributed ledger maintained across a peer-to-peer network. Key characteristics include an immutable record of transactions stored in time-stamped blocks, with data integrity ensured by cryptographic hashes and distributed consensus. While increasing transparency, blockchain also allows for privacy through pseudonymity and encryption. It has the potential to reduce costs for financial institutions and open up new applications by cutting out intermediaries and enabling direct peer-to-peer transactions with transparency and traceability.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...Social Samosa
The Modern Marketing Reckoner (MMR) is a comprehensive resource packed with POVs from 60+ industry leaders on how AI is transforming the 4 key pillars of marketing – product, place, price and promotions.
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
Natural Language Processing (NLP), RAG and its applications .pptxfkyes25
1. In the realm of Natural Language Processing (NLP), knowledge-intensive tasks such as question answering, fact verification, and open-domain dialogue generation require the integration of vast and up-to-date information. Traditional neural models, though powerful, struggle with encoding all necessary knowledge within their parameters, leading to limitations in generalization and scalability. The paper "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" introduces RAG (Retrieval-Augmented Generation), a novel framework that synergizes retrieval mechanisms with generative models, enhancing performance by dynamically incorporating external knowledge during inference.
End-to-end pipeline agility - Berlin Buzzwords 2024Lars Albertsson
We describe how we achieve high change agility in data engineering by eliminating the fear of breaking downstream data pipelines through end-to-end pipeline testing, and by using schema metaprogramming to safely eliminate boilerplate involved in changes that affect whole pipelines.
A quick poll on agility in changing pipelines from end to end indicated a huge span in capabilities. For the question "How long time does it take for all downstream pipelines to be adapted to an upstream change," the median response was 6 months, but some respondents could do it in less than a day. When quantitative data engineering differences between the best and worst are measured, the span is often 100x-1000x, sometimes even more.
A long time ago, we suffered at Spotify from fear of changing pipelines due to not knowing what the impact might be downstream. We made plans for a technical solution to test pipelines end-to-end to mitigate that fear, but the effort failed for cultural reasons. We eventually solved this challenge, but in a different context. In this presentation we will describe how we test full pipelines effectively by manipulating workflow orchestration, which enables us to make changes in pipelines without fear of breaking downstream.
Making schema changes that affect many jobs also involves a lot of toil and boilerplate. Using schema-on-read mitigates some of it, but has drawbacks since it makes it more difficult to detect errors early. We will describe how we have rejected this tradeoff by applying schema metaprogramming, eliminating boilerplate but keeping the protection of static typing, thereby further improving agility to quickly modify data pipelines without fear.
Learn SQL from basic queries to Advance queriesmanishkhaire30
Dive into the world of data analysis with our comprehensive guide on mastering SQL! This presentation offers a practical approach to learning SQL, focusing on real-world applications and hands-on practice. Whether you're a beginner or looking to sharpen your skills, this guide provides the tools you need to extract, analyze, and interpret data effectively.
Key Highlights:
Foundations of SQL: Understand the basics of SQL, including data retrieval, filtering, and aggregation.
Advanced Queries: Learn to craft complex queries to uncover deep insights from your data.
Data Trends and Patterns: Discover how to identify and interpret trends and patterns in your datasets.
Practical Examples: Follow step-by-step examples to apply SQL techniques in real-world scenarios.
Actionable Insights: Gain the skills to derive actionable insights that drive informed decision-making.
Join us on this journey to enhance your data analysis capabilities and unlock the full potential of SQL. Perfect for data enthusiasts, analysts, and anyone eager to harness the power of data!
#DataAnalysis #SQL #LearningSQL #DataInsights #DataScience #Analytics
State of Artificial intelligence Report 2023kuntobimo2016
Artificial intelligence (AI) is a multidisciplinary field of science and engineering whose goal is to create intelligent machines.
We believe that AI will be a force multiplier on technological progress in our increasingly digital, data-driven world. This is because everything around us today, ranging from culture to consumer products, is a product of intelligence.
The State of AI Report is now in its sixth year. Consider this report as a compilation of the most interesting things we’ve seen with a goal of triggering an informed conversation about the state of AI and its implication for the future.
We consider the following key dimensions in our report:
Research: Technology breakthroughs and their capabilities.
Industry: Areas of commercial application for AI and its business impact.
Politics: Regulation of AI, its economic implications and the evolving geopolitics of AI.
Safety: Identifying and mitigating catastrophic risks that highly-capable future AI systems could pose to us.
Predictions: What we believe will happen in the next 12 months and a 2022 performance review to keep us honest.
Lab 1: Data cleaning, exploration, removal of outliers, Correlation of Continuous Variables and Log Error (Target Variable), scatterplot analysis, adding new data features, Categorical and Continuous Feature Importance
1. 1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 1/63
Zillow Dataset Analysis and Visualization
MSDS 7331 Data Mining - Section 403 - Lab 1
Team: Ivelin Angelov, Yao Yao, Kaitlin Kirasich, Albert Asuncion
Business Understanding
10 points
Description:
Describe the purpose of the data set you selected (i.e., why was this data collected in the first
place?). Describe how you would define and measure the outcomes from the dataset. That is, why
is this data important and how do you know if you have mined useful knowledge from the dataset?
How would you measure the effectiveness of a good prediction algorithm? Be specific.
Answer:
Origin and purpose of dataset
This is a dataset from a Kaggle competition: "Zillow Prize: Zillow’s Home Value Prediction
(Zestimate)". To download all accompanied dataset refer to this link:
https://www.kaggle.com/c/zillow-prize-1/data (https://www.kaggle.com/c/zillow-prize-1/data)
Note: The dataset has 2985217 rows and 58 columns and it requires at least 2GB of free RAM to
load.
Zillow, a leading real estate and rental marketplace platform, developed a model to estimate the
property price based on property features, which they call the "Zestimate". As with every real world
model, the Zestimate has some error associated with it. Zestimates are estimated home values
based on 7.5 million statistical and machine learning models that analyze hundreds of data points
on each property.
The purpose of this dataset and Kaggle competition is to minimize the error between the Zestimate
(what we will predict) and the actual sale price, given certain features of a home.
Description of dataset
We are provided with a full dataset of real estate properties in three counties in California: Los
Angeles, Orange, and Ventura in 2016. The dataset contains:
ID for the listing
57 variables describing the property features such as the number of bedrooms and various
measurements in square feet
Two resulting variables: logerror and transactiondate
The dataset has two parts:
2. 1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 2/63
Training data (90275 rows), which contains logerror and transactiondate and has all the
transactions before October 15, 2016, plus some of the transactions after October 15,
2016.
Testing data (2895067 rows), which contains the rest of the transactions between October
15 and December 31, 2016.
A successful measure of how well we predict log error will be how well we can clean and train our
data measured by our placement in the Kaggle competition. Kaggle measures the effectiveness of a
good prediction algorithm by taking the log error of the Zestimate and the actual sales price. The log
error is defined as:
where logerror < 0 represent Zestimates lower than the actual sell price and logerror > 0 represent
Zestimates higher than the actual sell price.
Our notebook
This notebook is an exploratory analysis for the dataset described above. Our study is organized as
follows:
Data Meaning
Data Quality (EDA)
Review of variables
Identification of missing values and outliers
Data cleansing
Visualizations
Simple Statistics
Visualize Attributes
Explore Joint Attributes
Explore Attributes and Classes
New Features
Exceptional Work
References/Citations
Conclusion
From the correlation table, random forest, and linear regression feature importance, we found out
that regionidzip, calculatedfinishedsquarefeet, bedroomcount, censustractandblock,
regionidneighborhood, and taxdelinquencyyear are the most important variables towards building
our prediction model.
Future work
In the future lab notebooks, we will predict the logerror from a regression model. To measure the
effectiveness of a good prediction algorithm, we will first apply cross-validation by splitting the
training dataset to training, validation, and testing to model our prediction error. A final prediction
error will be given by Kaggle when we submit our predictions to the competition.
logerror =log(Zestimate)−log(SalePrice)
3. 1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 3/63
In [1]:
Data Meaning
10 points
Description:
Describe the meaning and type of data (scale, values, etc.) for each attribute in the data file.
Below is a table of all of the variables in the dataset. We list the variable name, type of data, scale,
and a description.
/usr/local/lib/python2.7/site-packages/matplotlib/__init__.py:878: UserWarning:
axes.color_cycle is deprecated and replaced with axes.prop_cycle; please use th
e latter.
warnings.warn(self.msg_depr % (key, alt_key))
Out[1]: 'The dataset has 2985342 rows and 60 columns'
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
# load datasets here:
train_data = pd.read_csv('../input/train_2016_v2.csv')
data = pd.read_csv('../input/properties_2016.csv', low_memory=False)
data = pd.merge(data, train_data, how='left', on='parcelid')
'The dataset has %d rows and %d columns' % data.shape
4. 1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 4/63
In [2]: from IPython.display import display, HTML
variables_description = [
['airconditioningtypeid', 'nominal', 'TBD', 'Type of cooling system present in the
,['architecturalstyletypeid', 'nominal', 'TBD', 'Architectural style of the home
,['assessmentyear', 'interval', 'TBD', 'The year of the property tax assessment']
,['basementsqft', 'ratio', 'TBD', 'Finished living area below or partially below g
,['bathroomcnt', 'ordinal', 'TBD', 'Number of bathrooms in home including fractio
,['bedroomcnt', 'ordinal', 'TBD', 'Number of bedrooms in home']
,['buildingclasstypeid', 'nominal', 'TBD', 'The building framing type (steel frame
,['buildingqualitytypeid', 'ordinal', 'TBD', 'Overall assessment of condition of t
,['calculatedbathnbr', 'ordinal', 'TBD', 'Number of bathrooms in home including f
,['calculatedfinishedsquarefeet', 'ratio', 'TBD', 'Calculated total finished livi
,['censustractandblock', 'nominal', 'TBD', 'Census tract and block ID combined -
,['decktypeid', 'nominal', 'TBD', 'Type of deck (if any) present on parcel']
,['finishedfloor1squarefeet', 'ratio', 'TBD', 'Size of the finished living area o
,['finishedsquarefeet12', 'ratio', 'TBD', 'Finished living area']
,['finishedsquarefeet13', 'ratio', 'TBD', 'Perimeter living area']
,['finishedsquarefeet15', 'ratio', 'TBD', 'Total area']
,['finishedsquarefeet50', 'ratio', 'TBD', 'Size of the finished living area on the
,['finishedsquarefeet6', 'ratio', 'TBD', 'Base unfinished and finished area']
,['fips', 'nominal', 'TBD', 'Federal Information Processing Standard code - see ht
,['fireplacecnt', 'ordinal', 'TBD', 'Number of fireplaces in a home (if any)']
,['fireplaceflag', 'ordinal', 'TBD', 'Is a fireplace present in this home']
,['fullbathcnt', 'ordinal', 'TBD', 'Number of full bathrooms (sink, shower + batht
,['garagecarcnt', 'ordinal', 'TBD', 'Total number of garages on the lot including
,['garagetotalsqft', 'ratio', 'TBD', 'Total number of square feet of all garages o
,['hashottuborspa', 'ordinal', 'TBD', 'Does the home have a hot tub or spa']
,['heatingorsystemtypeid', 'nominal', 'TBD', 'Type of home heating system']
,['landtaxvaluedollarcnt', 'ratio', 'TBD', 'The assessed value of the land area of
,['latitude', 'interval', 'TBD', 'Latitude of the middle of the parcel multiplied
,['logerror', 'interval', 'TBD', 'Error or the Zillow model response variable']
,['longitude', 'interval', 'TBD', 'Longitude of the middle of the parcel multiplie
,['lotsizesquarefeet', 'ratio', 'TBD', 'Area of the lot in square feet']
,['numberofstories', 'ordinal', 'TBD', 'Number of stories or levels the home has'
,['parcelid', 'nominal', 'TBD', 'Unique identifier for parcels (lots)']
,['poolcnt', 'ordinal', 'TBD', 'Number of pools on the lot (if any)']
,['poolsizesum', 'ratio', 'TBD', 'Total square footage of all pools on property']
,['pooltypeid10', 'nominal', 'TBD', 'Spa or Hot Tub']
,['pooltypeid2', 'nominal', 'TBD', 'Pool with Spa/Hot Tub']
,['pooltypeid7', 'nominal', 'TBD', 'Pool without hot tub']
,['propertycountylandusecode', 'nominal', 'TBD', 'County land use code i.e. it's
,['propertylandusetypeid', 'nominal', 'TBD', 'Type of land use the property is zo
,['propertyzoningdesc', 'nominal', 'TBD', 'Description of the allowed land uses (
,['rawcensustractandblock', 'nominal', 'TBD', 'Census tract and block ID combined
,['regionidcity', 'nominal', 'TBD', 'City in which the property is located (if any
,['regionidcounty', 'nominal', 'TBD', 'County in which the property is located']
,['regionidneighborhood', 'nominal', 'TBD', 'Neighborhood in which the property i
,['regionidzip', 'nominal', 'TBD', 'Zip code in which the property is located']
,['roomcnt', 'ordinal', 'TBD', 'Total number of rooms in the principal residence'
,['storytypeid', 'nominal', 'TBD', 'Type of floors in a multi-story house (i.e. b
,['structuretaxvaluedollarcnt', 'ratio', 'TBD', 'The assessed value of the built
,['taxamount', 'ratio', 'TBD', 'The total property tax assessed for that assessme
,['taxdelinquencyflag', 'nominal', 'TBD', 'Property taxes for this parcel are past
,['taxdelinquencyyear', 'interval', 'TBD', 'Year']
,['taxvaluedollarcnt', 'ratio', 'TBD', 'The total tax assessed value of the parce
5. 1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 5/63
Out[2]: Variable Type Scale Description
airconditioningtypeid nominal
[nan, 1.0, 13.0, 5.0,
11.0, 9.0, 12.0, 3.0]
Type of cooling system present in the h
any)
architecturalstyletypeid nominal
[nan, 7.0, 21.0, 8.0,
2.0, 3.0, 5.0, 10.0,
27.0]
Architectural style of the home (i.e. ran
colonial, split-level, etc…)
assessmentyear interval (2000, 2016) The year of the property tax assessme
basementsqft ratio (20, 8516)
Finished living area below or partially b
ground level
bathroomcnt ordinal
[0.0, 2.0, 4.0, 3.0,
1.0, ... (38 More)]
Number of bathrooms in home includin
fractional bathrooms
bedroomcnt ordinal
[0.0, 4.0, 5.0, 2.0,
3.0, ... (22 More)]
Number of bedrooms in home
buildingclasstypeid nominal
[nan, 3.0, 4.0, 5.0,
2.0, 1.0]
The building framing type (steel frame,
frame, concrete/brick)
,['threequarterbathnbr', 'ordinal', 'TBD', 'Number of 3/4 bathrooms in house (show
,['transactiondate', 'nominal', 'TBD', 'Date of the transaction response variable
,['typeconstructiontypeid', 'nominal', 'TBD', 'What type of construction material
,['unitcnt', 'ordinal', 'TBD', 'Number of units the structure is built into (i.e.
,['yardbuildingsqft17', 'interval', 'TBD', 'Patio in yard']
,['yardbuildingsqft26', 'interval', 'TBD', 'Storage shed/building in yard']
,['yearbuilt', 'interval', 'TBD', 'The Year the principal residence was built']
]
variables = pd.DataFrame(variables_description, columns=['name', 'type', 'scale',
variables = variables.set_index('name')
variables = variables.loc[data.columns]
def output_variables_table(variables):
variables = variables.sort_index()
rows = ['<tr><th>Variable</th><th>Type</th><th>Scale</th><th>Description</th>
for vname, atts in variables.iterrows():
atts = atts.to_dict()
# add scale if TBD
if atts['scale'] == 'TBD':
if atts['type'] in ['nominal', 'ordinal']:
uniques = data[vname].unique()
uniques = list(uniques.astype(str))
if len(uniques) < 10:
atts['scale'] = '[%s]' % ', '.join(uniques)
else:
atts['scale'] = '[%s]' % (', '.join(uniques[:5]) + ', ... (%d
if atts['type'] in ['ratio', 'interval']:
atts['scale'] = '(%d, %d)' % (data[vname].min(), data[vname].max(
row = (vname, atts['type'], atts['scale'], atts['description'])
rows.append('<tr><td>%s</td><td>%s</td><td>%s</td><td>%s</td></tr>' % row
return HTML('<table>%s</table>' % ''.join(rows))
output_variables_table(variables)
6. 1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 6/63
buildingqualitytypeid ordinal
[nan, 7.0, 4.0, 10.0,
1.0, ... (13 More)]
Overall assessment of condition of the
from best (lowest) to worst (highest)
calculatedbathnbr ordinal
[nan, 2.0, 4.0, 3.0,
1.0, ... (35 More)]
Number of bathrooms in home includin
fractional bathroom
calculatedfinishedsquarefeet ratio (1, 952576)
Calculated total finished living area of t
home
censustractandblock nominal
[nan,
6.1110010011e+13,
6.1110009032e+13,
6.1110010024e+13,
6.1110010023e+13,
... (96772 More)]
Census tract and block ID combined -
contains blockgroup assignment by ex
decktypeid nominal [nan, 66.0] Type of deck (if any) present on parcel
finishedfloor1squarefeet ratio (3, 31303)
Size of the finished living area on the f
(entry) floor of the home
finishedsquarefeet12 ratio (1, 290345) Finished living area
finishedsquarefeet13 ratio (120, 2688) Perimeter living area
finishedsquarefeet15 ratio (112, 820242) Total area
finishedsquarefeet50 ratio (3, 31303)
Size of the finished living area on the f
(entry) floor of the home
finishedsquarefeet6 ratio (117, 952576) Base unfinished and finished area
fips nominal
[6037.0, 6059.0,
6111.0, nan]
Federal Information Processing Standa
- see
https://en.wikipedia.org/wiki/FIPS_coun
for more details
fireplacecnt ordinal
[nan, 3.0, 1.0, 2.0,
4.0, ... (10 More)]
Number of fireplaces in a home (if any
fireplaceflag ordinal [nan, True] Is a fireplace present in this home
fullbathcnt ordinal
[nan, 2.0, 4.0, 3.0,
1.0, ... (21 More)]
Number of full bathrooms (sink, showe
bathtub, and toilet) present in home
garagecarcnt ordinal
[nan, 2.0, 4.0, 1.0,
3.0, ... (25 More)]
Total number of garages on the lot incl
attached garage
garagetotalsqft ratio (0, 7749)
Total number of square feet of all garag
lot including an attached garage
hashottuborspa ordinal [nan, True] Does the home have a hot tub or spa
heatingorsystemtypeid nominal
[nan, 2.0, 7.0, 20.0,
6.0, ... (15 More)]
Type of home heating system
landtaxvaluedollarcnt ratio (1, 90246219)
The assessed value of the land area o
parcel
7. 1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 7/63
latitude interval
(33324388,
34819650)
Latitude of the middle of the parcel mu
by 10e6
logerror interval (-4, 4) Error or the Zillow model response var
longitude interval
(-119475780,
-117554316)
Longitude of the middle of the parcel m
by 10e6
lotsizesquarefeet ratio (100, 328263808) Area of the lot in square feet
numberofstories ordinal
[nan, 1.0, 4.0, 2.0,
3.0, ... (13 More)]
Number of stories or levels the home h
parcelid nominal
[10754147,
10759547,
10843547,
10859147,
10879947, ...
(2985217 More)]
Unique identifier for parcels (lots)
poolcnt ordinal [nan, 1.0] Number of pools on the lot (if any)
poolsizesum ratio (19, 17410) Total square footage of all pools on pro
pooltypeid10 nominal [nan, 1.0] Spa or Hot Tub
pooltypeid2 nominal [nan, 1.0] Pool with Spa/Hot Tub
pooltypeid7 nominal [nan, 1.0] Pool without hot tub
propertycountylandusecode nominal
[010D, 0109, 1200,
1210, 010V, ... (241
More)]
County land use code i.e. it's zoning at
county level
propertylandusetypeid nominal
[269.0, 261.0, 47.0,
31.0, 260.0, ... (16
More)]
Type of land use the property is zoned
propertyzoningdesc nominal
[nan, LCA11*,
LAC2, LAM1,
LAC4, ... (5639
More)]
Description of the allowed land uses (z
for that property
rawcensustractandblock nominal
[60378002.041,
60378001.011,
60377030.012,
60371412.023,
60371232.052, ...
(99394 More)]
Census tract and block ID combined -
contains blockgroup assignment by ex
regionidcity nominal
[37688.0, 51617.0,
12447.0, 396054.0,
47547.0, ... (187
More)]
City in which the property is located (if
regionidcounty nominal
[3101.0, 1286.0,
2061.0, nan]
County in which the property is located
8. 1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 8/63
Data Quality
15 points
regionidneighborhood nominal
[nan, 27080.0,
46795.0, 274049.0,
31817.0, ... (529
More)]
Neighborhood in which the property is
regionidzip nominal
[96337.0, 96095.0,
96424.0, 96450.0,
96446.0, ... (406
More)]
Zip code in which the property is locate
roomcnt ordinal
[0.0, 8.0, 4.0, 5.0,
7.0, ... (37 More)]
Total number of rooms in the principal
residence
storytypeid nominal [nan, 7.0]
Type of floors in a multi-story house (i.e
basement and main level, split-level, a
etc.). See tab for details.
structuretaxvaluedollarcnt ratio (1, 251486000)
The assessed value of the built structu
the parcel
taxamount ratio (1, 3458861)
The total property tax assessed for tha
assessment year
taxdelinquencyflag nominal [nan, Y]
Property taxes for this parcel are past d
of 2015
taxdelinquencyyear interval (0, 99) Year
taxvaluedollarcnt ratio (1, 282786000) The total tax assessed value of the par
threequarterbathnbr ordinal
[nan, 1.0, 2.0, 4.0,
3.0, 6.0, 5.0, 7.0]
Number of 3/4 bathrooms in house (sh
sink + toilet)
transactiondate nominal
[nan, 2016-01-27,
2016-03-30, 2016-
05-27, 2016-06-07,
... (353 More)]
Date of the transaction response varia
typeconstructiontypeid nominal
[nan, 6.0, 4.0, 10.0,
13.0, 11.0]
What type of construction material was
construct the home
unitcnt ordinal
[nan, 2.0, 1.0, 3.0,
5.0, ... (147 More)]
Number of units the structure is built in
= duplex, 3 = triplex, etc...)
yardbuildingsqft17 interval (10, 7983) Patio in yard
yardbuildingsqft26 interval (10, 6141) Storage shed/building in yard
yearbuilt interval (1801, 2015) The Year the principal residence was b
9. 1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 9/63
Description:
Verify data quality: Explain any missing values, duplicate data, and outliers. Are those mistakes?
How do you deal with these problems? Give justifications for your methods.
Examining Distribution of Missing Values
From the observations, most of the rows have about 30 missing values. For the observations that
have 57 missing values, it means that most of the features are missing and we choose to remove
those. We will add in values to those missing where appropriate, below.
In [3]:
All observations have a value for parcelid
In [4]:
0.38 percent of the data has only parcelid present and all other variables
Out[4]: 0
plt.rcParams['figure.figsize'] = [10, 7]
number_missing_per_row = data.isnull().sum(axis=1)
sns.distplot(number_missing_per_row, color="#34495e", kde=False);
plt.title('Distribution of Missing Values', fontsize=15)
plt.xlabel('Number of Missing Values', fontsize=15)
plt.ylabel('Number of Rows', fontsize=15);
data['parcelid'].isnull().sum()
13. 1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 13/63
Examining Variables for Missing Values and Outliers
For variables that are nominal, ratio, and interval where appropriate, we wrote a function that
replaces outliers 5 standard deviations from the mean and assigning them as 5 standard deviations
above or below the mean, respectively.
In [10]:
Variable: airconditioningtypeid - Type of cooling system present in the
home (if any)
Has datatype: nominal and 72.710860 percent of values missing
For this variable, missing values indicate the absence of a cooling system. We replace all missing
values with 0 to represent no cooling system. We changed the column datatype to integer.
In [11]:
Variable: architecturalstyletypeid - Architectural style of the home (i.e.
ranch, colonial, split-level, etc…)
Has datatype: nominal and 99.796185 percent of values missing
Architectural style describes the home design. As such, it is not something we can extrapolate a
value for. With over 99% of values missing, we decided to eliminate this variable.
('Before', array([ nan, 1., 13., 5., 11., 9., 12., 3.]))
('After', array([ 0, 1, 13, 5, 11, 9, 12, 3]))
def fix_outliers(data, column):
mean = data[column].mean()
std = data[column].std()
max_value = mean + std * 5
min_value = mean - std * 5
if data[column].max() < max_value and data[column].min() > min_value:
print('No outliers found')
return
print('Outliers found!')
f, ((ax0, ax1), (ax2, ax3)) = plt.subplots(nrows=2, ncols=2, figsize=[15, 7])
f.subplots_adjust(hspace=.4)
sns.boxplot(data[column].dropna(), ax=ax0, color="#34495e").set_title('Before
sns.distplot(data[column].dropna(), ax=ax2, color="#34495e").set_title('Before
data.loc[data[column] > max_value, column] = max_value
data.loc[data[column] < min_value, column] = min_value
sns.boxplot(data[column].dropna(), ax=ax1, color="#34495e").set_title('After'
sns.distplot(data[column].dropna(), ax=ax3, color="#34495e").set_title('After
print('Before', data['airconditioningtypeid'].unique())
data['airconditioningtypeid'] = data['airconditioningtypeid'].fillna(0).astype(np
print('After', data['airconditioningtypeid'].unique())
14. 1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 14/63
In [12]:
Variable: assessmentyear - year of the property tax assessment
Has datatype: interval and has 2 values missing
We replaced the missing values with the latest tax year which also happens to be the median tax
year. We changed the column datatype to integer.
In [13]:
Variable: basementsqft - Finished living area below or partially below
ground level
Has datatype: ratio and 99.945255 percent of values missing
Basements are not standard home features. Whenever a basement is not a feature of the home,
the value for area was entered as a missing value. With over 99% of values missing, we decided to
eliminate this variable.
In [14]:
Variable: bathroomcnt - Number of bathrooms in home including
fractional bathrooms
Has datatype: ordinal and 0.000841 percent of values missing
We decided it is potentially possible for the property to not have a bathroom so we decided to
replace missing values with zeros since there are only very few. We changed the column datatype
to a float.
('Before', array([ 2015., 2014., 2003., 2012., 2001., 2011., 2013., 201
6.,
2010., nan, 2004., 2005., 2002., 2000., 2009.]))
('After', array([2015, 2014, 2003, 2012, 2001, 2011, 2013, 2016, 2010, 2004, 20
05,
2002, 2000, 2009]))
del data['architecturalstyletypeid']
print('Before', data['assessmentyear'].unique())
median_value = data['assessmentyear'].median()
data['assessmentyear'] = data['assessmentyear'].fillna(median_value).astype(np.int
print('After', data['assessmentyear'].unique())
del data['basementsqft']
15. 1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 15/63
In [15]:
Variable: bedroomcnt - Number of bedrooms in home
Has datatype: ordinal and 0.000437 percent of values missing
We decided to replace missing values with zeros since there are only very few to represent a studio
apartment. We changed the column datatype to integer.
In [16]:
Variable: buildingclasstypeid - The building framing type (steel frame,
wood frame, concrete/brick)
Has datatype: nominal and 99.576949 percent of values missing
With this much missing values and the difficulty of assigning a building framing type, we decided to
remove this variable.
In [17]:
Variable: buildingqualitytypeid - Overall assessment of condition of the
building from best (lowest) to worst (highest)
Has datatype: ordinal and 34.81 percent of values missing
We chose to replace the missing values with the median of the condition assessment instead of
giving the missing values the best or worst value. We changed the column datatype to integer.
('Before', array([ 0. , 2. , 4. , 3. , 1. , 2.5 , 3.5 , 5.
,
1.5 , 4.5 , 7.5 , 5.5 , 6. , 7. , 10. , 8. ,
9. , 12. , 11. , 8.5 , 6.5 , 13. , 9.5 , 14. ,
20. , 19.5 , 15. , 10.5 , nan, 18. , 16. , 1.75,
17. , 19. , 0.5 , 12.5 , 11.5 , 14.5 ]))
('After', array([ 0. , 2. , 4. , 3. , 1. , 2.5 , 3.5 , 5.
,
1.5 , 4.5 , 7.5 , 5.5 , 6. , 7. , 10. , 8. ,
9. , 12. , 11. , 8.5 , 6.5 , 13. , 9.5 , 14. ,
20. , 19.5 , 15. , 10.5 , 18. , 16. , 1.75, 17. ,
19. , 0.5 , 12.5 , 11.5 , 14.5 ]))
('Before', array([ 0., 4., 5., 2., 3., 1., 6., 7., 8., 12.,
11.,
9., 10., 14., 16., 13., nan, 15., 17., 18., 20., 19.]))
('After', array([ 0, 4, 5, 2, 3, 1, 6, 7, 8, 12, 11, 9, 10, 14, 16, 1
3, 15,
17, 18, 20, 19]))
print('Before', data['bathroomcnt'].unique())
data['bathroomcnt'] = data['bathroomcnt'].fillna(0).astype(np.float32)
print('After', data['bathroomcnt'].unique())
print('Before', data['bedroomcnt'].unique())
data['bedroomcnt'] = data['bedroomcnt'].fillna(0).astype(np.int32)
print('After', data['bedroomcnt'].unique())
del data['buildingclasstypeid']
16. 1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 16/63
In [18]:
Variable: calculatedbathnbr - Number of bathrooms in home including
fractional bathroom
Has datatype: ordinal and 3.95 percent of values missing
With a low number of missing values, we decided to assign 0 to all missing values since we decided
above it is possible that a property could have 0 bathrooms. We changed the column datatype to a
float.
In [19]:
Variable: calculatedfinishedsquarefeet - Calculated total finished living
area of the home
Has datatype: ratio and 1.48 percent of values missing
These missing values appear to be consistent with 0 or missing values for variables associated with
a building or structure on the property such as bathroomcnt, bedroomcnt, or architecturalstyletypeid.
We can assume that no structures exist on these properties and we decided to impute zeros to
these. We changed the column datatype to integer. We then replaced all outliers with a maximum
and minimum value of (mean ± 5 * std), respectively.
('Before', array([ nan, 7., 4., 10., 1., 12., 8., 3., 6., 9.,
5.,
11., 2.]))
('After', array([ 7, 4, 10, 1, 12, 8, 3, 6, 9, 5, 11, 2]))
('Before', array([ nan, 2. , 4. , 3. , 1. , 2.5, 3.5, 5. , 1.
5,
4.5, 7.5, 5.5, 6. , 7. , 10. , 8. , 9. , 12. ,
11. , 8.5, 6.5, 13. , 9.5, 14. , 20. , 19.5, 15. ,
10.5, 18. , 16. , 17. , 19. , 12.5, 11.5, 14.5]))
('After', array([ 0. , 2. , 4. , 3. , 1. , 2.5, 3.5, 5. , 1.5,
4.5, 7.5, 5.5, 6. , 7. , 10. , 8. , 9. , 12. ,
11. , 8.5, 6.5, 13. , 9.5, 14. , 20. , 19.5, 15. ,
10.5, 18. , 16. , 17. , 19. , 12.5, 11.5, 14.5]))
print('Before', data['buildingqualitytypeid'].unique())
medianQuality = data['buildingqualitytypeid'].median()
data['buildingqualitytypeid'] = data['buildingqualitytypeid'].fillna(medianQuality
print('After', data['buildingqualitytypeid'].unique())
print('Before', data['calculatedbathnbr'].unique())
data['calculatedbathnbr'] = data['calculatedbathnbr'].fillna(0).astype(np.float32
print('After', data['calculatedbathnbr'].unique())
17. 1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 17/63
In [20]:
Variable: censustractandblock - census tract and census block ID
Has datatype: nominal and 2.14 percent of values missing
With such a small amount of missing values, we decided to replace them with the median. A better
approach in the future could be taking into account zip code and then median for the missing
values. We changed the column datatype to a float.
In [21]:
Variable: decktypeid - Type of deck (if any) present on parcel
Has datatype: nominal and 99.427311 percent of values missing
Outliers found!
('Before', [nan, 10925.92657277406, 5068.0, 1776.0, 2400.0, 3611.0, 3754.0, 247
0.0, '...'])
('After', [0, 10925, 5068, 1776, 2400, 3611, 3754, 2470, '...'])
('Before', [nan, 61110010011023.0, 61110009032019.0, 61110010024015.0, 61110010
023002.0, 61110010024021.0, 61110010021029.0, 61110010022038.0, '...'])
('After', [60375714234368.0, 61110011035648.0, 61110006841344.0, 6111000264704
0.0, 61110015229952.0, 61110019424256.0, 61110023618560.0, 61110027812864.0,
'...'])
fix_outliers(data, 'calculatedfinishedsquarefeet')
print('Before', data['calculatedfinishedsquarefeet'].unique()[:8].tolist() + ['..
data['calculatedfinishedsquarefeet'] = data['calculatedfinishedsquarefeet'].fillna
print('After', data['calculatedfinishedsquarefeet'].unique()[:8].tolist() + ['...
print('Before', data['censustractandblock'].unique()[:8].tolist() + ['...'])
median_value = data['censustractandblock'].median()
data['censustractandblock'] = data['censustractandblock'].fillna(median_value)
data['censustractandblock'] = data['censustractandblock'].astype(np.float32)
print('After', data['censustractandblock'].unique()[:8].tolist() + ['...'])
18. 1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 18/63
Missing values is most likely an indication of an absence of this feature in the property. With 99%
missing values, we will remove this column.
In [22]:
Variable: finishedfloor1squarefeet - Size of the finished living area on
the first (entry) floor of the home
Has datatype: ratio and 93.18 percent of values missing
Having this much missing values and the availability of an alternate variable -
calculatedfinishedsquarefeet - with very few missing values, we decided to eliminate this variable.
In [23]:
Variable: finishedsquarefeet12 - Finished living area
Has datatype: ratio and 8.89 percent of values missing
The finishedsquarefeet fields add up to the calculatedfinishedsquarefeet. Missing values are
therefore zeros. We changed the column datatype to integer. We then replaced all outliers with a
maximum and minimum value of (mean ± 5 * std), respectively.
In [24]:
Outliers found!
('Before', array([ nan, 4000., 3633., ..., 317., 268., 161.]))
('After', array([ 0, 4000, 3633, ..., 317, 268, 161]))
del data['decktypeid']
del data['finishedfloor1squarefeet']
fix_outliers(data, 'finishedsquarefeet12')
print('Before', data['finishedsquarefeet12'].unique())
data['finishedsquarefeet12'] = data['finishedsquarefeet12'].fillna(0).astype(np.i
print('After', data['finishedsquarefeet12'].unique())
19. 1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 19/63
Variable: finishedsquarefeet13 - Finished living area
Has datatype: ratio and 99.743000 percent of values missing
The finishedsquarefeet fields add up to the calculatedfinishedsquarefeet. Since there are 99%
missing values we will remove this from the dataset.
In [25]:
Variable: finishedsquarefeet15 - Total area
Has datatype: ratio and 93.58 percent of values missing
The finishedsquarefeet fields add up to the calculatedfinishedsquarefeet. Since there are 93%
missing values we will remove this from the dataset.
In [26]:
Variable: finishedsquarefeet50 - Size of the finished living area on the
first (entry) floor of the home
Has datatype: ratio and 93.18 percent of values missing
The finishedsquarefeet fields add up to the calculatedfinishedsquarefeet. Since there are 93%
missing values we will replace the missing values with 0. We changed the column datatype to float.
In [27]:
Variable: finishedsquarefeet6 - Base unfinished and finished area
Has datatype: ratio and 99.26 percent of values missing
With 99% missing values, we decided to delete this variable.
In [28]:
Variable: fips - Federal Information Processing Standard code - see
https://en.wikipedia.org/wiki/FIPS_county_code
(https://en.wikipedia.org/wiki/FIPS_county_code) for more details
Has datatype: nominal with values [6037.0, 6059.0, 6111.0] and no missing values
We changed the column datatype to integer.
In [29]:
Variable: fireplacecnt - Number of fireplaces in a home (if any)
del data['finishedsquarefeet13']
del data['finishedsquarefeet15']
data['finishedsquarefeet50'] = data['finishedsquarefeet50'].fillna(0).astype(np.f
del data['finishedsquarefeet6']
data['fips'] = data['fips'].astype(np.int32)
20. 1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 20/63
Has datatype: ordinal and 89.486882 percent of values missing
In this dataset, missing value represents 0 fireplaces. We replaced all missing values with zero and
change the column datatype to integer. We changed the column datatype to integer.
In [30]:
Variable: fireplaceflag - does the home have a fireplace
Has datatype: ordinal and 99.82 percent of values missing
With 99% missing values, we decided to delete the variable.
In [31]:
Variable: fullbathcnt - Number of full bathrooms (sink, shower +
bathtub, and toilet) present in home
Has datatype: ordinal and 3.95 percent of values missing
We first replaced its missing values with the values of bathroomcnt which is a similar measure. After
that, we have 25 observations missing and we replace them with 0. We changed the column
datatype to a float.
In [32]:
Variable: garagecarcnt - Total number of garages on the lot including an
attached garage
Has datatype: ordinal and 70.298173 percent of values missing
We assume that missing values will represent no garage and replace all missing values with zero.
We changed the column datatype to integer.
('Before', array([ nan, 3., 1., 2., 4., 9., 5., 7., 6., 8.]))
('After', array([0, 3, 1, 2, 4, 9, 5, 7, 6, 8]))
('Before', array([ nan, 2., 4., 3., 1., 5., 7., 6., 10., 8.,
9.,
12., 11., 13., 14., 20., 19., 15., 18., 16., 17.]))
('After', array([ 0. , 2. , 4. , 3. , 1. , 5. , 7. , 6.
,
10. , 8. , 9. , 12. , 11. , 7.5 , 2.5 , 4.5 ,
1.5 , 13. , 14. , 20. , 3.5 , 19. , 5.5 , 15. ,
18. , 16. , 1.75, 6.5 , 17. , 0.5 , 8.5 ]))
print('Before', data['fireplacecnt'].unique())
data['fireplacecnt'] = data['fireplacecnt'].fillna(0).astype(np.int32)
print('After', data['fireplacecnt'].unique())
del data['fireplaceflag']
print('Before', data['fullbathcnt'].unique())
missing_fullbathcnt = data['fullbathcnt'].isnull()
data.loc[missing_fullbathcnt, 'fullbathcnt'] = data['bathroomcnt'][missing_fullbat
data['fullbathcnt'] = data['fullbathcnt'].astype(np.float32)
print('After', data['fullbathcnt'].unique())
21. 1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 21/63
In [33]:
Variable: garagetotalsqft - Total number of garages on the lot including
an attached garage
Has datatype: ratio and 70.298173 percent of values missing
We first replaced missing values where garagecarcnt is 0 with 0 garagetotalsqft. We changed the
column datatype to a float. We then replaced all outliers with a maximum and minimum value of
(mean ± 5 * std), respectively.
In [34]:
Variable: hashottuborspa - Does the home have a hot tub or spa
Has datatype: ordinal and 97.679250 percent of values missing
In this dataset missing values represent doesn't have a hot tub or spa. we replaced all missing
values with 0 and all True values with 1. We changed the column datatype to integer.
[ 0 2 4 1 3 5 7 6 8 9 12 11 10 13 14 15 25 21 18 17 24 19 16 20]
Outliers found!
data['garagecarcnt'] = data['garagecarcnt'].fillna(0).astype(np.int32)
print(data['garagecarcnt'].unique())
fix_outliers(data, 'garagetotalsqft')
data.loc[data['garagecarcnt'] == 0, 'garagetotalsqft'] = 0
data['garagecarcnt'] = data['garagecarcnt'].astype(np.float32)
assert data['garagetotalsqft'].isnull().sum() == 0
22. 1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 22/63
In [35]:
Variable: heatingorsystemtypeid - Type of home heating system
Has datatype: nominal and 39.255728 percent of values missing
We replaced all missing values with 0 which will represent a missing heating system type id. We
changed the column datatype to integer.
In [36]:
Variable: landtaxvaluedollarcnt - the assessed value of the land
Has datatype: ratio and 1.89 percent of values missing
We replaced all missing values with the median assessed land values. We changed the column
datatype to integer. We then replaced all outliers with a maximum and minimum value of (mean ± 5
* std), respectively.
('Before', array([nan, True], dtype=object))
('After', array([0, 1]))
('Before', array([ nan, 2., 7., 20., 6., 13., 18., 24., 12., 10.,
1.,
14., 21., 11., 19.]))
('After', array([ 0, 2, 7, 20, 6, 13, 18, 24, 12, 10, 1, 14, 21, 11, 19]))
print('Before', data['hashottuborspa'].unique())
data['hashottuborspa'] = data['hashottuborspa'].fillna(0).replace('True', 1).asty
print('After', data['hashottuborspa'].unique())
print('Before', data['heatingorsystemtypeid'].unique())
data['heatingorsystemtypeid'] = data['heatingorsystemtypeid'].fillna(0).astype(np
print('After', data['heatingorsystemtypeid'].unique())
23. 1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 23/63
In [37]:
Variables: latitude and longitude
Has datatype: interval and no missing values. We changed the column datatype to float.
In [38]:
Variable: logerror - Error or the Zillow model response variable
Has datatype: interval and 96.964429 percent of values missing
We will not fill any missing values because they represent the test part of the dataset. We changed
the column datatype to float.
In [39]:
Variable: lotsizesquarefeet - Area of the lot in square feet
Has datatype: ratio and 8.9 percent of values missing
Outliers found!
('Before', array([ 9.00000000e+00, 2.75160000e+04, 7.62631000e+05, ...,
1.28007500e+06, 3.61063000e+05, 9.54574000e+05]))
('After', array([ 9, 27516, 762631, ..., 1280075, 361063, 954574]))
fix_outliers(data, 'landtaxvaluedollarcnt')
print('Before', data['landtaxvaluedollarcnt'].unique())
median_value = data['landtaxvaluedollarcnt'].median()
data['landtaxvaluedollarcnt'] = data['landtaxvaluedollarcnt'].fillna(median_value
print('After', data['landtaxvaluedollarcnt'].unique())
data[['latitude', 'longitude']] = data[['latitude', 'longitude']].astype(np.float
data['logerror'] = data['logerror'].astype(np.float32)
24. 1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 24/63
We replace all missing values with 0 which will represent no lot. We changed the column datatype
to integer. We then replaced all outliers with a maximum and minimum value of (mean ± 5 * std),
respectively.
In [40]:
Variable: numberofstories - number of stories or levels the home
has
Has datatype: ordinal and 77.06 percent of values missing
We replace all missing values with 1 after removing all outliers to represent a single story home. We
changed the column datatype to integer. We then replaced all outliers with a maximum and
minimum value of (mean ± 5 * std), respectively.
Outliers found!
fix_outliers(data, 'lotsizesquarefeet')
data['lotsizesquarefeet'] = data['lotsizesquarefeet'].fillna(0).astype(np.float32
25. 1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 25/63
In [41]:
Variable: parcelid - Unique identifier for parcels (lots)
Has datatype: nominal and no values missing. We changed the column datatype to integer.
In [42]:
Variable: poolcnt - Number of pools on the lot (if any)
Has datatype: ordinal and 82.6 percent of values missing
We replaced all missing values with 0 which will represent no pools. We changed the column
datatype to integer.
In [43]:
Variable: poolsizesum - Total square footage of all pools on
Outliers found!
('Before', array([ nan, 1. , 4. , 2. , 3.
,
4.09684575]))
('After', array([1, 4, 2, 3]))
('Before', array([ nan, 1.]))
('After', array([0, 1]))
fix_outliers(data, 'numberofstories')
print('Before', data['numberofstories'].unique())
data['numberofstories'] = data['numberofstories'].fillna(1).astype(np.int32)
print('After', data['numberofstories'].unique())
data['parcelid'] = data['parcelid'].astype(np.int32)
print('Before', data['poolcnt'].unique())
data['poolcnt'] = data['poolcnt'].fillna(0).astype(np.int32)
print('After', data['poolcnt'].unique())
26. 1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 26/63
property
Has datatype: ratio and 99 percent of values missing
We replaced all missing values with 0 if number of pools is 0 or with the average poolsizesum
otherwise. We changed the column datatype to a float. We then replaced all outliers with a
maximum and minimum value of (mean ± 5 * std), respectively.
In [44]:
Variable: pooltypeid10 - Spa or Hot Tub
Has datatype: nominal and 98.8 percent of values missing
We replaced all missing values with 0 which will represent no Spa or Hot Tub. We changed the
column datatype to integer.
In [45]:
Variable: pooltypeid2 - Pool with Spa/Hot Tub
Has datatype: nominal and 98.9 percent of values missing
Outliers found!
('Before', array([ nan, 1.]))
('After', array([0, 1]))
fix_outliers(data, 'poolsizesum')
data.loc[data['poolsizesum'].isnull(), 'poolsizesum'] = int(data['poolsizesum'].me
data.loc[data['poolcnt'] == 0, 'poolsizesum'] = 0
data['poolcnt'] = data['poolcnt'].astype(np.float32)
print('Before', data['pooltypeid10'].unique())
data['pooltypeid10'] = data['pooltypeid10'].fillna(0).astype(np.int32)
print('After', data['pooltypeid10'].unique())
27. 1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 27/63
We replaced all missing values with 0 which will represent no Pool with Spa/Hot Tub. We changed
the column datatype to integer.
In [46]:
Variable: pooltypeid7 - Pool without hot tub
Has datatype: nominal and 83.6 percent of values missing
We replaced all missing values with 0 which will represent no pool without hot tub. We changed the
column datatype to integer.
In [47]:
Variable: propertycountylandusecode - County land use code i.e. it's
zoning at the county level
Has datatype: nominal and 0.02 percent of values missing
We replaced all missing values with 0 which will represent no county land use code. We changed
the column datatype to string.
In [48]:
Variable: propertylandusetypeid - Type of land use the property is zoned
for
Has datatype: nominal and 0 percent of values missing.
We are just changing the datatype to integer
In [49]:
('Before', array([ nan, 1.]))
('After', array([0, 1]))
('Before', array([ nan, 1.]))
('After', array([0, 1]))
('Before', ['010D', '0109', '1200', '1210', '010V', '300V', '0100', '0200',
'...'])
('After', ['010D', '0109', '1200', '1210', '010V', '300V', '0100', '0200',
'...'])
print('Before', data['pooltypeid2'].unique())
data['pooltypeid2'] = data['pooltypeid2'].fillna(0).astype(np.int32)
print('After', data['pooltypeid2'].unique())
print('Before', data['pooltypeid7'].unique())
data['pooltypeid7'] = data['pooltypeid7'].fillna(0).astype(np.int32)
print('After', data['pooltypeid7'].unique())
print('Before', data['propertycountylandusecode'].unique()[:8].tolist() + ['...']
data['propertycountylandusecode'] = data['propertycountylandusecode'].fillna(0).a
print('After', data['propertycountylandusecode'].unique()[:8].tolist() + ['...'])
data['propertylandusetypeid'] = data['propertylandusetypeid'].astype(np.int32)
28. 1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 28/63
Variable: propertyzoningdesc - Description of the allowed land uses
(zoning) for that property
Has datatype: nominal and 33.4 percent of values missing
We replaced all missing values with 0 which will represent no description of the allowed land uses.
We changed the column datatype to string.
In [50]:
Variable: rawcensustractandblock - Census tract and block ID combined
- also contains blockgroup assignment by extension
Has datatype: nominal and 0 percent of values missing
We are just changing the datatype to integer
In [51]:
Variable: regionidcity - City in which the property is located (if
any)
Has datatype: nominal and 1.72 percent of values missing
we will replace any missing values with 0 to represent no city ID. We are just changing the datatype
to integer
In [52]:
Variable: regionidcounty - County in which the property is located
('Before', array([nan, 'LCA11*', 'LAC2', ..., 'WCR1400000', 'EMPYYY', 'RMM2*'],
dtype=object))
('After', array(['0', 'LCA11*', 'LAC2', ..., 'WCR1400000', 'EMPYYY', 'RMM2*'],
dtype=object))
('Before', array([ 60378002.041 , 60378001.011002, 60377030.012017, ...,
60590878.032022, 60590626.211013, 60379012.091563]))
('After', array([60378002, 60378001, 60377030, ..., 61110057, 60375324, 6037599
1]))
('Before', [37688.0, 51617.0, 12447.0, 396054.0, 47547.0, nan, 54311.0, 40227.
0, '...'])
('After', [37688, 51617, 12447, 396054, 47547, 0, 54311, 40227, '...'])
print('Before', data['propertyzoningdesc'].unique())
data['propertyzoningdesc'] = data['propertyzoningdesc'].fillna(0).astype(np.str)
print('After', data['propertyzoningdesc'].unique())
print('Before', data['rawcensustractandblock'].unique())
data['rawcensustractandblock'] = data['rawcensustractandblock'].fillna(0).astype(
print('After', data['rawcensustractandblock'].unique())
print('Before', data['regionidcity'].unique()[:8].tolist() + ['...'])
data['regionidcity'] = data['regionidcity'].fillna(0).astype(np.int32)
print('After', data['regionidcity'].unique()[:8].tolist() + ['...'])
29. 1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 29/63
Has datatype: nominal and 0 percent of values missing. We changed the column datatype to
integer.
In [53]:
Variable: regionidneighborhood - Neighborhood in which the property is
located
Has datatype: nominal and 61.1 percent of values missing
We replaced all missing values with 0 which will represent no region ID neighborhood. We changed
the column datatype to integer.
In [54]:
Variable: regionidzip - Zip code in which the property is located
Has datatype: nominal and 0.08 percent of values missing
We replaced all missing values with 0 which will represent no zip code. We changed the column
datatype to integer.
In [55]:
Variable: roomcnt - Total number of rooms in the principal
residence
Has datatype: nominal and 0.001 percent of values missing
We replaced all missing values with 1 which will represent no Total number of rooms in the principal
residence reported. We changed the column datatype to integer. We then replaced all outliers with a
maximum and minimum value of (mean ± 5 * std), respectively.
('Before', array([ 3101., 1286., 2061.]))
('After', array([3101, 1286, 2061]))
('Before', [nan, 27080.0, 46795.0, 274049.0, 31817.0, 37739.0, 115729.0, 7877.
0, '...'])
('After', [0, 27080, 46795, 274049, 31817, 37739, 115729, 7877, '...'])
('Before', [96337.0, 96095.0, 96424.0, 96450.0, 96446.0, 96049.0, 96434.0, 9643
6.0, '...'])
('After', [96337, 96095, 96424, 96450, 96446, 96049, 96434, 96436, '...'])
print('Before', data['regionidcounty'].unique())
data['regionidcounty'] = data['regionidcounty'].astype(np.int32)
print('After', data['regionidcounty'].unique())
print('Before', data['regionidneighborhood'].unique()[:8].tolist() + ['...'])
data['regionidneighborhood'] = data['regionidneighborhood'].fillna(0).astype(np.i
print('After', data['regionidneighborhood'].unique()[:8].tolist() + ['...'])
print('Before', data['regionidzip'].unique()[:8].tolist() + ['...'])
data['regionidzip'] = data['regionidzip'].fillna(0).astype(np.int32)
print('After', data['regionidzip'].unique()[:8].tolist() + ['...'])
30. 1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 30/63
In [56]:
Variable: storytypeid - Type of floors in a multi-story house (i.e.
basement and main level, split-level, attic, etc.). See tab for
details.
Has datatype: nominal and 99.9 percent of values missing
With 99% missing values, we decided to remove this variable.
In [57]:
Variable: structuretaxvaluedollarcnt - the assessed value of the
building
Has datatype: ratio and 1.46 percent of values missing
We replaced all missing values with the median assessed building tax. We changed the column
datatype to integer. We then replaced all outliers with a maximum and minimum value of (mean ± 5
* std), respectively.
Outliers found!
('Before', array([ 0. , 8. , 4. , 5. ,
7. , 6. , 11. , 3. ,
10. , 9. , 2. , 12. ,
15.67699991, 13. , 15. , 14. ,
1. , nan]))
('After', array([ 0, 8, 4, 5, 7, 6, 11, 3, 10, 9, 2, 12, 15, 13, 14,
1]))
fix_outliers(data, 'roomcnt')
print('Before', data['roomcnt'].unique())
data['roomcnt'] = data['roomcnt'].fillna(1).astype(np.int32)
print('After', data['roomcnt'].unique())
del data['storytypeid']
31. 1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 31/63
In [58]:
Variable: taxamount - property tax for the assessment year
Has datatype: ratio and 0.66 percent of values missing
We replaced all missing values with the median property taxes for the assessment year. We
changed the column datatype to integer. We then replaced all outliers with a maximum and
minimum value of (mean ± 5 * std), respectively.
Outliers found!
('Before', array([ nan, 650756., 571346., ..., 409940., 463704., 43776
5.]))
('After', array([122590, 650756, 571346, ..., 409940, 463704, 437765]))
fix_outliers(data, 'structuretaxvaluedollarcnt')
print('Before', data['structuretaxvaluedollarcnt'].unique())
medTax = np.nanmedian(data['structuretaxvaluedollarcnt'])
data['structuretaxvaluedollarcnt'] = data['structuretaxvaluedollarcnt'].fillna(med
print('After', data['structuretaxvaluedollarcnt'].unique())
32. 1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 32/63
In [59]:
Variable: taxdelinquencyflag - property taxes from 2015 that are past
due
Has datatype: nominal and 98.10 percent of values missing
We replaced all missing values with 0 representing no past due property taxes and all Y values with
1 representing that there are past due property taxes. We changed the column datatype to integer.
In [60]:
Variable: taxdelinquencyyear - years of delinquency
Has datatype: interval and 98.10 percent of values missing
We replaced all missing values with 0 representing no years of property tax delinquencies. We
changed the column datatype to integer. We then replaced all outliers with a maximum and
minimum value of (mean ± 5 * std), respectively.
Outliers found!
('Before', array([ nan, 20800.37, 14557.57, ..., 33604.04, 12627.18,
15546.14]))
('After', array([ 3991.7800293 , 20800.36914062, 14557.5703125 , ...,
33604.0390625 , 12627.1796875 , 15546.13964844]))
('Before', array([nan, 'Y'], dtype=object))
('After', array([0, 1]))
fix_outliers(data, 'taxamount')
print('Before', data['taxamount'].unique())
median_value = data['taxamount'].median()
data['taxamount'] = data['taxamount'].fillna(median_value).astype(np.float32)
print('After', data['taxamount'].unique())
print('Before', data['taxdelinquencyflag'].unique())
data['taxdelinquencyflag'] = data['taxdelinquencyflag'].fillna(0).replace('Y', 1)
print('After', data['taxdelinquencyflag'].unique())
33. 1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 33/63
In [61]:
Variable: taxvaluedollarcnt - total tax
Has datatype: ratio and 1.04 percent of values missing
We replaced all missing values with the median total tax amount. We changed the column datatype
to integer. We then replaced all outliers with a maximum and minimum value of (mean ± 5 * std),
respectively.
Outliers found!
('Before', array([ nan, 13. , 15. , 11. ,
14. , 9. , 10. , 8. ,
12. , 7. , 6. , 2. ,
26.79676804, 5. , 3. , 4. ,
0.98797484, 1. ]))
('After', array([ 0, 13, 15, 11, 14, 9, 10, 8, 12, 7, 6, 2, 26, 5, 3,
4, 1]))
fix_outliers(data, 'taxdelinquencyyear')
print('Before', data['taxdelinquencyyear'].unique())
data['taxdelinquencyyear'] = data['taxdelinquencyyear'].fillna(0).astype(np.int32
print('After', data['taxdelinquencyyear'].unique())
34. 1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 34/63
In [62]:
Variable: threequarterbathnbr - Number of 3/4 bathrooms in house
(shower + sink + toilet)
Has datatype: ordinal and 89.5 percent of values missing
We replaced all missing values with 0 which will represent no Number of 3/4 bathrooms in the
property. We changed the column datatype to integer.
In [63]:
Variable: transactiondate - Date of the transaction response
variable
Has datatype: interval and 96.964429 percent of values missing
Will not fill any missing values because they represent the test part of the dataset
Outliers found!
('Before', array([ 9.00000000e+00, 2.75160000e+04, 1.41338700e+06, ...,
4.70248000e+05, 6.43794000e+05, 5.30550000e+05]))
('After', array([ 9, 27516, 1413387, ..., 470248, 643794, 530550]))
('Before', array([ nan, 1., 2., 4., 3., 6., 5., 7.]))
('After', array([0, 1, 2, 4, 3, 6, 5, 7]))
fix_outliers(data, 'taxvaluedollarcnt')
print('Before', data['taxvaluedollarcnt'].unique())
median_value = data['taxvaluedollarcnt'].median()
data['taxvaluedollarcnt'] = data['taxvaluedollarcnt'].fillna(median_value).astype
print('After', data['taxvaluedollarcnt'].unique())
print('Before', data['threequarterbathnbr'].unique())
data['threequarterbathnbr'] = data['threequarterbathnbr'].fillna(0).astype(np.int
print('After', data['threequarterbathnbr'].unique())
35. 1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 35/63
In [64]:
Variable: typeconstructiontypeid - What type of construction material
was used to construct the home
Has datatype: nominal and 99.7 percent of values missing
With 99% missing values, we decided to remove this variable.
In [65]:
Variable: unitcnt - number of units in the building
Has datatype: ordinal and 33.5 percent of values missing
We replaced all missing values with 1 to represent a single family home for any with no values. We
changed the column datatype to integer. We then replaced all outliers with a maximum and
minimum value of (mean ± 5 * std), respectively.
In [66]:
Variable: yardbuildingsqft17 - sq feet of patio in yard
Has datatype: interval and 97.29 percent of values missing
Outliers found!
('Before', [nan, 2.0, 1.0, 3.0, 5.0, 4.0, 9.0, 13.420418204007635, '...'])
('After', array([ 1, 2, 3, 5, 4, 9, 13, 12, 6, 7, 8, 10, 11]))
data['transactiondate'] = pd.to_datetime(data['transactiondate'])
del data['typeconstructiontypeid']
fix_outliers(data, 'unitcnt')
print('Before', data['unitcnt'].unique()[:8].tolist() + ['...'])
data['unitcnt'] = data['unitcnt'].fillna(1).astype(np.int32)
print('After', data['unitcnt'].unique())
36. 1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 36/63
We replaced all missing values with 0 representing no patio. We changed the column datatype to
integer. We then replaced all outliers with a maximum and minimum value of (mean ± 5 * std),
respectively.
In [67]:
Variable: yardbuildingsqft26 - storage shed/building in yard
Has datatype: interval and 99.91 percent of values missing
We replaced all missing values with 0 which will represent no (square ft) storage shed or building in
the yard. We changed the column datatype to integer. We then replaced all outliers with a maximum
and minimum value of (mean ± 5 * std), respectively.
Outliers found!
('Before', array([ nan, 450., 94., ..., 969., 1359., 1079.]))
('After', array([ 0, 450, 94, ..., 969, 1359, 1079]))
fix_outliers(data, 'yardbuildingsqft17')
print('Before', data['yardbuildingsqft17'].unique())
data['yardbuildingsqft17'] = data['yardbuildingsqft17'].fillna(0).astype(np.int32
print('After', data['yardbuildingsqft17'].unique())
37. 1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 37/63
In [68]:
Variable: yearbuilt - The Year the residence was built
Has datatype: interval and 1.63 percent of values missing
We replaced all missing values with the median year built of 1963 until we have a better method to
impute. We changed the column datatype to integer.
In [69]:
End of data cleaning
We went through every variable and next cell will confirm that the dataset has no missing values.
In [70]:
Simple Statistics
Outliers found!
('Before', [nan, 1948.0, 1947.0, 1943.0, 1946.0, 1978.0, 1958.0, 1949.0,
'...'])
('After', [1963, 1948, 1947, 1943, 1946, 1978, 1958, 1949, '...'])
fix_outliers(data, 'yardbuildingsqft26')
data['yardbuildingsqft26'] = data['yardbuildingsqft26'].fillna(0).astype(np.float
#there's too many values to print, before and after data redacted
print('Before', data['yearbuilt'].unique()[:8].tolist() + ['...'])
medYear = data['yearbuilt'].median()
data['yearbuilt'] = data['yearbuilt'].fillna(medYear).astype(np.int32)
print('After', data['yearbuilt'].unique()[:8].tolist() + ['...'])
# 'logerror' and 'transactiondate' are future variables and only exist in the tran
explanatory_vars = data.columns[~data.columns.isin(['logerror', 'transactiondate'
assert np.all(~data[explanatory_vars].isnull())
38. 1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 38/63
10 points
Description:
Visualize appropriate statistics (e.g., range, mode, mean, median, variance, counts) for a subset of
attributes. Describe anything meaningful you found from this or if you found something potentially
interesting. Note: You can also use data from other sources for comparison. Explain why the
statistics run are meaningful.
Table of Binary Variables (0 or 1)
We standardized all Yes/No and True/False variables to 1 or 0, respectively. The table below shows
that all binary flags in this dataset represent rare features such a pool, hot tub, tax delinquency flag,
and three quarter bathroom.
In [71]:
Summary Statistics of All Continuous Variables
To make the table more readable, we converted all simple statistics of continuous variables to
integers. We lose some precision but we get a better overview. For each variable, we have already
accounted for outliers and standardized missing values. We can immediately see that 0 is the most
common value for many of the variables. To explore further, we chose to visualize each variable that
had non-zero 25% to 75% values in the form of a boxplot and histogram.
Out[71]: Percent with value equal to 1
hashottuborspa 2.320720
poolcnt 17.403347
pooltypeid2 1.078548
pooltypeid7 16.324799
pooltypeid10 1.242172
taxdelinquencyflag 1.898850
threequarterbathnbr 10.584165
bin_vars = ['hashottuborspa', 'poolcnt', 'pooltypeid2', 'pooltypeid7', 'pooltypeid
bin_data = data[bin_vars]
result_table = bin_data.mean() * 100
pd.DataFrame(result_table, columns=['Percent with value equal to 1'])
39. 1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 39/63
In [72]:
Calculated Finished Square Feet
For calculated square feet, most values were 0 with a range from 0 to 10898 sqft. Note that we
removed outliers earlier while cleaning the data. The median of 1561 was a little smaller than the
mean of 1784 so we expect to see a slight right skew, which we do below. What is interesting here
is the peak at 0 of values and then another peak around 1600 to 1800. We continue to have few
properties with very large (higher than the 75th percentile of 2124) which is fairly normal for any
area to have the middle class homes with few larger homes mixed in.
Out[72]: count mean std min 25% 50% 75% max
calculatedfinishedsquarefeet 2973905 1784 984 0 1199 1561 2124 1092
finishedsquarefeet12 2973905 1596 958 0 1092 1466 1996 6615
finishedsquarefeet50 2973905 94 390 0 0 0 0 3130
garagetotalsqft 2973905 113 217 0 0 0 0 1610
lotsizesquarefeet 2973905 19810 73796 0 5200 6700 9243 1710
poolsizesum 2973905 90 196 0 0 0 0 1476
yardbuildingsqft17 2973905 8 61 0 0 0 0 1485
yardbuildingsqft26 2973905 0 12 0 0 0 0 2126
yearbuilt 2973905 1964 23 1801 1950 1963 1981 2015
structuretaxvaluedollarcnt 2973905 166367 179850 1 75440 122590 195143 2181
taxvaluedollarcnt 2973905 407695 429374 1 181179 306086 485000 4052
assessmentyear 2973905 2014 0 2000 2015 2015 2015 2016
landtaxvaluedollarcnt 2973905 242391 287722 1 76724 167043 303002 2477
taxamount 2973905 5229 5284 1 2471 3991 6178 5129
taxdelinquencyyear 2973905 0 1 0 0 0 0 26
logerror 90275 0 0 -4 0 0 0 4
train_data = data[~data['logerror'].isnull()]
continous_vars = variables[variables['type'].isin(['ratio', 'interval'])].index
continous_vars = continous_vars[continous_vars.isin(data.columns)]
continous_vars = continous_vars[~continous_vars.isin(['longitude', 'latitude'])]
output_table = data[continous_vars].describe().T
mode_range = data[continous_vars].mode().T
mode_range.columns = ['mode']
mode_range['range'] = data[continous_vars].max() - data[continous_vars].min()
output_table = output_table.join(mode_range)
output_table.astype(int)
40. 1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 40/63
In [73]:
Finished Living Area
Similar to calculated finished square feet, finished living area had outliers which we already fixed
above. The range for finished living area is 0 to 6871 with 0 being the mode of the data. The mean
(1596) is about 100sqft larger than the median (1466) so they are relatively the same since the
variance is 962.
This variable is bimodal in with a large spike at 0 and another peak with a fairly normal distribution
and long right tail at around 1400.
We also see a slight spike at the very end of the tail of the dataset. This means there were a lot of
outliers that were set to the maximum (mean + 6 * std).
f, (ax0,ax2) = plt.subplots(nrows=2, ncols=1, figsize=[15, 7])
sns.boxplot(data['calculatedfinishedsquarefeet'], ax=ax0, color="#34495e").set_tit
sns.distplot(data['calculatedfinishedsquarefeet'], ax=ax2, color="#34495e");
41. 1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 41/63
In [74]:
Lot Size Square Feet
Lot size square feet has the largest range from 0 to 1,710,750 even after removing all outliers
(mean + std * 5). The mode for this variable is 0 so we see below a spike at 0 and a very long right
tail.
What is interesting with this variable is the large variance of 73796. The 25th to 75th percentile
values are 5200 and 9243 respectively so we will skipped over the box plot and plotted the
histogram below.
In the histogram, we see a right skewed distribution which makes sense considering the mean is
19810 and the median is 6700 - again, with such a large variance it is difficult for the eye to see the
difference. The main takeaway here is the large number of 0s.
f, (ax0,ax2) = plt.subplots(nrows=2, ncols=1, figsize=[15, 7])
sns.boxplot(data['finishedsquarefeet12'], ax=ax0, color="#34495e").set_title('Fin
sns.distplot(data['finishedsquarefeet12'], ax=ax2, color="#34495e");
42. 1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 42/63
In [75]:
Year Built
The year the properties were built ranges from 1801 to 2015. The mode and median of 1963 is only
a year difference from the mean of 1964. The distribution seems to be fairly normal with the peak in
the early 1960s and dropping off on both sides. We see a number of homes that were built before
1905 (the low whisker of the boxplot) which gives us a long left tail.
We see a few other spikes in homes built which could correlate to a number of other factors such as
healthy economic growth, political backing on mortgages, or rises in population. The baby boomers
born early 1960 shows many houses being built and around the time they turned 18 more houses
seem to have been built. We see an apparent fall right before 2000 which could be the dot com
burst and another drop in the housing burst of 2007. Because our data was collected in 2016, we
expect to see fewer homes built the previous year.
What will be interesting with this variable is how old a home has to be to begin to "fall apart" or need
major renovations to the piping or foundation. Will a home built in a certain year have many homes
made from a faulty material that causes damages later on? Will the Zestimate take into account the
disclosures of a home that each sale price typically does?
f, (ax0) = plt.subplots(nrows=1, ncols=1)
sns.distplot(data['lotsizesquarefeet'], ax=ax0, color="#34495e").set_title('Lot s
43. 1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 43/63
In [76]:
Total Tax Value
The total tax value of the property ranges from 1 to 4,052,186. The median of 306,086 is the same
as the mode and a little smaller than the mean of 407,695 which is evident in the right skewed
distribution below. These values have already been adjusted for outliers which is why we see a
slight spike at the maximum value for larger developments and unique mansions.
The distribution is fairly similar to square feet above because the tax is calculated by value
assessed * square feet. What is interesting to note here is the missing values for tax were replaced
by the median (hence the median and mode being the same) where the square footage missing
values were replaced with 0s (hence the 0 as the mode and second peak in the distribution).
f, (ax0,ax2) = plt.subplots(nrows=2, ncols=1)
sns.boxplot(data['yearbuilt'].dropna(), ax=ax0, color="#34495e").set_title('Year
sns.distplot(data['yearbuilt'].dropna(), ax=ax2, color="#34495e");
44. 1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 44/63
In [77]:
Building and Land Tax
The building or structure tax has a similar right skewed distribution as total tax. The values range
from 1 to 2,165,929, already adjusted for outliers and cleaned up with missing values set to median.
That being said, the median and mode are the same at 122,590 which is lower than the mean of
166,344.
The land tax values range from 1 to 2,477,536, also adjusted for outliers and cleaned up with
missing values set to median. Because of this, the median and mode are the same at 167,043
which is lower than the mean of 242,391.
Land tax seems to have a larger range of values from the 25th to 75th percentile than the building
tax. This means that the land is valued at a greater variance (287k) than the buildings in certain
areas (variance of 179k). We think this could be due to location itself as better neighborhoods, safer
areas, or better schools could result in a higher assessment than other locations, thus widening the
variance.
f, (ax0,ax2) = plt.subplots(nrows=2, ncols=1)
sns.boxplot(data['taxvaluedollarcnt'], ax=ax0, color="#34495e").set_title('Total t
sns.distplot(data['taxvaluedollarcnt'], ax=ax2, color="#34495e");
45. 1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 45/63
In [78]:
Assessment Year
Assessment year is the year that the property was assessed. The 25th through 75th percentile of
values are all from the year 2015 so reading a box plot is not very helpful. Instead we list out the
unique values for assessment year along with our histogram.
In the state of California, the base year value is set when you originally purchase the property,
based on the sales price listed on the deed. However, there are exceptions which is why we see a
few assessment years from 2000 to 2016 thrown in.
In order for assessment year to be useful for our predictions, we should find out what each
exception is and what the cause of it not to be assessed at the point of sale. This could affect the
predicted log error.
f, (ax0,ax2,ax3,ax4) = plt.subplots(nrows=4, ncols=1, figsize=[15, 14])
sns.boxplot(data['structuretaxvaluedollarcnt'], ax=ax0, color="#34495e").set_title
sns.distplot(data['structuretaxvaluedollarcnt'], ax=ax2, color="#34495e");
sns.boxplot(data['landtaxvaluedollarcnt'], ax=ax3, color="#34495e").set_title('La
sns.distplot(data['landtaxvaluedollarcnt'], ax=ax4, color="#34495e");
46. 1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 46/63
In [79]:
Visualize Attributes
15 points
Description:
Visualize the most interesting attributes (at least 5 attributes, your opinion on what is interesting).
Important: Interpret the implications for each visualization. Explain for each attribute why the chosen
visualization is appropriate.
Distribution of Target Variable: Logerror
In the training dataset, logerror is the response variable so we are interested in seeing the
distribution of log error that we are training on. We visualize this using a boxplot and histogram to
get a general picture of the overall distribution. It is symmetric around zero which implies that the
model generating the logerror has no bias and is very accurate in most instances.
('Unique years:', array([2015, 2014, 2003, 2012, 2001, 2011, 2013, 2016, 2010,
2004, 2005,
2002, 2000, 2009]))
print('Unique years:', data['assessmentyear'].unique())
f, (ax2) = plt.subplots(nrows=1, ncols=1, figsize=[15, 4])
sns.distplot(data['assessmentyear'], ax=ax2, color="#34495e")
plt.title('Assessment year distribution');
47. 1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 47/63
In [80]:
Count of Bathrooms
We think that the number of bathrooms in a home could be interesting because our data was
collected in California where rent is very high. It is common to buy a rental property and have
random tenants. Tenants that do not know each other may want their own bathroom. In our case,
most homes have 2 bathrooms. Notably, there are outliers with no bathrooms or suspiciously high
train_data = data[~data['logerror'].isnull()]
x = train_data['logerror']
f, (ax_box, ax_hist) = plt.subplots(2, sharex=True,
gridspec_kw={"height_ratios": (.15, .85)}, figsize=(10, 10))
sns.boxplot(train_data['logerror'][train_data['logerror'].abs()<1], ax=ax_box, co
sns.distplot(
train_data['logerror'][train_data['logerror'].abs()<1],
ax=ax_hist, bins=400, kde=False, color="#34495e");
48. 1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 48/63
counts. We see records in the dataset with no bathroom which we justified above as being possible.
Because we are looking at frequency, we chose to visualize the sum of each number of bathrooms
(as a category) in a bar chart.
In [81]:
Count of Bedrooms
For the same reasons we were interested in the number of bathrooms, we are also interested in the
number of bedrooms. In our dataset, most properties have 3 bedrooms and we see fewer instances
as we go up or down one bedroom in the data. Here we still see records without any bedrooms
which we justified as studios above. We chose the same visualization (using number of bedrooms
as a category and counting the frequency of each category) displayed in a bar chart below.
sns.countplot(data['bathroomcnt'], color="#34495e")
plt.ylabel('Count', fontsize=12)
plt.xlabel('Bathrooms', fontsize=12)
plt.title("Frequency of Bathroom count", fontsize=15);
49. 1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 49/63
In [82]:
Bed to Bath Ratio
After visualizing the distribution of bathroom and bedroom counts, we also thought it would be
interesting to try to see if the number of bathrooms were dependent on the number of bedrooms.
We chose to stick with a bar chart, only this time using the ratio of bedrooms to bathrooms as the
category to find the sum counts of. What we found is most homes have about a ratio of 1.5
bedrooms per 1 bathroom in a property.
plt.ylabel('Count', fontsize=12)
plt.xlabel('Bedrooms', fontsize=12)
plt.title("Frequency of Bedrooms count", fontsize=15)
sns.countplot(data['bedroomcnt'], color="#34495e");
50. 1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 50/63
In [83]:
Average Tax Per Square Feet
For our last attribute, we calculated the tax per square foot to see if we could find any trends. We
again chose to use a bar chart to plot the ratio and the sum counts. What we found is that plotting
this exposes extreme outliers for possible elimination. Most properties are under a few dollars per
square foot but as the visualization reveals, there are suspicious records. However, because this is
southern California and land space is limited for continuous growth, there could be a reason that
some places have high tax per square feet due to better real estate areas.
non_zero_mask = data['bathroomcnt'] > 0
bedroom = data[non_zero_mask]['bedroomcnt']
bathroom = data[non_zero_mask]['bathroomcnt']
bedroom_to_bath_ratio = bedroom / bathroom
bedroom_to_bath_ratio = bedroom_to_bath_ratio[bedroom_to_bath_ratio<6]
sns.distplot(bedroom_to_bath_ratio, color="#34495e", kde=False)
plt.title('Bed to Bath ratio', fontsize=15)
plt.xlabel('Ratio', fontsize=15)
plt.ylabel('Count', fontsize=15);
51. 1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 51/63
In [84]:
Explore Joint Attributes
15 points
Description:
Visualize relationships between attributes: Look at the attributes via scatter plots, correlation, cross-
tabulation, group-wise averages, etc. as appropriate. Explain any interesting relationships.
Absolute Log Error and Number of Occurrences Per
Month
We compared amount of absolute error based monthly average and found out that the error could
be cyclical for the year, where the error dips during the Spring and Summer months and rises during
the Winter months.
non_zero_mask = data['calculatedfinishedsquarefeet'] > 0
tax = data[non_zero_mask]['taxamount']
sqft = data[non_zero_mask]['calculatedfinishedsquarefeet']
tax_per_sqft = tax / sqft
tax_per_sqft = tax_per_sqft[tax_per_sqft<10]
sns.distplot(tax_per_sqft, color="#34495e", kde=False)
plt.title('Tax Per Square Feet', fontsize=15)
plt.xlabel('Ratio', fontsize=15)
plt.ylabel('Count', fontsize=15);
52. 1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 52/63
We compared amount of transactions based monthly average and the transactions are the highest
during the Spring, Summer, and Fall seasons possibly due to an optimal time to sell property. The
transactions are at its lowest during the Winter season.
For a cross comparison, we have high number of transactions during the Spring and Summer
seasons while the log error is relatively low and we have low number of transactions during the
Winter season while the log error is relatively high.
In [85]:
Number of Transactions and Mean Absolute Log Error Per
Day of the Week
Saturdays and Sundays are non-work days, hence why there is a dip in absolute log error and
number of transactions
For the workdays, Friday has the most transactions while Monday has the least.
months = train_data['transactiondate'].dt.month
month_names = ['January','February','March','April','May','June','July','August',
train_data['abs_logerror'] = train_data['logerror'].abs()
f, (ax0, ax1) = plt.subplots(nrows=1, ncols=2, figsize=[17, 7])
per_month = train_data.groupby(months)["abs_logerror"].mean()
per_month.index = month_names
ax0.set_title('Average Log Error Across Month Of 2016')
ax0.set_xlabel('Month Of The Year', fontsize=15)
ax0.set_ylabel('Log Error', fontsize=15)
sns.pointplot(x=per_month.index, y=per_month, color="#34495e", ax=ax0)
per_month = train_data.groupby(months)["logerror"].count()
per_month.index = month_names
ax1.set_title('No Of Occurunces per month In 2016')
ax1.set_xlabel('Month Of The Year', fontsize=15)
ax1.set_ylabel('Nimber of Occurences', fontsize=15)
sns.barplot(x=per_month.index, y=per_month, color="#34495e", ax=ax1);
53. 1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 53/63
For the workdays, Monday has relatively the most log errors while Friday has relatively the least log
errors.
For cross analysis, Monday has the least transactions with the most error while Friday has the most
transactions with the least errors. Sunday and Saturday are special cases and does not have
substantial evidence to provide any trends.
In [86]:
Continuous Variable Correlation Heatmap
Heatmap of correlations are represented where the warmer colors are highly correlated, white is
non-correlated, and colder colors are negatively correlated. We see that calculated finished square
feet is correlated with finished square feet, due to collinearity. Tax amounts and year built are also
highly correlated to finished square feet as well as with one another.
Latitude and longitude are negatively correlated with each other possibly because the beachfront
properties are more expensive.
weekday = train_data['transactiondate'].dt.weekday
weekdays = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'S
abs_logerror = train_data['logerror'].abs()
f, (ax0, ax1) = plt.subplots(nrows=1, ncols=2, figsize=[17, 7])
to_plot = abs_logerror.groupby(weekday).count()
to_plot.index = weekdays
to_plot.plot(color="#34495e", linewidth=4, ax=ax0)
ax0.set_title('Number of Transactions Per Day')
ax0.set_ylabel('Number of Transactions', fontsize=15)
ax0.set_xlabel('Day', fontsize=15)
to_plot = abs_logerror.groupby(weekday).mean()
to_plot.index = weekdays
to_plot.plot(color="#34495e", linewidth=4, ax=ax1)
ax1.set_title('Mean Absolute Log Error Per Day')
ax1.set_ylabel('Mean Absolute Log Error', fontsize=15)
ax1.set_xlabel('Day', fontsize=15);
55. 1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 55/63
From a simple graph, we can see the shoreline of California as well as possible areas of
obstruction, such as mountains that prevent property growth in those areas. The majority of
properties are in the center to upper left of the graph.
In [88]:
Number of Stories vs Year Built
As architectural feats improved, we started to see more properties with 2 or more stories by 1950.
The number of one story properties also increased during that time. The baby boomers, the end of
WWII and readily available steel, and mortgage incentives may be the cause of the increase of
more properties being built as well as more stories per property. Note: because we filled in missing
values as the median value, the 1965 spike in the data is artificial until we use other methods to
impute year built.
<matplotlib.figure.Figure at 0x114670f10>
plt.figure(figsize=(12,12));
sns.jointplot(x=data.latitude.values, y=data.longitude.values, size=10, color="#34
plt.ylabel('Longitude', fontsize=15)
plt.xlabel('Latitude', fontsize=15)
plt.title('Longitude and Latitude Data Points', fontsize=15);
56. 1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 56/63
In [89]:
Explore Attributes and Class
10 points
Description:
Identify and explain interesting relationships between features and the class you are trying to
predict (i.e., relationships with variables and the target classification).
Correlation of Continuous Variables and Log
Error (Target Variable)
We see that calculatedfinishedsquarefeet has the highest correlation with log error (0.04) while price
per square feet has the highest negative correlation with log error (-0.02). taxvaluedollarcnt has
relatively low correlation with log error. We choose to further explore finishedsquarefeet12 and its
relationship with log error.
fig,ax1= plt.subplots()
fig.set_size_inches(20,10)
yearMerged = data.groupby(['yearbuilt', 'numberofstories'])["parcelid"].count().u
yearMerged = yearMerged.loc[1900:]
yearMerged.index.name = 'Year Built'
plt.title('Number of Stories Per Year Built', fontsize=15)
plt.ylabel('Count', fontsize=15);
yearMerged.plot(ax=ax1, linewidth=4);
58. 1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 58/63
Scatterplot of Log Error and Calculated Finished Square
Feet
We are plotting our best correlated variable calculatedfinishedsquarefeet against the logerror. We
don't see any linear relationship from the scatter plot below even though it is evenly distributed.
In [91]:
New Features
5 points
Description:
Are there other features that could be added to the data or created from existing features? Which
ones?
column = "calculatedfinishedsquarefeet"
train_data = data[~data['logerror'].isnull()]
sns.jointplot(train_data[column], train_data['logerror'], size=10, color="#34495e
plt.ylabel('Log Error', fontsize=12)
plt.xlabel('Calculated Finished Square Feet', fontsize=15)
plt.title("Calculated Finished Square Feet Vs Log Error", fontsize=15);
59. 1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 59/63
Tax Per Square Feet
We created a tax per square feet feature. It is negatively correlated with log error and we hope that
it will add value to a predictive model.
In [92]:
City zip code details
The Zillow dataset has a variable: 'regionidcity' which is a numerical ID, representing the city in
which the property is located (if any). We don't have a string variable showing the city name.
We found a government dataset publicly available on the internet, containing all zip codes as well
as other information associated with each zip code. We have downloaded the dataset from here:
http://federalgovernmentzipcodes.us (http://federalgovernmentzipcodes.us) and joined it with our
dataset with the cell below.
This will give us the actual name of the cities, zip code type and location type.
New Variables Joined:
zipcode_type Standard, PO BOX Only, Unique, Military (implies APO or FPO) - Zip code
type may provide useful insight towards prediction
city USPS official city name(s) - this will distinguish one county from another that was
lacking in the original dataset
location_type Primary, Acceptable, Not Acceptable - because these are all valid location
properties, they will most likely be acceptable.
In [93]:
Out[92]: ('Correlation with log error:', -0.014065552662672554)
The zips dataset has 81831 rows and 4 columns
The merged dataset has 3857451 rows and 53 columns
non_zero_mask = data['calculatedfinishedsquarefeet'] > 0
tax = data[non_zero_mask]['taxamount']
sqft = data[non_zero_mask]['calculatedfinishedsquarefeet']
data['price_per_sqft'] = tax / sqft
'Correlation with log error:', data['price_per_sqft'].corr(data['logerror'])
# data from http://federalgovernmentzipcodes.us
zips = pd.read_csv('../input/free-zipcode-database.csv', low_memory=False)
zips = zips[['Zipcode','ZipCodeType','City','LocationType']]
zips.columns = ['zipcode', 'zipcode_type', 'city', 'location_type']
assert np.all(~zips.isnull())
zips = zips.rename(columns={'zipcode':'regionidzip'})
data = pd.merge(data, zips, how='left', on='regionidzip')
print('The zips dataset has %d rows and %d columns' % zips.shape)
print('The merged dataset has %d rows and %d columns' % data.shape)
60. 1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 60/63
Table of New Variables
Just focusing on the new features added to the dataset, here are the value types and descriptions.
In [94]:
Other Ideas For New Features
Other features that we thought about that we could include in the future are last remodel date of
kitchen or bathroom, key words in the description of overpriced Zestimates or underpriced
Zestimates, and how close a home is to a grocery store, Starbucks, a mall, or a place of interest.
A recently remodeled home could raise the actual sale price much higher than the Zestimate.
Certain words in the listing description could be associated with lower sale prices or people who bid
a higher sale price. Lastly the walkability, or how close a home is to a grocery store, Starbucks, a
mall, or place of interest could increase the final sale price as well.
Exceptional Work
10 points
Description:
You have free reign to provide additional analyses. One idea: implement dimensionality reduction,
then visualize and interpret the results.
Categorical Feature Importance
Out[94]: Variable Type Scale Description
city nominal
[APO, WHISKEYTOWN, nan,
REDDING, FPO, ... (239 More)]
USPS offical city name(s)
location_type nominal
[PRIMARY, nan, ACCEPTABLE,
NOT ACCEPTABLE]
Primary, Acceptable, Not
Acceptable
price_per_sqft ratio (0, 11911) Tax per SQFT
zipcode_type nominal
[MILITARY, PO BOX, nan,
STANDARD, UNIQUE]
Standard, PO BOX Only, Unique,
Military(implies APO or FPO)
variables_description = [
['price_per_sqft', 'ratio', 'TBD', 'Tax per SQFT']
,['zipcode_type', 'nominal', 'TBD', 'Standard, PO BOX Only, Unique, Military(impl
,['city', 'nominal', 'TBD', 'USPS offical city name(s)']
,['location_type', 'nominal', 'TBD', 'Primary, Acceptable, Not Acceptable']
]
new_variables = pd.DataFrame(variables_description, columns=['name', 'type', 'sca
new_variables = new_variables.set_index('name')
new_variables = new_variables.loc[new_variables.index.isin(data.columns)]
variables = variables.append(new_variables)
output_variables_table(new_variables)
61. 1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 61/63
g p
According to a random forest model with seed 0, region id zip, bedroom count, census tract and
block, and region id neighborhood explain the most variance for log error. Even though the
importance of the other variables are relatively lower, they could provide greater importance if we
add interaction terms or use a different nonlinear model.
In [95]:
Continuous Feature Importance
from sklearn import ensemble
train_data = data[~data['logerror'].isnull()]
categorical_vars = variables[variables['type'].isin(['ordinal', 'nominal'])].index
categorical_vars = categorical_vars[categorical_vars.isin(data.columns)]
categorical_vars = categorical_vars[~categorical_vars.isin(['parcelid', 'logerror
X = train_data[categorical_vars]
# remove string types
categorical_vars = categorical_vars[X.dtypes != object]
X = X[categorical_vars]
y = train_data['logerror']
model = ensemble.ExtraTreesRegressor(random_state=0)
model.fit(X.fillna(0), y)
index = pd.Index(categorical_vars, name='Variable Name')
importance = pd.Series(model.feature_importances_, index=index)
importance.sort()
importance.plot(kind='barh', color="#34495e")
plt.title('Categorical Feature Importance')
plt.xlabel('Importance', fontsize=15);
62. 1/12/2018 lab1
http://localhost:8888/notebooks/Documents/GitHub/data_mining_group_project-master/notebooks/Lab%201/lab1.ipynb 62/63
According to the linear regression model, the feature tax delinquency year explains the most
variance of log error. Even though the importance of the other variables are relatively lower, they
could provide greater importance if we add interaction terms or a higher order polynomials.
In [96]:
Exporting the cleaned datasets
from sklearn.linear_model import LinearRegression
train_data = data[~data['logerror'].isnull()]
continuous_vars = variables[variables['type'].isin(['ratio', 'interval'])].index
continuous_vars = continuous_vars[continuous_vars.isin(data.columns)]
continuous_vars = continuous_vars[~continuous_vars.isin(['parcelid', 'logerror',
X = train_data[continuous_vars]
y = train_data['logerror']
model = LinearRegression()
model.fit(X.fillna(0), y)
index = pd.Index(continuous_vars, name='Variable Name')
importance = pd.Series(np.abs(model.coef_), index=index)
importance.sort()
importance.plot(kind='barh', color="#34495e")
plt.title('Continuous Feature Importance')
plt.xlabel('Importance', fontsize=15);