The document discusses data wrangling and manipulation techniques in machine learning. It covers topics like data exploration, data wrangling, data acquisition, and data manipulation in Python. It demonstrates techniques like loading CSV and Excel files, exploring data through dimensionality checks, slicing, and correlation analysis. The objectives are to perform data wrangling and understand its significance, manipulate data in Python using coercion and merging, and explore data using Python.
Outlier and fraud detection using HadoopPranab Ghosh
This document summarizes an expert talk on outlier and fraud detection using big data technologies. It discusses different techniques for detecting outliers in instance and sequence data, including proximity-based, density-based, and information theory approaches. It provides examples of using Hadoop and MapReduce to calculate pairwise distances between credit card transactions at scale and find the k nearest neighbors of each transaction to identify outliers. The talk uses credit card transactions as a sample dataset to demonstrate these techniques.
1. The document summarizes a presentation about Apache Mahout, an open source machine learning library. It discusses algorithms like clustering, classification, topic modeling and recommendations.
2. It provides an overview of clustering Reuters documents using K-means in Mahout and demonstrates how to generate vectors, run clustering and inspect clusters.
3. It also discusses classification techniques in Mahout like Naive Bayes, logistic regression and support vector machines and shows code examples for generating feature vectors from data.
This document summarizes a presentation on classifying data using the Mahout machine learning library. It begins with an overview of classification and Mahout. It then describes using Mahout for classification, including preparing a dataset on question tags, splitting the data into training and test sets, building a naive Bayes classifier model, and applying the model to classify new data. Code examples and commands are provided for each step.
SDEC2011 Mahout - the what, the how and the whyKorea Sdec
1) Mahout is an Apache project that builds a scalable machine learning library.
2) It aims to support a variety of machine learning tasks such as clustering, classification, and recommendation.
3) Mahout algorithms are implemented using MapReduce to scale linearly with large datasets.
Giorgio Alfredo Spedicato will give a presentation on machine learning and actuarial science. He will review machine learning theory, including unsupervised and supervised learning algorithms. He will provide examples using various datasets, including using unsupervised learning on an auto insurance dataset and supervised learning for credit scoring and claim severity prediction. Spedicato has experience as a data scientist and actuary and holds a PhD in Actuarial Science.
Orchestrating the Intelligent Web with Apache Mahoutaneeshabakharia
Apache Mahout is an open source machine learning library for developing scalable algorithms. It includes algorithms for classification, clustering, recommendation engines, and frequent pattern mining. Mahout algorithms can be run locally or on Hadoop for distributed processing. Topic modeling using latent Dirichlet allocation is demonstrated for analyzing tweets and suggesting Twitter lists. While algorithms can provide benefits, some such as digital face manipulation can also be disturbing.
Outlier and fraud detection using HadoopPranab Ghosh
This document summarizes an expert talk on outlier and fraud detection using big data technologies. It discusses different techniques for detecting outliers in instance and sequence data, including proximity-based, density-based, and information theory approaches. It provides examples of using Hadoop and MapReduce to calculate pairwise distances between credit card transactions at scale and find the k nearest neighbors of each transaction to identify outliers. The talk uses credit card transactions as a sample dataset to demonstrate these techniques.
1. The document summarizes a presentation about Apache Mahout, an open source machine learning library. It discusses algorithms like clustering, classification, topic modeling and recommendations.
2. It provides an overview of clustering Reuters documents using K-means in Mahout and demonstrates how to generate vectors, run clustering and inspect clusters.
3. It also discusses classification techniques in Mahout like Naive Bayes, logistic regression and support vector machines and shows code examples for generating feature vectors from data.
This document summarizes a presentation on classifying data using the Mahout machine learning library. It begins with an overview of classification and Mahout. It then describes using Mahout for classification, including preparing a dataset on question tags, splitting the data into training and test sets, building a naive Bayes classifier model, and applying the model to classify new data. Code examples and commands are provided for each step.
SDEC2011 Mahout - the what, the how and the whyKorea Sdec
1) Mahout is an Apache project that builds a scalable machine learning library.
2) It aims to support a variety of machine learning tasks such as clustering, classification, and recommendation.
3) Mahout algorithms are implemented using MapReduce to scale linearly with large datasets.
Giorgio Alfredo Spedicato will give a presentation on machine learning and actuarial science. He will review machine learning theory, including unsupervised and supervised learning algorithms. He will provide examples using various datasets, including using unsupervised learning on an auto insurance dataset and supervised learning for credit scoring and claim severity prediction. Spedicato has experience as a data scientist and actuary and holds a PhD in Actuarial Science.
Orchestrating the Intelligent Web with Apache Mahoutaneeshabakharia
Apache Mahout is an open source machine learning library for developing scalable algorithms. It includes algorithms for classification, clustering, recommendation engines, and frequent pattern mining. Mahout algorithms can be run locally or on Hadoop for distributed processing. Topic modeling using latent Dirichlet allocation is demonstrated for analyzing tweets and suggesting Twitter lists. While algorithms can provide benefits, some such as digital face manipulation can also be disturbing.
Machine Learning and Apache Mahout : An IntroductionVarad Meru
An Introductory presentation on Machine Learning and Apache Mahout. I presented it at the BigData Meetup - Pune Chapter's first meetup (http://www.meetup.com/Big-Data-Meetup-Pune-Chapter/).
Logistic regression is a machine learning algorithm used for classification. Apache Mahout is a scalable machine learning library that includes an implementation of logistic regression using the stochastic gradient descent algorithm. The document demonstrates how to use Mahout's logistic regression on a sample dataset to classify points based on their features and predict whether they are filled or empty. It shows training a model, evaluating performance on the training data, and selecting additional features to improve the model.
This document provides an introduction to machine learning with Apache Mahout. It defines machine learning as a branch of artificial intelligence that uses statistics and large datasets to make smart decisions. Common applications include spam filtering, credit card fraud detection, medical diagnostics, and search engines. Apache Mahout is a platform for machine learning algorithms that allows users to build their own algorithms or use existing functionality like recommender engines, classification, and clustering.
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsDataStax Academy
Apache Spark has grown to be one of the largest open source communities in big data, with over 190 developers and dozens of companies contributing. The latest 1.0 release alone includes contributions from 117 people. A clean API, interactive shell, distributed in-memory computation, stream processing, interactive SQL, and libraries delivering everything from machine learning to graph processing make it an excellent unified platform to solve a number of problems. Apache Spark works very well with a growing number of big data solutions, including Cassandra and Hadoop. Come learn about Apache Spark and see how easy it is for you to get started using Spark to build your own high performance big data applications today.
This presentation discusses random forest modeling using the Mahout machine learning library. It describes random forests and the Mahout library. The presentation then demonstrates how to generate a random forest model in Mahout on Hadoop, including data preparation, model training, testing the model, and tuning hyperparameters. It also discusses techniques like selecting random subsets of features to split on and using the Gini index to evaluate splits.
An introduction to Apache Mahout presented at Apache BarCamp DC, May 19, 2012
A brief introduction to the examples and links to more resources for further exploration.
In this paper, we present a literature survey of existing frequent item set mining algorithms. The concept of frequent item set mining is also discussed in brief. The working procedure of some modern frequent item set mining techniques is given. Also the merits and demerits of each method are described. It is found that the frequent item set mining is still a burning research topic.
The document discusses association rule mining and the Apriori algorithm. Association rule mining finds frequent patterns and correlations among items in transactional databases. The Apriori algorithm uses candidate generation and database scanning to iteratively discover frequent itemsets. It generates candidate k-itemsets from frequent (k-1)-itemsets and prunes candidates that have subsets not in frequent itemsets. The algorithm counts supports of candidates by storing them in a hash tree and using a subset function to find contained candidates in each transaction. The FP-tree structure provides a more efficient alternative to Apriori by compressing the database and avoiding candidate generation through a divide-and-conquer approach.
Machine learning is used widely on the web today. Apache Mahout provides scalable machine learning libraries for common tasks like recommendation, clustering, classification and pattern mining. It implements many algorithms like k-means clustering in a MapReduce framework allowing them to scale to large datasets. Mahout functionality includes collaborative filtering, document clustering, categorization and frequent pattern mining.
Mining Frequent Patterns, Association and CorrelationsJustin Cletus
This document summarizes Chapter 6 of the book "Data Mining: Concepts and Techniques" which discusses frequent pattern mining. It introduces basic concepts like frequent itemsets and association rules. It then describes several scalable algorithms for mining frequent itemsets, including Apriori, FP-Growth, and ECLAT. It also discusses optimizations to Apriori like partitioning the database and techniques to reduce the number of candidates and database scans.
Frequency-based Constraint Relaxation for Private Query Processing in Cloud D...Junpei Kawamoto
This document proposes a frequency-based constraint relaxation methodology for private queries in cloud databases. It aims to reduce computational costs for servers while maintaining privacy risks below existing "complete" protocols. The approach relaxes the constraint that servers must check all database items for a query by instead checking a subset, or "handled set", based on search intention frequencies. Evaluation on a real dataset found the approach reduces average query costs to 6.5% of complete protocols while keeping privacy risks comparable.
Mahout is an open source machine learning java library from Apache Software Foundation, and therefore platform independent, that provides a fertile framework and collection of patterns and ready-made component for testing and deploying new large-scale algorithms.
With these slides we aims at providing a deeper understanding of its architecture.
Mahout Tutorial and Hands-on (version 2015)Cataldo Musto
This document provides an overview of Apache Mahout, an open source machine learning library for Java. It describes what Mahout is, the machine learning algorithms it implements (including clustering, classification, recommendation and frequent itemset mining), and why it is preferred over other machine learning frameworks due to its scalability and support for Hadoop. It also discusses Mahout's architecture, components, recommendation workflow and evaluation methods.
The document discusses using probabilistic data structures like Hyperloglog and Bloom filters to estimate the number of unique users or elements in a massive streaming data in real-time. Hyperloglog works by counting the number of leading zeros in randomly generated binary numbers to estimate unique elements. While it provides an approximate count, it is very space and memory efficient. The document provides an example pipeline for processing ad viewing data and counting unique users in subgroups using both Hyperloglog and Bloom filters.
This document discusses association rule mining and frequent pattern mining. It defines association rule mining as finding frequent patterns, associations, correlations, or causal structures among items in transactional databases. An example is provided to illustrate support and confidence for an association rule. Several algorithms are summarized for scalably mining association rules from large transactional databases, including Apriori, Partition, sampling, and FP-growth. FP-growth avoids candidate generation by building an FP-tree structure.
Output Privacy Protection With Pattern-Based Heuristic Algorithmijcsit
Privacy Preserving Data Mining(PPDM) is an ongoing research area aimed at bridging the gap between
the collaborative data mining and data confidentiality There are many different approaches which have
been adopted for PPDM, of them the rule hiding approach is used in this article. This approach ensures
output privacy that prevent the mined patterns(itemsets) from malicious inference problems. An efficient
algorithm named as Pattern-based Maxcover Algorithm is proposed with experimental results. This
algorithm minimizes the dissimilarity between the source and the released database; Moreover the
patterns protected cannot be retrieved from the released database by an adversary or counterpart even
with an arbitrarily low support threshold.
This document summarizes a presentation on deep learning and fraud detection. The presentation explores the state of the art in deep learning and fraud detection, provides guidance on getting results, and includes experiments. The agenda includes discussing motivation for advanced modeling in fraud detection, explaining neural networks and deep learning, and exploring sample fraud detection features and challenges. Examples of applying clustering and autoencoders to time series anomaly detection and card velocity fraud detection are also summarized.
The document discusses Chapter 5 of the book "Data Mining: Concepts and Techniques" which covers frequent pattern mining, association rule mining, and correlation analysis. It provides an overview of the basic concepts, describes efficient algorithms for mining frequent itemsets such as Apriori and FP-growth, and discusses challenges and improvements to scaling frequent pattern mining to large datasets.
This document discusses using Hadoop metrics and Complex Event Processing (CEP) to generate alarms. It contains the following key points:
1. Hadoop collects various metrics from its daemons that can be used for monitoring and debugging. Sinks need to be implemented to store or send these metrics to message queues.
2. CEP analyzes streams of event data to detect patterns and infer complex events. It can be used to process Hadoop metrics in real-time and generate alarms based on rules.
3. SiddhiQL is an SQL-like language for writing CEP queries to process input event streams and detect patterns to output new event streams. Examples of queries to generate different types of alarms
The document provides an overview of popular Python libraries for data science such as NumPy, SciPy, Pandas, SciKit-Learn, matplotlib, and Seaborn. It discusses the key features and uses of each library. The document also demonstrates how to load data into Pandas data frames, explore and manipulate the data frames using various methods like head(), groupby(), filtering, and slicing. Summary statistics, plotting and other analyses can be performed on the data frames using these libraries.
Best Data Science Ppt using Python
Data science is an inter-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from many structural and unstructured data. Data science is related to data mining, machine learning and big data.
Machine Learning and Apache Mahout : An IntroductionVarad Meru
An Introductory presentation on Machine Learning and Apache Mahout. I presented it at the BigData Meetup - Pune Chapter's first meetup (http://www.meetup.com/Big-Data-Meetup-Pune-Chapter/).
Logistic regression is a machine learning algorithm used for classification. Apache Mahout is a scalable machine learning library that includes an implementation of logistic regression using the stochastic gradient descent algorithm. The document demonstrates how to use Mahout's logistic regression on a sample dataset to classify points based on their features and predict whether they are filled or empty. It shows training a model, evaluating performance on the training data, and selecting additional features to improve the model.
This document provides an introduction to machine learning with Apache Mahout. It defines machine learning as a branch of artificial intelligence that uses statistics and large datasets to make smart decisions. Common applications include spam filtering, credit card fraud detection, medical diagnostics, and search engines. Apache Mahout is a platform for machine learning algorithms that allows users to build their own algorithms or use existing functionality like recommender engines, classification, and clustering.
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsDataStax Academy
Apache Spark has grown to be one of the largest open source communities in big data, with over 190 developers and dozens of companies contributing. The latest 1.0 release alone includes contributions from 117 people. A clean API, interactive shell, distributed in-memory computation, stream processing, interactive SQL, and libraries delivering everything from machine learning to graph processing make it an excellent unified platform to solve a number of problems. Apache Spark works very well with a growing number of big data solutions, including Cassandra and Hadoop. Come learn about Apache Spark and see how easy it is for you to get started using Spark to build your own high performance big data applications today.
This presentation discusses random forest modeling using the Mahout machine learning library. It describes random forests and the Mahout library. The presentation then demonstrates how to generate a random forest model in Mahout on Hadoop, including data preparation, model training, testing the model, and tuning hyperparameters. It also discusses techniques like selecting random subsets of features to split on and using the Gini index to evaluate splits.
An introduction to Apache Mahout presented at Apache BarCamp DC, May 19, 2012
A brief introduction to the examples and links to more resources for further exploration.
In this paper, we present a literature survey of existing frequent item set mining algorithms. The concept of frequent item set mining is also discussed in brief. The working procedure of some modern frequent item set mining techniques is given. Also the merits and demerits of each method are described. It is found that the frequent item set mining is still a burning research topic.
The document discusses association rule mining and the Apriori algorithm. Association rule mining finds frequent patterns and correlations among items in transactional databases. The Apriori algorithm uses candidate generation and database scanning to iteratively discover frequent itemsets. It generates candidate k-itemsets from frequent (k-1)-itemsets and prunes candidates that have subsets not in frequent itemsets. The algorithm counts supports of candidates by storing them in a hash tree and using a subset function to find contained candidates in each transaction. The FP-tree structure provides a more efficient alternative to Apriori by compressing the database and avoiding candidate generation through a divide-and-conquer approach.
Machine learning is used widely on the web today. Apache Mahout provides scalable machine learning libraries for common tasks like recommendation, clustering, classification and pattern mining. It implements many algorithms like k-means clustering in a MapReduce framework allowing them to scale to large datasets. Mahout functionality includes collaborative filtering, document clustering, categorization and frequent pattern mining.
Mining Frequent Patterns, Association and CorrelationsJustin Cletus
This document summarizes Chapter 6 of the book "Data Mining: Concepts and Techniques" which discusses frequent pattern mining. It introduces basic concepts like frequent itemsets and association rules. It then describes several scalable algorithms for mining frequent itemsets, including Apriori, FP-Growth, and ECLAT. It also discusses optimizations to Apriori like partitioning the database and techniques to reduce the number of candidates and database scans.
Frequency-based Constraint Relaxation for Private Query Processing in Cloud D...Junpei Kawamoto
This document proposes a frequency-based constraint relaxation methodology for private queries in cloud databases. It aims to reduce computational costs for servers while maintaining privacy risks below existing "complete" protocols. The approach relaxes the constraint that servers must check all database items for a query by instead checking a subset, or "handled set", based on search intention frequencies. Evaluation on a real dataset found the approach reduces average query costs to 6.5% of complete protocols while keeping privacy risks comparable.
Mahout is an open source machine learning java library from Apache Software Foundation, and therefore platform independent, that provides a fertile framework and collection of patterns and ready-made component for testing and deploying new large-scale algorithms.
With these slides we aims at providing a deeper understanding of its architecture.
Mahout Tutorial and Hands-on (version 2015)Cataldo Musto
This document provides an overview of Apache Mahout, an open source machine learning library for Java. It describes what Mahout is, the machine learning algorithms it implements (including clustering, classification, recommendation and frequent itemset mining), and why it is preferred over other machine learning frameworks due to its scalability and support for Hadoop. It also discusses Mahout's architecture, components, recommendation workflow and evaluation methods.
The document discusses using probabilistic data structures like Hyperloglog and Bloom filters to estimate the number of unique users or elements in a massive streaming data in real-time. Hyperloglog works by counting the number of leading zeros in randomly generated binary numbers to estimate unique elements. While it provides an approximate count, it is very space and memory efficient. The document provides an example pipeline for processing ad viewing data and counting unique users in subgroups using both Hyperloglog and Bloom filters.
This document discusses association rule mining and frequent pattern mining. It defines association rule mining as finding frequent patterns, associations, correlations, or causal structures among items in transactional databases. An example is provided to illustrate support and confidence for an association rule. Several algorithms are summarized for scalably mining association rules from large transactional databases, including Apriori, Partition, sampling, and FP-growth. FP-growth avoids candidate generation by building an FP-tree structure.
Output Privacy Protection With Pattern-Based Heuristic Algorithmijcsit
Privacy Preserving Data Mining(PPDM) is an ongoing research area aimed at bridging the gap between
the collaborative data mining and data confidentiality There are many different approaches which have
been adopted for PPDM, of them the rule hiding approach is used in this article. This approach ensures
output privacy that prevent the mined patterns(itemsets) from malicious inference problems. An efficient
algorithm named as Pattern-based Maxcover Algorithm is proposed with experimental results. This
algorithm minimizes the dissimilarity between the source and the released database; Moreover the
patterns protected cannot be retrieved from the released database by an adversary or counterpart even
with an arbitrarily low support threshold.
This document summarizes a presentation on deep learning and fraud detection. The presentation explores the state of the art in deep learning and fraud detection, provides guidance on getting results, and includes experiments. The agenda includes discussing motivation for advanced modeling in fraud detection, explaining neural networks and deep learning, and exploring sample fraud detection features and challenges. Examples of applying clustering and autoencoders to time series anomaly detection and card velocity fraud detection are also summarized.
The document discusses Chapter 5 of the book "Data Mining: Concepts and Techniques" which covers frequent pattern mining, association rule mining, and correlation analysis. It provides an overview of the basic concepts, describes efficient algorithms for mining frequent itemsets such as Apriori and FP-growth, and discusses challenges and improvements to scaling frequent pattern mining to large datasets.
This document discusses using Hadoop metrics and Complex Event Processing (CEP) to generate alarms. It contains the following key points:
1. Hadoop collects various metrics from its daemons that can be used for monitoring and debugging. Sinks need to be implemented to store or send these metrics to message queues.
2. CEP analyzes streams of event data to detect patterns and infer complex events. It can be used to process Hadoop metrics in real-time and generate alarms based on rules.
3. SiddhiQL is an SQL-like language for writing CEP queries to process input event streams and detect patterns to output new event streams. Examples of queries to generate different types of alarms
The document provides an overview of popular Python libraries for data science such as NumPy, SciPy, Pandas, SciKit-Learn, matplotlib, and Seaborn. It discusses the key features and uses of each library. The document also demonstrates how to load data into Pandas data frames, explore and manipulate the data frames using various methods like head(), groupby(), filtering, and slicing. Summary statistics, plotting and other analyses can be performed on the data frames using these libraries.
Best Data Science Ppt using Python
Data science is an inter-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from many structural and unstructured data. Data science is related to data mining, machine learning and big data.
This document provides an overview of a machine learning course that teaches Pandas basics. The course aims to teach students how to handle and visualize data, apply basic learning algorithms, develop supervised and unsupervised learning techniques, and build machine learning models. The document outlines the course objectives, outcomes, syllabus including data preprocessing, feature extraction, and data visualization techniques. It also provides references for further reading.
This document provides an introduction to data analysis techniques using Python. It discusses key Python libraries for data analysis like NumPy, Pandas, SciPy, Scikit-Learn and libraries for data visualization like matplotlib and Seaborn. It covers essential concepts in data analysis like Series, DataFrames and how to perform data cleaning, transformation, aggregation and visualization on data frames. It also discusses statistical analysis, machine learning techniques and how big data and data analytics can work together. The document is intended as an overview and hands-on guide to getting started with data analysis in Python.
C++ Programming Class Creation Program Assignment InstructionsTawnaDelatorrejs
C++ Programming: Class Creation Program Assignment Instructions
Overview
The purpose of this assignment is to give you some practice with creating your own classes. This program serves as the basis for all of the other programming assignments in this class and your future Computer Science classes.
Instructions
Construct a class named Square that has a floating-point data member named side. The class should have a zero-argument constructor that initializes this data member to 0. It should have member functions named calcPerimeter() and calcArea() that calculate the perimeter and area of a square respectively, a member function setSide() to set the side of the square, a member function getSide() to return the side, and a member function showData() that displays the square’s side, perimeter, and area. The formula for the area of a square is Area = side * side. The formula for the perimeter of a square is Perimeter = 4 * side.
The class should use appropriate protection levels for the member data and functions. It should also follow “principles of minimalization”: that is, no member data should be part of a class unless it is needed by most member functions of the object. A general rule of thumb is that “if you can easily calculate it, don’t store it.”
Use your class in a program that creates an instance of a Square (utilizing the zero-argument constructor), prompts a user for a side, calls the setSide() function to set the square’s side, and then calls showData() to display the square’s side, perimeter, and area. Your program should allow the user to enter new square dimensions until the user enters -1. Be sure to include appropriate error checking. Does it make sense to enter “abc” as the side of a square? No. Therefore, you should ensure that the user enters numeric data for the side. Negative numbers (other than the -1 to exit) should also be prevented.
Style:
· Your lab should be constructed such that separate files are used: Square.h (your class declaration file), Square.cpp (your class implementation file), and SquareDriver.cpp (the file that contains main() and any other functions that are not part of the class).
The purpose of having separate files is for code resusability. If other developers want to use your class in their programs, they don't need main() as well. They should only have to "#include" your class header file. As such, two separate files for the class are needed to be able to hide the implementation details from other developers. You want other developers to know how to use your class (i.e. what functions are available and how they are called -- this is called the "interface" of the class), but not how they are implemented. If both the interface and the implementation code are in the same file, this cannot be accomplished. When you distribute your class to other developers, the implementation (.cpp) file gets compiled, but the interface (.h) doesn't. That way, the developer can use your class, but he o ...
Python for Data Science is a must learn for professionals in the Data Analytics domain. With the growth in IT industry, there is a booming demand for skilled Data Scientists and Python has evolved as the most preferred programming language. Through this blog, you will learn the basics, how to analyze data and then create some beautiful visualizations using Python.
The document discusses various Python libraries used for data science tasks. It describes NumPy for numerical computing, SciPy for algorithms, Pandas for data structures and analysis, Scikit-Learn for machine learning, Matplotlib for visualization, and Seaborn which builds on Matplotlib. It also provides examples of loading data frames in Pandas, exploring and manipulating data, grouping and aggregating data, filtering, sorting, and handling missing values.
Implementing a data_science_project (Python Version)_part1Dr Sulaimon Afolabi
This document outlines the steps for loading and merging crime data from multiple sources for machine learning analysis. It begins by introducing the datasets: Dataset A contains crime reports and police stations; Dataset B contains police station populations; and Dataset C contains police station coordinates. It then describes loading each dataset and checking for duplicates. The datasets are merged by joining Dataset A and B on the police station feature, and joining the resulting dataset with Dataset C on police station. The final merged dataset will contain crime statistics, population estimates, and location data for analysis.
This document provides an overview of Python libraries for data analysis and data science. It discusses popular Python libraries such as NumPy, Pandas, SciPy, Scikit-Learn and visualization libraries like matplotlib and Seaborn. It describes the functionality of these libraries for tasks like reading and manipulating data, descriptive statistics, inferential statistics, machine learning and data visualization. It also provides examples of using these libraries to explore a sample dataset and perform operations like data filtering, aggregation, grouping and missing value handling.
This document provides information on various data types, variables, and structures in Visual Basic, including:
- The differences between information and data, with data being information formatted for computer software.
- Common variable types like Integer, String, and Date that can be declared and initialized.
- Mathematical and string methods that can be used to manipulate variable values.
- Control structures like If/Then/Else statements, loops, and Select Case that allow conditional execution of code.
- Collections like arrays and ArrayLists that can store and manage multiple values.
The document covers many fundamental concepts in Visual Basic programming for representing and manipulating data.
This document provides an overview of popular Python libraries for data science and analysis. It discusses NumPy for efficient numerical computations, SciPy for scientific computing functions, Pandas for data structures and analysis, Scikit-Learn for machine learning algorithms, and Matplotlib and Seaborn for data visualization. It also describes common operations in Pandas like reading data, selecting and filtering data, descriptive statistics, and grouping data.
This document provides an overview of popular Python libraries for data science, including NumPy, SciPy, Pandas, Scikit-Learn, matplotlib and Seaborn. It describes the main functions of each library, such as NumPy for multidimensional arrays and mathematical operations, Pandas for data structures and data manipulation, Scikit-Learn for machine learning algorithms, and matplotlib and Seaborn for data visualization. The document also covers reading and exploring data frames, selecting and filtering data, aggregating and grouping data, handling missing values, and basic statistical analysis and graphics.
This document provides an overview of popular Python libraries for data science and analysis. It discusses NumPy for efficient numerical computations, SciPy for scientific computing functions, Pandas for data structures and manipulation, Scikit-Learn for machine learning algorithms, and Matplotlib and Seaborn for data visualization. It also describes common operations in Pandas like reading data, exploring data frames, selecting columns and rows, filtering, grouping, and descriptive statistics.
Machine Learning with Python discusses machine learning concepts and the Python tools used for machine learning. It introduces machine learning terminology and different types of learning. It describes the Pandas, Matplotlib and scikit-learn frameworks for data analysis and machine learning in Python. Examples show simple programs for supervised learning using linear regression and unsupervised learning using K-means clustering.
Search engines (e.g. Google.com, Yahoo.com, and Bi
ng.com) have become the dominant model of online search. Large and small e-commerce provide built-in search capability to their visitors to examine the products they have. While most large business are able to hire the
necessary skills to build advanced search engines,
small online business still lack the ability to evaluate the results of their search engines, which means losing the opportunity to compete with larger business. The purpose of this paper is to build an open-source model that can measure the relevance of search results for online businesses
as well as the accuracy of their underlined algorithms. We used data from a Kaggle.com competition in order to show our model running on real data.
PYTHON-Chapter 4-Plotting and Data Science PyLab - MAULIK BORSANIYAMaulik Borsaniya
This document discusses data visualization and Matplotlib. It begins with an introduction to data visualization and its importance. It then covers basic visualization rules like labeling axes and adding titles. It discusses what Matplotlib is and how to install it. It provides examples of common plot types in Matplotlib like sine waves, scatter plots, bar charts, and pie charts. It also discusses working with data science and Pandas, including how to create Pandas Series and DataFrames from various data sources.
A good foundation has been established for both data mining research and genuine
application based data mining. The current functionality of EMADS is limited
to classification and Meta-ARM. The research team is at present working towards
increasing the diversity of mining tasks that EMADS can address. There are many
directions in which the work can (and is being) taken forward. One interesting direction
is to build on the wealth of distributed data mining research that is currently
available and progress this in an MAS context. The research team are also enhancing
the system’s robustness so as to make it publicly available. It is hoped that once
the system is live other interested data mining practitioners will be prepared to contribute
algorithms and data.
This document provides an overview of popular Python libraries for data science, including NumPy, SciPy, Pandas, Scikit-Learn, matplotlib and Seaborn. It describes what each library is used for, such as NumPy for multidimensional arrays and mathematical operations, Pandas for data manipulation and analysis, and Scikit-Learn for machine learning algorithms. It also discusses reading and exploring data frames, selecting and filtering data, aggregating and grouping data, handling missing values, and data visualization.
Start machine learning in 5 simple stepsRenjith M P
Simple steps to get started with machine learning.
The use case uses python programming. Target audience is expected to have a very basic python knowledge.
End-to-end pipeline agility - Berlin Buzzwords 2024Lars Albertsson
We describe how we achieve high change agility in data engineering by eliminating the fear of breaking downstream data pipelines through end-to-end pipeline testing, and by using schema metaprogramming to safely eliminate boilerplate involved in changes that affect whole pipelines.
A quick poll on agility in changing pipelines from end to end indicated a huge span in capabilities. For the question "How long time does it take for all downstream pipelines to be adapted to an upstream change," the median response was 6 months, but some respondents could do it in less than a day. When quantitative data engineering differences between the best and worst are measured, the span is often 100x-1000x, sometimes even more.
A long time ago, we suffered at Spotify from fear of changing pipelines due to not knowing what the impact might be downstream. We made plans for a technical solution to test pipelines end-to-end to mitigate that fear, but the effort failed for cultural reasons. We eventually solved this challenge, but in a different context. In this presentation we will describe how we test full pipelines effectively by manipulating workflow orchestration, which enables us to make changes in pipelines without fear of breaking downstream.
Making schema changes that affect many jobs also involves a lot of toil and boilerplate. Using schema-on-read mitigates some of it, but has drawbacks since it makes it more difficult to detect errors early. We will describe how we have rejected this tradeoff by applying schema metaprogramming, eliminating boilerplate but keeping the protection of static typing, thereby further improving agility to quickly modify data pipelines without fear.
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
Natural Language Processing (NLP), RAG and its applications .pptxfkyes25
1. In the realm of Natural Language Processing (NLP), knowledge-intensive tasks such as question answering, fact verification, and open-domain dialogue generation require the integration of vast and up-to-date information. Traditional neural models, though powerful, struggle with encoding all necessary knowledge within their parameters, leading to limitations in generalization and scalability. The paper "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" introduces RAG (Retrieval-Augmented Generation), a novel framework that synergizes retrieval mechanisms with generative models, enhancing performance by dynamically incorporating external knowledge during inference.
Learn SQL from basic queries to Advance queriesmanishkhaire30
Dive into the world of data analysis with our comprehensive guide on mastering SQL! This presentation offers a practical approach to learning SQL, focusing on real-world applications and hands-on practice. Whether you're a beginner or looking to sharpen your skills, this guide provides the tools you need to extract, analyze, and interpret data effectively.
Key Highlights:
Foundations of SQL: Understand the basics of SQL, including data retrieval, filtering, and aggregation.
Advanced Queries: Learn to craft complex queries to uncover deep insights from your data.
Data Trends and Patterns: Discover how to identify and interpret trends and patterns in your datasets.
Practical Examples: Follow step-by-step examples to apply SQL techniques in real-world scenarios.
Actionable Insights: Gain the skills to derive actionable insights that drive informed decision-making.
Join us on this journey to enhance your data analysis capabilities and unlock the full potential of SQL. Perfect for data enthusiasts, analysts, and anyone eager to harness the power of data!
#DataAnalysis #SQL #LearningSQL #DataInsights #DataScience #Analytics
The Ipsos - AI - Monitor 2024 Report.pdfSocial Samosa
According to Ipsos AI Monitor's 2024 report, 65% Indians said that products and services using AI have profoundly changed their daily life in the past 3-5 years.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
3. Learning Objectives
Demonstrate different data wrangling techniques and their significance
Perform data manipulation in python using coercion, merging, concatenation,
and joins
Demonstrate data import and exploration using Python
By the end of this lesson, you will be able to:
5. Loading .csv File in Python
Program Data Program
CSV File
Code
df = pandas.read_csv("/home/simpy/Datasets/BostonHousing.csv")
Path to file
Before starting with a dataset, the first step is to load the dataset. Below is the code for the
same:
6. Loading Data to .csv File
Program Data
Program CSV File
Code
df.to_csv("/home/simpy/Datasets/BostonHousing.csv")
Path to file
Below is the code for loading the data within an existing csv file:
7. Loading .xlsx File in Python
Program Data Program
XLS File
Code
df = pandas.read_excel("/home/simpy/Datasets/BostonHousing.xlsx")
Below is the code for loading an xlsx file within python:
8. Program Data
Program XLS File
Loading Data to .xlsx File
Code
df.to_excel("/home/simpy/Datasets/BostonHousing.xlsx")
Below is the code for loading program data into an existing xlsx file:
9. Assisted Practice
Data Exploration
Problem Statement: Extract data from the given SalaryGender CSV file and store the data from
each column in a separate NumPy array.
Objective: Import the dataset (csv) in/from your Python notebook to local system.
Access: Click on the Labs tab on the left side panel of the LMS. Copy or note the username and
password that are generated. Click on the Launch Lab button. On the page that appears, enter the
username and password in the respective fields, and click Login.
Duration: 5 mins.
10. Data Exploration Techniques
The shape attribute returns a two-item tuple (number of rows and the
number of columns) for the data frame. For a Series, it returns a one-item
tuple.
Code
df.shape
Dimensionality Check
Type of Dataset
Slicing and Indexing
Identifying Unique
Elements
Value Extraction
Feature Mean
Feature Median
Feature Mode
11. Data Exploration Techniques (Contd.)
You can use the type ( ) in python to return the type of object.
Code
type(df)
Checking the type of data frame:
Checking the type of a column (çhas) within a data frame:
Code
df['chas'].dtype
Dimensionality Check
Type of Dataset
Slicing and Indexing
Identifying Unique
Elements
Value Extraction
Feature Mean
Feature Median
Feature Mode
12. Data Exploration Techniques (Contd.)
You can use the : operator with the start index on left and end index on
right of it to output the corresponding slice.
Slicing a list:
Slicing a Data frame (df) using iloc indexer:
Code
df.iloc[:,1:3]
list = [1,2,3,4,5]
Code
list[1:3]
Dimensionality Check
Type of Dataset
Slicing and Indexing
Identifying Unique
Elements
Value Extraction
Feature Mean
Feature Median
Feature Mode
13. Using unique ( ) on the column of interest will return a numpy array with
unique values of the column.
Extracting all unique values out of ‘’crim” column:
Code
df['crim'].unique()
Dimensionality Check
Type of Dataset
Slicing and Indexing
Identifying Unique
Elements
Value Extraction
Feature Mean
Feature Median
Feature Mode
Data Exploration Techniques (Contd.)
14. Using value ( ) on the column of interest will return a numpy array with all
the values of the column.
Extracting values out of ‘’crim” column:
Code
df['crim’].values()
Dimensionality Check
Type of Dataset
Slicing and Indexing
Identifying Unique
Elements
Value Extraction
Feature Mean
Feature Median
Feature Mode
Data Exploration Techniques (Contd.)
15. Using mean( ) on the data frame will return mean of the data frame across
all the columns.
Code
df.mean()
Dimensionality Check
Type of Dataset
Slicing and Indexing
Identifying Unique
Elements
Value Extraction
Feature Mean
Feature Median
Feature Mode
Data Exploration Techniques (Contd.)
16. Using median( ) on the data frame will return median values of the data
frame across all the columns.
Code
df.median()
Dimensionality Check
Type of Dataset
Slicing and Indexing
Identifying Unique
Elements
Value Extraction
Feature Mean
Feature Median
Feature Mode
Data Exploration Techniques (Contd.)
17. Using mode( ) on the data frame will return mode values of the data frame
across all the columns, rows with axis=0 and axis = 1, respectively.
Code
df.mode(axis=0)
Dimensionality Check
Type of Dataset
Slicing and Indexing
Identifying Unique
Elements
Value Extraction
Feature Mean
Feature Median
Feature Mode
Data Exploration Techniques (Contd.)
20. Plotting a Heatmap with Seaborn
Code
import matplotlib.pyplot as plt
import seaborn as sns
correlations = df.corr()
sns.heatmap(data = correlations,square = True, cmap = "bwr")
plt.yticks(rotation=0)
plt.xticks(rotation=90)
Rectangular dataset (2D
dataset that can be
coerced into an ndarray)
If True, set the Axes
aspect to “equal” so
each cell will be
square-shaped
Matplotlib
colormap name or
object, or list of
colors
Below is the code for plotting a heatmap within Python:
21. Plotting a Heatmap with Seaborn (Contd.)
Maximum
correlation
Minimum
correlation
Below is the heatmap obtained, where, approaching red colour means maximum correlation
and approaching blue means minimal correlation.
22. Assisted Practice
Data Exploration
Problem Statement: Suppose you are a public school administrator. Some schools in your state of Tennessee
are performing below average academically. Your superintendent under pressure from frustrated parents and
voters approached you with the task of understanding why these schools are under-performing. To improve
school performance, you need to learn more about these schools and their students, just as a business needs to
understand its own strengths and weaknesses and its customers. The data includes various demographic, school
faculty, and income variables.
Objective: Perform exploratory data analysis which includes: determining the type of the data, correlation
analysis over the same. You need to convert the data into useful information:
▪ Read the data in pandas data frame
▪ Describe the data to find more details
▪ Find the correlation between ‘reduced_lunch’ and ‘school_rating’
Access: Click on the Labs tab on the left side panel of the LMS. Copy or note the username and password that
are generated. Click on the Launch Lab button. On the page that appears, enter the username and password in
the respective fields, and click Login.
Duration: 15 mins.
23. Unassisted Practice
Data Exploration Duration: 15
mins.
Problem Statement: Mtcars, an automobile company in Chambersburg, United States has recorded the
production of its cars within a dataset. With respect to some of the feedback given by their customers they are
coming up with a new model. As a result of it they have to explore the current dataset to derive further insights out if
it.
Objective: Import the dataset, explore for dimensionality, type and average value of the horsepower across all the
cars. Also, identify few of mostly correlated features which would help in modification.
Note: This practice is not graded. It is only intended for you to apply the knowledge you have gained to solve real-
world problems.
Access: Click on the Labs tab on the left side panel of the LMS. Copy or note the username and password that are
generated. Click on the Launch Lab button. On the page that appears, enter the username and password in the
respective fields, and click Login
24. Data Import
Code
df1 = pandas.read_csv(“mtcars.csv“)
The first step is to import the data as a part of exploration.
26. Dimensionality Check
Type of Dataset
Code
type(df1)
Data Exploration
Identifying mean value
type(), returns type of the given object.
27. Data Exploration
Dimensionality Check
Type of Dataset
Code
df1[‘hp’].mean()
Identifying mean value
mean( ) function can be used to calculate mean/average of a given list of numbers.
28. Identifying Correlation Using a Heatmap
Code
import matplotlib.pyplot as plt
import seaborn as sns
correlations = df1.corr()
sns.heatmap(data = correlations,square = True, cmap = “viridis")
plt.yticks(rotation=0)
plt.xticks(rotation=90)
Heatmap function in seaborn is used to plot the correlation matrix.
29. Identifying Correlation Using a Heatmap
From the adjacent map, you
can clearly see that
cylinder (cyl) and
displacement (disp) are the
most correlated features.
Graphical representation of data where the individual values contained in a
matrix are represented in colors.
32. Need of Data Wrangling
Missing data, a very common problem
Presence of noisy data (erroneous data and outliers)
Inconsistent data
Develop a more accurate model
Prevent data leakage
Following are the problems that can be avoided with wrangled data:
33. Missing Values in a Dataset
Consider a random dataset given below, illustrating
missing values.
34. Missing Value Detection
Consider a dataset below, imported as df1 within Python, having some missing values.
Detecting
missing
values
Code
df1.isna().any()
35. Missing Value Treatment
Mean Imputation: Replace the missing value
with variable’s mean
Code
from sklearn.preprocessing import Imputer
mean_imputer =
Imputer(missing_values=np.nan,strategy='mean',axis=1)
mean_imputer = mean_imputer.fit(df1)
imputed_df = mean_imputer.transform(df1.values)
df1 = pd.DataFrame(data=imputed_df,columns=cols)
df1
36. Mean Imputation: Replace the missing value
with variable’s mean
Median Imputation: Replace the missing
value with variable’s median
Missing Value Treatment (Contd.)
Code
from sklearn.preprocessing import Imputer
median_imputer=Imputer(missing_values=np.nan,strategy
=‘median',axis=1)
median_imputer = median_imputer.fit(df1)
imputed_df = median_imputer.transform(df1.values)
df1 = pd.DataFrame(data=imputed_df,columns=cols)
df1
Note: Mean imputation/Median imputation is again model dependent and is valid only on numerical data.
37. Outlier Values in a Dataset
Note: Outliers skew the data when you are trying to do any type of average.
An outlier is a value that lies
outside the usual observation of
values.
0
1
2
3
4
5
7.0 7.5 8.0 8.5 9.0 9.5 10.0 10.5 11.0 11.5 12.0 12.5 13.0 13.5 14.0 14.5 15.0
x MIDPOINT
OUTLIER?
FREQUENCY
5.0 7.5 10.0 12.5 15.0
5
12
13
11
10
9
8
7
6
Y3
X1
38. Dealing with an Outlier
Outlier Detection
Outlier Treatment
Detect any outlier in the first column of df1
Code
import seaborn as sns
sns.boxplot(x=df1['Assignment'])
Outliers:
Values < 60
39. Dealing with an Outlier
Outlier Detection
Outlier Treatment
Create a filter based on the boxplot obtained
and apply the filter to the data frame
Code
filter=df1['Assignment'].values>60
df1_outlier_rem=df1[filter]
df1_outlier_rem
40. Assisted Practice
Data Wrangling
Problem Statement: Load the load_diabetes datasets internally from sklearn and check for any missing value or
outlier data in the ‘data’ column. If any irregularities found treat them accordingly.
Objective: Perform missing value and outlier data treatment.
Access: Click on the Labs tab on the left side panel of the LMS. Copy or note the username and password that are
generated. Click on the Launch Lab button. On the page that appears, enter the username and password in the
respective fields, and click Login.
Duration: 15 mins.
41. Unassisted Practice
Data Wrangling Duration: 5 mins.
Problem Statement: Mtcars, the automobile company in the United States have planned to rework on optimizing
the horsepower of their cars, as most of the customers feedbacks were centred around horsepower. However, while
developing a ML model with respect to horsepower, the efficiency of the model was compromised. Irregularity might
be one of the causes.
Objective: Check for missing values and outliers within the horsepower column and remove them.
Note: This practice is not graded. It is only intended for you to apply the knowledge you have gained to solve real-
world problems.
Access: Click on the Labs tab on the left side panel of the LMS. Copy or note the username and password that are
generated. Click on the Launch Lab button. On the page that appears, enter the username and password in the
respective fields, and click Login.
42. Check for Irregularities
Check for missing values
Code
df1['hp'].isna().any()
Check for Outliers
Outlier
Code
sns.boxplot(x=df1['hp'])
43. Outlier Treatment
Code
filter = df1['hp']<250
df1_out_rem = df1[filter]
sns.boxplot(x=df2_out_rem['hp'])
Outlier filtered
data
Data with hp>250 is the outlier data. Therefore, you can filter it accordingly.
45. Functionalities of Data Object in Python
tail( )
values( )
groupby( )
Concatenation
Merging
head( )
A data object is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and
columns.
46. tail( )
values( )
groupby( )
Concatenation
Merging
head( )
Functionalities of Data Object in Python (Contd.)
Head( ) returns the first n rows of the data structure
Code
import pandas as pd
import numpy as np
df=pd.Series(np.arange(1,51))
print(df.head(6))
47. tail( )
values( )
groupby( )
Concatenation
Merging
head( )
Functionalities of Data Object in Python (Contd.)
Tail( ) returns the last n rows of the data structure
Code
import pandas as pd
import numpy as np
df=pd.Series(np.arange(1,51))
print(df.tail(6))
48. tail( )
values( )
groupby( )
Concatenation
Merging
head( )
Functionalities of Data Object in Python (Contd.)
values( ) returns the actual data in the series of the array
Code
import pandas as pd
import numpy as np
df=pd.Series(np.arange(1,51))
print(df.values)
49. tail( )
values( )
groupby( )
Concatenation
Merging
head( )
Functionalities of Data Object using Python (Contd.)
The Data Frame is grouped according to the ‘Team’ and ‘ICC_Rank’ columns
Code
import pandas as pd
world_cup={'Team':['West Indies','West
indies','India','Australia','Pakistan','Sri
Lanka','Australia','Australia','Australia','
Insia','Australia'],
'Rank':[7,7,2,1,6,4,1,1,1,2,1],
'Year':[1975,1979,1983,1987,1992,1996,1999,2003
,2007,2011,2015]}
df=pd.DataFrame(world_cup)
print(df.groupby(['Team','Rank’]).groups)
50. tail( )
values( )
groupby( )
Concatenation
Merging
head( )
Functionalities of Data Object in Python (Contd.)
Concatenation combines two or more data structures.
Code
import pandas
world_champions={'Team':['India','Australia','West
Indies','Pakistan','Sri Lanka’],
'ICC_rank':[2,3,7,8,4],
'World_champions_Year':[2011,2015,1979,1992,1996],
'Points':[874,787,753,673,855]}
chokers={'Team':['South Africa','New
Zealand','Zimbabwe'],'ICC_rank':[1,5,9],
'Points':[895,764,656]}
df1=pandas.DataFrame(world_champions)
df2=pandas.DataFrame(chokers)
print(pandas.concat([df1,df2],axis=1))
51. tail( )
values( )
groupby( )
Concatenation
Merging
head( )
Functionalities of Data Object in Python (Contd.)
The concatenated output:
52. tail( )
values( )
groupby( )
Concatenation
Merging
head( )
Functionalities of Data Object in Python (Contd.)
Merging is the Pandas operation that performs database joins on objects
Code
import pandas
champion_stats={'Team':['India','Australia','West
Indies','Pakistan','Sri Lanka'],
'ICC_rank':[2,3,7,8,4],
'World_champions_Year':[2011,2015,1979,1992,1996],
'Points':[874,787,753,673,855]}
match_stats={'Team':['India','Australia','West
Indies','Pakistan','Sri Lanka'],
'World_cup_played':[11,10,11,9,8],
'ODIs_played':[733,988,712,679,662]}
df1=pandas.DataFrame(champion_stats)
df2=pandas.DataFrame(match_stats)
print(df1)
print(df2)
print(pandas.merge(df1,df2,on='Team'))
53. tail( )
values( )
groupby( )
Concatenation
Merging
head( )
Functionalities of Data Object in Python (Contd.)
The merged object
contains all the columns
of the data frames
merged
54. Different Types of Joins
Left Join Right Join Inner Join Full Outer Join
Joins are used to combine records from two or more tables in a database. Below
are the four most commonly used joins:
55. Left Join
Code
import pandas
world_champions={'Team':['India','Australia','West
Indies','Pakistan','Sri Lanka'],
'ICC_rank':[2,3,7,8,4],
'World_champions_Year':[2011,2015,1979,1992,1996],
'Points':[874,787,753,673,855]}
chokers={'Team':['South Africa','New
Zealand','Zimbabwe'],
'ICC_rank':[1,5,9],'Points':[895,764,656]}
df1=pandas.DataFrame(world_champions)
df2=pandas.DataFrame(chokers)
print(pandas.merge(df1,df2,on='Team',how='left'))
Returns all rows from
the left table, even if
there are no matches in
the right table
Left Join
56. Right Join
Code
import pandas
world_champions={'Team':['India','Australia','West
Indies','Pakistan','Sri Lanka'],
'ICC_rank':[2,3,7,8,4],
'World_champions_Year':[2011,2015,1979,1992,1996],
'Points':[874,787,753,673,855]}
chokers={'Team':['South Africa','New
Zealand','Zimbabwe'],'ICC_rank':[1,5,9],'Points':[89
5,764,656]}
df1=pandas.DataFrame(world_champions)
df2=pandas.DataFrame(chokers)
print(pandas.merge(df1,df2,on='Team',how=‘right'))
Preserves the unmatched
rows from the second
(right) table, joining them
with a NULL in the shape
of the first (left) table
Right Join
57. Inner Join
Code
import pandas
world_champions={'Team':['India','Australia','West
Indies','Pakistan','Sri Lanka'],
'ICC_rank':[2,3,7,8,4],
'World_champions_Year':[2011,2015,1979,1992,1996],
'Points':[874,787,753,673,855]}
chokers={'Team':['South Africa','New
Zealand','Zimbabwe'],'ICC_rank':[1,5,9],'Points':[89
5,764,656]}
df1=pandas.DataFrame(world_champions)
df2=pandas.DataFrame(chokers)
print(pandas.merge(df1,df2,on='Team',how=‘inner'))
Selects all rows from
both participating tables
if there is a match
between the columns
Inner Join
58. Full Outer Join
Code
import pandas
world_champions={'Team':['India','Australia','West
Indies','Pakistan','Sri Lanka'],
'ICC_rank':[2,3,7,8,4],
'World_champions_Year':[2011,2015,1979,1992,1996],
'Points':[874,787,753,673,855]}
chokers={'Team':['South Africa',’New
Zealand','Zimbabwe'],'ICC_rank':[1,5,9],'Points':[89
5,764,656]}
df1=pandas.DataFrame(world_champions)
df2=pandas.DataFrame(chokers)
print(pandas.merge(df1,df2,on='Team',how=‘outer'))
Returns all records when
there is a match in either
left (table1) or right
(table2) table records
Full Outer Join
59. Typecasting
It converts the data type of an object to the required data
type.
Int( )
Returns an integer object
from any number or string.
string( )
Returns string from any
numeric object or converts
any number to string
float( )
Returns a floating-point
number from a number or a
string
60. Typecasting Using Int, float and string( )
Code
int(12.32)
Code
int(‘43’)
Code
float(23)
Code
float('21.43
')
Code
int(12.32)
Few typecasted data types
61. Assisted Practice
Data Manipulation
Problem Statement: As a macroeconomic analyst at the Organization for Economic Cooperation and Development
(OECD), your job is to collect relevant data for analysis. It looks like you have three countries in the north_america data
frame and one country in the south_america data frame. As these are in two separate plots, it's hard to compare the
average labor hours between North America and South America. If all the countries were into the same data frame, it
would be much easier to do this comparison.
Objective: Demonstrate concatenation.
Access: Click on the Labs tab on the left side panel of the LMS. Copy or note the username and password that are
generated. Click on the Launch Lab button. On the page that appears, enter the username and password in the
respective fields, and click Login.
Duration: 10 mins.
62. Unassisted Practice
Data Manipulation Duration: 10 mins.
Problem Statement: SFO Public Department - referred to as SFO has captured all the salary data of its employees
from year 2011-2014. Now in 2018 the organization is facing some financial crisis. As a first step HR wants to
rationalize employee cost to save payroll budget. You have to do data manipulation and answer the below questions:
1. How much total salary cost has increased from year 2011 to 2014?
2. Who was the top earning employee across all the years?
Objective: Perform data manipulation and visualization techniques
Note: This practice is not graded. It is only intended for you to apply the knowledge you have gained to solve real-
world problems.
Access: Click on the Labs tab on the left side panel of the LMS. Copy or note the username and password that are
generated. Click on the Launch Lab button. On the page that appears, enter the username and password in the
respective fields, and click Login.
63. Answer 1
Code
salary = pd.read_csv('Salaries.csv')
mean_year =
salary.groupby('Year').mean()['TotalPayBenefits']
print ( mean_year)
Check the mean salary cost per year and see how it has increased per
year.
65. Key Takeaways
Demonstrate data import and exploration using Python
Demonstrate different data wrangling techniques and their significance
Perform data manipulation in python using coercion, merging,
concatenation, and joins
Now, you are able to:
68. The correct answer is
a.
b.
c.
d.
Knowledge
Check
Which of the following plots can be used to detect an outlier?
All the above plots can be used to detect an outlier.
d . All of the above
1
Boxplot
Histogram
Scatter plot
All of the above
69. Knowledge
Check
a.
b.
c.
d.
What is the output of the below Python code?
import numpy as np percentiles = [98, 76.37, 55.55, 69, 88]
first_subject = np.array(percentiles) print first_subject.dtype
2
float32
float
int32
float64
70. The correct answer is
a.
b.
c.
d.
Knowledge
Check
What is the output of the below Python code?
import numpy as np
percentiles = [98, 76.37, 55.55, 69, 88]
first_subject = np.array(percentiles)
print first_subject.dtype
Float64’s can represent numbers much more accurately than other floats and has more storage capacity.
d. float64
2
float32
float
int32
float64
71. Lesson-End Project
Problem Statement: From the raw data below create a data frame:
'first_name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 'last_name': ['Miller', 'Jacobson', ".", 'Milner', 'Cooze'],
'age': [42, 52, 36, 24, 73], 'preTestScore': [4, 24, 31, ".", "."],'postTestScore': ["25,000", "94,000", 57, 62, 70]
Objective: Perform data processing on raw data:
▪ Save the data frame into a csv file as project.csv
▪ Read the project.csv and print the data frame
▪ Read the project.csv without column heading
▪ Read the project.csv and make the index columns as 'First Name’ and 'Last Name'
▪ Print the data frame in a Boolean form as True or False. True for Null/ NaN values and false for
non-null values
▪ Read the data frame by skipping first 3 rows and print the data frame
Access: Click the Labs tab in the left side panel of the LMS. Copy or note the username and password that are
generated. Click the Launch Lab button. On the page that appears, enter the username and password in the
respective fields and click Login.
Duration: 20 mins.