The document provides a cheat sheet summarizing key concepts in Python, Pandas, NumPy, Scikit-Learn, and data visualization with Matplotlib and Seaborn. It includes sections on Python basics like data types, variables, lists, dictionaries, functions, modules, and control flow. Sections on Pandas cover data structures, loading/saving data, selection, merging, grouping, pivoting and visualization. NumPy sections cover arrays, array math, manipulation, aggregation and subsetting. Scikit-Learn sections cover the machine learning workflow and common algorithms.
The document is a cheat sheet for data wrangling with pandas, providing syntax and methods for creating and manipulating DataFrames, reshaping and subsetting data, summarizing data, combining datasets, filtering and joining data, grouping data, handling missing values, and plotting data. Key methods described include pd.melt() to gather columns into rows, pd.pivot() to spread rows into columns, pd.concat() to append DataFrames, df.sort_values() to order rows by column values, and df.groupby() to group data.
1. Inheritance is a mechanism where a new class is derived from an existing class, known as the base or parent class. The derived class inherits properties and methods from the parent class.
2. There are 5 types of inheritance: single, multilevel, multiple, hierarchical, and hybrid. Multiple inheritance allows a class to inherit from more than one parent class.
3. Overriding allows a subclass to replace or extend a method defined in the parent class, while still calling the parent method using the super() function or parent class name. This allows the subclass to provide a specific implementation of a method.
This document provides an introduction to object-oriented programming in Python. It discusses key concepts like classes, instances, inheritance, and modules. Classes group state and behavior together, and instances are created from classes. Methods defined inside a class have a self parameter. The __init__ method is called when an instance is created. Inheritance allows classes to extend existing classes. Modules package reusable code and data, and the import statement establishes dependencies between modules. The __name__ variable is used to determine if a file is being run directly or imported.
A class is a code template for creating objects. Objects have member variables and have behaviour associated with them. In python a class is created by the keyword class.
An object is created using the constructor of the class. This object will then be called the instance of the class.
This document provides an overview of the Python programming language. It discusses Python's history and evolution, its key features like being object-oriented, open source, portable, having dynamic typing and built-in types/tools. It also covers Python's use for numeric processing with libraries like NumPy and SciPy. The document explains how to use Python interactively from the command line and as scripts. It describes Python's basic data types like integers, floats, strings, lists, tuples and dictionaries as well as common operations on these types.
Matplotlib is a Python library used for 2D plotting. It can create publication-quality figures in both hardcopy and interactive formats across platforms. The basic steps for creating plots with Matplotlib are: 1) prepare data, 2) create a plot, 3) add elements to the plot, 4) customize the plot, 5) save the plot, and 6) show the plot. Matplotlib provides various functions for plotting different types of data, customizing figures, axes, and layouts.
This document provides an introduction to object oriented programming in Python. It discusses key OOP concepts like classes, methods, encapsulation, abstraction, inheritance, polymorphism, and more. Each concept is explained in 1-2 paragraphs with examples provided in Python code snippets. The document is presented as a slideshow that is meant to be shared and provide instruction on OOP in Python.
This document discusses object-oriented programming concepts in Python including multiple inheritance, method resolution order, method overriding, and static and class methods. It provides examples of multiple inheritance where a class inherits from more than one parent class. It also explains method resolution order which determines the search order for methods and attributes in cases of multiple inheritance. The document demonstrates method overriding where a subclass redefines a method from its parent class. It describes static and class methods in Python, noting how static methods work on class data instead of instance data and can be called through both the class and instances, while class methods always receive the class as the first argument.
The document is a cheat sheet for data wrangling with pandas, providing syntax and methods for creating and manipulating DataFrames, reshaping and subsetting data, summarizing data, combining datasets, filtering and joining data, grouping data, handling missing values, and plotting data. Key methods described include pd.melt() to gather columns into rows, pd.pivot() to spread rows into columns, pd.concat() to append DataFrames, df.sort_values() to order rows by column values, and df.groupby() to group data.
1. Inheritance is a mechanism where a new class is derived from an existing class, known as the base or parent class. The derived class inherits properties and methods from the parent class.
2. There are 5 types of inheritance: single, multilevel, multiple, hierarchical, and hybrid. Multiple inheritance allows a class to inherit from more than one parent class.
3. Overriding allows a subclass to replace or extend a method defined in the parent class, while still calling the parent method using the super() function or parent class name. This allows the subclass to provide a specific implementation of a method.
This document provides an introduction to object-oriented programming in Python. It discusses key concepts like classes, instances, inheritance, and modules. Classes group state and behavior together, and instances are created from classes. Methods defined inside a class have a self parameter. The __init__ method is called when an instance is created. Inheritance allows classes to extend existing classes. Modules package reusable code and data, and the import statement establishes dependencies between modules. The __name__ variable is used to determine if a file is being run directly or imported.
A class is a code template for creating objects. Objects have member variables and have behaviour associated with them. In python a class is created by the keyword class.
An object is created using the constructor of the class. This object will then be called the instance of the class.
This document provides an overview of the Python programming language. It discusses Python's history and evolution, its key features like being object-oriented, open source, portable, having dynamic typing and built-in types/tools. It also covers Python's use for numeric processing with libraries like NumPy and SciPy. The document explains how to use Python interactively from the command line and as scripts. It describes Python's basic data types like integers, floats, strings, lists, tuples and dictionaries as well as common operations on these types.
Matplotlib is a Python library used for 2D plotting. It can create publication-quality figures in both hardcopy and interactive formats across platforms. The basic steps for creating plots with Matplotlib are: 1) prepare data, 2) create a plot, 3) add elements to the plot, 4) customize the plot, 5) save the plot, and 6) show the plot. Matplotlib provides various functions for plotting different types of data, customizing figures, axes, and layouts.
This document provides an introduction to object oriented programming in Python. It discusses key OOP concepts like classes, methods, encapsulation, abstraction, inheritance, polymorphism, and more. Each concept is explained in 1-2 paragraphs with examples provided in Python code snippets. The document is presented as a slideshow that is meant to be shared and provide instruction on OOP in Python.
This document discusses object-oriented programming concepts in Python including multiple inheritance, method resolution order, method overriding, and static and class methods. It provides examples of multiple inheritance where a class inherits from more than one parent class. It also explains method resolution order which determines the search order for methods and attributes in cases of multiple inheritance. The document demonstrates method overriding where a subclass redefines a method from its parent class. It describes static and class methods in Python, noting how static methods work on class data instead of instance data and can be called through both the class and instances, while class methods always receive the class as the first argument.
This document discusses object-oriented programming concepts in Python including classes, objects, attributes, methods, inheritance, and method overriding. It defines a MyVector class with x and y attributes and methods to add vectors and display their count. Access specifiers like private and built-in are covered. Python uses garbage collection and has no destructors. Inheritance allows a SubVector class to extend MyVector, and a method can be overridden to change a class's behavior.
Modules in Python allow organizing classes into files to make them available and easy to find. Modules are simply Python files that can import classes from other modules in the same folder. Packages allow further organizing modules into subfolders, with an __init__.py file making each subfolder a package. Modules can import classes from other modules or packages using either absolute or relative imports, and the __init__.py can simplify imports from its package. Modules can also contain global variables and classes to share resources across a program.
The Pandas library provides easy-to-use data structures and analysis tools for Python. It uses NumPy and allows import of data into Series (one-dimensional arrays) and DataFrames (two-dimensional labeled data structures). Data can be accessed, filtered, and manipulated using indexing, booleans, and arithmetic operations. Pandas supports reading and writing data to common formats like CSV, Excel, SQL, and can help with data cleaning, manipulation, and analysis tasks.
The document discusses lists in Python, including how to create, access, modify, loop through, slice, sort, and perform other operations on list elements. Lists can contain elements of different data types, are indexed starting at 0, and support methods like append(), insert(), pop(), and more to manipulate the list. Examples are provided to demonstrate common list operations and functions.
The document discusses various Python flow control statements including if/else, for loops, while loops, break and continue. It provides examples of using if/else statements for decision making and checking conditions. It also demonstrates how to use for and while loops for iteration, including using the range function. It explains how break and continue can be used to terminate or skip iterations. Finally, it briefly mentions pass, for, while loops with else blocks, and nested loops.
Object oriented programming with pythonArslan Arshad
A short intro to how Object Oriented Paradigm work in Python Programming language. This presentation created for beginner like bachelor student of Computer Science.
The document discusses recursive functions and provides examples of recursive algorithms for calculating factorial, greatest common divisor (GCD), Fibonacci numbers, power functions, and solving the Towers of Hanoi problem. Recursive functions are functions that call themselves during their execution. They break down problems into subproblems of the same type until reaching a base case. This recursive breakdown allows problems to be solved in a top-down, step-by-step manner.
Python modules allow for code reusability and organization. There are three main components of a Python program: libraries/packages, modules, and functions/sub-modules. Modules can be imported using import, from, or from * statements. Namespaces and name resolution determine how names are looked up and associated with objects. Packages are collections of related modules and use an __init__.py file to be recognized as packages by Python.
This document provides an overview of Pandas, a Python library used for data analysis and manipulation. Pandas allows users to manage, clean, analyze and model data. It organizes data in a form suitable for plotting or displaying tables. Key data structures in Pandas include Series for 1D data and DataFrame for 2D (tabular) data. DataFrames can be created from various inputs and Pandas includes input/output tools to read data from files into DataFrames.
Structured Query Language (SQL) is a query language that allows users to specify conditions to retrieve data from a database. SQL queries select rows from database tables that satisfy specified conditions. The results are output in a table format. Common SQL clauses include SELECT, FROM, WHERE, GROUP BY, HAVING, ORDER BY, and INTO to output results to a table, cursor, file or printer. SQL can perform queries on single or multiple related tables through joins.
Here is a Python class with the specifications provided in the question:
class PICTURE:
def __init__(self, pno, category, location):
self.pno = pno
self.category = category
self.location = location
def FixLocation(self, new_location):
self.location = new_location
This defines a PICTURE class with three instance attributes - pno, category and location as specified in the question. It also defines a FixLocation method to assign a new location as required.
This document provides information about priority queues and binary heaps. It defines a binary heap as a nearly complete binary tree where the root node has the maximum/minimum value. It describes heap operations like insertion, deletion of max/min, and increasing/decreasing keys. The time complexity of these operations is O(log n). Heapsort, which uses a heap data structure, is also covered and has overall time complexity of O(n log n). Binary heaps are often used to implement priority queues and for algorithms like Dijkstra's and Prim's.
Regular expressions are a powerful tool for searching, matching, and parsing text patterns. They allow complex text patterns to be matched with a standardized syntax. All modern programming languages include regular expression libraries. Regular expressions can be used to search strings, replace parts of strings, split strings, and find all occurrences of a pattern in a string. They are useful for tasks like validating formats, parsing text, and finding/replacing text. This document provides examples of common regular expression patterns and methods for using regular expressions in Python.
Pandas is an open source Python library that provides data structures and data analysis tools for working with tabular data. It allows users to easily perform operations on different types of data such as tabular, time series, and matrix data. Pandas provides data structures like Series for 1D data and DataFrame for 2D data. It has tools for data cleaning, transformation, manipulation, and visualization of data.
This document provides an introduction to the Python programming language. It describes Python as a multi-purpose, object-oriented language that is interpreted, dynamically typed and focuses on readability. It lists several major organizations that use Python. It then provides examples of basic Python programs and covers key Python concepts like variables, data types, strings, comments, functions and more in under 3 sentences each.
The document discusses association rule mining to discover relationships between data items in large datasets. It describes how association rules have the form of X → Y, showing items that frequently occur together. The key steps are: (1) generating frequent itemsets whose support is above a minimum threshold; (2) extracting high-confidence rules from each frequent itemset. It proposes using the Apriori algorithm to efficiently find frequent itemsets by pruning the search space based on the antimonotonicity of support.
This document provides an overview of Python for data analysis using the pandas library. It discusses key pandas concepts like Series and DataFrames for working with one-dimensional and multi-dimensional labeled data structures. It also covers common data analysis tasks in pandas such as data loading, aggregation, grouping, pivoting, filtering, handling time series data, and plotting.
Kickstart your data science journey with this Python cheat sheet that contains code examples for strings, lists, importing libraries and NumPy arrays.
Find more cheat sheets and learn data science with Python at www.datacamp.com.
The document discusses frequent itemset mining methods. It describes the Apriori algorithm which uses a candidate generation-and-test approach involving joining and pruning steps. It also describes the FP-Growth method which mines frequent itemsets without candidate generation by building a frequent-pattern tree. The advantages of each method are provided, such as Apriori being easily parallelized but requiring multiple database scans.
This document discusses functions and methods in Python. It defines functions and methods, and explains the differences between them. It provides examples of defining and calling functions, returning values from functions, and passing arguments to functions. It also covers topics like local and global variables, function decorators, generators, modules, and lambda functions.
Getting started with Pandas Cheatsheet.pdfSudhakarVenkey
Pandas is a popular Python library for data wrangling and analysis that can handle various data formats like CSV, Excel, JSON, HTML, and SQL. It allows users to create DataFrames from dictionaries, lists, and by importing data from text, Excel files, websites, databases and JSON. DataFrames can be exported, inspected by viewing rows and columns, subsetted by selecting rows and columns, queried using logical conditions, and reshaped by changing layout, renaming columns, appending rows/columns, and sorting values.
This document discusses object-oriented programming concepts in Python including classes, objects, attributes, methods, inheritance, and method overriding. It defines a MyVector class with x and y attributes and methods to add vectors and display their count. Access specifiers like private and built-in are covered. Python uses garbage collection and has no destructors. Inheritance allows a SubVector class to extend MyVector, and a method can be overridden to change a class's behavior.
Modules in Python allow organizing classes into files to make them available and easy to find. Modules are simply Python files that can import classes from other modules in the same folder. Packages allow further organizing modules into subfolders, with an __init__.py file making each subfolder a package. Modules can import classes from other modules or packages using either absolute or relative imports, and the __init__.py can simplify imports from its package. Modules can also contain global variables and classes to share resources across a program.
The Pandas library provides easy-to-use data structures and analysis tools for Python. It uses NumPy and allows import of data into Series (one-dimensional arrays) and DataFrames (two-dimensional labeled data structures). Data can be accessed, filtered, and manipulated using indexing, booleans, and arithmetic operations. Pandas supports reading and writing data to common formats like CSV, Excel, SQL, and can help with data cleaning, manipulation, and analysis tasks.
The document discusses lists in Python, including how to create, access, modify, loop through, slice, sort, and perform other operations on list elements. Lists can contain elements of different data types, are indexed starting at 0, and support methods like append(), insert(), pop(), and more to manipulate the list. Examples are provided to demonstrate common list operations and functions.
The document discusses various Python flow control statements including if/else, for loops, while loops, break and continue. It provides examples of using if/else statements for decision making and checking conditions. It also demonstrates how to use for and while loops for iteration, including using the range function. It explains how break and continue can be used to terminate or skip iterations. Finally, it briefly mentions pass, for, while loops with else blocks, and nested loops.
Object oriented programming with pythonArslan Arshad
A short intro to how Object Oriented Paradigm work in Python Programming language. This presentation created for beginner like bachelor student of Computer Science.
The document discusses recursive functions and provides examples of recursive algorithms for calculating factorial, greatest common divisor (GCD), Fibonacci numbers, power functions, and solving the Towers of Hanoi problem. Recursive functions are functions that call themselves during their execution. They break down problems into subproblems of the same type until reaching a base case. This recursive breakdown allows problems to be solved in a top-down, step-by-step manner.
Python modules allow for code reusability and organization. There are three main components of a Python program: libraries/packages, modules, and functions/sub-modules. Modules can be imported using import, from, or from * statements. Namespaces and name resolution determine how names are looked up and associated with objects. Packages are collections of related modules and use an __init__.py file to be recognized as packages by Python.
This document provides an overview of Pandas, a Python library used for data analysis and manipulation. Pandas allows users to manage, clean, analyze and model data. It organizes data in a form suitable for plotting or displaying tables. Key data structures in Pandas include Series for 1D data and DataFrame for 2D (tabular) data. DataFrames can be created from various inputs and Pandas includes input/output tools to read data from files into DataFrames.
Structured Query Language (SQL) is a query language that allows users to specify conditions to retrieve data from a database. SQL queries select rows from database tables that satisfy specified conditions. The results are output in a table format. Common SQL clauses include SELECT, FROM, WHERE, GROUP BY, HAVING, ORDER BY, and INTO to output results to a table, cursor, file or printer. SQL can perform queries on single or multiple related tables through joins.
Here is a Python class with the specifications provided in the question:
class PICTURE:
def __init__(self, pno, category, location):
self.pno = pno
self.category = category
self.location = location
def FixLocation(self, new_location):
self.location = new_location
This defines a PICTURE class with three instance attributes - pno, category and location as specified in the question. It also defines a FixLocation method to assign a new location as required.
This document provides information about priority queues and binary heaps. It defines a binary heap as a nearly complete binary tree where the root node has the maximum/minimum value. It describes heap operations like insertion, deletion of max/min, and increasing/decreasing keys. The time complexity of these operations is O(log n). Heapsort, which uses a heap data structure, is also covered and has overall time complexity of O(n log n). Binary heaps are often used to implement priority queues and for algorithms like Dijkstra's and Prim's.
Regular expressions are a powerful tool for searching, matching, and parsing text patterns. They allow complex text patterns to be matched with a standardized syntax. All modern programming languages include regular expression libraries. Regular expressions can be used to search strings, replace parts of strings, split strings, and find all occurrences of a pattern in a string. They are useful for tasks like validating formats, parsing text, and finding/replacing text. This document provides examples of common regular expression patterns and methods for using regular expressions in Python.
Pandas is an open source Python library that provides data structures and data analysis tools for working with tabular data. It allows users to easily perform operations on different types of data such as tabular, time series, and matrix data. Pandas provides data structures like Series for 1D data and DataFrame for 2D data. It has tools for data cleaning, transformation, manipulation, and visualization of data.
This document provides an introduction to the Python programming language. It describes Python as a multi-purpose, object-oriented language that is interpreted, dynamically typed and focuses on readability. It lists several major organizations that use Python. It then provides examples of basic Python programs and covers key Python concepts like variables, data types, strings, comments, functions and more in under 3 sentences each.
The document discusses association rule mining to discover relationships between data items in large datasets. It describes how association rules have the form of X → Y, showing items that frequently occur together. The key steps are: (1) generating frequent itemsets whose support is above a minimum threshold; (2) extracting high-confidence rules from each frequent itemset. It proposes using the Apriori algorithm to efficiently find frequent itemsets by pruning the search space based on the antimonotonicity of support.
This document provides an overview of Python for data analysis using the pandas library. It discusses key pandas concepts like Series and DataFrames for working with one-dimensional and multi-dimensional labeled data structures. It also covers common data analysis tasks in pandas such as data loading, aggregation, grouping, pivoting, filtering, handling time series data, and plotting.
Kickstart your data science journey with this Python cheat sheet that contains code examples for strings, lists, importing libraries and NumPy arrays.
Find more cheat sheets and learn data science with Python at www.datacamp.com.
The document discusses frequent itemset mining methods. It describes the Apriori algorithm which uses a candidate generation-and-test approach involving joining and pruning steps. It also describes the FP-Growth method which mines frequent itemsets without candidate generation by building a frequent-pattern tree. The advantages of each method are provided, such as Apriori being easily parallelized but requiring multiple database scans.
This document discusses functions and methods in Python. It defines functions and methods, and explains the differences between them. It provides examples of defining and calling functions, returning values from functions, and passing arguments to functions. It also covers topics like local and global variables, function decorators, generators, modules, and lambda functions.
Getting started with Pandas Cheatsheet.pdfSudhakarVenkey
Pandas is a popular Python library for data wrangling and analysis that can handle various data formats like CSV, Excel, JSON, HTML, and SQL. It allows users to create DataFrames from dictionaries, lists, and by importing data from text, Excel files, websites, databases and JSON. DataFrames can be exported, inspected by viewing rows and columns, subsetted by selecting rows and columns, queried using logical conditions, and reshaped by changing layout, renaming columns, appending rows/columns, and sorting values.
This document provides a summary of a seminar presentation on robotic process automation and virtual internships. It introduces popular Python libraries for data science like NumPy, SciPy, Pandas, matplotlib and Seaborn. It covers reading, exploring and manipulating data frames; filtering and selecting data; grouping; descriptive statistics. It also discusses missing value handling and aggregation functions. The goal is to provide an overview of key Python tools and techniques for data analysis.
import os import matplotlib-pyplot as plt import pandas as pd import r.docxBlake0FxCampbelld
import os
import matplotlib.pyplot as plt
import pandas as pd
import random
import string
from numpy.random import default_rng
import datetime
def run(low=None, high=None, number_values=None):
"""
'main' function for hw1 lecture
creates some data then saves it in a folder
"""
# User enters low, high, n integer
print("Creating random data.")
if type(low) != int:
while(True):
low = input("Please enter lowest possible value as integer:\n")
low_int = confirm_integer(low)
if low_int:
break
else:
continue
else:
low_int = low
if type(high) != int:
while(True):
high = input("Please enter highest possible value as integer:\n")
high_int = confirm_integer(high)
if high_int:
break
else:
continue
else:
high_int = high
if type(number_values) != int:
while(True):
n = input("Please enter number of values to create:\n")
n_int = confirm_integer(n)
if n_int:
break
else:
continue
else:
n_int = number_values
random_df = make_random_df(low_int, high_int, n_int)
# make output dir
local_dir = os.getcwd()
output_dir = os.path.join(local_dir, "Random_Files") # complete path to new folder
try:
os.mkdir(output_dir) # make folder
except FileExistsError:
pass # Do nothing if folder exists
output_df_os_mod_examples(random_df, output_dir)
def confirm_integer(input_val):
"""
Converts input val to integer, returns False if exception
:param input_val:
:return: integer or False
"""
try:
convert_int = int(input_val)
except Exception as e:
print("invalid integer")
return False
return convert_int
def make_random_df(lower_int, upper_int, number_idx):
"""
Creates Random DataFrame from bounds [lower_int, upper_int].
Creates 5 column DataFrame [Integers, Floads, Random Ascii (Ul#), Random Ascii (l), Random Ascii (U#)] number_idx rows
:param lower_int: lower bounds for random integers
:param upper_int: upper bounds for random integers
:param number_idx: number of random_df index
:return: rand_df: Pandas Dataframe (5 cols) number_idx rows
"""
# create n random data
rng = default_rng()
ints = rng.integers(low=lower_int, high=upper_int, size=number_idx)
floats = ints * rng.random() # multiply integers by random float [0,1]
# create some characters
rand_upper_lower_digit = random.choices(string.ascii_letters + string.digits, k=number_idx)
rand_lower = random.choices(string.ascii_lowercase, k=number_idx)
rand_upper_digit = random.choices(string.ascii_uppercase + string.digits, k=number_idx)
# make dataframe
rand_df_dict = {
"Integers": ints,
"Floats": floats,
"Random Ascii (Ul#)": rand_upper_lower_digit,
"Random Ascii (l)": rand_lower,
"Random Ascii (U#)": rand_upper_digit
}
rand_df = pd.DataFrame.from_dict(rand_df_dict, orient='columns')
return rand_df
def output_df_os_mod_examples(random_df, output_dir):
"""
Output dataframe in many file types.
Examples of os and os.path
:param random_df: pandasDataFrame cols:[Integers, Floads, Random Ascii (Ul#), Random Ascii (l), Random Ascii (U#)]
:param output_dir: path of output folder, exists==True
:return: None
"""
cwd = os.getcwd()
files = os..
This document provides an overview of Python libraries for data analysis and data science. It discusses popular Python libraries such as NumPy, Pandas, SciPy, Scikit-Learn and visualization libraries like matplotlib and Seaborn. It describes the functionality of these libraries for tasks like reading and manipulating data, descriptive statistics, inferential statistics, machine learning and data visualization. It also provides examples of using these libraries to explore a sample dataset and perform operations like data filtering, aggregation, grouping and missing value handling.
Best Data Science Ppt using Python
Data science is an inter-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from many structural and unstructured data. Data science is related to data mining, machine learning and big data.
The document discusses various Python libraries used for data science tasks. It describes NumPy for numerical computing, SciPy for algorithms, Pandas for data structures and analysis, Scikit-Learn for machine learning, Matplotlib for visualization, and Seaborn which builds on Matplotlib. It also provides examples of loading data frames in Pandas, exploring and manipulating data, grouping and aggregating data, filtering, sorting, and handling missing values.
Pandas is a popular Python library for data analysis and manipulation. It allows users to import data from various sources like CSV, Excel, JSON, databases into DataFrames. DataFrames can then be queried, filtered, reshaped, inspected, and exported back into different formats. Some key operations include importing/exporting data, subsetting rows and columns, reshaping the layout, and aggregating/summarizing the data.
The document provides a cheat sheet on the pandas DataFrame object. It discusses importing pandas, creating DataFrames from various data sources like CSVs, Excel, and dictionaries. It covers common operations on DataFrames like selecting, filtering, and transforming columns; handling indexes; and saving DataFrames. The DataFrame is a two-dimensional data structure with labeled columns that can be manipulated using various methods.
This document discusses building machine learning models to predict if Titanic passengers survived using a dataset from Kaggle. It outlines the steps of exploratory data analysis, data processing including encoding, standardization, and imputation, feature selection, splitting the data into training and test sets, building models like logistic regression, random forest and evaluating them on the test set using metrics like accuracy, confusion matrix and classification report. Random forest is found to have the best performance with an accuracy of 77% on the test set.
COMPUTER SCIENCE CLASS 12 PRACTICAL FILEAnushka Rai
Here's my Computer Science Board Practical File. I hope you find it as useful as it was to me.This file is however of CBSE class 12th 2020-2021 syllabus.
Python is an interpreted high-level general-purpose programming language. Its design philosophy emphasizes code readability with its use of significant indentation. Its language constructs as well as its object-oriented approach aim to help programmers write clear, logical code for small and large-scale projects
This document analyzes income qualification data to predict poverty levels. It explores the data types and features, cleans missing and mixed values, and builds a random forest classifier model. Cross-validation shows the model achieves over 93% accuracy. The most important features are found to be monthly rent paid, number of rooms, and responses to questions about household members. The model is applied to test data to predict poverty levels.
Good morning Salma Hayek you have to do is your purpose of the best time to plant grass seed in the morning Salma Hayek you have to do is your purpose of the best time to plant grass seed in the morning Salma Hayek you have to do is your purpose of the best time to plant grass seed in the morning Salma Hayek you want me potter to plant in spring I will be there in the morning Salma Hayek you have a nice weekend with someone legally allowed in spring a contract for misunderstanding and tomorrow I hope it was about the best msg you want me potter you want me potter you want to do is your purpose of the best time to plant grass seed in the morning Salma Hayek you have to do it up but what do you think about the pros of the morning Salma good mornings are you doing well and tomorrow I hope it goes well and I hope you to do it goes well and tomorrow I have to be there at both locations in spring a nice day service and I hope it goes away soon as I can you have to be to get a I hope it goes away soon I hope it goes away soon I hope it goes away soon as I can you have to be to work at a time I can do is
This document provides an overview of topics covered on Day 1 of a Python training, including strings, control flow, and data structures. Strings topics include methods, formatting, and Unicode. Control flow covers conditions, loops (for and while), and range. Data structures discussed are tuples, lists, dictionaries, and sorting. The document concludes with an overview of topics for the next session, including functions, object-oriented programming, and Python packaging.
GE8151 Problem Solving and Python ProgrammingMuthu Vinayagam
The document provides information about various Python concepts like print statement, variables, data types, operators, conditional statements, loops, functions, modules, exceptions, files and packages. It explains print statement syntax, how variables work in Python, built-in data types like numbers, strings, lists, dictionaries and tuples. It also discusses conditional statements like if-else, loops like while and for, functions, modules, exceptions, file handling operations and packages in Python.
The document contains a Python program to demonstrate various data types, list methods, tuple methods, dictionary methods, arithmetic operations using functions, filtering even numbers from a list, date and time functions, adding days to a date, counting character frequencies in a string, NumPy array operations, concatenating dataframes, reading a CSV file using Pandas, and calculating area and circumference of a circle using the math module. The program contains examples and outputs for each concept.
Is it easier to add functional programming features to a query language, or to add query capabilities to a functional language? In Morel, we have done the latter.
Functional and query languages have much in common, and yet much to learn from each other. Functional languages have a rich type system that includes polymorphism and functions-as-values and Turing-complete expressiveness; query languages have optimization techniques that can make programs several orders of magnitude faster, and runtimes that can use thousands of nodes to execute queries over terabytes of data.
Morel is an implementation of Standard ML on the JVM, with language extensions to allow relational expressions. Its compiler can translate programs to relational algebra and, via Apache Calcite’s query optimizer, run those programs on relational backends.
In this talk, we describe the principles that drove Morel’s design, the problems that we had to solve in order to implement a hybrid functional/relational language, and how Morel can be applied to implement data-intensive systems.
(A talk given by Julian Hyde at Strange Loop 2021, St. Louis, MO, on October 1st, 2021.)
fINAL Lesson_5_Data_Manipulation_using_R_v1.pptxdataKarthik
Anna is a junior data scientist working on a customer retention strategy. She needs to analyze data from different sources to understand customer value. To efficiently perform her job, she needs to learn techniques for reading, merging, summarizing and preparing data for analysis in R. These include reading data from files and databases, merging tables, summarizing data using functions like mean, median, and aggregate, and exporting cleaned data.
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
UiPath Test Automation using UiPath Test Suite series, part 6DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 6. In this session, we will cover Test Automation with generative AI and Open AI.
UiPath Test Automation with generative AI and Open AI webinar offers an in-depth exploration of leveraging cutting-edge technologies for test automation within the UiPath platform. Attendees will delve into the integration of generative AI, a test automation solution, with Open AI advanced natural language processing capabilities.
Throughout the session, participants will discover how this synergy empowers testers to automate repetitive tasks, enhance testing accuracy, and expedite the software testing life cycle. Topics covered include the seamless integration process, practical use cases, and the benefits of harnessing AI-driven automation for UiPath testing initiatives. By attending this webinar, testers, and automation professionals can gain valuable insights into harnessing the power of AI to optimize their test automation workflows within the UiPath ecosystem, ultimately driving efficiency and quality in software development processes.
What will you get from this session?
1. Insights into integrating generative AI.
2. Understanding how this integration enhances test automation within the UiPath platform
3. Practical demonstrations
4. Exploration of real-world use cases illustrating the benefits of AI-driven test automation for UiPath
Topics covered:
What is generative AI
Test Automation with generative AI and Open AI.
UiPath integration with generative AI
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Building Production Ready Search Pipelines with Spark and MilvusZilliz
Spark is the widely used ETL tool for processing, indexing and ingesting data to serving stack for search. Milvus is the production-ready open-source vector database. In this talk we will show how to use Spark to process unstructured data to extract vector representations, and push the vectors to Milvus vector database for search serving.
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2024/06/building-and-scaling-ai-applications-with-the-nx-ai-manager-a-presentation-from-network-optix/
Robin van Emden, Senior Director of Data Science at Network Optix, presents the “Building and Scaling AI Applications with the Nx AI Manager,” tutorial at the May 2024 Embedded Vision Summit.
In this presentation, van Emden covers the basics of scaling edge AI solutions using the Nx tool kit. He emphasizes the process of developing AI models and deploying them globally. He also showcases the conversion of AI models and the creation of effective edge AI pipelines, with a focus on pre-processing, model conversion, selecting the appropriate inference engine for the target hardware and post-processing.
van Emden shows how Nx can simplify the developer’s life and facilitate a rapid transition from concept to production-ready applications.He provides valuable insights into developing scalable and efficient edge AI solutions, with a strong focus on practical implementation.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
Your One-Stop Shop for Python Success: Top 10 US Python Development Providersakankshawande
Simplify your search for a reliable Python development partner! This list presents the top 10 trusted US providers offering comprehensive Python development services, ensuring your project's success from conception to completion.
Generating privacy-protected synthetic data using Secludy and MilvusZilliz
During this demo, the founders of Secludy will demonstrate how their system utilizes Milvus to store and manipulate embeddings for generating privacy-protected synthetic data. Their approach not only maintains the confidentiality of the original data but also enhances the utility and scalability of LLMs under privacy constraints. Attendees, including machine learning engineers, data scientists, and data managers, will witness first-hand how Secludy's integration with Milvus empowers organizations to harness the power of LLMs securely and efficiently.
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-und-domino-lizenzkostenreduzierung-in-der-welt-von-dlau/
DLAU und die Lizenzen nach dem CCB- und CCX-Modell sind für viele in der HCL-Community seit letztem Jahr ein heißes Thema. Als Notes- oder Domino-Kunde haben Sie vielleicht mit unerwartet hohen Benutzerzahlen und Lizenzgebühren zu kämpfen. Sie fragen sich vielleicht, wie diese neue Art der Lizenzierung funktioniert und welchen Nutzen sie Ihnen bringt. Vor allem wollen Sie sicherlich Ihr Budget einhalten und Kosten sparen, wo immer möglich. Das verstehen wir und wir möchten Ihnen dabei helfen!
Wir erklären Ihnen, wie Sie häufige Konfigurationsprobleme lösen können, die dazu führen können, dass mehr Benutzer gezählt werden als nötig, und wie Sie überflüssige oder ungenutzte Konten identifizieren und entfernen können, um Geld zu sparen. Es gibt auch einige Ansätze, die zu unnötigen Ausgaben führen können, z. B. wenn ein Personendokument anstelle eines Mail-Ins für geteilte Mailboxen verwendet wird. Wir zeigen Ihnen solche Fälle und deren Lösungen. Und natürlich erklären wir Ihnen das neue Lizenzmodell.
Nehmen Sie an diesem Webinar teil, bei dem HCL-Ambassador Marc Thomas und Gastredner Franz Walder Ihnen diese neue Welt näherbringen. Es vermittelt Ihnen die Tools und das Know-how, um den Überblick zu bewahren. Sie werden in der Lage sein, Ihre Kosten durch eine optimierte Domino-Konfiguration zu reduzieren und auch in Zukunft gering zu halten.
Diese Themen werden behandelt
- Reduzierung der Lizenzkosten durch Auffinden und Beheben von Fehlkonfigurationen und überflüssigen Konten
- Wie funktionieren CCB- und CCX-Lizenzen wirklich?
- Verstehen des DLAU-Tools und wie man es am besten nutzt
- Tipps für häufige Problembereiche, wie z. B. Team-Postfächer, Funktions-/Testbenutzer usw.
- Praxisbeispiele und Best Practices zum sofortigen Umsetzen
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxSitimaJohn
Ocean Lotus cyber threat actors represent a sophisticated, persistent, and politically motivated group that poses a significant risk to organizations and individuals in the Southeast Asian region. Their continuous evolution and adaptability underscore the need for robust cybersecurity measures and international cooperation to identify and mitigate the threats posed by such advanced persistent threat groups.
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Speck&Tech
ABSTRACT: A prima vista, un mattoncino Lego e la backdoor XZ potrebbero avere in comune il fatto di essere entrambi blocchi di costruzione, o dipendenze di progetti creativi e software. La realtà è che un mattoncino Lego e il caso della backdoor XZ hanno molto di più di tutto ciò in comune.
Partecipate alla presentazione per immergervi in una storia di interoperabilità, standard e formati aperti, per poi discutere del ruolo importante che i contributori hanno in una comunità open source sostenibile.
BIO: Sostenitrice del software libero e dei formati standard e aperti. È stata un membro attivo dei progetti Fedora e openSUSE e ha co-fondato l'Associazione LibreItalia dove è stata coinvolta in diversi eventi, migrazioni e formazione relativi a LibreOffice. In precedenza ha lavorato a migrazioni e corsi di formazione su LibreOffice per diverse amministrazioni pubbliche e privati. Da gennaio 2020 lavora in SUSE come Software Release Engineer per Uyuni e SUSE Manager e quando non segue la sua passione per i computer e per Geeko coltiva la sua curiosità per l'astronomia (da cui deriva il suo nickname deneb_alpha).
1. Pandas | Numpy | Sklearn
Matplotlib | Seaborn
BS4 | Selenium | Scrapy
Python
Cheat Sheet
by Frank Andrade
2. Python Basics
Cheat Sheet
Here you will find all the Python core concepts you need to
know before learning any third-party library.
Data Types
Integers (int): 1
Float (float): 1.2
String (str): "Hello World"
Boolean: True/False
List: [value1, value2]
Dictionary: {key1:value1, key2:value2, ...}
Variables
Variable assignment:
message_1 = "I'm learning Python"
message_2 = "and it's fun!"
String concatenation (+ operator):
message_1 + ' ' + message_2
String concatenation (f-string):
f'{message_1} {message_2}'
List
Creating a list:
countries = ['United States', 'India',
'China', 'Brazil']
Create an empty list:
my_list = []
Indexing:
>>> countries[0]
United States
>>> countries[3]
Brazil
>>> countries[-1]
Brazil
Slicing:
>>>countries[0:3]
['United States', 'India', 'China']
>>>countries[1:]
['India', 'China', 'Brazil']
>>>countries[:2]
['United States', 'India']
Adding elements to a list:
countries.append('Canada')
countries.insert(0,'Canada')
Nested list:
nested_list = [countries, countries_2]
Remove element:
countries.remove('United States')
countries.pop(0)#removes and returns value
del countries[0]
Creating a new list:
numbers = [4, 3, 10, 7, 1, 2]
Sorting a list:
>>> numbers.sort()
[1, 2, 3, 4, 7, 10]
>>> numbers.sort(reverse=True)
[10, 7, 4, 3, 2, 1]
Update value on a list:
>>> numbers[0] = 1000
>>> numbers
[1000, 7, 4, 3, 2, 1]
Copying a list:
new_list = countries[:]
new_list_2 = countries.copy()
Built-in Functions
Print an object:
print("Hello World")
Return the length of x:
len(x)
Return the minimum value:
min(x)
Return the maximum value:
max(x)
Returns a sequence of numbers:
range(x1,x2,n) # from x1 to x2
(increments by n)
Convert x to a string:
str(x)
Convert x to an integer/float:
int(x)
float(x)
Convert x to a list:
list(x)
Addition
Subtraction
Multiplication
Division
Exponent
Modulus
Floor division
+
-
*
/
**
%
//
Equal to
Different
Greater than
Less than
Greater than or equal to
Less than or equal to
==
!=
>
<
>=
<=
Numeric Operators Comparison Operators
String methods
string.upper(): converts to uppercase
string.lower(): converts to lowercase
string.title(): converts to title case
string.count('l'): counts how many times "l"
appears
string.find('h'): position of the "h" first
ocurrance
string.replace('o', 'u'): replaces "o" with "u"
3. If Statement
Conditional test:
if <condition>:
<code>
elif <condition>:
<code>
...
else:
<code>
Example:
if age>=18:
print("You're an adult!")
Conditional test with list:
if <value> in <list>:
<code>
Loops
For loop:
for <variable> in <list>:
<code>
For loop and enumerate list elements:
for i, element in enumerate(<list>):
<code>
For loop and obtain dictionary elements:
for key, value in my_dict.items():
<code>
While loop:
while <condition>:
<code>
Data Validation
Try-except:
try:
<code>
except <error>:
<code>
Loop control statement:
break: stops loop execution
continue: jumps to next iteration
pass: does nothing
Functions
Create a function:
def function(<params>):
<code>
return <data>
Modules
Import module:
import module
module.method()
OS module:
import os
os.getcwd()
os.listdir()
os.makedirs(<path>)
Dictionary
Creating a dictionary:
my_data = {'name':'Frank', 'age':26}
Create an empty dictionary:
my_dict = {}
Get value of key "name":
>>> my_data["name"]
'Frank'
Get the keys:
>>> my_data.keys()
dict_keys(['name', 'age'])
Get the values:
>>> my_data.values()
dict_values(['Frank', 26])
Get the pair key-value:
>>> my_data.items()
dict_items([('name', 'Frank'), ('age', 26)])
Adding/updating items in a dictionary:
my_data['height']=1.7
my_data.update({'height':1.8,
'languages':['English', 'Spanish']})
>>> my_data
{'name': 'Frank',
'age': 26,
'height': 1.8,
'languages': ['English', 'Spanish']}
Remove an item:
my_data.pop('height')
del my_data['languages']
my_data.clear()
Copying a dictionary:
new_dict = my_data.copy()
logical AND
logical OR
logical NOT
and
or
not
Boolean Operators
logical AND
logical OR
logical NOT
&
|
~
Boolean Operators
(Pandas)
Below there are my guides, tutorials
and complete Data Science course:
- Medium Guides
- YouTube Tutorials
- Data Science Course (Udemy)
Made by Frank Andrade frank-andrade.medium.com
Comment
New Line
#
n
Special Characters
4. Pandas
Cheat Sheet
Pandas provides data analysis tools for Python. All of the
following code examples refer to the dataframe below.
Getting Started
Import pandas:
import pandas as pd
Create a series:
s = pd.Series([1, 2, 3],
index=['A', 'B', 'C'],
name='col1')
Create a dataframe:
data = [[1, 4], [2, 5], [3, 6]]
index = ['A', 'B', 'C']
df = pd.DataFrame(data, index=index,
columns=['col1', 'col2'])
Read a csv file with pandas:
df = pd.read_csv('filename.csv')
Advanced parameters:
df = pd.read_csv('filename.csv', sep=',',
names=['col1', 'col2'],
index_col=0,
encoding='utf-8',
nrows=3)
Selecting rows and columns
Select single column:
df['col1']
Select multiple columns:
df[['col1', 'col2']]
Show first n rows:
df.head(2)
Show last n rows:
df.tail(2)
Select rows by index values:
df.loc['A'] df.loc[['A', 'B']]
Select rows by position:
df.loc[1] df.loc[1:]
Data wrangling
Filter by value:
df[df['col1'] > 1]
Sort by one column:
df.sort_values('col1')
Sort by columns:
df.sort_values(['col1', 'col2'],
ascending=[False, True])
Identify duplicate rows:
df.duplicated()
Identify unique rows:
df['col1'].unique()
Swap rows and columns:
df = df.transpose()
df = df.T
Drop a column:
df = df.drop('col1', axis=1)
Clone a data frame:
clone = df.copy()
Connect multiple data frames vertically:
df2 = df + 5 #new dataframe
pd.concat([df,df2])
Merge multiple data frames horizontally:
df3 = pd.DataFrame([[1, 7],[8,9]],
index=['B', 'D'],
columns=['col1', 'col3'])
#df3: new dataframe
Only merge complete rows (INNER JOIN):
df.merge(df3)
Left column stays complete (LEFT OUTER JOIN):
df.merge(df3, how='left')
Right column stays complete (RIGHT OUTER JOIN):
df.merge(df3, how='right')
Preserve all values (OUTER JOIN):
df.merge(df3, how='outer')
Merge rows by index:
df.merge(df3,left_index=True,
right_index=True)
Fill NaN values:
df.fillna(0)
Apply your own function:
def func(x):
return 2**x
df.apply(func)
Arithmetics and statistics
Add to all values:
df + 10
Sum over columns:
df.sum()
Cumulative sum over columns:
df.cumsum()
Mean over columns:
df.mean()
Standard deviation over columns:
df.std()
Count unique values:
df['col1'].value_counts()
Summarize descriptive statistics:
df.describe()
A
B
C
col1 col2
1
2
3
4
5
6
df =
axis 0
axis 1
5. Hierarchical indexing
Create hierarchical index:
df.stack()
Dissolve hierarchical index:
df.unstack()
Aggregation
Create group object:
g = df.groupby('col1')
Iterate over groups:
for i, group in g:
print(i, group)
Aggregate groups:
g.sum()
g.prod()
g.mean()
g.std()
g.describe()
Select columns from groups:
g['col2'].sum()
g[['col2', 'col3']].sum()
Transform values:
import math
g.transform(math.log)
Apply a list function on each group:
def strsum(group):
return ''.join([str(x) for x in group.value])
g['col2'].apply(strsum)
Boxplot:
df['col1'].plot(kind='box')
Histogram over one column:
df['col1'].plot(kind='hist',
bins=3)
Piechart:
df.plot(kind='pie',
y='col1',
title='Population')
Set tick marks:
labels = ['A', 'B', 'C', 'D']
positions = [1, 2, 3, 4]
plt.xticks(positions, labels)
plt.yticks(positions, labels)
Label diagram and axes:
plt.title('Correlation')
plt.xlabel('Nunstück')
plt.ylabel('Slotermeyer')
Save most recent diagram:
plt.savefig('plot.png')
plt.savefig('plot.png',dpi=300)
plt.savefig('plot.svg')
Data export
Data as NumPy array:
df.values
Save data as CSV file:
df.to_csv('output.csv', sep=",")
Format a dataframe as tabular string:
df.to_string()
Convert a dataframe to a dictionary:
df.to_dict()
Save a dataframe as an Excel table:
df.to_excel('output.xlsx')
Pivot and Pivot Table
Read csv file 1:
df_gdp = pd.read_csv('gdp.csv')
The pivot() method:
df_gdp.pivot(index="year",
columns="country",
values="gdppc")
Read csv file 2:
df_sales=pd.read_excel(
'supermarket_sales.xlsx')
Make pivot table:
df_sales.pivot_table(index='Gender',
aggfunc='sum')
Make a pivot tables that says how much male and
female spend in each category:
df_sales.pivot_table(index='Gender',
columns='Product line',
values='Total',
aggfunc='sum')
Visualization
The plots below are made with a dataframe
with the shape of df_gdp (pivot() method)
Import matplotlib:
import matplotlib.pyplot as plt
Start a new diagram:
plt.figure()
Scatter plot:
df.plot(kind='scatter')
Bar plot:
df.plot(kind='bar',
xlabel='data1',
ylabel='data2')
Lineplot:
df.plot(kind='line',
figsize=(8,4))
Below there are my guides, tutorials
and complete Pandas course:
- Medium Guides
- YouTube Tutorials
- Pandas Course (Udemy)
Made by Frank Andrade frank-andrade.medium.com
6. NumPy
Cheat Sheet
Getting Started
Import numpy:
import numpy as np
Create arrays:
a = np.array([1,2,3])
b = np.array([(1.5,2,3), (4,5,6)], dtype=float)
c = np.array([[(1.5,2,3), (4,5,6)],
[(3,2,1), (4,5,6)]],
dtype = float)
Initial placeholders:
np.zeros((3,4)) #Create an array of zeros
np.ones((2,3,4),dtype=np.int16)
d = np.arange(10,25,5)
np.linspace( 0,2, 9)
e = np.full((2,2), 7)
f = np.eye(2)
np.random.random((2,2))
np.empty((3,2))
Saving & Loading On Disk:
np.save('my_array', a)
np.savez('array.npz', a, b)
np.load('my_array.npy')
Saving & Loading Text Files
np.loadtxt('my_file.txt')
np.genfromtxt('my_file.csv',
delimiter=',')
np.savetxt('myarray.txt', a,
delimiter= ' ')
Inspecting Your Array
a.shape
len(a)
b.ndim
e.size
b.dtype #data type
b.dtype.name
b.astype(int) #change data type
Data Types
np.int64
np.float32
np.complex
np.bool
np.object
np.string_
np.unicode_
Array Mathematics
Arithmetic Operations
>>> g = a-b
array([[-0.5, 0. , 0. ],
[-3. , 3. , 3. ]])
>>> np.subtract(a,b)
>>> b+a
array([[2.5, 4. , 6. ],
[ 5. , 7. , 9. ]])
>>> np.add(b,a)
>>> a/b
array([[ 0.66666667, 1. , 1. ],
[ 0.2 5 , 0.4 , 0 . 5 ]])
>>> np.divide(a,b)
>>> a*b
array([[ 1 . 5, 4. , 9. ],
[ 4. , 10. , 18. ]])
>>> np.multiply(a,b)
>>> np.exp(b)
>>> np.sqrt(b)
>>> np.sin(a)
>>> np.log(a)
>>> e.dot(f)
Aggregate functions:
a.sum()
a.min()
b.max(axis= 0)
b.cumsum(axis= 1) #Cumulative sum
a.mean()
b.median()
a.corrcoef() #Correlation coefficient
np.std(b) #Standard deviation
Copying arrays:
h = a.view() #Create a view
np.copy(a)
h = a.copy() #Create a deep copy
Sorting arrays:
a.sort() #Sort an array
c.sort(axis=0)
Array Manipulation
Transposing Array:
i = np.transpose(b)
i.T
Changing Array Shape:
b.ravel()
g.reshape(3,-2)
Adding/removing elements:
h.resize((2,6))
np.append(h,g)
np.insert(a, 1, 5)
np.delete(a,[1])
Combining arrays:
np.concatenate((a,d),axis=0)
np.vstack((a,b)) #stack vertically
np.hstack((e,f)) #stack horizontally
Splitting arrays:
np.hsplit(a,3) #Split horizontally
np.vsplit(c,2) #Split vertically
Subsetting
b[1,2]
Slicing:
a[0:2]
Boolean Indexing:
a[a<2]
NumPy provides tools for working with arrays. All of the
following code examples refer to the arrays below.
NumPy Arrays
1D Array 2D Array
1 2 3 2 3
1.5
4 5 6
axis 0
axis 1
2 3
1.5
4 5 6
2 3
1
2 3
1
7. Scikit-Learn
Cheat Sheet
Sklearn is a free machine learning library for Python. It features various
classification, regression and clustering algorithms.
Getting Started
The code below demonstrates the basic steps of using sklearn to create and run a model
on a set of data.
The steps in the code include loading the data, splitting into train and test sets, scaling
the sets, creating the model, fitting the model on the data using the trained model to
make predictions on the test set, and finally evaluating the performance of the model.
from sklearn import neighbors,datasets,preprocessing
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
iris = datasets.load_iris()
X,y = iris.data[:,:2], iris.target
X_train, X_test, y_train, y_test=train_test_split(X,y)
scaler = preprocessing_StandardScaler().fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
knn = neighbors.KNeighborsClassifier(n_neighbors = 5)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
accuracy_score(y_test, y_pred)
Loading the Data
The data needs to be numeric and stored as NumPy arrays or SciPy spare matrix
(numeric arrays, such as Pandas DataFrame’s are also ok)
>>> import numpy as np
>>> X = np.random.random((10,5))
array([[0.21,0.33],
[0.23, 0.60],
[0.48, 0.62]])
>>> y = np.array(['A','B','A'])
array(['A', 'B', 'A'])
Training and Test Data
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,
random_state = 0)#Splits data into training and test set
Preprocessing The Data
Standardization
Standardizes the features by removing the mean and scaling to unit variance.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler().fit(X_train)
standarized_X = scaler.transform(X_train)
standarized_X_test = scaler.transform(X_test)
Normalization
Each sample (row of the data matrix) with at least one non-zero component is
rescaled independently of other samples so that its norm equals one.
from sklearn.preprocessing import Normalizer
scaler = Normalizer().fit(X_train)
normalized_X = scaler.transform(X_train)
normalized_X_test = scaler.transform(X_test)
Binarization
Binarize data (set feature values to 0 or 1) according to a threshold.
from sklearn.preprocessing import Binarizer
binarizer = Binarizer(threshold = 0.0).fit(X)
binary_X = binarizer.transform(X_test)
Encoding Categorical Features
Imputation transformer for completing missing values.
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit_transform(X_train)
Imputing Missing Values
from sklearn.impute import SimpleImputer
imp = SimpleImputer(missing_values=0, strategy ='mean')
imp.fit_transform(X_train)
Generating Polynomial Features
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(5)
poly.fit_transform(X)
8. Create Your Model
Supervised Learning Models
Linear Regression
from sklearn.linear_model import LinearRegression
lr = LinearRegression(normalize = True)
Support Vector Machines (SVM)
from sklearn.svm import SVC
svc = SVC(kernel = 'linear')
Naive Bayes
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
KNN
from sklearn import neighbors
knn = neighbors.KNeighborsClassifier(n_neighbors = 5)
Unsupervised Learning Models
Principal Component Analysis (PCA)
from sklearn.decomposition import PCA
pca = PCA(n_components = 0.95)
K means
from sklearn.cluster import KMeans
k_means = KMeans(n_clusters = 3, random_state = 0)
Model Fitting
Fitting supervised and unsupervised learning models onto data.
Supervised Learning
lr.fit(X, y) #Fit the model to the data
knn.fit(X_train,y_train)
svc.fit(X_train,y_train)
Unsupervised Learning
k_means.fit(X_train) #Fit the model to the data
pca_model = pca.fit_transform(X_train)#Fit to data,then transform
Prediction
Predict Labels
y_pred = lr.predict(X_test) #Supervised Estimators
y_pred = k_means.predict(X_test) #Unsupervised Estimators
Estimate probability of a label
y_pred = knn.predict_proba(X_test)
Evaluate Your Model’s Performance
Classification Metrics
Accuracy Score
knn.score(X_test,y_test)
from sklearn.metrics import accuracy_score
accuracy_score(y_test,y_pred)
Classification Report
from sklearn.metrics import classification_report
print(classification_report(y_test,y_pred))
Confusion Matrix
from sklearn .metrics import confusion_matrix
print(confusion_matrix(y_test,y_pred))
Regression Metrics
Mean Absolute Error
from sklearn.metrics import mean_absolute_error
mean_absolute_error(y_test,y_pred)
Mean Squared Error
from sklearn.metrics import mean_squared_error
mean_squared_error(y_test,y_pred)
R² Score
from sklearn.metrics import r2_score
r2_score(y_test, y_pred)
Clustering Metrics
Adjusted Rand Index
from sklearn.metrics import adjusted_rand_score
adjusted_rand_score(y_test,y_pred)
Homogeneity
from sklearn.metrics import homogeneity_score
homogeneity_score(y_test,y_pred)
V-measure
from sklearn.metrics import v_measure_score
v_measure_score(y_test,y_pred)
Tune Your Model
Grid Search
from sklearn.model_selection import GridSearchCV
params = {'n_neighbors':np.arange(1,3),
'metric':['euclidean','cityblock']}
grid = GridSearchCV(estimator = knn, param_grid = params)
grid.fit(X_train, y_train)
print(grid.best_score_)
print(grid.best_estimator_)
9. Matplotlib
Workflow
The basic steps to creating plots with matplotlib are Prepare
Data, Plot, Customize Plot, Save Plot and Show Plot.
import matplotlib.pyplot as plt
Example with lineplot
Prepare data
x = [2017, 2018, 2019, 2020, 2021]
y = [43, 45, 47, 48, 50]
Plot & Customize Plot
plt.plot(x,y,marker='o',linestyle='--',
color='g', label='USA')
plt.xlabel('Years')
plt.ylabel('Population (M)')
plt.title('Years vs Population')
plt.legend(loc='lower right')
plt.yticks([41, 45, 48, 51])
Save Plot
plt.savefig('example.png')
Show Plot
plt.show()
Markers: '.', 'o', 'v', '<', '>'
Line Styles: '-', '--', '-.', ':'
Colors: 'b', 'g', 'r', 'y' #blue, green, red, yellow
Barplot
x = ['USA', 'UK', 'Australia']
y = [40, 50, 33]
plt.bar(x, y)
plt.show()
Piechart
plt.pie(y, labels=x, autopct='%.0f %%')
plt.show()
Histogram
ages = [15, 16, 17, 30, 31, 32, 35]
bins = [15, 20, 25, 30, 35]
plt.hist(ages, bins, edgecolor='black')
plt.show()
Boxplots
ages = [15, 16, 17, 30, 31, 32, 35]
plt.boxplot(ages)
plt.show()
Scatterplot
a = [1, 2, 3, 4, 5, 4, 3 ,2, 5, 6, 7]
b = [7, 2, 3, 5, 5, 7, 3, 2, 6, 3, 2]
plt.scatter(a, b)
plt.show()
Subplots
Add the code below to make multple plots with 'n'
number of rows and columns.
fig, ax = plt.subplots(nrows=1,
ncols=2,
sharey=True,
figsize=(12, 4))
Plot & Customize Each Graph
ax[0].plot(x, y, color='g')
ax[0].legend()
ax[1].plot(a, b, color='r')
ax[1].legend()
plt.show()
Data Viz
Cheat Sheet
Seaborn
Workflow
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
Lineplot
plt.figure(figsize=(10, 5))
flights = sns.load_dataset("flights")
may_flights=flights.query("month=='May'")
ax = sns.lineplot(data=may_flights,
x="year",
y="passengers")
ax.set(xlabel='x', ylabel='y',
title='my_title, xticks=[1,2,3])
ax.legend(title='my_legend,
title_fontsize=13)
plt.show()
Barplot
tips = sns.load_dataset("tips")
ax = sns.barplot(x="day",
y="total_bill,
data=tips)
Histogram
penguins = sns.load_dataset("penguins")
sns.histplot(data=penguins,
x="flipper_length_mm")
Boxplot
tips = sns.load_dataset("tips")
ax = sns.boxplot(x=tips["total_bill"])
Scatterplot
tips = sns.load_dataset("tips")
sns.scatterplot(data=tips,
x="total_bill",
y="tip")
Figure aesthetics
sns.set_style('darkgrid') #stlyes
sns.set_palette('husl', 3) #palettes
sns.color_palette('husl') #colors
Fontsize of the axes title, x and y labels, tick labels
and legend:
plt.rc('axes', titlesize=18)
plt.rc('axes', labelsize=14)
plt.rc('xtick', labelsize=13)
plt.rc('ytick', labelsize=13)
plt.rc('legend', fontsize=13)
plt.rc('font', size=13)
Matplotlib is a Python 2D plotting library that produces
figures in a variety of formats.
X-axis
Y-axis
Figure
10. HTML for Web Scraping
Let's take a look at the HTML element syntax.
This is a single HTML element, but the HTML code behind a
website has hundreds of them.
The HTML code is structured with “nodes”. Each rectangle below
represents a node (element, attribute and text nodes)
<h1 class="title"> Titanic (1997) </h1>
Element Element Element
Text
13 meters. You ...
XPath
We need to learn XPath to scrape with Selenium or
Scrapy.
XPath Syntax
An XPath usually contains a tag name, attribute
name, and attribute value.
Let’s check some examples to locate the article,
title, and transcript elements of the HTML code we
used before.
//article[@class="main-article"]
//h1
//div[@class="full-script"]
XPath Functions and Operators
XPath functions
XPath Operators: and, or
XPath Special Characters
“Siblings” are nodes with the same parent.
A node’s children and its children’s children are
called its “descendants”. Similarly, a node’s parent
and its parent’s parent are called its “ancestors”.
it’s recommended to find element in this order.
ID
Class name
Tag name
Xpath
a.
b.
c.
d.
Beautiful Soup
Workflow
Importing the libraries
from bs4 import BeautifulSoup
import requests
Fetch the pages
result=requests.get("www.google.com")
result.status_code #get status code
result.headers #get the headers
Page content
content = result.text
Create soup
soup = BeautifulSoup(content,"lxml")
HTML in a readable format
print(soup.prettify())
Find an element
soup.find(id="specific_id")
Find elements
soup.find_all("a")
soup.find_all("a","css_class")
soup.find_all("a",class_="my_class")
soup.find_all("a",attrs={"class":
"my_class"})
Get inner text
sample = element.get_text()
sample = element.get_text(strip=True,
separator= ' ')
Get specific attributes
sample = element.get('href')
Web Scraping
Cheat Sheet
Web Scraping is the process of extracting data from a
website. Before studying Beautiful Soup and Selenium, it's
good to review some HTML basics first.
Tag
name End tag
Attribute
name
Attribute
value
Attribute Affected content
HTML Element
HTML code example
<article class="main-article">
<h1> Titanic (1997) </h1>
<p class="plot"> 84 years later ... </p>
<div class="full-script"> 13 meters. You ... </div>
</article>
Root Element
Attribute
Attribute Text
Text
<article>
class="main-article"
<h1> <p> <div>
Titanic (1997) class="plot" 84 years later ...
Attribute
class="full-script""
Parent Node
Siblings
//tagName[@AttributeName="Value"]
//tag[contains(@AttributeName, "Value")]
//tag[(expression 1) and (expression 2)]
Selects the children from the node set on the
left side of this character
Specifies that the matching node set should
be located at any level within the document
Specifies the current context should be used
(refers to present node)
Refers to a parent node
A wildcard character that selects all
elements or attributes regardless of names
Select an attribute
Grouping an XPath expression
Indicates that a node with index "n" should
be selected
/
//
.
..
*
@
()
[n]
11. Selenium
Workflow
from selenium import webdriver
web="www.google.com"
path='introduce chromedriver path'
driver = webdriver.Chrome(path)
driver.get(web)
Find an element
driver.find_element_by_id('name')
Find elements
driver.find_elements_by_class_name()
driver.find_elements_by_css_selector
driver.find_elements_by_xpath()
driver.find_elements_by_tag_name()
driver.find_elements_by_name()
Quit driver
driver.quit()
Getting the text
data = element.text
Implicit Waits
import time
time.sleep(2)
Explicit Waits
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
WebDriverWait(driver, 5).until(EC.element_to_be_clickable((By.ID,
'id_name'))) #Wait 5 seconds until an element is clickable
Options: Headless mode, change window size
from selenium.webdriver.chrome.options import Options
options = Options()
options.headless = True
options.add_argument('window-size=1920x1080')
driver=webdriver.Chrome(path,options=options)
Scrapy
Scrapy is the most powerful web scraping framework in Python, but it's a bit
complicated to set up, so check my guide or its documentation to set it up.
Creating a Project and Spider
To create a new project, run the following command in the terminal.
scrapy startproject my_first_spider
To create a new spider, first change the directory.
cd my_first_spider
Create an spider
scrapy genspider example example.com
The Basic Template
When you create a spider, you obtain a template with the following content.
The class is built with the data we introduced in the previous command, but the
parse method needs to be built by us. To build it, use the functions below.
Finding elements
To find elements in Scrapy, use the response argument from the parse method
Getting the text
To obtain the text element we use text() and either .get() or .getall(). For example:
response.xpath(‘//h1/text()’).get()
response.xpath(‘//tag[@Attribute=”Value”]/text()’).getall()
Return data extracted
To see the data extracted we have to use the yield keyword
def parse(self, response):
title = response.xpath(‘//h1/text()’).get()
# Return data extracted
yield {'titles': title}
Run the spider and export data to CSV or JSON
scrapy crawl example
scrapy crawl example -o name_of_file.csv
scrapy crawl example -o name_of_file.json
import scrapy
class ExampleSpider(scrapy.Spider):
name = 'example'
allowed_domains = ['example.com']
start_urls = ['http://example.com/']
def parse(self, response):
pass
Class
Parse method
response.xpath('//tag[@AttributeName="Value"]')
Below there are my guides, tutorials
and complete web scraping course:
- Medium Guides
- YouTube Tutorials
- Web Scraping Course (Udemy)
Made by Frank Andrade frank-andrade.medium.com