Pandas.ppt

Introduction to pandas
Srashti Deshmukh

Pandas Data Structures
Series is like column in a table DataFrame
A 6
B 3.14
C -4
D 0
foo bar baz
A x 6 True
B y 10 True
C z NaN False
index values index columns

Creating Series
import pandas as pd
s1 = pd.Series([1, 2, 3, 4])
0 1
1 2
2 3
3 4
s2 = pd.Series([1, 2, 3, 4], index=[‘A’, ‘B’, ‘C’, ‘D’])
A 1
B 2
C 3
D 4

Creating DataFrame
df = pd.DataFrame({‘foo’: [‘x’, ‘y’, ‘z’],
‘bar’: [6, 10, None],
‘baz’: [True, True, False]})
foo bar baz
0 x 6 True
1 y 10 True
2 z NaN False

Importing Data
• pd.read_csv(filename) | From a CSV file
• pd.read_table(filename) | From a delimited text file (like TSV)
• pd.read_excel(filename) | From an Excel file
• pd.read_sql(query, connection_object) | Read from a SQL
table/database
• pd.read_json(json_string) | Read from a JSON formatted
string, URL or file
• pd.DataFrame(dict) | From a dict, keys for columns names,
values for data as lists

Viewing/Inspecting Data
• df.head(n) | First n rows of the DataFrame
• df.tail(n) | Last n rows of the DataFrame
• df.shape() | Number of rows and columns
• df.info() | Index, Datatype and Memory information
• df.describe() | Summary statistics for numerical
columns

Creating DataFrame from list
calories=[420,300,150]
duration=[6,10,5]
df=pd.DataFrame(list (zip(calories , duration) ),columns=[‘Calories’,’Duration’])
Calories Duration
0 420 6
1 300 10
2 150 5

Column Selection
df[‘Calories’]
Calories
0 420
1 300
2 150
Calories Duration
0 420 6
1 300 10
2 150 5

Column Selection
df[[‘Calories’,’Duration’]]
Calories Duration
0 420 6
1 300 10
2 150 5
Calories Duration
0 420 6
1 300 10
2 150 5

Row Selection
df.loc[0]
Calories 420
Duration 6
Calories Duration
0 420 6
1 300 10
2 150 5

Row Selection
df.loc[0:2] OR
df.loc[[0,1]]
Calories Duration
0 420 6
1 300 10
Calories Duration
0 420 6
1 300 10
2 150 5

Cleaning wrong data
df.loc[2, 'Duration'] = 5
Calories Duration
0 420 6
1 300 10
2 150 5
Calories Duration
0 420 6
1 300 10
2 150 500

Removing Duplicates
df.duplicated(): Returns True for every row that is a duplicate, otherwise False
df.drop_duplicates(inplace=True)
Calories Duration
0 420 6
1 300 10
Calories Duration
0 420 6
1 300 10
2 300 10

Conditional Filtering
foo bar baz
0 x 6 True
1 y 10 True
2 z NaN False
df[ (df[‘baz’]) ]
foo bar baz
0 x 6 True
1 y 10 True

Conditional Filtering
foo bar baz
0 x 6 True
1 y 10 True
2 z NaN False
df[ (df['foo'] == 'x') |
(df['foo'] == 'z') ]
foo bar baz
0 x 6 True
2 z NaN False

Handling Missing Values
new_df = df.dropna()
dataframe.dropna(axis, how, thresh, subset, inplace)
foo bar baz
0 x 6 True
1 y 10 True
2 z NaN False
3 NaN NaN NaN
foo bar baz
0 x 6 True
1 y 10 True

new_df = df.dropna(how=‘all’)
foo bar baz
0 x 6 True
1 y 10 True
2 z NaN False
foo bar baz
0 x 6 True
1 y 10 True
2 z NaN False
3 NaN NaN NaN

new_df = df.fillna(0)
foo bar baz
0 x 6 True
1 y 10 True
2 z NaN False
3 NaN NaN NaN
foo bar baz
0 x 6 True
1 y 10 True
2 z 0 False
3 0 0 0

new_df = df.fillna(method=‘ffill’)
foo bar baz
0 x 6 True
1 y 10 True
2 z NaN False
3 NaN NaN NaN
foo bar baz
0 x 6 True
1 y 10 True
2 z 10 False
3 z 10 False

Indexing
ix = df.index
foo bar baz
0 a 6 True
1 b 10 True
2 c -2 False
3 d 1 True
0
1
2
3
df.index returns A Pandas Index object containing the label of the rows.

Indexing
df = df.set_index(‘foo’)
bar baz
foo
a 6 True
b 10 True
c -2 False
d 1 True
foo bar baz
0 a 6 True
1 b 10 True
2 c -2 False
3 d 1 True

Indexing
df.loc[‘a’]
bar baz
foo
a 6 True
b 10 True
c -2 False
d 1 True
bar 6
baz True
df.iloc[0]

Statistics
df.describe()
df.cov()
df.corr()
df.rank()
df.cumsum()

describe()
Calories Duration
0 420 6
1 300 10
2 150 5

cumsum()
Returns the cumulative sum
Calories Duration
0 420 6
1 300 10
2 150 5

Difference between Spark and Pandas
DataFrame
Spark DataFrame Pandas DataFrame
Spark DataFrame has Multiple Nodes. Pandas DataFrame has a Single Node.
Spark DataFrame is Immutable. Pandas DataFrame is Mutable.
Complex operations are difficult to perform as
compared to Pandas DataFrame.
Complex operations are easier to perform as
compared to Spark DataFrame.
sparkDataFrame.count() returns the number of
rows.
pandasDataFrame.count() returns the number of
non NA/null observations for each column.

Below are the few considerations when to choose PySpark over Pandas
• If our data is huge and grows significantly over the years and we want to improve our
processing time.
• If we want fault-tolerance
• Language to choose (Spark supports Python, Scala, Java & R)
• When you want Machine-learning capability.
• Would like to read Parquet, Avro, Hive, Casandra, Snowflake e.t.c
• If we want to stream the data and process it real-time.
How to Decide Between Pandas vs
PySpark?

The team
Answer the question, “Why are we the ones to solve the problem we identified?”
@sahildua2305
Thank you!

Pandas.ppt

More Related Content

What's hot

Similar to Pandas.ppt

Recently uploaded

Pandas.ppt