2. OVERVIEW
➢ Series
➢ DataFrame
➢ Pandas for Time Series
➢ Merging, Joining, Concatenate
➢ Importing data
➢ A simple example
> the python commands will be written here
# this is a comment
2
3. SET IT UP!
➢ Open a Terminal
➢ Start ipython notebook
➢ Open ipython notebook web-page (localhost:8888)
➢ Open ‘tutorial_pandas.ipynb’
$ ipython notebook
3
4. PANDAS LIBRARY
The Pandas library provides useful functions to:
➢ Represent and manage data structures
➢ Ease the data processing
➢ With built-in functions to manage (Time) Series
It uses numpy, scipy, matplotlib functions
Manual PDF ONLINE
> import pandas as pd
# to import the pandas library
> pd.__version__
# get the version of the library (0.16)
4
5. SERIES: DATA STRUCTURE
➢ Unidimensional data structure
➢ Indexing
· automatic
· manual
· ! not univocally !
> data = [1,2,3,4,5]
> s = pd.Series(data)
> s
> s.index
> s = pd.Series(data, index = ['a','b','c','d','d'])
> s['d']
> s[[4]]
# try with: s = pd.Series(data, index = [1,2,3,4,4])
> s.index = [1,2,3,4,5]
5
6. SERIES: BASIC OPERATIONS
➢ Mathematically, Series are vectors
➢ Compatible with numpy functions
➢ Some basic functions available as pandas methods
➢ Plotting (based on matplotlib)
> import numpy as np
# import numpy to get some mathematical functions
> random_data = np.random.uniform(size=10)
> s = pd.Series(random_data)
> s+1
# try other mathematical functions: **2, *2, exp(s), …
> s.apply(np.log)
> s.mean()
# try other built-in functions. Use 'tab' to discover …
> s.plot() 6
7. DATAFRAME: DATA STRUCTURE
➢ Bidimensional data structure
➢ A dictionary of Series, with shared index
→ each column is a Series
➢ Indexed, cols and rows (not univocally)
> s1 = pd.Series([1,2,3,4,5], index = list('abcde'))
> data = {'one':s1**s1, 'two':s1+1}
> df = pd.DataFrame(data)
> df.columns
> df.index
# index, columns: assign name (if not existing), or select
> s2 = pd.Series([1,2,3,4,10], index = list('edcbh'))
> df['three'] = s2
# try changing s2 indexes,
7
8. DATAFRAME: ACCESSING VALUES - 1
➢ keep calm
➢ select columns and rows to obtain Series
➢ query function to select rows
> data = np.random.randn(5,2)
> df = pd.DataFrame(data, index = list('abcde'),
columns = ['one','two'])
> col = df.one
> row = df.xs('b')
# type(col) and type(row) is Series,you know how to manage ...
> df.query('one > 0')
> df.index = [1,2,3,4,5]
> df.query('1 < index < 4')
8
9. DATAFRAME: ACCESSING VALUES - 2
➢ … madness continues
➢ ix access by index:
works on rows, AND on columns
➢ iloc access by position
➢ you can extract Series
➢ ! define a strategy, and be careful with indexes !
> data = np.random.randn(5,2)
> df = pd.DataFrame(data, index = list('abcde'),
columns = ['one','two'])
> df.ix['a']
# try df.ix[['a', 'b'], 'one'], types
> df.iloc[1,1]
# try df.iloc[1:,1], types?
> df.ix[1:, 'one']
# works as well...
9
10. DATAFRAME: BASIC OPERATIONS
➢ DataFrames can be considered as Matrixes
➢ Compatible with numpy functions
➢ Some basic functions available as pandas methods
· axis = 0: column-wise
· axis = 1: row-wise
➢ self.apply() function
➢ Plotting (based on matplotlib)
> df_copy = df
# it is a link! Use df_copy = df.copy()
> df * df
> np.exp(df)
> df.mean()
# try df.mean(axis = 1)
# try type(df.mean())
> df.apply(np.mean)
> df.plot()
# try df.transpose().plot()
1
11. PANDAS FOR TIME SERIES
➢ Used in financial data analysis, we will use for signals
➢ TimeSeries: Series when the index is a timestamp
➢ Pandas functions for Time Series (here)
➢ Useful to select a portion of signal (windowing)
· query method: not available on Series → convert to a DataFrame
> times = np.arange(0, 60, 0.5)
> data = np.random.randn(len(times))
> ts = pd.Series(data, index = times)
> ts.plot()
> epoch = ts[(ts.index > 10) & (ts.index <=20)]
# ts.plot()
# epoch.plot()
> ts_df = pd.DataFrame(ts)
> ts_df.query('10 < index <=20')
1
12. FEW NOTES ABOUT TIMESTAMPS
➢ Absolute timestamps VS Relative timestamps
· Absolute timestamp is important for synchronization
➢ Unix Timestamps VS date/time representation (converter)
· Unix Timestamp: reference for signal processing
· 0000000000.000000 = 1970, 1st January, 00:00:00.000000
· date/time: easier to understand
· unix timestamp: easier to select/manage
➢ Pandas functions to manage Timestamps
> import datetime
> import time
> now_dt = datetime.datetime.now()
# now_dt = time.ctime()
> now_ut = time.time()
# find out how to convert datetime <--> timestamp
> ts.index = ts.index + now_ut
> ts.index = pd.to_datetime(ts.index, unit = 's')
# ts[(ts.index > -write date time here-)]
> ts.plot()
1