SlideShare a Scribd company logo
1 of 470
Download to read offline
Python for Data Analysis
Wes McKinney
Beijing • Cambridge • Farnham • Köln • Sebastopol • Tokyo
Python for Data Analysis
by Wes McKinney
Copyright © 2013 Wes McKinney. All rights reserved.
Printed in the United States of America.
Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.
O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions
are also available for most titles (http://my.safaribooksonline.com). For more information, contact our
corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com.
Editors: Julie Steele and Meghan Blanchette
Production Editor: Melanie Yarbrough
Copyeditor: Teresa Exley
Proofreader: BIM Publishing Services
Indexer: BIM Publishing Services
Cover Designer: Karen Montgomery
Interior Designer: David Futato
Illustrator: Rebecca Demarest
October 2012: First Edition.
Revision History for the First Edition:
2012-10-05 First release
See http://oreilly.com/catalog/errata.csp?isbn=9781449319793 for release details.
Nutshell Handbook, the Nutshell Handbook logo, and the O’Reilly logo are registered trademarks of
O’Reilly Media, Inc. Python for Data Analysis, the cover image of a golden-tailed tree shrew, and related
trade dress are trademarks of O’Reilly Media, Inc.
Many of the designations used by manufacturers and sellers to distinguish their products are claimed as
trademarks. Where those designations appear in this book, and O’Reilly Media, Inc., was aware of a
trademark claim, the designations have been printed in caps or initial caps.
While every precaution has been taken in the preparation of this book, the publisher and author assume
no responsibility for errors or omissions, or for damages resulting from the use of the information con-
tained herein.
ISBN: 978-1-449-31979-3
[LSI]
1349356084
Table of Contents
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
1. Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
What Is This Book About? 1
Why Python for Data Analysis? 2
Python as Glue 2
Solving the “Two-Language” Problem 2
Why Not Python? 3
Essential Python Libraries 3
NumPy 4
pandas 4
matplotlib 5
IPython 5
SciPy 6
Installation and Setup 6
Windows 7
Apple OS X 9
GNU/Linux 10
Python 2 and Python 3 11
Integrated Development Environments (IDEs) 11
Community and Conferences 12
Navigating This Book 12
Code Examples 13
Data for Examples 13
Import Conventions 13
Jargon 13
Acknowledgements 14
2. Introductory Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.usa.gov data from bit.ly 17
Counting Time Zones in Pure Python 19
iii
Counting Time Zones with pandas 21
MovieLens 1M Data Set 26
Measuring rating disagreement 30
US Baby Names 1880-2010 32
Analyzing Naming Trends 36
Conclusions and The Path Ahead 43
3. IPython: An Interactive Computing and Development Environment . . . . . . . . . . . . 45
IPython Basics 46
Tab Completion 47
Introspection 48
The %run Command 49
Executing Code from the Clipboard 50
Keyboard Shortcuts 52
Exceptions and Tracebacks 53
Magic Commands 54
Qt-based Rich GUI Console 55
Matplotlib Integration and Pylab Mode 56
Using the Command History 58
Searching and Reusing the Command History 58
Input and Output Variables 58
Logging the Input and Output 59
Interacting with the Operating System 60
Shell Commands and Aliases 60
Directory Bookmark System 62
Software Development Tools 62
Interactive Debugger 62
Timing Code: %time and %timeit 67
Basic Profiling: %prun and %run -p 68
Profiling a Function Line-by-Line 70
IPython HTML Notebook 72
Tips for Productive Code Development Using IPython 72
Reloading Module Dependencies 74
Code Design Tips 74
Advanced IPython Features 76
Making Your Own Classes IPython-friendly 76
Profiles and Configuration 77
Credits 78
4. NumPy Basics: Arrays and Vectorized Computation . . . . . . . . . . . . . . . . . . . . . . . . . . 79
The NumPy ndarray: A Multidimensional Array Object 80
Creating ndarrays 81
Data Types for ndarrays 83
iv | Table of Contents
Operations between Arrays and Scalars 85
Basic Indexing and Slicing 86
Boolean Indexing 89
Fancy Indexing 92
Transposing Arrays and Swapping Axes 93
Universal Functions: Fast Element-wise Array Functions 95
Data Processing Using Arrays 97
Expressing Conditional Logic as Array Operations 98
Mathematical and Statistical Methods 100
Methods for Boolean Arrays 101
Sorting 101
Unique and Other Set Logic 102
File Input and Output with Arrays 103
Storing Arrays on Disk in Binary Format 103
Saving and Loading Text Files 104
Linear Algebra 105
Random Number Generation 106
Example: Random Walks 108
Simulating Many Random Walks at Once 109
5. Getting Started with pandas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
Introduction to pandas Data Structures 112
Series 112
DataFrame 115
Index Objects 120
Essential Functionality 122
Reindexing 122
Dropping entries from an axis 125
Indexing, selection, and filtering 125
Arithmetic and data alignment 128
Function application and mapping 132
Sorting and ranking 133
Axis indexes with duplicate values 136
Summarizing and Computing Descriptive Statistics 137
Correlation and Covariance 139
Unique Values, Value Counts, and Membership 141
Handling Missing Data 142
Filtering Out Missing Data 143
Filling in Missing Data 145
Hierarchical Indexing 147
Reordering and Sorting Levels 149
Summary Statistics by Level 150
Using a DataFrame’s Columns 150
Table of Contents | v
Other pandas Topics 151
Integer Indexing 151
Panel Data 152
6. Data Loading, Storage, and File Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
Reading and Writing Data in Text Format 155
Reading Text Files in Pieces 160
Writing Data Out to Text Format 162
Manually Working with Delimited Formats 163
JSON Data 165
XML and HTML: Web Scraping 166
Binary Data Formats 171
Using HDF5 Format 171
Reading Microsoft Excel Files 172
Interacting with HTML and Web APIs 173
Interacting with Databases 174
Storing and Loading Data in MongoDB 176
7. Data Wrangling: Clean, Transform, Merge, Reshape . . . . . . . . . . . . . . . . . . . . . . . . 177
Combining and Merging Data Sets 177
Database-style DataFrame Merges 178
Merging on Index 182
Concatenating Along an Axis 185
Combining Data with Overlap 188
Reshaping and Pivoting 189
Reshaping with Hierarchical Indexing 190
Pivoting “long” to “wide” Format 192
Data Transformation 194
Removing Duplicates 194
Transforming Data Using a Function or Mapping 195
Replacing Values 196
Renaming Axis Indexes 197
Discretization and Binning 199
Detecting and Filtering Outliers 201
Permutation and Random Sampling 202
Computing Indicator/Dummy Variables 203
String Manipulation 205
String Object Methods 206
Regular expressions 207
Vectorized string functions in pandas 210
Example: USDA Food Database 212
vi | Table of Contents
8. Plotting and Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
A Brief matplotlib API Primer 219
Figures and Subplots 220
Colors, Markers, and Line Styles 224
Ticks, Labels, and Legends 225
Annotations and Drawing on a Subplot 228
Saving Plots to File 231
matplotlib Configuration 231
Plotting Functions in pandas 232
Line Plots 232
Bar Plots 235
Histograms and Density Plots 238
Scatter Plots 239
Plotting Maps: Visualizing Haiti Earthquake Crisis Data 241
Python Visualization Tool Ecosystem 247
Chaco 248
mayavi 248
Other Packages 248
The Future of Visualization Tools? 249
9. Data Aggregation and Group Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
GroupBy Mechanics 252
Iterating Over Groups 255
Selecting a Column or Subset of Columns 256
Grouping with Dicts and Series 257
Grouping with Functions 258
Grouping by Index Levels 259
Data Aggregation 259
Column-wise and Multiple Function Application 262
Returning Aggregated Data in “unindexed” Form 264
Group-wise Operations and Transformations 264
Apply: General split-apply-combine 266
Quantile and Bucket Analysis 268
Example: Filling Missing Values with Group-specific Values 270
Example: Random Sampling and Permutation 271
Example: Group Weighted Average and Correlation 273
Example: Group-wise Linear Regression 274
Pivot Tables and Cross-Tabulation 275
Cross-Tabulations: Crosstab 277
Example: 2012 Federal Election Commission Database 278
Donation Statistics by Occupation and Employer 280
Bucketing Donation Amounts 283
Donation Statistics by State 285
Table of Contents | vii
10. Time Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
Date and Time Data Types and Tools 290
Converting between string and datetime 291
Time Series Basics 293
Indexing, Selection, Subsetting 294
Time Series with Duplicate Indices 296
Date Ranges, Frequencies, and Shifting 297
Generating Date Ranges 298
Frequencies and Date Offsets 299
Shifting (Leading and Lagging) Data 301
Time Zone Handling 303
Localization and Conversion 304
Operations with Time Zone−aware Timestamp Objects 305
Operations between Different Time Zones 306
Periods and Period Arithmetic 307
Period Frequency Conversion 308
Quarterly Period Frequencies 309
Converting Timestamps to Periods (and Back) 311
Creating a PeriodIndex from Arrays 312
Resampling and Frequency Conversion 312
Downsampling 314
Upsampling and Interpolation 316
Resampling with Periods 318
Time Series Plotting 319
Moving Window Functions 320
Exponentially-weighted functions 324
Binary Moving Window Functions 324
User-Defined Moving Window Functions 326
Performance and Memory Usage Notes 327
11. Financial and Economic Data Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329
Data Munging Topics 329
Time Series and Cross-Section Alignment 330
Operations with Time Series of Different Frequencies 332
Time of Day and “as of” Data Selection 334
Splicing Together Data Sources 336
Return Indexes and Cumulative Returns 338
Group Transforms and Analysis 340
Group Factor Exposures 342
Decile and Quartile Analysis 343
More Example Applications 345
Signal Frontier Analysis 345
Future Contract Rolling 347
viii | Table of Contents
Rolling Correlation and Linear Regression 350
12. Advanced NumPy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353
ndarray Object Internals 353
NumPy dtype Hierarchy 354
Advanced Array Manipulation 355
Reshaping Arrays 355
C versus Fortran Order 356
Concatenating and Splitting Arrays 357
Repeating Elements: Tile and Repeat 360
Fancy Indexing Equivalents: Take and Put 361
Broadcasting 362
Broadcasting Over Other Axes 364
Setting Array Values by Broadcasting 367
Advanced ufunc Usage 367
ufunc Instance Methods 368
Custom ufuncs 370
Structured and Record Arrays 370
Nested dtypes and Multidimensional Fields 371
Why Use Structured Arrays? 372
Structured Array Manipulations: numpy.lib.recfunctions 372
More About Sorting 373
Indirect Sorts: argsort and lexsort 374
Alternate Sort Algorithms 375
numpy.searchsorted: Finding elements in a Sorted Array 376
NumPy Matrix Class 377
Advanced Array Input and Output 379
Memory-mapped Files 379
HDF5 and Other Array Storage Options 380
Performance Tips 380
The Importance of Contiguous Memory 381
Other Speed Options: Cython, f2py, C 382
Appendix: Python Language Essentials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433
Table of Contents | ix
Preface
The scientific Python ecosystem of open source libraries has grown substantially over
the last 10 years. By late 2011, I had long felt that the lack of centralized learning
resources for data analysis and statistical applications was a stumbling block for new
Python programmers engaged in such work. Key projects for data analysis (especially
NumPy,IPython,matplotlib,andpandas)hadalsomaturedenoughthatabookwritten
about them would likely not go out-of-date very quickly. Thus, I mustered the nerve
to embark on this writing project. This is the book that I wish existed when I started
using Python for data analysis in 2007. I hope you find it useful and are able to apply
these tools productively in your work.
Conventions Used in This Book
The following typographical conventions are used in this book:
Italic
Indicates new terms, URLs, email addresses, filenames, and file extensions.
Constant width
Used for program listings, as well as within paragraphs to refer to program elements
such as variable or function names, databases, data types, environment variables,
statements, and keywords.
Constant width bold
Shows commands or other text that should be typed literally by the user.
Constant width italic
Shows text that should be replaced with user-supplied values or by values deter-
mined by context.
This icon signifies a tip, suggestion, or general note.
xi
This icon indicates a warning or caution.
Using Code Examples
This book is here to help you get your job done. In general, you may use the code in
this book in your programs and documentation. You do not need to contact us for
permission unless you’re reproducing a significant portion of the code. For example,
writing a program that uses several chunks of code from this book does not require
permission. Selling or distributing a CD-ROM of examples from O’Reilly books does
require permission. Answering a question by citing this book and quoting example
code does not require permission. Incorporating a significant amount of example code
from this book into your product’s documentation does require permission.
We appreciate, but do not require, attribution. An attribution usually includes the title,
author, publisher, and ISBN. For example: “Python for Data Analysis by William Wes-
ley McKinney (O’Reilly). Copyright 2012 William McKinney, 978-1-449-31979-3.”
If you feel your use of code examples falls outside fair use or the permission given above,
feel free to contact us at permissions@oreilly.com.
Safari® Books Online
Safari Books Online (www.safaribooksonline.com) is an on-demand digital
library that delivers expert content in both book and video form from the
world’s leading authors in technology and business.
Technology professionals, software developers, web designers, and business and cre-
ative professionals use Safari Books Online as their primary resource for research,
problem solving, learning, and certification training.
Safari Books Online offers a range of product mixes and pricing programs for organi-
zations, government agencies, and individuals. Subscribers have access to thousands
of books, training videos, and prepublication manuscripts in one fully searchable da-
tabasefrompublisherslikeO’ReillyMedia,PrenticeHallProfessional,Addison-Wesley
Professional, Microsoft Press, Sams, Que, Peachpit Press, Focal Press, Cisco Press, John
Wiley & Sons, Syngress, Morgan Kaufmann, IBM Redbooks, Packt, Adobe Press, FT
Press, Apress, Manning, New Riders, McGraw-Hill, Jones & Bartlett, Course Tech-
nology, and dozens more. For more information about Safari Books Online, please visit
us online.
xii | Preface
How to Contact Us
Please address comments and questions concerning this book to the publisher:
O’Reilly Media, Inc.
1005 Gravenstein Highway North
Sebastopol, CA 95472
800-998-9938 (in the United States or Canada)
707-829-0515 (international or local)
707-829-0104 (fax)
We have a web page for this book, where we list errata, examples, and any additional
information. You can access this page at http://oreil.ly/python_for_data_analysis.
To comment or ask technical questions about this book, send email to
bookquestions@oreilly.com.
For more information about our books, courses, conferences, and news, see our website
at http://www.oreilly.com.
Find us on Facebook: http://facebook.com/oreilly
Follow us on Twitter: http://twitter.com/oreillymedia
Watch us on YouTube: http://www.youtube.com/oreillymedia
Preface | xiii
CHAPTER 1
Preliminaries
What Is This Book About?
This book is concerned with the nuts and bolts of manipulating, processing, cleaning,
and crunching data in Python. It is also a practical, modern introduction to scientific
computing in Python, tailored for data-intensive applications. This is a book about the
parts of the Python language and libraries you’ll need to effectively solve a broad set of
data analysis problems. This book is not an exposition on analytical methods using
Python as the implementation language.
When I say “data”, what am I referring to exactly? The primary focus is on structured
data, a deliberately vague term that encompasses many different common forms of
data, such as
• Multidimensional arrays (matrices)
• Tabular or spreadsheet-like data in which each column may be a different type
(string, numeric, date, or otherwise). This includes most kinds of data commonly
stored in relational databases or tab- or comma-delimited text files
• Multiple tables of data interrelated by key columns (what would be primary or
foreign keys for a SQL user)
• Evenly or unevenly spaced time series
This is by no means a complete list. Even though it may not always be obvious, a large
percentage of data sets can be transformed into a structured form that is more suitable
for analysis and modeling. If not, it may be possible to extract features from a data set
into a structured form. As an example, a collection of news articles could be processed
into a word frequency table which could then be used to perform sentiment analysis.
Most users of spreadsheet programs like Microsoft Excel, perhaps the most widely used
data analysis tool in the world, will not be strangers to these kinds of data.
1
Why Python for Data Analysis?
For many people (myself among them), the Python language is easy to fall in love with.
Since its first appearance in 1991, Python has become one of the most popular dynamic,
programming languages, along with Perl, Ruby, and others. Python and Ruby have
become especially popular in recent years for building websites using their numerous
web frameworks, like Rails (Ruby) and Django (Python). Such languages are often
called scripting languages as they can be used to write quick-and-dirty small programs,
or scripts. I don’t like the term “scripting language” as it carries a connotation that they
cannot be used for building mission-critical software. Among interpreted languages
Python is distinguished by its large and active scientific computing community. Adop-
tion of Python for scientific computing in both industry applications and academic
research has increased significantly since the early 2000s.
For data analysis and interactive, exploratory computing and data visualization, Python
will inevitably draw comparisons with the many other domain-specific open source
and commercial programming languages and tools in wide use, such as R, MATLAB,
SAS, Stata, and others. In recent years, Python’s improved library support (primarily
pandas) has made it a strong alternative for data manipulation tasks. Combined with
Python’s strength in general purpose programming, it is an excellent choice as a single
language for building data-centric applications.
Python as Glue
Part of Python’s success as a scientific computing platform is the ease of integrating C,
C++, and FORTRAN code. Most modern computing environments share a similar set
of legacy FORTRAN and C libraries for doing linear algebra, optimization, integration,
fast fourier transforms, and other such algorithms. The same story has held true for
many companies and national labs that have used Python to glue together 30 years’
worth of legacy software.
Most programs consist of small portions of code where most of the time is spent, with
large amounts of “glue code” that doesn’t run often. In many cases, the execution time
of the glue code is insignificant; effort is most fruitfully invested in optimizing the
computational bottlenecks, sometimes by moving the code to a lower-level language
like C.
In the last few years, the Cython project (http://cython.org) has become one of the
preferred ways of both creating fast compiled extensions for Python and also interfacing
with C and C++ code.
Solving the “Two-Language” Problem
In many organizations, it is common to research, prototype, and test new ideas using
a more domain-specific computing language like MATLAB or R then later port those
2 | Chapter 1: Preliminaries
ideas to be part of a larger production system written in, say, Java, C#, or C++. What
people are increasingly finding is that Python is a suitable language not only for doing
research and prototyping but also building the production systems, too. I believe that
more and more companies will go down this path as there are often significant organ-
izational benefits to having both scientists and technologists using the same set of pro-
grammatic tools.
Why Not Python?
While Python is an excellent environment for building computationally-intensive sci-
entific applications and building most kinds of general purpose systems, there are a
number of uses for which Python may be less suitable.
As Python is an interpreted programming language, in general most Python code will
run substantially slower than code written in a compiled language like Java or C++. As
programmer time is typically more valuable than CPU time, many are happy to make
this tradeoff. However, in an application with very low latency requirements (for ex-
ample, a high frequency trading system), the time spent programming in a lower-level,
lower-productivity language like C++ to achieve the maximum possible performance
might be time well spent.
Python is not an ideal language for highly concurrent, multithreaded applications, par-
ticularly applications with many CPU-bound threads. The reason for this is that it has
what is known as the global interpreter lock (GIL), a mechanism which prevents the
interpreter from executing more than one Python bytecode instruction at a time. The
technical reasons for why the GIL exists are beyond the scope of this book, but as of
this writing it does not seem likely that the GIL will disappear anytime soon. While it
is true that in many big data processing applications, a cluster of computers may be
required to process a data set in a reasonable amount of time, there are still situations
where a single-process, multithreaded system is desirable.
This is not to say that Python cannot execute truly multithreaded, parallel code; that
code just cannot be executed in a single Python process. As an example, the Cython
project features easy integration with OpenMP, a C framework for parallel computing,
in order to to parallelize loops and thus significantly speed up numerical algorithms.
Essential Python Libraries
For those who are less familiar with the scientific Python ecosystem and the libraries
used throughout the book, I present the following overview of each library.
Essential Python Libraries | 3
NumPy
NumPy, short for Numerical Python, is the foundational package for scientific com-
puting in Python. The majority of this book will be based on NumPy and libraries built
on top of NumPy. It provides, among other things:
• A fast and efficient multidimensional array object ndarray
• Functions for performing element-wise computations with arrays or mathematical
operations between arrays
• Tools for reading and writing array-based data sets to disk
• Linear algebra operations, Fourier transform, and random number generation
• Tools for integrating connecting C, C++, and Fortran code to Python
Beyond the fast array-processing capabilities that NumPy adds to Python, one of its
primary purposes with regards to data analysis is as the primary container for data to
be passed between algorithms. For numerical data, NumPy arrays are a much more
efficient way of storing and manipulating data than the other built-in Python data
structures. Also, libraries written in a lower-level language, such as C or Fortran, can
operate on the data stored in a NumPy array without copying any data.
pandas
pandas provides rich data structures and functions designed to make working with
structured data fast, easy, and expressive. It is, as you will see, one of the critical in-
gredients enabling Python to be a powerful and productive data analysis environment.
The primary object in pandas that will be used in this book is the DataFrame, a two-
dimensional tabular, column-oriented data structure with both row and column labels:
>>> frame
total_bill tip sex smoker day time size
1 16.99 1.01 Female No Sun Dinner 2
2 10.34 1.66 Male No Sun Dinner 3
3 21.01 3.5 Male No Sun Dinner 3
4 23.68 3.31 Male No Sun Dinner 2
5 24.59 3.61 Female No Sun Dinner 4
6 25.29 4.71 Male No Sun Dinner 4
7 8.77 2 Male No Sun Dinner 2
8 26.88 3.12 Male No Sun Dinner 4
9 15.04 1.96 Male No Sun Dinner 2
10 14.78 3.23 Male No Sun Dinner 2
pandas combines the high performance array-computing features of NumPy with the
flexible data manipulation capabilities of spreadsheets and relational databases (such
as SQL). It provides sophisticated indexing functionality to make it easy to reshape,
slice and dice, perform aggregations, and select subsets of data. pandas is the primary
tool that we will use in this book.
4 | Chapter 1: Preliminaries
For financial users, pandas features rich, high-performance time series functionality
and tools well-suited for working with financial data. In fact, I initially designed pandas
as an ideal tool for financial data analysis applications.
For users of the R language for statistical computing, the DataFrame name will be
familiar, as the object was named after the similar R data.frame object. They are not
the same, however; the functionality provided by data.frame in R is essentially a strict
subset of that provided by the pandas DataFrame. While this is a book about Python, I
will occasionally draw comparisons with R as it is one of the most widely-used open
source data analysis environments and will be familiar to many readers.
The pandas name itself is derived from panel data, an econometrics term for multidi-
mensional structured data sets, and Python data analysis itself.
matplotlib
matplotlib is the most popular Python library for producing plots and other 2D data
visualizations.ItwasoriginallycreatedbyJohnD.Hunter (JDH)andisnowmaintained
by a large team of developers. It is well-suited for creating plots suitable for publication.
It integrates well with IPython (see below), thus providing a comfortable interactive
environment for plotting and exploring data. The plots are also interactive; you can
zoom in on a section of the plot and pan around the plot using the toolbar in the plot
window.
IPython
IPython is the component in the standard scientific Python toolset that ties everything
together. It provides a robust and productive environment for interactive and explor-
atory computing. It is an enhanced Python shell designed to accelerate the writing,
testing, and debugging of Python code. It is particularly useful for interactively working
with data and visualizing data with matplotlib. IPython is usually involved with the
majority of my Python work, including running, debugging, and testing code.
Aside from the standard terminal-based IPython shell, the project also provides
• A Mathematica-like HTML notebook for connecting to IPython through a web
browser (more on this later).
• A Qt framework-based GUI console with inline plotting, multiline editing, and
syntax highlighting
• An infrastructure for interactive parallel and distributed computing
I will devote a chapter to IPython and how to get the most out of its features. I strongly
recommend using it while working through this book.
Essential Python Libraries | 5
SciPy
SciPy is a collection of packages addressing a number of different standard problem
domains in scientific computing. Here is a sampling of the packages included:
• scipy.integrate: numerical integration routines and differential equation solvers
• scipy.linalg: linear algebra routines and matrix decompositions extending be-
yond those provided in numpy.linalg.
• scipy.optimize: function optimizers (minimizers) and root finding algorithms
• scipy.signal: signal processing tools
• scipy.sparse: sparse matrices and sparse linear system solvers
• scipy.special: wrapper around SPECFUN, a Fortran library implementing many
common mathematical functions, such as the gamma function
• scipy.stats: standard continuous and discrete probability distributions (density
functions, samplers, continuous distribution functions), various statistical tests,
and more descriptive statistics
• scipy.weave: tool for using inline C++ code to accelerate array computations
Together NumPy and SciPy form a reasonably complete computational replacement
for much of MATLAB along with some of its add-on toolboxes.
Installation and Setup
Since everyone uses Python for different applications, there is no single solution for
settingupPythonandrequiredadd-onpackages.Manyreaderswillnothaveacomplete
scientific Python environment suitable for following along with this book, so here I will
give detailed instructions to get set up on each operating system. I recommend using
one of the following base Python distributions:
• Enthought Python Distribution: a scientific-oriented Python distribution from En-
thought (http://www.enthought.com). This includes EPDFree, a free base scientific
distribution (with NumPy, SciPy, matplotlib, Chaco, and IPython) and EPD Full,
a comprehensive suite of more than 100 scientific packages across many domains.
EPD Full is free for academic use but has an annual subscription for non-academic
users.
• Python(x,y) (http://pythonxy.googlecode.com): A free scientific-oriented Python
distribution for Windows.
I will be using EPDFree for the installation guides, though you are welcome to take
another approach depending on your needs. At the time of this writing, EPD includes
Python 2.7, though this might change at some point in the future. After installing, you
will have the following packages installed and importable:
6 | Chapter 1: Preliminaries
• Scientific Python base: NumPy, SciPy, matplotlib, and IPython. These are all in-
cluded in EPDFree.
• IPython Notebook dependencies: tornado and pyzmq. These are included in EPD-
Free.
• pandas (version 0.8.2 or higher).
At some point while reading you may wish to install one or more of the following
packages: statsmodels, PyTables, PyQt (or equivalently, PySide), xlrd, lxml, basemap,
pymongo, and requests. These are used in various examples. Installing these optional
libraries is not necessary, and I would would suggest waiting until you need them. For
example, installing PyQt or PyTables from source on OS X or Linux can be rather
arduous. For now, it’s most important to get up and running with the bare minimum:
EPDFree and pandas.
For information on each Python package and links to binary installers or other help,
see the Python Package Index (PyPI, http://pypi.python.org). This is also an excellent
resource for finding new Python packages.
To avoid confusion and to keep things simple, I am avoiding discussion
of more complex environment management tools like pip and virtua-
lenv. There are many excellent guides available for these tools on the
Internet.
SomeusersmaybeinterestedinalternatePythonimplementations,such
as IronPython, Jython, or PyPy. To make use of the tools presented in
this book, it is (currently) necessary to use the standard C-based Python
interpreter, known as CPython.
Windows
To get started on Windows, download the EPDFree installer from http://www.en
thought.com, which should be an MSI installer named like epd_free-7.3-1-win-
x86.msi. Run the installer and accept the default installation location C:Python27. If
you had previously installed Python in this location, you may want to delete it manually
first (or using Add/Remove Programs).
Next, you need to verify that Python has been successfully added to the system path
and that there are no conflicts with any prior-installed Python versions. First, open a
command prompt by going to the Start Menu and starting the Command Prompt ap-
plication, also known as cmd.exe. Try starting the Python interpreter by typing
python. You should see a message that matches the version of EPDFree you installed:
C:UsersWes>python
Python 2.7.3 |EPD_free 7.3-1 (32-bit)| (default, Apr 12 2012, 14:30:37) on win32
Type "credits", "demo" or "enthought" for more information.
>>>
Installation and Setup | 7
If you see a message for a different version of EPD or it doesn’t work at all, you will
need to clean up your Windows environment variables. On Windows 7 you can start
typing “environment variables” in the programs search field and select Edit environ
ment variables for your account. On Windows XP, you will have to go to Control
Panel > System > Advanced > Environment Variables. On the window that pops up,
you are looking for the Path variable. It needs to contain the following two directory
paths, separated by semicolons:
C:Python27;C:Python27Scripts
If you installed other versions of Python, be sure to delete any other Python-related
directories from both the system and user Path variables. After making a path alterna-
tion, you have to restart the command prompt for the changes to take effect.
Once you can launch Python successfully from the command prompt, you need to
install pandas. The easiest way is to download the appropriate binary installer from
http://pypi.python.org/pypi/pandas. For EPDFree, this should be pandas-0.9.0.win32-
py2.7.exe. After you run this, let’s launch IPython and check that things are installed
correctly by importing pandas and making a simple matplotlib plot:
C:UsersWes>ipython --pylab
Python 2.7.3 |EPD_free 7.3-1 (32-bit)|
Type "copyright", "credits" or "license" for more information.
IPython 0.12.1 -- An enhanced Interactive Python.
? -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help -> Python's own help system.
object? -> Details about 'object', use 'object??' for extra details.
Welcome to pylab, a matplotlib-based Python environment [backend: WXAgg].
For more information, type 'help(pylab)'.
In [1]: import pandas
In [2]: plot(arange(10))
If successful, there should be no error messages and a plot window will appear. You
can also check that the IPython HTML notebook can be successfully run by typing:
$ ipython notebook --pylab=inline
If you use the IPython notebook application on Windows and normally
use Internet Explorer, you will likely need to install and run Mozilla
Firefox or Google Chrome instead.
EPDFree on Windows contains only 32-bit executables. If you want or need a 64-bit
setup on Windows, using EPD Full is the most painless way to accomplish that. If you
would rather install from scratch and not pay for an EPD subscription, Christoph
Gohlke at the University of California, Irvine, publishes unofficial binary installers for
8 | Chapter 1: Preliminaries
all of the book’s necessary packages (http://www.lfd.uci.edu/~gohlke/pythonlibs/) for 32-
and 64-bit Windows.
Apple OS X
To get started on OS X, you must first install Xcode, which includes Apple’s suite of
software development tools. The necessary component for our purposes is the gcc C
and C++ compiler suite. The Xcode installer can be found on the OS X install DVD
that came with your computer or downloaded from Apple directly.
Once you’ve installed Xcode, launch the terminal (Terminal.app) by navigating to
Applications > Utilities. Type gcc and press enter. You should hopefully see some-
thing like:
$ gcc
i686-apple-darwin10-gcc-4.2.1: no input files
Now you need to install EPDFree. Download the installer which should be a disk image
named something like epd_free-7.3-1-macosx-i386.dmg. Double-click the .dmg file to
mount it, then double-click the .mpkg file inside to run the installer.
When the installer runs, it automatically appends the EPDFree executable path to
your .bash_profile file. This is located at /Users/your_uname/.bash_profile:
# Setting PATH for EPD_free-7.3-1
PATH="/Library/Frameworks/Python.framework/Versions/Current/bin:${PATH}"
export PATH
Should you encounter any problems in the following steps, you’ll want to inspect
your .bash_profile and potentially add the above directory to your path.
Now, it’s time to install pandas. Execute this command in the terminal:
$ sudo easy_install pandas
Searching for pandas
Reading http://pypi.python.org/simple/pandas/
Reading http://pandas.pydata.org
Reading http://pandas.sourceforge.net
Best match: pandas 0.9.0
Downloading http://pypi.python.org/packages/source/p/pandas/pandas-0.9.0.zip
Processing pandas-0.9.0.zip
Writing /tmp/easy_install-H5mIX6/pandas-0.9.0/setup.cfg
Running pandas-0.9.0/setup.py -q bdist_egg --dist-dir /tmp/easy_install-H5mIX6/
pandas-0.9.0/egg-dist-tmp-RhLG0z
Adding pandas 0.9.0 to easy-install.pth file
Installed /Library/Frameworks/Python.framework/Versions/7.3/lib/python2.7/
site-packages/pandas-0.9.0-py2.7-macosx-10.5-i386.egg
Processing dependencies for pandas
Finished processing dependencies for pandas
To verify everything is working, launch IPython in Pylab mode and test importing pan-
das then making a plot interactively:
Installation and Setup | 9
$ ipython --pylab
22:29 ~/VirtualBox VMs/WindowsXP $ ipython
Python 2.7.3 |EPD_free 7.3-1 (32-bit)| (default, Apr 12 2012, 11:28:34)
Type "copyright", "credits" or "license" for more information.
IPython 0.12.1 -- An enhanced Interactive Python.
? -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help -> Python's own help system.
object? -> Details about 'object', use 'object??' for extra details.
Welcome to pylab, a matplotlib-based Python environment [backend: WXAgg].
For more information, type 'help(pylab)'.
In [1]: import pandas
In [2]: plot(arange(10))
If this succeeds, a plot window with a straight line should pop up.
GNU/Linux
Some, but not all, Linux distributions include sufficiently up-to-date
versions of all the required Python packages and can be installed using
the built-in package management tool like apt. I detail setup using EPD-
Free as it's easily reproducible across distributions.
Linux details will vary a bit depending on your Linux flavor, but here I give details for
Debian-based GNU/Linux systems like Ubuntu and Mint. Setup is similar to OS X with
the exception of how EPDFree is installed. The installer is a shell script that must be
executed in the terminal. Depending on whether you have a 32-bit or 64-bit system,
you will either need to install the x86 (32-bit) or x86_64 (64-bit) installer. You will then
have a file named something similar to epd_free-7.3-1-rh5-x86_64.sh. To install it,
execute this script with bash:
$ bash epd_free-7.3-1-rh5-x86_64.sh
After accepting the license, you will be presented with a choice of where to put the
EPDFree files. I recommend installing the files in your home directory, say /home/wesm/
epd (substituting your own username for wesm).
Once the installer has finished, you need to add EPDFree’s bin directory to your
$PATH variable. If you are using the bash shell (the default in Ubuntu, for example), this
means adding the following path addition in your .bashrc:
export PATH=/home/wesm/epd/bin:$PATH
Obviously, substitute the installation directory you used for /home/wesm/epd/. After
doing this you can either start a new terminal process or execute your .bashrc again
with source ~/.bashrc.
10 | Chapter 1: Preliminaries
You need a C compiler such as gcc to move forward; many Linux distributions include
gcc, but others may not. On Debian systems, you can install gcc by executing:
sudo apt-get install gcc
If you type gcc on the command line it should say something like:
$ gcc
gcc: no input files
Now, time to install pandas:
$ easy_install pandas
If you installed EPDFree as root, you may need to add sudo to the command and enter
the sudo or root password. To verify things are working, perform the same checks as
in the OS X section.
Python 2 and Python 3
The Python community is currently undergoing a drawn-out transition from the Python
2 series of interpreters to the Python 3 series. Until the appearance of Python 3.0, all
Python code was backwards compatible. The community decided that in order to move
the language forward, certain backwards incompatible changes were necessary.
I am writing this book with Python 2.7 as its basis, as the majority of the scientific
Python community has not yet transitioned to Python 3. The good news is that, with
a few exceptions, you should have no trouble following along with the book if you
happen to be using Python 3.2.
Integrated Development Environments (IDEs)
When asked about my standard development environment, I almost always say “IPy-
thon plus a text editor”. I typically write a program and iteratively test and debug each
piece of it in IPython. It is also useful to be able to play around with data interactively
and visually verify that a particular set of data manipulations are doing the right thing.
Libraries like pandas and NumPy are designed to be easy-to-use in the shell.
However, some will still prefer to work in an IDE instead of a text editor. They do
provide many nice “code intelligence” features like completion or quickly pulling up
the documentation associated with functions and classes. Here are some that you can
explore:
• Eclipse with PyDev Plugin
• Python Tools for Visual Studio (for Windows users)
• PyCharm
• Spyder
• Komodo IDE
Installation and Setup | 11
Community and Conferences
Outside of an Internet search, the scientific Python mailing lists are generally helpful
and responsive to questions. Some ones to take a look at are:
• pydata: a Google Group list for questions related to Python for data analysis and
pandas
• pystatsmodels: for statsmodels or pandas-related questions
• numpy-discussion: for NumPy-related questions
• scipy-user: for general SciPy or scientific Python questions
I deliberately did not post URLs for these in case they change. They can be easily located
via Internet search.
Each year many conferences are held all over the world for Python programmers. PyCon
and EuroPython are the two main general Python conferences in the United States and
Europe, respectively. SciPy and EuroSciPy are scientific-oriented Python conferences
where you will likely find many “birds of a feather” if you become more involved with
using Python for data analysis after reading this book.
Navigating This Book
If you have never programmed in Python before, you may actually want to start at the
end of the book, where I have placed a condensed tutorial on Python syntax, language
features, and built-in data structures like tuples, lists, and dicts. These things are con-
sidered prerequisite knowledge for the remainder of the book.
The book starts by introducing you to the IPython environment. Next, I give a short
introduction to the key features of NumPy, leaving more advanced NumPy use for
another chapter at the end of the book. Then, I introduce pandas and devote the rest
of the book to data analysis topics applying pandas, NumPy, and matplotlib (for vis-
ualization). I have structured the material in the most incremental way possible, though
there is occasionally some minor cross-over between chapters.
Data files and related material for each chapter are hosted as a git repository on GitHub:
http://github.com/pydata/pydata-book
I encourage you to download the data and use it to replicate the book’s code examples
and experiment with the tools presented in each chapter. I will happily accept contri-
butions, scripts, IPython notebooks, or any other materials you wish to contribute to
the book's repository for all to enjoy.
12 | Chapter 1: Preliminaries
Code Examples
Most of the code examples in the book are shown with input and output as it would
appear executed in the IPython shell.
In [5]: code
Out[5]: output
At times, for clarity, multiple code examples will be shown side by side. These should
be read left to right and executed separately.
In [5]: code In [6]: code2
Out[5]: output Out[6]: output2
Data for Examples
Data sets for the examples in each chapter are hosted in a repository on GitHub: http:
//github.com/pydata/pydata-book. You can download this data either by using the git
revision control command-line program or by downloading a zip file of the repository
from the website.
I have made every effort to ensure that it contains everything necessary to reproduce
the examples, but I may have made some mistakes or omissions. If so, please send me
an e-mail: wesmckinn@gmail.com.
Import Conventions
The Python community has adopted a number of naming conventions for commonly-
used modules:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
This means that when you see np.arange, this is a reference to the arange function in
NumPy. This is done as it’s considered bad practice in Python software development
to import everything (from numpy import *) from a large package like NumPy.
Jargon
I’ll use some terms common both to programming and data science that you may not
be familiar with. Thus, here are some brief definitions:
Munge/Munging/Wrangling
Describes the overall process of manipulating unstructured and/or messy data into
a structured or clean form. The word has snuck its way into the jargon of many
modern day data hackers. Munge rhymes with “lunge”.
Navigating This Book | 13
Pseudocode
A description of an algorithm or process that takes a code-like form while likely
not being actual valid source code.
Syntactic sugar
Programming syntax which does not add new features, but makes something more
convenient or easier to type.
Acknowledgements
It would have been difficult for me to write this book without the support of a large
number of people.
On the O’Reilly staff, I’m very grateful for my editors Meghan Blanchette and Julie
Steele who guided me through the process. Mike Loukides also worked with me in the
proposal stages and helped make the book a reality.
I received a wealth of technical review from a large cast of characters. In particular,
Martin Blais and Hugh White were incredibly helpful in improving the book’s exam-
ples, clarity, and organization from cover to cover. James Long, Drew Conway, Fer-
nando Pérez, Brian Granger, Thomas Kluyver, Adam Klein, Josh Klein, Chang She, and
Stéfan van der Walt each reviewed one or more chapters, providing pointed feedback
from many different perspectives.
I got many great ideas for examples and data sets from friends and colleagues in the
data community, among them: Mike Dewar, Jeff Hammerbacher, James Johndrow,
Kristian Lum, Adam Klein, Hilary Mason, Chang She, and Ashley Williams.
I am of course indebted to the many leaders in the open source scientific Python com-
munity who’ve built the foundation for my development work and gave encouragement
while I was writing this book: the IPython core team (Fernando Pérez, Brian Granger,
Min Ragan-Kelly, Thomas Kluyver, and others), John Hunter, Skipper Seabold, Travis
Oliphant, Peter Wang, Eric Jones, Robert Kern, Josef Perktold, Francesc Alted, Chris
Fonnesbeck, and too many others to mention. Several other people provided a great
deal of support, ideas, and encouragement along the way: Drew Conway, Sean Taylor,
Giuseppe Paleologo, Jared Lander, David Epstein, John Krowas, Joshua Bloom, Den
Pilsworth, John Myles-White, and many others I’ve forgotten.
I’d also like to thank a number of people from my formative years. First, my former
AQR colleagues who’ve cheered me on in my pandas work over the years: Alex Reyf-
man, Michael Wong, Tim Sargen, Oktay Kurbanov, Matthew Tschantz, Roni Israelov,
Michael Katz, Chris Uga, Prasad Ramanan, Ted Square, and Hoon Kim. Lastly, my
academic advisors Haynes Miller (MIT) and Mike West (Duke).
On the personal side, Casey Dinkin provided invaluable day-to-day support during the
writing process, tolerating my highs and lows as I hacked together the final draft on
14 | Chapter 1: Preliminaries
top of an already overcommitted schedule. Lastly, my parents, Bill and Kim, taught me
to always follow my dreams and to never settle for less.
Acknowledgements | 15
CHAPTER 2
Introductory Examples
This book teaches you the Python tools to work productively with data. While readers
may have many different end goals for their work, the tasks required generally fall into
a number of different broad groups:
Interacting with the outside world
Reading and writing with a variety of file formats and databases.
Preparation
Cleaning, munging, combining, normalizing, reshaping, slicing and dicing, and
transforming data for analysis.
Transformation
Applying mathematical and statistical operations to groups of data sets to derive
new data sets. For example, aggregating a large table by group variables.
Modeling and computation
Connecting your data to statistical models, machine learning algorithms, or other
computational tools
Presentation
Creating interactive or static graphical visualizations or textual summaries
In this chapter I will show you a few data sets and some things we can do with them.
These examples are just intended to pique your interest and thus will only be explained
at a high level. Don’t worry if you have no experience with any of these tools; they will
be discussed in great detail throughout the rest of the book. In the code examples you’ll
see input and output prompts like In [15]:; these are from the IPython shell.
1.usa.gov data from bit.ly
In 2011, URL shortening service bit.ly partnered with the United States government
website usa.gov to provide a feed of anonymous data gathered from users who shorten
links ending with .gov or .mil. As of this writing, in addition to providing a live feed,
hourly snapshots are available as downloadable text files.1
17
In the case of the hourly snapshots, each line in each file contains a common form of
web data known as JSON, which stands for JavaScript Object Notation. For example,
if we read just the first line of a file you may see something like
In [15]: path = 'ch02/usagov_bitly_data2012-03-16-1331923249.txt'
In [16]: open(path).readline()
Out[16]: '{ "a": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11
(KHTML, like Gecko) Chrome/17.0.963.78 Safari/535.11", "c": "US", "nk": 1,
"tz": "America/New_York", "gr": "MA", "g": "A6qOVH", "h": "wfLQtf", "l":
"orofrog", "al": "en-US,en;q=0.8", "hh": "1.usa.gov", "r":
"http://www.facebook.com/l/7AQEFzjSi/1.usa.gov/wfLQtf", "u":
"http://www.ncbi.nlm.nih.gov/pubmed/22415991", "t": 1331923247, "hc":
1331822918, "cy": "Danvers", "ll": [ 42.576698, -70.954903 ] }n'
Python has numerous built-in and 3rd party modules for converting a JSON string into
a Python dictionary object. Here I’ll use the json module and its loads function invoked
on each line in the sample file I downloaded:
import json
path = 'ch02/usagov_bitly_data2012-03-16-1331923249.txt'
records = [json.loads(line) for line in open(path)]
If you’ve never programmed in Python before, the last expression here is called a list
comprehension, which is a concise way of applying an operation (like json.loads) to a
collection of strings or other objects. Conveniently, iterating over an open file handle
gives you a sequence of its lines. The resulting object records is now a list of Python
dicts:
In [18]: records[0]
Out[18]:
{u'a': u'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like
Gecko) Chrome/17.0.963.78 Safari/535.11',
u'al': u'en-US,en;q=0.8',
u'c': u'US',
u'cy': u'Danvers',
u'g': u'A6qOVH',
u'gr': u'MA',
u'h': u'wfLQtf',
u'hc': 1331822918,
u'hh': u'1.usa.gov',
u'l': u'orofrog',
u'll': [42.576698, -70.954903],
u'nk': 1,
u'r': u'http://www.facebook.com/l/7AQEFzjSi/1.usa.gov/wfLQtf',
u't': 1331923247,
u'tz': u'America/New_York',
u'u': u'http://www.ncbi.nlm.nih.gov/pubmed/22415991'}
1. http://www.usa.gov/About/developer-resources/1usagov.shtml
18 | Chapter 2: Introductory Examples
Note that Python indices start at 0 and not 1 like some other languages (like R). It’s
now easy to access individual values within records by passing a string for the key you
wish to access:
In [19]: records[0]['tz']
Out[19]: u'America/New_York'
The u here in front of the quotation stands for unicode, a standard form of string en-
coding. Note that IPython shows the time zone string object representation here rather
than its print equivalent:
In [20]: print records[0]['tz']
America/New_York
Counting Time Zones in Pure Python
Suppose we were interested in the most often-occurring time zones in the data set (the
tz field). There are many ways we could do this. First, let’s extract a list of time zones
again using a list comprehension:
In [25]: time_zones = [rec['tz'] for rec in records]
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
/home/wesm/book_scripts/whetting/<ipython> in <module>()
----> 1 time_zones = [rec['tz'] for rec in records]
KeyError: 'tz'
Oops! Turns out that not all of the records have a time zone field. This is easy to handle
as we can add the check if 'tz' in rec at the end of the list comprehension:
In [26]: time_zones = [rec['tz'] for rec in records if 'tz' in rec]
In [27]: time_zones[:10]
Out[27]:
[u'America/New_York',
u'America/Denver',
u'America/New_York',
u'America/Sao_Paulo',
u'America/New_York',
u'America/New_York',
u'Europe/Warsaw',
u'',
u'',
u'']
Just looking at the first 10 time zones we see that some of them are unknown (empty).
You can filter these out also but I’ll leave them in for now. Now, to produce counts by
time zone I’ll show two approaches: the harder way (using just the Python standard
library) and the easier way (using pandas). One way to do the counting is to use a dict
to store counts while we iterate through the time zones:
def get_counts(sequence):
counts = {}
1.usa.gov data from bit.ly | 19
for x in sequence:
if x in counts:
counts[x] += 1
else:
counts[x] = 1
return counts
If you know a bit more about the Python standard library, you might prefer to write
the same thing more briefly:
from collections import defaultdict
def get_counts2(sequence):
counts = defaultdict(int) # values will initialize to 0
for x in sequence:
counts[x] += 1
return counts
I put this logic in a function just to make it more reusable. To use it on the time zones,
just pass the time_zones list:
In [31]: counts = get_counts(time_zones)
In [32]: counts['America/New_York']
Out[32]: 1251
In [33]: len(time_zones)
Out[33]: 3440
If we wanted the top 10 time zones and their counts, we have to do a little bit of dic-
tionary acrobatics:
def top_counts(count_dict, n=10):
value_key_pairs = [(count, tz) for tz, count in count_dict.items()]
value_key_pairs.sort()
return value_key_pairs[-n:]
We have then:
In [35]: top_counts(counts)
Out[35]:
[(33, u'America/Sao_Paulo'),
(35, u'Europe/Madrid'),
(36, u'Pacific/Honolulu'),
(37, u'Asia/Tokyo'),
(74, u'Europe/London'),
(191, u'America/Denver'),
(382, u'America/Los_Angeles'),
(400, u'America/Chicago'),
(521, u''),
(1251, u'America/New_York')]
20 | Chapter 2: Introductory Examples
If you search the Python standard library, you may find the collections.Counter class,
which makes this task a lot easier:
In [49]: from collections import Counter
In [50]: counts = Counter(time_zones)
In [51]: counts.most_common(10)
Out[51]:
[(u'America/New_York', 1251),
(u'', 521),
(u'America/Chicago', 400),
(u'America/Los_Angeles', 382),
(u'America/Denver', 191),
(u'Europe/London', 74),
(u'Asia/Tokyo', 37),
(u'Pacific/Honolulu', 36),
(u'Europe/Madrid', 35),
(u'America/Sao_Paulo', 33)]
Counting Time Zones with pandas
The main pandas data structure is the DataFrame, which you can think of as repre-
senting a table or spreadsheet of data. Creating a DataFrame from the original set of
records is simple:
In [289]: from pandas import DataFrame, Series
In [290]: import pandas as pd
In [291]: frame = DataFrame(records)
In [292]: frame
Out[292]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3560 entries, 0 to 3559
Data columns:
_heartbeat_ 120 non-null values
a 3440 non-null values
al 3094 non-null values
c 2919 non-null values
cy 2919 non-null values
g 3440 non-null values
gr 2919 non-null values
h 3440 non-null values
hc 3440 non-null values
hh 3440 non-null values
kw 93 non-null values
l 3440 non-null values
ll 2919 non-null values
nk 3440 non-null values
r 3440 non-null values
t 3440 non-null values
tz 3440 non-null values
1.usa.gov data from bit.ly | 21
u 3440 non-null values
dtypes: float64(4), object(14)
In [293]: frame['tz'][:10]
Out[293]:
0 America/New_York
1 America/Denver
2 America/New_York
3 America/Sao_Paulo
4 America/New_York
5 America/New_York
6 Europe/Warsaw
7
8
9
Name: tz
The output shown for the frame is the summary view, shown for large DataFrame ob-
jects. The Series object returned by frame['tz'] has a method value_counts that gives
us what we’re looking for:
In [294]: tz_counts = frame['tz'].value_counts()
In [295]: tz_counts[:10]
Out[295]:
America/New_York 1251
521
America/Chicago 400
America/Los_Angeles 382
America/Denver 191
Europe/London 74
Asia/Tokyo 37
Pacific/Honolulu 36
Europe/Madrid 35
America/Sao_Paulo 33
Then, we might want to make a plot of this data using plotting library, matplotlib. You
can do a bit of munging to fill in a substitute value for unknown and missing time zone
data in the records. The fillna function can replace missing (NA) values and unknown
(empty strings) values can be replaced by boolean array indexing:
In [296]: clean_tz = frame['tz'].fillna('Missing')
In [297]: clean_tz[clean_tz == ''] = 'Unknown'
In [298]: tz_counts = clean_tz.value_counts()
In [299]: tz_counts[:10]
Out[299]:
America/New_York 1251
Unknown 521
America/Chicago 400
America/Los_Angeles 382
America/Denver 191
Missing 120
22 | Chapter 2: Introductory Examples
Europe/London 74
Asia/Tokyo 37
Pacific/Honolulu 36
Europe/Madrid 35
Making a horizontal bar plot can be accomplished using the plot method on the
counts objects:
In [301]: tz_counts[:10].plot(kind='barh', rot=0)
See Figure 2-1 for the resulting figure. We’ll explore more tools for working with this
kind of data. For example, the a field contains information about the browser, device,
or application used to perform the URL shortening:
In [302]: frame['a'][1]
Out[302]: u'GoogleMaps/RochesterNY'
In [303]: frame['a'][50]
Out[303]: u'Mozilla/5.0 (Windows NT 5.1; rv:10.0.2) Gecko/20100101 Firefox/10.0.2'
In [304]: frame['a'][51]
Out[304]: u'Mozilla/5.0 (Linux; U; Android 2.2.2; en-us; LG-P925/V10e Build/FRG83G) AppleWebKit/533.1 (K
Figure 2-1. Top time zones in the 1.usa.gov sample data
Parsing all of the interesting information in these “agent” strings may seem like a
daunting task. Luckily, once you have mastered Python’s built-in string functions and
regular expression capabilities, it is really not so bad. For example, we could split off
the first token in the string (corresponding roughly to the browser capability) and make
another summary of the user behavior:
In [305]: results = Series([x.split()[0] for x in frame.a.dropna()])
In [306]: results[:5]
Out[306]:
0 Mozilla/5.0
1 GoogleMaps/RochesterNY
2 Mozilla/4.0
3 Mozilla/5.0
4 Mozilla/5.0
1.usa.gov data from bit.ly | 23
In [307]: results.value_counts()[:8]
Out[307]:
Mozilla/5.0 2594
Mozilla/4.0 601
GoogleMaps/RochesterNY 121
Opera/9.80 34
TEST_INTERNET_AGENT 24
GoogleProducer 21
Mozilla/6.0 5
BlackBerry8520/5.0.0.681 4
Now, suppose you wanted to decompose the top time zones into Windows and non-
Windows users. As a simplification, let’s say that a user is on Windows if the string
'Windows' is in the agent string. Since some of the agents are missing, I’ll exclude these
from the data:
In [308]: cframe = frame[frame.a.notnull()]
We want to then compute a value whether each row is Windows or not:
In [309]: operating_system = np.where(cframe['a'].str.contains('Windows'),
.....: 'Windows', 'Not Windows')
In [310]: operating_system[:5]
Out[310]:
0 Windows
1 Not Windows
2 Windows
3 Not Windows
4 Windows
Name: a
Then, you can group the data by its time zone column and this new list of operating
systems:
In [311]: by_tz_os = cframe.groupby(['tz', operating_system])
The group counts, analogous to the value_counts function above, can be computed
using size. This result is then reshaped into a table with unstack:
In [312]: agg_counts = by_tz_os.size().unstack().fillna(0)
In [313]: agg_counts[:10]
Out[313]:
a Not Windows Windows
tz
245 276
Africa/Cairo 0 3
Africa/Casablanca 0 1
Africa/Ceuta 0 2
Africa/Johannesburg 0 1
Africa/Lusaka 0 1
America/Anchorage 4 1
America/Argentina/Buenos_Aires 1 0
24 | Chapter 2: Introductory Examples
America/Argentina/Cordoba 0 1
America/Argentina/Mendoza 0 1
Finally, let’s select the top overall time zones. To do so, I construct an indirect index
array from the row counts in agg_counts:
# Use to sort in ascending order
In [314]: indexer = agg_counts.sum(1).argsort()
In [315]: indexer[:10]
Out[315]:
tz
24
Africa/Cairo 20
Africa/Casablanca 21
Africa/Ceuta 92
Africa/Johannesburg 87
Africa/Lusaka 53
America/Anchorage 54
America/Argentina/Buenos_Aires 57
America/Argentina/Cordoba 26
America/Argentina/Mendoza 55
I then use take to select the rows in that order, then slice off the last 10 rows:
In [316]: count_subset = agg_counts.take(indexer)[-10:]
In [317]: count_subset
Out[317]:
a Not Windows Windows
tz
America/Sao_Paulo 13 20
Europe/Madrid 16 19
Pacific/Honolulu 0 36
Asia/Tokyo 2 35
Europe/London 43 31
America/Denver 132 59
America/Los_Angeles 130 252
America/Chicago 115 285
245 276
America/New_York 339 912
Then, as shown in the preceding code block, this can be plotted in a bar plot; I’ll make
it a stacked bar plot by passing stacked=True (see Figure 2-2) :
In [319]: count_subset.plot(kind='barh', stacked=True)
The plot doesn’t make it easy to see the relative percentage of Windows users in the
smaller groups, but the rows can easily be normalized to sum to 1 then plotted again
(see Figure 2-3):
In [321]: normed_subset = count_subset.div(count_subset.sum(1), axis=0)
In [322]: normed_subset.plot(kind='barh', stacked=True)
1.usa.gov data from bit.ly | 25
All of the methods employed here will be examined in great detail throughout the rest
of the book.
MovieLens 1M Data Set
GroupLens Research (http://www.grouplens.org/node/73) provides a number of collec-
tions of movie ratings data collected from users of MovieLens in the late 1990s and
Figure 2-2. Top time zones by Windows and non-Windows users
Figure 2-3. Percentage Windows and non-Windows users in top-occurring time zones
26 | Chapter 2: Introductory Examples
early 2000s. The data provide movie ratings, movie metadata (genres and year), and
demographic data about the users (age, zip code, gender, and occupation). Such data
is often of interest in the development of recommendation systems based on machine
learning algorithms. While I will not be exploring machine learning techniques in great
detail in this book, I will show you how to slice and dice data sets like these into the
exact form you need.
The MovieLens 1M data set contains 1 million ratings collected from 6000 users on
4000 movies. It’s spread across 3 tables: ratings, user information, and movie infor-
mation.Afterextractingthedatafromthezipfile,eachtablecanbeloadedintoapandas
DataFrame object using pandas.read_table:
import pandas as pd
unames = ['user_id', 'gender', 'age', 'occupation', 'zip']
users = pd.read_table('ml-1m/users.dat', sep='::', header=None,
names=unames)
rnames = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_table('ml-1m/ratings.dat', sep='::', header=None,
names=rnames)
mnames = ['movie_id', 'title', 'genres']
movies = pd.read_table('ml-1m/movies.dat', sep='::', header=None,
names=mnames)
You can verify that everything succeeded by looking at the first few rows of each Da-
taFrame with Python's slice syntax:
In [334]: users[:5]
Out[334]:
user_id gender age occupation zip
0 1 F 1 10 48067
1 2 M 56 16 70072
2 3 M 25 15 55117
3 4 M 45 7 02460
4 5 M 25 20 55455
In [335]: ratings[:5]
Out[335]:
user_id movie_id rating timestamp
0 1 1193 5 978300760
1 1 661 3 978302109
2 1 914 3 978301968
3 1 3408 4 978300275
4 1 2355 5 978824291
In [336]: movies[:5]
Out[336]:
movie_id title genres
0 1 Toy Story (1995) Animation|Children's|Comedy
1 2 Jumanji (1995) Adventure|Children's|Fantasy
2 3 Grumpier Old Men (1995) Comedy|Romance
3 4 Waiting to Exhale (1995) Comedy|Drama
MovieLens 1M Data Set | 27
4 5 Father of the Bride Part II (1995) Comedy
In [337]: ratings
Out[337]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000209 entries, 0 to 1000208
Data columns:
user_id 1000209 non-null values
movie_id 1000209 non-null values
rating 1000209 non-null values
timestamp 1000209 non-null values
dtypes: int64(4)
Note that ages and occupations are coded as integers indicating groups described in
the data set’s README file. Analyzing the data spread across three tables is not a simple
task; for example, suppose you wanted to compute mean ratings for a particular movie
by sex and age. As you will see, this is much easier to do with all of the data merged
together into a single table. Using pandas’s merge function, we first merge ratings with
users then merging that result with the movies data. pandas infers which columns to
use as the merge (or join) keys based on overlapping names:
In [338]: data = pd.merge(pd.merge(ratings, users), movies)
In [339]: data
Out[339]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000209 entries, 0 to 1000208
Data columns:
user_id 1000209 non-null values
movie_id 1000209 non-null values
rating 1000209 non-null values
timestamp 1000209 non-null values
gender 1000209 non-null values
age 1000209 non-null values
occupation 1000209 non-null values
zip 1000209 non-null values
title 1000209 non-null values
genres 1000209 non-null values
dtypes: int64(6), object(4)
In [340]: data.ix[0]
Out[340]:
user_id 1
movie_id 1
rating 5
timestamp 978824268
gender F
age 1
occupation 10
zip 48067
title Toy Story (1995)
genres Animation|Children's|Comedy
Name: 0
28 | Chapter 2: Introductory Examples
In this form, aggregating the ratings grouped by one or more user or movie attributes
is straightforward once you build some familiarity with pandas. To get mean movie
ratings for each film grouped by gender, we can use the pivot_table method:
In [341]: mean_ratings = data.pivot_table('rating', rows='title',
.....: cols='gender', aggfunc='mean')
In [342]: mean_ratings[:5]
Out[342]:
gender F M
title
$1,000,000 Duck (1971) 3.375000 2.761905
'Night Mother (1986) 3.388889 3.352941
'Til There Was You (1997) 2.675676 2.733333
'burbs, The (1989) 2.793478 2.962085
...And Justice for All (1979) 3.828571 3.689024
This produced another DataFrame containing mean ratings with movie totals as row
labels and gender as column labels. First, I’m going to filter down to movies that re-
ceived at least 250 ratings (a completely arbitrary number); to do this, I group the data
by title and use size() to get a Series of group sizes for each title:
In [343]: ratings_by_title = data.groupby('title').size()
In [344]: ratings_by_title[:10]
Out[344]:
title
$1,000,000 Duck (1971) 37
'Night Mother (1986) 70
'Til There Was You (1997) 52
'burbs, The (1989) 303
...And Justice for All (1979) 199
1-900 (1994) 2
10 Things I Hate About You (1999) 700
101 Dalmatians (1961) 565
101 Dalmatians (1996) 364
12 Angry Men (1957) 616
In [345]: active_titles = ratings_by_title.index[ratings_by_title >= 250]
In [346]: active_titles
Out[346]:
Index(['burbs, The (1989), 10 Things I Hate About You (1999),
101 Dalmatians (1961), ..., Young Sherlock Holmes (1985),
Zero Effect (1998), eXistenZ (1999)], dtype=object)
The index of titles receiving at least 250 ratings can then be used to select rows from
mean_ratings above:
In [347]: mean_ratings = mean_ratings.ix[active_titles]
In [348]: mean_ratings
Out[348]:
<class 'pandas.core.frame.DataFrame'>
Index: 1216 entries, 'burbs, The (1989) to eXistenZ (1999)
MovieLens 1M Data Set | 29
Data columns:
F 1216 non-null values
M 1216 non-null values
dtypes: float64(2)
To see the top films among female viewers, we can sort by the F column in descending
order:
In [350]: top_female_ratings = mean_ratings.sort_index(by='F', ascending=False)
In [351]: top_female_ratings[:10]
Out[351]:
gender F M
Close Shave, A (1995) 4.644444 4.473795
Wrong Trousers, The (1993) 4.588235 4.478261
Sunset Blvd. (a.k.a. Sunset Boulevard) (1950) 4.572650 4.464589
Wallace & Gromit: The Best of Aardman Animation (1996) 4.563107 4.385075
Schindler's List (1993) 4.562602 4.491415
Shawshank Redemption, The (1994) 4.539075 4.560625
Grand Day Out, A (1992) 4.537879 4.293255
To Kill a Mockingbird (1962) 4.536667 4.372611
Creature Comforts (1990) 4.513889 4.272277
Usual Suspects, The (1995) 4.513317 4.518248
Measuring rating disagreement
Suppose you wanted to find the movies that are most divisive between male and female
viewers. One way is to add a column to mean_ratings containing the difference in
means, then sort by that:
In [352]: mean_ratings['diff'] = mean_ratings['M'] - mean_ratings['F']
Sorting by 'diff' gives us the movies with the greatest rating difference and which were
preferred by women:
In [353]: sorted_by_diff = mean_ratings.sort_index(by='diff')
In [354]: sorted_by_diff[:15]
Out[354]:
gender F M diff
Dirty Dancing (1987) 3.790378 2.959596 -0.830782
Jumpin' Jack Flash (1986) 3.254717 2.578358 -0.676359
Grease (1978) 3.975265 3.367041 -0.608224
Little Women (1994) 3.870588 3.321739 -0.548849
Steel Magnolias (1989) 3.901734 3.365957 -0.535777
Anastasia (1997) 3.800000 3.281609 -0.518391
Rocky Horror Picture Show, The (1975) 3.673016 3.160131 -0.512885
Color Purple, The (1985) 4.158192 3.659341 -0.498851
Age of Innocence, The (1993) 3.827068 3.339506 -0.487561
Free Willy (1993) 2.921348 2.438776 -0.482573
French Kiss (1995) 3.535714 3.056962 -0.478752
Little Shop of Horrors, The (1960) 3.650000 3.179688 -0.470312
Guys and Dolls (1955) 4.051724 3.583333 -0.468391
Mary Poppins (1964) 4.197740 3.730594 -0.467147
Patch Adams (1998) 3.473282 3.008746 -0.464536
30 | Chapter 2: Introductory Examples
Reversing the order of the rows and again slicing off the top 15 rows, we get the movies
preferred by men that women didn’t rate as highly:
# Reverse order of rows, take first 15 rows
In [355]: sorted_by_diff[::-1][:15]
Out[355]:
gender F M diff
Good, The Bad and The Ugly, The (1966) 3.494949 4.221300 0.726351
Kentucky Fried Movie, The (1977) 2.878788 3.555147 0.676359
Dumb & Dumber (1994) 2.697987 3.336595 0.638608
Longest Day, The (1962) 3.411765 4.031447 0.619682
Cable Guy, The (1996) 2.250000 2.863787 0.613787
Evil Dead II (Dead By Dawn) (1987) 3.297297 3.909283 0.611985
Hidden, The (1987) 3.137931 3.745098 0.607167
Rocky III (1982) 2.361702 2.943503 0.581801
Caddyshack (1980) 3.396135 3.969737 0.573602
For a Few Dollars More (1965) 3.409091 3.953795 0.544704
Porky's (1981) 2.296875 2.836364 0.539489
Animal House (1978) 3.628906 4.167192 0.538286
Exorcist, The (1973) 3.537634 4.067239 0.529605
Fright Night (1985) 2.973684 3.500000 0.526316
Barb Wire (1996) 1.585366 2.100386 0.515020
Suppose instead you wanted the movies that elicited the most disagreement among
viewers, independent of gender. Disagreement can be measured by the variance or
standard deviation of the ratings:
# Standard deviation of rating grouped by title
In [356]: rating_std_by_title = data.groupby('title')['rating'].std()
# Filter down to active_titles
In [357]: rating_std_by_title = rating_std_by_title.ix[active_titles]
# Order Series by value in descending order
In [358]: rating_std_by_title.order(ascending=False)[:10]
Out[358]:
title
Dumb & Dumber (1994) 1.321333
Blair Witch Project, The (1999) 1.316368
Natural Born Killers (1994) 1.307198
Tank Girl (1995) 1.277695
Rocky Horror Picture Show, The (1975) 1.260177
Eyes Wide Shut (1999) 1.259624
Evita (1996) 1.253631
Billy Madison (1995) 1.249970
Fear and Loathing in Las Vegas (1998) 1.246408
Bicentennial Man (1999) 1.245533
Name: rating
You may have noticed that movie genres are given as a pipe-separated (|) string. If you
wanted to do some analysis by genre, more work would be required to transform the
genre information into a more usable form. I will revisit this data later in the book to
illustrate such a transformation.
MovieLens 1M Data Set | 31
US Baby Names 1880-2010
The United States Social Security Administration (SSA) has made available data on the
frequency of baby names from 1880 through the present. Hadley Wickham, an author
of several popular R packages, has often made use of this data set in illustrating data
manipulation in R.
In [4]: names.head(10)
Out[4]:
name sex births year
0 Mary F 7065 1880
1 Anna F 2604 1880
2 Emma F 2003 1880
3 Elizabeth F 1939 1880
4 Minnie F 1746 1880
5 Margaret F 1578 1880
6 Ida F 1472 1880
7 Alice F 1414 1880
8 Bertha F 1320 1880
9 Sarah F 1288 1880
There are many things you might want to do with the data set:
• Visualize the proportion of babies given a particular name (your own, or another
name) over time.
• Determine the relative rank of a name.
• Determine the most popular names in each year or the names with largest increases
or decreases.
• Analyze trends in names: vowels, consonants, length, overall diversity, changes in
spelling, first and last letters
• Analyze external sources of trends: biblical names, celebrities, demographic
changes
Using the tools we’ve looked at so far, most of these kinds of analyses are very straight-
forward, so I will walk you through many of them. I encourage you to download and
explore the data yourself. If you find an interesting pattern in the data, I would love to
hear about it.
As of this writing, the US Social Security Administration makes available data files, one
per year, containing the total number of births for each sex/name combination. The
raw archive of these files can be obtained here:
http://www.ssa.gov/oact/babynames/limits.html
In the event that this page has been moved by the time you’re reading this, it can most
likely be located again by Internet search. After downloading the “National data” file
names.zip and unzipping it, you will have a directory containing a series of files like
yob1880.txt. I use the UNIX head command to look at the first 10 lines of one of the
files (on Windows, you can use the more command or open it in a text editor):
32 | Chapter 2: Introductory Examples
In [367]: !head -n 10 names/yob1880.txt
Mary,F,7065
Anna,F,2604
Emma,F,2003
Elizabeth,F,1939
Minnie,F,1746
Margaret,F,1578
Ida,F,1472
Alice,F,1414
Bertha,F,1320
Sarah,F,1288
As this is a nicely comma-separated form, it can be loaded into a DataFrame with
pandas.read_csv:
In [368]: import pandas as pd
In [369]: names1880 = pd.read_csv('names/yob1880.txt', names=['name', 'sex', 'births'])
In [370]: names1880
Out[370]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2000 entries, 0 to 1999
Data columns:
name 2000 non-null values
sex 2000 non-null values
births 2000 non-null values
dtypes: int64(1), object(2)
These files only contain names with at least 5 occurrences in each year, so for simplic-
ity’s sake we can use the sum of the births column by sex as the total number of births
in that year:
In [371]: names1880.groupby('sex').births.sum()
Out[371]:
sex
F 90993
M 110493
Name: births
Since the data set is split into files by year, one of the first things to do is to assemble
all of the data into a single DataFrame and further to add a year field. This is easy to
do using pandas.concat:
# 2010 is the last available year right now
years = range(1880, 2011)
pieces = []
columns = ['name', 'sex', 'births']
for year in years:
path = 'names/yob%d.txt' % year
frame = pd.read_csv(path, names=columns)
frame['year'] = year
pieces.append(frame)
US Baby Names 1880-2010 | 33
# Concatenate everything into a single DataFrame
names = pd.concat(pieces, ignore_index=True)
There are a couple things to note here. First, remember that concat glues the DataFrame
objects together row-wise by default. Secondly, you have to pass ignore_index=True
because we’re not interested in preserving the original row numbers returned from
read_csv. So we now have a very large DataFrame containing all of the names data:
Now the names DataFrame looks like:
In [373]: names
Out[373]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1690784 entries, 0 to 1690783
Data columns:
name 1690784 non-null values
sex 1690784 non-null values
births 1690784 non-null values
year 1690784 non-null values
dtypes: int64(2), object(2)
With this data in hand, we can already start aggregating the data at the year and sex
level using groupby or pivot_table, see Figure 2-4:
In [374]: total_births = names.pivot_table('births', rows='year',
.....: cols='sex', aggfunc=sum)
In [375]: total_births.tail()
Out[375]:
sex F M
year
2006 1896468 2050234
2007 1916888 2069242
2008 1883645 2032310
2009 1827643 1973359
2010 1759010 1898382
In [376]: total_births.plot(title='Total births by sex and year')
Next, let’s insert a column prop with the fraction of babies given each name relative to
the total number of births. A prop value of 0.02 would indicate that 2 out of every 100
babies was given a particular name. Thus, we group the data by year and sex, then add
the new column to each group:
def add_prop(group):
# Integer division floors
births = group.births.astype(float)
group['prop'] = births / births.sum()
return group
names = names.groupby(['year', 'sex']).apply(add_prop)
34 | Chapter 2: Introductory Examples
Remember that because births is of integer type, we have to cast either
the numerator or denominator to floating point to compute a fraction
(unless you are using Python 3!).
The resulting complete data set now has the following columns:
In [378]: names
Out[378]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1690784 entries, 0 to 1690783
Data columns:
name 1690784 non-null values
sex 1690784 non-null values
births 1690784 non-null values
year 1690784 non-null values
prop 1690784 non-null values
dtypes: float64(1), int64(2), object(2)
When performing a group operation like this, it's often valuable to do a sanity check,
like verifying that the prop column sums to 1 within all the groups. Since this is floating
point data, use np.allclose to check that the group sums are sufficiently close to (but
perhaps not exactly equal to) 1:
In [379]: np.allclose(names.groupby(['year', 'sex']).prop.sum(), 1)
Out[379]: True
Now that this is done, I’m going to extract a subset of the data to facilitate further
analysis: the top 1000 names for each sex/year combination. This is yet another group
operation:
def get_top1000(group):
return group.sort_index(by='births', ascending=False)[:1000]
Figure 2-4. Total births by sex and year
US Baby Names 1880-2010 | 35
grouped = names.groupby(['year', 'sex'])
top1000 = grouped.apply(get_top1000)
If you prefer a do-it-yourself approach, you could also do:
pieces = []
for year, group in names.groupby(['year', 'sex']):
pieces.append(group.sort_index(by='births', ascending=False)[:1000])
top1000 = pd.concat(pieces, ignore_index=True)
The resulting data set is now quite a bit smaller:
In [382]: top1000
Out[382]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 261877 entries, 0 to 261876
Data columns:
name 261877 non-null values
sex 261877 non-null values
births 261877 non-null values
year 261877 non-null values
prop 261877 non-null values
dtypes: float64(1), int64(2), object(2)
We’ll use this Top 1,000 data set in the following investigations into the data.
Analyzing Naming Trends
With the full data set and Top 1,000 data set in hand, we can start analyzing various
naming trends of interest. Splitting the Top 1,000 names into the boy and girl portions
is easy to do first:
In [383]: boys = top1000[top1000.sex == 'M']
In [384]: girls = top1000[top1000.sex == 'F']
Simple time series, like the number of Johns or Marys for each year can be plotted but
require a bit of munging to be a bit more useful. Let’s form a pivot table of the total
number of births by year and name:
In [385]: total_births = top1000.pivot_table('births', rows='year', cols='name',
.....: aggfunc=sum)
Now, this can be plotted for a handful of names using DataFrame’s plot method:
In [386]: total_births
Out[386]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 131 entries, 1880 to 2010
Columns: 6865 entries, Aaden to Zuri
dtypes: float64(6865)
In [387]: subset = total_births[['John', 'Harry', 'Mary', 'Marilyn']]
In [388]: subset.plot(subplots=True, figsize=(12, 10), grid=False,
.....: title="Number of births per year")
36 | Chapter 2: Introductory Examples
See Figure 2-5 for the result. On looking at this, you might conclude that these names
have grown out of favor with the American population. But the story is actually more
complicated than that, as will be explored in the next section.
Figure 2-5. A few boy and girl names over time
Measuring the increase in naming diversity
One explanation for the decrease in plots above is that fewer parents are choosing
common names for their children. This hypothesis can be explored and confirmed in
the data. One measure is the proportion of births represented by the top 1000 most
popular names, which I aggregate and plot by year and sex:
In [390]: table = top1000.pivot_table('prop', rows='year',
.....: cols='sex', aggfunc=sum)
In [391]: table.plot(title='Sum of table1000.prop by year and sex',
.....: yticks=np.linspace(0, 1.2, 13), xticks=range(1880, 2020, 10))
See Figure 2-6 for this plot. So you can see that, indeed, there appears to be increasing
name diversity (decreasing total proportion in the top 1,000). Another interesting met-
ric is the number of distinct names, taken in order of popularity from highest to lowest,
in the top 50% of births. This number is a bit more tricky to compute. Let’s consider
just the boy names from 2010:
In [392]: df = boys[boys.year == 2010]
In [393]: df
Out[393]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 260877 to 261876
Data columns:
US Baby Names 1880-2010 | 37
name 1000 non-null values
sex 1000 non-null values
births 1000 non-null values
year 1000 non-null values
prop 1000 non-null values
dtypes: float64(1), int64(2), object(2)
Figure 2-6. Proportion of births represented in top 1000 names by sex
After sorting prop in descending order, we want to know how many of the most popular
names it takes to reach 50%. You could write a for loop to do this, but a vectorized
NumPy way is a bit more clever. Taking the cumulative sum,cumsum, of prop then calling
themethodsearchsorted returnsthepositioninthecumulativesumatwhich0.5 would
need to be inserted to keep it in sorted order:
In [394]: prop_cumsum = df.sort_index(by='prop', ascending=False).prop.cumsum()
In [395]: prop_cumsum[:10]
Out[395]:
260877 0.011523
260878 0.020934
260879 0.029959
260880 0.038930
260881 0.047817
260882 0.056579
260883 0.065155
260884 0.073414
260885 0.081528
260886 0.089621
In [396]: prop_cumsum.searchsorted(0.5)
Out[396]: 116
38 | Chapter 2: Introductory Examples
Since arrays are zero-indexed, adding 1 to this result gives you a result of 117. By con-
trast, in 1900 this number was much smaller:
In [397]: df = boys[boys.year == 1900]
In [398]: in1900 = df.sort_index(by='prop', ascending=False).prop.cumsum()
In [399]: in1900.searchsorted(0.5) + 1
Out[399]: 25
It should now be fairly straightforward to apply this operation to each year/sex com-
bination; groupby those fields and apply a function returning the count for each group:
def get_quantile_count(group, q=0.5):
group = group.sort_index(by='prop', ascending=False)
return group.prop.cumsum().searchsorted(q) + 1
diversity = top1000.groupby(['year', 'sex']).apply(get_quantile_count)
diversity = diversity.unstack('sex')
This resulting DataFrame diversity now has two time series, one for each sex, indexed
by year. This can be inspected in IPython and plotted as before (see Figure 2-7):
In [401]: diversity.head()
Out[401]:
sex F M
year
1880 38 14
1881 38 14
1882 38 15
1883 39 15
1884 39 16
In [402]: diversity.plot(title="Number of popular names in top 50%")
Figure 2-7. Plot of diversity metric by year
US Baby Names 1880-2010 | 39
As you can see, girl names have always been more diverse than boy names, and they
have only become more so over time. Further analysis of what exactly is driving the
diversity, like the increase of alternate spellings, is left to the reader.
The “Last letter” Revolution
In 2007, a baby name researcher Laura Wattenberg pointed out on her website (http:
//www.babynamewizard.com) that the distribution of boy names by final letter has
changed significantly over the last 100 years. To see this, I first aggregate all of the births
in the full data set by year, sex, and final letter:
# extract last letter from name column
get_last_letter = lambda x: x[-1]
last_letters = names.name.map(get_last_letter)
last_letters.name = 'last_letter'
table = names.pivot_table('births', rows=last_letters,
cols=['sex', 'year'], aggfunc=sum)
Then, I select out three representative years spanning the history and print the first few
rows:
In [404]: subtable = table.reindex(columns=[1910, 1960, 2010], level='year')
In [405]: subtable.head()
Out[405]:
sex F M
year 1910 1960 2010 1910 1960 2010
last_letter
a 108376 691247 670605 977 5204 28438
b NaN 694 450 411 3912 38859
c 5 49 946 482 15476 23125
d 6750 3729 2607 22111 262112 44398
e 133569 435013 313833 28655 178823 129012
Next, normalize the table by total births to compute a new table containing proportion
of total births for each sex ending in each letter:
In [406]: subtable.sum()
Out[406]:
sex year
F 1910 396416
1960 2022062
2010 1759010
M 1910 194198
1960 2132588
2010 1898382
In [407]: letter_prop = subtable / subtable.sum().astype(float)
With the letter proportions now in hand, I can make bar plots for each sex broken
down by year. See Figure 2-8:
import matplotlib.pyplot as plt
40 | Chapter 2: Introductory Examples
fig, axes = plt.subplots(2, 1, figsize=(10, 8))
letter_prop['M'].plot(kind='bar', rot=0, ax=axes[0], title='Male')
letter_prop['F'].plot(kind='bar', rot=0, ax=axes[1], title='Female',
legend=False)
Figure 2-8. Proportion of boy and girl names ending in each letter
As you can see, boy names ending in “n” have experienced significant growth since the
1960s. Going back to the full table created above, I again normalize by year and sex
and select a subset of letters for the boy names, finally transposing to make each column
a time series:
In [410]: letter_prop = table / table.sum().astype(float)
In [411]: dny_ts = letter_prop.ix[['d', 'n', 'y'], 'M'].T
In [412]: dny_ts.head()
Out[412]:
d n y
year
1880 0.083055 0.153213 0.075760
1881 0.083247 0.153214 0.077451
1882 0.085340 0.149560 0.077537
1883 0.084066 0.151646 0.079144
1884 0.086120 0.149915 0.080405
With this DataFrame of time series in hand, I can make a plot of the trends over time
again with its plot method (see Figure 2-9):
In [414]: dny_ts.plot()
US Baby Names 1880-2010 | 41
Figure 2-9. Proportion of boys born with names ending in d/n/y over time
Boy names that became girl names (and vice versa)
Another fun trend is looking at boy names that were more popular with one sex earlier
in the sample but have “changed sexes” in the present. One example is the name Lesley
or Leslie. Going back to the top1000 dataset, I compute a list of names occurring in the
dataset starting with 'lesl':
In [415]: all_names = top1000.name.unique()
In [416]: mask = np.array(['lesl' in x.lower() for x in all_names])
In [417]: lesley_like = all_names[mask]
In [418]: lesley_like
Out[418]: array([Leslie, Lesley, Leslee, Lesli, Lesly], dtype=object)
From there, we can filter down to just those names and sum births grouped by name
to see the relative frequencies:
In [419]: filtered = top1000[top1000.name.isin(lesley_like)]
In [420]: filtered.groupby('name').births.sum()
Out[420]:
name
Leslee 1082
Lesley 35022
Lesli 929
Leslie 370429
Lesly 10067
Name: births
Next, let’s aggregate by sex and year and normalize within year:
42 | Chapter 2: Introductory Examples
In [421]: table = filtered.pivot_table('births', rows='year',
.....: cols='sex', aggfunc='sum')
In [422]: table = table.div(table.sum(1), axis=0)
In [423]: table.tail()
Out[423]:
sex F M
year
2006 1 NaN
2007 1 NaN
2008 1 NaN
2009 1 NaN
2010 1 NaN
Lastly, it’s now easy to make a plot of the breakdown by sex over time (Figure 2-10):
In [425]: table.plot(style={'M': 'k-', 'F': 'k--'})
Figure 2-10. Proportion of male/female Lesley-like names over time
Conclusions and The Path Ahead
The examples in this chapter are rather simple, but they’re here to give you a bit of a
flavor of what sorts of things you can expect in the upcoming chapters. The focus of
this book is on tools as opposed to presenting more sophisticated analytical methods.
Mastering the techniques in this book will enable you to implement your own analyses
(assuming you know what you want to do!) in short order.
Conclusions and The Path Ahead | 43
CHAPTER 3
IPython:AnInteractiveComputingand
Development Environment
Act without doing; work without effort. Think of the small as large and the few as many.
Confront the difficult while it is still easy; accomplish the great task by a series of small
acts.
—Laozi
People often ask me, “What is your Python development environment?” My answer is
almost always the same, “IPython and a text editor”. You may choose to substitute an
Integrated Development Environment (IDE) for a text editor in order to take advantage
ofmoreadvancedgraphicaltoolsandcodecompletioncapabilities.Evenifso,Istrongly
recommend making IPython an important part of your workflow. Some IDEs even
provide IPython integration, so it’s possible to get the best of both worlds.
The IPython project began in 2001 as Fernando Pérez’s side project to make a better
interactive Python interpreter. In the subsequent 11 years it has grown into what’s
widely considered one of the most important tools in the modern scientific Python
computing stack. While it does not provide any computational or data analytical tools
by itself, IPython is designed from the ground up to maximize your productivity in both
interactive computing and software development. It encourages an execute-explore
workflow instead of the typical edit-compile-run workflow of many other programming
languages. It also provides very tight integration with the operating system’s shell and
file system. Since much of data analysis coding involves exploration, trial and error,
and iteration, IPython will, in almost all cases, help you get the job done faster.
Of course, the IPython project now encompasses a great deal more than just an en-
hanced, interactive Python shell. It also includes a rich GUI console with inline plotting,
a web-based interactive notebook format, and a lightweight, fast parallel computing
engine. And, as with so many other tools designed for and by programmers, it is highly
customizable. I’ll discuss some of these features later in the chapter.
45
Since IPython has interactivity at its core, some of the features in this chapter are dif-
ficult to fully illustrate without a live console. If this is your first time learning about
IPython, I recommend that you follow along with the examples to get a feel for how
things work. As with any keyboard-driven console-like environment, developing mus-
cle-memory for the common commands is part of the learning curve.
Many parts of this chapter (for example: profiling and debugging) can
be safely omitted on a first reading as they are not necessary for under-
standing the rest of the book. This chapter is intended to provide a
standalone, rich overview of the functionality provided by IPython.
IPython Basics
You can launch IPython on the command line just like launching the regular Python
interpreter except with the ipython command:
$ ipython
Python 2.7.2 (default, May 27 2012, 21:26:12)
Type "copyright", "credits" or "license" for more information.
IPython 0.12 -- An enhanced Interactive Python.
? -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help -> Python's own help system.
object? -> Details about 'object', use 'object??' for extra details.
In [1]: a = 5
In [2]: a
Out[2]: 5
You can execute arbitrary Python statements by typing them in and pressing
<return>. When typing just a variable into IPython, it renders a string representation
of the object:
In [542]: data = {i : randn() for i in range(7)}
In [543]: data
Out[543]:
{0: 0.6900018528091594,
1: 1.0015434424937888,
2: -0.5030873913603446,
3: -0.6222742250596455,
4: -0.9211686080130108,
5: -0.726213492660829,
6: 0.2228955458351768}
46 | Chapter 3: IPython: An Interactive Computing and Development Environment
Many kinds of Python objects are formatted to be more readable, or pretty-printed,
which is distinct from normal printing with print. If you printed a dict like the above
in the standard Python interpreter, it would be much less readable:
>>> from numpy.random import randn
>>> data = {i : randn() for i in range(7)}
>>> print data
{0: -1.5948255432744511, 1: 0.10569006472787983, 2: 1.972367135977295,
3: 0.15455217573074576, 4: -0.24058577449429575, 5: -1.2904897053651216,
6: 0.3308507317325902}
IPython also provides facilities to make it easy to execute arbitrary blocks of code (via
somewhat glorified copy-and-pasting) and whole Python scripts. These will be dis-
cussed shortly.
Tab Completion
On the surface, the IPython shell looks like a cosmetically slightly-different interactive
Python interpreter. Users of Mathematica may find the enumerated input and output
prompts familiar. One of the major improvements over the standard Python shell is
tab completion, a feature common to most interactive data analysis environments.
While entering expressions in the shell, pressing <Tab> will search the namespace for
any variables (objects, functions, etc.) matching the characters you have typed so far:
In [1]: an_apple = 27
In [2]: an_example = 42
In [3]: an<Tab>
an_apple and an_example any
In this example, note that IPython displayed both the two variables I defined as well as
the Python keyword and and built-in function any. Naturally, you can also complete
methods and attributes on any object after typing a period:
In [3]: b = [1, 2, 3]
In [4]: b.<Tab>
b.append b.extend b.insert b.remove b.sort
b.count b.index b.pop b.reverse
The same goes for modules:
In [1]: import datetime
In [2]: datetime.<Tab>
datetime.date datetime.MAXYEAR datetime.timedelta
datetime.datetime datetime.MINYEAR datetime.tzinfo
datetime.datetime_CAPI datetime.time
IPython Basics | 47
Note that IPython by default hides methods and attributes starting with
underscores, such as magic methods and internal “private” methods
and attributes, in order to avoid cluttering the display (and confusing
new Python users!). These, too, can be tab-completed but you must first
type an underscore to see them. If you prefer to always see such methods
in tab completion, you can change this setting in the IPython configu-
ration.
Tab completion works in many contexts outside of searching the interactive namespace
and completing object or module attributes.When typing anything that looks like a file
path (even in a Python string), pressing <Tab> will complete anything on your com-
puter’s file system matching what you’ve typed:
In [3]: book_scripts/<Tab>
book_scripts/cprof_example.py book_scripts/ipython_script_test.py
book_scripts/ipython_bug.py book_scripts/prof_mod.py
In [3]: path = 'book_scripts/<Tab>
book_scripts/cprof_example.py book_scripts/ipython_script_test.py
book_scripts/ipython_bug.py book_scripts/prof_mod.py
Combined with the %run command (see later section), this functionality will undoubt-
edly save you many keystrokes.
Another area where tab completion saves time is in the completion of function keyword
arguments (including the = sign!).
Introspection
Using a question mark (?) before or after a variable will display some general informa-
tion about the object:
In [545]: b?
Type: list
String Form:[1, 2, 3]
Length: 3
Docstring:
list() -> new empty list
list(iterable) -> new list initialized from iterable's items
This is referred to as object introspection. If the object is a function or instance method,
the docstring, if defined, will also be shown. Suppose we’d written the following func-
tion:
def add_numbers(a, b):
"""
Add two numbers together
Returns
-------
the_sum : type of arguments
48 | Chapter 3: IPython: An Interactive Computing and Development Environment
"""
return a + b
Then using ? shows us the docstring:
In [547]: add_numbers?
Type: function
String Form:<function add_numbers at 0x5fad848>
File: book_scripts/<ipython-input-546-5473012eeb65>
Definition: add_numbers(a, b)
Docstring:
Add two numbers together
Returns
-------
the_sum : type of arguments
Using ?? will also show the function’s source code if possible:
In [548]: add_numbers??
Type: function
String Form:<function add_numbers at 0x5fad848>
File: book_scripts/<ipython-input-546-5473012eeb65>
Definition: add_numbers(a, b)
Source:
def add_numbers(a, b):
"""
Add two numbers together
Returns
-------
the_sum : type of arguments
"""
return a + b
? has a final usage, which is for searching the IPython namespace in a manner similar
to the standard UNIX or Windows command line. A number of characters combined
with the wildcard (*) will show all names matching the wildcard expression. For ex-
ample, we could get a list of all functions in the top level NumPy namespace containing
load:
In [549]: np.*load*?
np.load
np.loads
np.loadtxt
np.pkgload
The %run Command
Any file can be run as a Python program inside the environment of your IPython session
using the %run command. Suppose you had the following simple script stored in ipy
thon_script_test.py:
def f(x, y, z):
return (x + y) / z
a = 5
IPython Basics | 49
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy
Python for Data Analysis Made Easy

More Related Content

Similar to Python for Data Analysis Made Easy

Python learning book in code this good book
Python learning book in code this good  bookPython learning book in code this good  book
Python learning book in code this good bookMdDelwarHussain5
 
Introduction to Machine Learning with Python ( PDFDrive.com ).pdf
Introduction to Machine Learning with Python ( PDFDrive.com ).pdfIntroduction to Machine Learning with Python ( PDFDrive.com ).pdf
Introduction to Machine Learning with Python ( PDFDrive.com ).pdfbisan3
 
The Ring programming language version 1.8 book - Part 199 of 202
The Ring programming language version 1.8 book - Part 199 of 202The Ring programming language version 1.8 book - Part 199 of 202
The Ring programming language version 1.8 book - Part 199 of 202Mahmoud Samir Fayed
 
Hands_On_Machine_Learning_with_Scikit_Le.pdf
Hands_On_Machine_Learning_with_Scikit_Le.pdfHands_On_Machine_Learning_with_Scikit_Le.pdf
Hands_On_Machine_Learning_with_Scikit_Le.pdfShems192009
 
0321711378.pdf
0321711378.pdf0321711378.pdf
0321711378.pdfUzma443495
 
www.webre24h.com - [O`reilly] javascript cookbook - [powers]
www.webre24h.com - [O`reilly]   javascript cookbook - [powers]www.webre24h.com - [O`reilly]   javascript cookbook - [powers]
www.webre24h.com - [O`reilly] javascript cookbook - [powers]webre24h
 
DNS and BIND, 5th Edition.pdf
DNS and BIND, 5th Edition.pdfDNS and BIND, 5th Edition.pdf
DNS and BIND, 5th Edition.pdfviditsir
 
Python business intelligence (PyData 2012 talk)
Python business intelligence (PyData 2012 talk)Python business intelligence (PyData 2012 talk)
Python business intelligence (PyData 2012 talk)Stefan Urbanek
 
SQL SCIPY STREAMLIT_Introduction to the basic of SQL SCIPY STREAMLIT
SQL SCIPY STREAMLIT_Introduction to the basic of SQL SCIPY STREAMLITSQL SCIPY STREAMLIT_Introduction to the basic of SQL SCIPY STREAMLIT
SQL SCIPY STREAMLIT_Introduction to the basic of SQL SCIPY STREAMLITchaitalidarode1
 
Class 12 Ip Whole Text Book Preeti Arora
Class 12 Ip Whole Text Book Preeti Arora Class 12 Ip Whole Text Book Preeti Arora
Class 12 Ip Whole Text Book Preeti Arora VaibhavGour7
 
The Ring programming language version 1.4 book - Part 30 of 30
The Ring programming language version 1.4 book - Part 30 of 30The Ring programming language version 1.4 book - Part 30 of 30
The Ring programming language version 1.4 book - Part 30 of 30Mahmoud Samir Fayed
 
(ATS6-PLAT03) What's behind Discngine collections
(ATS6-PLAT03) What's behind Discngine collections(ATS6-PLAT03) What's behind Discngine collections
(ATS6-PLAT03) What's behind Discngine collectionsBIOVIA
 
Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Tech Triveni
 
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...James Anderson
 
The Ring programming language version 1.8 book - Part 200 of 202
The Ring programming language version 1.8 book - Part 200 of 202The Ring programming language version 1.8 book - Part 200 of 202
The Ring programming language version 1.8 book - Part 200 of 202Mahmoud Samir Fayed
 

Similar to Python for Data Analysis Made Easy (20)

Python learning book in code this good book
Python learning book in code this good  bookPython learning book in code this good  book
Python learning book in code this good book
 
Introduction to Machine Learning with Python ( PDFDrive.com ).pdf
Introduction to Machine Learning with Python ( PDFDrive.com ).pdfIntroduction to Machine Learning with Python ( PDFDrive.com ).pdf
Introduction to Machine Learning with Python ( PDFDrive.com ).pdf
 
The Ring programming language version 1.8 book - Part 199 of 202
The Ring programming language version 1.8 book - Part 199 of 202The Ring programming language version 1.8 book - Part 199 of 202
The Ring programming language version 1.8 book - Part 199 of 202
 
Hands_On_Machine_Learning_with_Scikit_Le.pdf
Hands_On_Machine_Learning_with_Scikit_Le.pdfHands_On_Machine_Learning_with_Scikit_Le.pdf
Hands_On_Machine_Learning_with_Scikit_Le.pdf
 
0321711378.pdf
0321711378.pdf0321711378.pdf
0321711378.pdf
 
www.webre24h.com - [O`reilly] javascript cookbook - [powers]
www.webre24h.com - [O`reilly]   javascript cookbook - [powers]www.webre24h.com - [O`reilly]   javascript cookbook - [powers]
www.webre24h.com - [O`reilly] javascript cookbook - [powers]
 
DNS and BIND, 5th Edition.pdf
DNS and BIND, 5th Edition.pdfDNS and BIND, 5th Edition.pdf
DNS and BIND, 5th Edition.pdf
 
Python business intelligence (PyData 2012 talk)
Python business intelligence (PyData 2012 talk)Python business intelligence (PyData 2012 talk)
Python business intelligence (PyData 2012 talk)
 
Web_Development_with_Node_Express.pdf
Web_Development_with_Node_Express.pdfWeb_Development_with_Node_Express.pdf
Web_Development_with_Node_Express.pdf
 
ATLAS.ti_ManualWin.v9.pdf
ATLAS.ti_ManualWin.v9.pdfATLAS.ti_ManualWin.v9.pdf
ATLAS.ti_ManualWin.v9.pdf
 
C How To Program.pdf
C How To Program.pdfC How To Program.pdf
C How To Program.pdf
 
OBIEE Course Content - BTS
OBIEE Course Content - BTSOBIEE Course Content - BTS
OBIEE Course Content - BTS
 
OBIEE Course Content - BTS
OBIEE Course Content - BTSOBIEE Course Content - BTS
OBIEE Course Content - BTS
 
SQL SCIPY STREAMLIT_Introduction to the basic of SQL SCIPY STREAMLIT
SQL SCIPY STREAMLIT_Introduction to the basic of SQL SCIPY STREAMLITSQL SCIPY STREAMLIT_Introduction to the basic of SQL SCIPY STREAMLIT
SQL SCIPY STREAMLIT_Introduction to the basic of SQL SCIPY STREAMLIT
 
Class 12 Ip Whole Text Book Preeti Arora
Class 12 Ip Whole Text Book Preeti Arora Class 12 Ip Whole Text Book Preeti Arora
Class 12 Ip Whole Text Book Preeti Arora
 
The Ring programming language version 1.4 book - Part 30 of 30
The Ring programming language version 1.4 book - Part 30 of 30The Ring programming language version 1.4 book - Part 30 of 30
The Ring programming language version 1.4 book - Part 30 of 30
 
(ATS6-PLAT03) What's behind Discngine collections
(ATS6-PLAT03) What's behind Discngine collections(ATS6-PLAT03) What's behind Discngine collections
(ATS6-PLAT03) What's behind Discngine collections
 
Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...
 
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
 
The Ring programming language version 1.8 book - Part 200 of 202
The Ring programming language version 1.8 book - Part 200 of 202The Ring programming language version 1.8 book - Part 200 of 202
The Ring programming language version 1.8 book - Part 200 of 202
 

Recently uploaded

꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxFurkanTasci3
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 

Recently uploaded (20)

꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptx
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
Decoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in ActionDecoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in Action
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 

Python for Data Analysis Made Easy

  • 1.
  • 2.
  • 3. Python for Data Analysis Wes McKinney Beijing • Cambridge • Farnham • Köln • Sebastopol • Tokyo
  • 4. Python for Data Analysis by Wes McKinney Copyright © 2013 Wes McKinney. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://my.safaribooksonline.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com. Editors: Julie Steele and Meghan Blanchette Production Editor: Melanie Yarbrough Copyeditor: Teresa Exley Proofreader: BIM Publishing Services Indexer: BIM Publishing Services Cover Designer: Karen Montgomery Interior Designer: David Futato Illustrator: Rebecca Demarest October 2012: First Edition. Revision History for the First Edition: 2012-10-05 First release See http://oreilly.com/catalog/errata.csp?isbn=9781449319793 for release details. Nutshell Handbook, the Nutshell Handbook logo, and the O’Reilly logo are registered trademarks of O’Reilly Media, Inc. Python for Data Analysis, the cover image of a golden-tailed tree shrew, and related trade dress are trademarks of O’Reilly Media, Inc. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and O’Reilly Media, Inc., was aware of a trademark claim, the designations have been printed in caps or initial caps. While every precaution has been taken in the preparation of this book, the publisher and author assume no responsibility for errors or omissions, or for damages resulting from the use of the information con- tained herein. ISBN: 978-1-449-31979-3 [LSI] 1349356084
  • 5. Table of Contents Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi 1. Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 What Is This Book About? 1 Why Python for Data Analysis? 2 Python as Glue 2 Solving the “Two-Language” Problem 2 Why Not Python? 3 Essential Python Libraries 3 NumPy 4 pandas 4 matplotlib 5 IPython 5 SciPy 6 Installation and Setup 6 Windows 7 Apple OS X 9 GNU/Linux 10 Python 2 and Python 3 11 Integrated Development Environments (IDEs) 11 Community and Conferences 12 Navigating This Book 12 Code Examples 13 Data for Examples 13 Import Conventions 13 Jargon 13 Acknowledgements 14 2. Introductory Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 1.usa.gov data from bit.ly 17 Counting Time Zones in Pure Python 19 iii
  • 6. Counting Time Zones with pandas 21 MovieLens 1M Data Set 26 Measuring rating disagreement 30 US Baby Names 1880-2010 32 Analyzing Naming Trends 36 Conclusions and The Path Ahead 43 3. IPython: An Interactive Computing and Development Environment . . . . . . . . . . . . 45 IPython Basics 46 Tab Completion 47 Introspection 48 The %run Command 49 Executing Code from the Clipboard 50 Keyboard Shortcuts 52 Exceptions and Tracebacks 53 Magic Commands 54 Qt-based Rich GUI Console 55 Matplotlib Integration and Pylab Mode 56 Using the Command History 58 Searching and Reusing the Command History 58 Input and Output Variables 58 Logging the Input and Output 59 Interacting with the Operating System 60 Shell Commands and Aliases 60 Directory Bookmark System 62 Software Development Tools 62 Interactive Debugger 62 Timing Code: %time and %timeit 67 Basic Profiling: %prun and %run -p 68 Profiling a Function Line-by-Line 70 IPython HTML Notebook 72 Tips for Productive Code Development Using IPython 72 Reloading Module Dependencies 74 Code Design Tips 74 Advanced IPython Features 76 Making Your Own Classes IPython-friendly 76 Profiles and Configuration 77 Credits 78 4. NumPy Basics: Arrays and Vectorized Computation . . . . . . . . . . . . . . . . . . . . . . . . . . 79 The NumPy ndarray: A Multidimensional Array Object 80 Creating ndarrays 81 Data Types for ndarrays 83 iv | Table of Contents
  • 7. Operations between Arrays and Scalars 85 Basic Indexing and Slicing 86 Boolean Indexing 89 Fancy Indexing 92 Transposing Arrays and Swapping Axes 93 Universal Functions: Fast Element-wise Array Functions 95 Data Processing Using Arrays 97 Expressing Conditional Logic as Array Operations 98 Mathematical and Statistical Methods 100 Methods for Boolean Arrays 101 Sorting 101 Unique and Other Set Logic 102 File Input and Output with Arrays 103 Storing Arrays on Disk in Binary Format 103 Saving and Loading Text Files 104 Linear Algebra 105 Random Number Generation 106 Example: Random Walks 108 Simulating Many Random Walks at Once 109 5. Getting Started with pandas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 Introduction to pandas Data Structures 112 Series 112 DataFrame 115 Index Objects 120 Essential Functionality 122 Reindexing 122 Dropping entries from an axis 125 Indexing, selection, and filtering 125 Arithmetic and data alignment 128 Function application and mapping 132 Sorting and ranking 133 Axis indexes with duplicate values 136 Summarizing and Computing Descriptive Statistics 137 Correlation and Covariance 139 Unique Values, Value Counts, and Membership 141 Handling Missing Data 142 Filtering Out Missing Data 143 Filling in Missing Data 145 Hierarchical Indexing 147 Reordering and Sorting Levels 149 Summary Statistics by Level 150 Using a DataFrame’s Columns 150 Table of Contents | v
  • 8. Other pandas Topics 151 Integer Indexing 151 Panel Data 152 6. Data Loading, Storage, and File Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 Reading and Writing Data in Text Format 155 Reading Text Files in Pieces 160 Writing Data Out to Text Format 162 Manually Working with Delimited Formats 163 JSON Data 165 XML and HTML: Web Scraping 166 Binary Data Formats 171 Using HDF5 Format 171 Reading Microsoft Excel Files 172 Interacting with HTML and Web APIs 173 Interacting with Databases 174 Storing and Loading Data in MongoDB 176 7. Data Wrangling: Clean, Transform, Merge, Reshape . . . . . . . . . . . . . . . . . . . . . . . . 177 Combining and Merging Data Sets 177 Database-style DataFrame Merges 178 Merging on Index 182 Concatenating Along an Axis 185 Combining Data with Overlap 188 Reshaping and Pivoting 189 Reshaping with Hierarchical Indexing 190 Pivoting “long” to “wide” Format 192 Data Transformation 194 Removing Duplicates 194 Transforming Data Using a Function or Mapping 195 Replacing Values 196 Renaming Axis Indexes 197 Discretization and Binning 199 Detecting and Filtering Outliers 201 Permutation and Random Sampling 202 Computing Indicator/Dummy Variables 203 String Manipulation 205 String Object Methods 206 Regular expressions 207 Vectorized string functions in pandas 210 Example: USDA Food Database 212 vi | Table of Contents
  • 9. 8. Plotting and Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219 A Brief matplotlib API Primer 219 Figures and Subplots 220 Colors, Markers, and Line Styles 224 Ticks, Labels, and Legends 225 Annotations and Drawing on a Subplot 228 Saving Plots to File 231 matplotlib Configuration 231 Plotting Functions in pandas 232 Line Plots 232 Bar Plots 235 Histograms and Density Plots 238 Scatter Plots 239 Plotting Maps: Visualizing Haiti Earthquake Crisis Data 241 Python Visualization Tool Ecosystem 247 Chaco 248 mayavi 248 Other Packages 248 The Future of Visualization Tools? 249 9. Data Aggregation and Group Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251 GroupBy Mechanics 252 Iterating Over Groups 255 Selecting a Column or Subset of Columns 256 Grouping with Dicts and Series 257 Grouping with Functions 258 Grouping by Index Levels 259 Data Aggregation 259 Column-wise and Multiple Function Application 262 Returning Aggregated Data in “unindexed” Form 264 Group-wise Operations and Transformations 264 Apply: General split-apply-combine 266 Quantile and Bucket Analysis 268 Example: Filling Missing Values with Group-specific Values 270 Example: Random Sampling and Permutation 271 Example: Group Weighted Average and Correlation 273 Example: Group-wise Linear Regression 274 Pivot Tables and Cross-Tabulation 275 Cross-Tabulations: Crosstab 277 Example: 2012 Federal Election Commission Database 278 Donation Statistics by Occupation and Employer 280 Bucketing Donation Amounts 283 Donation Statistics by State 285 Table of Contents | vii
  • 10. 10. Time Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289 Date and Time Data Types and Tools 290 Converting between string and datetime 291 Time Series Basics 293 Indexing, Selection, Subsetting 294 Time Series with Duplicate Indices 296 Date Ranges, Frequencies, and Shifting 297 Generating Date Ranges 298 Frequencies and Date Offsets 299 Shifting (Leading and Lagging) Data 301 Time Zone Handling 303 Localization and Conversion 304 Operations with Time Zone−aware Timestamp Objects 305 Operations between Different Time Zones 306 Periods and Period Arithmetic 307 Period Frequency Conversion 308 Quarterly Period Frequencies 309 Converting Timestamps to Periods (and Back) 311 Creating a PeriodIndex from Arrays 312 Resampling and Frequency Conversion 312 Downsampling 314 Upsampling and Interpolation 316 Resampling with Periods 318 Time Series Plotting 319 Moving Window Functions 320 Exponentially-weighted functions 324 Binary Moving Window Functions 324 User-Defined Moving Window Functions 326 Performance and Memory Usage Notes 327 11. Financial and Economic Data Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329 Data Munging Topics 329 Time Series and Cross-Section Alignment 330 Operations with Time Series of Different Frequencies 332 Time of Day and “as of” Data Selection 334 Splicing Together Data Sources 336 Return Indexes and Cumulative Returns 338 Group Transforms and Analysis 340 Group Factor Exposures 342 Decile and Quartile Analysis 343 More Example Applications 345 Signal Frontier Analysis 345 Future Contract Rolling 347 viii | Table of Contents
  • 11. Rolling Correlation and Linear Regression 350 12. Advanced NumPy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353 ndarray Object Internals 353 NumPy dtype Hierarchy 354 Advanced Array Manipulation 355 Reshaping Arrays 355 C versus Fortran Order 356 Concatenating and Splitting Arrays 357 Repeating Elements: Tile and Repeat 360 Fancy Indexing Equivalents: Take and Put 361 Broadcasting 362 Broadcasting Over Other Axes 364 Setting Array Values by Broadcasting 367 Advanced ufunc Usage 367 ufunc Instance Methods 368 Custom ufuncs 370 Structured and Record Arrays 370 Nested dtypes and Multidimensional Fields 371 Why Use Structured Arrays? 372 Structured Array Manipulations: numpy.lib.recfunctions 372 More About Sorting 373 Indirect Sorts: argsort and lexsort 374 Alternate Sort Algorithms 375 numpy.searchsorted: Finding elements in a Sorted Array 376 NumPy Matrix Class 377 Advanced Array Input and Output 379 Memory-mapped Files 379 HDF5 and Other Array Storage Options 380 Performance Tips 380 The Importance of Contiguous Memory 381 Other Speed Options: Cython, f2py, C 382 Appendix: Python Language Essentials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433 Table of Contents | ix
  • 12.
  • 13. Preface The scientific Python ecosystem of open source libraries has grown substantially over the last 10 years. By late 2011, I had long felt that the lack of centralized learning resources for data analysis and statistical applications was a stumbling block for new Python programmers engaged in such work. Key projects for data analysis (especially NumPy,IPython,matplotlib,andpandas)hadalsomaturedenoughthatabookwritten about them would likely not go out-of-date very quickly. Thus, I mustered the nerve to embark on this writing project. This is the book that I wish existed when I started using Python for data analysis in 2007. I hope you find it useful and are able to apply these tools productively in your work. Conventions Used in This Book The following typographical conventions are used in this book: Italic Indicates new terms, URLs, email addresses, filenames, and file extensions. Constant width Used for program listings, as well as within paragraphs to refer to program elements such as variable or function names, databases, data types, environment variables, statements, and keywords. Constant width bold Shows commands or other text that should be typed literally by the user. Constant width italic Shows text that should be replaced with user-supplied values or by values deter- mined by context. This icon signifies a tip, suggestion, or general note. xi
  • 14. This icon indicates a warning or caution. Using Code Examples This book is here to help you get your job done. In general, you may use the code in this book in your programs and documentation. You do not need to contact us for permission unless you’re reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing a CD-ROM of examples from O’Reilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. Incorporating a significant amount of example code from this book into your product’s documentation does require permission. We appreciate, but do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. For example: “Python for Data Analysis by William Wes- ley McKinney (O’Reilly). Copyright 2012 William McKinney, 978-1-449-31979-3.” If you feel your use of code examples falls outside fair use or the permission given above, feel free to contact us at permissions@oreilly.com. Safari® Books Online Safari Books Online (www.safaribooksonline.com) is an on-demand digital library that delivers expert content in both book and video form from the world’s leading authors in technology and business. Technology professionals, software developers, web designers, and business and cre- ative professionals use Safari Books Online as their primary resource for research, problem solving, learning, and certification training. Safari Books Online offers a range of product mixes and pricing programs for organi- zations, government agencies, and individuals. Subscribers have access to thousands of books, training videos, and prepublication manuscripts in one fully searchable da- tabasefrompublisherslikeO’ReillyMedia,PrenticeHallProfessional,Addison-Wesley Professional, Microsoft Press, Sams, Que, Peachpit Press, Focal Press, Cisco Press, John Wiley & Sons, Syngress, Morgan Kaufmann, IBM Redbooks, Packt, Adobe Press, FT Press, Apress, Manning, New Riders, McGraw-Hill, Jones & Bartlett, Course Tech- nology, and dozens more. For more information about Safari Books Online, please visit us online. xii | Preface
  • 15. How to Contact Us Please address comments and questions concerning this book to the publisher: O’Reilly Media, Inc. 1005 Gravenstein Highway North Sebastopol, CA 95472 800-998-9938 (in the United States or Canada) 707-829-0515 (international or local) 707-829-0104 (fax) We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at http://oreil.ly/python_for_data_analysis. To comment or ask technical questions about this book, send email to bookquestions@oreilly.com. For more information about our books, courses, conferences, and news, see our website at http://www.oreilly.com. Find us on Facebook: http://facebook.com/oreilly Follow us on Twitter: http://twitter.com/oreillymedia Watch us on YouTube: http://www.youtube.com/oreillymedia Preface | xiii
  • 16.
  • 17. CHAPTER 1 Preliminaries What Is This Book About? This book is concerned with the nuts and bolts of manipulating, processing, cleaning, and crunching data in Python. It is also a practical, modern introduction to scientific computing in Python, tailored for data-intensive applications. This is a book about the parts of the Python language and libraries you’ll need to effectively solve a broad set of data analysis problems. This book is not an exposition on analytical methods using Python as the implementation language. When I say “data”, what am I referring to exactly? The primary focus is on structured data, a deliberately vague term that encompasses many different common forms of data, such as • Multidimensional arrays (matrices) • Tabular or spreadsheet-like data in which each column may be a different type (string, numeric, date, or otherwise). This includes most kinds of data commonly stored in relational databases or tab- or comma-delimited text files • Multiple tables of data interrelated by key columns (what would be primary or foreign keys for a SQL user) • Evenly or unevenly spaced time series This is by no means a complete list. Even though it may not always be obvious, a large percentage of data sets can be transformed into a structured form that is more suitable for analysis and modeling. If not, it may be possible to extract features from a data set into a structured form. As an example, a collection of news articles could be processed into a word frequency table which could then be used to perform sentiment analysis. Most users of spreadsheet programs like Microsoft Excel, perhaps the most widely used data analysis tool in the world, will not be strangers to these kinds of data. 1
  • 18. Why Python for Data Analysis? For many people (myself among them), the Python language is easy to fall in love with. Since its first appearance in 1991, Python has become one of the most popular dynamic, programming languages, along with Perl, Ruby, and others. Python and Ruby have become especially popular in recent years for building websites using their numerous web frameworks, like Rails (Ruby) and Django (Python). Such languages are often called scripting languages as they can be used to write quick-and-dirty small programs, or scripts. I don’t like the term “scripting language” as it carries a connotation that they cannot be used for building mission-critical software. Among interpreted languages Python is distinguished by its large and active scientific computing community. Adop- tion of Python for scientific computing in both industry applications and academic research has increased significantly since the early 2000s. For data analysis and interactive, exploratory computing and data visualization, Python will inevitably draw comparisons with the many other domain-specific open source and commercial programming languages and tools in wide use, such as R, MATLAB, SAS, Stata, and others. In recent years, Python’s improved library support (primarily pandas) has made it a strong alternative for data manipulation tasks. Combined with Python’s strength in general purpose programming, it is an excellent choice as a single language for building data-centric applications. Python as Glue Part of Python’s success as a scientific computing platform is the ease of integrating C, C++, and FORTRAN code. Most modern computing environments share a similar set of legacy FORTRAN and C libraries for doing linear algebra, optimization, integration, fast fourier transforms, and other such algorithms. The same story has held true for many companies and national labs that have used Python to glue together 30 years’ worth of legacy software. Most programs consist of small portions of code where most of the time is spent, with large amounts of “glue code” that doesn’t run often. In many cases, the execution time of the glue code is insignificant; effort is most fruitfully invested in optimizing the computational bottlenecks, sometimes by moving the code to a lower-level language like C. In the last few years, the Cython project (http://cython.org) has become one of the preferred ways of both creating fast compiled extensions for Python and also interfacing with C and C++ code. Solving the “Two-Language” Problem In many organizations, it is common to research, prototype, and test new ideas using a more domain-specific computing language like MATLAB or R then later port those 2 | Chapter 1: Preliminaries
  • 19. ideas to be part of a larger production system written in, say, Java, C#, or C++. What people are increasingly finding is that Python is a suitable language not only for doing research and prototyping but also building the production systems, too. I believe that more and more companies will go down this path as there are often significant organ- izational benefits to having both scientists and technologists using the same set of pro- grammatic tools. Why Not Python? While Python is an excellent environment for building computationally-intensive sci- entific applications and building most kinds of general purpose systems, there are a number of uses for which Python may be less suitable. As Python is an interpreted programming language, in general most Python code will run substantially slower than code written in a compiled language like Java or C++. As programmer time is typically more valuable than CPU time, many are happy to make this tradeoff. However, in an application with very low latency requirements (for ex- ample, a high frequency trading system), the time spent programming in a lower-level, lower-productivity language like C++ to achieve the maximum possible performance might be time well spent. Python is not an ideal language for highly concurrent, multithreaded applications, par- ticularly applications with many CPU-bound threads. The reason for this is that it has what is known as the global interpreter lock (GIL), a mechanism which prevents the interpreter from executing more than one Python bytecode instruction at a time. The technical reasons for why the GIL exists are beyond the scope of this book, but as of this writing it does not seem likely that the GIL will disappear anytime soon. While it is true that in many big data processing applications, a cluster of computers may be required to process a data set in a reasonable amount of time, there are still situations where a single-process, multithreaded system is desirable. This is not to say that Python cannot execute truly multithreaded, parallel code; that code just cannot be executed in a single Python process. As an example, the Cython project features easy integration with OpenMP, a C framework for parallel computing, in order to to parallelize loops and thus significantly speed up numerical algorithms. Essential Python Libraries For those who are less familiar with the scientific Python ecosystem and the libraries used throughout the book, I present the following overview of each library. Essential Python Libraries | 3
  • 20. NumPy NumPy, short for Numerical Python, is the foundational package for scientific com- puting in Python. The majority of this book will be based on NumPy and libraries built on top of NumPy. It provides, among other things: • A fast and efficient multidimensional array object ndarray • Functions for performing element-wise computations with arrays or mathematical operations between arrays • Tools for reading and writing array-based data sets to disk • Linear algebra operations, Fourier transform, and random number generation • Tools for integrating connecting C, C++, and Fortran code to Python Beyond the fast array-processing capabilities that NumPy adds to Python, one of its primary purposes with regards to data analysis is as the primary container for data to be passed between algorithms. For numerical data, NumPy arrays are a much more efficient way of storing and manipulating data than the other built-in Python data structures. Also, libraries written in a lower-level language, such as C or Fortran, can operate on the data stored in a NumPy array without copying any data. pandas pandas provides rich data structures and functions designed to make working with structured data fast, easy, and expressive. It is, as you will see, one of the critical in- gredients enabling Python to be a powerful and productive data analysis environment. The primary object in pandas that will be used in this book is the DataFrame, a two- dimensional tabular, column-oriented data structure with both row and column labels: >>> frame total_bill tip sex smoker day time size 1 16.99 1.01 Female No Sun Dinner 2 2 10.34 1.66 Male No Sun Dinner 3 3 21.01 3.5 Male No Sun Dinner 3 4 23.68 3.31 Male No Sun Dinner 2 5 24.59 3.61 Female No Sun Dinner 4 6 25.29 4.71 Male No Sun Dinner 4 7 8.77 2 Male No Sun Dinner 2 8 26.88 3.12 Male No Sun Dinner 4 9 15.04 1.96 Male No Sun Dinner 2 10 14.78 3.23 Male No Sun Dinner 2 pandas combines the high performance array-computing features of NumPy with the flexible data manipulation capabilities of spreadsheets and relational databases (such as SQL). It provides sophisticated indexing functionality to make it easy to reshape, slice and dice, perform aggregations, and select subsets of data. pandas is the primary tool that we will use in this book. 4 | Chapter 1: Preliminaries
  • 21. For financial users, pandas features rich, high-performance time series functionality and tools well-suited for working with financial data. In fact, I initially designed pandas as an ideal tool for financial data analysis applications. For users of the R language for statistical computing, the DataFrame name will be familiar, as the object was named after the similar R data.frame object. They are not the same, however; the functionality provided by data.frame in R is essentially a strict subset of that provided by the pandas DataFrame. While this is a book about Python, I will occasionally draw comparisons with R as it is one of the most widely-used open source data analysis environments and will be familiar to many readers. The pandas name itself is derived from panel data, an econometrics term for multidi- mensional structured data sets, and Python data analysis itself. matplotlib matplotlib is the most popular Python library for producing plots and other 2D data visualizations.ItwasoriginallycreatedbyJohnD.Hunter (JDH)andisnowmaintained by a large team of developers. It is well-suited for creating plots suitable for publication. It integrates well with IPython (see below), thus providing a comfortable interactive environment for plotting and exploring data. The plots are also interactive; you can zoom in on a section of the plot and pan around the plot using the toolbar in the plot window. IPython IPython is the component in the standard scientific Python toolset that ties everything together. It provides a robust and productive environment for interactive and explor- atory computing. It is an enhanced Python shell designed to accelerate the writing, testing, and debugging of Python code. It is particularly useful for interactively working with data and visualizing data with matplotlib. IPython is usually involved with the majority of my Python work, including running, debugging, and testing code. Aside from the standard terminal-based IPython shell, the project also provides • A Mathematica-like HTML notebook for connecting to IPython through a web browser (more on this later). • A Qt framework-based GUI console with inline plotting, multiline editing, and syntax highlighting • An infrastructure for interactive parallel and distributed computing I will devote a chapter to IPython and how to get the most out of its features. I strongly recommend using it while working through this book. Essential Python Libraries | 5
  • 22. SciPy SciPy is a collection of packages addressing a number of different standard problem domains in scientific computing. Here is a sampling of the packages included: • scipy.integrate: numerical integration routines and differential equation solvers • scipy.linalg: linear algebra routines and matrix decompositions extending be- yond those provided in numpy.linalg. • scipy.optimize: function optimizers (minimizers) and root finding algorithms • scipy.signal: signal processing tools • scipy.sparse: sparse matrices and sparse linear system solvers • scipy.special: wrapper around SPECFUN, a Fortran library implementing many common mathematical functions, such as the gamma function • scipy.stats: standard continuous and discrete probability distributions (density functions, samplers, continuous distribution functions), various statistical tests, and more descriptive statistics • scipy.weave: tool for using inline C++ code to accelerate array computations Together NumPy and SciPy form a reasonably complete computational replacement for much of MATLAB along with some of its add-on toolboxes. Installation and Setup Since everyone uses Python for different applications, there is no single solution for settingupPythonandrequiredadd-onpackages.Manyreaderswillnothaveacomplete scientific Python environment suitable for following along with this book, so here I will give detailed instructions to get set up on each operating system. I recommend using one of the following base Python distributions: • Enthought Python Distribution: a scientific-oriented Python distribution from En- thought (http://www.enthought.com). This includes EPDFree, a free base scientific distribution (with NumPy, SciPy, matplotlib, Chaco, and IPython) and EPD Full, a comprehensive suite of more than 100 scientific packages across many domains. EPD Full is free for academic use but has an annual subscription for non-academic users. • Python(x,y) (http://pythonxy.googlecode.com): A free scientific-oriented Python distribution for Windows. I will be using EPDFree for the installation guides, though you are welcome to take another approach depending on your needs. At the time of this writing, EPD includes Python 2.7, though this might change at some point in the future. After installing, you will have the following packages installed and importable: 6 | Chapter 1: Preliminaries
  • 23. • Scientific Python base: NumPy, SciPy, matplotlib, and IPython. These are all in- cluded in EPDFree. • IPython Notebook dependencies: tornado and pyzmq. These are included in EPD- Free. • pandas (version 0.8.2 or higher). At some point while reading you may wish to install one or more of the following packages: statsmodels, PyTables, PyQt (or equivalently, PySide), xlrd, lxml, basemap, pymongo, and requests. These are used in various examples. Installing these optional libraries is not necessary, and I would would suggest waiting until you need them. For example, installing PyQt or PyTables from source on OS X or Linux can be rather arduous. For now, it’s most important to get up and running with the bare minimum: EPDFree and pandas. For information on each Python package and links to binary installers or other help, see the Python Package Index (PyPI, http://pypi.python.org). This is also an excellent resource for finding new Python packages. To avoid confusion and to keep things simple, I am avoiding discussion of more complex environment management tools like pip and virtua- lenv. There are many excellent guides available for these tools on the Internet. SomeusersmaybeinterestedinalternatePythonimplementations,such as IronPython, Jython, or PyPy. To make use of the tools presented in this book, it is (currently) necessary to use the standard C-based Python interpreter, known as CPython. Windows To get started on Windows, download the EPDFree installer from http://www.en thought.com, which should be an MSI installer named like epd_free-7.3-1-win- x86.msi. Run the installer and accept the default installation location C:Python27. If you had previously installed Python in this location, you may want to delete it manually first (or using Add/Remove Programs). Next, you need to verify that Python has been successfully added to the system path and that there are no conflicts with any prior-installed Python versions. First, open a command prompt by going to the Start Menu and starting the Command Prompt ap- plication, also known as cmd.exe. Try starting the Python interpreter by typing python. You should see a message that matches the version of EPDFree you installed: C:UsersWes>python Python 2.7.3 |EPD_free 7.3-1 (32-bit)| (default, Apr 12 2012, 14:30:37) on win32 Type "credits", "demo" or "enthought" for more information. >>> Installation and Setup | 7
  • 24. If you see a message for a different version of EPD or it doesn’t work at all, you will need to clean up your Windows environment variables. On Windows 7 you can start typing “environment variables” in the programs search field and select Edit environ ment variables for your account. On Windows XP, you will have to go to Control Panel > System > Advanced > Environment Variables. On the window that pops up, you are looking for the Path variable. It needs to contain the following two directory paths, separated by semicolons: C:Python27;C:Python27Scripts If you installed other versions of Python, be sure to delete any other Python-related directories from both the system and user Path variables. After making a path alterna- tion, you have to restart the command prompt for the changes to take effect. Once you can launch Python successfully from the command prompt, you need to install pandas. The easiest way is to download the appropriate binary installer from http://pypi.python.org/pypi/pandas. For EPDFree, this should be pandas-0.9.0.win32- py2.7.exe. After you run this, let’s launch IPython and check that things are installed correctly by importing pandas and making a simple matplotlib plot: C:UsersWes>ipython --pylab Python 2.7.3 |EPD_free 7.3-1 (32-bit)| Type "copyright", "credits" or "license" for more information. IPython 0.12.1 -- An enhanced Interactive Python. ? -> Introduction and overview of IPython's features. %quickref -> Quick reference. help -> Python's own help system. object? -> Details about 'object', use 'object??' for extra details. Welcome to pylab, a matplotlib-based Python environment [backend: WXAgg]. For more information, type 'help(pylab)'. In [1]: import pandas In [2]: plot(arange(10)) If successful, there should be no error messages and a plot window will appear. You can also check that the IPython HTML notebook can be successfully run by typing: $ ipython notebook --pylab=inline If you use the IPython notebook application on Windows and normally use Internet Explorer, you will likely need to install and run Mozilla Firefox or Google Chrome instead. EPDFree on Windows contains only 32-bit executables. If you want or need a 64-bit setup on Windows, using EPD Full is the most painless way to accomplish that. If you would rather install from scratch and not pay for an EPD subscription, Christoph Gohlke at the University of California, Irvine, publishes unofficial binary installers for 8 | Chapter 1: Preliminaries
  • 25. all of the book’s necessary packages (http://www.lfd.uci.edu/~gohlke/pythonlibs/) for 32- and 64-bit Windows. Apple OS X To get started on OS X, you must first install Xcode, which includes Apple’s suite of software development tools. The necessary component for our purposes is the gcc C and C++ compiler suite. The Xcode installer can be found on the OS X install DVD that came with your computer or downloaded from Apple directly. Once you’ve installed Xcode, launch the terminal (Terminal.app) by navigating to Applications > Utilities. Type gcc and press enter. You should hopefully see some- thing like: $ gcc i686-apple-darwin10-gcc-4.2.1: no input files Now you need to install EPDFree. Download the installer which should be a disk image named something like epd_free-7.3-1-macosx-i386.dmg. Double-click the .dmg file to mount it, then double-click the .mpkg file inside to run the installer. When the installer runs, it automatically appends the EPDFree executable path to your .bash_profile file. This is located at /Users/your_uname/.bash_profile: # Setting PATH for EPD_free-7.3-1 PATH="/Library/Frameworks/Python.framework/Versions/Current/bin:${PATH}" export PATH Should you encounter any problems in the following steps, you’ll want to inspect your .bash_profile and potentially add the above directory to your path. Now, it’s time to install pandas. Execute this command in the terminal: $ sudo easy_install pandas Searching for pandas Reading http://pypi.python.org/simple/pandas/ Reading http://pandas.pydata.org Reading http://pandas.sourceforge.net Best match: pandas 0.9.0 Downloading http://pypi.python.org/packages/source/p/pandas/pandas-0.9.0.zip Processing pandas-0.9.0.zip Writing /tmp/easy_install-H5mIX6/pandas-0.9.0/setup.cfg Running pandas-0.9.0/setup.py -q bdist_egg --dist-dir /tmp/easy_install-H5mIX6/ pandas-0.9.0/egg-dist-tmp-RhLG0z Adding pandas 0.9.0 to easy-install.pth file Installed /Library/Frameworks/Python.framework/Versions/7.3/lib/python2.7/ site-packages/pandas-0.9.0-py2.7-macosx-10.5-i386.egg Processing dependencies for pandas Finished processing dependencies for pandas To verify everything is working, launch IPython in Pylab mode and test importing pan- das then making a plot interactively: Installation and Setup | 9
  • 26. $ ipython --pylab 22:29 ~/VirtualBox VMs/WindowsXP $ ipython Python 2.7.3 |EPD_free 7.3-1 (32-bit)| (default, Apr 12 2012, 11:28:34) Type "copyright", "credits" or "license" for more information. IPython 0.12.1 -- An enhanced Interactive Python. ? -> Introduction and overview of IPython's features. %quickref -> Quick reference. help -> Python's own help system. object? -> Details about 'object', use 'object??' for extra details. Welcome to pylab, a matplotlib-based Python environment [backend: WXAgg]. For more information, type 'help(pylab)'. In [1]: import pandas In [2]: plot(arange(10)) If this succeeds, a plot window with a straight line should pop up. GNU/Linux Some, but not all, Linux distributions include sufficiently up-to-date versions of all the required Python packages and can be installed using the built-in package management tool like apt. I detail setup using EPD- Free as it's easily reproducible across distributions. Linux details will vary a bit depending on your Linux flavor, but here I give details for Debian-based GNU/Linux systems like Ubuntu and Mint. Setup is similar to OS X with the exception of how EPDFree is installed. The installer is a shell script that must be executed in the terminal. Depending on whether you have a 32-bit or 64-bit system, you will either need to install the x86 (32-bit) or x86_64 (64-bit) installer. You will then have a file named something similar to epd_free-7.3-1-rh5-x86_64.sh. To install it, execute this script with bash: $ bash epd_free-7.3-1-rh5-x86_64.sh After accepting the license, you will be presented with a choice of where to put the EPDFree files. I recommend installing the files in your home directory, say /home/wesm/ epd (substituting your own username for wesm). Once the installer has finished, you need to add EPDFree’s bin directory to your $PATH variable. If you are using the bash shell (the default in Ubuntu, for example), this means adding the following path addition in your .bashrc: export PATH=/home/wesm/epd/bin:$PATH Obviously, substitute the installation directory you used for /home/wesm/epd/. After doing this you can either start a new terminal process or execute your .bashrc again with source ~/.bashrc. 10 | Chapter 1: Preliminaries
  • 27. You need a C compiler such as gcc to move forward; many Linux distributions include gcc, but others may not. On Debian systems, you can install gcc by executing: sudo apt-get install gcc If you type gcc on the command line it should say something like: $ gcc gcc: no input files Now, time to install pandas: $ easy_install pandas If you installed EPDFree as root, you may need to add sudo to the command and enter the sudo or root password. To verify things are working, perform the same checks as in the OS X section. Python 2 and Python 3 The Python community is currently undergoing a drawn-out transition from the Python 2 series of interpreters to the Python 3 series. Until the appearance of Python 3.0, all Python code was backwards compatible. The community decided that in order to move the language forward, certain backwards incompatible changes were necessary. I am writing this book with Python 2.7 as its basis, as the majority of the scientific Python community has not yet transitioned to Python 3. The good news is that, with a few exceptions, you should have no trouble following along with the book if you happen to be using Python 3.2. Integrated Development Environments (IDEs) When asked about my standard development environment, I almost always say “IPy- thon plus a text editor”. I typically write a program and iteratively test and debug each piece of it in IPython. It is also useful to be able to play around with data interactively and visually verify that a particular set of data manipulations are doing the right thing. Libraries like pandas and NumPy are designed to be easy-to-use in the shell. However, some will still prefer to work in an IDE instead of a text editor. They do provide many nice “code intelligence” features like completion or quickly pulling up the documentation associated with functions and classes. Here are some that you can explore: • Eclipse with PyDev Plugin • Python Tools for Visual Studio (for Windows users) • PyCharm • Spyder • Komodo IDE Installation and Setup | 11
  • 28. Community and Conferences Outside of an Internet search, the scientific Python mailing lists are generally helpful and responsive to questions. Some ones to take a look at are: • pydata: a Google Group list for questions related to Python for data analysis and pandas • pystatsmodels: for statsmodels or pandas-related questions • numpy-discussion: for NumPy-related questions • scipy-user: for general SciPy or scientific Python questions I deliberately did not post URLs for these in case they change. They can be easily located via Internet search. Each year many conferences are held all over the world for Python programmers. PyCon and EuroPython are the two main general Python conferences in the United States and Europe, respectively. SciPy and EuroSciPy are scientific-oriented Python conferences where you will likely find many “birds of a feather” if you become more involved with using Python for data analysis after reading this book. Navigating This Book If you have never programmed in Python before, you may actually want to start at the end of the book, where I have placed a condensed tutorial on Python syntax, language features, and built-in data structures like tuples, lists, and dicts. These things are con- sidered prerequisite knowledge for the remainder of the book. The book starts by introducing you to the IPython environment. Next, I give a short introduction to the key features of NumPy, leaving more advanced NumPy use for another chapter at the end of the book. Then, I introduce pandas and devote the rest of the book to data analysis topics applying pandas, NumPy, and matplotlib (for vis- ualization). I have structured the material in the most incremental way possible, though there is occasionally some minor cross-over between chapters. Data files and related material for each chapter are hosted as a git repository on GitHub: http://github.com/pydata/pydata-book I encourage you to download the data and use it to replicate the book’s code examples and experiment with the tools presented in each chapter. I will happily accept contri- butions, scripts, IPython notebooks, or any other materials you wish to contribute to the book's repository for all to enjoy. 12 | Chapter 1: Preliminaries
  • 29. Code Examples Most of the code examples in the book are shown with input and output as it would appear executed in the IPython shell. In [5]: code Out[5]: output At times, for clarity, multiple code examples will be shown side by side. These should be read left to right and executed separately. In [5]: code In [6]: code2 Out[5]: output Out[6]: output2 Data for Examples Data sets for the examples in each chapter are hosted in a repository on GitHub: http: //github.com/pydata/pydata-book. You can download this data either by using the git revision control command-line program or by downloading a zip file of the repository from the website. I have made every effort to ensure that it contains everything necessary to reproduce the examples, but I may have made some mistakes or omissions. If so, please send me an e-mail: wesmckinn@gmail.com. Import Conventions The Python community has adopted a number of naming conventions for commonly- used modules: import numpy as np import pandas as pd import matplotlib.pyplot as plt This means that when you see np.arange, this is a reference to the arange function in NumPy. This is done as it’s considered bad practice in Python software development to import everything (from numpy import *) from a large package like NumPy. Jargon I’ll use some terms common both to programming and data science that you may not be familiar with. Thus, here are some brief definitions: Munge/Munging/Wrangling Describes the overall process of manipulating unstructured and/or messy data into a structured or clean form. The word has snuck its way into the jargon of many modern day data hackers. Munge rhymes with “lunge”. Navigating This Book | 13
  • 30. Pseudocode A description of an algorithm or process that takes a code-like form while likely not being actual valid source code. Syntactic sugar Programming syntax which does not add new features, but makes something more convenient or easier to type. Acknowledgements It would have been difficult for me to write this book without the support of a large number of people. On the O’Reilly staff, I’m very grateful for my editors Meghan Blanchette and Julie Steele who guided me through the process. Mike Loukides also worked with me in the proposal stages and helped make the book a reality. I received a wealth of technical review from a large cast of characters. In particular, Martin Blais and Hugh White were incredibly helpful in improving the book’s exam- ples, clarity, and organization from cover to cover. James Long, Drew Conway, Fer- nando Pérez, Brian Granger, Thomas Kluyver, Adam Klein, Josh Klein, Chang She, and Stéfan van der Walt each reviewed one or more chapters, providing pointed feedback from many different perspectives. I got many great ideas for examples and data sets from friends and colleagues in the data community, among them: Mike Dewar, Jeff Hammerbacher, James Johndrow, Kristian Lum, Adam Klein, Hilary Mason, Chang She, and Ashley Williams. I am of course indebted to the many leaders in the open source scientific Python com- munity who’ve built the foundation for my development work and gave encouragement while I was writing this book: the IPython core team (Fernando Pérez, Brian Granger, Min Ragan-Kelly, Thomas Kluyver, and others), John Hunter, Skipper Seabold, Travis Oliphant, Peter Wang, Eric Jones, Robert Kern, Josef Perktold, Francesc Alted, Chris Fonnesbeck, and too many others to mention. Several other people provided a great deal of support, ideas, and encouragement along the way: Drew Conway, Sean Taylor, Giuseppe Paleologo, Jared Lander, David Epstein, John Krowas, Joshua Bloom, Den Pilsworth, John Myles-White, and many others I’ve forgotten. I’d also like to thank a number of people from my formative years. First, my former AQR colleagues who’ve cheered me on in my pandas work over the years: Alex Reyf- man, Michael Wong, Tim Sargen, Oktay Kurbanov, Matthew Tschantz, Roni Israelov, Michael Katz, Chris Uga, Prasad Ramanan, Ted Square, and Hoon Kim. Lastly, my academic advisors Haynes Miller (MIT) and Mike West (Duke). On the personal side, Casey Dinkin provided invaluable day-to-day support during the writing process, tolerating my highs and lows as I hacked together the final draft on 14 | Chapter 1: Preliminaries
  • 31. top of an already overcommitted schedule. Lastly, my parents, Bill and Kim, taught me to always follow my dreams and to never settle for less. Acknowledgements | 15
  • 32.
  • 33. CHAPTER 2 Introductory Examples This book teaches you the Python tools to work productively with data. While readers may have many different end goals for their work, the tasks required generally fall into a number of different broad groups: Interacting with the outside world Reading and writing with a variety of file formats and databases. Preparation Cleaning, munging, combining, normalizing, reshaping, slicing and dicing, and transforming data for analysis. Transformation Applying mathematical and statistical operations to groups of data sets to derive new data sets. For example, aggregating a large table by group variables. Modeling and computation Connecting your data to statistical models, machine learning algorithms, or other computational tools Presentation Creating interactive or static graphical visualizations or textual summaries In this chapter I will show you a few data sets and some things we can do with them. These examples are just intended to pique your interest and thus will only be explained at a high level. Don’t worry if you have no experience with any of these tools; they will be discussed in great detail throughout the rest of the book. In the code examples you’ll see input and output prompts like In [15]:; these are from the IPython shell. 1.usa.gov data from bit.ly In 2011, URL shortening service bit.ly partnered with the United States government website usa.gov to provide a feed of anonymous data gathered from users who shorten links ending with .gov or .mil. As of this writing, in addition to providing a live feed, hourly snapshots are available as downloadable text files.1 17
  • 34. In the case of the hourly snapshots, each line in each file contains a common form of web data known as JSON, which stands for JavaScript Object Notation. For example, if we read just the first line of a file you may see something like In [15]: path = 'ch02/usagov_bitly_data2012-03-16-1331923249.txt' In [16]: open(path).readline() Out[16]: '{ "a": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.78 Safari/535.11", "c": "US", "nk": 1, "tz": "America/New_York", "gr": "MA", "g": "A6qOVH", "h": "wfLQtf", "l": "orofrog", "al": "en-US,en;q=0.8", "hh": "1.usa.gov", "r": "http://www.facebook.com/l/7AQEFzjSi/1.usa.gov/wfLQtf", "u": "http://www.ncbi.nlm.nih.gov/pubmed/22415991", "t": 1331923247, "hc": 1331822918, "cy": "Danvers", "ll": [ 42.576698, -70.954903 ] }n' Python has numerous built-in and 3rd party modules for converting a JSON string into a Python dictionary object. Here I’ll use the json module and its loads function invoked on each line in the sample file I downloaded: import json path = 'ch02/usagov_bitly_data2012-03-16-1331923249.txt' records = [json.loads(line) for line in open(path)] If you’ve never programmed in Python before, the last expression here is called a list comprehension, which is a concise way of applying an operation (like json.loads) to a collection of strings or other objects. Conveniently, iterating over an open file handle gives you a sequence of its lines. The resulting object records is now a list of Python dicts: In [18]: records[0] Out[18]: {u'a': u'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.78 Safari/535.11', u'al': u'en-US,en;q=0.8', u'c': u'US', u'cy': u'Danvers', u'g': u'A6qOVH', u'gr': u'MA', u'h': u'wfLQtf', u'hc': 1331822918, u'hh': u'1.usa.gov', u'l': u'orofrog', u'll': [42.576698, -70.954903], u'nk': 1, u'r': u'http://www.facebook.com/l/7AQEFzjSi/1.usa.gov/wfLQtf', u't': 1331923247, u'tz': u'America/New_York', u'u': u'http://www.ncbi.nlm.nih.gov/pubmed/22415991'} 1. http://www.usa.gov/About/developer-resources/1usagov.shtml 18 | Chapter 2: Introductory Examples
  • 35. Note that Python indices start at 0 and not 1 like some other languages (like R). It’s now easy to access individual values within records by passing a string for the key you wish to access: In [19]: records[0]['tz'] Out[19]: u'America/New_York' The u here in front of the quotation stands for unicode, a standard form of string en- coding. Note that IPython shows the time zone string object representation here rather than its print equivalent: In [20]: print records[0]['tz'] America/New_York Counting Time Zones in Pure Python Suppose we were interested in the most often-occurring time zones in the data set (the tz field). There are many ways we could do this. First, let’s extract a list of time zones again using a list comprehension: In [25]: time_zones = [rec['tz'] for rec in records] --------------------------------------------------------------------------- KeyError Traceback (most recent call last) /home/wesm/book_scripts/whetting/<ipython> in <module>() ----> 1 time_zones = [rec['tz'] for rec in records] KeyError: 'tz' Oops! Turns out that not all of the records have a time zone field. This is easy to handle as we can add the check if 'tz' in rec at the end of the list comprehension: In [26]: time_zones = [rec['tz'] for rec in records if 'tz' in rec] In [27]: time_zones[:10] Out[27]: [u'America/New_York', u'America/Denver', u'America/New_York', u'America/Sao_Paulo', u'America/New_York', u'America/New_York', u'Europe/Warsaw', u'', u'', u''] Just looking at the first 10 time zones we see that some of them are unknown (empty). You can filter these out also but I’ll leave them in for now. Now, to produce counts by time zone I’ll show two approaches: the harder way (using just the Python standard library) and the easier way (using pandas). One way to do the counting is to use a dict to store counts while we iterate through the time zones: def get_counts(sequence): counts = {} 1.usa.gov data from bit.ly | 19
  • 36. for x in sequence: if x in counts: counts[x] += 1 else: counts[x] = 1 return counts If you know a bit more about the Python standard library, you might prefer to write the same thing more briefly: from collections import defaultdict def get_counts2(sequence): counts = defaultdict(int) # values will initialize to 0 for x in sequence: counts[x] += 1 return counts I put this logic in a function just to make it more reusable. To use it on the time zones, just pass the time_zones list: In [31]: counts = get_counts(time_zones) In [32]: counts['America/New_York'] Out[32]: 1251 In [33]: len(time_zones) Out[33]: 3440 If we wanted the top 10 time zones and their counts, we have to do a little bit of dic- tionary acrobatics: def top_counts(count_dict, n=10): value_key_pairs = [(count, tz) for tz, count in count_dict.items()] value_key_pairs.sort() return value_key_pairs[-n:] We have then: In [35]: top_counts(counts) Out[35]: [(33, u'America/Sao_Paulo'), (35, u'Europe/Madrid'), (36, u'Pacific/Honolulu'), (37, u'Asia/Tokyo'), (74, u'Europe/London'), (191, u'America/Denver'), (382, u'America/Los_Angeles'), (400, u'America/Chicago'), (521, u''), (1251, u'America/New_York')] 20 | Chapter 2: Introductory Examples
  • 37. If you search the Python standard library, you may find the collections.Counter class, which makes this task a lot easier: In [49]: from collections import Counter In [50]: counts = Counter(time_zones) In [51]: counts.most_common(10) Out[51]: [(u'America/New_York', 1251), (u'', 521), (u'America/Chicago', 400), (u'America/Los_Angeles', 382), (u'America/Denver', 191), (u'Europe/London', 74), (u'Asia/Tokyo', 37), (u'Pacific/Honolulu', 36), (u'Europe/Madrid', 35), (u'America/Sao_Paulo', 33)] Counting Time Zones with pandas The main pandas data structure is the DataFrame, which you can think of as repre- senting a table or spreadsheet of data. Creating a DataFrame from the original set of records is simple: In [289]: from pandas import DataFrame, Series In [290]: import pandas as pd In [291]: frame = DataFrame(records) In [292]: frame Out[292]: <class 'pandas.core.frame.DataFrame'> Int64Index: 3560 entries, 0 to 3559 Data columns: _heartbeat_ 120 non-null values a 3440 non-null values al 3094 non-null values c 2919 non-null values cy 2919 non-null values g 3440 non-null values gr 2919 non-null values h 3440 non-null values hc 3440 non-null values hh 3440 non-null values kw 93 non-null values l 3440 non-null values ll 2919 non-null values nk 3440 non-null values r 3440 non-null values t 3440 non-null values tz 3440 non-null values 1.usa.gov data from bit.ly | 21
  • 38. u 3440 non-null values dtypes: float64(4), object(14) In [293]: frame['tz'][:10] Out[293]: 0 America/New_York 1 America/Denver 2 America/New_York 3 America/Sao_Paulo 4 America/New_York 5 America/New_York 6 Europe/Warsaw 7 8 9 Name: tz The output shown for the frame is the summary view, shown for large DataFrame ob- jects. The Series object returned by frame['tz'] has a method value_counts that gives us what we’re looking for: In [294]: tz_counts = frame['tz'].value_counts() In [295]: tz_counts[:10] Out[295]: America/New_York 1251 521 America/Chicago 400 America/Los_Angeles 382 America/Denver 191 Europe/London 74 Asia/Tokyo 37 Pacific/Honolulu 36 Europe/Madrid 35 America/Sao_Paulo 33 Then, we might want to make a plot of this data using plotting library, matplotlib. You can do a bit of munging to fill in a substitute value for unknown and missing time zone data in the records. The fillna function can replace missing (NA) values and unknown (empty strings) values can be replaced by boolean array indexing: In [296]: clean_tz = frame['tz'].fillna('Missing') In [297]: clean_tz[clean_tz == ''] = 'Unknown' In [298]: tz_counts = clean_tz.value_counts() In [299]: tz_counts[:10] Out[299]: America/New_York 1251 Unknown 521 America/Chicago 400 America/Los_Angeles 382 America/Denver 191 Missing 120 22 | Chapter 2: Introductory Examples
  • 39. Europe/London 74 Asia/Tokyo 37 Pacific/Honolulu 36 Europe/Madrid 35 Making a horizontal bar plot can be accomplished using the plot method on the counts objects: In [301]: tz_counts[:10].plot(kind='barh', rot=0) See Figure 2-1 for the resulting figure. We’ll explore more tools for working with this kind of data. For example, the a field contains information about the browser, device, or application used to perform the URL shortening: In [302]: frame['a'][1] Out[302]: u'GoogleMaps/RochesterNY' In [303]: frame['a'][50] Out[303]: u'Mozilla/5.0 (Windows NT 5.1; rv:10.0.2) Gecko/20100101 Firefox/10.0.2' In [304]: frame['a'][51] Out[304]: u'Mozilla/5.0 (Linux; U; Android 2.2.2; en-us; LG-P925/V10e Build/FRG83G) AppleWebKit/533.1 (K Figure 2-1. Top time zones in the 1.usa.gov sample data Parsing all of the interesting information in these “agent” strings may seem like a daunting task. Luckily, once you have mastered Python’s built-in string functions and regular expression capabilities, it is really not so bad. For example, we could split off the first token in the string (corresponding roughly to the browser capability) and make another summary of the user behavior: In [305]: results = Series([x.split()[0] for x in frame.a.dropna()]) In [306]: results[:5] Out[306]: 0 Mozilla/5.0 1 GoogleMaps/RochesterNY 2 Mozilla/4.0 3 Mozilla/5.0 4 Mozilla/5.0 1.usa.gov data from bit.ly | 23
  • 40. In [307]: results.value_counts()[:8] Out[307]: Mozilla/5.0 2594 Mozilla/4.0 601 GoogleMaps/RochesterNY 121 Opera/9.80 34 TEST_INTERNET_AGENT 24 GoogleProducer 21 Mozilla/6.0 5 BlackBerry8520/5.0.0.681 4 Now, suppose you wanted to decompose the top time zones into Windows and non- Windows users. As a simplification, let’s say that a user is on Windows if the string 'Windows' is in the agent string. Since some of the agents are missing, I’ll exclude these from the data: In [308]: cframe = frame[frame.a.notnull()] We want to then compute a value whether each row is Windows or not: In [309]: operating_system = np.where(cframe['a'].str.contains('Windows'), .....: 'Windows', 'Not Windows') In [310]: operating_system[:5] Out[310]: 0 Windows 1 Not Windows 2 Windows 3 Not Windows 4 Windows Name: a Then, you can group the data by its time zone column and this new list of operating systems: In [311]: by_tz_os = cframe.groupby(['tz', operating_system]) The group counts, analogous to the value_counts function above, can be computed using size. This result is then reshaped into a table with unstack: In [312]: agg_counts = by_tz_os.size().unstack().fillna(0) In [313]: agg_counts[:10] Out[313]: a Not Windows Windows tz 245 276 Africa/Cairo 0 3 Africa/Casablanca 0 1 Africa/Ceuta 0 2 Africa/Johannesburg 0 1 Africa/Lusaka 0 1 America/Anchorage 4 1 America/Argentina/Buenos_Aires 1 0 24 | Chapter 2: Introductory Examples
  • 41. America/Argentina/Cordoba 0 1 America/Argentina/Mendoza 0 1 Finally, let’s select the top overall time zones. To do so, I construct an indirect index array from the row counts in agg_counts: # Use to sort in ascending order In [314]: indexer = agg_counts.sum(1).argsort() In [315]: indexer[:10] Out[315]: tz 24 Africa/Cairo 20 Africa/Casablanca 21 Africa/Ceuta 92 Africa/Johannesburg 87 Africa/Lusaka 53 America/Anchorage 54 America/Argentina/Buenos_Aires 57 America/Argentina/Cordoba 26 America/Argentina/Mendoza 55 I then use take to select the rows in that order, then slice off the last 10 rows: In [316]: count_subset = agg_counts.take(indexer)[-10:] In [317]: count_subset Out[317]: a Not Windows Windows tz America/Sao_Paulo 13 20 Europe/Madrid 16 19 Pacific/Honolulu 0 36 Asia/Tokyo 2 35 Europe/London 43 31 America/Denver 132 59 America/Los_Angeles 130 252 America/Chicago 115 285 245 276 America/New_York 339 912 Then, as shown in the preceding code block, this can be plotted in a bar plot; I’ll make it a stacked bar plot by passing stacked=True (see Figure 2-2) : In [319]: count_subset.plot(kind='barh', stacked=True) The plot doesn’t make it easy to see the relative percentage of Windows users in the smaller groups, but the rows can easily be normalized to sum to 1 then plotted again (see Figure 2-3): In [321]: normed_subset = count_subset.div(count_subset.sum(1), axis=0) In [322]: normed_subset.plot(kind='barh', stacked=True) 1.usa.gov data from bit.ly | 25
  • 42. All of the methods employed here will be examined in great detail throughout the rest of the book. MovieLens 1M Data Set GroupLens Research (http://www.grouplens.org/node/73) provides a number of collec- tions of movie ratings data collected from users of MovieLens in the late 1990s and Figure 2-2. Top time zones by Windows and non-Windows users Figure 2-3. Percentage Windows and non-Windows users in top-occurring time zones 26 | Chapter 2: Introductory Examples
  • 43. early 2000s. The data provide movie ratings, movie metadata (genres and year), and demographic data about the users (age, zip code, gender, and occupation). Such data is often of interest in the development of recommendation systems based on machine learning algorithms. While I will not be exploring machine learning techniques in great detail in this book, I will show you how to slice and dice data sets like these into the exact form you need. The MovieLens 1M data set contains 1 million ratings collected from 6000 users on 4000 movies. It’s spread across 3 tables: ratings, user information, and movie infor- mation.Afterextractingthedatafromthezipfile,eachtablecanbeloadedintoapandas DataFrame object using pandas.read_table: import pandas as pd unames = ['user_id', 'gender', 'age', 'occupation', 'zip'] users = pd.read_table('ml-1m/users.dat', sep='::', header=None, names=unames) rnames = ['user_id', 'movie_id', 'rating', 'timestamp'] ratings = pd.read_table('ml-1m/ratings.dat', sep='::', header=None, names=rnames) mnames = ['movie_id', 'title', 'genres'] movies = pd.read_table('ml-1m/movies.dat', sep='::', header=None, names=mnames) You can verify that everything succeeded by looking at the first few rows of each Da- taFrame with Python's slice syntax: In [334]: users[:5] Out[334]: user_id gender age occupation zip 0 1 F 1 10 48067 1 2 M 56 16 70072 2 3 M 25 15 55117 3 4 M 45 7 02460 4 5 M 25 20 55455 In [335]: ratings[:5] Out[335]: user_id movie_id rating timestamp 0 1 1193 5 978300760 1 1 661 3 978302109 2 1 914 3 978301968 3 1 3408 4 978300275 4 1 2355 5 978824291 In [336]: movies[:5] Out[336]: movie_id title genres 0 1 Toy Story (1995) Animation|Children's|Comedy 1 2 Jumanji (1995) Adventure|Children's|Fantasy 2 3 Grumpier Old Men (1995) Comedy|Romance 3 4 Waiting to Exhale (1995) Comedy|Drama MovieLens 1M Data Set | 27
  • 44. 4 5 Father of the Bride Part II (1995) Comedy In [337]: ratings Out[337]: <class 'pandas.core.frame.DataFrame'> Int64Index: 1000209 entries, 0 to 1000208 Data columns: user_id 1000209 non-null values movie_id 1000209 non-null values rating 1000209 non-null values timestamp 1000209 non-null values dtypes: int64(4) Note that ages and occupations are coded as integers indicating groups described in the data set’s README file. Analyzing the data spread across three tables is not a simple task; for example, suppose you wanted to compute mean ratings for a particular movie by sex and age. As you will see, this is much easier to do with all of the data merged together into a single table. Using pandas’s merge function, we first merge ratings with users then merging that result with the movies data. pandas infers which columns to use as the merge (or join) keys based on overlapping names: In [338]: data = pd.merge(pd.merge(ratings, users), movies) In [339]: data Out[339]: <class 'pandas.core.frame.DataFrame'> Int64Index: 1000209 entries, 0 to 1000208 Data columns: user_id 1000209 non-null values movie_id 1000209 non-null values rating 1000209 non-null values timestamp 1000209 non-null values gender 1000209 non-null values age 1000209 non-null values occupation 1000209 non-null values zip 1000209 non-null values title 1000209 non-null values genres 1000209 non-null values dtypes: int64(6), object(4) In [340]: data.ix[0] Out[340]: user_id 1 movie_id 1 rating 5 timestamp 978824268 gender F age 1 occupation 10 zip 48067 title Toy Story (1995) genres Animation|Children's|Comedy Name: 0 28 | Chapter 2: Introductory Examples
  • 45. In this form, aggregating the ratings grouped by one or more user or movie attributes is straightforward once you build some familiarity with pandas. To get mean movie ratings for each film grouped by gender, we can use the pivot_table method: In [341]: mean_ratings = data.pivot_table('rating', rows='title', .....: cols='gender', aggfunc='mean') In [342]: mean_ratings[:5] Out[342]: gender F M title $1,000,000 Duck (1971) 3.375000 2.761905 'Night Mother (1986) 3.388889 3.352941 'Til There Was You (1997) 2.675676 2.733333 'burbs, The (1989) 2.793478 2.962085 ...And Justice for All (1979) 3.828571 3.689024 This produced another DataFrame containing mean ratings with movie totals as row labels and gender as column labels. First, I’m going to filter down to movies that re- ceived at least 250 ratings (a completely arbitrary number); to do this, I group the data by title and use size() to get a Series of group sizes for each title: In [343]: ratings_by_title = data.groupby('title').size() In [344]: ratings_by_title[:10] Out[344]: title $1,000,000 Duck (1971) 37 'Night Mother (1986) 70 'Til There Was You (1997) 52 'burbs, The (1989) 303 ...And Justice for All (1979) 199 1-900 (1994) 2 10 Things I Hate About You (1999) 700 101 Dalmatians (1961) 565 101 Dalmatians (1996) 364 12 Angry Men (1957) 616 In [345]: active_titles = ratings_by_title.index[ratings_by_title >= 250] In [346]: active_titles Out[346]: Index(['burbs, The (1989), 10 Things I Hate About You (1999), 101 Dalmatians (1961), ..., Young Sherlock Holmes (1985), Zero Effect (1998), eXistenZ (1999)], dtype=object) The index of titles receiving at least 250 ratings can then be used to select rows from mean_ratings above: In [347]: mean_ratings = mean_ratings.ix[active_titles] In [348]: mean_ratings Out[348]: <class 'pandas.core.frame.DataFrame'> Index: 1216 entries, 'burbs, The (1989) to eXistenZ (1999) MovieLens 1M Data Set | 29
  • 46. Data columns: F 1216 non-null values M 1216 non-null values dtypes: float64(2) To see the top films among female viewers, we can sort by the F column in descending order: In [350]: top_female_ratings = mean_ratings.sort_index(by='F', ascending=False) In [351]: top_female_ratings[:10] Out[351]: gender F M Close Shave, A (1995) 4.644444 4.473795 Wrong Trousers, The (1993) 4.588235 4.478261 Sunset Blvd. (a.k.a. Sunset Boulevard) (1950) 4.572650 4.464589 Wallace & Gromit: The Best of Aardman Animation (1996) 4.563107 4.385075 Schindler's List (1993) 4.562602 4.491415 Shawshank Redemption, The (1994) 4.539075 4.560625 Grand Day Out, A (1992) 4.537879 4.293255 To Kill a Mockingbird (1962) 4.536667 4.372611 Creature Comforts (1990) 4.513889 4.272277 Usual Suspects, The (1995) 4.513317 4.518248 Measuring rating disagreement Suppose you wanted to find the movies that are most divisive between male and female viewers. One way is to add a column to mean_ratings containing the difference in means, then sort by that: In [352]: mean_ratings['diff'] = mean_ratings['M'] - mean_ratings['F'] Sorting by 'diff' gives us the movies with the greatest rating difference and which were preferred by women: In [353]: sorted_by_diff = mean_ratings.sort_index(by='diff') In [354]: sorted_by_diff[:15] Out[354]: gender F M diff Dirty Dancing (1987) 3.790378 2.959596 -0.830782 Jumpin' Jack Flash (1986) 3.254717 2.578358 -0.676359 Grease (1978) 3.975265 3.367041 -0.608224 Little Women (1994) 3.870588 3.321739 -0.548849 Steel Magnolias (1989) 3.901734 3.365957 -0.535777 Anastasia (1997) 3.800000 3.281609 -0.518391 Rocky Horror Picture Show, The (1975) 3.673016 3.160131 -0.512885 Color Purple, The (1985) 4.158192 3.659341 -0.498851 Age of Innocence, The (1993) 3.827068 3.339506 -0.487561 Free Willy (1993) 2.921348 2.438776 -0.482573 French Kiss (1995) 3.535714 3.056962 -0.478752 Little Shop of Horrors, The (1960) 3.650000 3.179688 -0.470312 Guys and Dolls (1955) 4.051724 3.583333 -0.468391 Mary Poppins (1964) 4.197740 3.730594 -0.467147 Patch Adams (1998) 3.473282 3.008746 -0.464536 30 | Chapter 2: Introductory Examples
  • 47. Reversing the order of the rows and again slicing off the top 15 rows, we get the movies preferred by men that women didn’t rate as highly: # Reverse order of rows, take first 15 rows In [355]: sorted_by_diff[::-1][:15] Out[355]: gender F M diff Good, The Bad and The Ugly, The (1966) 3.494949 4.221300 0.726351 Kentucky Fried Movie, The (1977) 2.878788 3.555147 0.676359 Dumb & Dumber (1994) 2.697987 3.336595 0.638608 Longest Day, The (1962) 3.411765 4.031447 0.619682 Cable Guy, The (1996) 2.250000 2.863787 0.613787 Evil Dead II (Dead By Dawn) (1987) 3.297297 3.909283 0.611985 Hidden, The (1987) 3.137931 3.745098 0.607167 Rocky III (1982) 2.361702 2.943503 0.581801 Caddyshack (1980) 3.396135 3.969737 0.573602 For a Few Dollars More (1965) 3.409091 3.953795 0.544704 Porky's (1981) 2.296875 2.836364 0.539489 Animal House (1978) 3.628906 4.167192 0.538286 Exorcist, The (1973) 3.537634 4.067239 0.529605 Fright Night (1985) 2.973684 3.500000 0.526316 Barb Wire (1996) 1.585366 2.100386 0.515020 Suppose instead you wanted the movies that elicited the most disagreement among viewers, independent of gender. Disagreement can be measured by the variance or standard deviation of the ratings: # Standard deviation of rating grouped by title In [356]: rating_std_by_title = data.groupby('title')['rating'].std() # Filter down to active_titles In [357]: rating_std_by_title = rating_std_by_title.ix[active_titles] # Order Series by value in descending order In [358]: rating_std_by_title.order(ascending=False)[:10] Out[358]: title Dumb & Dumber (1994) 1.321333 Blair Witch Project, The (1999) 1.316368 Natural Born Killers (1994) 1.307198 Tank Girl (1995) 1.277695 Rocky Horror Picture Show, The (1975) 1.260177 Eyes Wide Shut (1999) 1.259624 Evita (1996) 1.253631 Billy Madison (1995) 1.249970 Fear and Loathing in Las Vegas (1998) 1.246408 Bicentennial Man (1999) 1.245533 Name: rating You may have noticed that movie genres are given as a pipe-separated (|) string. If you wanted to do some analysis by genre, more work would be required to transform the genre information into a more usable form. I will revisit this data later in the book to illustrate such a transformation. MovieLens 1M Data Set | 31
  • 48. US Baby Names 1880-2010 The United States Social Security Administration (SSA) has made available data on the frequency of baby names from 1880 through the present. Hadley Wickham, an author of several popular R packages, has often made use of this data set in illustrating data manipulation in R. In [4]: names.head(10) Out[4]: name sex births year 0 Mary F 7065 1880 1 Anna F 2604 1880 2 Emma F 2003 1880 3 Elizabeth F 1939 1880 4 Minnie F 1746 1880 5 Margaret F 1578 1880 6 Ida F 1472 1880 7 Alice F 1414 1880 8 Bertha F 1320 1880 9 Sarah F 1288 1880 There are many things you might want to do with the data set: • Visualize the proportion of babies given a particular name (your own, or another name) over time. • Determine the relative rank of a name. • Determine the most popular names in each year or the names with largest increases or decreases. • Analyze trends in names: vowels, consonants, length, overall diversity, changes in spelling, first and last letters • Analyze external sources of trends: biblical names, celebrities, demographic changes Using the tools we’ve looked at so far, most of these kinds of analyses are very straight- forward, so I will walk you through many of them. I encourage you to download and explore the data yourself. If you find an interesting pattern in the data, I would love to hear about it. As of this writing, the US Social Security Administration makes available data files, one per year, containing the total number of births for each sex/name combination. The raw archive of these files can be obtained here: http://www.ssa.gov/oact/babynames/limits.html In the event that this page has been moved by the time you’re reading this, it can most likely be located again by Internet search. After downloading the “National data” file names.zip and unzipping it, you will have a directory containing a series of files like yob1880.txt. I use the UNIX head command to look at the first 10 lines of one of the files (on Windows, you can use the more command or open it in a text editor): 32 | Chapter 2: Introductory Examples
  • 49. In [367]: !head -n 10 names/yob1880.txt Mary,F,7065 Anna,F,2604 Emma,F,2003 Elizabeth,F,1939 Minnie,F,1746 Margaret,F,1578 Ida,F,1472 Alice,F,1414 Bertha,F,1320 Sarah,F,1288 As this is a nicely comma-separated form, it can be loaded into a DataFrame with pandas.read_csv: In [368]: import pandas as pd In [369]: names1880 = pd.read_csv('names/yob1880.txt', names=['name', 'sex', 'births']) In [370]: names1880 Out[370]: <class 'pandas.core.frame.DataFrame'> Int64Index: 2000 entries, 0 to 1999 Data columns: name 2000 non-null values sex 2000 non-null values births 2000 non-null values dtypes: int64(1), object(2) These files only contain names with at least 5 occurrences in each year, so for simplic- ity’s sake we can use the sum of the births column by sex as the total number of births in that year: In [371]: names1880.groupby('sex').births.sum() Out[371]: sex F 90993 M 110493 Name: births Since the data set is split into files by year, one of the first things to do is to assemble all of the data into a single DataFrame and further to add a year field. This is easy to do using pandas.concat: # 2010 is the last available year right now years = range(1880, 2011) pieces = [] columns = ['name', 'sex', 'births'] for year in years: path = 'names/yob%d.txt' % year frame = pd.read_csv(path, names=columns) frame['year'] = year pieces.append(frame) US Baby Names 1880-2010 | 33
  • 50. # Concatenate everything into a single DataFrame names = pd.concat(pieces, ignore_index=True) There are a couple things to note here. First, remember that concat glues the DataFrame objects together row-wise by default. Secondly, you have to pass ignore_index=True because we’re not interested in preserving the original row numbers returned from read_csv. So we now have a very large DataFrame containing all of the names data: Now the names DataFrame looks like: In [373]: names Out[373]: <class 'pandas.core.frame.DataFrame'> Int64Index: 1690784 entries, 0 to 1690783 Data columns: name 1690784 non-null values sex 1690784 non-null values births 1690784 non-null values year 1690784 non-null values dtypes: int64(2), object(2) With this data in hand, we can already start aggregating the data at the year and sex level using groupby or pivot_table, see Figure 2-4: In [374]: total_births = names.pivot_table('births', rows='year', .....: cols='sex', aggfunc=sum) In [375]: total_births.tail() Out[375]: sex F M year 2006 1896468 2050234 2007 1916888 2069242 2008 1883645 2032310 2009 1827643 1973359 2010 1759010 1898382 In [376]: total_births.plot(title='Total births by sex and year') Next, let’s insert a column prop with the fraction of babies given each name relative to the total number of births. A prop value of 0.02 would indicate that 2 out of every 100 babies was given a particular name. Thus, we group the data by year and sex, then add the new column to each group: def add_prop(group): # Integer division floors births = group.births.astype(float) group['prop'] = births / births.sum() return group names = names.groupby(['year', 'sex']).apply(add_prop) 34 | Chapter 2: Introductory Examples
  • 51. Remember that because births is of integer type, we have to cast either the numerator or denominator to floating point to compute a fraction (unless you are using Python 3!). The resulting complete data set now has the following columns: In [378]: names Out[378]: <class 'pandas.core.frame.DataFrame'> Int64Index: 1690784 entries, 0 to 1690783 Data columns: name 1690784 non-null values sex 1690784 non-null values births 1690784 non-null values year 1690784 non-null values prop 1690784 non-null values dtypes: float64(1), int64(2), object(2) When performing a group operation like this, it's often valuable to do a sanity check, like verifying that the prop column sums to 1 within all the groups. Since this is floating point data, use np.allclose to check that the group sums are sufficiently close to (but perhaps not exactly equal to) 1: In [379]: np.allclose(names.groupby(['year', 'sex']).prop.sum(), 1) Out[379]: True Now that this is done, I’m going to extract a subset of the data to facilitate further analysis: the top 1000 names for each sex/year combination. This is yet another group operation: def get_top1000(group): return group.sort_index(by='births', ascending=False)[:1000] Figure 2-4. Total births by sex and year US Baby Names 1880-2010 | 35
  • 52. grouped = names.groupby(['year', 'sex']) top1000 = grouped.apply(get_top1000) If you prefer a do-it-yourself approach, you could also do: pieces = [] for year, group in names.groupby(['year', 'sex']): pieces.append(group.sort_index(by='births', ascending=False)[:1000]) top1000 = pd.concat(pieces, ignore_index=True) The resulting data set is now quite a bit smaller: In [382]: top1000 Out[382]: <class 'pandas.core.frame.DataFrame'> Int64Index: 261877 entries, 0 to 261876 Data columns: name 261877 non-null values sex 261877 non-null values births 261877 non-null values year 261877 non-null values prop 261877 non-null values dtypes: float64(1), int64(2), object(2) We’ll use this Top 1,000 data set in the following investigations into the data. Analyzing Naming Trends With the full data set and Top 1,000 data set in hand, we can start analyzing various naming trends of interest. Splitting the Top 1,000 names into the boy and girl portions is easy to do first: In [383]: boys = top1000[top1000.sex == 'M'] In [384]: girls = top1000[top1000.sex == 'F'] Simple time series, like the number of Johns or Marys for each year can be plotted but require a bit of munging to be a bit more useful. Let’s form a pivot table of the total number of births by year and name: In [385]: total_births = top1000.pivot_table('births', rows='year', cols='name', .....: aggfunc=sum) Now, this can be plotted for a handful of names using DataFrame’s plot method: In [386]: total_births Out[386]: <class 'pandas.core.frame.DataFrame'> Int64Index: 131 entries, 1880 to 2010 Columns: 6865 entries, Aaden to Zuri dtypes: float64(6865) In [387]: subset = total_births[['John', 'Harry', 'Mary', 'Marilyn']] In [388]: subset.plot(subplots=True, figsize=(12, 10), grid=False, .....: title="Number of births per year") 36 | Chapter 2: Introductory Examples
  • 53. See Figure 2-5 for the result. On looking at this, you might conclude that these names have grown out of favor with the American population. But the story is actually more complicated than that, as will be explored in the next section. Figure 2-5. A few boy and girl names over time Measuring the increase in naming diversity One explanation for the decrease in plots above is that fewer parents are choosing common names for their children. This hypothesis can be explored and confirmed in the data. One measure is the proportion of births represented by the top 1000 most popular names, which I aggregate and plot by year and sex: In [390]: table = top1000.pivot_table('prop', rows='year', .....: cols='sex', aggfunc=sum) In [391]: table.plot(title='Sum of table1000.prop by year and sex', .....: yticks=np.linspace(0, 1.2, 13), xticks=range(1880, 2020, 10)) See Figure 2-6 for this plot. So you can see that, indeed, there appears to be increasing name diversity (decreasing total proportion in the top 1,000). Another interesting met- ric is the number of distinct names, taken in order of popularity from highest to lowest, in the top 50% of births. This number is a bit more tricky to compute. Let’s consider just the boy names from 2010: In [392]: df = boys[boys.year == 2010] In [393]: df Out[393]: <class 'pandas.core.frame.DataFrame'> Int64Index: 1000 entries, 260877 to 261876 Data columns: US Baby Names 1880-2010 | 37
  • 54. name 1000 non-null values sex 1000 non-null values births 1000 non-null values year 1000 non-null values prop 1000 non-null values dtypes: float64(1), int64(2), object(2) Figure 2-6. Proportion of births represented in top 1000 names by sex After sorting prop in descending order, we want to know how many of the most popular names it takes to reach 50%. You could write a for loop to do this, but a vectorized NumPy way is a bit more clever. Taking the cumulative sum,cumsum, of prop then calling themethodsearchsorted returnsthepositioninthecumulativesumatwhich0.5 would need to be inserted to keep it in sorted order: In [394]: prop_cumsum = df.sort_index(by='prop', ascending=False).prop.cumsum() In [395]: prop_cumsum[:10] Out[395]: 260877 0.011523 260878 0.020934 260879 0.029959 260880 0.038930 260881 0.047817 260882 0.056579 260883 0.065155 260884 0.073414 260885 0.081528 260886 0.089621 In [396]: prop_cumsum.searchsorted(0.5) Out[396]: 116 38 | Chapter 2: Introductory Examples
  • 55. Since arrays are zero-indexed, adding 1 to this result gives you a result of 117. By con- trast, in 1900 this number was much smaller: In [397]: df = boys[boys.year == 1900] In [398]: in1900 = df.sort_index(by='prop', ascending=False).prop.cumsum() In [399]: in1900.searchsorted(0.5) + 1 Out[399]: 25 It should now be fairly straightforward to apply this operation to each year/sex com- bination; groupby those fields and apply a function returning the count for each group: def get_quantile_count(group, q=0.5): group = group.sort_index(by='prop', ascending=False) return group.prop.cumsum().searchsorted(q) + 1 diversity = top1000.groupby(['year', 'sex']).apply(get_quantile_count) diversity = diversity.unstack('sex') This resulting DataFrame diversity now has two time series, one for each sex, indexed by year. This can be inspected in IPython and plotted as before (see Figure 2-7): In [401]: diversity.head() Out[401]: sex F M year 1880 38 14 1881 38 14 1882 38 15 1883 39 15 1884 39 16 In [402]: diversity.plot(title="Number of popular names in top 50%") Figure 2-7. Plot of diversity metric by year US Baby Names 1880-2010 | 39
  • 56. As you can see, girl names have always been more diverse than boy names, and they have only become more so over time. Further analysis of what exactly is driving the diversity, like the increase of alternate spellings, is left to the reader. The “Last letter” Revolution In 2007, a baby name researcher Laura Wattenberg pointed out on her website (http: //www.babynamewizard.com) that the distribution of boy names by final letter has changed significantly over the last 100 years. To see this, I first aggregate all of the births in the full data set by year, sex, and final letter: # extract last letter from name column get_last_letter = lambda x: x[-1] last_letters = names.name.map(get_last_letter) last_letters.name = 'last_letter' table = names.pivot_table('births', rows=last_letters, cols=['sex', 'year'], aggfunc=sum) Then, I select out three representative years spanning the history and print the first few rows: In [404]: subtable = table.reindex(columns=[1910, 1960, 2010], level='year') In [405]: subtable.head() Out[405]: sex F M year 1910 1960 2010 1910 1960 2010 last_letter a 108376 691247 670605 977 5204 28438 b NaN 694 450 411 3912 38859 c 5 49 946 482 15476 23125 d 6750 3729 2607 22111 262112 44398 e 133569 435013 313833 28655 178823 129012 Next, normalize the table by total births to compute a new table containing proportion of total births for each sex ending in each letter: In [406]: subtable.sum() Out[406]: sex year F 1910 396416 1960 2022062 2010 1759010 M 1910 194198 1960 2132588 2010 1898382 In [407]: letter_prop = subtable / subtable.sum().astype(float) With the letter proportions now in hand, I can make bar plots for each sex broken down by year. See Figure 2-8: import matplotlib.pyplot as plt 40 | Chapter 2: Introductory Examples
  • 57. fig, axes = plt.subplots(2, 1, figsize=(10, 8)) letter_prop['M'].plot(kind='bar', rot=0, ax=axes[0], title='Male') letter_prop['F'].plot(kind='bar', rot=0, ax=axes[1], title='Female', legend=False) Figure 2-8. Proportion of boy and girl names ending in each letter As you can see, boy names ending in “n” have experienced significant growth since the 1960s. Going back to the full table created above, I again normalize by year and sex and select a subset of letters for the boy names, finally transposing to make each column a time series: In [410]: letter_prop = table / table.sum().astype(float) In [411]: dny_ts = letter_prop.ix[['d', 'n', 'y'], 'M'].T In [412]: dny_ts.head() Out[412]: d n y year 1880 0.083055 0.153213 0.075760 1881 0.083247 0.153214 0.077451 1882 0.085340 0.149560 0.077537 1883 0.084066 0.151646 0.079144 1884 0.086120 0.149915 0.080405 With this DataFrame of time series in hand, I can make a plot of the trends over time again with its plot method (see Figure 2-9): In [414]: dny_ts.plot() US Baby Names 1880-2010 | 41
  • 58. Figure 2-9. Proportion of boys born with names ending in d/n/y over time Boy names that became girl names (and vice versa) Another fun trend is looking at boy names that were more popular with one sex earlier in the sample but have “changed sexes” in the present. One example is the name Lesley or Leslie. Going back to the top1000 dataset, I compute a list of names occurring in the dataset starting with 'lesl': In [415]: all_names = top1000.name.unique() In [416]: mask = np.array(['lesl' in x.lower() for x in all_names]) In [417]: lesley_like = all_names[mask] In [418]: lesley_like Out[418]: array([Leslie, Lesley, Leslee, Lesli, Lesly], dtype=object) From there, we can filter down to just those names and sum births grouped by name to see the relative frequencies: In [419]: filtered = top1000[top1000.name.isin(lesley_like)] In [420]: filtered.groupby('name').births.sum() Out[420]: name Leslee 1082 Lesley 35022 Lesli 929 Leslie 370429 Lesly 10067 Name: births Next, let’s aggregate by sex and year and normalize within year: 42 | Chapter 2: Introductory Examples
  • 59. In [421]: table = filtered.pivot_table('births', rows='year', .....: cols='sex', aggfunc='sum') In [422]: table = table.div(table.sum(1), axis=0) In [423]: table.tail() Out[423]: sex F M year 2006 1 NaN 2007 1 NaN 2008 1 NaN 2009 1 NaN 2010 1 NaN Lastly, it’s now easy to make a plot of the breakdown by sex over time (Figure 2-10): In [425]: table.plot(style={'M': 'k-', 'F': 'k--'}) Figure 2-10. Proportion of male/female Lesley-like names over time Conclusions and The Path Ahead The examples in this chapter are rather simple, but they’re here to give you a bit of a flavor of what sorts of things you can expect in the upcoming chapters. The focus of this book is on tools as opposed to presenting more sophisticated analytical methods. Mastering the techniques in this book will enable you to implement your own analyses (assuming you know what you want to do!) in short order. Conclusions and The Path Ahead | 43
  • 60.
  • 61. CHAPTER 3 IPython:AnInteractiveComputingand Development Environment Act without doing; work without effort. Think of the small as large and the few as many. Confront the difficult while it is still easy; accomplish the great task by a series of small acts. —Laozi People often ask me, “What is your Python development environment?” My answer is almost always the same, “IPython and a text editor”. You may choose to substitute an Integrated Development Environment (IDE) for a text editor in order to take advantage ofmoreadvancedgraphicaltoolsandcodecompletioncapabilities.Evenifso,Istrongly recommend making IPython an important part of your workflow. Some IDEs even provide IPython integration, so it’s possible to get the best of both worlds. The IPython project began in 2001 as Fernando Pérez’s side project to make a better interactive Python interpreter. In the subsequent 11 years it has grown into what’s widely considered one of the most important tools in the modern scientific Python computing stack. While it does not provide any computational or data analytical tools by itself, IPython is designed from the ground up to maximize your productivity in both interactive computing and software development. It encourages an execute-explore workflow instead of the typical edit-compile-run workflow of many other programming languages. It also provides very tight integration with the operating system’s shell and file system. Since much of data analysis coding involves exploration, trial and error, and iteration, IPython will, in almost all cases, help you get the job done faster. Of course, the IPython project now encompasses a great deal more than just an en- hanced, interactive Python shell. It also includes a rich GUI console with inline plotting, a web-based interactive notebook format, and a lightweight, fast parallel computing engine. And, as with so many other tools designed for and by programmers, it is highly customizable. I’ll discuss some of these features later in the chapter. 45
  • 62. Since IPython has interactivity at its core, some of the features in this chapter are dif- ficult to fully illustrate without a live console. If this is your first time learning about IPython, I recommend that you follow along with the examples to get a feel for how things work. As with any keyboard-driven console-like environment, developing mus- cle-memory for the common commands is part of the learning curve. Many parts of this chapter (for example: profiling and debugging) can be safely omitted on a first reading as they are not necessary for under- standing the rest of the book. This chapter is intended to provide a standalone, rich overview of the functionality provided by IPython. IPython Basics You can launch IPython on the command line just like launching the regular Python interpreter except with the ipython command: $ ipython Python 2.7.2 (default, May 27 2012, 21:26:12) Type "copyright", "credits" or "license" for more information. IPython 0.12 -- An enhanced Interactive Python. ? -> Introduction and overview of IPython's features. %quickref -> Quick reference. help -> Python's own help system. object? -> Details about 'object', use 'object??' for extra details. In [1]: a = 5 In [2]: a Out[2]: 5 You can execute arbitrary Python statements by typing them in and pressing <return>. When typing just a variable into IPython, it renders a string representation of the object: In [542]: data = {i : randn() for i in range(7)} In [543]: data Out[543]: {0: 0.6900018528091594, 1: 1.0015434424937888, 2: -0.5030873913603446, 3: -0.6222742250596455, 4: -0.9211686080130108, 5: -0.726213492660829, 6: 0.2228955458351768} 46 | Chapter 3: IPython: An Interactive Computing and Development Environment
  • 63. Many kinds of Python objects are formatted to be more readable, or pretty-printed, which is distinct from normal printing with print. If you printed a dict like the above in the standard Python interpreter, it would be much less readable: >>> from numpy.random import randn >>> data = {i : randn() for i in range(7)} >>> print data {0: -1.5948255432744511, 1: 0.10569006472787983, 2: 1.972367135977295, 3: 0.15455217573074576, 4: -0.24058577449429575, 5: -1.2904897053651216, 6: 0.3308507317325902} IPython also provides facilities to make it easy to execute arbitrary blocks of code (via somewhat glorified copy-and-pasting) and whole Python scripts. These will be dis- cussed shortly. Tab Completion On the surface, the IPython shell looks like a cosmetically slightly-different interactive Python interpreter. Users of Mathematica may find the enumerated input and output prompts familiar. One of the major improvements over the standard Python shell is tab completion, a feature common to most interactive data analysis environments. While entering expressions in the shell, pressing <Tab> will search the namespace for any variables (objects, functions, etc.) matching the characters you have typed so far: In [1]: an_apple = 27 In [2]: an_example = 42 In [3]: an<Tab> an_apple and an_example any In this example, note that IPython displayed both the two variables I defined as well as the Python keyword and and built-in function any. Naturally, you can also complete methods and attributes on any object after typing a period: In [3]: b = [1, 2, 3] In [4]: b.<Tab> b.append b.extend b.insert b.remove b.sort b.count b.index b.pop b.reverse The same goes for modules: In [1]: import datetime In [2]: datetime.<Tab> datetime.date datetime.MAXYEAR datetime.timedelta datetime.datetime datetime.MINYEAR datetime.tzinfo datetime.datetime_CAPI datetime.time IPython Basics | 47
  • 64. Note that IPython by default hides methods and attributes starting with underscores, such as magic methods and internal “private” methods and attributes, in order to avoid cluttering the display (and confusing new Python users!). These, too, can be tab-completed but you must first type an underscore to see them. If you prefer to always see such methods in tab completion, you can change this setting in the IPython configu- ration. Tab completion works in many contexts outside of searching the interactive namespace and completing object or module attributes.When typing anything that looks like a file path (even in a Python string), pressing <Tab> will complete anything on your com- puter’s file system matching what you’ve typed: In [3]: book_scripts/<Tab> book_scripts/cprof_example.py book_scripts/ipython_script_test.py book_scripts/ipython_bug.py book_scripts/prof_mod.py In [3]: path = 'book_scripts/<Tab> book_scripts/cprof_example.py book_scripts/ipython_script_test.py book_scripts/ipython_bug.py book_scripts/prof_mod.py Combined with the %run command (see later section), this functionality will undoubt- edly save you many keystrokes. Another area where tab completion saves time is in the completion of function keyword arguments (including the = sign!). Introspection Using a question mark (?) before or after a variable will display some general informa- tion about the object: In [545]: b? Type: list String Form:[1, 2, 3] Length: 3 Docstring: list() -> new empty list list(iterable) -> new list initialized from iterable's items This is referred to as object introspection. If the object is a function or instance method, the docstring, if defined, will also be shown. Suppose we’d written the following func- tion: def add_numbers(a, b): """ Add two numbers together Returns ------- the_sum : type of arguments 48 | Chapter 3: IPython: An Interactive Computing and Development Environment
  • 65. """ return a + b Then using ? shows us the docstring: In [547]: add_numbers? Type: function String Form:<function add_numbers at 0x5fad848> File: book_scripts/<ipython-input-546-5473012eeb65> Definition: add_numbers(a, b) Docstring: Add two numbers together Returns ------- the_sum : type of arguments Using ?? will also show the function’s source code if possible: In [548]: add_numbers?? Type: function String Form:<function add_numbers at 0x5fad848> File: book_scripts/<ipython-input-546-5473012eeb65> Definition: add_numbers(a, b) Source: def add_numbers(a, b): """ Add two numbers together Returns ------- the_sum : type of arguments """ return a + b ? has a final usage, which is for searching the IPython namespace in a manner similar to the standard UNIX or Windows command line. A number of characters combined with the wildcard (*) will show all names matching the wildcard expression. For ex- ample, we could get a list of all functions in the top level NumPy namespace containing load: In [549]: np.*load*? np.load np.loads np.loadtxt np.pkgload The %run Command Any file can be run as a Python program inside the environment of your IPython session using the %run command. Suppose you had the following simple script stored in ipy thon_script_test.py: def f(x, y, z): return (x + y) / z a = 5 IPython Basics | 49