Pandas is a Python library for data analysis and manipulation. It provides high performance tools for structured data, including DataFrame objects for tabular data with row and column indexes. Pandas aims to have a clean and consistent API that is both performant and easy to use for tasks like data cleaning, aggregation, reshaping and merging of data.
Introduction to Python Pandas for Data AnalyticsPhoenix
Pandas is an open-source, BSD-licensed Python library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. Python with Pandas is used in a wide range of fields including academic and commercial domains including finance, economics, Statistics, analytics, medical...
Abstract: This PDSG workshop introduces the basics of Python libraries used in machine learning. Libraries covered are Numpy, Pandas and MathlibPlot.
Level: Fundamental
Requirements: One should have some knowledge of programming and some statistics.
This is the basic introduction of the pandas library, you can use it for teaching this library for machine learning introduction. This slide will be able to help to understand the basics of pandas to the students with no coding background.
Introduction to Python Pandas for Data AnalyticsPhoenix
Pandas is an open-source, BSD-licensed Python library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. Python with Pandas is used in a wide range of fields including academic and commercial domains including finance, economics, Statistics, analytics, medical...
Abstract: This PDSG workshop introduces the basics of Python libraries used in machine learning. Libraries covered are Numpy, Pandas and MathlibPlot.
Level: Fundamental
Requirements: One should have some knowledge of programming and some statistics.
This is the basic introduction of the pandas library, you can use it for teaching this library for machine learning introduction. This slide will be able to help to understand the basics of pandas to the students with no coding background.
Analysis of data in Python with SciPy and pandas, Ubuntu installation, PyCharm configuration, Series, DataFrame, big data, medical data, merging data, groupby, graphing data, iPython using Wakari.io, and analyzing stock prices of US automakers including Ford and Telsa. As presented at Penguicon 2016.
Looking for a computer institute to learn Full Stack development and Digital Marketing? Our institute offers comprehensive courses in both areas, providing students with the skills and knowledge needed to succeed in today's digital landscape
Python Class | Python Programming | Python Tutorial | EdurekaEdureka!
( Python Training : https://www.edureka.co/python )
This Edureka Python Class tutorial (Python Tutorial Blog: https://goo.gl/wd28Zr) will help you understand Python Classes and Objects with examples. It will also explain the concept of Abstract Classes and Inheritance in python.
Check out our Python Training Playlist: https://goo.gl/Na1p9G
This Python Programming tutorial video helps you to learn following topics:
1. Python Classes and Objects
2. Inheritance
3. Abstract Classes
Data Science With Python | Python For Data Science | Python Data Science Cour...Simplilearn
This Data Science with Python presentation will help you understand what is Data Science, basics of Python for data analysis, why learn Python, how to install Python, Python libraries for data analysis, exploratory analysis using Pandas, introduction to series and dataframe, loan prediction problem, data wrangling using Pandas, building a predictive model using Scikit-Learn and implementing logistic regression model using Python. The aim of this video is to provide a comprehensive knowledge to beginners who are new to Python for data analysis. This video provides a comprehensive overview of basic concepts that you need to learn to use Python for data analysis. Now, let us understand how Python is used in Data Science for data analysis.
This Data Science with Python presentation will cover the following topics:
1. What is Data Science?
2. Basics of Python for data analysis
- Why learn Python?
- How to install Python?
3. Python libraries for data analysis
4. Exploratory analysis using Pandas
- Introduction to series and dataframe
- Loan prediction problem
5. Data wrangling using Pandas
6. Building a predictive model using Scikit-learn
- Logistic regression
This Data Science with Python course will establish your mastery of data science and analytics techniques using Python. With this Python for Data Science Course, you'll learn the essential concepts of Python programming and become an expert in data analytics, machine learning, data visualization, web scraping and natural language processing. Python is a required skill for many data science positions, so jumpstart your career with this interactive, hands-on course.
Why learn Data Science?
Data Scientists are being deployed in all kinds of industries, creating a huge demand for skilled professionals. Data scientist is the pinnacle rank in an analytics organization. Glassdoor has ranked data scientist first in the 25 Best Jobs for 2016, and good data scientists are scarce and in great demand. As a data you will be required to understand the business problem, design the analysis, collect and format the required data, apply algorithms or techniques using the correct tools, and finally make recommendations backed by data.
You can gain in-depth knowledge of Data Science by taking our Data Science with python certification training course. With Simplilearn Data Science certification training course, you will prepare for a career as a Data Scientist as you master all the concepts and techniques.
Learn more at: https://www.simplilearn.com
Python Interview Questions And Answers 2019 | EdurekaEdureka!
( ** Python Training : https://www.edureka.co/python ** )
This PPT on Python Interview Questions and Answers will help you prepare for Python job interviews. Start your preparation by going through the most frequently asked questions on Python.
Check out our Python Training Playlist: https://goo.gl/Na1p9G
Follow us to never miss an update in the future:
Instagram: https://www.instagram.com/edureka_learning/
Facebook: https://www.facebook.com/edurekaIN/
Twitter: https://twitter.com/edurekain
LinkedIn: https://www.linkedin.com/company/edureka
Analysis of data in Python with SciPy and pandas, Ubuntu installation, PyCharm configuration, Series, DataFrame, big data, medical data, merging data, groupby, graphing data, iPython using Wakari.io, and analyzing stock prices of US automakers including Ford and Telsa. As presented at Penguicon 2016.
Looking for a computer institute to learn Full Stack development and Digital Marketing? Our institute offers comprehensive courses in both areas, providing students with the skills and knowledge needed to succeed in today's digital landscape
Python Class | Python Programming | Python Tutorial | EdurekaEdureka!
( Python Training : https://www.edureka.co/python )
This Edureka Python Class tutorial (Python Tutorial Blog: https://goo.gl/wd28Zr) will help you understand Python Classes and Objects with examples. It will also explain the concept of Abstract Classes and Inheritance in python.
Check out our Python Training Playlist: https://goo.gl/Na1p9G
This Python Programming tutorial video helps you to learn following topics:
1. Python Classes and Objects
2. Inheritance
3. Abstract Classes
Data Science With Python | Python For Data Science | Python Data Science Cour...Simplilearn
This Data Science with Python presentation will help you understand what is Data Science, basics of Python for data analysis, why learn Python, how to install Python, Python libraries for data analysis, exploratory analysis using Pandas, introduction to series and dataframe, loan prediction problem, data wrangling using Pandas, building a predictive model using Scikit-Learn and implementing logistic regression model using Python. The aim of this video is to provide a comprehensive knowledge to beginners who are new to Python for data analysis. This video provides a comprehensive overview of basic concepts that you need to learn to use Python for data analysis. Now, let us understand how Python is used in Data Science for data analysis.
This Data Science with Python presentation will cover the following topics:
1. What is Data Science?
2. Basics of Python for data analysis
- Why learn Python?
- How to install Python?
3. Python libraries for data analysis
4. Exploratory analysis using Pandas
- Introduction to series and dataframe
- Loan prediction problem
5. Data wrangling using Pandas
6. Building a predictive model using Scikit-learn
- Logistic regression
This Data Science with Python course will establish your mastery of data science and analytics techniques using Python. With this Python for Data Science Course, you'll learn the essential concepts of Python programming and become an expert in data analytics, machine learning, data visualization, web scraping and natural language processing. Python is a required skill for many data science positions, so jumpstart your career with this interactive, hands-on course.
Why learn Data Science?
Data Scientists are being deployed in all kinds of industries, creating a huge demand for skilled professionals. Data scientist is the pinnacle rank in an analytics organization. Glassdoor has ranked data scientist first in the 25 Best Jobs for 2016, and good data scientists are scarce and in great demand. As a data you will be required to understand the business problem, design the analysis, collect and format the required data, apply algorithms or techniques using the correct tools, and finally make recommendations backed by data.
You can gain in-depth knowledge of Data Science by taking our Data Science with python certification training course. With Simplilearn Data Science certification training course, you will prepare for a career as a Data Scientist as you master all the concepts and techniques.
Learn more at: https://www.simplilearn.com
Python Interview Questions And Answers 2019 | EdurekaEdureka!
( ** Python Training : https://www.edureka.co/python ** )
This PPT on Python Interview Questions and Answers will help you prepare for Python job interviews. Start your preparation by going through the most frequently asked questions on Python.
Check out our Python Training Playlist: https://goo.gl/Na1p9G
Follow us to never miss an update in the future:
Instagram: https://www.instagram.com/edureka_learning/
Facebook: https://www.facebook.com/edurekaIN/
Twitter: https://twitter.com/edurekain
LinkedIn: https://www.linkedin.com/company/edureka
Data Wrangling and Visualization Using PythonMOHITKUMAR1379
Python is open source and has so many libraries for data wrangling and visualization that makes life of data scientists easier. For data wrangling pandas is used as it represent tabular data and it has other function to parse data from different sources, data cleaning, handling missing values, merging data sets etc. To visualize data, low level matplotlib can be used. But it is a base package for other high level packages such as seaborn, that draw well customized plot in just one line of code. Python has dash framework that is used to make interactive web application using python code without javascript and html. These dash application can be published on any server as well as on clouds like google cloud but freely on heroku cloud.
Minimizing the Complexities of Machine Learning with Data VirtualizationDenodo
Watch full webinar here: https://buff.ly/309CZ1Y
Advanced data science techniques, like machine learning, have proven an extremely useful tool to derive valuable insights from existing data. Platforms like Spark, and complex libraries for R, Python and Scala put advanced techniques at the fingertips of the data scientists. However, these data scientists spent most of their time looking for the right data and massaging it into a usable format. Data virtualization offers a new alternative to address these issues in a more efficient and agile way.
Attend this webinar and learn:
*How data virtualization can accelerate data acquisition and massaging, providing the data scientist with a powerful tool to complement their practice
*How popular tools from the data science ecosystem: Spark, Python, Zeppelin, Jupyter, etc. integrate with Denodo
*How you can use the Denodo Platform with large data volumes in an efficient way
*About the success McCormick has had as a result of seasoning the Machine Learning and Blockchain Landscape with data virtualization
This contains the agenda of the Spark Meetup I organised in Bangalore on Friday, the 23rd of Jan 2014. It carries the slides for the talk I gave on distributed deep learning over Spark
Apache Arrow Workshop at VLDB 2019 / BOSS SessionWes McKinney
Technical deep dive for database system developers in the Arrow columnar format, binary protocol, C++ development platform, and Arrow Flight RPC.
See demo Jupyter notebooks at https://github.com/wesm/vldb-2019-apache-arrow-workshop
Data Science Without Borders (JupyterCon 2017)Wes McKinney
Talk about building shared, language-agnostic computational infrastructure for data science. Discusses the motivation and work that's happening in the Apache Arrow project to help (http://arrow.apache.org)
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Welcome to the first live UiPath Community Day Dubai! Join us for this unique occasion to meet our local and global UiPath Community and leaders. You will get a full view of the MEA region's automation landscape and the AI Powered automation technology capabilities of UiPath. Also, hosted by our local partners Marc Ellis, you will enjoy a half-day packed with industry insights and automation peers networking.
📕 Curious on our agenda? Wait no more!
10:00 Welcome note - UiPath Community in Dubai
Lovely Sinha, UiPath Community Chapter Leader, UiPath MVPx3, Hyper-automation Consultant, First Abu Dhabi Bank
10:20 A UiPath cross-region MEA overview
Ashraf El Zarka, VP and Managing Director MEA, UiPath
10:35: Customer Success Journey
Deepthi Deepak, Head of Intelligent Automation CoE, First Abu Dhabi Bank
11:15 The UiPath approach to GenAI with our three principles: improve accuracy, supercharge productivity, and automate more
Boris Krumrey, Global VP, Automation Innovation, UiPath
12:15 To discover how Marc Ellis leverages tech-driven solutions in recruitment and managed services.
Brendan Lingam, Director of Sales and Business Development, Marc Ellis
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
The Metaverse and AI: how can decision-makers harness the Metaverse for their...Jen Stirrup
The Metaverse is popularized in science fiction, and now it is becoming closer to being a part of our daily lives through the use of social media and shopping companies. How can businesses survive in a world where Artificial Intelligence is becoming the present as well as the future of technology, and how does the Metaverse fit into business strategy when futurist ideas are developing into reality at accelerated rates? How do we do this when our data isn't up to scratch? How can we move towards success with our data so we are set up for the Metaverse when it arrives?
How can you help your company evolve, adapt, and succeed using Artificial Intelligence and the Metaverse to stay ahead of the competition? What are the potential issues, complications, and benefits that these technologies could bring to us and our organizations? In this session, Jen Stirrup will explain how to start thinking about these technologies as an organisation.
The Metaverse and AI: how can decision-makers harness the Metaverse for their...
pandas: a Foundational Python Library for Data Analysis and Statistics
1. pandas: a Foundational Python library for Data Analysis
and Statistics
Wes McKinney
PyHPC 2011, 18 November 2011
Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 1 / 25
2. An alternate title
High Performance Structured Data
Manipulation in Python
Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 2 / 25
3. My background
Former quant hacker at AQR Capital, now entrepreneur
Background: math, statistics, computer science, quant finance.
Shaken, not stirred
Active in scientific Python community
My blog: http://blog.wesmckinney.com
Twitter: @wesmckinn
Book! “Python for Data Analysis”, to hit the shelves later next year
from O’Reilly
Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 3 / 25
4. Structured data
cname year agefrom ageto ls lsc pop ccode
0 Australia 1950 15 19 64.3 15.4 558 AUS
1 Australia 1950 20 24 48.4 26.4 645 AUS
2 Australia 1950 25 29 47.9 26.2 681 AUS
3 Australia 1950 30 34 44 23.8 614 AUS
4 Australia 1950 35 39 42.1 21.9 625 AUS
5 Australia 1950 40 44 38.9 20.1 555 AUS
6 Australia 1950 45 49 34 16.9 491 AUS
7 Australia 1950 50 54 29.6 14.6 439 AUS
8 Australia 1950 55 59 28 12.9 408 AUS
9 Australia 1950 60 64 26.3 12.1 356 AUS
Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 4 / 25
5. Structured data
A familiar data model
Heterogeneous columns or hyperslabs
Each column/hyperslab is homogeneously typed
Relational databases (SQL, etc.) are just a special case
Need good performance in row- and column-oriented operations
Support for axis metadata
Data alignment is critical
Seamless integration with Python data structures and NumPy
Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 5 / 25
6. Structured data challenges
Table modification: column insertion/deletion
Axis indexing and data alignment
Aggregation and transformation by group (“group by”)
Missing data handling
Pivoting and reshaping
Merging and joining
Time series-specific manipulations
Fast IO: flat files, databases, HDF5, ...
Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 6 / 25
7. Not all fun and games
We care nearly equally about
Performance
Ease-of-use (syntax / API fits your mental model)
Expressiveness
Clean, consistent API design is hard and underappreciated
Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 7 / 25
8. The big picture
Build a foundation for data analysis and statistical computing
Craft the most expressive / flexible in-memory data manipulation tool
in any language
Preferably also one of the fastest, too
Vastly simplify the data preparation, munging, and integration process
Comfortable abstractions: master data-fu without needing to be a
computer scientist
Later: extend API with distributed computing backend for
larger-than-memory datasets
Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 8 / 25
9. pandas: a brief history
Starting building April 2008 back at AQR
Open-sourced (BSD license) mid-2009
29075 lines of Python/Cython code as of yesterday, and growing fast
Heavily tested, being used by many companies (inc. lots of financial
firms) in production
Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 9 / 25
10. Cython: getting good performance
My choice tool for writing performant code
High level access to NumPy C API internals
Buffer syntax/protocol abstracts away striding details of
non-contiguous arrays, very low overhead vs. working with raw C
pointers
Reduce/remove interpreter overhead associated with working with
Python data structures
Interface directly with C/C++ code when necessary
Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 10 / 25
11. Axis indexing
Key pandas feature
The axis index is a data structure itself, which can be customized to
support things like:
1-1 O(1) indexing with hashable Python objects
Datetime indexing for time series data
Hierarchical (multi-level) indexing
Use Python dict to support O(1) lookups and O(n) realignment ops.
Can specialize to get better performance and memory usage
Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 11 / 25
12. Axis indexing
Every axis has an index
Automatic alignment between differently-indexed objects: makes it
nearly impossible to accidentally combine misaligned data
Hierarchical indexing provides an intuitive way of structuring and
working with higher-dimensional data
Natural way of expressing “group by” and join-type operations
As good or in many cases much more integrated/flexible than
commercial or open-source alternatives to pandas/Python
Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 12 / 25
13. The trouble with Python dicts...
Python dict memory footprint can be quite large
1MM key-value pairs: something like 70mb on a 64-bit system
Even though sizeof(PyObject*) == 8
Python dict is great, but should use a faster, threadsafe hash table for
primitive C types (like 64-bit integer)
BUT: using a hash table only necessary in the general case. With
monotonic indexes you don’t need one for realignment ops
Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 13 / 25
14. Some alignment numbers
Hardware: Macbook Pro Core i7 laptop, Python 2.7.2
Outer-join 500k-length indexes chosen from 1MM elements
Dict-based with random strings: 2.2 seconds
Sorted strings: 400ms (5.5x faster)
Sorted int64: 19ms (115x faster)
Fortunately, time series data falls into this last category
Alignment ops with C primitives could be fairly easily parallelized with
OpenMP in Cython
Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 14 / 25
15. DataFrame, the pandas workhorse
A 2D tabular data structure with row and column indexes
Hierarchical indexing one way to support higher-dimensional data in a
lower-dimensional structure
Simplified NumPy type system: float, int, boolean, object
Rich indexing operations, SQL-like join/merges, etc.
Support heterogeneous columns WITHOUT sacrificing performance in
the homogeneous (e.g. floating point only) case
Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 15 / 25
16. DataFrame, under the hood
Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 16 / 25
17. Supporting size mutability
In order to have good row-oriented performance, need to store
like-typed columns in a single ndarray
“Column” insertion: accumulate 1 × N × . . . homogeneous columns,
later consolidate with other like-typed into a single block
I.e. avoid reallocate-copy or array concatenation steps as long as
possible
Column deletions can be no-copy events (since ndarrays support
views)
Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 17 / 25
18. Hierarchical indexing
New this year, but really should have done long ago
Natural result of multi-key groupby
An intuitive way to work with higher-dimensional data
Much less ad hoc way of expressing reshaping operations
Once you have it, things like Excel-style pivot tables just “fall out”
Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 18 / 25
21. Reshaping implementation nuances
Must deal with unbalanced group sizes / missing data
Play vectorization tricks with the NumPy C-contiguous memory
layout: no Python for loops allowed
Care must be taken to handle heterogeneous and homogeneous data
cases
Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 21 / 25
22. GroupBy
High level process
split data set into groups
apply function to each group (an aggregation or a transformation)
combine results intelligently into a result data structure
Can be used to emulate SQL GROUP BY operations
Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 22 / 25
23. GroupBy
Grouping closely related to indexing
Create correspondence between axis labels and group labels using one
of:
Array of group labels (like a DataFrame column)
Python function to be applied to each axis tick
Can group by multiple keys
For a hierarchically indexed axis, can select a level and group by that
(or some transformation thereof)
Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 23 / 25
24. GroupBy implementation challenges
Computing the group labels from arbitrary Python objects is very
expensive
77ms for 1MM strings with 1K groups
107ms for 1MM strings with 10K groups
350ms for 1MM strings with 100K groups
To sort or not to sort (for iteration)?
Once you have the labels, can reorder the data set in O(n) (with a
much smaller constant than computing the labels)
Roughly 35ms to reorder 1MM float64 data points given the labels
(By contrast, computing the mean of 1MM elements takes 1.4ms)
Python function call overhead is significant in cases with lots of small
groups; much better (orders of magnitude speedup) to write
specialized Cython routines
Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 24 / 25