This document summarizes a technical talk given by Coby Viner on the Pandas Python library. The talk covered an overview of Pandas, including its features and capabilities for data manipulation and analysis. Specific topics discussed included the Pandas data structures of Series and DataFrame, basic usage of Pandas, plotting data with Pandas, an example use case of Pandas for machine learning tasks, and comparisons of Pandas to R and SQL.
Slides of my talk at OSLCfest in Stockholm Nov 6, 2019
Video recording of the talk is available here:
https://www.facebook.com/oslcfest/videos/2261640397437958/
Describing Scholarly Contributions semantically with the Open Research Knowle...Sören Auer
1) Prof. Dr. Sören Auer discusses challenges with current scholarly communication and proposes using knowledge graphs and the Open Research Knowledge Graph to better represent research contributions.
2) The presentation outlines how research contributions could be semantically captured and organized in the knowledge graph, including publications, data, and other artifacts.
3) Features like intuitive exploration, question answering, and automatic generation of comparisons are demonstrated as possible applications of the semantic representations in the knowledge graph.
This document provides an introduction and overview of the INF2190 - Data Analytics course. It discusses the instructor, Attila Barta, details on where and when the course will take place. It then provides definitions and history of data analytics, discusses how the field has evolved with big data, and references enterprise data analytics architectures. It contrasts traditional vs. big data era data analytics approaches and tools. The objective of the course is described as providing students with the foundation to become data scientists.
Towards an Open Research Knowledge GraphSören Auer
The document-oriented workflows in science have reached (or already exceeded) the limits of adequacy as highlighted for example by recent discussions on the increasing proliferation of scientific literature and the reproducibility crisis. Now it is possible to rethink this dominant paradigm of document-centered knowledge exchange and transform it into knowledge-based information flows by representing and expressing knowledge through semantically rich, interlinked knowledge graphs. The core of the establishment of knowledge-based information flows is the creation and evolution of information models for the establishment of a common understanding of data and information between the various stakeholders as well as the integration of these technologies into the infrastructure and processes of search and knowledge exchange in the research library of the future. By integrating these information models into existing and new research infrastructure services, the information structures that are currently still implicit and deeply hidden in documents can be made explicit and directly usable. This has the potential to revolutionize scientific work because information and research results can be seamlessly interlinked with each other and better mapped to complex information needs. Also research results become directly comparable and easier to reuse.
Moving forward data centric sciences weaving AI, Big Data & HPCGenoveva Vargas-Solar
This novel and multidisciplinary data centric and scientific movement, promises new and not yet imagined applications that rely on massive amounts of evolving data that need to be cleaned, integrated and analysed for modelling purposes. Yet, data management issues are not usually perceived as central. In this keynote I will explore the key challenges and opportunities for data management in this new scientific world, and discuss how a possible data centric artificial intelligence supported by high performance computing (HPC) can best contribute to these exciting domains. If the moto is not academic, huge numbers of dollars being devoted to related applications are moving industry and academia to analyse these directions.
Towards Knowledge Graph based Representation, Augmentation and Exploration of...Sören Auer
This document discusses improving scholarly communication through knowledge graphs. It describes some current issues with scholarly communication like lack of structure, integration, and machine-readability. Knowledge graphs are proposed as a solution to represent scholarly concepts, publications, and data in a structured and linked manner. This would help address issues like reproducibility, duplication, and enable new ways of exploring and querying scholarly knowledge. The document outlines a ScienceGRAPH approach using cognitive knowledge graphs to represent scholarly knowledge at different levels of granularity and allow for intuitive exploration and question answering over semantic representations.
11.challenging issues of spatio temporal data miningAlexander Decker
This document discusses the challenging issues of spatio-temporal data mining. It begins with an introduction to spatio-temporal databases and how they differ from traditional databases by managing moving objects and their locations over time. It then provides an overview of spatial data mining and temporal data mining before focusing on spatio-temporal data mining, which aims to analyze large databases containing both spatial and temporal information. The document outlines some of the key challenges in applying traditional data mining techniques to spatio-temporal data due to its continuous and correlated nature.
This document provides an introduction to a course on Python for Data Science. It discusses key concepts related to data, information, databases, data warehouses, big data, and data science. It outlines the course objectives, which are to train students to solve computational problems using Python and build different types of models. The syllabus covers topics like introduction to data science, NumPy, data manipulation with Python, data cleaning/preparation/visualization, and machine learning using Python. Textbooks and reference materials are also listed.
Slides of my talk at OSLCfest in Stockholm Nov 6, 2019
Video recording of the talk is available here:
https://www.facebook.com/oslcfest/videos/2261640397437958/
Describing Scholarly Contributions semantically with the Open Research Knowle...Sören Auer
1) Prof. Dr. Sören Auer discusses challenges with current scholarly communication and proposes using knowledge graphs and the Open Research Knowledge Graph to better represent research contributions.
2) The presentation outlines how research contributions could be semantically captured and organized in the knowledge graph, including publications, data, and other artifacts.
3) Features like intuitive exploration, question answering, and automatic generation of comparisons are demonstrated as possible applications of the semantic representations in the knowledge graph.
This document provides an introduction and overview of the INF2190 - Data Analytics course. It discusses the instructor, Attila Barta, details on where and when the course will take place. It then provides definitions and history of data analytics, discusses how the field has evolved with big data, and references enterprise data analytics architectures. It contrasts traditional vs. big data era data analytics approaches and tools. The objective of the course is described as providing students with the foundation to become data scientists.
Towards an Open Research Knowledge GraphSören Auer
The document-oriented workflows in science have reached (or already exceeded) the limits of adequacy as highlighted for example by recent discussions on the increasing proliferation of scientific literature and the reproducibility crisis. Now it is possible to rethink this dominant paradigm of document-centered knowledge exchange and transform it into knowledge-based information flows by representing and expressing knowledge through semantically rich, interlinked knowledge graphs. The core of the establishment of knowledge-based information flows is the creation and evolution of information models for the establishment of a common understanding of data and information between the various stakeholders as well as the integration of these technologies into the infrastructure and processes of search and knowledge exchange in the research library of the future. By integrating these information models into existing and new research infrastructure services, the information structures that are currently still implicit and deeply hidden in documents can be made explicit and directly usable. This has the potential to revolutionize scientific work because information and research results can be seamlessly interlinked with each other and better mapped to complex information needs. Also research results become directly comparable and easier to reuse.
Moving forward data centric sciences weaving AI, Big Data & HPCGenoveva Vargas-Solar
This novel and multidisciplinary data centric and scientific movement, promises new and not yet imagined applications that rely on massive amounts of evolving data that need to be cleaned, integrated and analysed for modelling purposes. Yet, data management issues are not usually perceived as central. In this keynote I will explore the key challenges and opportunities for data management in this new scientific world, and discuss how a possible data centric artificial intelligence supported by high performance computing (HPC) can best contribute to these exciting domains. If the moto is not academic, huge numbers of dollars being devoted to related applications are moving industry and academia to analyse these directions.
Towards Knowledge Graph based Representation, Augmentation and Exploration of...Sören Auer
This document discusses improving scholarly communication through knowledge graphs. It describes some current issues with scholarly communication like lack of structure, integration, and machine-readability. Knowledge graphs are proposed as a solution to represent scholarly concepts, publications, and data in a structured and linked manner. This would help address issues like reproducibility, duplication, and enable new ways of exploring and querying scholarly knowledge. The document outlines a ScienceGRAPH approach using cognitive knowledge graphs to represent scholarly knowledge at different levels of granularity and allow for intuitive exploration and question answering over semantic representations.
11.challenging issues of spatio temporal data miningAlexander Decker
This document discusses the challenging issues of spatio-temporal data mining. It begins with an introduction to spatio-temporal databases and how they differ from traditional databases by managing moving objects and their locations over time. It then provides an overview of spatial data mining and temporal data mining before focusing on spatio-temporal data mining, which aims to analyze large databases containing both spatial and temporal information. The document outlines some of the key challenges in applying traditional data mining techniques to spatio-temporal data due to its continuous and correlated nature.
This document provides an introduction to a course on Python for Data Science. It discusses key concepts related to data, information, databases, data warehouses, big data, and data science. It outlines the course objectives, which are to train students to solve computational problems using Python and build different types of models. The syllabus covers topics like introduction to data science, NumPy, data manipulation with Python, data cleaning/preparation/visualization, and machine learning using Python. Textbooks and reference materials are also listed.
The document provides information about a course on Big Data Analytics taught at Malla Reddy College of Engineering & Technology. It includes 5 units that will be covered: Introduction to Big Data and Analytics, Introduction to Technology Landscape, Introduction to MongoDB and MapReduce Programming, Introduction to Hive and Pig, and Introduction to Data Analytics with R. The course aims to introduce students to big data tools and information standard formats. It will cover topics such as structured and unstructured data, Hadoop, MongoDB, MapReduce, Hive, Pig, and machine learning algorithms.
The document provides information about a course on Big Data Analytics taught at Malla Reddy College of Engineering & Technology. It includes 5 units that will be covered: Introduction to Big Data and Analytics, Introduction to Technology Landscape, Introduction to MongoDB and MapReduce Programming, Introduction to Hive and Pig, and Introduction to Data Analytics with R. The course aims to introduce students to big data tools and information standard formats to help them design data for analytics and work with tools like Hadoop, Scala, and machine learning algorithms.
This document outlines the topics that will be covered in a course on data science with Python. It includes sections on Python basics, data science basics, statistics, algorithms and analysis, and contact information. Some of the main topics that will be covered include data types in Python, conditional statements and loops, descriptive statistics, linear and logistic regression, decision trees, support vector machines, and principal component analysis. The course is intended to teach students the skills needed for a career in data science.
Introduction to question answering for linked data & big dataAndre Freitas
This document discusses question answering (QA) systems in the context of big data and heterogeneous data scenarios. It outlines the motivation and challenges for developing natural language interfaces for databases. The document covers the basic concepts and taxonomy of QA systems, including question types, answer types, data sources, and domains. It also discusses the anatomy and components of a typical QA system.
Keynote at Open Data Science Conference, San Francisco, Nov 2015, outlines the evolution of Data Science akin to evolution of alchemy to chemistry; Intel's motivations for releasing Trusted Analytics Platform to open source.
COVID - 19 DATA ANALYSIS USING PYTHON and Introduction to Data ScienceVibhuti Mandral
This document contains summaries of multiple MOOCs and projects related to data science using Python. The first MOOC introduces data science concepts using Python like NumPy and Pandas libraries. The second project focuses on intermediate Pandas functions for data processing. The third guided project analyzes the relationship between COVID-19 spread and country happiness using Python libraries like Pandas and Matplotlib for data preparation, calculations, joining datasets, and visualizing results.
VerticaPy allows users to perform machine learning and data science tasks using Python directly in the Vertica database. It provides tools for data exploration, preparation, modeling, evaluation and visualization. Models can be built and stored within Vertica for scalable deployment and management. VerticaPy aims to bring analytics to the next level by allowing users to leverage Vertica's in-database capabilities while working with Python.
A Comprehensive Guide to Data Science Technologies.pdfGeethaPratyusha
In the fast-paced realm of data science, staying ahead requires a deep understanding of the tools and technologies that drive insights from data. From programming languages to advanced frameworks, the world of data science technologies is vast and dynamic. In this blog, we embark on a comprehensive guide, navigating through the essential tools that empower data scientists to unravel the mysteries hidden within datasets and shape the future of information analysis. For those seeking a structured and immersive learning experience, complementing this tech-centric journey with a well-crafted data science course is the key to unlocking boundless opportunities in this evolving field.
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...Sarah Aerni
Slides from the Pivotal Open Source Hub Meetup
"Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Science!"
As the need for data science as a key differentiator grows in all industries, from large corporations to startups, the need to get to results quickly is enabled by sharing ideas and methods in the community. The data science team at Pivotal leverages and contributes to this community of publicly available and open source technologies as part of their practice. We will share the resources we use by highlighting specific toolkits for building models (e.g. MADlib, R) and visualization (e.g. Gephi and Circos) along with their benefits and limitations by sharing examples from Pivotal's data science engagements. At the end of this session we hope to have answered the questions: Where can I get started with Data Science? Which toolkit is most appropriate for building a model with my dataset? How can I visualize my results to have the greatest impact?
Bio: Sarah Aerni is a member of the Pivotal Data Science team with a focus on healthcare and life science. She has a background in the field of Bioinformatics, developing tools to help biomedical researchers understand their data. She holds a B.S. In Biology with a specialization in Bioinformatics and minor in French Literature from UCSD, and an M.S. and Ph.D in Biomedical Informatics from Stanford University. During her time as a researcher she focused on the interface between machine learning and biology, building computational models enabling research for a broad range of fields in biomedicine. She also co-founded a start-up providing informatics services to researchers and small companies. At Pivotal she works with customers in life science and healthcare building models to derive insight and business value from their data.
Pandas is an open source Python library that provides high-performance data structures and data analysis tools. It allows users to work with structured and unstructured data, clean and manipulate data sets, and perform complex analyses. The presentation will provide an overview of Pandas functionality, demonstrate how to download and install it, and showcase examples of using Pandas to clean and analyze financial data sets. There will be time for Q&A at the end.
Introduction to Big Data: Smart FactoryJongwook Woo
Jongwook Woo presents an introduction to big data and smart factories. He discusses his background working with big data technologies and partnerships. The document then covers what big data is, common tools like Hadoop and Spark, and how big data is used in smart factories to collect, analyze and visualize machine data to improve operations. It concludes with a high-level summary of using big data for smart factory applications.
Advanced Analytics and Machine Learning with Data VirtualizationDenodo
Watch: https://bit.ly/2DYsUhD
Advanced data science techniques, like machine learning, have proven an extremely useful tool to derive valuable insights from existing data. Platforms like Spark, and complex libraries for R, Python and Scala put advanced techniques at the fingertips of the data scientists. However, these data scientists spent most of their time looking for the right data and massaging it into a usable format. Data virtualization offers a new alternative to address these issues in a more efficient and agile way.
Attend this webinar and learn:
- How data virtualization can accelerate data acquisition and massaging, providing the data scientist with a powerful tool to complement their practice
- How popular tools from the data science ecosystem: Spark, Python, Zeppelin, Jupyter, etc. integrate with Denodo
- How you can use the Denodo Platform with large data volumes in an efficient way
- How Prologis accelerated their use of Machine Learning with data virtualization
This document discusses various tools and technologies used in data science. It covers popular programming languages like Python, R, Java and C++; databases like MySQL, NoSQL, SQL Server and Oracle; data analytics tools like SAS, Tableau, SPSS and Excel; APIs like TensorFlow; servers and frameworks like Hadoop and Spark; and compares SQL and NoSQL databases. It provides details on languages and tools like R, Python, Excel, SAS, SPSS and discusses their uses and popularity in data science.
Bringing Machine Learning and Knowledge Graphs Together
Six Core Aspects of Semantic AI:
- Hybrid Approach
- Data Quality
- Data as a Service
- Structured Data Meets Text
- No Black-box
- Towards Self-optimizing Machines
The document summarizes an Open Data Science Conference and iRODS User Group meeting. It discusses technologies like Julia, Stan, Scikit-learn, Apache Spark, Apache Hadoop, and Apache Hive that were presented. It provides information on keynote speakers and their affiliated companies. The document also lists topics for training workshops and good talks available online. Finally, it summarizes questions asked about iRODS and provides information on implementing data policy rules.
Pentaho data integration 4.0 and my sqlAHMED ENNAJI
This document provides an overview of Pentaho Data Integration (PDI) version 4 and its support for MySQL. It begins with an introduction to Pentaho as an open source business intelligence suite. It then discusses the key components and features of PDI, including extraction, transformation, loading, and support for over 35 database types. New features in version 4 are highlighted, such as improved visualization, logging, and plugin architecture. The document concludes with a section focused on MySQL support in PDI, including JDBC/ODBC integration and bulk loading jobs.
This document provides an overview of getting started with data science using Python. It discusses what data science is, why it is in high demand, and the typical skills and backgrounds of data scientists. It then covers popular Python libraries for data science like NumPy, Pandas, Scikit-Learn, TensorFlow, and Keras. Common data science steps are outlined including data gathering, preparation, exploration, model building, validation, and deployment. Example applications and case studies are discussed along with resources for learning including podcasts, websites, communities, books, and TV shows.
The document provides an introduction to data mining. It discusses the growth of data from terabytes to petabytes and how data mining can help extract knowledge from large datasets. The document outlines the evolution of sciences from empirical to theoretical to computational and now data-driven. It also describes the evolution of database technology and defines data mining as the process of discovering interesting patterns from large amounts of data. The key steps of the knowledge discovery process are discussed.
GNU Parallel: Lab meeting—technical talkHoffman Lab
The document summarizes an upcoming lab meeting technical talk on GNU Parallel, a shell tool for executing jobs in parallel. The talk will cover why GNU Parallel is useful, basic examples and syntax from its tutorial, additional advanced syntax for various tasks, recently added features since 2020, and more examples from the tutorial and the speaker's own use of GNU Parallel.
This document summarizes a new technique and Python package called TCRpower for quantifying the detection power of T-cell receptor sequencing methods using spike-in standards. TCRpower uses a negative binomial model to estimate detection probabilities of target T-cell receptors based on sequencing read counts. It calibrates this model using spike-in controls containing known T-cell receptor sequences added at defined concentrations. Results from applying TCRpower to PCR-based T-cell receptor sequencing data show it can reliably detect clonotypes down to a frequency of 10-6 but has higher variability for rarer clonotypes below 300 per million RNA. TCRpower improves method selection, optimization and reproducibility for T-cell receptor sequencing.
More Related Content
Similar to Pandas: a high-level, data-centric, Python extension and plotting library
The document provides information about a course on Big Data Analytics taught at Malla Reddy College of Engineering & Technology. It includes 5 units that will be covered: Introduction to Big Data and Analytics, Introduction to Technology Landscape, Introduction to MongoDB and MapReduce Programming, Introduction to Hive and Pig, and Introduction to Data Analytics with R. The course aims to introduce students to big data tools and information standard formats. It will cover topics such as structured and unstructured data, Hadoop, MongoDB, MapReduce, Hive, Pig, and machine learning algorithms.
The document provides information about a course on Big Data Analytics taught at Malla Reddy College of Engineering & Technology. It includes 5 units that will be covered: Introduction to Big Data and Analytics, Introduction to Technology Landscape, Introduction to MongoDB and MapReduce Programming, Introduction to Hive and Pig, and Introduction to Data Analytics with R. The course aims to introduce students to big data tools and information standard formats to help them design data for analytics and work with tools like Hadoop, Scala, and machine learning algorithms.
This document outlines the topics that will be covered in a course on data science with Python. It includes sections on Python basics, data science basics, statistics, algorithms and analysis, and contact information. Some of the main topics that will be covered include data types in Python, conditional statements and loops, descriptive statistics, linear and logistic regression, decision trees, support vector machines, and principal component analysis. The course is intended to teach students the skills needed for a career in data science.
Introduction to question answering for linked data & big dataAndre Freitas
This document discusses question answering (QA) systems in the context of big data and heterogeneous data scenarios. It outlines the motivation and challenges for developing natural language interfaces for databases. The document covers the basic concepts and taxonomy of QA systems, including question types, answer types, data sources, and domains. It also discusses the anatomy and components of a typical QA system.
Keynote at Open Data Science Conference, San Francisco, Nov 2015, outlines the evolution of Data Science akin to evolution of alchemy to chemistry; Intel's motivations for releasing Trusted Analytics Platform to open source.
COVID - 19 DATA ANALYSIS USING PYTHON and Introduction to Data ScienceVibhuti Mandral
This document contains summaries of multiple MOOCs and projects related to data science using Python. The first MOOC introduces data science concepts using Python like NumPy and Pandas libraries. The second project focuses on intermediate Pandas functions for data processing. The third guided project analyzes the relationship between COVID-19 spread and country happiness using Python libraries like Pandas and Matplotlib for data preparation, calculations, joining datasets, and visualizing results.
VerticaPy allows users to perform machine learning and data science tasks using Python directly in the Vertica database. It provides tools for data exploration, preparation, modeling, evaluation and visualization. Models can be built and stored within Vertica for scalable deployment and management. VerticaPy aims to bring analytics to the next level by allowing users to leverage Vertica's in-database capabilities while working with Python.
A Comprehensive Guide to Data Science Technologies.pdfGeethaPratyusha
In the fast-paced realm of data science, staying ahead requires a deep understanding of the tools and technologies that drive insights from data. From programming languages to advanced frameworks, the world of data science technologies is vast and dynamic. In this blog, we embark on a comprehensive guide, navigating through the essential tools that empower data scientists to unravel the mysteries hidden within datasets and shape the future of information analysis. For those seeking a structured and immersive learning experience, complementing this tech-centric journey with a well-crafted data science course is the key to unlocking boundless opportunities in this evolving field.
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...Sarah Aerni
Slides from the Pivotal Open Source Hub Meetup
"Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Science!"
As the need for data science as a key differentiator grows in all industries, from large corporations to startups, the need to get to results quickly is enabled by sharing ideas and methods in the community. The data science team at Pivotal leverages and contributes to this community of publicly available and open source technologies as part of their practice. We will share the resources we use by highlighting specific toolkits for building models (e.g. MADlib, R) and visualization (e.g. Gephi and Circos) along with their benefits and limitations by sharing examples from Pivotal's data science engagements. At the end of this session we hope to have answered the questions: Where can I get started with Data Science? Which toolkit is most appropriate for building a model with my dataset? How can I visualize my results to have the greatest impact?
Bio: Sarah Aerni is a member of the Pivotal Data Science team with a focus on healthcare and life science. She has a background in the field of Bioinformatics, developing tools to help biomedical researchers understand their data. She holds a B.S. In Biology with a specialization in Bioinformatics and minor in French Literature from UCSD, and an M.S. and Ph.D in Biomedical Informatics from Stanford University. During her time as a researcher she focused on the interface between machine learning and biology, building computational models enabling research for a broad range of fields in biomedicine. She also co-founded a start-up providing informatics services to researchers and small companies. At Pivotal she works with customers in life science and healthcare building models to derive insight and business value from their data.
Pandas is an open source Python library that provides high-performance data structures and data analysis tools. It allows users to work with structured and unstructured data, clean and manipulate data sets, and perform complex analyses. The presentation will provide an overview of Pandas functionality, demonstrate how to download and install it, and showcase examples of using Pandas to clean and analyze financial data sets. There will be time for Q&A at the end.
Introduction to Big Data: Smart FactoryJongwook Woo
Jongwook Woo presents an introduction to big data and smart factories. He discusses his background working with big data technologies and partnerships. The document then covers what big data is, common tools like Hadoop and Spark, and how big data is used in smart factories to collect, analyze and visualize machine data to improve operations. It concludes with a high-level summary of using big data for smart factory applications.
Advanced Analytics and Machine Learning with Data VirtualizationDenodo
Watch: https://bit.ly/2DYsUhD
Advanced data science techniques, like machine learning, have proven an extremely useful tool to derive valuable insights from existing data. Platforms like Spark, and complex libraries for R, Python and Scala put advanced techniques at the fingertips of the data scientists. However, these data scientists spent most of their time looking for the right data and massaging it into a usable format. Data virtualization offers a new alternative to address these issues in a more efficient and agile way.
Attend this webinar and learn:
- How data virtualization can accelerate data acquisition and massaging, providing the data scientist with a powerful tool to complement their practice
- How popular tools from the data science ecosystem: Spark, Python, Zeppelin, Jupyter, etc. integrate with Denodo
- How you can use the Denodo Platform with large data volumes in an efficient way
- How Prologis accelerated their use of Machine Learning with data virtualization
This document discusses various tools and technologies used in data science. It covers popular programming languages like Python, R, Java and C++; databases like MySQL, NoSQL, SQL Server and Oracle; data analytics tools like SAS, Tableau, SPSS and Excel; APIs like TensorFlow; servers and frameworks like Hadoop and Spark; and compares SQL and NoSQL databases. It provides details on languages and tools like R, Python, Excel, SAS, SPSS and discusses their uses and popularity in data science.
Bringing Machine Learning and Knowledge Graphs Together
Six Core Aspects of Semantic AI:
- Hybrid Approach
- Data Quality
- Data as a Service
- Structured Data Meets Text
- No Black-box
- Towards Self-optimizing Machines
The document summarizes an Open Data Science Conference and iRODS User Group meeting. It discusses technologies like Julia, Stan, Scikit-learn, Apache Spark, Apache Hadoop, and Apache Hive that were presented. It provides information on keynote speakers and their affiliated companies. The document also lists topics for training workshops and good talks available online. Finally, it summarizes questions asked about iRODS and provides information on implementing data policy rules.
Pentaho data integration 4.0 and my sqlAHMED ENNAJI
This document provides an overview of Pentaho Data Integration (PDI) version 4 and its support for MySQL. It begins with an introduction to Pentaho as an open source business intelligence suite. It then discusses the key components and features of PDI, including extraction, transformation, loading, and support for over 35 database types. New features in version 4 are highlighted, such as improved visualization, logging, and plugin architecture. The document concludes with a section focused on MySQL support in PDI, including JDBC/ODBC integration and bulk loading jobs.
This document provides an overview of getting started with data science using Python. It discusses what data science is, why it is in high demand, and the typical skills and backgrounds of data scientists. It then covers popular Python libraries for data science like NumPy, Pandas, Scikit-Learn, TensorFlow, and Keras. Common data science steps are outlined including data gathering, preparation, exploration, model building, validation, and deployment. Example applications and case studies are discussed along with resources for learning including podcasts, websites, communities, books, and TV shows.
The document provides an introduction to data mining. It discusses the growth of data from terabytes to petabytes and how data mining can help extract knowledge from large datasets. The document outlines the evolution of sciences from empirical to theoretical to computational and now data-driven. It also describes the evolution of database technology and defines data mining as the process of discovering interesting patterns from large amounts of data. The key steps of the knowledge discovery process are discussed.
Similar to Pandas: a high-level, data-centric, Python extension and plotting library (20)
GNU Parallel: Lab meeting—technical talkHoffman Lab
The document summarizes an upcoming lab meeting technical talk on GNU Parallel, a shell tool for executing jobs in parallel. The talk will cover why GNU Parallel is useful, basic examples and syntax from its tutorial, additional advanced syntax for various tasks, recently added features since 2020, and more examples from the tutorial and the speaker's own use of GNU Parallel.
This document summarizes a new technique and Python package called TCRpower for quantifying the detection power of T-cell receptor sequencing methods using spike-in standards. TCRpower uses a negative binomial model to estimate detection probabilities of target T-cell receptors based on sequencing read counts. It calibrates this model using spike-in controls containing known T-cell receptor sequences added at defined concentrations. Results from applying TCRpower to PCR-based T-cell receptor sequencing data show it can reliably detect clonotypes down to a frequency of 10-6 but has higher variability for rarer clonotypes below 300 per million RNA. TCRpower improves method selection, optimization and reproducibility for T-cell receptor sequencing.
Efficient querying of genomic reference databases with ggetHoffman Lab
gget is a free, open-source command-line tool and Python package for efficiently querying genomic reference databases. It allows users to retrieve gene and transcript sequences, search for genes, find correlated genes from expression databases, enrich gene sets in pathways and ontologies, and more. gget also integrates tools for sequence alignment, BLAST/BLAT searches, and protein structure prediction with AlphaFold.
The WashU Epigenome Browser is an online tool for exploring epigenomic data. It was recently updated in 2022 with new features like dynamic tracks that allow users to overlay additional data on top of existing tracks. The meeting covered a live demo of the browser and directed attendees to its documentation and dynamic tracks feature page to learn more.
Wireguard: A Virtual Private Network TunnelHoffman Lab
Wireguard is a simple yet secure VPN tunnel that can provide access to an entire private network rather than just a single machine. It runs on Linux, Windows, macOS, and phones. With Wireguard, you create a virtual network interface and cryptographic key pair, share your public key, and add the public keys of networks you want to access. This allows you to securely connect your device to the private network and access resources like network attached storage from anywhere via an encrypted single point of access.
Plotting heatmap with matplotlib/seabornHoffman Lab
The document describes several methods for creating heatmaps using the matplotlib and seaborn Python libraries. It provides code examples for creating basic heatmaps with matplotlib and seaborn, heatmaps with labels and annotations using seaborn, combining multiple heatmaps, and manually creating heatmaps with matplotlib by adding colored rectangles. The final sections provide an example of creating a heatmap with two colors and adding polygons manually.
Go Get Data (GGD) is a genomics data management system that provides access to processed and curated genomic data files. It allows users to create "data recipes" that define genomic data files and their metadata. These recipes are used to generate data packages that can be installed and their files accessed via environment variables. GGD also supports finding, installing, uninstalling, and listing installed data packages.
The document discusses fastp, an ultra-fast all-in-one FASTQ preprocessor. Fastp performs adapter trimming, quality trimming, base correction, polyG/polyX tail trimming, and can handle UMIs. It is very fast due to being written in C++ and multi-threaded. Fastp outputs metrics that can be integrated into MultiQC reports. The document provides examples of fastp commands and usage with GNU Parallel for processing multiple samples simultaneously.
R markdown allows connecting data, code, and text into reports, presentations, and other documents. It works with R, Python, and Bash code. The rmdformats package creates clean HTML documents from R markdown files using different template designs like "readthedown" and "docute". Templates allow formatting code and content into pages, tables of contents, and other features. Parameters control template features such as figure sizes and code folding. Resources for learning more about R markdown and rmdformats were also provided.
This document discusses various file searching tools. It introduces grep for searching files using regular expressions. Faster alternatives to grep like Ag, Ack, and Ripgrep are presented. The document also covers finding files using find or fd, fuzzy filtering with FZF, code searching with ctags or language servers, and summarizes to consider faster tools when possible and leverage editor plugins for code context.
The document discusses Better BibTeX (BBT), an add-on for the desktop version of Zotero. BBT improves on the standard BibTeX file export from Zotero by handling key formatting, duplicates, special characters, and journal abbreviations to produce cleaner BibTeX files that are suitable for use in LaTeX documents on platforms like Overleaf.
Bioawk is a tool that extends GNU awk to facilitate working with biological file formats like FASTA, FASTQ, SAM, BED, GFF, and VCF. It directly reads gzipped files and treats spanning sequences as single records. Some key functions added in Bioawk include calculating GC content, reversing/reverse complementing sequences, and working with quality values. Bioawk allows for convenient parsing, manipulation and statistical analysis of genomic data.
This document discusses terminals and shells. It defines that a shell is a program that interprets commands from a user and executes those commands, while a terminal is a physical device for displaying output and reading input. It provides a brief history of terminals, from telexes and teletypewriters to modern terminal emulators. It also covers terminal configuration, customization, multiplexing using software like tmux and screen, and pseudoterminals. Finally, it discusses different shells, how to choose a shell, shell modes of operation, and shell configuration files like bashrc and profiles.
This document discusses molecular biology concepts for computer scientists and tools for creating glossaries and displaying acronyms. It introduces BioRender, a tool for creating biological diagrams and illustrations. It then evaluates different LaTeX packages for creating glossaries and displaying acronyms, finding that the glossaries-extra package allows for creating both a glossary and acronyms. It concludes that BioRender is easy to use and has a useful icon library, and that glossaries-extra is effective for defining terms and acronyms.
Linters in R provides a 3 sentence summary of the document:
The document discusses the R package lintr, which checks R code for adherence to style guidelines, syntax errors, and possible semantic issues. It describes how to install lintr for use with RStudio, Emacs, and Vim and configure which checks or "linters" are applied. The document also gives examples of what lintr checks for, such as syntax, formatting, code quality, and provides information on customizing lintr using a project-specific configuration file.
BioSyntax: syntax highlighting for computational biologyHoffman Lab
bioSyntax is syntax highlighting software for computational biology. It highlights nucleotides, amino acids, and quality scores in common file formats used in bioinformatics like FASTA, FASTQ, SAM, BAM, VCF, GTF and custom formats. bioSyntax works in many text editors and terminals to help visualize and interpret genomic data. It supports common color schemes and allows customizing colors for specific nucleotides, amino acids or quality scores to highlight features of interest in sequences and alignments.
This document provides an overview and introduction to using the version control system Git. It covers basic Git concepts and operations including configuration, the three main states files can be in, committing changes, viewing history and logs, branching, merging, rebasing, tagging, and collaborating remotely. The document also discusses some internals of Git including how objects are stored and how Git and other version control systems originated.
The document discusses the UCSC Genome Browser, an online tool for viewing and interacting with genomic data. It allows users to view multiple data sources simultaneously for a genomic region across many organisms. The document covers basic usage, uploading temporary custom tracks, creating permanent track hubs to host data, and sharing views using saved sessions. Track hubs and sessions allow sharing genomic views and custom data without time limits.
MultiQC: summarize analysis results for multiple tools and samples in a singl...Hoffman Lab
MultiQC is a tool that aggregates bioinformatics quality control (QC) results from different tools into a single HTML report. It currently supports 73 tools and can integrate QC metrics from preprocessing, alignment, and post-alignment stages. MultiQC generates interactive plots and tables in an customizable report to allow users to compare QC metrics across multiple samples and tools in an flexible manner.
Esquisse is an R package called dreamRs that provides an interactive graphical user interface for creating ggplot2 graphs. The interface allows users to build plots by selecting aesthetic mappings and geoms directly in the UI without writing any code. dreamRs can be launched from within R or installed directly from GitHub, and it offers various customization options for adjusting plot properties through the visual interface.
Programming Foundation Models with DSPy - Meetup SlidesZilliz
Prompting language models is hard, while programming language models is easy. In this talk, I will discuss the state-of-the-art framework DSPy for programming foundation models with its powerful optimizers and runtime constraint system.
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdfflufftailshop
When it comes to unit testing in the .NET ecosystem, developers have a wide range of options available. Among the most popular choices are NUnit, XUnit, and MSTest. These unit testing frameworks provide essential tools and features to help ensure the quality and reliability of code. However, understanding the differences between these frameworks is crucial for selecting the most suitable one for your projects.
HCL Notes and Domino License Cost Reduction in the World of DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-and-domino-license-cost-reduction-in-the-world-of-dlau/
The introduction of DLAU and the CCB & CCX licensing model caused quite a stir in the HCL community. As a Notes and Domino customer, you may have faced challenges with unexpected user counts and license costs. You probably have questions on how this new licensing approach works and how to benefit from it. Most importantly, you likely have budget constraints and want to save money where possible. Don’t worry, we can help with all of this!
We’ll show you how to fix common misconfigurations that cause higher-than-expected user counts, and how to identify accounts which you can deactivate to save money. There are also frequent patterns that can cause unnecessary cost, like using a person document instead of a mail-in for shared mailboxes. We’ll provide examples and solutions for those as well. And naturally we’ll explain the new licensing model.
Join HCL Ambassador Marc Thomas in this webinar with a special guest appearance from Franz Walder. It will give you the tools and know-how to stay on top of what is going on with Domino licensing. You will be able lower your cost through an optimized configuration and keep it low going forward.
These topics will be covered
- Reducing license cost by finding and fixing misconfigurations and superfluous accounts
- How do CCB and CCX licenses really work?
- Understanding the DLAU tool and how to best utilize it
- Tips for common problem areas, like team mailboxes, functional/test users, etc
- Practical examples and best practices to implement right away
This presentation provides valuable insights into effective cost-saving techniques on AWS. Learn how to optimize your AWS resources by rightsizing, increasing elasticity, picking the right storage class, and choosing the best pricing model. Additionally, discover essential governance mechanisms to ensure continuous cost efficiency. Whether you are new to AWS or an experienced user, this presentation provides clear and practical tips to help you reduce your cloud costs and get the most out of your budget.
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-und-domino-lizenzkostenreduzierung-in-der-welt-von-dlau/
DLAU und die Lizenzen nach dem CCB- und CCX-Modell sind für viele in der HCL-Community seit letztem Jahr ein heißes Thema. Als Notes- oder Domino-Kunde haben Sie vielleicht mit unerwartet hohen Benutzerzahlen und Lizenzgebühren zu kämpfen. Sie fragen sich vielleicht, wie diese neue Art der Lizenzierung funktioniert und welchen Nutzen sie Ihnen bringt. Vor allem wollen Sie sicherlich Ihr Budget einhalten und Kosten sparen, wo immer möglich. Das verstehen wir und wir möchten Ihnen dabei helfen!
Wir erklären Ihnen, wie Sie häufige Konfigurationsprobleme lösen können, die dazu führen können, dass mehr Benutzer gezählt werden als nötig, und wie Sie überflüssige oder ungenutzte Konten identifizieren und entfernen können, um Geld zu sparen. Es gibt auch einige Ansätze, die zu unnötigen Ausgaben führen können, z. B. wenn ein Personendokument anstelle eines Mail-Ins für geteilte Mailboxen verwendet wird. Wir zeigen Ihnen solche Fälle und deren Lösungen. Und natürlich erklären wir Ihnen das neue Lizenzmodell.
Nehmen Sie an diesem Webinar teil, bei dem HCL-Ambassador Marc Thomas und Gastredner Franz Walder Ihnen diese neue Welt näherbringen. Es vermittelt Ihnen die Tools und das Know-how, um den Überblick zu bewahren. Sie werden in der Lage sein, Ihre Kosten durch eine optimierte Domino-Konfiguration zu reduzieren und auch in Zukunft gering zu halten.
Diese Themen werden behandelt
- Reduzierung der Lizenzkosten durch Auffinden und Beheben von Fehlkonfigurationen und überflüssigen Konten
- Wie funktionieren CCB- und CCX-Lizenzen wirklich?
- Verstehen des DLAU-Tools und wie man es am besten nutzt
- Tipps für häufige Problembereiche, wie z. B. Team-Postfächer, Funktions-/Testbenutzer usw.
- Praxisbeispiele und Best Practices zum sofortigen Umsetzen
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxSitimaJohn
Ocean Lotus cyber threat actors represent a sophisticated, persistent, and politically motivated group that poses a significant risk to organizations and individuals in the Southeast Asian region. Their continuous evolution and adaptability underscore the need for robust cybersecurity measures and international cooperation to identify and mitigate the threats posed by such advanced persistent threat groups.
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc
How does your privacy program stack up against your peers? What challenges are privacy teams tackling and prioritizing in 2024?
In the fifth annual Global Privacy Benchmarks Survey, we asked over 1,800 global privacy professionals and business executives to share their perspectives on the current state of privacy inside and outside of their organizations. This year’s report focused on emerging areas of importance for privacy and compliance professionals, including considerations and implications of Artificial Intelligence (AI) technologies, building brand trust, and different approaches for achieving higher privacy competence scores.
See how organizational priorities and strategic approaches to data security and privacy are evolving around the globe.
This webinar will review:
- The top 10 privacy insights from the fifth annual Global Privacy Benchmarks Survey
- The top challenges for privacy leaders, practitioners, and organizations in 2024
- Key themes to consider in developing and maintaining your privacy program
Taking AI to the Next Level in Manufacturing.pdfssuserfac0301
Read Taking AI to the Next Level in Manufacturing to gain insights on AI adoption in the manufacturing industry, such as:
1. How quickly AI is being implemented in manufacturing.
2. Which barriers stand in the way of AI adoption.
3. How data quality and governance form the backbone of AI.
4. Organizational processes and structures that may inhibit effective AI adoption.
6. Ideas and approaches to help build your organization's AI strategy.
Introduction of Cybersecurity with OSS at Code Europe 2024Hiroshi SHIBATA
I develop the Ruby programming language, RubyGems, and Bundler, which are package managers for Ruby. Today, I will introduce how to enhance the security of your application using open-source software (OSS) examples from Ruby and RubyGems.
The first topic is CVE (Common Vulnerabilities and Exposures). I have published CVEs many times. But what exactly is a CVE? I'll provide a basic understanding of CVEs and explain how to detect and handle vulnerabilities in OSS.
Next, let's discuss package managers. Package managers play a critical role in the OSS ecosystem. I'll explain how to manage library dependencies in your application.
I'll share insights into how the Ruby and RubyGems core team works to keep our ecosystem safe. By the end of this talk, you'll have a better understanding of how to safeguard your code.
Best 20 SEO Techniques To Improve Website Visibility In SERPPixlogix Infotech
Boost your website's visibility with proven SEO techniques! Our latest blog dives into essential strategies to enhance your online presence, increase traffic, and rank higher on search engines. From keyword optimization to quality content creation, learn how to make your site stand out in the crowded digital landscape. Discover actionable tips and expert insights to elevate your SEO game.
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
Pandas: a high-level, data-centric, Python extension and plotting library
1. LAB MEETING—
TECHNICAL
TALK
COBY VINER
PYTHON SOFTWARE
HIERARCHY
LIB. HIGHLIGHTS
BASIC PANDAS
PANDAS PLOTS
PANDAS USE CASE
PANDAS/R
PANDAS/SQL
REFERENCES
LAB MEETING—TECHNICAL TALK
PANDAS: A HIGH-LEVEL, DATA-CENTRIC, PYTHON
EXTENSION AND PLOTTING LIBRARY
Coby Viner
Hoffman Lab
Thursday, June 18, 2015
2. LAB MEETING—
TECHNICAL
TALK
COBY VINER
PYTHON SOFTWARE
HIERARCHY
LIB. HIGHLIGHTS
BASIC PANDAS
PANDAS PLOTS
PANDAS USE CASE
PANDAS/R
PANDAS/SQL
REFERENCES
OVERVIEW
A PYTHON HIERARCHY OF DATA ANALYTICS
Library highlights
SOME BASIC PANDAS
PANDAS PLOTS
PANDAS USE CASE: ML ALG. SUMMARY & PREP. OF
PLOTS
PANDAS VS. R
PANDAS VS. SQL
3. A PYTHON HIERARCHY OF DATA ANALYTICS
SciPy
SciKits
Python
NumPymatplotlib IPython
Pandas
scikit-learn
StatsModelsSymPy
Cython
nose
scikit-
image
4. LAB MEETING—
TECHNICAL
TALK
COBY VINER
PYTHON SOFTWARE
HIERARCHY
LIB. HIGHLIGHTS
BASIC PANDAS
PANDAS PLOTS
PANDAS USE CASE
PANDAS/R
PANDAS/SQL
REFERENCES
LIBRARY HIGHLIGHTS
A fast and efficient DataFrame object for data
manipulation with integrated indexing;
W. McKinney, “Data Structures for Statistical Computing in Python,” in
Proceedings of the 9th
Python in Science Conference, S. van der Walt and
J. Millman, Eds., 2010, pp. 51–6.
5. LAB MEETING—
TECHNICAL
TALK
COBY VINER
PYTHON SOFTWARE
HIERARCHY
LIB. HIGHLIGHTS
BASIC PANDAS
PANDAS PLOTS
PANDAS USE CASE
PANDAS/R
PANDAS/SQL
REFERENCES
LIBRARY HIGHLIGHTS
A fast and efficient DataFrame object for data
manipulation with integrated indexing;
Tools for reading and writing data between
in-memory data structures and different formats:
CSV and text files, Microsoft Excel, SQL
databases, and the fast HDF5 format;
W. McKinney, “Data Structures for Statistical Computing in Python,” in
Proceedings of the 9th
Python in Science Conference, S. van der Walt and
J. Millman, Eds., 2010, pp. 51–6.
6. LAB MEETING—
TECHNICAL
TALK
COBY VINER
PYTHON SOFTWARE
HIERARCHY
LIB. HIGHLIGHTS
BASIC PANDAS
PANDAS PLOTS
PANDAS USE CASE
PANDAS/R
PANDAS/SQL
REFERENCES
LIBRARY HIGHLIGHTS
A fast and efficient DataFrame object for data
manipulation with integrated indexing;
Tools for reading and writing data between
in-memory data structures and different formats:
CSV and text files, Microsoft Excel, SQL
databases, and the fast HDF5 format;
Intelligent data alignment and integrated
handling of missing data: gain automatic
label-based alignment in computations and
easily manipulate messy data into an orderly
form;
W. McKinney, “Data Structures for Statistical Computing in Python,” in
Proceedings of the 9th
Python in Science Conference, S. van der Walt and
J. Millman, Eds., 2010, pp. 51–6.
7. LAB MEETING—
TECHNICAL
TALK
COBY VINER
PYTHON SOFTWARE
HIERARCHY
LIB. HIGHLIGHTS
BASIC PANDAS
PANDAS PLOTS
PANDAS USE CASE
PANDAS/R
PANDAS/SQL
REFERENCES
LIBRARY HIGHLIGHTS
Flexible reshaping and pivoting of data sets;
W. McKinney, “Data Structures for Statistical Computing in Python,” in
Proceedings of the 9th
Python in Science Conference, S. van der Walt and
J. Millman, Eds., 2010, pp. 51–6.
8. LAB MEETING—
TECHNICAL
TALK
COBY VINER
PYTHON SOFTWARE
HIERARCHY
LIB. HIGHLIGHTS
BASIC PANDAS
PANDAS PLOTS
PANDAS USE CASE
PANDAS/R
PANDAS/SQL
REFERENCES
LIBRARY HIGHLIGHTS
Flexible reshaping and pivoting of data sets;
Intelligent label-based slicing, fancy indexing,
and subsetting of large data sets;
W. McKinney, “Data Structures for Statistical Computing in Python,” in
Proceedings of the 9th
Python in Science Conference, S. van der Walt and
J. Millman, Eds., 2010, pp. 51–6.
9. LAB MEETING—
TECHNICAL
TALK
COBY VINER
PYTHON SOFTWARE
HIERARCHY
LIB. HIGHLIGHTS
BASIC PANDAS
PANDAS PLOTS
PANDAS USE CASE
PANDAS/R
PANDAS/SQL
REFERENCES
LIBRARY HIGHLIGHTS
Flexible reshaping and pivoting of data sets;
Intelligent label-based slicing, fancy indexing,
and subsetting of large data sets;
Columns can be inserted and deleted from data
structures for size mutability;
W. McKinney, “Data Structures for Statistical Computing in Python,” in
Proceedings of the 9th
Python in Science Conference, S. van der Walt and
J. Millman, Eds., 2010, pp. 51–6.
10. LAB MEETING—
TECHNICAL
TALK
COBY VINER
PYTHON SOFTWARE
HIERARCHY
LIB. HIGHLIGHTS
BASIC PANDAS
PANDAS PLOTS
PANDAS USE CASE
PANDAS/R
PANDAS/SQL
REFERENCES
LIBRARY HIGHLIGHTS
Flexible reshaping and pivoting of data sets;
Intelligent label-based slicing, fancy indexing,
and subsetting of large data sets;
Columns can be inserted and deleted from data
structures for size mutability;
Aggregating or transforming data with a powerful
group by engine allowing split-apply-combine
operations on data sets;
W. McKinney, “Data Structures for Statistical Computing in Python,” in
Proceedings of the 9th
Python in Science Conference, S. van der Walt and
J. Millman, Eds., 2010, pp. 51–6.
11. LAB MEETING—
TECHNICAL
TALK
COBY VINER
PYTHON SOFTWARE
HIERARCHY
LIB. HIGHLIGHTS
BASIC PANDAS
PANDAS PLOTS
PANDAS USE CASE
PANDAS/R
PANDAS/SQL
REFERENCES
LIBRARY HIGHLIGHTS
Flexible reshaping and pivoting of data sets;
Intelligent label-based slicing, fancy indexing,
and subsetting of large data sets;
Columns can be inserted and deleted from data
structures for size mutability;
Aggregating or transforming data with a powerful
group by engine allowing split-apply-combine
operations on data sets;
High performance merging and joining of data
sets;
W. McKinney, “Data Structures for Statistical Computing in Python,” in
Proceedings of the 9th
Python in Science Conference, S. van der Walt and
J. Millman, Eds., 2010, pp. 51–6.
12. LAB MEETING—
TECHNICAL
TALK
COBY VINER
PYTHON SOFTWARE
HIERARCHY
LIB. HIGHLIGHTS
BASIC PANDAS
PANDAS PLOTS
PANDAS USE CASE
PANDAS/R
PANDAS/SQL
REFERENCES
LIBRARY HIGHLIGHTS
Hierarchical axis indexing provides an intuitive
way of working with high-dimensional data in a
lower-dimensional data structure;
W. McKinney, “Data Structures for Statistical Computing in Python,” in
Proceedings of the 9th
Python in Science Conference, S. van der Walt and
J. Millman, Eds., 2010, pp. 51–6.
13. LAB MEETING—
TECHNICAL
TALK
COBY VINER
PYTHON SOFTWARE
HIERARCHY
LIB. HIGHLIGHTS
BASIC PANDAS
PANDAS PLOTS
PANDAS USE CASE
PANDAS/R
PANDAS/SQL
REFERENCES
LIBRARY HIGHLIGHTS
Hierarchical axis indexing provides an intuitive
way of working with high-dimensional data in a
lower-dimensional data structure;
Time series-functionality: date range generation
and frequency conversion, moving window
statistics, moving window linear regressions,
date shifting and lagging. [. . . ]
W. McKinney, “Data Structures for Statistical Computing in Python,” in
Proceedings of the 9th
Python in Science Conference, S. van der Walt and
J. Millman, Eds., 2010, pp. 51–6.
14. LAB MEETING—
TECHNICAL
TALK
COBY VINER
PYTHON SOFTWARE
HIERARCHY
LIB. HIGHLIGHTS
BASIC PANDAS
PANDAS PLOTS
PANDAS USE CASE
PANDAS/R
PANDAS/SQL
REFERENCES
LIBRARY HIGHLIGHTS
Hierarchical axis indexing provides an intuitive
way of working with high-dimensional data in a
lower-dimensional data structure;
Time series-functionality: date range generation
and frequency conversion, moving window
statistics, moving window linear regressions,
date shifting and lagging. [. . . ]
[D]omain-specific time offsets and join time
series without losing data;
W. McKinney, “Data Structures for Statistical Computing in Python,” in
Proceedings of the 9th
Python in Science Conference, S. van der Walt and
J. Millman, Eds., 2010, pp. 51–6.
15. LAB MEETING—
TECHNICAL
TALK
COBY VINER
PYTHON SOFTWARE
HIERARCHY
LIB. HIGHLIGHTS
BASIC PANDAS
PANDAS PLOTS
PANDAS USE CASE
PANDAS/R
PANDAS/SQL
REFERENCES
LIBRARY HIGHLIGHTS
Hierarchical axis indexing provides an intuitive
way of working with high-dimensional data in a
lower-dimensional data structure;
Time series-functionality: date range generation
and frequency conversion, moving window
statistics, moving window linear regressions,
date shifting and lagging. [. . . ]
[D]omain-specific time offsets and join time
series without losing data;
Highly optimized for performance, with critical
code paths written in Cython or C.
W. McKinney, “Data Structures for Statistical Computing in Python,” in
Proceedings of the 9th
Python in Science Conference, S. van der Walt and
J. Millman, Eds., 2010, pp. 51–6.
16. LAB MEETING—
TECHNICAL
TALK
COBY VINER
PYTHON SOFTWARE
HIERARCHY
LIB. HIGHLIGHTS
BASIC PANDAS
PANDAS PLOTS
PANDAS USE CASE
PANDAS/R
PANDAS/SQL
REFERENCES
SOME BASIC PANDAS
Basic new data structures include Series and DataFrame.
In [1]: import pandas as pd
In [2]: import numpy as np
In [3]: import matplotlib.pyplot as plt
In [4]: s = pd.Series([1,3,5,np.nan])
0 1
1 3
2 5
3 NaN
dtype: float64
17. LAB MEETING—
TECHNICAL
TALK
COBY VINER
PYTHON SOFTWARE
HIERARCHY
LIB. HIGHLIGHTS
BASIC PANDAS
PANDAS PLOTS
PANDAS USE CASE
PANDAS/R
PANDAS/SQL
REFERENCES
SOME BASIC PANDAS
In [6]: dates = pd.date_range('20130101', periods=3)
DatetimeIndex(['2013-01-01', '2013-01-02',
'2013-01-03'],
dtype='datetime64[ns]',
freq='D', tz=None)
In [8]: df = pd.DataFrame(np.random.randn(6,4),
index=dates, columns=list('ABCD'))
In [9]: df
Out[9]:
A B C D
2013-01-01 0.469112 -0.282863 -1.509059 -1.135632
2013-01-02 1.212112 -0.173215 0.119209 -1.044236
2013-01-03 -0.861849 -2.104569 -0.494929 1.071804
18. LAB MEETING—
TECHNICAL
TALK
COBY VINER
PYTHON SOFTWARE
HIERARCHY
LIB. HIGHLIGHTS
BASIC PANDAS
PANDAS PLOTS
PANDAS USE CASE
PANDAS/R
PANDAS/SQL
REFERENCES
SOME BASIC PANDAS
In [12]: df2.dtypes
Out[12]:
A float64
B datetime64[ns]
C float32
D int32
E category
F object
dtype: object
20. LAB MEETING—
TECHNICAL
TALK
COBY VINER
PYTHON SOFTWARE
HIERARCHY
LIB. HIGHLIGHTS
BASIC PANDAS
PANDAS PLOTS
PANDAS USE CASE
PANDAS/R
PANDAS/SQL
REFERENCES
SOME BASIC PANDAS
df.describe()
21. LAB MEETING—
TECHNICAL
TALK
COBY VINER
PYTHON SOFTWARE
HIERARCHY
LIB. HIGHLIGHTS
BASIC PANDAS
PANDAS PLOTS
PANDAS USE CASE
PANDAS/R
PANDAS/SQL
REFERENCES
SOME BASIC PANDAS
df.describe()
df.T
22. LAB MEETING—
TECHNICAL
TALK
COBY VINER
PYTHON SOFTWARE
HIERARCHY
LIB. HIGHLIGHTS
BASIC PANDAS
PANDAS PLOTS
PANDAS USE CASE
PANDAS/R
PANDAS/SQL
REFERENCES
SOME BASIC PANDAS
df.describe()
df.T
df.sort_index(axis=1, ascending=False)
23. LAB MEETING—
TECHNICAL
TALK
COBY VINER
PYTHON SOFTWARE
HIERARCHY
LIB. HIGHLIGHTS
BASIC PANDAS
PANDAS PLOTS
PANDAS USE CASE
PANDAS/R
PANDAS/SQL
REFERENCES
SOME BASIC PANDAS
df.describe()
df.T
df.sort_index(axis=1, ascending=False)
df.sort(columns=’B’)
24. LAB MEETING—
TECHNICAL
TALK
COBY VINER
PYTHON SOFTWARE
HIERARCHY
LIB. HIGHLIGHTS
BASIC PANDAS
PANDAS PLOTS
PANDAS USE CASE
PANDAS/R
PANDAS/SQL
REFERENCES
SOME BASIC PANDAS
df.describe()
df.T
df.sort_index(axis=1, ascending=False)
df.sort(columns=’B’)
Selection can be done as in NumPy, but new optimized
methods: .at, .iat, .loc, .iloc and .ix.
25. LAB MEETING—
TECHNICAL
TALK
COBY VINER
PYTHON SOFTWARE
HIERARCHY
LIB. HIGHLIGHTS
BASIC PANDAS
PANDAS PLOTS
PANDAS USE CASE
PANDAS/R
PANDAS/SQL
REFERENCES
SOME BASIC PANDAS
In [35]: df.iloc[1:3,:] # slicing rows explicitly
Out[35]:
A B C D
2013-01-02 1.212112 -0.173215 0.119209 -1.044236
2013-01-03 -0.861849 -2.104569 -0.494929 1.071804
In [40]: df[df > 0] # where retrieval operation
Out[40]:
A B C D
2013-01-01 0.469112 NaN NaN NaN
2013-01-02 1.212112 NaN 0.119209 NaN
2013-01-03 NaN NaN NaN 1.071804
2013-01-04 0.721555 NaN NaN 0.271860
2013-01-05 NaN 0.567020 0.276232 NaN
2013-01-06 NaN 0.113648 NaN 0.524988
26. LAB MEETING—
TECHNICAL
TALK
COBY VINER
PYTHON SOFTWARE
HIERARCHY
LIB. HIGHLIGHTS
BASIC PANDAS
PANDAS PLOTS
PANDAS USE CASE
PANDAS/R
PANDAS/SQL
REFERENCES
SOME BASIC PANDAS
In [66]: df.apply(np.cumsum)
Out[66]:
A B C D F
2013-01-01 0.000000 0.000000 -1.509059 5 NaN
2013-01-02 1.212112 -0.173215 -1.389850 10 1
2013-01-03 0.350263 -2.277784 -1.884779 15 3
2013-01-04 1.071818 -2.984555 -2.924354 20 6
2013-01-05 0.646846 -2.417535 -2.648122 25 10
2013-01-06 -0.026844 -2.303886 -4.126549 30 15
In [67]: df.apply(lambda x: x.max() - x.min())
Out[67]:
A 2.073961
B 2.671590
C 1.785291
D 0.000000
F 4.000000
27. LAB MEETING—
TECHNICAL
TALK
COBY VINER
PYTHON SOFTWARE
HIERARCHY
LIB. HIGHLIGHTS
BASIC PANDAS
PANDAS PLOTS
PANDAS USE CASE
PANDAS/R
PANDAS/SQL
REFERENCES
SOME BASIC PANDAS
In [95]: stacked = df2.stack()
In [96]: stacked
Out[96]:
first second
bar one A 0.029399
B -0.542108
two A 0.282696
B -0.087302
baz one A -1.575170
B 1.771208
two A 0.816482
B 1.100230
dtype: float64
28. PANDAS PLOTS
Everything matplotlib can do, Pandas can do better. . .
It uses matplotlib and permits direct over-riding of behaviour
via matplotlib’s more low-level functions.
df2 = pd.DataFrame(np.random.rand(10, 4),
columns=['a', 'b', 'c', 'd'])
df2.plot(kind='bar');
29. PANDAS PLOTS
Everything matplotlib can do, Pandas can do better. . .
It uses matplotlib and permits direct over-riding of behaviour
via matplotlib’s more low-level functions.
df2 = pd.DataFrame(np.random.rand(10, 4),
columns=['a', 'b', 'c', 'd'])
df2.plot(kind='bar');
30. PANDAS PLOTS
It also has some nice and intuitive sub-plotting features:
df.plot(subplots=True, layout=(2, 3), figsize=(6, 6),
sharex=False)
31. PANDAS PLOTS
It also has some nice and intuitive sub-plotting features:
df.plot(subplots=True, layout=(2, 3), figsize=(6, 6),
sharex=False)
35. PANDAS USE CASE: ML ALG. SUMMARY &
PREP. OF PLOTS
for i, group in data.groupby(obj_mapping, axis=0,
sort=False):
ax = group.plot(kind='barh', legend=False)
ax.set_title(...)
ax.set_xlabel(<...> obj_n_mapping[i]).title() <...>)
ax.set_ylabel('Machine learning algorithm')
ax.set_yticklabels(<list comprehension>)
ax.xaxis.grid(True, which='both')
ax.yaxis.grid(False)
for tic in ax.yaxis.get_major_ticks():
tic.tick1On = tic.tick2On = False
patches, labels = ax.get_legend_handles_labels()
ax.legend(patches[::-1], labels[::-1], loc='upper center',
bbox_to_anchor=(0.5, -0.1), fancybox=True,
shadow=True, ncol=5)
36. PANDAS USE CASE: ML ALG. SUMMARY &
PREP. OF PLOTS
for t_idx, t in enumerate(ax.get_legend().get_texts()):
<edit various legend items>
for ext in ['pdf', 'pgf']:
plt.savefig(<path> + ext, bbox_inches='tight')
37. 0 20 40 60 80 100
Accuracy (%)
ADA boost NR
ADA boost R
Bagging NR
Bagging R
k-NN NR
k-NN R
Logistic regression NR
Logistic regression R
Random forests NR
Random forests R
Linear SVM NR
Linear SVM R
Machinelearningalgorithm
Metrics for machine learning algorithm vs. model accuracy, maximizing accuracy
F1 score Recall Precision Val. Accuracy Accuracy
38. 0 20 40 60 80 100
F1 Score (%)
ADA boost NR
ADA boost R
Bagging NR
Bagging R
k-NN NR
k-NN R
Logistic regression NR
Logistic regression R
Random forests NR
Random forests R
Linear SVM NR
Linear SVM R
Machinelearningalgorithm
Metrics for machine learning algorithm vs. model accuracy, maximizing F1 score
Val. F1 score F1 score Recall Precision Accuracy
39. LAB MEETING—
TECHNICAL
TALK
COBY VINER
PYTHON SOFTWARE
HIERARCHY
LIB. HIGHLIGHTS
BASIC PANDAS
PANDAS PLOTS
PANDAS USE CASE
PANDAS/R
PANDAS/SQL
REFERENCES
PANDAS VS. R
Very similar abilities as far as data manipulation is concerned. . .
R data.frame column selections ↔ similar in Pandas or
df.loc, non-contigous columns via: df.iloc[:,
np.r_[:x, y:z]].
W. McKinney, Comparison with R / R libraries, 2015.
40. LAB MEETING—
TECHNICAL
TALK
COBY VINER
PYTHON SOFTWARE
HIERARCHY
LIB. HIGHLIGHTS
BASIC PANDAS
PANDAS PLOTS
PANDAS USE CASE
PANDAS/R
PANDAS/SQL
REFERENCES
PANDAS VS. R
Very similar abilities as far as data manipulation is concerned. . .
R data.frame column selections ↔ similar in Pandas or
df.loc, non-contigous columns via: df.iloc[:,
np.r_[:x, y:z]].
R’s aggregate/plyr’s ddply ↔ Pandas’ groupby().
W. McKinney, Comparison with R / R libraries, 2015.
41. LAB MEETING—
TECHNICAL
TALK
COBY VINER
PYTHON SOFTWARE
HIERARCHY
LIB. HIGHLIGHTS
BASIC PANDAS
PANDAS PLOTS
PANDAS USE CASE
PANDAS/R
PANDAS/SQL
REFERENCES
PANDAS VS. R
Very similar abilities as far as data manipulation is concerned. . .
R data.frame column selections ↔ similar in Pandas or
df.loc, non-contigous columns via: df.iloc[:,
np.r_[:x, y:z]].
R’s aggregate/plyr’s ddply ↔ Pandas’ groupby().
R’s %in% ↔ Pandas’ isin().
W. McKinney, Comparison with R / R libraries, 2015.
42. LAB MEETING—
TECHNICAL
TALK
COBY VINER
PYTHON SOFTWARE
HIERARCHY
LIB. HIGHLIGHTS
BASIC PANDAS
PANDAS PLOTS
PANDAS USE CASE
PANDAS/R
PANDAS/SQL
REFERENCES
PANDAS VS. R
Very similar abilities as far as data manipulation is concerned. . .
R data.frame column selections ↔ similar in Pandas or
df.loc, non-contigous columns via: df.iloc[:,
np.r_[:x, y:z]].
R’s aggregate/plyr’s ddply ↔ Pandas’ groupby().
R’s %in% ↔ Pandas’ isin().
R’s tapply() ↔ Pandas’ pivot_table().
W. McKinney, Comparison with R / R libraries, 2015.
43. LAB MEETING—
TECHNICAL
TALK
COBY VINER
PYTHON SOFTWARE
HIERARCHY
LIB. HIGHLIGHTS
BASIC PANDAS
PANDAS PLOTS
PANDAS USE CASE
PANDAS/R
PANDAS/SQL
REFERENCES
PANDAS VS. R
Very similar abilities as far as data manipulation is concerned. . .
R data.frame column selections ↔ similar in Pandas or
df.loc, non-contigous columns via: df.iloc[:,
np.r_[:x, y:z]].
R’s aggregate/plyr’s ddply ↔ Pandas’ groupby().
R’s %in% ↔ Pandas’ isin().
R’s tapply() ↔ Pandas’ pivot_table().
R’s subset() ↔ Pandas’ query().
W. McKinney, Comparison with R / R libraries, 2015.
44. LAB MEETING—
TECHNICAL
TALK
COBY VINER
PYTHON SOFTWARE
HIERARCHY
LIB. HIGHLIGHTS
BASIC PANDAS
PANDAS PLOTS
PANDAS USE CASE
PANDAS/R
PANDAS/SQL
REFERENCES
PANDAS VS. R
Very similar abilities as far as data manipulation is concerned. . .
R data.frame column selections ↔ similar in Pandas or
df.loc, non-contigous columns via: df.iloc[:,
np.r_[:x, y:z]].
R’s aggregate/plyr’s ddply ↔ Pandas’ groupby().
R’s %in% ↔ Pandas’ isin().
R’s tapply() ↔ Pandas’ pivot_table().
R’s subset() ↔ Pandas’ query().
df <- data.frame(a=rnorm(10), b=rnorm(10))
with(df, a + b)
df$a + df$b # same as the previous expression
W. McKinney, Comparison with R / R libraries, 2015.
45. LAB MEETING—
TECHNICAL
TALK
COBY VINER
PYTHON SOFTWARE
HIERARCHY
LIB. HIGHLIGHTS
BASIC PANDAS
PANDAS PLOTS
PANDAS USE CASE
PANDAS/R
PANDAS/SQL
REFERENCES
PANDAS VS. R
df = pd.DataFrame({'a': np.random.randn(10)
'b': np.random.randn(10)})
df.eval('a + b')
df.a + df.b # same as the previous expression
plyr data structure mapping:
R Python
array list
lists dictionary or list of objects
data.frame dataframe
plyr’s melt on a data frame can be done the exact same way
in Pandas. Most other plyr functions are covered by Pandas’
pivot tables.
W. McKinney, Comparison with R / R libraries, 2015.
46. LAB MEETING—
TECHNICAL
TALK
COBY VINER
PYTHON SOFTWARE
HIERARCHY
LIB. HIGHLIGHTS
BASIC PANDAS
PANDAS PLOTS
PANDAS USE CASE
PANDAS/R
PANDAS/SQL
REFERENCES
PANDAS VS. R
A pivot table example:
df row # A B C D
0 foo one small 1
1 foo one large 2
2 foo one large 2
3 foo two small 3
4 foo two small 3
5 bar one large 4
6 bar one small 5
7 bar two small 6
8 bar two large 7
pivot_table(df, values='D', index=['A', 'B'],
columns=['C'], aggfunc=np.sum)
small large
foo one 1 4
two 6 NaN
bar one 5 4
two 6 7
47. LAB MEETING—
TECHNICAL
TALK
COBY VINER
PYTHON SOFTWARE
HIERARCHY
LIB. HIGHLIGHTS
BASIC PANDAS
PANDAS PLOTS
PANDAS USE CASE
PANDAS/R
PANDAS/SQL
REFERENCES
PANDAS VS. R
R’s factor is analogous to categorical data frames in Pandas:
cut(c(1,2,3,4,5,6), 3)
factor(c(1,2,3,2,2,3))
pd.cut(pd.Series([1,2,3,4,5,6]), 3)
pd.Series([1,2,3,2,2,3]).astype("category")
W. McKinney, Comparison with R / R libraries, 2015.
48. LAB MEETING—
TECHNICAL
TALK
COBY VINER
PYTHON SOFTWARE
HIERARCHY
LIB. HIGHLIGHTS
BASIC PANDAS
PANDAS PLOTS
PANDAS USE CASE
PANDAS/R
PANDAS/SQL
REFERENCES
PANDAS VS. SQL
Null checking via notnull() and isnull()
W. McKinney, Comparison with SQL, , 2015.
49. LAB MEETING—
TECHNICAL
TALK
COBY VINER
PYTHON SOFTWARE
HIERARCHY
LIB. HIGHLIGHTS
BASIC PANDAS
PANDAS PLOTS
PANDAS USE CASE
PANDAS/R
PANDAS/SQL
REFERENCES
PANDAS VS. SQL
Null checking via notnull() and isnull()
Group by is analogous
W. McKinney, Comparison with SQL, , 2015.
50. LAB MEETING—
TECHNICAL
TALK
COBY VINER
PYTHON SOFTWARE
HIERARCHY
LIB. HIGHLIGHTS
BASIC PANDAS
PANDAS PLOTS
PANDAS USE CASE
PANDAS/R
PANDAS/SQL
REFERENCES
PANDAS VS. SQL
Null checking via notnull() and isnull()
Group by is analogous
Use agg() to pass a dictionary of functions to apply to
particular columns
W. McKinney, Comparison with SQL, , 2015.
51. LAB MEETING—
TECHNICAL
TALK
COBY VINER
PYTHON SOFTWARE
HIERARCHY
LIB. HIGHLIGHTS
BASIC PANDAS
PANDAS PLOTS
PANDAS USE CASE
PANDAS/R
PANDAS/SQL
REFERENCES
PANDAS VS. SQL
Null checking via notnull() and isnull()
Group by is analogous
Use agg() to pass a dictionary of functions to apply to
particular columns
Conduct joins via join() or merge()
W. McKinney, Comparison with SQL, , 2015.
52. LAB MEETING—
TECHNICAL
TALK
COBY VINER
PYTHON SOFTWARE
HIERARCHY
LIB. HIGHLIGHTS
BASIC PANDAS
PANDAS PLOTS
PANDAS USE CASE
PANDAS/R
PANDAS/SQL
REFERENCES
PANDAS VS. SQL
Null checking via notnull() and isnull()
Group by is analogous
Use agg() to pass a dictionary of functions to apply to
particular columns
Conduct joins via join() or merge()
UNION ALL via concat()
W. McKinney, Comparison with SQL, , 2015.
53. LAB MEETING—
TECHNICAL
TALK
COBY VINER
PYTHON SOFTWARE
HIERARCHY
LIB. HIGHLIGHTS
BASIC PANDAS
PANDAS PLOTS
PANDAS USE CASE
PANDAS/R
PANDAS/SQL
REFERENCES
PANDAS VS. SQL
Null checking via notnull() and isnull()
Group by is analogous
Use agg() to pass a dictionary of functions to apply to
particular columns
Conduct joins via join() or merge()
UNION ALL via concat()
UNION via concat(<...>).drop_duplicates()
W. McKinney, Comparison with SQL, , 2015.
54. LAB MEETING—
TECHNICAL
TALK
COBY VINER
PYTHON SOFTWARE
HIERARCHY
LIB. HIGHLIGHTS
BASIC PANDAS
PANDAS PLOTS
PANDAS USE CASE
PANDAS/R
PANDAS/SQL
REFERENCES
PANDAS VS. SQL
SELECT total_bill, tip, smoker, time
FROM tips
LIMIT 5;
tips[['total_bill', 'tip', 'smoker', 'time']].head(5)
SELECT *
FROM tips
WHERE time = 'Dinner' AND tip > 5.00;
tips[(tips['time'] == 'Dinner') & (tips['tip'] > 5.00)]
W. McKinney, Comparison with SQL, , 2015.
55. LAB MEETING—
TECHNICAL
TALK
COBY VINER
PYTHON SOFTWARE
HIERARCHY
LIB. HIGHLIGHTS
BASIC PANDAS
PANDAS PLOTS
PANDAS USE CASE
PANDAS/R
PANDAS/SQL
REFERENCES
W. McKinney, “Data Structures for Statistical Computing
in Python,” in Proceedings of the 9th
Python in Science
Conference, S. van der Walt and J. Millman, Eds., 2010,
pp. 51–6.
——,Comparison with R / R libraries, 2015.
——,Comparison with SQL, 2015.
——, Python for data analysis. Sebastopol, Calif: O’Reilly,
2013.
F. Pedregosa, G. Varoquaux, A. Gramfort, et al.,
“Scikit-learn: Machine learning in Python,” The Journal of
Machine Learning Research, vol. 12, pp. 2825–2830,
2011.
F. Perez and B. E. Granger, “IPython: a system for
interactive scientific computing,” Computing in Science &
Engineering, vol. 9, no. 3, pp. 21–29, 2007.
E. Jones, T. Oliphant, P. Peterson, et al., SciPy: Open
source scientific tools for Python, 2001–.
56. LAB MEETING—
TECHNICAL
TALK
COBY VINER
PYTHON SOFTWARE
HIERARCHY
LIB. HIGHLIGHTS
BASIC PANDAS
PANDAS PLOTS
PANDAS USE CASE
PANDAS/R
PANDAS/SQL
REFERENCES
S. Behnel, R. Bradshaw, C. Citro, et al., “Cython: The
best of both worlds,” Computing in Science &
Engineering, vol. 13, no. 2, pp. 31–39, 2011.
S. Van Der Walt, S. C. Colbert, and G. Varoquaux, “The
NumPy array: a structure for efficient numerical
computation,” Computing in Science & Engineering, vol.
13, no. 2, pp. 22–30, 2011.
J. D. Hunter, “Matplotlib: A 2D graphics environment,”
Computing In Science & Engineering, vol. 9, no. 3,
pp. 90–95, 2007.
M. Harrower and C. A. Brewer, “ColorBrewer. org: an
online tool for selecting colour schemes for maps,” The
Cartographic Journal, vol. 40, no. 1, pp. 27–37, 2003.
W. McKinney, 10 Minutes to pandas — pandas 0.16.2
documentation, 2015.