This document discusses structured data challenges in finance and statistics. It introduces Wes McKinney and his work developing pandas, an open-source Python library designed for working with structured and time series data. Pandas includes data structures like the DataFrame, which allows for fast and flexible data manipulation, indexing, and aggregation of tabular data. The document argues that existing tools are still lacking for working with structured data and that pandas was created to optimize ease-of-use, flexibility, and performance.
pandas: a Foundational Python Library for Data Analysis and StatisticsWes McKinney
Pandas is a Python library for data analysis and manipulation. It provides high performance tools for structured data, including DataFrame objects for tabular data with row and column indexes. Pandas aims to have a clean and consistent API that is both performant and easy to use for tasks like data cleaning, aggregation, reshaping and merging of data.
Data Structures for Statistical Computing in PythonWes McKinney
The document discusses statistical data structures in Python. It summarizes that structured arrays are commonly used to store statistical data sets but have limitations. The R data frame is introduced as a flexible alternative that inspired the pandas library in Python. Pandas aims to create intuitive data structures for statistical analysis with labeled axes and automatic data alignment. Its core data structure, the DataFrame, functions similarly to R's data frame.
Scipy 2011 Time Series Analysis in PythonWes McKinney
1) The document discusses statsmodels, a Python library for statistical modeling that implements standard statistical models. It includes tools for linear regression, descriptive statistics, statistical tests, time series analysis, and more.
2) The talk provides an overview of using statsmodels for time series analysis, including descriptive statistics, autoregressive moving average (ARMA) models, vector autoregression (VAR) models, and filtering tools.
3) The discussion highlights the development of statsmodels and the need for integrated statistical data structures and user interfaces to make Python more competitive with R for data analysis and statistics.
Pandas is a Python library used for working with structured and time series data. It provides data structures like Series (1D array) and DataFrame (2D tabular structure) that are built on NumPy arrays for fast and efficient data manipulation. Key features of Pandas include fast DataFrame objects with indexing, loading data from different formats, handling missing data, reshaping/pivoting datasets, slicing/subsetting large datasets, and merging/joining data. The document provides an overview of Pandas, why it is useful, its main data structures (Series and DataFrame), and how to create and use them.
pandas: Powerful data analysis tools for PythonWes McKinney
Wes McKinney introduced pandas, a Python data analysis library built on NumPy. Pandas provides data structures and tools for cleaning, manipulating, and working with relational and time-series data. Key features include DataFrame for 2D data, hierarchical indexing, merging and joining data, and grouping and aggregating data. Pandas is used heavily in financial applications and has over 1500 unit tests, ensuring stability and reliability. Future goals include better time series handling and integration with other Python data science packages.
Presentation on data preparation with pandasAkshitaKanther
Data preparation is the first step after you get your hands on any kind of dataset. This is the step when you pre-process raw data into a form that can be easily and accurately analyzed. Proper data preparation allows for efficient analysis - it can eliminate errors and inaccuracies that could have occurred during the data gathering process and can thus help in removing some bias resulting from poor data quality. Therefore a lot of an analyst's time is spent on this vital step.
This document provides an overview of Python for data analysis using the pandas library. It discusses key pandas concepts like Series and DataFrames for working with one-dimensional and multi-dimensional labeled data structures. It also covers common data analysis tasks in pandas such as data loading, aggregation, grouping, pivoting, filtering, handling time series data, and plotting.
pandas: a Foundational Python Library for Data Analysis and StatisticsWes McKinney
Pandas is a Python library for data analysis and manipulation. It provides high performance tools for structured data, including DataFrame objects for tabular data with row and column indexes. Pandas aims to have a clean and consistent API that is both performant and easy to use for tasks like data cleaning, aggregation, reshaping and merging of data.
Data Structures for Statistical Computing in PythonWes McKinney
The document discusses statistical data structures in Python. It summarizes that structured arrays are commonly used to store statistical data sets but have limitations. The R data frame is introduced as a flexible alternative that inspired the pandas library in Python. Pandas aims to create intuitive data structures for statistical analysis with labeled axes and automatic data alignment. Its core data structure, the DataFrame, functions similarly to R's data frame.
Scipy 2011 Time Series Analysis in PythonWes McKinney
1) The document discusses statsmodels, a Python library for statistical modeling that implements standard statistical models. It includes tools for linear regression, descriptive statistics, statistical tests, time series analysis, and more.
2) The talk provides an overview of using statsmodels for time series analysis, including descriptive statistics, autoregressive moving average (ARMA) models, vector autoregression (VAR) models, and filtering tools.
3) The discussion highlights the development of statsmodels and the need for integrated statistical data structures and user interfaces to make Python more competitive with R for data analysis and statistics.
Pandas is a Python library used for working with structured and time series data. It provides data structures like Series (1D array) and DataFrame (2D tabular structure) that are built on NumPy arrays for fast and efficient data manipulation. Key features of Pandas include fast DataFrame objects with indexing, loading data from different formats, handling missing data, reshaping/pivoting datasets, slicing/subsetting large datasets, and merging/joining data. The document provides an overview of Pandas, why it is useful, its main data structures (Series and DataFrame), and how to create and use them.
pandas: Powerful data analysis tools for PythonWes McKinney
Wes McKinney introduced pandas, a Python data analysis library built on NumPy. Pandas provides data structures and tools for cleaning, manipulating, and working with relational and time-series data. Key features include DataFrame for 2D data, hierarchical indexing, merging and joining data, and grouping and aggregating data. Pandas is used heavily in financial applications and has over 1500 unit tests, ensuring stability and reliability. Future goals include better time series handling and integration with other Python data science packages.
Presentation on data preparation with pandasAkshitaKanther
Data preparation is the first step after you get your hands on any kind of dataset. This is the step when you pre-process raw data into a form that can be easily and accurately analyzed. Proper data preparation allows for efficient analysis - it can eliminate errors and inaccuracies that could have occurred during the data gathering process and can thus help in removing some bias resulting from poor data quality. Therefore a lot of an analyst's time is spent on this vital step.
This document provides an overview of Python for data analysis using the pandas library. It discusses key pandas concepts like Series and DataFrames for working with one-dimensional and multi-dimensional labeled data structures. It also covers common data analysis tasks in pandas such as data loading, aggregation, grouping, pivoting, filtering, handling time series data, and plotting.
The document summarizes the SciPy 2010 conference which had 187 attendees. The major theme was parallel computing and GPUs, with tutorials on high performance computing, Python concurrency, and GPU programming in Python. There were also sessions on parallel libraries and GPU frameworks like Theano. A minor theme was statistics and data structures, with talks on Pandas, Statsmodels, and N-dimensional data arrays. Other sessions covered bioinformatics, astronomy, machine learning, and Python libraries like NumPy and PyZMQ. Attendees also participated in sprints to contribute to Python scientific computing packages.
Introduction to Data Science, Prerequisites (tidyverse), Import Data (readr), Data Tyding (tidyr),
pivot_longer(), pivot_wider(), separate(), unite(), Data Transformation (dplyr - Grammar of Manipulation): arrange(), filter(),
select(), mutate(), summarise()m
Data Visualization (ggplot - Grammar of Graphics): Column Chart, Stacked Column Graph, Bar Graph, Line Graph, Dual Axis Chart, Area Chart, Pie Chart, Heat Map, Scatter Chart, Bubble Chart
Data engineering and analytics using pythonPurna Chander
This document provides an overview of data engineering and analytics using Python. It discusses Jupyter notebooks and commonly used Python modules for data science like Pandas, NumPy, SciPy, Matplotlib and Seaborn. It describes Anaconda distribution and the key features of Pandas including data loading, structures like DataFrames and Series, and core operations like filtering, mapping, joining, sorting, cleaning and grouping. It also demonstrates data visualization using Seaborn and a machine learning example of linear regression.
This document discusses tools for large scale data analysis. It begins by defining business value as anything that makes people more likely to give money or saves costs. It then discusses how data has outgrown local storage and requires scaling out to clusters and distributed systems. The document lists various systems that can be used for data ingestion, storage, querying, processing and output. It covers batch systems like Hadoop and real-time systems like Storm. It emphasizes that to generate business value, one needs to start analyzing big data from various sources like web logs, sensors and parse noise to find signals.
Pandas is a Python library for data analysis and manipulation of structured data. It allows working with time series, grouping data, merging datasets, and performing statistical computations. Pandas provides data structures like Series for 1D data and DataFrame for 2D data that make it easy to reindex, select subsets, and handle missing data. It integrates well with NumPy and Matplotlib for numerical processing and visualization.
ffbase, statistical functions for large datasetsEdwin de Jonge
This document introduces ffbase, an R package that adds statistical functions and utilities for working with large datasets stored in ff format. ffbase allows standard R code to be used on ff objects by rewriting expressions to operate chunkwise. It also connects ff data to other packages for large data analysis. The goal is to make working with large out-of-memory data more convenient and productive within the R environment.
Introduction to Pandas and Time Series Analysis [PyCon DE]Alexander Hendorf
Most data is allocated to a period or to some point in time. We can gain a lot of insight by analyzing what happened when. The better the quality and accuracy of our data, the better our predictions can become.
Unfortunately the data we have to deal with is often aggregated for example on a monthly basis, but not all months are the same, they may have 28 days, 31 days, have four or five weekends,…. It’s made fit to our calendar that was made fit to deal with the earth surrounding the sun, not to please Data Scientists.
Dealing with periodical data can be a challenge. This talk will show to how you can deal with it with Pandas.
Introduction to Pandas and Time Series Analysis [Budapest BI Forum]Alexander Hendorf
As senior consultant of German management consultancy Königsweg, Alexander is guiding enterprises and institutions through change processes of digitalisation and automation.
Alexander always loved data almost as much as music and so no wonder he’s organiser of local meet ups and one of the 25 mongoDB Community Masters.
He loves to share this expertise and engages in the global community as organiser and program chair of the EuroPython conference, speaker and trainer at multiple international conferences as mongoDB World, EuroPython, Cebit or PyData.
Pandas data transformational data structure patterns and challenges finalRajesh M
The needs and requirements for Data Transformation technologies be it Big Data, Machine Learning, Deep Learning or Simple Search and Reporting is still maturing due to the fundamental focus loss on Data Structural Patterns that can enable it. This presentation is oriented towards it.
Pandas is a powerful Python library for data analysis and manipulation. It provides rich data structures for working with structured and time series data easily. Pandas allows for data cleaning, analysis, modeling, and visualization. It builds on NumPy and provides data frames for working with tabular data similarly to R's data frames, as well as time series functionality and tools for plotting, merging, grouping, and handling missing data.
Artificial intelligence and data stream miningAlbert Bifet
Big Data and Artificial Intelligence have the potential to
fundamentally shift the way we interact with our surroundings. The
challenge of deriving insights from data streams has been recognized
as one of the most exciting and key opportunities for both academia
and industry. Advanced analysis of big data streams from sensors and
devices is bound to become a key area of artificial intelligence
research as the number of applications requiring such processing
increases. Dealing with the evolution over time of such data streams,
i.e., with concepts that drift or change completely, is one of the
core issues in stream mining. In this talk, I will present an overview
of data stream mining, industrial applications, open source tools, and
current challenges of data stream mining.
This document provides an overview of the Python programming language, including its history, key features, applications, popular uses, and data analysis libraries. It describes Python's origins in the late 1980s, common versions, and naming based on the Monty Python comedy troupe. The document outlines Python's simplicity, open source nature, object orientation, portability, extensive libraries, and popular uses like web development, science/engineering, education, and more. It also lists several major companies and organizations that use Python.
MOA is a framework for online machine learning from data streams. It includes algorithms for classification, regression, clustering and frequent pattern mining that can incorporate data and update models on the fly. MOA is closely related to WEKA and includes tools for evaluating streaming algorithms on data from sensors and IoT devices. It provides an environment for designing and running experiments on streaming machine learning algorithms at massive scales.
This document provides instructions for extracting data from graphs and converting it to a mathematical model using open source software. It describes using g3data to extract numerical data points from an image of a graph, then inputting that data into Eureqa to generate an exponential formula that fits the data points. The goal is to extrapolate the relationship between kinematic viscosity and temperature to higher temperatures than shown in the original graph.
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)Serban Tanasa
1) The document provides a quick guide to using data.table in R and Pentaho Data Integration (PDI) for fast data loading and manipulation. It discusses benchmarks showing data.table is 2-20x faster than traditional methods for reading, ordering, and transforming large data.
2) The outline discusses how to use basic data.table functions for speed gains and to overcome R's scaling limitations. It also provides a very brief overview of PDI's capabilities for Extract/Transform/Load (ETL) workflows without writing code.
3) The benchmarks section shows data.table is up to 500% faster than traditional R methods for reading large CSV files and orders of magnitude faster for sorting and aggregating
Visual analysis of high-volume time series data is ubiquitous in many industries, including finance, banking, and discrete manufacturing. Contemporary, RDBMS-based systems for visualization of high-volume time series data have difficulty to cope with the hard latency requirements of interactive visualizations and dissipate a lot expensive network bandwidth. Current solutions for lowering the volume of time series data disregard the properties of the resulting visualization and achieve only poor visualization quality.
In this work, we introduce M4, a simple aggregation-based time series dimensionality reduction technique that is superior to existing approaches, in that it provides lower visualization errors at higher data reduction ratios. Focusing on the semantic of line charts, as the predominant form of time-series visualization, we explain in detail, why current data reduction techniques fail and how our approach achieves superiority by respecting the process of line rasterization. We describe how to incorporate the proposed aggregation model already at the query-level in a visualization-driven query-
rewriting system. Our approach is generic and applicable to any visualization system that relies on relational data sources. Using real world data sets from high tech manufacturing, stock markets, and engineering domains we demonstrate that our visualization-oriented data aggregation can reduce data volumes by up to two orders of magnitude, while preserving perfect visualizations.
Introduction to the R Statistical Computing Environmentizahn
Get an introduction to R, the open-source system for statistical computation and graphics. With hands-on exercises, learn how to import and manage datasets, create R objects, and conduct basic statistical analyses. Full workshop materials can be downloaded from http://projects.iq.harvard.edu/rtc/event/introduction-r
Scalable and Efficient Algorithms for Analysis of Massive, Streaming GraphsJason Riedy
Graph-structured data in network security, social networks, finance, and other applications not only are massive but also under continual evolution. The changes often are scattered across the graph, permitting novel parallel and incremental analysis algorithms. We discuss analysis algorithms for streaming graph data to maintain both local and global metrics with low latency and high efficiency.
Documenting sacs compliance with microsoft excelSandra Nicks
This document describes how to use Microsoft Excel to document an institution's compliance with SACS accreditation standards using queries and pivot tables. It involves five steps: 1) Setting up initial worksheets with faculty, course, and schedule data; 2) Creating a query to combine the data; 3) Creating pivot tables to analyze full-time faculty and terminal degree percentages by academic program; 4) Populating a report table with the results; 5) Updating the report table with new semesters' data. The process allows quick updates to track compliance over time by pulling from the underlying data.
This document provides an overview of MS Access and database design. It discusses key concepts like relational databases, tables, records, and fields. It also outlines the steps to create tables and define fields, add additional tables, create queries, forms and reports, and use templates to design a database in MS Access. The goal is to organize data without duplication and ensure consistency through techniques like normalization.
The document summarizes the SciPy 2010 conference which had 187 attendees. The major theme was parallel computing and GPUs, with tutorials on high performance computing, Python concurrency, and GPU programming in Python. There were also sessions on parallel libraries and GPU frameworks like Theano. A minor theme was statistics and data structures, with talks on Pandas, Statsmodels, and N-dimensional data arrays. Other sessions covered bioinformatics, astronomy, machine learning, and Python libraries like NumPy and PyZMQ. Attendees also participated in sprints to contribute to Python scientific computing packages.
Introduction to Data Science, Prerequisites (tidyverse), Import Data (readr), Data Tyding (tidyr),
pivot_longer(), pivot_wider(), separate(), unite(), Data Transformation (dplyr - Grammar of Manipulation): arrange(), filter(),
select(), mutate(), summarise()m
Data Visualization (ggplot - Grammar of Graphics): Column Chart, Stacked Column Graph, Bar Graph, Line Graph, Dual Axis Chart, Area Chart, Pie Chart, Heat Map, Scatter Chart, Bubble Chart
Data engineering and analytics using pythonPurna Chander
This document provides an overview of data engineering and analytics using Python. It discusses Jupyter notebooks and commonly used Python modules for data science like Pandas, NumPy, SciPy, Matplotlib and Seaborn. It describes Anaconda distribution and the key features of Pandas including data loading, structures like DataFrames and Series, and core operations like filtering, mapping, joining, sorting, cleaning and grouping. It also demonstrates data visualization using Seaborn and a machine learning example of linear regression.
This document discusses tools for large scale data analysis. It begins by defining business value as anything that makes people more likely to give money or saves costs. It then discusses how data has outgrown local storage and requires scaling out to clusters and distributed systems. The document lists various systems that can be used for data ingestion, storage, querying, processing and output. It covers batch systems like Hadoop and real-time systems like Storm. It emphasizes that to generate business value, one needs to start analyzing big data from various sources like web logs, sensors and parse noise to find signals.
Pandas is a Python library for data analysis and manipulation of structured data. It allows working with time series, grouping data, merging datasets, and performing statistical computations. Pandas provides data structures like Series for 1D data and DataFrame for 2D data that make it easy to reindex, select subsets, and handle missing data. It integrates well with NumPy and Matplotlib for numerical processing and visualization.
ffbase, statistical functions for large datasetsEdwin de Jonge
This document introduces ffbase, an R package that adds statistical functions and utilities for working with large datasets stored in ff format. ffbase allows standard R code to be used on ff objects by rewriting expressions to operate chunkwise. It also connects ff data to other packages for large data analysis. The goal is to make working with large out-of-memory data more convenient and productive within the R environment.
Introduction to Pandas and Time Series Analysis [PyCon DE]Alexander Hendorf
Most data is allocated to a period or to some point in time. We can gain a lot of insight by analyzing what happened when. The better the quality and accuracy of our data, the better our predictions can become.
Unfortunately the data we have to deal with is often aggregated for example on a monthly basis, but not all months are the same, they may have 28 days, 31 days, have four or five weekends,…. It’s made fit to our calendar that was made fit to deal with the earth surrounding the sun, not to please Data Scientists.
Dealing with periodical data can be a challenge. This talk will show to how you can deal with it with Pandas.
Introduction to Pandas and Time Series Analysis [Budapest BI Forum]Alexander Hendorf
As senior consultant of German management consultancy Königsweg, Alexander is guiding enterprises and institutions through change processes of digitalisation and automation.
Alexander always loved data almost as much as music and so no wonder he’s organiser of local meet ups and one of the 25 mongoDB Community Masters.
He loves to share this expertise and engages in the global community as organiser and program chair of the EuroPython conference, speaker and trainer at multiple international conferences as mongoDB World, EuroPython, Cebit or PyData.
Pandas data transformational data structure patterns and challenges finalRajesh M
The needs and requirements for Data Transformation technologies be it Big Data, Machine Learning, Deep Learning or Simple Search and Reporting is still maturing due to the fundamental focus loss on Data Structural Patterns that can enable it. This presentation is oriented towards it.
Pandas is a powerful Python library for data analysis and manipulation. It provides rich data structures for working with structured and time series data easily. Pandas allows for data cleaning, analysis, modeling, and visualization. It builds on NumPy and provides data frames for working with tabular data similarly to R's data frames, as well as time series functionality and tools for plotting, merging, grouping, and handling missing data.
Artificial intelligence and data stream miningAlbert Bifet
Big Data and Artificial Intelligence have the potential to
fundamentally shift the way we interact with our surroundings. The
challenge of deriving insights from data streams has been recognized
as one of the most exciting and key opportunities for both academia
and industry. Advanced analysis of big data streams from sensors and
devices is bound to become a key area of artificial intelligence
research as the number of applications requiring such processing
increases. Dealing with the evolution over time of such data streams,
i.e., with concepts that drift or change completely, is one of the
core issues in stream mining. In this talk, I will present an overview
of data stream mining, industrial applications, open source tools, and
current challenges of data stream mining.
This document provides an overview of the Python programming language, including its history, key features, applications, popular uses, and data analysis libraries. It describes Python's origins in the late 1980s, common versions, and naming based on the Monty Python comedy troupe. The document outlines Python's simplicity, open source nature, object orientation, portability, extensive libraries, and popular uses like web development, science/engineering, education, and more. It also lists several major companies and organizations that use Python.
MOA is a framework for online machine learning from data streams. It includes algorithms for classification, regression, clustering and frequent pattern mining that can incorporate data and update models on the fly. MOA is closely related to WEKA and includes tools for evaluating streaming algorithms on data from sensors and IoT devices. It provides an environment for designing and running experiments on streaming machine learning algorithms at massive scales.
This document provides instructions for extracting data from graphs and converting it to a mathematical model using open source software. It describes using g3data to extract numerical data points from an image of a graph, then inputting that data into Eureqa to generate an exponential formula that fits the data points. The goal is to extrapolate the relationship between kinematic viscosity and temperature to higher temperatures than shown in the original graph.
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)Serban Tanasa
1) The document provides a quick guide to using data.table in R and Pentaho Data Integration (PDI) for fast data loading and manipulation. It discusses benchmarks showing data.table is 2-20x faster than traditional methods for reading, ordering, and transforming large data.
2) The outline discusses how to use basic data.table functions for speed gains and to overcome R's scaling limitations. It also provides a very brief overview of PDI's capabilities for Extract/Transform/Load (ETL) workflows without writing code.
3) The benchmarks section shows data.table is up to 500% faster than traditional R methods for reading large CSV files and orders of magnitude faster for sorting and aggregating
Visual analysis of high-volume time series data is ubiquitous in many industries, including finance, banking, and discrete manufacturing. Contemporary, RDBMS-based systems for visualization of high-volume time series data have difficulty to cope with the hard latency requirements of interactive visualizations and dissipate a lot expensive network bandwidth. Current solutions for lowering the volume of time series data disregard the properties of the resulting visualization and achieve only poor visualization quality.
In this work, we introduce M4, a simple aggregation-based time series dimensionality reduction technique that is superior to existing approaches, in that it provides lower visualization errors at higher data reduction ratios. Focusing on the semantic of line charts, as the predominant form of time-series visualization, we explain in detail, why current data reduction techniques fail and how our approach achieves superiority by respecting the process of line rasterization. We describe how to incorporate the proposed aggregation model already at the query-level in a visualization-driven query-
rewriting system. Our approach is generic and applicable to any visualization system that relies on relational data sources. Using real world data sets from high tech manufacturing, stock markets, and engineering domains we demonstrate that our visualization-oriented data aggregation can reduce data volumes by up to two orders of magnitude, while preserving perfect visualizations.
Introduction to the R Statistical Computing Environmentizahn
Get an introduction to R, the open-source system for statistical computation and graphics. With hands-on exercises, learn how to import and manage datasets, create R objects, and conduct basic statistical analyses. Full workshop materials can be downloaded from http://projects.iq.harvard.edu/rtc/event/introduction-r
Scalable and Efficient Algorithms for Analysis of Massive, Streaming GraphsJason Riedy
Graph-structured data in network security, social networks, finance, and other applications not only are massive but also under continual evolution. The changes often are scattered across the graph, permitting novel parallel and incremental analysis algorithms. We discuss analysis algorithms for streaming graph data to maintain both local and global metrics with low latency and high efficiency.
Documenting sacs compliance with microsoft excelSandra Nicks
This document describes how to use Microsoft Excel to document an institution's compliance with SACS accreditation standards using queries and pivot tables. It involves five steps: 1) Setting up initial worksheets with faculty, course, and schedule data; 2) Creating a query to combine the data; 3) Creating pivot tables to analyze full-time faculty and terminal degree percentages by academic program; 4) Populating a report table with the results; 5) Updating the report table with new semesters' data. The process allows quick updates to track compliance over time by pulling from the underlying data.
This document provides an overview of MS Access and database design. It discusses key concepts like relational databases, tables, records, and fields. It also outlines the steps to create tables and define fields, add additional tables, create queries, forms and reports, and use templates to design a database in MS Access. The goal is to organize data without duplication and ensure consistency through techniques like normalization.
The document discusses practical data visualization and provides examples of different types of visualizations and the tools used to create them. It covers why visualization is important, such as to preserve complexity and evaluate data quality. It also discusses different visualization types like one-dimensional, planar, temporal, multidimensional, hierarchical, and network visualizations. Additionally, it discusses showing uncertainty in data and demonstrates various visualization tools like JMP, Tableau, and others.
From Artwork to Cyber Attacks: Lessons Learned in Building Knowledge Graphs u...Craig Knoblock
Over the last few years we have been building domain-specific knowledge graphs for a variety of real-world problems, including creating virtual museums, combating human trafficking, identifying illegal arms sales, and predicting cyber attacks. We have developed a variety of techniques to construct such knowledge graphs, including techniques for extracting data from online sources, aligning the data to a domain ontology, and linking the data across sources. In his talk I will present these techniques and describe our experience in applying Semantic Web technologies to build knowledge graphs for real-world problems.
The document discusses the design of a database for a university to track student club participation. A design team determined that tables were needed to track clubs, students, club memberships, and club events. The team defined the fields for each table, including primary keys. Examples of normalized database tables are also provided, along with explanations of 1st, 2nd, and 3rd normal forms. Additional database topics like data types, file-based systems, and database security are also briefly covered.
Data Mining mod1 ppt.pdf bca sixth semester notesasnaparveen414
1. Data mining involves the automated analysis of large datasets to discover patterns and relationships. It has grown in importance due to the massive growth in data from various sources like business, science, and social media.
2. A typical data mining system includes components for data cleaning, data transformation, pattern evaluation, and knowledge presentation from datasets in databases or data warehouses. Data mining algorithms are applied to extract useful patterns.
3. Data mining draws from multiple disciplines including database technology, statistics, machine learning, and visualization. It aims to discover knowledge from data that is too large for traditional data analysis methods to handle effectively.
This document contains notes from a database fundamentals class taught by Eng. Javier Daza on April 8, 2024. The notes cover the history and evolution of databases, definitions of databases, types of databases including relational and NoSQL, and characteristics and advantages of databases. The class included activities on database history, a pre-test quiz, and a discussion of the top Gartner technology trends and technologies from CES 2024. The goal of the class was to provide context on relational databases by exploring related topics.
Beyond Linked Data - Exploiting Entity-Centric Knowledge on the WebStefan Dietze
This document discusses enabling discovery and search of linked data and knowledge graphs. It presents approaches for dataset recommendation including using vocabulary overlap and existing links between datasets. It also discusses profiling datasets to create topic profiles using entity extraction and ranking techniques. These recommendation and profiling approaches aim to help with discovering relevant datasets and entities for a given topic or task.
Ee184405 statistika dan stokastik statistik deskriptif 1 grafikyusufbf
Statistika adalah suatu bidang ilmu yang mempelajari cara-cara mengumpulkan data untuk selanjutnya dapat dideskripsikan dan diolah, kemudian melakukan induksi/inferensi dalam rangka membuat kesimpulan, agar dapat ditentukan keputusan yang akan diambil berdasarkan data yang dimiliki.
DATA =============> PROSES STATISTIK ===========> INFORMASI
Statistik Deskriptif adalah suatu cara menggambarkan persoalan yang berdasarkan data yang dimiliki yakni dengan cara menata data tersebut sedemikian rupa agar karakteristik data dapat dipahami dengan mudah sehingga berguna untuk keperluan selanjutnya.
The document discusses creating tables and enforcing data integrity in SQL. It describes creating a user-defined datatype to resolve inconsistencies between table structures. It also explains different types of data integrity such as entity integrity, domain integrity, referential integrity, and user-defined integrity. Constraints like primary key, unique, foreign key, check, and default can be used to enforce these integrity rules when creating or altering tables. The tasks involve creating tables with various constraints to define rules for the newspaper and news ad tables.
The document provides an introduction to Eyeriss, an energy-efficient reconfigurable accelerator for deep convolutional neural networks (CNNs). Some key points:
- Eyeriss uses a row stationary dataflow that reduces energy costs compared to other dataflows like weight stationary and output stationary.
- It has a 4-level memory hierarchy from DRAM to register files to minimize data movement costs.
- A network-on-chip and multicast/point-to-point delivery allows single-cycle data delivery between components.
- Compression techniques like run-length compression are used to further reduce data movement costs.
This document provides an overview of the CS639: Data Management for Data Science course. It discusses that data science is becoming increasingly important as more fields utilize data-driven approaches. The course will teach students the basics of managing and analyzing data to obtain useful insights. It will cover topics like data storage, predictive analytics, data integration, and communicating findings. The goal is for students to learn fundamental concepts and design data science workflows and pipelines. The course will include lectures, programming assignments, a midterm, and final exam.
This document summarizes research on analyzing web search logs and modeling user behavior. It discusses:
1) Three query corpora that were analyzed from different sites, including academic, health, and search engine logs.
2) The data models used to structure the log data and methods for identifying search sessions and clustering queries.
3) Techniques for modeling user behaviors like popular queries, document access patterns, and analyzing search sessions to understand information needs and search strategies.
4) Methods for clustering queries both quantitatively based on session variables and qualitatively by analyzing queries at the conceptual and semantic levels.
Retrieval, Crawling and Fusion of Entity-centric Data on the WebStefan Dietze
Stefan Dietze gave a keynote presentation covering three main topics:
1) Challenges in entity retrieval from heterogeneous linked datasets and knowledge graphs due to diversity and lack of standardization.
2) Approaches for enabling discovery and search through dataset recommendation, profiling, and entity retrieval methods that cluster entities to address link sparsity.
3) Going beyond linked data to exploit semantics embedded in web markup, with case studies in data fusion for entity reconciliation and retrieval.
This document provides an introduction to data mining concepts and techniques. It discusses why data mining is needed due to the massive growth of data, defines data mining as the extraction of patterns from large data sets, and outlines the data mining process. A variety of data types that can be mined are described, including relational, transactional, time-series, text and web data. The document also covers major data mining functionalities like classification, clustering, association rule mining and trend analysis. Top 10 popular data mining algorithms are listed.
This document provides an introduction to data mining concepts and techniques. It discusses why data mining is needed due to the massive growth of data. It defines data mining as the extraction of interesting patterns from large datasets. The document outlines the key steps in the knowledge discovery process and how data mining fits within business intelligence applications. It also describes different types of data that can be mined and popular data mining algorithms.
This document provides an agenda and materials for a one-day workshop on qualitative data analysis. The workshop will include two exercises. The first involves selecting quotes, assigning codes, and creating memos from narrative data. The second uses grounded theory methods to map themes, quotes and codes from the data. The workshop aims to teach participants tools for analyzing text, documents and images within and across different settings.
This document discusses data science, data, and dashboards. It provides an overview of different types of data science questions including descriptive, exploratory, inferential, predictive, causal, and mechanistic. It also outlines the typical data science process from collecting and preparing data to building mathematical models and deploying them. Additionally, it covers different types of structured and unstructured data as well as considerations for big data. Finally, guidelines are presented for building effective dashboards, including focusing on a single screen, using space effectively, providing context, and highlighting important information without clutter.
Similar to Structured Data Challenges in Finance and Statistics (20)
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
The document discusses the future of composable data systems and provides an overview from Wes McKinney. Some key points:
- Composable data systems are designed to be modular and reusable across different components through open standards and protocols. This allows new engines to be developed more easily.
- The data landscape is shifting to an era of composability, where monolithic systems will be replaced by modular, reusable pieces.
- Areas of focus for composable systems include execution engines, query interfaces, storage protocols, and optimization.
- Projects like Apache Arrow, Ibis, Substrait, and modular engines like DuckDB, DataFusion, and Velox are moving the industry toward composability.
Solving Enterprise Data Challenges with Apache ArrowWes McKinney
This document discusses Apache Arrow, an open-source library that enables fast and efficient data interchange and processing. It summarizes the growth of Arrow and its ecosystem, including new features like the Arrow C++ query engine and Arrow Rust DataFusion. It also highlights how enterprises are using Arrow to solve challenges around data interoperability, access speed, query performance, and embeddable analytics. Case studies describe how companies like Microsoft, Google Cloud, Snowflake, and Meta leverage Arrow in their products and platforms. The presenter promotes Voltron Data's enterprise subscription and upcoming conference to support business use of Apache Arrow.
Apache Arrow: High Performance Columnar Data FrameworkWes McKinney
The document discusses the benefits of exercise for mental health. Regular physical activity can help reduce anxiety and depression and improve mood and cognitive functioning. Exercise causes chemical changes in the brain that may help boost feelings of calmness, happiness and focus.
- Apache Arrow is an open-source project that provides a shared data format and library for high performance data analytics across multiple languages. It aims to unify database and data science technology stacks.
- In 2021, Ursa Labs joined forces with GPU-accelerated computing pioneers to form Voltron Data, continuing development of Apache Arrow and related projects like Arrow Flight and the Arrow R package.
- Upcoming releases of the Arrow R package will bring additional query execution capabilities like joins and window functions to improve performance and efficiency of analytics workflows in R.
Apache Arrow Flight: A New Gold Standard for Data TransportWes McKinney
This document discusses how structured data is often moved inefficiently between systems, causing waste. It introduces Apache Arrow, an open standard for in-memory data, and how Arrow can help make data movement more efficient. Systems like Snowflake and BigQuery are now using Arrow to help speed up query result fetching by enabling zero-copy data transfers and sharing file formats between query processing and storage.
ACM TechTalks : Apache Arrow and the Future of Data FramesWes McKinney
Wes McKinney gave a talk on Apache Arrow and the future of data frames. He discussed how Arrow aims to standardize columnar data formats and reduce inefficiencies in data processing. It defines an efficient binary format for transferring data between systems and programming languages. As more tools support Arrow natively, it will become more efficient to process data directly in Arrow format rather than converting between data structures. Arrow is gaining adoption in popular data tools like Spark, BigQuery, and InfluxDB to improve performance.
Apache Arrow: Present and Future @ ScaledML 2020Wes McKinney
This document discusses Apache Arrow, an open source project that provides cross-language data structures and algorithms for efficient data analytics. It summarizes the history and goals of Arrow, provides examples of how it has been adopted, and outlines ongoing development initiatives. Key points include that Arrow aims to accelerate data processing by standardizing columnar data formats and protocols, it has seen widespread adoption with over 50M installs in 2019, and active areas of work include the C++ development platform and Arrow Flight RPC framework.
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future Wes McKinney
Wes McKinney gave a presentation on the past, present, and future of Python for data analysis. He discussed the origins and development of pandas over the past 12 years from the first open source release in 2009 to the current state. Key points included pandas receiving its first formal funding in 2019, its large community of contributors, and factors driving Python's growth for data science like its package ecosystem and education. McKinney also addressed early concerns about Python and looked to the future, highlighting projects like Apache Arrow that aim to improve performance and interoperability.
Apache Arrow: Leveling Up the Analytics StackWes McKinney
This document discusses the development of Apache Arrow, an open source in-memory data format designed for efficient analytical data processing on modern hardware. It provides a brief history of big data and analytics technologies leading to the need for Arrow. Key points about Arrow include that it aims to eliminate data serialization, enable code sharing across languages, and has over 400 contributors representing 11 programming languages. Notable subcomponents include DataFusion, Gandiva, and Plasma; and development is supported by organizations like Ursa Labs.
Apache Arrow Workshop at VLDB 2019 / BOSS SessionWes McKinney
Technical deep dive for database system developers in the Arrow columnar format, binary protocol, C++ development platform, and Arrow Flight RPC.
See demo Jupyter notebooks at https://github.com/wesm/vldb-2019-apache-arrow-workshop
Apache Arrow: Leveling Up the Data Science StackWes McKinney
Ursa Labs builds cross-language libraries like Apache Arrow for data science. Arrow provides a columnar data format and utilities for efficient serialization, IO, and querying across programming languages. Ursa Labs contributes to Arrow and funds open source developers to grow the Arrow ecosystem. Their goal is to reduce the CPU time spent on data serialization and enable faster data analysis in languages like R.
This document discusses the history and development of Python data analysis tools, including pandas. It covers Wes McKinney's work on pandas from 2008 to the present, including the motivations for making data analysis easier and more productive. It also summarizes the development of related projects like Apache Arrow for standardizing columnar data representations to improve code reuse across languages.
Apache Arrow at DataEngConf Barcelona 2018Wes McKinney
Wes McKinney is a leading open source developer who created Python's pandas library and now leads the Apache Arrow project. Apache Arrow is an open standard for in-memory analytics that aims to improve data sharing and reuse across systems by defining a common columnar data format and memory layout. It allows data to be accessed and algorithms to be reused across different programming languages with near-zero data copying. Arrow is being integrated into various data systems and is working to expand its computational libraries and language support.
Apache Arrow: Cross-language Development Platform for In-memory DataWes McKinney
Apache Arrow is an open standard for in-memory columnar data and an analytical data processing platform. It aims to simplify system architectures, improve interoperability between systems, and enable data and algorithms to be reused across different programming languages. Arrow provides a portable in-memory data format and computational libraries to build analytical data processing systems. It is language-independent and supports data sharing and algorithm reuse between libraries and processes via shared memory with near-zero overhead.
Apache Arrow -- Cross-language development platform for in-memory dataWes McKinney
Wes McKinney is the creator of Python's pandas project and a primary developer of Apache Arrow, Apache Parquet, and other open-source projects. Apache Arrow is an open-source cross-language development platform for in-memory analytics that aims to improve data science tools. It provides a shared standard for memory interoperability and computation across languages through its columnar memory format and libraries. Apache Arrow has growing adoption in data science systems and is working to expand language support and computational capabilities.
Shared Infrastructure for Data ScienceWes McKinney
Wes McKinney discussed the evolution of data science tools and infrastructure over the past 10 years and a vision for the next 10 years. He argued that current data science languages like Python, R, and Julia operate in "silos" with separate implementations for data storage, processing, and analytics. However, new projects like Apache Arrow aim to break down these silos by establishing shared standards for in-memory data formats and interchange that can unite the implementations across languages. Arrow provides a portable data frame format, zero-copy interchange capabilities, and potential for high performance data access and flexible computation engines. This would allow data science work to be more portable across programming languages while improving performance.
Data Science Without Borders (JupyterCon 2017)Wes McKinney
Talk about building shared, language-agnostic computational infrastructure for data science. Discusses the motivation and work that's happening in the Apache Arrow project to help (http://arrow.apache.org)
Memory Interoperability in Analytics and Machine LearningWes McKinney
Wes McKinney gave a talk on Apache Arrow, an open source project for memory interoperability between analytics and machine learning systems. Arrow provides efficient columnar memory structures and zero-copy sharing of data between applications. It defines common data types and schemas that can be used across programming languages. Arrow is implemented in C++ and provides language bindings for other languages like Python. It aims to improve performance for tasks like data loading, preprocessing, modeling and serving. Projects like pandas, Spark and Ray are exploring using Arrow internally for more efficient data handling.
OpenID AuthZEN Interop Read Out - AuthorizationDavid Brossard
During Identiverse 2024 and EIC 2024, members of the OpenID AuthZEN WG got together and demoed their authorization endpoints conforming to the AuthZEN API
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
GraphRAG for Life Science to increase LLM accuracyTomaz Bratanic
GraphRAG for life science domain, where you retriever information from biomedical knowledge graphs using LLMs to increase the accuracy and performance of generated answers
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
Skybuffer SAM4U tool for SAP license adoptionTatiana Kojar
Manage and optimize your license adoption and consumption with SAM4U, an SAP free customer software asset management tool.
SAM4U, an SAP complimentary software asset management tool for customers, delivers a detailed and well-structured overview of license inventory and usage with a user-friendly interface. We offer a hosted, cost-effective, and performance-optimized SAM4U setup in the Skybuffer Cloud environment. You retain ownership of the system and data, while we manage the ABAP 7.58 infrastructure, ensuring fixed Total Cost of Ownership (TCO) and exceptional services through the SAP Fiori interface.
Project Management Semester Long Project - Acuityjpupo2018
Acuity is an innovative learning app designed to transform the way you engage with knowledge. Powered by AI technology, Acuity takes complex topics and distills them into concise, interactive summaries that are easy to read & understand. Whether you're exploring the depths of quantum mechanics or seeking insight into historical events, Acuity provides the key information you need without the burden of lengthy texts.
Monitoring and Managing Anomaly Detection on OpenShift.pdfTosin Akinosho
Monitoring and Managing Anomaly Detection on OpenShift
Overview
Dive into the world of anomaly detection on edge devices with our comprehensive hands-on tutorial. This SlideShare presentation will guide you through the entire process, from data collection and model training to edge deployment and real-time monitoring. Perfect for those looking to implement robust anomaly detection systems on resource-constrained IoT/edge devices.
Key Topics Covered
1. Introduction to Anomaly Detection
- Understand the fundamentals of anomaly detection and its importance in identifying unusual behavior or failures in systems.
2. Understanding Edge (IoT)
- Learn about edge computing and IoT, and how they enable real-time data processing and decision-making at the source.
3. What is ArgoCD?
- Discover ArgoCD, a declarative, GitOps continuous delivery tool for Kubernetes, and its role in deploying applications on edge devices.
4. Deployment Using ArgoCD for Edge Devices
- Step-by-step guide on deploying anomaly detection models on edge devices using ArgoCD.
5. Introduction to Apache Kafka and S3
- Explore Apache Kafka for real-time data streaming and Amazon S3 for scalable storage solutions.
6. Viewing Kafka Messages in the Data Lake
- Learn how to view and analyze Kafka messages stored in a data lake for better insights.
7. What is Prometheus?
- Get to know Prometheus, an open-source monitoring and alerting toolkit, and its application in monitoring edge devices.
8. Monitoring Application Metrics with Prometheus
- Detailed instructions on setting up Prometheus to monitor the performance and health of your anomaly detection system.
9. What is Camel K?
- Introduction to Camel K, a lightweight integration framework built on Apache Camel, designed for Kubernetes.
10. Configuring Camel K Integrations for Data Pipelines
- Learn how to configure Camel K for seamless data pipeline integrations in your anomaly detection workflow.
11. What is a Jupyter Notebook?
- Overview of Jupyter Notebooks, an open-source web application for creating and sharing documents with live code, equations, visualizations, and narrative text.
12. Jupyter Notebooks with Code Examples
- Hands-on examples and code snippets in Jupyter Notebooks to help you implement and test anomaly detection models.
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Speck&Tech
ABSTRACT: A prima vista, un mattoncino Lego e la backdoor XZ potrebbero avere in comune il fatto di essere entrambi blocchi di costruzione, o dipendenze di progetti creativi e software. La realtà è che un mattoncino Lego e il caso della backdoor XZ hanno molto di più di tutto ciò in comune.
Partecipate alla presentazione per immergervi in una storia di interoperabilità, standard e formati aperti, per poi discutere del ruolo importante che i contributori hanno in una comunità open source sostenibile.
BIO: Sostenitrice del software libero e dei formati standard e aperti. È stata un membro attivo dei progetti Fedora e openSUSE e ha co-fondato l'Associazione LibreItalia dove è stata coinvolta in diversi eventi, migrazioni e formazione relativi a LibreOffice. In precedenza ha lavorato a migrazioni e corsi di formazione su LibreOffice per diverse amministrazioni pubbliche e privati. Da gennaio 2020 lavora in SUSE come Software Release Engineer per Uyuni e SUSE Manager e quando non segue la sua passione per i computer e per Geeko coltiva la sua curiosità per l'astronomia (da cui deriva il suo nickname deneb_alpha).
Driving Business Innovation: Latest Generative AI Advancements & Success StorySafe Software
Are you ready to revolutionize how you handle data? Join us for a webinar where we’ll bring you up to speed with the latest advancements in Generative AI technology and discover how leveraging FME with tools from giants like Google Gemini, Amazon, and Microsoft OpenAI can supercharge your workflow efficiency.
During the hour, we’ll take you through:
Guest Speaker Segment with Hannah Barrington: Dive into the world of dynamic real estate marketing with Hannah, the Marketing Manager at Workspace Group. Hear firsthand how their team generates engaging descriptions for thousands of office units by integrating diverse data sources—from PDF floorplans to web pages—using FME transformers, like OpenAIVisionConnector and AnthropicVisionConnector. This use case will show you how GenAI can streamline content creation for marketing across the board.
Ollama Use Case: Learn how Scenario Specialist Dmitri Bagh has utilized Ollama within FME to input data, create custom models, and enhance security protocols. This segment will include demos to illustrate the full capabilities of FME in AI-driven processes.
Custom AI Models: Discover how to leverage FME to build personalized AI models using your data. Whether it’s populating a model with local data for added security or integrating public AI tools, find out how FME facilitates a versatile and secure approach to AI.
We’ll wrap up with a live Q&A session where you can engage with our experts on your specific use cases, and learn more about optimizing your data workflows with AI.
This webinar is ideal for professionals seeking to harness the power of AI within their data management systems while ensuring high levels of customization and security. Whether you're a novice or an expert, gain actionable insights and strategies to elevate your data processes. Join us to see how FME and AI can revolutionize how you work with data!
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxSitimaJohn
Ocean Lotus cyber threat actors represent a sophisticated, persistent, and politically motivated group that poses a significant risk to organizations and individuals in the Southeast Asian region. Their continuous evolution and adaptability underscore the need for robust cybersecurity measures and international cooperation to identify and mitigate the threats posed by such advanced persistent threat groups.
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfMalak Abu Hammad
Discover how MongoDB Atlas and vector search technology can revolutionize your application's search capabilities. This comprehensive presentation covers:
* What is Vector Search?
* Importance and benefits of vector search
* Practical use cases across various industries
* Step-by-step implementation guide
* Live demos with code snippets
* Enhancing LLM capabilities with vector search
* Best practices and optimization strategies
Perfect for developers, AI enthusiasts, and tech leaders. Learn how to leverage MongoDB Atlas to deliver highly relevant, context-aware search results, transforming your data retrieval process. Stay ahead in tech innovation and maximize the potential of your applications.
#MongoDB #VectorSearch #AI #SemanticSearch #TechInnovation #DataScience #LLM #MachineLearning #SearchTechnology
Have you ever been confused by the myriad of choices offered by AWS for hosting a website or an API?
Lambda, Elastic Beanstalk, Lightsail, Amplify, S3 (and more!) can each host websites + APIs. But which one should we choose?
Which one is cheapest? Which one is fastest? Which one will scale to meet our needs?
Join me in this session as we dive into each AWS hosting service to determine which one is best for your scenario and explain why!
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on integration of Salesforce with Bonterra Impact Management.
Interested in deploying an integration with Salesforce for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
UiPath Test Automation using UiPath Test Suite series, part 6DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 6. In this session, we will cover Test Automation with generative AI and Open AI.
UiPath Test Automation with generative AI and Open AI webinar offers an in-depth exploration of leveraging cutting-edge technologies for test automation within the UiPath platform. Attendees will delve into the integration of generative AI, a test automation solution, with Open AI advanced natural language processing capabilities.
Throughout the session, participants will discover how this synergy empowers testers to automate repetitive tasks, enhance testing accuracy, and expedite the software testing life cycle. Topics covered include the seamless integration process, practical use cases, and the benefits of harnessing AI-driven automation for UiPath testing initiatives. By attending this webinar, testers, and automation professionals can gain valuable insights into harnessing the power of AI to optimize their test automation workflows within the UiPath ecosystem, ultimately driving efficiency and quality in software development processes.
What will you get from this session?
1. Insights into integrating generative AI.
2. Understanding how this integration enhances test automation within the UiPath platform
3. Practical demonstrations
4. Exploration of real-world use cases illustrating the benefits of AI-driven test automation for UiPath
Topics covered:
What is generative AI
Test Automation with generative AI and Open AI.
UiPath integration with generative AI
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Digital Marketing Trends in 2024 | Guide for Staying AheadWask
https://www.wask.co/ebooks/digital-marketing-trends-in-2024
Feeling lost in the digital marketing whirlwind of 2024? Technology is changing, consumer habits are evolving, and staying ahead of the curve feels like a never-ending pursuit. This e-book is your compass. Dive into actionable insights to handle the complexities of modern marketing. From hyper-personalization to the power of user-generated content, learn how to build long-term relationships with your audience and unlock the secrets to success in the ever-shifting digital landscape.
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Structured Data Challenges in Finance and Statistics
1. Structured Data Challenges in Finance and Statistics
Wes McKinney
Rice Statistics, 21 November 2011
Wes McKinney () Structured data challenges Rice Statistics 1 / 43
2. Me
S.B., MIT Math ’07
3 years in the quant finance business
Now: starting a software company, initially to build financial data
analysis and research systems
My blog: http://blog.wesmckinney.com
Twitter: @wesmckinn
Book! “Python for Data Analysis”, to hit the shelves later next year
from O’Reilly Media
Wes McKinney () Structured data challenges Rice Statistics 2 / 43
3. Structured data
cname year agefrom ageto ls lsc pop ccode
0 Australia 1950 15 19 64.3 15.4 558 AUS
1 Australia 1950 20 24 48.4 26.4 645 AUS
2 Australia 1950 25 29 47.9 26.2 681 AUS
3 Australia 1950 30 34 44 23.8 614 AUS
4 Australia 1950 35 39 42.1 21.9 625 AUS
5 Australia 1950 40 44 38.9 20.1 555 AUS
6 Australia 1950 45 49 34 16.9 491 AUS
7 Australia 1950 50 54 29.6 14.6 439 AUS
8 Australia 1950 55 59 28 12.9 408 AUS
9 Australia 1950 60 64 26.3 12.1 356 AUS
Wes McKinney () Structured data challenges Rice Statistics 3 / 43
4. Partial list of structured data necessities
Table modification: column insertion/deletion/type changes
Rich axis indexing, metadata
Easy data alignment
Aggregation and transformation by group (“group by”)
Missing data (NA) handling
Pivoting and reshaping
Merging and joining
Time series-specific manipulations
Fast Input/Output: text files, databases, HDF5, ...
Wes McKinney () Structured data challenges Rice Statistics 4 / 43
5. Are existing tools good enough?
We care nearly equally about
Ease-of-use (syntax / API fits your mental model)
Expressiveness
Performance (speed and memory usage)
Clean, consistent interface design is hard
Wes McKinney () Structured data challenges Rice Statistics 5 / 43
6. Auxiliary concerns
Any tool needs to integrate well with:
Statistical modeling tools
Data visualization (plotting)
Target users
Computer scientists, statisticians, software engineers?
Data scientists?
Wes McKinney () Structured data challenges Rice Statistics 6 / 43
7. Are existing tools good enough?
The typical players
R data.frame and friends + CRAN libraries
SQL and other relational databases
Python / NumPy: structured (record) arrays
Commercial products: SAS, Stata, MS Excel...
My conclusion: we still have a ways to go
R has become demonstrably better in the last 5 years (e.g. via plyr,
reshape2)
Wes McKinney () Structured data challenges Rice Statistics 7 / 43
8. Deeper problems in many industries
Facilitating the research process only part of problem
Much of academia: “Production systems?”
Industry: a wasteland of misshapen wheels or expensive vendor
products
Explosive growth in data-driven production systems
Hybrid-language systems are not always a good idea
Wes McKinney () Structured data challenges Rice Statistics 8 / 43
9. The big data conundrum
Great effort being invested in the (difficult) problem of large-scale
data processing, e.g. MapReduce-based
Less effort in the fundamental tooling for data manipulation /
preparation / integration
Single-node performance does matter
Single-node code development time matters too
Wes McKinney () Structured data challenges Rice Statistics 9 / 43
10. pandas: my effort in this arena
Pick your favorite: panel data structures or Python structured data
analysis
Starting building April 2008 back at AQR Capital
Open-sourced (BSD license) mid-2009
Heavily tested, being used by many companies (inc. lots of financial
firms) as the cornerstone of their systems
Goal: optimal balance of ease-of-use, flexibility, and performance
Heavy development the last 6 months
Wes McKinney () Structured data challenges Rice Statistics 10 / 43
11. Why did I start from scratch?
Accusations of NIH Syndrome abound
In 2008 I simultaneously needed
Agile, high performance data structures
A high productivity programming language for implementing all of the
non-computational business logic
A production application platform that would seamlessly integrate with
an interactive data analysis / research platform
In short, I was rebuilding major financial systems and I found my
options inadequate
Thrilling innovation opportunity!
Wes McKinney () Structured data challenges Rice Statistics 11 / 43
12. Why did I use Python?
High productivity general purpose language
Well thought-out object-oriented model
Excellent software-development tools
Easy for MATLAB/R users to learn
Flexible built-in data structures (dicts, sets, lists, tuples)
The right open-source scientific computing tools
Powerful array processing (NumPy)
Abundant tools for performance computing
Wes McKinney () Structured data challenges Rice Statistics 12 / 43
13. But, Python is not perfect
For statistical computing, a chicken-and-egg problem
Python’s plotting libraries are not designed for statistical graphics
Built-in data structures are not especially optimized for my large data
use cases
Occasional semantic / syntactic niggles
Wes McKinney () Structured data challenges Rice Statistics 13 / 43
14. Partial list of structured data necessities
Table modification: column insertion/deletion/type changes
Rich axis indexing, metadata
Easy data alignment
Aggregation and transformation by group (“group by”)
Missing data (NA) handling
Pivoting and reshaping
Merging and joining
Time series-specific manipulations
Fast Input/Output: text files, databases, HDF5, ...
Wes McKinney () Structured data challenges Rice Statistics 14 / 43
15. DataFrame, the pandas workhorse
A 2D tabular data structure with row and column indexes
Fast for row- and column-oriented operations
Support heterogeneous columns WITHOUT sacrificing performance in
the homogeneous (e.g. floating point only) case
Wes McKinney () Structured data challenges Rice Statistics 15 / 43
16. DataFrame
cname year agefrom ageto ls lsc pop ccode
0 Australia 1950 15 19 64.3 15.4 558 AUS
1 Australia 1950 20 24 48.4 26.4 645 AUS
2 Australia 1950 25 29 47.9 26.2 681 AUS
3 Australia 1950 30 34 44 23.8 614 AUS
4 Australia 1950 35 39 42.1 21.9 625 AUS
5 Australia 1950 40 44 38.9 20.1 555 AUS
6 Australia 1950 45 49 34 16.9 491 AUS
7 Australia 1950 50 54 29.6 14.6 439 AUS
8 Australia 1950 55 59 28 12.9 408 AUS
9 Australia 1950 60 64 26.3 12.1 356 AUS
Wes McKinney () Structured data challenges Rice Statistics 16 / 43
17. Axis indexing and metadata
Basic concept: labeled axes in use throughout the library
Need to support
Fast lookups (constant time)
Data realignment / selection by labels (linear)
Munging together irregularly indexed data
Key innovation: index is a data structure itself. Different
implementations can support more sophisticated indexing
Axis labels can be any immutable Python object
Wes McKinney () Structured data challenges Rice Statistics 17 / 43
18. Irregularly indexed data
DataFrame
Columns
I
N
D
E
X
Wes McKinney () Structured data challenges Rice Statistics 18 / 43
19. Axis indexing
Axis Index
d 0
a 1
b 2
c 3
e 4
Wes McKinney () Structured data challenges Rice Statistics 19 / 43
20. Why does this matter?
Real world data is highly irregular, especially time series
Operations between DataFrame objects automatically align on the
indexes
Nearly impossible to have errors due to misaligned data
Can vastly facilitate munging unstructured data into structured form
Grants immense freedom in writing research code
Time series are just a special case of a general indexed data structure
Wes McKinney () Structured data challenges Rice Statistics 20 / 43
22. Hierarchical indexing
Basic idea: represent high dimensional data in a lower-dimensional
structure that is easier to reason about
Axis index with k levels of indexing
Slice chunks of data in constant time!
Provides a very natural way of implementing reshaping operations
Advantage over a truly N-dimensional object: space-efficient dense
representation if groups are unbalanced
Extremely useful for econometric models on panel data
Wes McKinney () Structured data challenges Rice Statistics 22 / 43
24. Joining and merging
Join and merge-type operations are very easy to implement with
indexing in place
Multi-key join same code as aligning hierarchically-indexed
DataFrames
Will illustrate this with examples
Wes McKinney () Structured data challenges Rice Statistics 24 / 43
25. Supporting size mutability
In order to have good row-oriented performance, need to store
like-typed columns in a single ndarray
“Column” insertion: accumulate 1 × N × . . . homogeneous columns,
later consolidate with other like-typed into a single block
I.e. avoid reallocate-copy or array concatenation steps as long as
possible
Column deletions can be no-copy events (since ndarrays support
views)
Wes McKinney () Structured data challenges Rice Statistics 25 / 43
26. DataFrame, under the hood
Actually You see
6 2 4 5 1 3 1 2 3 4 5 6
O F I B I F B F F O
Wes McKinney () Structured data challenges Rice Statistics 26 / 43
29. Reshaping implementation nuances
Must carefully deal with unbalanced group sizes / missing data
I play vectorization tricks with the NumPy memory layout: no for
loops!
Care must be taken to handle heterogeneous and homogeneous data
cases
Wes McKinney () Structured data challenges Rice Statistics 29 / 43
30. GroupBy
High level process
split data set into groups
apply function to each group (an aggregation or a transformation)
combine results intelligently into a result data structure
Can be used to emulate SQL GROUP BY operations
Wes McKinney () Structured data challenges Rice Statistics 30 / 43
31. GroupBy
Grouping closely related to indexing
Create correspondence between axis labels and group labels using one
of:
Array of group labels (like a DataFrame column)
Python function to be applied to each axis tick
Can group by multiple keys
For a hierarchically indexed axis, can select a level and group by that
(or some transformation thereof)
Wes McKinney () Structured data challenges Rice Statistics 31 / 43
32. Anatomy of GroupBy
grouped = obj.groupby([key1, key2, key3])
This returns a GroupBy object
Each of the keys could be any of:
A Python function
A vector
A column name
Wes McKinney () Structured data challenges Rice Statistics 32 / 43
33. Anatomy of GroupBy
aggregate, transform, and the more general apply supported
group_means = grouped.agg(np.mean)
group_means2 = grouped.mean()
demeaned = grouped.transform(lambda x: x - x.mean())
Wes McKinney () Structured data challenges Rice Statistics 33 / 43
34. Anatomy of GroupBy
The GroupBy object is also iterable
group_means = {}
for group_name, group in grouped:
group_means[group_name] = grouped.mean()
Wes McKinney () Structured data challenges Rice Statistics 34 / 43
35. GroupBy and hierarchical indexing
Hierarchical indexing came about as the natural result of a multi-key
aggregation:
>>> group_means = df.groupby(['country', 'agefrom']).mean()
>>> group_means[['ls', 'lsc', 'pop']].unstack('country')
ls lsc pop
country Australia Austria Australia Austria Australia Austria
agefrom
15 70.03 31.1 26.17 14.67 6163 3310
20 58.02 59.98 34.51 45.83 1113 531
25 57.02 45.5 33.02 32.73 5021 2791
30 59.16 46.56 35.29 33.87 1082 527
35 59.58 43.29 34.53 30.3 1053 528.8
40 58.8 40.98 33.92 27.88 1005 522.5
45 56.79 39.19 31.71 26 927.2 503.5
50 54.71 37.14 30.37 23.94 836.6 475.2
55 51.93 34.82 26.98 22.11 735.8 443.2
60 49.92 32.35 25.85 20.2 632.6 408.8
65 47.02 29.59 22.98 18.44 522.5 361.9
70 46.27 28.52 22.53 17.72 410.5 295.8
75 46.28 28.06 23.41 18.42 624.5 437.6
Wes McKinney () Structured data challenges Rice Statistics 35 / 43
36. What makes GroupBy hard?
factor-izing the group labels is very expensive
Function call overhead on small groups
To sort or not to sort?
Cheaper than computing the group labels!
Munging together results in exceptional cases is tricky
Wes McKinney () Structured data challenges Rice Statistics 36 / 43
37. Time series operations
Fixed frequency indexing (DateRange)
Domain-specific time offsets (like business day, business month end)
Frequency conversion
Forward-filling / back-filling / interpolation
Leading/lagging
In the works (later this year), better/faster upsampling/downsampling
Wes McKinney () Structured data challenges Rice Statistics 37 / 43
38. Shoot-out with R’s time series objects
“Inner join” addition between two irregular 500K-length time series
sampled from 1 million-length time series. Timestamps are POSIXct (R) /
int64 (Python)
package timing factor
pandas 21.5 ms 1.0
xts 41.3 ms 1.92
fts 370 ms 17.2
its 1117 ms 51.95
zoo 3479 ms 161.8
Where there is smoke, there is fire?
Iron: Macbook Pro Core i7 with R 2.13.1, pandas 0.5.1git / Python 2.7.2
Wes McKinney () Structured data challenges Rice Statistics 38 / 43
39. Erm, inner join?
Intersecting time stamps loses information
Wes McKinney () Structured data challenges Rice Statistics 39 / 43
40. My (mild) performance obsession
Wes McKinney () Structured data challenges Rice Statistics 40 / 43
41. My (mild) performance obsession
Good performance stems from many factors
Well-designed algorithms (from a complexity standpoint)
Minimizing copying of data
Minimizing interpreter / function call overhead
Taking advantage of memory layout
I value functionality over performance, but I do spend a lot of time
profiling code and grokking performance characteristics
Wes McKinney () Structured data challenges Rice Statistics 41 / 43