Wes McKinney gave a talk on Apache Arrow, an open source project for memory interoperability between analytics and machine learning systems. Arrow provides efficient columnar memory structures and zero-copy sharing of data between applications. It defines common data types and schemas that can be used across programming languages. Arrow is implemented in C++ and provides language bindings for other languages like Python. It aims to improve performance for tasks like data loading, preprocessing, modeling and serving. Projects like pandas, Spark and Ray are exploring using Arrow internally for more efficient data handling.
Python Data Wrangling: Preparing for the FutureWes McKinney
Given at PyCon HK on October 29, 2016. About open source work in progress to advance the Python pandas project internals and leverage synergies with other efforts in OSS data technology
Data Science Languages and Industry AnalyticsWes McKinney
September 19, 2015 talk at Berkeley Institute for Data Science. On how comparatively poor JSON / structured data tools pose a challenge for the data science languages (Python, R, Julia, etc.).
Improving Python and Spark (PySpark) Performance and InteroperabilityWes McKinney
Slides from Spark Summit East 2017 — February 9, 2017 in Boston. Discusses ongoing development work to accelerate Python-on-Spark performance using Apache Arrow and other tools
Python Data Wrangling: Preparing for the FutureWes McKinney
Given at PyCon HK on October 29, 2016. About open source work in progress to advance the Python pandas project internals and leverage synergies with other efforts in OSS data technology
Data Science Languages and Industry AnalyticsWes McKinney
September 19, 2015 talk at Berkeley Institute for Data Science. On how comparatively poor JSON / structured data tools pose a challenge for the data science languages (Python, R, Julia, etc.).
Improving Python and Spark (PySpark) Performance and InteroperabilityWes McKinney
Slides from Spark Summit East 2017 — February 9, 2017 in Boston. Discusses ongoing development work to accelerate Python-on-Spark performance using Apache Arrow and other tools
Apache Arrow Workshop at VLDB 2019 / BOSS SessionWes McKinney
Technical deep dive for database system developers in the Arrow columnar format, binary protocol, C++ development platform, and Arrow Flight RPC.
See demo Jupyter notebooks at https://github.com/wesm/vldb-2019-apache-arrow-workshop
Apache Arrow is a new standard for in-memory columnar data processing. It is a complement to Apache Parquet and Apache ORC. In this deck we review key design goals and how Arrow works in detail.
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...Dremio Corporation
Essentially every successful analytical DBMS in the market today makes use of column-oriented data structures. In the Hadoop ecosystem, Apache Parquet (and Apache ORC) provide similar advantages in terms of processing and storage efficiency. Apache Arrow is the in-memory counterpart to these formats and has been been embraced by over a dozen open source projects as the de facto standard for in-memory processing. In this session the PMC Chair for Apache Arrow and the PMC Chair for Apache Parquet discuss the future of column-oriented processing.
Apache Arrow Workshop at VLDB 2019 / BOSS SessionWes McKinney
Technical deep dive for database system developers in the Arrow columnar format, binary protocol, C++ development platform, and Arrow Flight RPC.
See demo Jupyter notebooks at https://github.com/wesm/vldb-2019-apache-arrow-workshop
Apache Arrow is a new standard for in-memory columnar data processing. It is a complement to Apache Parquet and Apache ORC. In this deck we review key design goals and how Arrow works in detail.
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...Dremio Corporation
Essentially every successful analytical DBMS in the market today makes use of column-oriented data structures. In the Hadoop ecosystem, Apache Parquet (and Apache ORC) provide similar advantages in terms of processing and storage efficiency. Apache Arrow is the in-memory counterpart to these formats and has been been embraced by over a dozen open source projects as the de facto standard for in-memory processing. In this session the PMC Chair for Apache Arrow and the PMC Chair for Apache Parquet discuss the future of column-oriented processing.
Introduction to Search Systems - ScaleConf Colombia 2017Toria Gibbs
Often when a new user arrives on your website, the first place they go to find information is the search box! Whether they are searching for hotels on your travel site, products on your e-commerce site, or friends to connect with on your social media site, it is important to have fast, effective search in order to engage the user.
Teresa Clotilde Ojeda Sánchez: El Ministerio de Educación del Perú (MINEDU) pone a nuestra disposición la Cartilla de bienvenida a la comunidad educativa para el reinicio de clases", en la que brinda orientaciones con actividades para los niveles de Inicial, Primaria y Secundaria.
Ante el contexto que nuestro país está viviendo, debido a los
desastres ocasionados por las lluvias, huaicos e inundaciones; es importante que toda la comunidad educativa se organice para
afrontar y superar los daños que están afectando a niños, niñas,
adolescentes y sus familias, no solo en el aspecto material, sino
también en las dimensiones física, emocional y social.
El rol de la escuela ante este tipo de situaciones es fundamental
porque contribuye a restaurar un sentido de normalidad en la
vida de los estudiantes que han sido afectados. Ello exige que
iniciemos o reiniciemos las actividades educativas en un espacio
SEGURO, SALUDABLE Y ACOGEDOR.
En esta cartilla te brindamos orientaciones para contribuir con
el apoyo socioemocional de todos los actores de la comunidad
educativa, teniendo en el juego y el arte un medio para recuperar la confianza y alegría de vivir.
A proposal for algorithmic democracy, an electronic system that is a hybrid of traditional direct and representative democratic systems. Please send feedback at http://www.democracygps.org
Engineering patterns for implementing data science models on big data platformsHisham Arafat
Discussion of practically implementing data science models on big data platforms from engineering perspective. An eye opener on the engineering factors associated with designing and working solution. We use a simple text mining example on social media analytics for brand marketing. At the first while, it seems simple solution however if you go deeply and think on implementation aspects of even a simple analytics model, you can discover the degree of complexity at each part of the solution. An Abstraction of the Big Data key advantages would be very helpful to select appropriate Big Data technology components out of very large landscape. Two examples with reference are given for using Lambda Architecture and unusual way of image processing using Big Data abstraction provided.
Design Choices for Cloud Data PlatformsAshish Mrig
You have decided to migrate your workload to Cloud, congratulations ! Which database should be used to host and query your data ? Most people go default: AWS -> Redshift, GCP ->BigQuery, Azure -> Synapse and so on. This presentation will go over design considerations, guidelines and best practices to choose your data platform and will go beyond the default choices. We will talk about evolutions of databases, design, data modeling and how to minimize the cost.
[DataCon.TW 2017] Data Lake: centralize in on-prem vs. decentralize on cloudJeff Hung
Trend Micro has been running big-data in on-premises data center for many years. With Hadoop and its mature ecosystem, we are able to build the centralized Data Lake to serve and fulfill massive data processing loads while manage and encourage new use of data.
In recent years, we are shifting our focus to AWS. Due to the decentralized nature of the cloud, the design and thinking for building Data Lake are different. We must identify what are still important no matter in on-prem or on the cloud, and what could be done differently to embrace the cloud model.
In this talk, we will elaborate Trend Micro considerations and best practices on building Data Lake in on-prem and on cloud. And share our experience on managing peta-byte scale data with many years of evolution.
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...MLconf
Building a Recommender System for Publications using Vector Space Model and Python:In recent years, it has become very common that we have access to large number of publications on similar or related topics. Recommendation systems for publications are needed to locate appropriate published articles from a large number of publications on the same topic or on similar topics. In this talk, I will describe a recommender system framework for PubMed articles. PubMed is a free search engine that primarily accesses the MEDLINE database of references and abstracts on life-sciences and biomedical topics. The proposed recommender system produces two types of recommendations – i) content-based recommendation and (ii) recommendations based on similarities with other users’ search profiles. The first type of recommendation, viz., content-based recommendation, can efficiently search for material that is similar in context or topic to the input publication. The second mechanism generates recommendations using the search history of users whose search profiles match the current user. The content-based recommendation system uses a Vector Space model in ranking PubMed articles based on the similarity of content items. To implement the second recommendation mechanism, we use python libraries and frameworks. For the second method, we find the profile similarity of users, and recommend additional publications based on the history of the most similar user. In the talk I will present the background and motivation for these recommendation systems, and discuss the implementations of this PubMed recommendation system with example.
This talk will cover, via live demo & code walk-through, the key lessons we’ve learned while building such real-world software systems over the past few years. We’ll incrementally build a hybrid machine learned model for fraud detection, combining features from natural language processing, topic modeling, time series analysis, link analysis, heuristic rules & anomaly detection. We’ll be looking for fraud signals in public email datasets, using Python & popular open-source libraries for data science and Apache Spark as the compute engine for scalable parallel processing.
Simplifying And Accelerating Data Access for Python With Dremio and Apache ArrowPyData
By Sudheesh Katkam
PyData New York City 2017
Dremio is a new open source project for self-service data fabric. Dremio simplifies and accelerates access to data from any source and any size, including relational databases, NoSQL, Hadoop, Parquet, and text files. We'll show you how you can use Dremio to visually curate data from any source, then access via Pandas or Jupyter notebook for rapid access.
An architecture for federated data discovery and lineage over on-prem datasou...DataWorks Summit
Comcast's Streaming Data platform comprises a variety of ingest, transformation, and storage services in the public cloud. Peer-reviewed Apache Avro schemas support end-to-end data governance. We have previously reported (DataWorks Summit 2017) on how we extended Atlas with custom entity and process types for discovery and lineage in the AWS public cloud. Custom lambda functions notify Atlas of creation of new entities and new lineage links via asynchronous kafka messaging.
Recently we were presented the challenge of providing integrated data discovery and lineage across our public cloud datasources and on-prem datasources, both Hadoop-based and traditional data warehouses and RDBMSs. Can Apache Atlas meet this challenge? A resounding yes! This talk will present our federated architecture, with Atlas providing SQL-like, free-text, and graph search across select metadata from all on-prem and public cloud data sources in our purview. Lightweight, custom connectors/bridges identify metadata/lineage changes in underlying sources and publish them to Atlas via the asynchronous API. A portal layer provides Atlas query access and a federation of UIs. Once data of interest is identified via Atlas queries, interfaces specific to underlying sources may be used for special-purpose metadata mining.
While metadata repositories for data discovery and lineage abound, none of them have built-in connectors and listeners for the entire complement of data sources that Comcast and many other large enterprises use to support their business needs. In-house-built solutions typically underestimate the cost of development and maintenance and often suffer from architecture-by-accretion. Atlas' commitment to extensibility, built-in provision of typed, free-text, and graph search, and REST and asynchronous APIs, position it uniquely in the build-vs-buy sweet spot.
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinneyHakka Labs
video: https://www.hakkalabs.co/articles/ibis-operating-python-data-ecosystem-hadoop-scale-wes-mckinney
While Python is a de-facto language for modern data engineering and data science, Python development has been confined to local data processing—thereby limiting its users to smaller data sets. Historically, to address bigger data workloads, Python developers have had to extract samples or aggregates, forcing compromises in data fidelity, adding ETL costs, and ultimately leading to a loss of productivity and addressable use cases.
Ibis, a new open source data analytics framework for Python developers, has the goal of enabling the Python data ecosystem (NumPy, pandas, etc.) to operate efficiently at Hadoop scale. To enable high performance Python at scale without the age-old JVM interoperability problems, Ibis takes advantage of unique synergies between Python and Impala, the leading open source MPP analytical query engine. In this talk, Ibis creator Wes McKinney, who was also the creator of pandas, will demo the current capabilities of Ibis as well as explain its roadmap.
How to optimize Hortonworks Apache Spark ML workloads on Power - POWER 8/9 architecture is the latest offering from IBM and OpenPower foundation. It is the perfect platform for optimizing Hortonworks Spark's performance. During this presentation we will walk the audience through steps required to optimize YARN, HDFS, and Spark on a Power cluster.
Step required:
1) Classify workload into CPU, Memory, IO or mixed (CPU, memory, IO) intensive
2) Characterize "out-of-box" Hortonworks spark workload to understand CPU, Memory, IO and Network performance characteristics
3) Floor Plan cluster resources
4) Tune "out-of-box" workload to navigate "Roofline" Performance space in the above named dimensions
5) If workload is Memory / IO/ Network intensive bound then tune SPARK to increase operational intensity operations/byte as much as possible to make it CPU bound
6) Divide search space into regions and perform exhaustive search.
7) Identify Performance bottlenecks by resource monitoring and tune the System, JVM or application layer by profiling application and hardware counters if required.
The Challenges of Bringing Machine Learning to the MassesAlice Zheng
Why is it hard to build ML software, and why it is like designing a database. Jointly created with Sethu Raman (Dato/GraphLab). Talk at NIPS 2014 workshop on Software Engineering for Machine Learning (https://sites.google.com/site/software4ml/).
Data science holds tremendous potential for organizations to uncover new insights and drivers of revenue and profitability. Big Data has brought the promise of doing data science at scale to enterprises, however this promise also comes with challenges for data scientists to continuously learn and collaborate. Data Scientists have many tools at their disposal such as notebooks like Juypter and Apache Zeppelin & IDEs such as RStudio with languages like R, Python, Scala and frameworks like Apache Spark. Given all the choices how do you best collaborate to build your model and then work through the development lifecycle to deploy it from test into production ?
In this session learn the attributes of a modern data science platform that empowers data scientists to build models using all the data in their data lake and foster continuous learning and collaboration. We will show a demo of DSX with HDP with the focus on integration, security and model deployment and management.
Speakers:
Sriram Srinivasan, Senior Technical Staff Member, Analytics Platform Architect, IBM
Vikram Murali, Program Director, Data Science and Machine Learning, IBM
What happens when you transform your threat hunt playbooks from static step-by-step guides to something more dynamic? What if instead of copying and pasting code and queries from a document you could execute blocks of code from within the same framework as your text and notes? Notebook technologies have emerged largely from the data science community and have a direct application to the security domain.
We will show data science examples applied to threat hunting that involve interfacing with data from across the data landscape … one notebook, multiple data sources.
https://events.secureworldexpo.com/agenda/seattle-wa-2018/
Architectures for HPC/HTC Workloads on AWS - CMP306 - re:Invent 2017Amazon Web Services
Researchers and IT professionals who use high-performance computing (HPC) and high-throughput computing (HTC) need a large scale infrastructure to move their research forward. This session provides reference architectures for running your workloads on AWS, which enable you to achieve scale on-demand and reduce your time to science. We debunk myths about HPC in the cloud and demonstrate techniques for running common on-premises workloads in the cloud.
Similar to Memory Interoperability in Analytics and Machine Learning (20)
Data Science Without Borders (JupyterCon 2017)Wes McKinney
Talk about building shared, language-agnostic computational infrastructure for data science. Discusses the motivation and work that's happening in the Apache Arrow project to help (http://arrow.apache.org)
Enabling Python to be a Better Big Data CitizenWes McKinney
These slides are from my talk at the NYC Python Meetup at ODSC Office NYC on February 17, 2016. It discusses Python's architectural challenges to interoperate with the Hadoop ecosystem and how a new project, Apache Arrow, will help.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
2. Me
March 26, 2017
• Currently: Software Architect at Two Sigma Investments
• Creator of Python pandas project
• PMC member for Apache Arrow and Apache Parquet
• Author of Python for Data Analysis
• Other Python projects: Ibis, Feather, statsmodels
All Rights Reserved 2
4. This talk
4March 26, 2017
• Benefits of interoperable data and metadata
• Challenges to sharing memory between runtime environments
• Apache Arrow: Purpose and C++ architecture
• Opportunities for collaboration
• Example application: pandas 2.0
All Rights Reserved
5. Changing hardware landscape
March 26, 2017
• Intel has released first production 3D Xpoint SSD
• Reported 1000x faster than NAND, less expensive than RAM
• Convergence between RAM vs. shared memory / mmap performance
All Rights Reserved 5
6. Changing software landscape
March 26, 2017
• Next-gen ML / AI frameworks (TensorFlow, Torch, etc.)
• DIY open source architectures for machine learning in production
• Streaming / batch data processing pipelines
• Data cleaning and feature engineering
• Model fitting / scoring / serving
All Rights Reserved 6
7. “Zero-copy” memory interfaces
March 26, 2017
• Enables computational tools to process a dataset without any additional
serialization, or transfer to a different memory space
• Can do random access on a dataset that does not fit in RAM
• Another interpretation: reading a dataset is a metadata-only conversion
All Rights Reserved 7
8. Challenges to zero-copy memory sharing
March 26, 2017
• Cross-language issues
• Type metadata + logical types
• Byte/bit-level memory layout
• Language-specific issues
• In-memory data structures
• Memory allocation and sharing constructs
All Rights Reserved 8
9. What is pandas?
March 26, 2017
• Popular in-memory data manipulation tool for Python
• Focused on tabular datasets (“data frames”)
• Sprawling codebase spanning multiple areas
• IO for many data formats
• Array manipulations / data preparation
• OLAP-style analytics
• Internals implemented using NumPy array objects
All Rights Reserved 9
10. NumPy
March 26, 2017
• Tensor memory model ("ndarray") for numeric data
• Strided, homogeneously-typed, byte-addressable memory
• APL-inspired semantics
• Zero-copy construction from compatible memory layouts
• Computational tools support both strided and contiguous memory access
All Rights Reserved 10
11. pandas: Technical debt + Architectural issues
March 26, 2017
• Tensor library like NumPy awkward fit for pandas use cases
• Multidimensionality + strided memory access complicated algorithms
• Lack of built-in missing value support
• Weak on native string, variable length, or nested types
• pandas at core a “in-memory columnar” problem, similar to analytical SQL
engines
All Rights Reserved 11
12. Thesis: Tensors and Tables
March 26, 2017
• 2 data structures best suited for zero-copy sharing
• Tensors: N-dimensional, homogeneously-typed arrays
• Tables: Column-oriented, heterogeneously typed
• These data structures can be defined using common memory and metadata
primitives
All Rights Reserved 12
13. Observations
March 26, 2017
• A Tensor is semantically a multidimensional view of a 1D block of memory
• Writing computational code targeting arbitrary tensors is much more difficult
than 1D contiguous arrays
• Tensors of non-fixed size types (e.g. strings) occur less frequently
All Rights Reserved 13
14. Apache Arrow
March 26, 2017
• github.com/apache/arrow
• Collaboration amongst broad set of OSS projects around language-agnostic
shared data structures
• Initial focus
• In-memory columnar tables
• Canonical metadata
• Interoperability between JVM and native code (C/C++) ecosystem
All Rights Reserved 14
15. High performance data interchange
March 26, 2017All Rights Reserved
Today With Arrow
Source: Apache Arrow
15
16. What does Apache Arrow give you?
March 26, 2017
• Cache-efficient columnar memory: optimized for CPU affinity and SIMD /
parallel processing, O(1) random value access
• Zero-copy messaging / IPC: Language-agnostic metadata, batch/file-based
and streaming binary formats
• Complex schema support: Flat and nested data types
• Main implementations in C++ and Java: with integration tests
• Bindings / implementations for C, Python, Ruby, Javascript in various stages
of development
All Rights Reserved 16
17. Arrow in C++
March 26, 2017
• Reusable memory management and IO subsystem for native code applications
• Layered in multiple components
• Memory management
• Type metadata / schemas
• Array / Table containers
• IO interfaces
• Zero-copy IPC / messaging
All Rights Reserved 17
18. Arrow C++: Memory management
March 26, 2017
• arrow::Buffer
• RAII-based memory lifetime with std::shared_ptr<Buffer>
• arrow::MemoryMappedBuffer: for memory maps
• arrow::MemoryPool
• Abstract memory allocator for tracking all allocations
All Rights Reserved 18
19. Arrow C++: Type metadata
March 26, 2017
• arrow::DataType
• Base class for fixed size, variable size, and nested datatypes
• arrow::Field
• Type + name + additional metadata
• arrow::Schema
• Collection of fields
All Rights Reserved 19
20. Arrow C++: Array / Table containers
March 26, 2017
• arrow::Array
• 1-dimensional columnar arrays: Int32Array, ListArray, StructArray, etc.
• Support for dictionary-encoded arrays
• arrow::RecordBatch
• Collection of equal-length arrays
• arrow::Column
• Logical table “column” as chunked array
• arrow::Table
• Collection of columns
All Rights Reserved 20
21. Arrow C++: IO interfaces
March 26, 2017
• arrow::{InputStream, OutputStream}
• arrow::RandomAccessFile
• Abstract file interface
• arrow::MemoryMappedFile
• Zero-copy reads to arrow::Buffer
• Specific implementations for OS files, HDFS, etc.
All Rights Reserved 21
22. Arrow C++: Messaging / IPC
March 26, 2017
• Metadata read/write using Google’s Flatbuffers library
• Encapsulated Message type
• Write record batches, read with zero-copy
• arrow::{FileWriter, FileReader}
• Random access / “batch” binary format
• arrow::{StreamWriter, StreamReader}
• Streaming binary format
All Rights Reserved 22
23. In development: arrow::Tensor
March 26, 2017
• Targeting interoperability with memory layouts as used in NumPy,
TensorFlow, Torch, or other standard tensor-based frameworks
• data: arrow::Buffer
• shape: dimension sizes
• strides: memory ordering
• Zero-copy reads using Arrow’s shared memory tools
• Support Tensor math libraries for C++ like xtensor
All Rights Reserved 23
24. Example use: Ray ML framework from Berkeley RISELab
March 26, 2017All Rights Reserved 24
Source: https://arxiv.org/abs/1703.03924
• Shared memory-based object
store
• Zero-copy tensor reads using
Arrow libraries
25. Example use: pandas 2.0
March 26, 2017
• In-planning rearchitecture of pandas’s internals
• libpandas — largely Python-agnostic C++11 library
• Decoupling pandas data structures from NumPy tensors
• Support analytics targeting native Arrow memory
• Multicore / parallel algorithms
• Leverage latest SIMD intrinsics
• Lazy-loading DataFrames from primary input formats
• CSV, JSON, HDF5, Apache Parquet
All Rights Reserved 25
26. Other examples
March 26, 2017
• Spark integration (SPARK-13534)
• Weld integration (ARROW-649)
All Rights Reserved 26
27. Thank you
March 26, 2017
• Building code and community around
• IO subsystems
• Metadata
• Data structures and in-memory formats
All Rights Reserved 27