Data Science Languages and Industry AnalyticsWes McKinney
September 19, 2015 talk at Berkeley Institute for Data Science. On how comparatively poor JSON / structured data tools pose a challenge for the data science languages (Python, R, Julia, etc.).
Improving Python and Spark (PySpark) Performance and InteroperabilityWes McKinney
Slides from Spark Summit East 2017 — February 9, 2017 in Boston. Discusses ongoing development work to accelerate Python-on-Spark performance using Apache Arrow and other tools
Apache Arrow is a new standard for in-memory columnar data processing. It is a complement to Apache Parquet and Apache ORC. In this deck we review key design goals and how Arrow works in detail.
Python Data Wrangling: Preparing for the FutureWes McKinney
Given at PyCon HK on October 29, 2016. About open source work in progress to advance the Python pandas project internals and leverage synergies with other efforts in OSS data technology
It’s no longer a world of just relational databases. Companies are increasingly adopting specialized datastores such as Hadoop, HBase, MongoDB, Elasticsearch, Solr and S3. Apache Drill, an open source, in-memory, columnar SQL execution engine, enables interactive SQL queries against more datastores.
•Arun Murthy, from the Hadoop team at Yahoo! will introduce compendium of best practices for applications running on Apache Hadoop. In fact, we introduce the notion of a Grid Pattern which, similar to Design Pattern, represents a general reusable solution for applications running on the Grid. He will even cover the anti-patterns of applications running on the Apache Hadoop clusters. Arun will enumerate characteristics of well-behaved applications and provide guidance on appropriate uses of various features and capabilities of the Hadoop framework. It is largely prescriptive in its nature; a useful way to look at the presention is to understand that applications that follow, in spirit, the best practices prescribed here are very likely to be efficient, well-behaved in the multi-tenant environment of the Apache Hadoop clusters and unlikely to fall afoul of most policies and limits.
This talk was presented at Spark Summit East 2016. Below is the abstract:
In this talk, we’ll discuss the challenges of analyzing large-scale time series data sets and introduce the TS-for-Spark library. Whether we need to build models over data coming in every second from thousands of sensors of dig into the histories of millions of financial instruments, large scale time series data shows up in a variety of domains. Time series data has an innate structure not found in other data sets, and thus presents both unique challenges and opportunities. The open source Spark-TS package provides both Scala and Python APIs for munging, manipulating, and modeling time series data, on top of Spark. We'll cover its core abstractions, like the TimeSeriesRDD and DateTimeIndex, as well as some of the statistical modeling functionality it provides on top of them.
Data Science Languages and Industry AnalyticsWes McKinney
September 19, 2015 talk at Berkeley Institute for Data Science. On how comparatively poor JSON / structured data tools pose a challenge for the data science languages (Python, R, Julia, etc.).
Improving Python and Spark (PySpark) Performance and InteroperabilityWes McKinney
Slides from Spark Summit East 2017 — February 9, 2017 in Boston. Discusses ongoing development work to accelerate Python-on-Spark performance using Apache Arrow and other tools
Apache Arrow is a new standard for in-memory columnar data processing. It is a complement to Apache Parquet and Apache ORC. In this deck we review key design goals and how Arrow works in detail.
Python Data Wrangling: Preparing for the FutureWes McKinney
Given at PyCon HK on October 29, 2016. About open source work in progress to advance the Python pandas project internals and leverage synergies with other efforts in OSS data technology
It’s no longer a world of just relational databases. Companies are increasingly adopting specialized datastores such as Hadoop, HBase, MongoDB, Elasticsearch, Solr and S3. Apache Drill, an open source, in-memory, columnar SQL execution engine, enables interactive SQL queries against more datastores.
•Arun Murthy, from the Hadoop team at Yahoo! will introduce compendium of best practices for applications running on Apache Hadoop. In fact, we introduce the notion of a Grid Pattern which, similar to Design Pattern, represents a general reusable solution for applications running on the Grid. He will even cover the anti-patterns of applications running on the Apache Hadoop clusters. Arun will enumerate characteristics of well-behaved applications and provide guidance on appropriate uses of various features and capabilities of the Hadoop framework. It is largely prescriptive in its nature; a useful way to look at the presention is to understand that applications that follow, in spirit, the best practices prescribed here are very likely to be efficient, well-behaved in the multi-tenant environment of the Apache Hadoop clusters and unlikely to fall afoul of most policies and limits.
This talk was presented at Spark Summit East 2016. Below is the abstract:
In this talk, we’ll discuss the challenges of analyzing large-scale time series data sets and introduce the TS-for-Spark library. Whether we need to build models over data coming in every second from thousands of sensors of dig into the histories of millions of financial instruments, large scale time series data shows up in a variety of domains. Time series data has an innate structure not found in other data sets, and thus presents both unique challenges and opportunities. The open source Spark-TS package provides both Scala and Python APIs for munging, manipulating, and modeling time series data, on top of Spark. We'll cover its core abstractions, like the TimeSeriesRDD and DateTimeIndex, as well as some of the statistical modeling functionality it provides on top of them.
Introduction into scalable graph analysis with Apache Giraph and Spark GraphXrhatr
Graph relationships are everywhere. In fact, more often than not, analyzing relationships between points in your datasets lets you extract more business value from your data.
Consider social graphs, or relationships of customers to each other and products they purchase, as two of the most common examples. Now, if you think you have a scalability issue just analyzing points in your datasets, imagine what would happen if you wanted to start analyzing the arbitrary relationships between those data points: the amount of potential processing will increase dramatically, and the kind of algorithms you would typically want to run would change as well.
If your Hadoop batch-oriented approach with MapReduce works reasonably well, for scalable graph processing you have to embrace an in-memory, explorative, and iterative approach. One of the best ways to tame this complexity is known as the Bulk synchronous parallel approach. Its two most widely used implementations are available as Hadoop ecosystem projects: Apache Giraph (used at Facebook), and Apache GraphX (as part of a Spark project).
In this talk we will focus on practical advice on how to get up and running with Apache Giraph and GraphX; start analyzing simple datasets with built-in algorithms; and finally how to implement your own graph processing applications using the APIs provided by the projects. We will finally compare and contrast the two, and try to lay out some principles of when to use one vs. the other.
PayPal prvoides an online transfer money network. Each payment flow connects senders and receivers into a giant network where each sender/receiver is a node and each transaction is an edge. Traditionally, the risk score of a transaction is computed based on the characteristics of the involved sender/receiver/transaction. In this talk, we will describe a novel network inference approach to calculate transaction risk score that also includes the risk profile of neighboring senders and receivers using Apache Giraph. The approach reveals additional risk insights not possible with the traditional method. We leverage Hadoop to support a graph computation involving hundreds of millions of nodes and edges.
If you're building relational, time-series, IOT, or real-time architectures using Hadoop, you will find Apache Kudu an attractive choice. With Kudu, you'll be able to build your applications more simply and with fewer moving parts.
Hadoop has become faster and more capable, and has continued to narrow the gap compared to traditional database technologies. However, for developers looking for up-to-the-second analytics on fast-moving data, some important gaps remain that prevent many applications from transitioning to Hadoop-based architectures. Users are often caught between a rock and a hard place: columnar formats such as Apache Parquet offer extremely fast scan rates for analytics, but little to no ability for real-time modification or row-by-row indexed access. Online systems such as HBase offer very fast random access, but scan rates that are too slow for large scale data warehousing and analytical workloads.
This talk will describe Kudu, the new addition to the open source Hadoop ecosystem with out-of-the-box integration with Apache Spark and Apache Impala. Kudu fills the gap described above to provide a new option to achieve fast scans and fast random access from a single API.
Talk on Apache Kudu, presented by Asim Jalis at SF Data Engineering Meetup on 2/23/2016.
http://www.meetup.com/SF-Data-Engineering/events/228293610/
Big Data applications need to ingest streaming data and analyze it. HBase is great at ingesting streaming data but not good at analytics. HDFS is great at analytics but not at ingesting streaming data. Frequently applications ingest data into HBase and then move it to HDFS for analytics. What if you could use a single system for both use cases?
What if you could use a single system for both use cases? This could dramatically simplify your data pipeline architecture.
This is where Kudu comes in. Kudu is a storage system that lives between HDFS and HBase. It is good at both ingesting streaming data and good at analyzing it using Spark, MapReduce, and SQL.
HUG_Ireland_Apache_Arrow_Tomer_Shiran John Mulhall
A presentation by Tomer Shiran, CEO of Dremio made to Hadoop User Group (HUG) Ireland on "Hadoop Summit Night" on April 12th, 2016. This presentation covers Apache Arrow in detail.
Jump Start with Apache Spark 2.0 on DatabricksDatabricks
Apache Spark 2.0 has laid the foundation for many new features and functionality. Its main three themes—easier, faster, and smarter—are pervasive in its unified and simplified high-level APIs for Structured data.
In this introductory part lecture and part hands-on workshop you’ll learn how to apply some of these new APIs using Databricks Community Edition. In particular, we will cover the following areas:
What’s new in Spark 2.0
SparkSessions vs SparkContexts
Datasets/Dataframes and Spark SQL
Introduction to Structured Streaming concepts and APIs
The columnar roadmap: Apache Parquet and Apache ArrowDataWorks Summit
The Hadoop ecosystem has standardized on columnar formats—Apache Parquet for on-disk storage and Apache Arrow for in-memory. With this trend, deep integration with columnar formats is a key differentiator for big data technologies. Vertical integration from storage to execution greatly improves the latency of accessing data by pushing projections and filters to the storage layer, reducing time spent in IO reading from disk, as well as CPU time spent decompressing and decoding. Standards like Arrow and Parquet make this integration even more valuable as data can now cross system boundaries without incurring costly translation. Cross-system programming using languages such as Spark, Python, or SQL can becomes as fast as native internal performance.
In this talk we’ll explain how Parquet is improving at the storage level, with metadata and statistics that will facilitate more optimizations in query engines in the future. We’ll detail how the new vectorized reader from Parquet to Arrow enables much faster reads by removing abstractions as well as several future improvements. We will also discuss how standard Arrow-based APIs pave the way to breaking the silos of big data. One example is Arrow-based universal function libraries that can be written in any language (Java, Scala, C++, Python, R, ...) and will be usable in any big data system (Spark, Impala, Presto, Drill). Another is a standard data access API with projection and predicate push downs, which will greatly simplify data access optimizations across the board.
Speaker
Julien Le Dem, Principal Engineer, WeWork
Jump Start into Apache® Spark™ and DatabricksDatabricks
These are the slides from the Jump Start into Apache Spark and Databricks webinar on February 10th, 2016.
---
Spark is a fast, easy to use, and unified engine that allows you to solve many Data Sciences and Big Data (and many not-so-Big Data) scenarios easily. Spark comes packaged with higher-level libraries, including support for SQL queries, streaming data, machine learning, and graph processing. We will leverage Databricks to quickly and easily demonstrate, visualize, and debug our code samples; the notebooks will be available for you to download.
Transformation Processing Smackdown; Spark vs Hive vs PigLester Martin
Compare and contrast using Spark, Hive and Pig for transformation processing requirements. Video of my "talk" at https://www.youtube.com/watch?v=36_MayK5eU4.
Conference page for the talk is at https://devnexus.com/s/devnexus2017/presentations/17533.
Big Data, Data Lake, Fast Data - Dataserialiation-FormatsGuido Schmutz
The concept of "Data Lake" is in everyone's mind today. The idea of storing all the data that accumulates in a company in a central location and making it available sounds very interesting at first. But Data Lake can quickly turn from a clear, beautiful mountain lake into a huge pond, especially if it is inexpertly entrusted with all the source data formats that are common in today's enterprises, such as XML, JSON, CSV or unstructured text data. Who, after some time, still has an overview of which data, which format and how they have developed over different versions? Anyone who wants to help themselves from the Data Lake must ask themselves the same questions over and over again: what information is provided, what data types do they have and how has the content changed over time?
Data serialization frameworks such as Apache Avro and Google Protocol Buffer (Protobuf), which enable platform-independent data modeling and data storage, can help. This talk will discuss the possibilities of Avro and Protobuf and show how they can be used in the context of a data lake and what advantages can be achieved. The support on Avro and Protobuf by Big Data and Fast Data platforms is also a topic.
Talk at Hug FR on December 4, 2012 about the new Apache Drill project. Notably, this talk includes an introduction to the converging specification for the logical plan in Drill.
Apache Arrow Workshop at VLDB 2019 / BOSS SessionWes McKinney
Technical deep dive for database system developers in the Arrow columnar format, binary protocol, C++ development platform, and Arrow Flight RPC.
See demo Jupyter notebooks at https://github.com/wesm/vldb-2019-apache-arrow-workshop
Similar to Apache Arrow (Strata-Hadoop World San Jose 2016) (20)
Data Science Without Borders (JupyterCon 2017)Wes McKinney
Talk about building shared, language-agnostic computational infrastructure for data science. Discusses the motivation and work that's happening in the Apache Arrow project to help (http://arrow.apache.org)
Enabling Python to be a Better Big Data CitizenWes McKinney
These slides are from my talk at the NYC Python Meetup at ODSC Office NYC on February 17, 2016. It discusses Python's architectural challenges to interoperate with the Hadoop ecosystem and how a new project, Apache Arrow, will help.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
2. DREMIO
Who
Wes McKinney
• Engineer at Cloudera, formerly
DataPad CEO/founder
• Wrote bestseller Python for
Data Analysis 2012
• Open source projects
– Python {pandas, Ibis,
statsmodels}
– Apache {Arrow, Parquet, Kudu
(incubating)}
• Mostly work in Python and
Cython/C/C++
Jacques Nadeau
• CTO & Co-Founder at
Dremio, formerly Architect
at MapR
• Open Source projects
– Apache {Arrow, Parquet,
Calcite, Drill, HBase,
Phoenix}
• Mostly work in Java
3. DREMIO
Arrow in a Slide
• New Top-level Apache Software Foundation project
– Announced Feb 17, 2016
• Focused on Columnar In-Memory Analytics
1. 10-100x speedup on many workloads
2. Common data layer enables companies to choose best of breed
systems
3. Designed to work with any programming language
4. Support for both relational and complex data as-is
• Developers from 13+ major open source projects involved
– A significant % of the world’s data will be processed through Arrow!
Calcite
Cassandra
Deeplearning4j
Drill
Hadoop
HBase
Ibis
Impala
Kudu
Pandas
Parquet
Phoenix
Spark
Storm
R
6. DREMIO
Overview
• A high speed in-memory representation
• Well-documented and cross language
compatible
• Designed to take advantage of modern
CPU characteristics
• Embeddable in execution engines, storage
layers, etc.
8. DREMIO
High Performance Sharing & Interchange
Today With Arrow
• Each system has its own internal
memory format
• 70-80% CPU wasted on serialization
and deserialization
• Similar functionality implemented in
multiple projects
• All systems utilize the same memory
format
• No overhead for cross-system
communication
• Projects can share functionality (eg,
Parquet-to-Arrow reader)
Pandas Drill
Impala
HBase
KuduCassandra
Parquet
Spark
Arrow Memory
Pandas Drill
Impala
HBase
KuduCassandra
Parquet
Spark
Copy & Convert
Copy & Convert
Copy & Convert
Copy & Convert
Copy & Convert
9. DREMIO
Shared Need > Open Source Opportunity
• Columnar is Complex
• Shredded Columnar is even
more complex
• We all need to go to same
place
• Take Advantage of Open
Source approach
• Once we pick a shared
solution, we get interchange
for “free”
“We
are
also
considering
switching
to
a
columnar
canonical
in-‐memory
format
for
data
that
needs
to
be
materialized
during
query
processing,
in
order
to
take
advantage
of
SIMD
instrucBons”
-‐Impala
Team
“A
large
fracBon
of
the
CPU
Bme
is
spent
waiBng
for
data
to
be
fetched
from
main
memory…we
are
designing
cache-‐friendly
algorithms
and
data
structures
so
Spark
applicaBons
will
spend
less
Bme
waiBng
to
fetch
data
from
memory
and
more
Bme
doing
useful
work
–
Spark
Team
18. DREMIO
Java: Memory Management (& NVMe)
• Chunk-based managed allocator
– Built on top of Netty’s JEMalloc implementation
• Create a tree of allocators
– Limit and transfer semantics across allocators
– Leak detection and location accounting
• Wrap native memory from other applications
• New support for integration with Intel’s Persistent
Memory library via Apache Mnemonic
20. DREMIO
Common Message Pattern
• Schema Negotiation
– Logical Description of structure
– Identification of dictionary
encoded Nodes
• Dictionary Batch
– Dictionary ID, Values
• Record Batch
– Batches of records up to 64K
– Leaf nodes up to 2B values
Schema
NegoBaBon
DicBonary
Batch
Record
Batch
Record
Batch
Record
Batch
1..N
Batches
0..N
Batches
21. DREMIO
Record Batch Construction
Schema
NegoBaBon
DicBonary
Batch
Record
Batch
Record
Batch
Record
Batch
name
(offset)
name
(data)
iq
(data)
addresses
(list
offset)
addresses.number
addresses.street
(offset)
addresses.street
(data)
data
header
(describes
offsets
into
data)
name
(bitmap)
iq
(bitmap)
addresses
(bitmap)
addresses.number
(bitmap)
addresses.street
(bitmap)
{
name:
'wes',
iq:
180,
addresses:
[
{number:
2,
street
'a'},
{number:
3,
street
'bb'}
]
}
Each
box
is
conBguous
memory,
enBrely
conBguous
on
wire
22. DREMIO
RPC & IPC: Moving Data Between Systems
RPC
• Avoid Serialization & Deserialization
• Layer TBD: Focused on supporting vectored io
– Scatter/gather reads/writes against socket
IPC
• Alpha implementation using memory mapped files
– Moving data between Python and Drill
• Working on shared allocation approach
– Shared reference counting and well-defined ownership
semantics
24. DREMIO
Real World Example: Python With Spark or Drill
in partition 0
…
in partition
n - 1
SQL Engine
Python
function
input
Python
function
input
User-supplied
Python code
output
output
out partition 0
…
out partition
n - 1
SQL Engine
25. DREMIO
Real World Example: Feather File Format for
Python and R
• Problem: fast, language-
agnostic binary data
frame file format
• Written by Wes
McKinney (Python)
Hadley Wickham (R)
• Read speeds close to
disk IO performance
Arrow array 0
Arrow array 1
…
Arrow array n
Feather
metadata
Feather file
Apache Arrow
memory
Google
flatbuffers
26. DREMIO
Real World Example: Feather File Format for
Python and R
library(feather)
path
<-‐
"my_data.feather"
write_feather(df,
path)
df
<-‐
read_feather(path)
import
feather
path
=
'my_data.feather'
feather.write_dataframe(df,
path)
df
=
feather.read_dataframe(path)
R
Python
27. DREMIO
What’s Next
• Parquet for Python & C++
– Using Arrow Representation
• Available IPC Implementation
• Spark, Drill Integration
– Faster UDFs, Storage interfaces
28. DREMIO
Get Involved
• Join the community
– dev@arrow.apache.org
– Slack:
https://apachearrowslackin.herokuapp.com/
– http://arrow.apache.org
– @ApacheArrow, @wesmckinn, @intjesus