This document discusses improving Python and Spark performance and interoperability with Apache Arrow. It begins with an overview of current limitations of PySpark UDFs, such as inefficient data movement and scalar computation. It then introduces Apache Arrow, an open source in-memory columnar data format, and how it can help by allowing more efficient data sharing and vectorized computation. The document shows how Arrow improved PySpark UDF performance by 53x through vectorization and reduced serialization. It outlines future plans to further optimize UDFs and integration with Spark and other projects.
If you have your own Columnar format, stop now and use Parquet 😛Julien Le Dem
Lightning talk presented at HPTS 2015: http://hpts.ws/
Apache Parquet is the de facto standard columnar storage for big data. Open source and proprietary SQL engines already integrate with it as their users don’t want to load and duplicate their data in every tool. Users want an open, interoperable, efficient format to experiment with the many options they have. The format is defined by the open source community integrating feedback from many teams working on query engines (including but not limited to Impala, Drill, Hawq, SparkSQL, Presto, Hive, etc) or on infrastructure at scale (Twitter, Netflix, Stripe, Criteo, ...). Building on its initial success, the Parquet community is defining new features for the next iteration of the format. For example: improved metadata layout, type system completude or mergeable statistics used for planning.
If you have your own Columnar format, stop now and use Parquet 😛Julien Le Dem
Lightning talk presented at HPTS 2015: http://hpts.ws/
Apache Parquet is the de facto standard columnar storage for big data. Open source and proprietary SQL engines already integrate with it as their users don’t want to load and duplicate their data in every tool. Users want an open, interoperable, efficient format to experiment with the many options they have. The format is defined by the open source community integrating feedback from many teams working on query engines (including but not limited to Impala, Drill, Hawq, SparkSQL, Presto, Hive, etc) or on infrastructure at scale (Twitter, Netflix, Stripe, Criteo, ...). Building on its initial success, the Parquet community is defining new features for the next iteration of the format. For example: improved metadata layout, type system completude or mergeable statistics used for planning.
It’s no longer a world of just relational databases. Companies are increasingly adopting specialized datastores such as Hadoop, HBase, MongoDB, Elasticsearch, Solr and S3. Apache Drill, an open source, in-memory, columnar SQL execution engine, enables interactive SQL queries against more datastores.
Apache parquet - Apache big data North America 2017techmaddy
Apache Parquet brings the advantages of compressed, efficient columnar data representation available to any project in the Hadoop ecosystem. Apache Parquet is built from the ground up with complex nested data structures in mind, and uses the record shredding and assembly algorithm described in the Dremel paper. We believe this approach is superior to simple flattening of nested name spaces. Apache Parquet is built to support very efficient compression and encoding schemes. Multiple projects have demonstrated the performance impact of applying the right compression and encoding scheme to the data. Apache Parquet allows compression schemes to be specified on a per-column level and is future-proofed to allow adding more encodings as they are invented and implemented. This talk highlights the internal implementation of Apache Parquet.
Hadoop Infrastructure @Uber Past, Present and FutureDataWorks Summit
Uber’s mission is to provide transportation as reliable as running water and for fulfilling that mission data plays a critical role. In Uber, Hadoop plays a critical role in Data Infrastructure. We want to talk about the journey of Hadoop @Uber and our future plans in terms of scaling for billions of trips. We will talk about most unique use case Uber have and how Hadoop and eco system which we built, helped us in this journey. We want to talk about how we scaled from 10 -> 2000 and In future to scale up to 10’s X1000 of Nodes. We will talk about our mistakes, learning and wins and how we process billions of events per day. We will talk about the unique challenges and real world use-cases and how we will co-locate the Uber’s service architecture with batch (e.g data pipelines, machine learning and analytical workloads). Uber have done lot of improvements to current Hadoop eco system and uniquely solved some of the problems in a way which is never been solved in the past. This presentation will help audience to use this as an example and even encourage them to enhance the eco system. This will help to increase the community of these project and overall help the whole big data space. Audience is anybody who is working on Big Data and want to understand how to scale Hadoop and eco system for 10s of thousands of node. This talk will help them understand the Hadoop ecosystem and how to efficiently use that. It will also introduce them to some of the awesome technologies which Uber team is building in big data space.
The columnar roadmap: Apache Parquet and Apache ArrowDataWorks Summit
The Hadoop ecosystem has standardized on columnar formats—Apache Parquet for on-disk storage and Apache Arrow for in-memory. With this trend, deep integration with columnar formats is a key differentiator for big data technologies. Vertical integration from storage to execution greatly improves the latency of accessing data by pushing projections and filters to the storage layer, reducing time spent in IO reading from disk, as well as CPU time spent decompressing and decoding. Standards like Arrow and Parquet make this integration even more valuable as data can now cross system boundaries without incurring costly translation. Cross-system programming using languages such as Spark, Python, or SQL can becomes as fast as native internal performance.
In this talk we’ll explain how Parquet is improving at the storage level, with metadata and statistics that will facilitate more optimizations in query engines in the future. We’ll detail how the new vectorized reader from Parquet to Arrow enables much faster reads by removing abstractions as well as several future improvements. We will also discuss how standard Arrow-based APIs pave the way to breaking the silos of big data. One example is Arrow-based universal function libraries that can be written in any language (Java, Scala, C++, Python, R, ...) and will be usable in any big data system (Spark, Impala, Presto, Drill). Another is a standard data access API with projection and predicate push downs, which will greatly simplify data access optimizations across the board.
Speaker
Julien Le Dem, Principal Engineer, WeWork
Hadoop Distributed File System (HDFS) evolves from a MapReduce-centric storage system to a generic, cost-effective storage infrastructure where HDFS stores all data of inside the organizations. The new use case presents a new sets of challenges to the original HDFS architecture. One challenge is to scale the storage management of HDFS - the centralized scheme within NameNode becomes a main bottleneck which limits the total number of files stored. Although a typical large HDFS cluster is able to store several hundred petabytes of data, it is inefficient to handle large amounts of small files under the current architecture.
In this talk, we introduce our new design and in-progress work that re-architects HDFS to attack this limitation. The storage management is enhanced to a distributed scheme. A new concept of storage container is introduced for storing objects. HDFS blocks are stored and managed as objects in the storage containers instead of being tracked only by NameNode. Storage containers are replicated across DataNodes using a newly-developed high-throughput protocol based on the Raft consensus algorithm. Our current prototype shows that under the new architecture the storage management of HDFS scales 10x better, demonstrating that HDFS is capable of storing billions of files.
Apache Spark 2.3 boosts advanced analytics and deep learning with PythonDataWorks Summit
Python is one of the most popular programming languages for advanced analytics, data science, machine learning, and deep learning. One of Python’s greatest assets is its extensive set of libraries, such as Numpy, Pandas, Scikit-learn, Theano, TensorFlow, Keras, and so on. Apache Spark is becoming the core component for big data processing and playing important role to help data scientists solve complicated problems. It has a great significance and strong demand to integrate Spark with the extremely rich Python ecosystems to handle challenges in artificial intelligence. In the latest Spark 2.3, some very exciting features were put in, for example: vectorized UDF in PySpark, which leverages Apache Arrow to provide high performance interoperability between Spark and Pandas/Numpy; Image format in dataFrame/dataset, which can improve Spark and TensorFlow (or other deep learning libraries) interoperability; high-efficiency parallel modeling tuning with Spark MLlib, etc. In this talk, we'll share best practice on real use cases and hands-on experiences to illustrate the power of these new features and bring more discussions on this topic.
Speaker: Yanbo Liang, Staff Software Engineer, Hortonworks
Data Science at Scale on MPP databases - Use Cases & Open Source ToolsEsther Vasiete
Pivotal workshop slide deck for Structure Data 2016 held in San Francisco.
Abstract:
Learn how data scientists at Pivotal build machine learning models at massive scale on open source MPP databases like Greenplum and HAWQ (under Apache incubation) using in-database machine learning libraries like MADlib (under Apache incubation) and procedural languages like PL/Python and PL/R to take full advantage of the rich set of libraries in the open source community. This workshop will walk you through use cases in text analytics and image processing on MPP.
This is slides from our recent HadoopIsrael meetup. It is dedicated to comparison Spark and Tez frameworks.
In the end of the meetup there is small update about our ImpalaToGo project.
Improving Python and Spark (PySpark) Performance and InteroperabilityWes McKinney
Slides from Spark Summit East 2017 — February 9, 2017 in Boston. Discusses ongoing development work to accelerate Python-on-Spark performance using Apache Arrow and other tools
Improving Python and Spark Performance and Interoperability with Apache Arrow...Databricks
Apache Spark has become a popular and successful way for Python programming to parallelize and scale up data processing. In many use cases though, a PySpark job can perform worse than an equivalent job written in Scala. It is also costly to push and pull data between the user’s Python environment and the Spark master.
Apache Arrow-based interconnection between the various big data tools (SQL, UDFs, machine learning, big data frameworks, etc.) enables you to use them together seamlessly and efficiently, without overhead. When collocated on the same processing node, read-only shared memory and IPC avoid communication overhead. When remote, scatter-gather I/O sends the memory representation directly to the socket avoiding serialization costs.
Hadoop makes it relatively easy to store petabytes of data. However, storing data is not enough; columnar layouts for storage and in-memory execution allow the analysis of large amounts of data very quickly and efficiently. It provides the ability for multiple applications to share a common data representation and perform operations at full CPU throughput using SIMD and Vectorization. For interoperability, row based encodings (CSV, Thrift, Avro) combined with general purpose compression algorithms (GZip, LZO, Snappy) are common but inefficient. As discussed extensively in the database literature, a columnar layout with statistics and sorting provides vertical and horizontal partitioning, thus keeping IO to a minimum. Additionally a number of key big data technologies have or will soon have in-memory columnar capabilities. This includes Kudu, Ibis and Drill. Sharing a common in-memory columnar representation allows interoperability without the usual cost of serialization.
Understanding modern CPU architecture is critical to maximizing processing throughput. We’ll discuss the advantages of columnar layouts in Parquet and Arrow for in-memory processing and data encodings used for storage (dictionary, bit-packing, prefix coding). We’ll dissect and explain the design choices that enable us to achieve all three goals of interoperability, space and query efficiency. In addition, we’ll provide an overview of what’s coming in Parquet and Arrow in the next year.
It’s no longer a world of just relational databases. Companies are increasingly adopting specialized datastores such as Hadoop, HBase, MongoDB, Elasticsearch, Solr and S3. Apache Drill, an open source, in-memory, columnar SQL execution engine, enables interactive SQL queries against more datastores.
Apache parquet - Apache big data North America 2017techmaddy
Apache Parquet brings the advantages of compressed, efficient columnar data representation available to any project in the Hadoop ecosystem. Apache Parquet is built from the ground up with complex nested data structures in mind, and uses the record shredding and assembly algorithm described in the Dremel paper. We believe this approach is superior to simple flattening of nested name spaces. Apache Parquet is built to support very efficient compression and encoding schemes. Multiple projects have demonstrated the performance impact of applying the right compression and encoding scheme to the data. Apache Parquet allows compression schemes to be specified on a per-column level and is future-proofed to allow adding more encodings as they are invented and implemented. This talk highlights the internal implementation of Apache Parquet.
Hadoop Infrastructure @Uber Past, Present and FutureDataWorks Summit
Uber’s mission is to provide transportation as reliable as running water and for fulfilling that mission data plays a critical role. In Uber, Hadoop plays a critical role in Data Infrastructure. We want to talk about the journey of Hadoop @Uber and our future plans in terms of scaling for billions of trips. We will talk about most unique use case Uber have and how Hadoop and eco system which we built, helped us in this journey. We want to talk about how we scaled from 10 -> 2000 and In future to scale up to 10’s X1000 of Nodes. We will talk about our mistakes, learning and wins and how we process billions of events per day. We will talk about the unique challenges and real world use-cases and how we will co-locate the Uber’s service architecture with batch (e.g data pipelines, machine learning and analytical workloads). Uber have done lot of improvements to current Hadoop eco system and uniquely solved some of the problems in a way which is never been solved in the past. This presentation will help audience to use this as an example and even encourage them to enhance the eco system. This will help to increase the community of these project and overall help the whole big data space. Audience is anybody who is working on Big Data and want to understand how to scale Hadoop and eco system for 10s of thousands of node. This talk will help them understand the Hadoop ecosystem and how to efficiently use that. It will also introduce them to some of the awesome technologies which Uber team is building in big data space.
The columnar roadmap: Apache Parquet and Apache ArrowDataWorks Summit
The Hadoop ecosystem has standardized on columnar formats—Apache Parquet for on-disk storage and Apache Arrow for in-memory. With this trend, deep integration with columnar formats is a key differentiator for big data technologies. Vertical integration from storage to execution greatly improves the latency of accessing data by pushing projections and filters to the storage layer, reducing time spent in IO reading from disk, as well as CPU time spent decompressing and decoding. Standards like Arrow and Parquet make this integration even more valuable as data can now cross system boundaries without incurring costly translation. Cross-system programming using languages such as Spark, Python, or SQL can becomes as fast as native internal performance.
In this talk we’ll explain how Parquet is improving at the storage level, with metadata and statistics that will facilitate more optimizations in query engines in the future. We’ll detail how the new vectorized reader from Parquet to Arrow enables much faster reads by removing abstractions as well as several future improvements. We will also discuss how standard Arrow-based APIs pave the way to breaking the silos of big data. One example is Arrow-based universal function libraries that can be written in any language (Java, Scala, C++, Python, R, ...) and will be usable in any big data system (Spark, Impala, Presto, Drill). Another is a standard data access API with projection and predicate push downs, which will greatly simplify data access optimizations across the board.
Speaker
Julien Le Dem, Principal Engineer, WeWork
Hadoop Distributed File System (HDFS) evolves from a MapReduce-centric storage system to a generic, cost-effective storage infrastructure where HDFS stores all data of inside the organizations. The new use case presents a new sets of challenges to the original HDFS architecture. One challenge is to scale the storage management of HDFS - the centralized scheme within NameNode becomes a main bottleneck which limits the total number of files stored. Although a typical large HDFS cluster is able to store several hundred petabytes of data, it is inefficient to handle large amounts of small files under the current architecture.
In this talk, we introduce our new design and in-progress work that re-architects HDFS to attack this limitation. The storage management is enhanced to a distributed scheme. A new concept of storage container is introduced for storing objects. HDFS blocks are stored and managed as objects in the storage containers instead of being tracked only by NameNode. Storage containers are replicated across DataNodes using a newly-developed high-throughput protocol based on the Raft consensus algorithm. Our current prototype shows that under the new architecture the storage management of HDFS scales 10x better, demonstrating that HDFS is capable of storing billions of files.
Apache Spark 2.3 boosts advanced analytics and deep learning with PythonDataWorks Summit
Python is one of the most popular programming languages for advanced analytics, data science, machine learning, and deep learning. One of Python’s greatest assets is its extensive set of libraries, such as Numpy, Pandas, Scikit-learn, Theano, TensorFlow, Keras, and so on. Apache Spark is becoming the core component for big data processing and playing important role to help data scientists solve complicated problems. It has a great significance and strong demand to integrate Spark with the extremely rich Python ecosystems to handle challenges in artificial intelligence. In the latest Spark 2.3, some very exciting features were put in, for example: vectorized UDF in PySpark, which leverages Apache Arrow to provide high performance interoperability between Spark and Pandas/Numpy; Image format in dataFrame/dataset, which can improve Spark and TensorFlow (or other deep learning libraries) interoperability; high-efficiency parallel modeling tuning with Spark MLlib, etc. In this talk, we'll share best practice on real use cases and hands-on experiences to illustrate the power of these new features and bring more discussions on this topic.
Speaker: Yanbo Liang, Staff Software Engineer, Hortonworks
Data Science at Scale on MPP databases - Use Cases & Open Source ToolsEsther Vasiete
Pivotal workshop slide deck for Structure Data 2016 held in San Francisco.
Abstract:
Learn how data scientists at Pivotal build machine learning models at massive scale on open source MPP databases like Greenplum and HAWQ (under Apache incubation) using in-database machine learning libraries like MADlib (under Apache incubation) and procedural languages like PL/Python and PL/R to take full advantage of the rich set of libraries in the open source community. This workshop will walk you through use cases in text analytics and image processing on MPP.
This is slides from our recent HadoopIsrael meetup. It is dedicated to comparison Spark and Tez frameworks.
In the end of the meetup there is small update about our ImpalaToGo project.
Improving Python and Spark (PySpark) Performance and InteroperabilityWes McKinney
Slides from Spark Summit East 2017 — February 9, 2017 in Boston. Discusses ongoing development work to accelerate Python-on-Spark performance using Apache Arrow and other tools
Improving Python and Spark Performance and Interoperability with Apache Arrow...Databricks
Apache Spark has become a popular and successful way for Python programming to parallelize and scale up data processing. In many use cases though, a PySpark job can perform worse than an equivalent job written in Scala. It is also costly to push and pull data between the user’s Python environment and the Spark master.
Apache Arrow-based interconnection between the various big data tools (SQL, UDFs, machine learning, big data frameworks, etc.) enables you to use them together seamlessly and efficiently, without overhead. When collocated on the same processing node, read-only shared memory and IPC avoid communication overhead. When remote, scatter-gather I/O sends the memory representation directly to the socket avoiding serialization costs.
Hadoop makes it relatively easy to store petabytes of data. However, storing data is not enough; columnar layouts for storage and in-memory execution allow the analysis of large amounts of data very quickly and efficiently. It provides the ability for multiple applications to share a common data representation and perform operations at full CPU throughput using SIMD and Vectorization. For interoperability, row based encodings (CSV, Thrift, Avro) combined with general purpose compression algorithms (GZip, LZO, Snappy) are common but inefficient. As discussed extensively in the database literature, a columnar layout with statistics and sorting provides vertical and horizontal partitioning, thus keeping IO to a minimum. Additionally a number of key big data technologies have or will soon have in-memory columnar capabilities. This includes Kudu, Ibis and Drill. Sharing a common in-memory columnar representation allows interoperability without the usual cost of serialization.
Understanding modern CPU architecture is critical to maximizing processing throughput. We’ll discuss the advantages of columnar layouts in Parquet and Arrow for in-memory processing and data encodings used for storage (dictionary, bit-packing, prefix coding). We’ll dissect and explain the design choices that enable us to achieve all three goals of interoperability, space and query efficiency. In addition, we’ll provide an overview of what’s coming in Parquet and Arrow in the next year.
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...Dremio Corporation
Essentially every successful analytical DBMS in the market today makes use of column-oriented data structures. In the Hadoop ecosystem, Apache Parquet (and Apache ORC) provide similar advantages in terms of processing and storage efficiency. Apache Arrow is the in-memory counterpart to these formats and has been been embraced by over a dozen open source projects as the de facto standard for in-memory processing. In this session the PMC Chair for Apache Arrow and the PMC Chair for Apache Parquet discuss the future of column-oriented processing.
Improving Python and Spark Performance and Interoperability with Apache ArrowLi Jin
Apache Spark has become a popular and successful way for Python programming to parallelize and scale up data processing. In many use cases though, a PySpark job can perform worse than an equivalent job written in Scala. It is also costly to push and pull data between the user’s Python environment and the Spark master.
Apache Arrow-based interconnection between the various big data tools (SQL, UDFs, machine learning, big data frameworks, etc.) enables you to use them together seamlessly and efficiently, without overhead. When collocated on the same processing node, read-only shared memory and IPC avoid communication overhead. When remote, scatter-gather I/O sends the memory representation directly to the socket avoiding serialization costs.
Improving Python and Spark Performance and Interoperability with Apache ArrowTwo Sigma
Apache Arrow-based interconnection between various big data tools (SQL, UDFs, machine learning, big data frameworks, etc.) enables you to use them together seamlessly and efficiently,
From Mainframe to Microservices: Vanguard’s Move to the Cloud - ENT331 - re:I...Amazon Web Services
Maintaining control of sensitive data is critical in the highly regulated financial investments environment that Vanguard operates in. This need for data control complicated Vanguard's move to the cloud. They needed to expand globally to provide a great user experience while at the same time maintaining their mainframe-based backend data architecture. In this session, Vanguard discusses the creative approach they took to decouple their monolithic backend architecture to empower a microservices architecture while maintaining compliance with regulations. They also cover solutions implemented to successfully meet their requirements for security, latency, and end-state consistency.
A framework used to bridge between the language of business and PLCSMagnus Färneland
Presentation on the topic of how using a standard like PLCS (Product Life cycle Support, ISO10303-239) can increase the quality of your current and future data.
Presentation held by Magnus Färneland at the GPDIS conference in Phoenix, AZ
[db tech showcase Tokyo 2017] C13:There and back again or how to connect Orac...Insight Technology, Inc.
The presentation will show how we can combine Oracle relation database and ''Big Data'' platforms using Oracle and 3-d party tools and utilities. We will discuss ''Big Data'' and what we mean by it, talk about different topologies and how they can be applied to the business needs. Primarily the presentation will focus on Oracle integration tools such as Oracle GoldenGate, ODI, BD SQL and how they can help us with data ingestion and further usage of the data in our workflow. In addition, we are going to touch alternative tools and utilities and how they can be used.
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheDremio Corporation
From DataEngConf 2017 - Everybody wants to get to data faster. As we move from more general solution to specific optimization techniques, the level of performance impact grows. This talk will discuss how layering in-memory caching, columnar storage and relational caching can combine to provide a substantial improvement in overall data science and analytical workloads. It will include a detailed overview of how you can use Apache Arrow, Calcite and Parquet to achieve multiple magnitudes improvement in performance over what is currently possible.
Using LLVM to accelerate processing of data in Apache ArrowDataWorks Summit
Most query engines follow an interpreter-based approach where a SQL query is translated into a tree of relational algebra operations then fed through a conventional tuple-based iterator model to execute the query. We will explore the overhead associated with this approach and how the performance of query execution on columnar data can be improved using run-time code generation via LLVM.
Generally speaking, the best case for optimal query execution performance is a hand-written query plan that does exactly what is needed by the query for the exact same data types and format. Vectorized query processing models amortize the cost of function calls. However, research has shown that hand-written code for a given query plan has the potential to outperform the optimizations associated with a vectorized query processing model.
Over the last decade, the LLVM compiler framework has seen significant development. Furthermore, the database community has realized the potential of LLVM to boost query performance by implementing JIT query compilation frameworks. With LLVM, a SQL query is translated into a portable intermediary representation (IR) which is subsequently converted into machine code for the desired target architecture.
Dremio is built on top of Apache Arrow’s in-memory columnar vector format. The in-memory vectors map directly to the vector type in LLVM and that makes our job easier when writing the query processing algorithms in LLVM. We will talk about how Dremio implemented query processing logic in LLVM for some operators like FILTER and PROJECT. We will also discuss the performance benefits of LLVM-based vectorized query execution over other methods.
Speaker
Siddharth Teotia, Dremio, Software Engineer
Apache Arrow is designed to make things faster. Its focused on speeding communication between systems as well as processing within any one system. In this talk I'll start by discussing what Arrow is and why it was built. This will include covering an overview of the key components, goals, vision and current state. I’ll then take the audience through a detailed engineering review of how we used Arrow to solve several problems when building the Apache-Licensed Dremio product. This will include talking about Arrow performance characteristics, working with Arrow APIs, managing memory, sizing Arrow vectors, and moving data between processes and/or nodes. We’ll also review several code examples of specific data processing implementations and how they interact with Arrow data. Lastly we’ll spend a short amount of time on what’s next for Arrow. This will be a highly technical talk targeted towards people building data infrastructure systems and complex workflows.
Enabling Python to be a Better Big Data CitizenWes McKinney
These slides are from my talk at the NYC Python Meetup at ODSC Office NYC on February 17, 2016. It discusses Python's architectural challenges to interoperate with the Hadoop ecosystem and how a new project, Apache Arrow, will help.
FINRA's Managed Data Lake: Next-Gen Analytics in the Cloud - ENT328 - re:Inve...Amazon Web Services
Financial Impact Regulatory Authority (FINRA)'s Technology Group has changed its customers' relationship with data by creating a managed data lake that enables discovery on petabytes of capital markets' data, while saving time and money over traditional analytics solutions. FINRA's managed data lake unlocks the value in its data to accelerate analytics and machine learning at scale. The data lake includes a centralized data catalog and separates storage from compute, allowing users to query from petabytes of data in seconds. Learn how FINRA uses Spot Instances and services such as Amazon S3, Amazon EMR, Amazon Redshift, and AWS Lambda to provide the right tool for the right job at each step in the data processing pipeline. All of this is done while meeting FINRA's security and compliance responsibilities as a financial regulator.
Strata NY 2016: The future of column-oriented data processing with Arrow and ...Julien Le Dem
In pursuit of speed, big data is evolving toward columnar execution. The solid foundation laid by Arrow and Parquet for a shared columnar representation across the ecosystem promises a great future. Julien Le Dem and Jacques Nadeau discuss the future of columnar and the hardware trends it takes advantage of, like RDMA, SSDs, and nonvolatile memory.
During this webinar, we will review best practices and lessons learned from working with large and mid-size companies on their deployment of PostgreSQL. We will explore the practices that helped industry leaders move through these stages quickly, and get as much value out of PostgreSQL as possible without incurring undue risk.
We have identified a set of levers that companies can use to accelerate their success with PostgreSQL:
- Application Tiering
- Collaboration between DBAs and Development Teams
- Evangelizing
- Standardization and Automation
- Balance of Migration and New Development
Horses for Courses: Database RoundtableEric Kavanagh
The blessing and curse of today's database market? So many choices! While relational databases still dominate the day-to-day business, a host of alternatives has evolved around very specific use cases: graph, document, NoSQL, hybrid (HTAP), column store, the list goes on. And the database tools market is teeming with activity as well. Register for this special Research Webcast to hear Dr. Robin Bloor share his early findings about the evolving database market. He'll be joined by Steve Sarsfield of HPE Vertica, and Robert Reeves of Datical in a roundtable discussion with Bloor Group CEO Eric Kavanagh. Send any questions to info@insideanalysis.com, or tweet with #DBSurvival.
Similar to Improving Python and Spark Performance and Interoperability with Apache Arrow (20)
Data and AI summit: data pipelines observability with open lineageJulien Le Dem
Presentation of Data lineage an Observability with OpenLineage at the "Data and AI summit" (formerly Spark summit). With a focus on the Apache Spark integration for OpenLineage
How to use Parquet as a basis for ETL and analyticsJulien Le Dem
Parquet is a columnar format designed to be extremely efficient and interoperable across the hadoop ecosystem. Its integration in most of the Hadoop processing frameworks (Impala, Hive, Pig, Cascading, Crunch, Scalding, Spark, …) and serialization models (Thrift, Avro, Protocol Buffers, …) makes it easy to use in existing ETL and processing pipelines, while giving flexibility of choice on the query engine (whether in Java or C++). In this talk, we will describe how one can us Parquet with a wide variety of data analysis tools like Spark, Impala, Pig, Hive, and Cascading to create powerful, efficient data analysis pipelines. Data management is simplified as the format is self describing and handles schema evolution. Support for nested structures enables more natural modeling of data for Hadoop compared to flat representations that create the need for often costly joins.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Improving Python and Spark Performance and Interoperability with Apache Arrow
1. Improving Python and Spark
Performance and Interoperability
with Apache Arrow
Julien Le Dem
Principal Architect
Dremio
Li Jin
Software Engineer
Two Sigma Investments
We are a quantitative hedge fund based in New York City
Spark build-in functions overs basic math opeartors and functions, such as ‘mean’, ‘stddev’, ‘sum’
This is in comparison with build-in spark functions such as mean and sum where the python code gets translated into java code and executed in JVM
Grouped based udf doesn’t exist now.
Not sure about calling: lambda values: np.mean(np.array(values)) “grouped based udf”
60x is computed by:
T1 = df.agg(F.sum(df.v1))
T2 = df.withColumn(‘v2’, when(df.v1 > 0, 1.0).otherwise(-1.0)).agg(F.sum(‘v2’)
T3 = df.withColumn(‘v3’, F.udf(lambda x: 1.0 if x > 0 else -1.0, DoubleType())(df.v1)).agg(F.sum(“v3”))
(T3 – T1) / (T2 – T1)
Grouping here can be by some id, category, or time based grouping, daily average
Groups are materialized is one of the performance issue. However, this is not the one we focus on here.
For instance, groupBy().agg(collect_list()) fully materialized the group. It takes about 1.5 seconds to do that, while the normalization examples takes more than 25 seconds
Groups are materialized is one of the performance issue. However, this is not the one we focus on here.
For instance, groupBy().agg(collect_list()) fully materialized the group. It takes about 1.5 seconds to do that, while the normalization examples takes more than 25 seconds
(complicated code. Audience should not read the code. The presenter shows the boiler plate code vs actually code. The point here is show audience how difficult it is to do such things)
(complicated code. Audience should not read the code. The presenter shows the boiler plate code vs actually code. The point here is show audience how difficult it is to do such things)
(complicated code. Audience should not read the code. The presenter shows the boiler plate code vs actually code. The point here is show audience how difficult it is to do such things)
Scatter/gather I/O also unknown as vectored I/O is the a method where single procedure call reads data from a stream and write to multiple buffers
To illustrate further, here is the profiling result of python worker during one UDF evaluation.
Without profiling, It took about 2 seconds to process 2M doubles. And most of the time is spent in serialization and udf evalution
So we made some changes to the PySpark
Note: Without profile, 4 seconds -> 2 seconds
Let’s now compare Scalar vs Vectorized UDF. For 2 M doubles, the runtime of the python worker goes from 2 seconds to about 0.14 seconds, that’s a 15x speed up.
Note: Without profile, 4 seconds -> 2 seconds
Let’s now compare Scalar vs Vectorized UDF. For 2 M doubles, the runtime of the python worker goes from 2 seconds to about 0.14 seconds, that’s a 15x speed up.
Note: Without profile, 4 seconds -> 2 seconds
Let’s now compare Scalar vs Vectorized UDF. For 2 M doubles, the runtime of the python worker goes from 2 seconds to about 0.14 seconds, that’s a 15x speed up.
Groups are materialized is one of the performance issue. However, this is not the one we focus on here.
For instance, groupBy().agg(collect_list()) fully materialized the group. It takes about 1.5 seconds to do that, while the normalization examples takes more than 25 seconds
Make it clear before and after
Less time on this slide
RPC: Generic way to exchange data.
IPC: Share memory
Integration: