To use Ruby for data processing widely, Apache Arrow support is important. We can do the followings with Apache Arrow:
* Super fast large data interchange and processing
* Reading/writing data in several famous formats such as CSV and Apache Parquet
* Reading/writing partitioned large data on cloud storage such as Amazon S3
This talk describes the followings:
* What is Apache Arrow
* How to use Apache Arrow with Ruby
* How to integrate with Ruby 3.0 features such as MemoryView and Ractor
Apache Spark™ is a fast and general engine for large-scale data processing. Spark is written in Scala and runs on top of JVM, but Python is one of the officially supported languages. But how does it actually work? How can Python communicate with Java / Scala? In this talk, we’ll dive into the PySpark internals and try to understand how to write and test high-performance PySpark applications.
Back in ye olde days of Spark, using Python with Spark was an exercise in patience. Data was moving up and down from Python to Scala, being serialised constantly. Leveraging SparkSQL and avoiding UDFs made things better, as well as the constant improvement of the optimisers (Catalyst and Tungsten). But, with Spark 2.3 PySpark has speed up tremendously thanks to the (still experimental) addition of the Arrow serialisers.
In this talk we will learn how PySpark has improved its performance in Apache Spark 2.3 by using Apache Arrow. To do this, we will travel through the internals of Spark to find how Python interacts with the Scala core, and some of the internals of Pandas to see how data moves from Python to Scala via Arrow.
https://github.com/rberenguel/pyspark-arrow-pandas
London Spark Meetup Project Tungsten Oct 12 2015Chris Fregly
Building on a previous talk about how Spark beat Hadoop @ 100TB Daytona GraySort, we present low-level details of Project Tungsten which includes many CPU and Memory optimizations.
How does that PySpark thing work? And why Arrow makes it faster?Rubén Berenguel
Back in ye olde days of Spark, using Python with Spark was an exercise in patience. Data was moving up and down from Python to Scala, being serialised constantly. Leveraging SparkSQL and avoiding UDFs made things better, as well as the constant improvement of the optimisers (Catalyst and Tungsten). But, with Spark 2.3 PySpark has speed up tremendously thanks to the (still experimental) addition of the Arrow serialisers.
In this talk we will learn how PySpark has improved its performance in Apache Spark 2.3 by using Apache Arrow. To do this, we will travel through the internals of Spark to find how Python interacts with the Scala core, and some of the internals of Pandas to see how data moves from Python to Scala via Arrow.
https://github.com/rberenguel/pyspark-arrow-pandas
Apache Spark™ is a fast and general engine for large-scale data processing. Spark is written in Scala and runs on top of JVM, but Python is one of the officially supported languages. But how does it actually work? How can Python communicate with Java / Scala? In this talk, we’ll dive into the PySpark internals and try to understand how to write and test high-performance PySpark applications.
Back in ye olde days of Spark, using Python with Spark was an exercise in patience. Data was moving up and down from Python to Scala, being serialised constantly. Leveraging SparkSQL and avoiding UDFs made things better, as well as the constant improvement of the optimisers (Catalyst and Tungsten). But, with Spark 2.3 PySpark has speed up tremendously thanks to the (still experimental) addition of the Arrow serialisers.
In this talk we will learn how PySpark has improved its performance in Apache Spark 2.3 by using Apache Arrow. To do this, we will travel through the internals of Spark to find how Python interacts with the Scala core, and some of the internals of Pandas to see how data moves from Python to Scala via Arrow.
https://github.com/rberenguel/pyspark-arrow-pandas
London Spark Meetup Project Tungsten Oct 12 2015Chris Fregly
Building on a previous talk about how Spark beat Hadoop @ 100TB Daytona GraySort, we present low-level details of Project Tungsten which includes many CPU and Memory optimizations.
How does that PySpark thing work? And why Arrow makes it faster?Rubén Berenguel
Back in ye olde days of Spark, using Python with Spark was an exercise in patience. Data was moving up and down from Python to Scala, being serialised constantly. Leveraging SparkSQL and avoiding UDFs made things better, as well as the constant improvement of the optimisers (Catalyst and Tungsten). But, with Spark 2.3 PySpark has speed up tremendously thanks to the (still experimental) addition of the Arrow serialisers.
In this talk we will learn how PySpark has improved its performance in Apache Spark 2.3 by using Apache Arrow. To do this, we will travel through the internals of Spark to find how Python interacts with the Scala core, and some of the internals of Pandas to see how data moves from Python to Scala via Arrow.
https://github.com/rberenguel/pyspark-arrow-pandas
Amazon Elastic MapReduce (EMR) is a web service that allows you to easily and securely provision and manage your Hadoop clusters. In this talk, we will introduce you to Amazon EMR design patterns, such as using various data stores such as Amazon S3, how to take advantage of both transient and active clusters, as well as other Amazon EMR architectural patterns. We will dive deep on how to dynamically scale your cluster and address the ways you can fine-tune your cluster. We will discuss bootstrapping Hadoop applications from our partner ecosystem that you can use natively with Amazon EMR. Lastly, we will share best practices on how to keep your Amazon EMR cluster cost-effective.
Intro to PySpark: Python Data Analysis at scale in the CloudDaniel Zivkovic
Why would you care? Because PySpark is a cloud-agnostic analytics tool for Big Data processing, "hidden" in:
* AWS Glue - Managed ETL Service
* Amazon EMR - Big Data Platform
* Google Cloud Dataproc - Cloud-native Spark and Hadoop
* Azure HDInsight - Microsoft implementation of Apache Spark in the cloud
In this #ServerlessTO talk, Jonathan Rioux - Head of Data Science at EPAM Canada & author of PySpark in Action book (https://www.manning.com/books/pyspark-in-action), will get you acquainted with PySpark - Python API for Spark.
Event details: https://www.meetup.com/Serverless-Toronto/events/269124392/
Event recording: https://youtu.be/QGxytMbrjGY
Like always, BIG thanks to our knowledge sponsor Manning Publications – who generously offered to raffle not 1 but 3 of Jonathan's books!
RSVP for more exciting (online) events at https://www.meetup.com/Serverless-Toronto/events/
AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)Amazon Web Services
Elasticsearch has quickly become the leading open source technology for scaling search and building document services on. Many software providers have come to rely on it to serve the needs of high-performance, production applications.
In this talk, we’ll go deep on lessons learned from three years in production scaling from a few shards to more than 100 spread across 100s of nodes on AWS--to serve real-time queries against 100s of millions of documents.
Attendees will learn:
* How to capacity plan for ES on AWS
* How to scale and reshard on AWS with zero downtime
* What AWS and ES metrics to collect and alert on
* Tips on day to day ES operations
Session sponsored by SignalFx.
An introduction on large Data sets with R on Amazon EC2 with S3 support. Data distribution and job fragmentation with the use of Hadoop as the MapReduce implementation.
Real-Time Big Data at In-Memory Speed, Using StormNati Shalom
Storm, a popular framework from Twitter, is used for real-time event processing. The challenge presented is how to manage the state of your real-time data processing at all times. In addition, you need Storm to integrate with your batch processing system (such as Hadoop) in a consistent manner.
This session will demonstrate how to integrate Storm with an in-memory database/grid, and explore various strategies for integrating the data grid with Hadoop and Cassandra, seamlessly. By achieving smooth integration with consistent management, you will be able to easily manage all the tiers of you Big Data stack in a consistent and effective way.
- See more at: http://nosql2013.dataversity.net/sessionPop.cfm?confid=74&proposalid=5526#sthash.FWIdqRHh.dpuf
Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013Sonal Raj
This talk briefly outlines the Storm framework and Neo4J graph database, and how to compositely use them to perform computations on complex graphs in Python using the Petrel and Py2neo packages. This talk was given at PyCon India 2013.
Brussels Spark Meetup Oct 30, 2015: Spark After Dark 1.5: Real-time, Advanc...Chris Fregly
Combining the most popular and technically-deep material from his wildly popular Advanced Apache Spark Meetup, Chris Fregly will provide code-level deep dives into the latest performance and scalability advancements within the Apache Spark Ecosystem by exploring the following:
1) Building a Scalable and Performant Spark SQL/DataFrames Data Source Connector such as Spark-CSV, Spark-Cassandra, Spark-ElasticSearch, and Spark-Redshift
2) Speeding Up Spark SQL Queries using Partition Pruning and Predicate Pushdowns with CSV, JSON, Parquet, Avro, and ORC
3) Tuning Spark Streaming Performance and Fault Tolerance with KafkaRDD and KinesisRDD
4) Maintaining Stability during High Scale Streaming Ingestion using Approximations and Probabilistic Data Structures from Spark, Redis, and Twitter's Algebird
5) Building Effective Machine Learning Models using Feature Engineering, Dimension Reduction, and Natural Language Processing with MLlib/GraphX, ML Pipelines, DIMSUM, Locality Sensitive Hashing, and Stanford's CoreNLP
6) Tuning Core Spark Performance by Acknowledging Mechanical Sympathy for the Physical Limitations of OS and Hardware Resources such as CPU, Memory, Network, and Disk with Project Tungsten, Asynchronous Netty, and Linux epoll
* Demos *
This talk features many interesting and audience-interactive demos - as well as code-level deep dives into many of the projects listed above.
All demo code is available on Github at the following link: https://github.com/fluxcapacitor/pipeline/wiki
In addition, the entire demo environment has been Dockerized and made available for download on Docker Hub at the following link: https://hub.docker.com/r/fluxcapacitor/pipeline/
* Speaker Bio *
Chris Fregly is a Principal Data Solutions Engineer for the newly-formed IBM Spark Technology Center, an Apache Spark Contributor, a Netflix Open Source Committer, as well as the Organizer of the global Advanced Apache Spark Meetup and Author of the Upcoming Book, Advanced Spark.
Previously, Chris was a Data Solutions Engineer at Databricks and a Streaming Data Engineer at Netflix.
When Chris isn’t contributing to Spark and other open source projects, he’s creating book chapters, slides, and demos to share knowledge with his peers at meetups and conferences throughout the world.
(BDT208) A Technical Introduction to Amazon Elastic MapReduceAmazon Web Services
"Amazon EMR provides a managed framework which makes it easy, cost effective, and secure to run data processing frameworks such as Apache Hadoop, Apache Spark, and Presto on AWS. In this session, you learn the key design principles behind running these frameworks on the cloud and the feature set that Amazon EMR offers. We discuss the benefits of decoupling compute and storage and strategies to take advantage of the scale and the parallelism that the cloud offers, while lowering costs. Additionally, you hear from AOL’s Senior Software Engineer on how they used these strategies to migrate their Hadoop workloads to the AWS cloud and lessons learned along the way.
In this session, you learn the benefits of decoupling storage and compute and allowing them to scale independently; how to run Hadoop, Spark, Presto and other supported Hadoop Applications on Amazon EMR; how to use Amazon S3 as a persistent data-store and process data directly from Amazon S3; dDeployment strategies and how to avoid common mistakes when deploying at scale; and how to use Spot instances to scale your transient infrastructure effectively."
Learning Stream Processing with Apache StormEugene Dvorkin
Over the last couple years, Apache Storm became a de-facto standard for developing real-time analytics and complex event processing applications. Storm enables to tackle real-time data processing challenges the same way Hadoop enables batch processing of Big Data. Storm enables companies to have "Fast Data" alongside with "Big Data". Some use cases where Storm can be used are Fraud Detection, Operation Intelligence, Machine Learning, ETL, Analytics, etc.
In this meetup, Eugene Dvorkin, Architect @WebMD and NYC Storm User Group organizer will teach Apache Storm and Stream Processing fundamentals. While this meeting is geared toward new Storm users, experienced users may find something interesting as well.
Following topics will be covered:
• Why use Apache Storm?
• Common use cases
• Storm Architecture - components, concepts, topology
• Building simple Storm topology with Java and Groovy
• Trident and micro-batch processing
• Fault tolerance and guaranteed message delivery
• Running and monitoring Storm in production
• Kafka
• Storm at WebMD
• Resources
Apache Cassandra and Spark when combined can give powerful OLTP and OLAP functionality for your data. We’ll walk through the basics of both of these platforms before diving into applications combining the two. Usually joins, changing a partition key, or importing data can be difficult in Cassandra, but we’ll see how do these and other operations in a set of simple Spark Shell one-liners!
Organizations often need to quickly analyze large amounts of data, such as logs generated from a wide variety of sources and formats. However, traditional approaches require a lot of time and effort designing complex data transformation and loading processes; and configuring data warehouses. Using AWS, you can start querying your datasets within minutes. In this session you will learn how you can deploy a managed Presto environment in minutes to interactively query log data using standard ANSI SQL. Presto is a popular open source SQL engine for running interactive analytic queries against data sources of all sizes. We will talk about common use cases and best practices for running Presto on Amazon EMR.
Real time big data analytics with Storm by Ron Bodkin of Think Big AnalyticsData Con LA
This talk provides an overview of the open source Storm system for processing Big Data in realtime. The talk starts with an overview of the technology, including key components: Nimbus, Zookeeper, Topology, Tuple, Trident. It looks at integration with Hadoop through YARN and recent improvements. The presentation then dives into the complex Big Data architecture in which Storm can be integrated . The result is a compelling stack of technologies including integrated Hadoop clusters, MPP, and NoSQL databases.
After this, we look at example use cases for Storm: real-time advertising statistics, updating a Machine Learned model for content popularity predictions, and financial compliance monitoring.
Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...Amazon Web Services
Big data technologies let you work with any velocity, volume, or variety of data in a highly productive environment. Join the General Manager of Amazon EMR, Peter Sirota, to learn how to scale your analytics, use Hadoop with Amazon EMR, write queries with Hive, develop real world data flows with Pig, and understand the operational needs of a production data platform.
Presentation slide for "In-Memory Storage Evolution in Apache Spark" at Spark+AI Summit 2019
https://databricks.com/session/in-memory-storage-evolution-in-apache-spark
RubyKaigi 2022 - Fast data processing with Ruby and Apache ArrowKouhei Sutou
I introduced Ruby and Apache Arrow integration including the "super fast large data interchange and processing" Apache Arrow feature at RubyKaigi Takeout 2021.
This talk introduces how we can use the "super fast large data interchange and processing" Apache Arrow feature in Ruby. Here are some use cases:
* Fast data retrieval (fast pluck) from DB such as MySQL and PostgreSQL for batch processes in a Ruby on Rails application
* Fast data interchange with JavaScript for dynamic visualization in a Ruby on Rails application
* Fast OLAP with in-process DB such as DuckDB and Apache Arrow DataFusion in a Ruby on Rails application or irb session
Amazon Elastic MapReduce (EMR) is a web service that allows you to easily and securely provision and manage your Hadoop clusters. In this talk, we will introduce you to Amazon EMR design patterns, such as using various data stores such as Amazon S3, how to take advantage of both transient and active clusters, as well as other Amazon EMR architectural patterns. We will dive deep on how to dynamically scale your cluster and address the ways you can fine-tune your cluster. We will discuss bootstrapping Hadoop applications from our partner ecosystem that you can use natively with Amazon EMR. Lastly, we will share best practices on how to keep your Amazon EMR cluster cost-effective.
Intro to PySpark: Python Data Analysis at scale in the CloudDaniel Zivkovic
Why would you care? Because PySpark is a cloud-agnostic analytics tool for Big Data processing, "hidden" in:
* AWS Glue - Managed ETL Service
* Amazon EMR - Big Data Platform
* Google Cloud Dataproc - Cloud-native Spark and Hadoop
* Azure HDInsight - Microsoft implementation of Apache Spark in the cloud
In this #ServerlessTO talk, Jonathan Rioux - Head of Data Science at EPAM Canada & author of PySpark in Action book (https://www.manning.com/books/pyspark-in-action), will get you acquainted with PySpark - Python API for Spark.
Event details: https://www.meetup.com/Serverless-Toronto/events/269124392/
Event recording: https://youtu.be/QGxytMbrjGY
Like always, BIG thanks to our knowledge sponsor Manning Publications – who generously offered to raffle not 1 but 3 of Jonathan's books!
RSVP for more exciting (online) events at https://www.meetup.com/Serverless-Toronto/events/
AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)Amazon Web Services
Elasticsearch has quickly become the leading open source technology for scaling search and building document services on. Many software providers have come to rely on it to serve the needs of high-performance, production applications.
In this talk, we’ll go deep on lessons learned from three years in production scaling from a few shards to more than 100 spread across 100s of nodes on AWS--to serve real-time queries against 100s of millions of documents.
Attendees will learn:
* How to capacity plan for ES on AWS
* How to scale and reshard on AWS with zero downtime
* What AWS and ES metrics to collect and alert on
* Tips on day to day ES operations
Session sponsored by SignalFx.
An introduction on large Data sets with R on Amazon EC2 with S3 support. Data distribution and job fragmentation with the use of Hadoop as the MapReduce implementation.
Real-Time Big Data at In-Memory Speed, Using StormNati Shalom
Storm, a popular framework from Twitter, is used for real-time event processing. The challenge presented is how to manage the state of your real-time data processing at all times. In addition, you need Storm to integrate with your batch processing system (such as Hadoop) in a consistent manner.
This session will demonstrate how to integrate Storm with an in-memory database/grid, and explore various strategies for integrating the data grid with Hadoop and Cassandra, seamlessly. By achieving smooth integration with consistent management, you will be able to easily manage all the tiers of you Big Data stack in a consistent and effective way.
- See more at: http://nosql2013.dataversity.net/sessionPop.cfm?confid=74&proposalid=5526#sthash.FWIdqRHh.dpuf
Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013Sonal Raj
This talk briefly outlines the Storm framework and Neo4J graph database, and how to compositely use them to perform computations on complex graphs in Python using the Petrel and Py2neo packages. This talk was given at PyCon India 2013.
Brussels Spark Meetup Oct 30, 2015: Spark After Dark 1.5: Real-time, Advanc...Chris Fregly
Combining the most popular and technically-deep material from his wildly popular Advanced Apache Spark Meetup, Chris Fregly will provide code-level deep dives into the latest performance and scalability advancements within the Apache Spark Ecosystem by exploring the following:
1) Building a Scalable and Performant Spark SQL/DataFrames Data Source Connector such as Spark-CSV, Spark-Cassandra, Spark-ElasticSearch, and Spark-Redshift
2) Speeding Up Spark SQL Queries using Partition Pruning and Predicate Pushdowns with CSV, JSON, Parquet, Avro, and ORC
3) Tuning Spark Streaming Performance and Fault Tolerance with KafkaRDD and KinesisRDD
4) Maintaining Stability during High Scale Streaming Ingestion using Approximations and Probabilistic Data Structures from Spark, Redis, and Twitter's Algebird
5) Building Effective Machine Learning Models using Feature Engineering, Dimension Reduction, and Natural Language Processing with MLlib/GraphX, ML Pipelines, DIMSUM, Locality Sensitive Hashing, and Stanford's CoreNLP
6) Tuning Core Spark Performance by Acknowledging Mechanical Sympathy for the Physical Limitations of OS and Hardware Resources such as CPU, Memory, Network, and Disk with Project Tungsten, Asynchronous Netty, and Linux epoll
* Demos *
This talk features many interesting and audience-interactive demos - as well as code-level deep dives into many of the projects listed above.
All demo code is available on Github at the following link: https://github.com/fluxcapacitor/pipeline/wiki
In addition, the entire demo environment has been Dockerized and made available for download on Docker Hub at the following link: https://hub.docker.com/r/fluxcapacitor/pipeline/
* Speaker Bio *
Chris Fregly is a Principal Data Solutions Engineer for the newly-formed IBM Spark Technology Center, an Apache Spark Contributor, a Netflix Open Source Committer, as well as the Organizer of the global Advanced Apache Spark Meetup and Author of the Upcoming Book, Advanced Spark.
Previously, Chris was a Data Solutions Engineer at Databricks and a Streaming Data Engineer at Netflix.
When Chris isn’t contributing to Spark and other open source projects, he’s creating book chapters, slides, and demos to share knowledge with his peers at meetups and conferences throughout the world.
(BDT208) A Technical Introduction to Amazon Elastic MapReduceAmazon Web Services
"Amazon EMR provides a managed framework which makes it easy, cost effective, and secure to run data processing frameworks such as Apache Hadoop, Apache Spark, and Presto on AWS. In this session, you learn the key design principles behind running these frameworks on the cloud and the feature set that Amazon EMR offers. We discuss the benefits of decoupling compute and storage and strategies to take advantage of the scale and the parallelism that the cloud offers, while lowering costs. Additionally, you hear from AOL’s Senior Software Engineer on how they used these strategies to migrate their Hadoop workloads to the AWS cloud and lessons learned along the way.
In this session, you learn the benefits of decoupling storage and compute and allowing them to scale independently; how to run Hadoop, Spark, Presto and other supported Hadoop Applications on Amazon EMR; how to use Amazon S3 as a persistent data-store and process data directly from Amazon S3; dDeployment strategies and how to avoid common mistakes when deploying at scale; and how to use Spot instances to scale your transient infrastructure effectively."
Learning Stream Processing with Apache StormEugene Dvorkin
Over the last couple years, Apache Storm became a de-facto standard for developing real-time analytics and complex event processing applications. Storm enables to tackle real-time data processing challenges the same way Hadoop enables batch processing of Big Data. Storm enables companies to have "Fast Data" alongside with "Big Data". Some use cases where Storm can be used are Fraud Detection, Operation Intelligence, Machine Learning, ETL, Analytics, etc.
In this meetup, Eugene Dvorkin, Architect @WebMD and NYC Storm User Group organizer will teach Apache Storm and Stream Processing fundamentals. While this meeting is geared toward new Storm users, experienced users may find something interesting as well.
Following topics will be covered:
• Why use Apache Storm?
• Common use cases
• Storm Architecture - components, concepts, topology
• Building simple Storm topology with Java and Groovy
• Trident and micro-batch processing
• Fault tolerance and guaranteed message delivery
• Running and monitoring Storm in production
• Kafka
• Storm at WebMD
• Resources
Apache Cassandra and Spark when combined can give powerful OLTP and OLAP functionality for your data. We’ll walk through the basics of both of these platforms before diving into applications combining the two. Usually joins, changing a partition key, or importing data can be difficult in Cassandra, but we’ll see how do these and other operations in a set of simple Spark Shell one-liners!
Organizations often need to quickly analyze large amounts of data, such as logs generated from a wide variety of sources and formats. However, traditional approaches require a lot of time and effort designing complex data transformation and loading processes; and configuring data warehouses. Using AWS, you can start querying your datasets within minutes. In this session you will learn how you can deploy a managed Presto environment in minutes to interactively query log data using standard ANSI SQL. Presto is a popular open source SQL engine for running interactive analytic queries against data sources of all sizes. We will talk about common use cases and best practices for running Presto on Amazon EMR.
Real time big data analytics with Storm by Ron Bodkin of Think Big AnalyticsData Con LA
This talk provides an overview of the open source Storm system for processing Big Data in realtime. The talk starts with an overview of the technology, including key components: Nimbus, Zookeeper, Topology, Tuple, Trident. It looks at integration with Hadoop through YARN and recent improvements. The presentation then dives into the complex Big Data architecture in which Storm can be integrated . The result is a compelling stack of technologies including integrated Hadoop clusters, MPP, and NoSQL databases.
After this, we look at example use cases for Storm: real-time advertising statistics, updating a Machine Learned model for content popularity predictions, and financial compliance monitoring.
Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...Amazon Web Services
Big data technologies let you work with any velocity, volume, or variety of data in a highly productive environment. Join the General Manager of Amazon EMR, Peter Sirota, to learn how to scale your analytics, use Hadoop with Amazon EMR, write queries with Hive, develop real world data flows with Pig, and understand the operational needs of a production data platform.
Presentation slide for "In-Memory Storage Evolution in Apache Spark" at Spark+AI Summit 2019
https://databricks.com/session/in-memory-storage-evolution-in-apache-spark
RubyKaigi 2022 - Fast data processing with Ruby and Apache ArrowKouhei Sutou
I introduced Ruby and Apache Arrow integration including the "super fast large data interchange and processing" Apache Arrow feature at RubyKaigi Takeout 2021.
This talk introduces how we can use the "super fast large data interchange and processing" Apache Arrow feature in Ruby. Here are some use cases:
* Fast data retrieval (fast pluck) from DB such as MySQL and PostgreSQL for batch processes in a Ruby on Rails application
* Fast data interchange with JavaScript for dynamic visualization in a Ruby on Rails application
* Fast OLAP with in-process DB such as DuckDB and Apache Arrow DataFusion in a Ruby on Rails application or irb session
Crossing the Bridge: Connecting Rails and your Front-end FrameworkDaniel Spector
Presented at Railsconf 2015 by Daniel Spector, @danielspecs.
Crossing the Bridge explores tools, patterns and best practices to connect your Javascript MVC framework to Rails in the most seamless way possible. The talk progresses from demonstrating the standard API request cycle to preloading data to your client-side framework to rendering your javascript on the server. It explores Isomorphic Javascript and ways of implementing it with Rails.
Alpine academy apache spark series #1 introduction to cluster computing wit...Holden Karau
Alpine academy apache spark series #1 introduction to cluster computing with python & a wee bit of scala. This is the first in the series and is aimed at the intro level, the next one will cover MLLib & ML.
Pycon Colombia 2018
One year ago I joined a team that favours Serverless, since then I’ve been building and maintaining lots of services using Serverless. With a pinch of Skepticism, I sailed through some of the challenges and tooling, I want to share with the community the pains and glory of it.
Rails and the Apache SOLR Search EngineDavid Keener
What good is content if nobody can find it? Many information sites are like icebergs, with only a limited amount of content directly accessible to users and the rest, the "underwater" potion, only available through searches. This talk shows how Rails web sites can take advantage of the world-class Apache SOLR search engine to provide sophisticated and customizable search features. We'll cover how to get started with SOLR, integrating with SOLR using the Sunspot gem, indexing, hit highlighting and other topics.
Super simple introduction to REST-APIs (2nd version)Patrick Savalle
See also: https://hersengarage.nl/rest-api-design-as-a-craft-not-an-art-a3fd97ed3ef4
An API in an interface or client-server-contract and REST is an HTTP design pattern. A REST-API is the de facto standard in web interface. It maps server resources onto URLs and allows CRUD-like manipulations of those (Create-Read-Update-Delete).
In this presentation we cover the basics of:
- The HTTP protocol
- The REST design pattern
- The API
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...Provectus
Apache Beam is an open source, unified model and set of language-specific SDKs for defining and executing data processing pipelines, and also data ingestion and integration flows, supporting for both batch and streaming use cases. In presentation I will provide a general overview of Apache Beam and programming model comparison Apache Beam vs Apache Spark.
Join this video course on Udemy. Click the below link
https://www.udemy.com/mastering-rtos-hands-on-with-freertos-arduino-and-stm32fx/?couponCode=SLIDESHARE
>> The Complete FreeRTOS Course with Programming and Debugging <<
"The Biggest objective of this course is to demystifying RTOS practically using FreeRTOS and STM32 MCUs"
STEP-by-STEP guide to port/run FreeRTOS using development setup which includes,
1) Eclipse + STM32F4xx + FreeRTOS + SEGGER SystemView
2) FreeRTOS+Simulator (For windows)
Demystifying the complete Architecture (ARM Cortex M) related code of FreeRTOS which will massively help you to put this kernel on any target hardware of your choice.
Similar to RubyKaigi Takeout 2021 - Red Arrow - Ruby and Apache Arrow (20)
Apache Arrow 1.0 - A cross-language development platform for in-memory dataKouhei Sutou
Apache Arrow is a cross-language development platform for in-memory data. You can use Apache Arrow to process large data effectively in Python and other languages such as R. Apache Arrow is the future of data processing. Apache Arrow 1.0, the first major version, was released at 2020-07-24. It's a good time to know Apache Arrow and start using it.
Apache Arrow - A cross-language development platform for in-memory dataKouhei Sutou
Apache Arrow is the future for data processing systems. This talk describes how to solve data sharing overhead in data processing system such as Spark and PySpark. This talk also describes how to accelerate computation against your large data by Apache Arrow.
csv, one of the standard libraries, in Ruby 2.6 has many improvements:
* Default gemified
* Faster CSV parsing
* Faster CSV writing
* Clean new CSV parser implementation for further improvements
* Reconstructed test suites for further improvements
* Benchmark suites for further performance improvements
These improvements are done without breaking backward compatibility.
This talk describes details of these improvements by a new csv maintainer.
PGroonga 2 – Make PostgreSQL rich full text search system backend!Kouhei Sutou
PGroonga 2.0 has been released with 2 years development since PGroonga 1.0.0. PGroonga 1.0.0 just provides fast full text search with all languages support. It's important because it's a lacked feature in PostgreSQL. PGroonga 2.0 provides more useful features to implement rich full text search system with PostgreSQL. This session shows how to implement rich full text search system with PostgreSQL!
This talk describes about PGroonga that resolves these problems.
PostgreSQL Conference Japan 2017の資料です。
PostgreSQLが苦手な全文検索機能を強力に補完する拡張機能PGroongaの最新情報を紹介します。PGroongaは単なる全文検索モジュールではありません。PostgreSQLで簡単にリッチな全文検索システムを実現するための機能一式を提供します。PGroongaは最新PostgreSQLへの対応も積極的です。例としてロジカルレプリケーションと組み合わせた活用方法も紹介します。
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
RubyKaigi Takeout 2021 - Red Arrow - Ruby and Apache Arrow
1. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
Red Arrow
Ruby and Apache Arrow
Sutou Kouhei
ClearCode Inc.
RubyKaigi Takeout 2021
2021-09-11
2. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
Sutou Kouhei
A president Rubyist
The president of ClearCode Inc.
クリアコードの社長
3. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
Sutou Kouhei
An Apache Arrow contributor
A member of PMC of Apache Arrow
PMC: Project Management Committee
Apache Arrowのプロジェクト管理委員会メンバー
✓
#2 commits(コミット数2位)
✓
4. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
Sutou Kouhei
The pioneer in Ruby and Arrow
The author of Red Arrow
Red Arrowの作者
✓
Red Arrow:
The official Apache Arrow library for Ruby
公式のRuby用のApache Arrowライブラリー
✓
GObject Introspection based bindings
GObject Introspectionベースのバインディング
✓
Apache Arrow GLib is developed for Red Arrow
Red ArrowのためにApache Arrow GLibも開発
✓
✓
5. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
GObject Introspection?
A way to implement bindings
バインディングの実装方法の1つ
Ruby bindings 2016 - How to create bindings 2016 Powered by Rabbit 2.2.0
Ruby bindings 2016
How to create bindings 2016
Kouhei Sutou
ClearCode Inc.
RubyKaigi 2016
2016-09-09
https://rubykaigi.org/2016/presentations/ktou.html
6. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
Why do I work on Red Arrow?
なぜRed Arrowの開発をしているか
To use Ruby for data processing!
データ処理でRubyを使いたい!
At least a part of data processing
データ処理の全部と言わず一部だけでも
✓
✓
Results of my 5 years of work:
私のここ5年の仕事の成果
We can use Ruby for some data processing!
いくつかのデータ処理でRubyを使える!
✓
✓
7. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
Goal of this talk
このトークのゴール
You want to use Ruby
for some data processing
いくつかのデータ処理でRubyを使いたくなる
✓
You join Red Data Tools project
Red Data Toolsプロジェクトに参加する
✓
8. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
Red Data Tools project?
Red Data Tools is a project that
provides data processing tools for
Ruby
Red Data ToolsはRuby用のデータ処理ツールを提供するプロジェクト
https://red-data-tools.github.io/
9. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
Data processing?
... how?
10. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
0. Why do you want?
0. データ処理の目的を明らかにする
What problem do you want to resolve?
どんな問題を解決したい?
✓
What data is needed for it?
そのためにはどんなデータが必要?
✓
...
✓
No Red Arrow support in this area
このあたりにはRed Arrowを使えない
11. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
1. Collect data
1. データ収集
Where are data?
データはどこにある?
✓
Where are collected data stored?
集めたデータはどこに保存する?
✓
...
✓
Some Red Arrow supports in this area
このあたりでは少しRed Arrowを使えない
12. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
Common dataset
よく使われるデータセット
require "datasets"
Datasets::Iris.new
Datasets::PostalCodeJapan.new
Datasets::Wikipedia.new
Red Datasets
https://github.com/red-data-tools/red-datasets
13. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
Output: Local file
出力先:ローカルファイル
require "datasets-arrow"
dataset = Datasets::PostalCodeJapan.new
dataset.to_arrow.save("codes.csv")
dataset.to_arrow.save("codes.arrow")
Red Datasets Arrow
https://github.com/red-data-tools/red-datasets-arrow
14. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
#save
General serialize API for table data
テーブルデータ用の汎用シリアライズAPI
Serialize as the specified format
指定したフォーマットにシリアライズ
✓
If you use Red Arrow object for in-memory
table data, you can serialize to many
formats! Cool!
メモリー上のテーブルデータをRed Arrowオブジェクトにするといろんな
フォーマットにシリアライズできる!かっこいい!
✓
✓
Extensible!
拡張可能!
✓
15. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
#save: Implementation
module Arrow
class Table
def save(output)
saver = TableSaver.new(self, output)
saver.save
end
end
end
16. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
#save: Implementation
class Arrow::TableSaver
def save
format = detect_format(@output)
__send__("save_as_#{format}")
end
def save_as_csv
end
end
17. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
#save: Extend by Red Parquet
module Parquet::ArrowTableSavable
def save_as_parquet
end
Arrow::TableSaver.include(self)
end
Red Parquet is a subproject of Red Arrow
Red ParquetはRed Arrowのサブプロジェクト
18. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
#save: Extended
require "datasets-arrow"
require "parquet"
dataset = Datasets::PostalCodeJapan.new
dataset.to_arrow.save("codes.parquet")
19. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
Output: Online storage: Fluentd
出力先:オンラインストレージ:Fluentd
fluent-plugin-s3-arrow:
Collect data by Fluentd
Fluentdでデータ収集
✓
Format data as Apache Parquet by Red Arrow
Red ArrowでApache Parquet形式にデータを変換
✓
Store data to Amazon S3 by fluent-plugin-s3
fluent-plugin-s3でAmazon S3にデータを保存
✓
By @kanga33 at Speee/Red Data Tools
Speee/Red Data Toolsの香川さんが開発
✓
✓
https://github.com/red-data-tools/fluent-plugin-s3-arrow/
20. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
Output: Online storage: Red Arrow
出力先:オンラインストレージ:Red Arrow
require "datasets-arrow"
require "arrow-dataset"
dataset = Datasets::PostalCodeJapan.new
url = URL("s3://mybucket/codes.parquet")
dataset.to_arrow.save(url)
Implementing...
実装中。。。
21. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
#save: Implementing...
class Arrow::TableSaver
def save
if @output.is_a?(URI)
__send__("save_to_uri")
else
__send__("save_to_file")
end
end
end
22. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
Collect data w/ Red Arrow: Wrap up
Red Arrowでデータ収集:まとめ
Usable as serializer for common formats
よくあるフォーマットにシリアライズするツールとして使える
✓
Usable as writer to common locations
in the near future...
近いうちによくある出力先に書き出すツールとして使える
✓
23. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
2. Read data
2. データ読み込み
What format is used?
どんなフォーマットで保存されている?
✓
Where are collected data?
収集したデータはどこ?
✓
How large is collected data?
データはどれかで大きい?
✓
24. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
Format
フォーマット
require "arrow"
table = Arrow::Table.load("data.csv")
table = Arrow::Table.load("data.json")
table = Arrow::Table.load("data.arrow")
table = Arrow::Table.load("data.orc")
25. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
.load
General deserialize API for table data
テーブルデータ用の汎用デシリアライズAPI
Deserialize common formats
よく使われているフォーマットからデシリアライズ
✓
✓
Extensible!
拡張可能!
✓
26. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
.load: Implementation
module Arrow
def Table.load(input)
loader = TableLoader.new(self, input)
loader.load
end
end
27. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
.load: Implementation
class Arrow::TableLoader
def load
format = detect_format(@output)
__send__("load_as_#{format}")
end
def load_as_csv
end
end
28. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
.load: Extend by Red Parquet
module Parquet::ArrowTableLoadable
def load_as_parquet
end
Arrow::TableLoader.include(self)
end
Red Parquet is a subproject of Red Arrow
Red ParquetはRed Arrowのサブプロジェクト
29. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
.load: Extended
require "parquet"
table = Arrow::Table.load("data.parquet")
30. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
.load: More extensible
class Arrow::TableLoader
def load
if @output.is_a?(URI)
__send__("load_from_uri")
else
__send__("load_from_file")
end
end
end
31. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
.load: Extend by Red Arrow Dataset
module ArrowDataset::ArrowTableLoadable
def load_from_uri
end
Arrow::TableLoader.include(self)
end
Red Arrow Dataset is a subproject of Red Arrow
Red Arrow DatasetはRed Arrowのサブプロジェクト
32. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
Location: Online storage
場所:オンラインストレージ
require "arrow-dataset"
url = URI("s3://bucket/path...")
table = Arrow::Table.load(url)
33. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
Location: RDBMS
場所:RDBMS
require "arrow-activerecord"
User.all.to_arrow
Red Arrow Active Record
https://github.com/red-data-tools/red-arrow-activerecord
34. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
Location: Network
場所:ネットワーク
require "arrow-flight"
client = ArrowFlight::Client.new(url)
info = client.list_flights[0]
reader = client.do_get(info.endpoints[0].ticket)
table = reader.read_all
Introducing Apache Arrow Flight: A Framework for Fast Data Transport
https://arrow.apache.org/blog/2019/10/13/introducing-arrow-flight/
35. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
Large data
大規模データ
Apache Arrow format
Designed for large data
大規模データ用に設計されている
✓
✓
For large data
大規模データ用に必要なもの
Fast load
高速にロードできること
✓
...
✓
✓
36. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
Fast load: Benchmark
高速ロード:ベンチマーク
require "datasets-arrow"
dataset = Datasets::PostalCodeJapan.new
table = dataset.to_arrow # 124271 records
n = 5
n.times do |i|
table.save("codes.#{i}.csv")
table.save("codes.#{i}.arrow")
CSV.read("codes.#{i}.csv")
Arrow::Table.load("codes.#{i}.csv")
Arrow::Table.load("codes.#{i}.arrow")
table = table.concatenate([table])
end
37. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
Fast load: Benchmark: All
高速ロード:ベンチマーク:すべて
38. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
Fast load: Benchmark: Red Arrow
高速ロード:ベンチマーク:Red Arrow
39. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
How to implement fast load
高速ロードの実装方法
Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1
Apache Arrowフォーマットは
なぜ速いのか
須藤功平
株式会社クリアコード
db tech showcase ONLINE 2020
2020-12-08
https://slide.rabbit-shocker.org/authors/kou/db-tech-showcase-online-2020/
40. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
Read data with Red Arrow: Wrap up
Red Arrowでデータ読み込み:まとめ
Easy to read common formats
よくあるフォーマットのデータを簡単に読める
✓
Easy to read from common locations
よくある場所にあるデータを簡単に読める
✓
Large data ready
大規模データも扱える
✓
41. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
3. Explore data
3. データ探索
Preprocess data(データを前処理)
Filter out needless data(不要なデータを除去)
✓
...
✓
✓
Summarize data and visualize them
(データを要約して可視化)
✓
...
✓
Red Arrow can be used for some operations
いくつかの操作でRed Arrowを使える
42. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
Filter: Red Arrow
絞り込み:Red Arrow
table = Datasets::PostalCodeJapan.new.to_arrow
table.n_rows # 124271
filtered_table = table.slice do |slicer|
slicer.prefecture == "東京都" # Tokyo
end
filtered_table.n_rows # 3887
43. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
Filter: Performance
絞り込み:性能
dataset = Datasets::PostalCodeJapan.new
arrow_dataset = dataset.to_arrow
dataset.find_all do |row|
row.prefecture == "東京都" # Tokyo
end # 1.256s
arrow_dataset.slice do |slicer|
slicer.prefecture == "東京都" # Tokyo
end # 0.001s
44. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
Filter: Performance
絞り込み:性能
45. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
Apache Arrow data: Interchangeable
Apache Arrow data:交換可能
With low cost thanks to fast load
高速ロードできるので低コスト
✓
Apache Arrow data ready systems are
increasing
Apache Arrowデータを扱えるシステムは増加中
e.g. DuckDB: in-process SQL OLAP DBMS
(SQLite like DBMS for OLAP)
OLAP: OnLine Analytical Processing
例:DuckDB:同一プロセス内で動くデータ分析用SQL DB管理システム
✓
✓
46. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
Filter: DuckDB
絞り込み:DuckDB
require "arrow-duckdb"
codes = Datasets::PostalCodeJapan.new.to_arrow
db = DuckDB::Database.open
c = db.connect
c.register("codes", codes) do # Use codes without copy
c.query("SELECT * FROM codes WHERE prefecture = ?",
"東京都", # Tokyo
output: :arrow) # Output as Apache Arrow data
.to_table.n_rows # 3887
end
47. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
Summarize: Group + aggregation
要約:グループ化して集計
iris = Datasets::Iris.new.to_arrow
iris.group(:label).count(:sepal_length)
# count(sepal_length) label
# 0 50 Iris-setosa
# 1 50 Iris-versicolor
# 2 50 Iris-virginica
48. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
Visualize: Charty
可視化:Charty
require "charty"
Charty.backends.use("pyplot")
Charty.scatter_plot(data: iris,
x: :sepal_length,
y: :sepal_width,
color: :label)
.save("iris.png")
49. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
Visualize: Charty: Result
可視化:Charty:結果
50. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
4. Use insight
4. 知見を活用
Write report
(レポートにまとめたり)
✓
Build a model
(モデルを作ったり)
✓
...
✓
No Red Arrow support in this area for now
Can be used for passing data to other tools like DuckDB and Charty
今のところこのあたりにはRed Arrowを使えない
DuckDBやChartyにデータを渡すように他のツールにデータを渡すためには使える
51. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
Data processing and Red Arrow
Red Arrowでデータ処理
Red Arrow helps us in some areas
いくつかの領域ではRed Arrowを使える
Collect, read and explore data
データを収集して読み込んで探索するとか
✓
✓
Some tools can integrate with Red Arrow
いくつかのツールはRed Arrowと連携できる
Fluentd, DuckDB, Charty, ...
✓
✓
52. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
Red Arrow and Ruby 3.0
MemoryView support
✓
Ractor support
✓
53. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
MemoryView
MemoryView provides the features to
share multidimensional homogeneous
arrays of fixed-size element on
memory among extension libraries.
MemoryViewは多次元数値配列(数値はすべて同じ型)を共有する機能を提供します。
https://docs.ruby-lang.org/en/master/doc/memory_view_md.html
https://tech.speee.jp/entry/2020/12/24/093131 (Japanese)
54. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
Numeric arrays in Red Arrow
Red Arrow内の数値配列
Arrow::NumericArray family
1-dimensional numeric array
1次元数値配列
✓
✓
Arrow::Tensor
Multidimensional homogeneous numeric arrays
多次元数値配列
✓
✓
55. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
MemoryView: Red Arrow
Arrow::NumericArray family
Export as MemoryView: Support
MemoryViewとしてエクスポート:対応済み
✓
Import from MemoryView: Not yet
MemoryViewをインポート:未対応
✓
✓
Arrow::Tensor
Export/Import: Not yet
エクスポート・インポート:未対応
✓
✓
Join Red Data Tools to work on this!
対応を進めたい人はRed Data Toolsに来てね!
56. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
MemoryView: C++
Some problems are found by this work
Red Arrowの対応作業でいくつかの問題が見つかった
Can't use private as member name
メンバー名にprivateを使えない
✓
Can't assign to const variable with cast
キャストしてもconst変数に代入できない
✓
✓
Ruby 3.1 will fix them
Ruby 3.1では直っているはず
✓
57. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
Ractor
Ractor is designed to provide a
parallel execution feature of Ruby
without thread-safety concerns.
Ractorはスレッドセーフかどうかを気にせずに並列実行するための機能です。
https://docs.ruby-lang.org/en/master/doc/ractor_md.html
https://techlife.cookpad.com/entry/2020/12/26/131858 (Japanese)
58. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
Red Arrow and concurrency
Red Arrowと並列性
Red Arrow data are immutable
Red Arrowデータは変更不可
✓
Ractor can share frozen objects
Ractorはfrozenなオブジェクトを共有可能
✓
59. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
Ractor: Red Arrow
require "datasets-arrow"
table = Datasets::PostalCodeJapan.new.to_arrow
Ractor.make_shareable(table)
Ractor.new(table) do |t|
t.slice do |slicer|
slicer.prefecture == "東京都" # Tokyo
end
end
60. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
Ractor: Red Arrow: Benchmark
n_ractors = 4
n_jobs_per_ractor = 1000
n_jobs = n_ractors * n_jobs_per_ractor
n_jobs.times do
table.slice {|s| s.prefecture == "東京都"}
end
n_ractors.times.collect do
Ractor.new(table, n_jobs_per_ractor) do |t, n|
n.times {t.slice {|s| s.prefecture == "東京都"}}
end
end.each(&:take)
61. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
Ractor: Red Arrow: Benchmark
62. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
Wrap up
まとめ
Ruby can be used
in some data processing work
いくつかのデータ処理作業にRubyを使える
Red Arrow helps you!
Red Arrowが有用なケースがあるはず!
✓
✓
Ruby 3.0 has useful features for data
processing work
Ruby 3.0にはデータ処理作業に有用な機能があるよ
Red Arrow starts supporting them
Red Arrowはそれらのサポートを進めている
✓
✓
63. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
Goal of this talk
このトークのゴール
You want to use Ruby
for some data processing
いくつかのデータ処理でRubyを使いたくなる
✓
You join Red Data Tools project
あなたがRed Data Toolsプロジェクトに参加する
✓
64. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
Feature work
今後の仕事
Implement DataFusion bindings by
adding C API to DataFusion
DataFusionにC APIを追加してバインディングを実装
DataFusion: Apache Arrow native query
execution framework written in Rust
https://github.com/apache/arrow-datafusion/
DataFusion:Rust実装のApache Arrowベースのクエリー実行フレームワーク
✓
✓
Add Active Record like API to Red Arrow
Red ArrowにActive Record風のAPIを追加
✓
Improve MemoryView/Ractor support
MemoryView/Ractorサポートを進める
✓
65. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
Red Data Tools
Join us!
https://red-data-tools.github.io/
https://gitter.im/red-data-tools/en
https://red-data-tools.github.io/ja/
https://gitter.im/red-data-tools/ja
66. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
OSS Gate on-boarding
OSS Gateオンボーディング
Supports accepting newcomers by OSS
projects such as Ruby & Red Arrow
RubyやRed ArrowといったOSSプロジェクトが新人を受け入れることを支援
✓
Contact me!興味がある人は私に教えて!
OSS project members who want to accept newcomers
新人を受け入れたいOSSプロジェクトのメンバー
✓
Companies which want to support OSS Gate on-boarding
OSS Gateオンボーディングを支援したい会社
✓
✓
https://oss-gate.github.io/on-boarding/
67. Red Arrow - Ruby and Apache Arrow Powered by Rabbit 3.0.1
ClearCode Inc.
Recruitment: Developer to work on Red
Arrow related business
採用情報:Red Arrow関連のビジネスをする開発者
https://www.clear-code.com/recruitment/
✓
✓
Business: Apache Arrow/Red Arrow
related technical support/consulting:
仕事:Apache Arrow/Red Arrow関連の技術サポート・コンサルティング
https://www.clear-code.com/contact/
✓
✓