SlideShare a Scribd company logo
1 of 36
Download to read offline
Sadayuki Furuhashi
Founder & Software Architect
Treasure Data, inc.
EmbulkAn open-source plugin-based parallel bulk data loader
that makes painful data integration work relaxed.
Sharing our knowledge on RubyGems to manage arbitrary files.
A little about me...
> Sadayuki Furuhashi
> github/twitter: @frsyuki
> Treasure Data, Inc.
> Founder & Software Architect
> Open-source hacker
> MessagePack - Efficient object serializer
> Fluentd - An unified data collection tool
> Prestogres - PostgreSQL protocol gateway for Presto
> Embulk - A plugin-based parallel bulk data loader
> ServerEngine - A Ruby framework to build multiprocess servers
> LS4 - A distributed object storage with cross-region replication
> kumofs - A distributed strong-consistent key-value data store
Today’s talk
> What’s Embulk?
> How Embulk works?
> The architecture
> Writing Embulk plugins
> Roadmap & Development
> Q&A + Discussion
What’s Embulk?
> An open-source parallel bulk data loader
> using plugins
> to make data integration relaxed.
What’s Embulk?
> An open-source parallel bulk data loader
> loads records from “A” to “B”
> using plugins
> for various kinds of “A” and “B”
> to make data integration relaxed.
> which was very painful…
Storage, RDBMS,
NoSQL, Cloud Service,
etc.
broken records,

transactions (idempotency),

performance, …
The pains of bulk data loading
Example: load a 10GB CSV file to PostgreSQL
> 1. First attempt → fails
> 2. Write a script to make the records cleaned
• Convert ”20150127T190500Z” → “2015-01-27 19:05:00 UTC”
• Convert “N" → “”
• many cleanings…
> 3. Second attempt → another error
• Convert “Inf” → “Infinity”
> 4. Fix the script, retry, retry, retry…
> 5. Oh, some data got loaded twice!?
The pains of bulk data loading
Example: load a 10GB CSV file to PostgreSQL
> 6. Ok, the script worked.
> 7. Register it to cron to sync data every day.
> 8. One day… it fails with another error
• Convert invalid UTF-8 byte sequence to U+FFFD
The pains of bulk data loading
Example: load 10GB CSV × 720 files
> Most of scripts are slow.
• People have little time to optimize bulk load scripts
> One file takes 1 hour → 720 files takes 1 month (!?)
A lot of integration efforts for each storages:
> XML, JSON, Apache log format (+some custom), …
> SAM, BED, BAI2, HDF5, TDE, SequenceFile, RCFile…
> MongoDB, Elasticsearch, Redshift, Salesforce, …
The problems:
> Data cleaning (normalization)
> How to normalize broken records?
> Error handling
> How to remove broken records?
> Idempotent retrying
> How to retry without duplicated loading?
> Performance optimization
> How to optimize the code or parallelize?
The problems at Treasure Data
Treasure Data Service?
> “Fast, powerful SQL access to big data from connected
applications and products, with no new infrastructure or
special skills required.”
> Customers want to try Treasure Data, but
> SEs write scripts to bulk load their data. Hard work :(
> Customers want to migrate their big data, but
> Hard work :(
> Fluentd solved streaming data collection, but
> bulk data loading is another problem.
A solution:
> Package the efforts as a plugin.
> data cleaning, error handling, retrying
> Share & reuse the plugin.
> don’t repeat the pains!
> Keep improving the plugin code.
> rather than throwing away the efforts every time
> using OSS-style pull-reqs & frequent releases.
Embulk
Embulk is an open-source, plugin-based
parallel bulk data loader

that makes data integration works relaxed.
HDFS
MySQL
Amazon S3
Embulk
CSV Files
SequenceFile
Salesforce.com
Elasticsearch
Cassandra
Hive
Redis
HDFS
MySQL
Amazon S3
Embulk
CSV Files
SequenceFile
Salesforce.com
Elasticsearch
Cassandra
Hive
Redis
✓ Parallel execution
✓ Data validation
✓ Error recovery
✓ Deterministic behavior
✓ Idempotet retrying
bulk load
HDFS
MySQL
Amazon S3
Embulk
CSV Files
SequenceFile
Salesforce.com
Elasticsearch
Cassandra
Hive
Redis
✓ Parallel execution
✓ Data validation
✓ Error recovery
✓ Deterministic behavior
✓ Idempotet retrying
Plugins Plugins
bulk load
How Embulk works?
# install
$ wget https://bintray.com/artifact/download/
embulk/maven/embulk-0.2.0.jar -o embulk.jar
$ chmod 755 embulk.jar
Installing embulk
Bintray

releases
Embulk is released on Bintray
wget embulk.jar
# install
$ wget https://bintray.com/artifact/download/
embulk/maven/embulk-0.2.0.jar -o embulk.jar
$ chmod 755 embulk.jar

# guess
$ vi partial-config.yml
$ ./embulk guess partial-config.yml

-o config.yml
Guess format & schema in:
type: file
paths: [data/examples/]
out:

type: example
in:
type: file
paths: [data/examples/]
decoders:
- {type: gzip}
parser:
charset: UTF-8
newline: CRLF
type: csv
delimiter: ','
quote: '"'
header_line: true
columns:
- name: time

type: timestamp

format: '%Y-%m-%d %H:%M:%S'
- name: account

type: long
- name: purchase

type: timestamp

format: '%Y%m%d'
- name: comment

type: string
out:

type: example
guess
by guess plugins
# install
$ wget https://bintray.com/artifact/download/
embulk/maven/embulk-0.2.0.jar -o embulk.jar
$ chmod 755 embulk.jar

# guess
$ vi partial-config.yml
$ ./embulk guess partial-config.yml

-o config.yml

# preview
$ ./embulk preview config.yml
$ vi config.yml # if necessary
+--------------------------------------+---------------+--------------------+
| time:timestamp | uid:long | word:string |
+--------------------------------------+---------------+--------------------+
| 2015-01-27 19:23:49 UTC | 32,864 | embulk |
| 2015-01-27 19:01:23 UTC | 14,824 | jruby |
| 2015-01-28 02:20:02 UTC | 27,559 | plugin |
| 2015-01-29 11:54:36 UTC | 11,270 | fluentd |
+--------------------------------------+---------------+--------------------+
Preview & fix config
# install
$ wget https://bintray.com/artifact/download/
embulk/maven/embulk-0.2.0.jar -o embulk.jar
$ chmod 755 embulk.jar

# guess
$ vi partial-config.yml
$ ./embulk guess partial-config.yml

-o config.yml

# preview
$ ./embulk preview config.yml
$ vi config.yml # if necessary
# run
$ ./embulk run config.yml -o config.yml
in:
type: file
paths: [data/examples/]
decoders:
- {type: gzip}
parser:
charset: UTF-8
newline: CRLF
type: csv
delimiter: ','
quote: '"'
header_line: true
columns:
- name: time

type: timestamp

format: '%Y-%m-%d %H:%M:%S'
- name: account

type: long
- name: purchase

type: timestamp

format: '%Y%m%d'
- name: comment

type: string
last_paths: [data/examples/sample_001.csv.gz]
out:

type: example
Deterministic run
in:
type: file
paths: [data/examples/]
decoders:
- {type: gzip}
parser:
charset: UTF-8
newline: CRLF
type: csv
delimiter: ','
quote: '"'
header_line: true
columns:
- name: time

type: timestamp

format: '%Y-%m-%d %H:%M:%S'
- name: account

type: long
- name: purchase

type: timestamp

format: '%Y%m%d'
- name: comment

type: string
last_paths: [data/examples/sample_002.csv.gz]
out:

type: example
Repeat
# install
$ wget https://bintray.com/artifact/download/
embulk/maven/embulk-0.2.0.jar -o embulk.jar
$ chmod 755 embulk.jar

# guess
$ vi partial-config.yml
$ ./embulk guess partial-config.yml

-o config.yml

# preview
$ ./embulk preview config.yml
$ vi config.yml # if necessary
# run
$ ./embulk run config.yml -o config.yml
# repeat
$ ./embulk run config.yml -o config.yml
$ ./embulk run config.yml -o config.yml
The architecture
InputPlugin OutputPlugin
Embulk
executor plugin
read records write records
InputPlugin OutputPlugin
Embulk
executor plugin
MySQL, Cassandra,
HBase, Elasticsearch,

Treasure Data, …
record
record
InputPlugin
FileInputPlugin
OutputPlugin
FileOutputPlugin
EncoderPlugin
FormatterPlugin
DecoderPlugin
ParserPlugin
Embulk
executor plugin
read files
decompress
parse files
into records
write files
compress
format records
into files
InputPlugin
FileInputPlugin
OutputPlugin
FileOutputPlugin
EncoderPlugin
FormatterPlugin
DecoderPlugin
ParserPlugin
Embulk
executor plugin
HDFS, S3,

Riak CS, …
gzip, bzip2,

3des, …
CSV, JSON,

RCFile, …
buffer
buffer
record
record
buffer
buffer
Writing Embulk plugins
InputPlugin
module Embulk
class InputExample < InputPlugin
Plugin.register_input('example', self)
def self.transaction(config, &control)
# read config
task = {
'message' =>
config.param('message', :string, default: nil)
}
threads = config.param('threads', :int, default:
2)
columns = [
Column.new(0, 'col0', :long),
Column.new(1, 'col1', :double),
Column.new(2, 'col2', :string),
]
# BEGIN here
commit_reports = yield(task, columns, threads)
# COMMIT here
puts "Example input finished"
return {}
end
def run(task, schema, index, page_builder)
puts "Example input thread #{@index}…"
10.times do |i|
@page_builder.add([i, 10.0, "example"])
end
@page_builder.finish
commit_report = { }
return commit_report
end
end
end
OutputPlugin
module Embulk
class OutputExample < OutputPlugin
Plugin.register_output('example', self)
def self.transaction(
config, schema,
processor_count, &control)
# read config
task = {
'message' =>
config.param('message', :string, default: "record")
}
puts "Example output started."
commit_reports = yield(task)
puts "Example output finished. Commit
reports = #{commit_reports.to_json}"
return {}
end
def initialize(task, schema, index)
puts "Example output thread #{index}..."
super
@message = task.prop('message', :string)
@records = 0
end
def add(page)
page.each do |record|
hash = Hash[schema.names.zip(record)]
puts "#{@message}: #{hash.to_json}"
@records += 1
end
end
def finish
end
def abort
end
def commit
commit_report = {
"records" => @records
}
return commit_report
end
end
end
GuessPlugin
# guess_gzip.rb
module Embulk
class GzipGuess < GuessPlugin
Plugin.register_guess('gzip', self)
GZIP_HEADER = "x1f
x8b".force_encoding('ASCII-8BIT').freeze
def guess(config, sample_buffer)
if sample_buffer[0,2] == GZIP_HEADER
return {"decoders" => [{"type" => "gzip"}]}
end
return {}
end
end
end
# guess_
module Embulk
class GuessNewline < TextGuessPlugin
Plugin.register_guess('newline', self)
def guess_text(config, sample_text)
cr_count = sample_text.count("r")
lf_count = sample_text.count("n")
crlf_count = sample_text.scan(/rn/).length
if crlf_count > cr_count / 2 && crlf_count >
lf_count / 2
return {"parser" => {"newline" => "CRLF"}}
elsif cr_count > lf_count / 2
return {"parser" => {"newline" => "CR"}}
else
return {"parser" => {"newline" => "LF"}}
end
end
end
end
Releasing to RubyGems
Examples
> embulk-plugin-postgres-json.gem
> https://github.com/frsyuki/embulk-plugin-postgres-json
> embulk-plugin-redis.gem
> https://github.com/komamitsu/embulk-plugin-redis
> embulk-plugin-input-sfdc-event-log-files.gem
> https://github.com/nahi/embulk-plugin-input-sfdc-event-
log-files
Roadmap & Development
Roadmap
> Add missing JRuby Plugin APIs
> ParserPlugin, FormatterPlugin
> DecoderPlugin, EncoderPlugin
> Add Executor plugin SPI
> Add ssh distributed executor
> embulk run —command ssh %host embulk run %task
> Add MapReduce executor
> Add support for nested records (?)
Contributing to the Embulk project
> Pull-requests & issues on Github
> Posting blogs
> “I tried Embulk. Here is how it worked”
> “I read Embulk code. Here is how it’s written”
> “Embulk is good because…but bad because…”
> Talking on Twitter with a word “embulk"
> Writing & releasing plugins
> Windows support
> Integration to other software
> ETL tools, Fluentd, Hadoop, Presto, …
Q&A + Discussion?
Hiroshi Nakamura
@nahi
Muga Nishizawa
@muga_nishizawa
Sadayuki Furuhashi
@frsyuki
Embulk committers:
https://jobs.lever.co/treasure-data
Cloud service for the entire data pipeline.
We’re hiring!

More Related Content

What's hot

最近のストリーム処理事情振り返り
最近のストリーム処理事情振り返り最近のストリーム処理事情振り返り
最近のストリーム処理事情振り返りSotaro Kimura
 
Hadoop -NameNode HAの仕組み-
Hadoop -NameNode HAの仕組み-Hadoop -NameNode HAの仕組み-
Hadoop -NameNode HAの仕組み-Yuki Gonda
 
Cassandra serving netflix @ scale
Cassandra serving netflix @ scaleCassandra serving netflix @ scale
Cassandra serving netflix @ scaleVinay Kumar Chella
 
Where狙いのキー、order by狙いのキー
Where狙いのキー、order by狙いのキーWhere狙いのキー、order by狙いのキー
Where狙いのキー、order by狙いのキーyoku0825
 
OLAP for Big Data (Druid vs Apache Kylin vs Apache Lens)
OLAP for Big Data (Druid vs Apache Kylin vs Apache Lens)OLAP for Big Data (Druid vs Apache Kylin vs Apache Lens)
OLAP for Big Data (Druid vs Apache Kylin vs Apache Lens)SANG WON PARK
 
[Cloud OnAir] GCP 上でストリーミングデータ処理基盤を構築してみよう! 2018年9月13日 放送
[Cloud OnAir] GCP 上でストリーミングデータ処理基盤を構築してみよう! 2018年9月13日 放送[Cloud OnAir] GCP 上でストリーミングデータ処理基盤を構築してみよう! 2018年9月13日 放送
[Cloud OnAir] GCP 上でストリーミングデータ処理基盤を構築してみよう! 2018年9月13日 放送Google Cloud Platform - Japan
 
Cassandraとh baseの比較して入門するno sql
Cassandraとh baseの比較して入門するno sqlCassandraとh baseの比較して入門するno sql
Cassandraとh baseの比較して入門するno sqlYutuki r
 
Data platformdesign
Data platformdesignData platformdesign
Data platformdesignRyoma Nagata
 
SolrとElasticsearchを比べてみよう
SolrとElasticsearchを比べてみようSolrとElasticsearchを比べてみよう
SolrとElasticsearchを比べてみようShinsuke Sugaya
 
ストリーム処理プラットフォームにおけるKafka導入事例 #kafkajp
ストリーム処理プラットフォームにおけるKafka導入事例 #kafkajpストリーム処理プラットフォームにおけるKafka導入事例 #kafkajp
ストリーム処理プラットフォームにおけるKafka導入事例 #kafkajpYahoo!デベロッパーネットワーク
 
大量のデータ処理や分析に使えるOSS Apache Spark入門(Open Source Conference 2021 Online/Kyoto 発表資料)
大量のデータ処理や分析に使えるOSS Apache Spark入門(Open Source Conference 2021 Online/Kyoto 発表資料)大量のデータ処理や分析に使えるOSS Apache Spark入門(Open Source Conference 2021 Online/Kyoto 発表資料)
大量のデータ処理や分析に使えるOSS Apache Spark入門(Open Source Conference 2021 Online/Kyoto 発表資料)NTT DATA Technology & Innovation
 
SQL大量発行処理をいかにして高速化するか
SQL大量発行処理をいかにして高速化するかSQL大量発行処理をいかにして高速化するか
SQL大量発行処理をいかにして高速化するかShogo Wakayama
 
ストリーム処理を支えるキューイングシステムの選び方
ストリーム処理を支えるキューイングシステムの選び方ストリーム処理を支えるキューイングシステムの選び方
ストリーム処理を支えるキューイングシステムの選び方Yoshiyasu SAEKI
 
超実践 Cloud Spanner 設計講座
超実践 Cloud Spanner 設計講座超実践 Cloud Spanner 設計講座
超実践 Cloud Spanner 設計講座Samir Hammoudi
 
kubernetes初心者がKnative Lambda Runtime触ってみた(Kubernetes Novice Tokyo #13 発表資料)
kubernetes初心者がKnative Lambda Runtime触ってみた(Kubernetes Novice Tokyo #13 発表資料)kubernetes初心者がKnative Lambda Runtime触ってみた(Kubernetes Novice Tokyo #13 発表資料)
kubernetes初心者がKnative Lambda Runtime触ってみた(Kubernetes Novice Tokyo #13 発表資料)NTT DATA Technology & Innovation
 
[215]네이버콘텐츠통계서비스소개 김기영
[215]네이버콘텐츠통계서비스소개 김기영[215]네이버콘텐츠통계서비스소개 김기영
[215]네이버콘텐츠통계서비스소개 김기영NAVER D2
 
基本に戻ってInnoDBの話をします
基本に戻ってInnoDBの話をします基本に戻ってInnoDBの話をします
基本に戻ってInnoDBの話をしますyoku0825
 

What's hot (20)

最近のストリーム処理事情振り返り
最近のストリーム処理事情振り返り最近のストリーム処理事情振り返り
最近のストリーム処理事情振り返り
 
Hadoop -NameNode HAの仕組み-
Hadoop -NameNode HAの仕組み-Hadoop -NameNode HAの仕組み-
Hadoop -NameNode HAの仕組み-
 
PayPayでのk8s活用事例
PayPayでのk8s活用事例PayPayでのk8s活用事例
PayPayでのk8s活用事例
 
Cassandra serving netflix @ scale
Cassandra serving netflix @ scaleCassandra serving netflix @ scale
Cassandra serving netflix @ scale
 
Where狙いのキー、order by狙いのキー
Where狙いのキー、order by狙いのキーWhere狙いのキー、order by狙いのキー
Where狙いのキー、order by狙いのキー
 
OLAP for Big Data (Druid vs Apache Kylin vs Apache Lens)
OLAP for Big Data (Druid vs Apache Kylin vs Apache Lens)OLAP for Big Data (Druid vs Apache Kylin vs Apache Lens)
OLAP for Big Data (Druid vs Apache Kylin vs Apache Lens)
 
[Cloud OnAir] GCP 上でストリーミングデータ処理基盤を構築してみよう! 2018年9月13日 放送
[Cloud OnAir] GCP 上でストリーミングデータ処理基盤を構築してみよう! 2018年9月13日 放送[Cloud OnAir] GCP 上でストリーミングデータ処理基盤を構築してみよう! 2018年9月13日 放送
[Cloud OnAir] GCP 上でストリーミングデータ処理基盤を構築してみよう! 2018年9月13日 放送
 
Cassandraとh baseの比較して入門するno sql
Cassandraとh baseの比較して入門するno sqlCassandraとh baseの比較して入門するno sql
Cassandraとh baseの比較して入門するno sql
 
Data platformdesign
Data platformdesignData platformdesign
Data platformdesign
 
SolrとElasticsearchを比べてみよう
SolrとElasticsearchを比べてみようSolrとElasticsearchを比べてみよう
SolrとElasticsearchを比べてみよう
 
ストリーム処理プラットフォームにおけるKafka導入事例 #kafkajp
ストリーム処理プラットフォームにおけるKafka導入事例 #kafkajpストリーム処理プラットフォームにおけるKafka導入事例 #kafkajp
ストリーム処理プラットフォームにおけるKafka導入事例 #kafkajp
 
大量のデータ処理や分析に使えるOSS Apache Spark入門(Open Source Conference 2021 Online/Kyoto 発表資料)
大量のデータ処理や分析に使えるOSS Apache Spark入門(Open Source Conference 2021 Online/Kyoto 発表資料)大量のデータ処理や分析に使えるOSS Apache Spark入門(Open Source Conference 2021 Online/Kyoto 発表資料)
大量のデータ処理や分析に使えるOSS Apache Spark入門(Open Source Conference 2021 Online/Kyoto 発表資料)
 
Vacuum徹底解説
Vacuum徹底解説Vacuum徹底解説
Vacuum徹底解説
 
SQL大量発行処理をいかにして高速化するか
SQL大量発行処理をいかにして高速化するかSQL大量発行処理をいかにして高速化するか
SQL大量発行処理をいかにして高速化するか
 
ストリーム処理を支えるキューイングシステムの選び方
ストリーム処理を支えるキューイングシステムの選び方ストリーム処理を支えるキューイングシステムの選び方
ストリーム処理を支えるキューイングシステムの選び方
 
超実践 Cloud Spanner 設計講座
超実践 Cloud Spanner 設計講座超実践 Cloud Spanner 設計講座
超実践 Cloud Spanner 設計講座
 
kubernetes初心者がKnative Lambda Runtime触ってみた(Kubernetes Novice Tokyo #13 発表資料)
kubernetes初心者がKnative Lambda Runtime触ってみた(Kubernetes Novice Tokyo #13 発表資料)kubernetes初心者がKnative Lambda Runtime触ってみた(Kubernetes Novice Tokyo #13 発表資料)
kubernetes初心者がKnative Lambda Runtime触ってみた(Kubernetes Novice Tokyo #13 発表資料)
 
Apache Hadoopの新機能Ozoneの現状
Apache Hadoopの新機能Ozoneの現状Apache Hadoopの新機能Ozoneの現状
Apache Hadoopの新機能Ozoneの現状
 
[215]네이버콘텐츠통계서비스소개 김기영
[215]네이버콘텐츠통계서비스소개 김기영[215]네이버콘텐츠통계서비스소개 김기영
[215]네이버콘텐츠통계서비스소개 김기영
 
基本に戻ってInnoDBの話をします
基本に戻ってInnoDBの話をします基本に戻ってInnoDBの話をします
基本に戻ってInnoDBの話をします
 

Viewers also liked

【Unite 2017 Tokyo】「黒騎士と白の魔王」にみるC#で統一したサーバー/クライアント開発と現実的なUniRx使いこなし術
【Unite 2017 Tokyo】「黒騎士と白の魔王」にみるC#で統一したサーバー/クライアント開発と現実的なUniRx使いこなし術【Unite 2017 Tokyo】「黒騎士と白の魔王」にみるC#で統一したサーバー/クライアント開発と現実的なUniRx使いこなし術
【Unite 2017 Tokyo】「黒騎士と白の魔王」にみるC#で統一したサーバー/クライアント開発と現実的なUniRx使いこなし術Unity Technologies Japan K.K.
 
ZeroFormatterに見るC#で最速のシリアライザを作成する100億の方法
ZeroFormatterに見るC#で最速のシリアライザを作成する100億の方法ZeroFormatterに見るC#で最速のシリアライザを作成する100億の方法
ZeroFormatterに見るC#で最速のシリアライザを作成する100億の方法Yoshifumi Kawai
 
RuntimeUnitTestToolkit for Unity
RuntimeUnitTestToolkit for UnityRuntimeUnitTestToolkit for Unity
RuntimeUnitTestToolkit for UnityYoshifumi Kawai
 
NextGen Server/Client Architecture - gRPC + Unity + C#
NextGen Server/Client Architecture - gRPC + Unity + C#NextGen Server/Client Architecture - gRPC + Unity + C#
NextGen Server/Client Architecture - gRPC + Unity + C#Yoshifumi Kawai
 
「黒騎士と白の魔王」gRPCによるHTTP/2 - API, Streamingの実践
「黒騎士と白の魔王」gRPCによるHTTP/2 - API, Streamingの実践「黒騎士と白の魔王」gRPCによるHTTP/2 - API, Streamingの実践
「黒騎士と白の魔王」gRPCによるHTTP/2 - API, Streamingの実践Yoshifumi Kawai
 
Fluentd - road to v1 -
Fluentd - road to v1 -Fluentd - road to v1 -
Fluentd - road to v1 -N Masahiro
 
H2O - making HTTP better
H2O - making HTTP betterH2O - making HTTP better
H2O - making HTTP betterKazuho Oku
 
ガチ(?)対決!OSSのジョブ管理ツール
ガチ(?)対決!OSSのジョブ管理ツールガチ(?)対決!OSSのジョブ管理ツール
ガチ(?)対決!OSSのジョブ管理ツール賢 秋穂
 
A Framework for LightUp Applications of Grani
A Framework for LightUp Applications of GraniA Framework for LightUp Applications of Grani
A Framework for LightUp Applications of GraniYoshifumi Kawai
 
Fluentd and Embulk Game Server 4
Fluentd and Embulk Game Server 4Fluentd and Embulk Game Server 4
Fluentd and Embulk Game Server 4N Masahiro
 
EmbulkとDigdagとデータ分析基盤と
EmbulkとDigdagとデータ分析基盤とEmbulkとDigdagとデータ分析基盤と
EmbulkとDigdagとデータ分析基盤とToru Takahashi
 
How to Make Own Framework built on OWIN
How to Make Own Framework built on OWINHow to Make Own Framework built on OWIN
How to Make Own Framework built on OWINYoshifumi Kawai
 
async/await不要論
async/await不要論async/await不要論
async/await不要論bleis tift
 
AWS + Windows(C#)で構築する.NET最先端技術によるハイパフォーマンスウェブアプリケーション開発実践
AWS + Windows(C#)で構築する.NET最先端技術によるハイパフォーマンスウェブアプリケーション開発実践AWS + Windows(C#)で構築する.NET最先端技術によるハイパフォーマンスウェブアプリケーション開発実践
AWS + Windows(C#)で構築する.NET最先端技術によるハイパフォーマンスウェブアプリケーション開発実践Yoshifumi Kawai
 
The History of Reactive Extensions
The History of Reactive ExtensionsThe History of Reactive Extensions
The History of Reactive ExtensionsYoshifumi Kawai
 
Reactive Programming by UniRx for Asynchronous & Event Processing
Reactive Programming by UniRx for Asynchronous & Event ProcessingReactive Programming by UniRx for Asynchronous & Event Processing
Reactive Programming by UniRx for Asynchronous & Event ProcessingYoshifumi Kawai
 
UniRx - Reactive Extensions for Unity
UniRx - Reactive Extensions for UnityUniRx - Reactive Extensions for Unity
UniRx - Reactive Extensions for UnityYoshifumi Kawai
 
HttpClient詳解、或いは非同期の落とし穴について
HttpClient詳解、或いは非同期の落とし穴についてHttpClient詳解、或いは非同期の落とし穴について
HttpClient詳解、或いは非同期の落とし穴についてYoshifumi Kawai
 
Metaprogramming Universe in C# - 実例に見るILからRoslynまでの活用例
Metaprogramming Universe in C# - 実例に見るILからRoslynまでの活用例Metaprogramming Universe in C# - 実例に見るILからRoslynまでの活用例
Metaprogramming Universe in C# - 実例に見るILからRoslynまでの活用例Yoshifumi Kawai
 

Viewers also liked (19)

【Unite 2017 Tokyo】「黒騎士と白の魔王」にみるC#で統一したサーバー/クライアント開発と現実的なUniRx使いこなし術
【Unite 2017 Tokyo】「黒騎士と白の魔王」にみるC#で統一したサーバー/クライアント開発と現実的なUniRx使いこなし術【Unite 2017 Tokyo】「黒騎士と白の魔王」にみるC#で統一したサーバー/クライアント開発と現実的なUniRx使いこなし術
【Unite 2017 Tokyo】「黒騎士と白の魔王」にみるC#で統一したサーバー/クライアント開発と現実的なUniRx使いこなし術
 
ZeroFormatterに見るC#で最速のシリアライザを作成する100億の方法
ZeroFormatterに見るC#で最速のシリアライザを作成する100億の方法ZeroFormatterに見るC#で最速のシリアライザを作成する100億の方法
ZeroFormatterに見るC#で最速のシリアライザを作成する100億の方法
 
RuntimeUnitTestToolkit for Unity
RuntimeUnitTestToolkit for UnityRuntimeUnitTestToolkit for Unity
RuntimeUnitTestToolkit for Unity
 
NextGen Server/Client Architecture - gRPC + Unity + C#
NextGen Server/Client Architecture - gRPC + Unity + C#NextGen Server/Client Architecture - gRPC + Unity + C#
NextGen Server/Client Architecture - gRPC + Unity + C#
 
「黒騎士と白の魔王」gRPCによるHTTP/2 - API, Streamingの実践
「黒騎士と白の魔王」gRPCによるHTTP/2 - API, Streamingの実践「黒騎士と白の魔王」gRPCによるHTTP/2 - API, Streamingの実践
「黒騎士と白の魔王」gRPCによるHTTP/2 - API, Streamingの実践
 
Fluentd - road to v1 -
Fluentd - road to v1 -Fluentd - road to v1 -
Fluentd - road to v1 -
 
H2O - making HTTP better
H2O - making HTTP betterH2O - making HTTP better
H2O - making HTTP better
 
ガチ(?)対決!OSSのジョブ管理ツール
ガチ(?)対決!OSSのジョブ管理ツールガチ(?)対決!OSSのジョブ管理ツール
ガチ(?)対決!OSSのジョブ管理ツール
 
A Framework for LightUp Applications of Grani
A Framework for LightUp Applications of GraniA Framework for LightUp Applications of Grani
A Framework for LightUp Applications of Grani
 
Fluentd and Embulk Game Server 4
Fluentd and Embulk Game Server 4Fluentd and Embulk Game Server 4
Fluentd and Embulk Game Server 4
 
EmbulkとDigdagとデータ分析基盤と
EmbulkとDigdagとデータ分析基盤とEmbulkとDigdagとデータ分析基盤と
EmbulkとDigdagとデータ分析基盤と
 
How to Make Own Framework built on OWIN
How to Make Own Framework built on OWINHow to Make Own Framework built on OWIN
How to Make Own Framework built on OWIN
 
async/await不要論
async/await不要論async/await不要論
async/await不要論
 
AWS + Windows(C#)で構築する.NET最先端技術によるハイパフォーマンスウェブアプリケーション開発実践
AWS + Windows(C#)で構築する.NET最先端技術によるハイパフォーマンスウェブアプリケーション開発実践AWS + Windows(C#)で構築する.NET最先端技術によるハイパフォーマンスウェブアプリケーション開発実践
AWS + Windows(C#)で構築する.NET最先端技術によるハイパフォーマンスウェブアプリケーション開発実践
 
The History of Reactive Extensions
The History of Reactive ExtensionsThe History of Reactive Extensions
The History of Reactive Extensions
 
Reactive Programming by UniRx for Asynchronous & Event Processing
Reactive Programming by UniRx for Asynchronous & Event ProcessingReactive Programming by UniRx for Asynchronous & Event Processing
Reactive Programming by UniRx for Asynchronous & Event Processing
 
UniRx - Reactive Extensions for Unity
UniRx - Reactive Extensions for UnityUniRx - Reactive Extensions for Unity
UniRx - Reactive Extensions for Unity
 
HttpClient詳解、或いは非同期の落とし穴について
HttpClient詳解、或いは非同期の落とし穴についてHttpClient詳解、或いは非同期の落とし穴について
HttpClient詳解、或いは非同期の落とし穴について
 
Metaprogramming Universe in C# - 実例に見るILからRoslynまでの活用例
Metaprogramming Universe in C# - 実例に見るILからRoslynまでの活用例Metaprogramming Universe in C# - 実例に見るILからRoslynまでの活用例
Metaprogramming Universe in C# - 実例に見るILからRoslynまでの活用例
 

Similar to Embulk, an open-source plugin-based parallel bulk data loader

Fighting Against Chaotically Separated Values with Embulk
Fighting Against Chaotically Separated Values with EmbulkFighting Against Chaotically Separated Values with Embulk
Fighting Against Chaotically Separated Values with EmbulkSadayuki Furuhashi
 
Treasure Data and OSS
Treasure Data and OSSTreasure Data and OSS
Treasure Data and OSSN Masahiro
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09Chris Purrington
 
Plugin-based software design with Ruby and RubyGems
Plugin-based software design with Ruby and RubyGemsPlugin-based software design with Ruby and RubyGems
Plugin-based software design with Ruby and RubyGemsSadayuki Furuhashi
 
Play framework productivity formula
Play framework   productivity formula Play framework   productivity formula
Play framework productivity formula Sorin Chiprian
 
Site Performance - From Pinto to Ferrari
Site Performance - From Pinto to FerrariSite Performance - From Pinto to Ferrari
Site Performance - From Pinto to FerrariJoseph Scott
 
PuppetDB: Sneaking Clojure into Operations
PuppetDB: Sneaking Clojure into OperationsPuppetDB: Sneaking Clojure into Operations
PuppetDB: Sneaking Clojure into Operationsgrim_radical
 
SXSW 2012 JavaScript MythBusters
SXSW 2012 JavaScript MythBustersSXSW 2012 JavaScript MythBusters
SXSW 2012 JavaScript MythBustersElena-Oana Tabaranu
 
Digdagによる大規模データ処理の自動化とエラー処理
Digdagによる大規模データ処理の自動化とエラー処理Digdagによる大規模データ処理の自動化とエラー処理
Digdagによる大規模データ処理の自動化とエラー処理Sadayuki Furuhashi
 
12 core technologies you should learn, love, and hate to be a 'real' technocrat
12 core technologies you should learn, love, and hate to be a 'real' technocrat12 core technologies you should learn, love, and hate to be a 'real' technocrat
12 core technologies you should learn, love, and hate to be a 'real' technocratJonathan Linowes
 
Clug 2011 March web server optimisation
Clug 2011 March  web server optimisationClug 2011 March  web server optimisation
Clug 2011 March web server optimisationgrooverdan
 
12 core technologies you should learn, love, and hate to be a 'real' technocrat
12 core technologies you should learn, love, and hate to be a 'real' technocrat12 core technologies you should learn, love, and hate to be a 'real' technocrat
12 core technologies you should learn, love, and hate to be a 'real' technocratlinoj
 
Caching and tuning fun for high scalability
Caching and tuning fun for high scalabilityCaching and tuning fun for high scalability
Caching and tuning fun for high scalabilityWim Godden
 
Drupal Performance : DrupalCamp North
Drupal Performance : DrupalCamp NorthDrupal Performance : DrupalCamp North
Drupal Performance : DrupalCamp NorthPhilip Norton
 
High concurrency,
Low latency analytics
using Spark/Kudu
 High concurrency,
Low latency analytics
using Spark/Kudu High concurrency,
Low latency analytics
using Spark/Kudu
High concurrency,
Low latency analytics
using Spark/KuduChris George
 
Profiling PHP with Xdebug / Webgrind
Profiling PHP with Xdebug / WebgrindProfiling PHP with Xdebug / Webgrind
Profiling PHP with Xdebug / WebgrindSam Keen
 
Symfony finally swiped right on envvars
Symfony finally swiped right on envvarsSymfony finally swiped right on envvars
Symfony finally swiped right on envvarsSam Marley-Jarrett
 
Caching and tuning fun for high scalability
Caching and tuning fun for high scalabilityCaching and tuning fun for high scalability
Caching and tuning fun for high scalabilityWim Godden
 
The Future is Now: Leveraging the Cloud with Ruby
The Future is Now: Leveraging the Cloud with RubyThe Future is Now: Leveraging the Cloud with Ruby
The Future is Now: Leveraging the Cloud with RubyRobert Dempsey
 

Similar to Embulk, an open-source plugin-based parallel bulk data loader (20)

Fighting Against Chaotically Separated Values with Embulk
Fighting Against Chaotically Separated Values with EmbulkFighting Against Chaotically Separated Values with Embulk
Fighting Against Chaotically Separated Values with Embulk
 
Treasure Data and OSS
Treasure Data and OSSTreasure Data and OSS
Treasure Data and OSS
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09
 
Plugin-based software design with Ruby and RubyGems
Plugin-based software design with Ruby and RubyGemsPlugin-based software design with Ruby and RubyGems
Plugin-based software design with Ruby and RubyGems
 
Play framework productivity formula
Play framework   productivity formula Play framework   productivity formula
Play framework productivity formula
 
Site Performance - From Pinto to Ferrari
Site Performance - From Pinto to FerrariSite Performance - From Pinto to Ferrari
Site Performance - From Pinto to Ferrari
 
PuppetDB: Sneaking Clojure into Operations
PuppetDB: Sneaking Clojure into OperationsPuppetDB: Sneaking Clojure into Operations
PuppetDB: Sneaking Clojure into Operations
 
SXSW 2012 JavaScript MythBusters
SXSW 2012 JavaScript MythBustersSXSW 2012 JavaScript MythBusters
SXSW 2012 JavaScript MythBusters
 
Digdagによる大規模データ処理の自動化とエラー処理
Digdagによる大規模データ処理の自動化とエラー処理Digdagによる大規模データ処理の自動化とエラー処理
Digdagによる大規模データ処理の自動化とエラー処理
 
12 core technologies you should learn, love, and hate to be a 'real' technocrat
12 core technologies you should learn, love, and hate to be a 'real' technocrat12 core technologies you should learn, love, and hate to be a 'real' technocrat
12 core technologies you should learn, love, and hate to be a 'real' technocrat
 
Clug 2011 March web server optimisation
Clug 2011 March  web server optimisationClug 2011 March  web server optimisation
Clug 2011 March web server optimisation
 
Scaling PHP apps
Scaling PHP appsScaling PHP apps
Scaling PHP apps
 
12 core technologies you should learn, love, and hate to be a 'real' technocrat
12 core technologies you should learn, love, and hate to be a 'real' technocrat12 core technologies you should learn, love, and hate to be a 'real' technocrat
12 core technologies you should learn, love, and hate to be a 'real' technocrat
 
Caching and tuning fun for high scalability
Caching and tuning fun for high scalabilityCaching and tuning fun for high scalability
Caching and tuning fun for high scalability
 
Drupal Performance : DrupalCamp North
Drupal Performance : DrupalCamp NorthDrupal Performance : DrupalCamp North
Drupal Performance : DrupalCamp North
 
High concurrency,
Low latency analytics
using Spark/Kudu
 High concurrency,
Low latency analytics
using Spark/Kudu High concurrency,
Low latency analytics
using Spark/Kudu
High concurrency,
Low latency analytics
using Spark/Kudu
 
Profiling PHP with Xdebug / Webgrind
Profiling PHP with Xdebug / WebgrindProfiling PHP with Xdebug / Webgrind
Profiling PHP with Xdebug / Webgrind
 
Symfony finally swiped right on envvars
Symfony finally swiped right on envvarsSymfony finally swiped right on envvars
Symfony finally swiped right on envvars
 
Caching and tuning fun for high scalability
Caching and tuning fun for high scalabilityCaching and tuning fun for high scalability
Caching and tuning fun for high scalability
 
The Future is Now: Leveraging the Cloud with Ruby
The Future is Now: Leveraging the Cloud with RubyThe Future is Now: Leveraging the Cloud with Ruby
The Future is Now: Leveraging the Cloud with Ruby
 

More from Sadayuki Furuhashi

Performance Optimization Techniques of MessagePack-Ruby - RubyKaigi 2019
Performance Optimization Techniques of MessagePack-Ruby - RubyKaigi 2019Performance Optimization Techniques of MessagePack-Ruby - RubyKaigi 2019
Performance Optimization Techniques of MessagePack-Ruby - RubyKaigi 2019Sadayuki Furuhashi
 
Automating Workflows for Analytics Pipelines
Automating Workflows for Analytics PipelinesAutomating Workflows for Analytics Pipelines
Automating Workflows for Analytics PipelinesSadayuki Furuhashi
 
Fluentd at Bay Area Kubernetes Meetup
Fluentd at Bay Area Kubernetes MeetupFluentd at Bay Area Kubernetes Meetup
Fluentd at Bay Area Kubernetes MeetupSadayuki Furuhashi
 
DigdagはなぜYAMLなのか?
DigdagはなぜYAMLなのか?DigdagはなぜYAMLなのか?
DigdagはなぜYAMLなのか?Sadayuki Furuhashi
 
Logging for Production Systems in The Container Era
Logging for Production Systems in The Container EraLogging for Production Systems in The Container Era
Logging for Production Systems in The Container EraSadayuki Furuhashi
 
分散ワークフローエンジン『Digdag』の実装 at Tokyo RubyKaigi #11
分散ワークフローエンジン『Digdag』の実装 at Tokyo RubyKaigi #11分散ワークフローエンジン『Digdag』の実装 at Tokyo RubyKaigi #11
分散ワークフローエンジン『Digdag』の実装 at Tokyo RubyKaigi #11Sadayuki Furuhashi
 
Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1Sadayuki Furuhashi
 
Presto - Hadoop Conference Japan 2014
Presto - Hadoop Conference Japan 2014Presto - Hadoop Conference Japan 2014
Presto - Hadoop Conference Japan 2014Sadayuki Furuhashi
 
Fluentd - Set Up Once, Collect More
Fluentd - Set Up Once, Collect MoreFluentd - Set Up Once, Collect More
Fluentd - Set Up Once, Collect MoreSadayuki Furuhashi
 
Prestogres, ODBC & JDBC connectivity for Presto
Prestogres, ODBC & JDBC connectivity for PrestoPrestogres, ODBC & JDBC connectivity for Presto
Prestogres, ODBC & JDBC connectivity for PrestoSadayuki Furuhashi
 
What's new in v11 - Fluentd Casual Talks #3 #fluentdcasual
What's new in v11 - Fluentd Casual Talks #3 #fluentdcasualWhat's new in v11 - Fluentd Casual Talks #3 #fluentdcasual
What's new in v11 - Fluentd Casual Talks #3 #fluentdcasualSadayuki Furuhashi
 
How we use Fluentd in Treasure Data
How we use Fluentd in Treasure DataHow we use Fluentd in Treasure Data
How we use Fluentd in Treasure DataSadayuki Furuhashi
 
How to collect Big Data into Hadoop
How to collect Big Data into HadoopHow to collect Big Data into Hadoop
How to collect Big Data into HadoopSadayuki Furuhashi
 

More from Sadayuki Furuhashi (20)

Scripting Embulk Plugins
Scripting Embulk PluginsScripting Embulk Plugins
Scripting Embulk Plugins
 
Performance Optimization Techniques of MessagePack-Ruby - RubyKaigi 2019
Performance Optimization Techniques of MessagePack-Ruby - RubyKaigi 2019Performance Optimization Techniques of MessagePack-Ruby - RubyKaigi 2019
Performance Optimization Techniques of MessagePack-Ruby - RubyKaigi 2019
 
Making KVS 10x Scalable
Making KVS 10x ScalableMaking KVS 10x Scalable
Making KVS 10x Scalable
 
Automating Workflows for Analytics Pipelines
Automating Workflows for Analytics PipelinesAutomating Workflows for Analytics Pipelines
Automating Workflows for Analytics Pipelines
 
Fluentd at Bay Area Kubernetes Meetup
Fluentd at Bay Area Kubernetes MeetupFluentd at Bay Area Kubernetes Meetup
Fluentd at Bay Area Kubernetes Meetup
 
DigdagはなぜYAMLなのか?
DigdagはなぜYAMLなのか?DigdagはなぜYAMLなのか?
DigdagはなぜYAMLなのか?
 
Logging for Production Systems in The Container Era
Logging for Production Systems in The Container EraLogging for Production Systems in The Container Era
Logging for Production Systems in The Container Era
 
分散ワークフローエンジン『Digdag』の実装 at Tokyo RubyKaigi #11
分散ワークフローエンジン『Digdag』の実装 at Tokyo RubyKaigi #11分散ワークフローエンジン『Digdag』の実装 at Tokyo RubyKaigi #11
分散ワークフローエンジン『Digdag』の実装 at Tokyo RubyKaigi #11
 
Embuk internals
Embuk internalsEmbuk internals
Embuk internals
 
Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1
 
Prestogres internals
Prestogres internalsPrestogres internals
Prestogres internals
 
Presto+MySQLで分散SQL
Presto+MySQLで分散SQLPresto+MySQLで分散SQL
Presto+MySQLで分散SQL
 
Presto - Hadoop Conference Japan 2014
Presto - Hadoop Conference Japan 2014Presto - Hadoop Conference Japan 2014
Presto - Hadoop Conference Japan 2014
 
Fluentd - Set Up Once, Collect More
Fluentd - Set Up Once, Collect MoreFluentd - Set Up Once, Collect More
Fluentd - Set Up Once, Collect More
 
Prestogres, ODBC & JDBC connectivity for Presto
Prestogres, ODBC & JDBC connectivity for PrestoPrestogres, ODBC & JDBC connectivity for Presto
Prestogres, ODBC & JDBC connectivity for Presto
 
What's new in v11 - Fluentd Casual Talks #3 #fluentdcasual
What's new in v11 - Fluentd Casual Talks #3 #fluentdcasualWhat's new in v11 - Fluentd Casual Talks #3 #fluentdcasual
What's new in v11 - Fluentd Casual Talks #3 #fluentdcasual
 
How we use Fluentd in Treasure Data
How we use Fluentd in Treasure DataHow we use Fluentd in Treasure Data
How we use Fluentd in Treasure Data
 
Fluentd meetup at Slideshare
Fluentd meetup at SlideshareFluentd meetup at Slideshare
Fluentd meetup at Slideshare
 
How to collect Big Data into Hadoop
How to collect Big Data into HadoopHow to collect Big Data into Hadoop
How to collect Big Data into Hadoop
 
Fluentd meetup
Fluentd meetupFluentd meetup
Fluentd meetup
 

Recently uploaded

User Experience Designer | Kaylee Miller Resume
User Experience Designer | Kaylee Miller ResumeUser Experience Designer | Kaylee Miller Resume
User Experience Designer | Kaylee Miller ResumeKaylee Miller
 
Leveling Up your Branding and Mastering MERN: Fullstack WebDev
Leveling Up your Branding and Mastering MERN: Fullstack WebDevLeveling Up your Branding and Mastering MERN: Fullstack WebDev
Leveling Up your Branding and Mastering MERN: Fullstack WebDevpmgdscunsri
 
Practical Advice for FDA’s 510(k) Requirements.pdf
Practical Advice for FDA’s 510(k) Requirements.pdfPractical Advice for FDA’s 510(k) Requirements.pdf
Practical Advice for FDA’s 510(k) Requirements.pdfICS
 
Steps to Successfully Hire Ionic Developers
Steps to Successfully Hire Ionic DevelopersSteps to Successfully Hire Ionic Developers
Steps to Successfully Hire Ionic Developersmichealwillson701
 
8 key point on optimizing web hosting services in your business.pdf
8 key point on optimizing web hosting services in your business.pdf8 key point on optimizing web hosting services in your business.pdf
8 key point on optimizing web hosting services in your business.pdfOffsiteNOC
 
renewable energy renewable energy renewable energy renewable energy
renewable energy renewable energy renewable energy  renewable energyrenewable energy renewable energy renewable energy  renewable energy
renewable energy renewable energy renewable energy renewable energyjeyasrig
 
VuNet software organisation powerpoint deck
VuNet software organisation powerpoint deckVuNet software organisation powerpoint deck
VuNet software organisation powerpoint deckNaval Singh
 
Mobile App Development company Houston
Mobile  App  Development  company HoustonMobile  App  Development  company Houston
Mobile App Development company Houstonjennysmithusa549
 
Large Scale Architecture -- The Unreasonable Effectiveness of Simplicity
Large Scale Architecture -- The Unreasonable Effectiveness of SimplicityLarge Scale Architecture -- The Unreasonable Effectiveness of Simplicity
Large Scale Architecture -- The Unreasonable Effectiveness of SimplicityRandy Shoup
 
Building Generative AI-infused apps: what's possible and how to start
Building Generative AI-infused apps: what's possible and how to startBuilding Generative AI-infused apps: what's possible and how to start
Building Generative AI-infused apps: what's possible and how to startMaxim Salnikov
 
Technical improvements. Reasons. Methods. Estimations. CJ
Technical improvements.  Reasons. Methods. Estimations. CJTechnical improvements.  Reasons. Methods. Estimations. CJ
Technical improvements. Reasons. Methods. Estimations. CJpolinaucc
 
8 Steps to Build a LangChain RAG Chatbot.
8 Steps to Build a LangChain RAG Chatbot.8 Steps to Build a LangChain RAG Chatbot.
8 Steps to Build a LangChain RAG Chatbot.Ritesh Kanjee
 
CYBER SECURITY AND CYBER CRIME COMPLETE GUIDE.pLptx
CYBER SECURITY AND CYBER CRIME COMPLETE GUIDE.pLptxCYBER SECURITY AND CYBER CRIME COMPLETE GUIDE.pLptx
CYBER SECURITY AND CYBER CRIME COMPLETE GUIDE.pLptxBarakaMuyengi
 
MUT4SLX: Extensions for Mutation Testing of Stateflow Models
MUT4SLX: Extensions for Mutation Testing of Stateflow ModelsMUT4SLX: Extensions for Mutation Testing of Stateflow Models
MUT4SLX: Extensions for Mutation Testing of Stateflow ModelsUniversity of Antwerp
 
Boost Efficiency: Sabre API Integration Made Easy
Boost Efficiency: Sabre API Integration Made EasyBoost Efficiency: Sabre API Integration Made Easy
Boost Efficiency: Sabre API Integration Made Easymichealwillson701
 
Flutter the Future of Mobile App Development - 5 Crucial Reasons.pdf
Flutter the Future of Mobile App Development - 5 Crucial Reasons.pdfFlutter the Future of Mobile App Development - 5 Crucial Reasons.pdf
Flutter the Future of Mobile App Development - 5 Crucial Reasons.pdfMind IT Systems
 
MinionLabs_Mr. Gokul Srinivas_Young Entrepreneur
MinionLabs_Mr. Gokul Srinivas_Young EntrepreneurMinionLabs_Mr. Gokul Srinivas_Young Entrepreneur
MinionLabs_Mr. Gokul Srinivas_Young EntrepreneurPriyadarshini T
 
Telebu Social -Whatsapp Business API : Mastering Omnichannel Business Communi...
Telebu Social -Whatsapp Business API : Mastering Omnichannel Business Communi...Telebu Social -Whatsapp Business API : Mastering Omnichannel Business Communi...
Telebu Social -Whatsapp Business API : Mastering Omnichannel Business Communi...telebusocialmarketin
 
openEuler Community Overview - a presentation showing the current scale
openEuler Community Overview - a presentation showing the current scaleopenEuler Community Overview - a presentation showing the current scale
openEuler Community Overview - a presentation showing the current scaleShane Coughlan
 

Recently uploaded (20)

User Experience Designer | Kaylee Miller Resume
User Experience Designer | Kaylee Miller ResumeUser Experience Designer | Kaylee Miller Resume
User Experience Designer | Kaylee Miller Resume
 
20140812 - OBD2 Solution
20140812 - OBD2 Solution20140812 - OBD2 Solution
20140812 - OBD2 Solution
 
Leveling Up your Branding and Mastering MERN: Fullstack WebDev
Leveling Up your Branding and Mastering MERN: Fullstack WebDevLeveling Up your Branding and Mastering MERN: Fullstack WebDev
Leveling Up your Branding and Mastering MERN: Fullstack WebDev
 
Practical Advice for FDA’s 510(k) Requirements.pdf
Practical Advice for FDA’s 510(k) Requirements.pdfPractical Advice for FDA’s 510(k) Requirements.pdf
Practical Advice for FDA’s 510(k) Requirements.pdf
 
Steps to Successfully Hire Ionic Developers
Steps to Successfully Hire Ionic DevelopersSteps to Successfully Hire Ionic Developers
Steps to Successfully Hire Ionic Developers
 
8 key point on optimizing web hosting services in your business.pdf
8 key point on optimizing web hosting services in your business.pdf8 key point on optimizing web hosting services in your business.pdf
8 key point on optimizing web hosting services in your business.pdf
 
renewable energy renewable energy renewable energy renewable energy
renewable energy renewable energy renewable energy  renewable energyrenewable energy renewable energy renewable energy  renewable energy
renewable energy renewable energy renewable energy renewable energy
 
VuNet software organisation powerpoint deck
VuNet software organisation powerpoint deckVuNet software organisation powerpoint deck
VuNet software organisation powerpoint deck
 
Mobile App Development company Houston
Mobile  App  Development  company HoustonMobile  App  Development  company Houston
Mobile App Development company Houston
 
Large Scale Architecture -- The Unreasonable Effectiveness of Simplicity
Large Scale Architecture -- The Unreasonable Effectiveness of SimplicityLarge Scale Architecture -- The Unreasonable Effectiveness of Simplicity
Large Scale Architecture -- The Unreasonable Effectiveness of Simplicity
 
Building Generative AI-infused apps: what's possible and how to start
Building Generative AI-infused apps: what's possible and how to startBuilding Generative AI-infused apps: what's possible and how to start
Building Generative AI-infused apps: what's possible and how to start
 
Technical improvements. Reasons. Methods. Estimations. CJ
Technical improvements.  Reasons. Methods. Estimations. CJTechnical improvements.  Reasons. Methods. Estimations. CJ
Technical improvements. Reasons. Methods. Estimations. CJ
 
8 Steps to Build a LangChain RAG Chatbot.
8 Steps to Build a LangChain RAG Chatbot.8 Steps to Build a LangChain RAG Chatbot.
8 Steps to Build a LangChain RAG Chatbot.
 
CYBER SECURITY AND CYBER CRIME COMPLETE GUIDE.pLptx
CYBER SECURITY AND CYBER CRIME COMPLETE GUIDE.pLptxCYBER SECURITY AND CYBER CRIME COMPLETE GUIDE.pLptx
CYBER SECURITY AND CYBER CRIME COMPLETE GUIDE.pLptx
 
MUT4SLX: Extensions for Mutation Testing of Stateflow Models
MUT4SLX: Extensions for Mutation Testing of Stateflow ModelsMUT4SLX: Extensions for Mutation Testing of Stateflow Models
MUT4SLX: Extensions for Mutation Testing of Stateflow Models
 
Boost Efficiency: Sabre API Integration Made Easy
Boost Efficiency: Sabre API Integration Made EasyBoost Efficiency: Sabre API Integration Made Easy
Boost Efficiency: Sabre API Integration Made Easy
 
Flutter the Future of Mobile App Development - 5 Crucial Reasons.pdf
Flutter the Future of Mobile App Development - 5 Crucial Reasons.pdfFlutter the Future of Mobile App Development - 5 Crucial Reasons.pdf
Flutter the Future of Mobile App Development - 5 Crucial Reasons.pdf
 
MinionLabs_Mr. Gokul Srinivas_Young Entrepreneur
MinionLabs_Mr. Gokul Srinivas_Young EntrepreneurMinionLabs_Mr. Gokul Srinivas_Young Entrepreneur
MinionLabs_Mr. Gokul Srinivas_Young Entrepreneur
 
Telebu Social -Whatsapp Business API : Mastering Omnichannel Business Communi...
Telebu Social -Whatsapp Business API : Mastering Omnichannel Business Communi...Telebu Social -Whatsapp Business API : Mastering Omnichannel Business Communi...
Telebu Social -Whatsapp Business API : Mastering Omnichannel Business Communi...
 
openEuler Community Overview - a presentation showing the current scale
openEuler Community Overview - a presentation showing the current scaleopenEuler Community Overview - a presentation showing the current scale
openEuler Community Overview - a presentation showing the current scale
 

Embulk, an open-source plugin-based parallel bulk data loader

  • 1. Sadayuki Furuhashi Founder & Software Architect Treasure Data, inc. EmbulkAn open-source plugin-based parallel bulk data loader that makes painful data integration work relaxed. Sharing our knowledge on RubyGems to manage arbitrary files.
  • 2. A little about me... > Sadayuki Furuhashi > github/twitter: @frsyuki > Treasure Data, Inc. > Founder & Software Architect > Open-source hacker > MessagePack - Efficient object serializer > Fluentd - An unified data collection tool > Prestogres - PostgreSQL protocol gateway for Presto > Embulk - A plugin-based parallel bulk data loader > ServerEngine - A Ruby framework to build multiprocess servers > LS4 - A distributed object storage with cross-region replication > kumofs - A distributed strong-consistent key-value data store
  • 3. Today’s talk > What’s Embulk? > How Embulk works? > The architecture > Writing Embulk plugins > Roadmap & Development > Q&A + Discussion
  • 4. What’s Embulk? > An open-source parallel bulk data loader > using plugins > to make data integration relaxed.
  • 5. What’s Embulk? > An open-source parallel bulk data loader > loads records from “A” to “B” > using plugins > for various kinds of “A” and “B” > to make data integration relaxed. > which was very painful… Storage, RDBMS, NoSQL, Cloud Service, etc. broken records,
 transactions (idempotency),
 performance, …
  • 6. The pains of bulk data loading Example: load a 10GB CSV file to PostgreSQL > 1. First attempt → fails > 2. Write a script to make the records cleaned • Convert ”20150127T190500Z” → “2015-01-27 19:05:00 UTC” • Convert “N" → “” • many cleanings… > 3. Second attempt → another error • Convert “Inf” → “Infinity” > 4. Fix the script, retry, retry, retry… > 5. Oh, some data got loaded twice!?
  • 7. The pains of bulk data loading Example: load a 10GB CSV file to PostgreSQL > 6. Ok, the script worked. > 7. Register it to cron to sync data every day. > 8. One day… it fails with another error • Convert invalid UTF-8 byte sequence to U+FFFD
  • 8. The pains of bulk data loading Example: load 10GB CSV × 720 files > Most of scripts are slow. • People have little time to optimize bulk load scripts > One file takes 1 hour → 720 files takes 1 month (!?) A lot of integration efforts for each storages: > XML, JSON, Apache log format (+some custom), … > SAM, BED, BAI2, HDF5, TDE, SequenceFile, RCFile… > MongoDB, Elasticsearch, Redshift, Salesforce, …
  • 9. The problems: > Data cleaning (normalization) > How to normalize broken records? > Error handling > How to remove broken records? > Idempotent retrying > How to retry without duplicated loading? > Performance optimization > How to optimize the code or parallelize?
  • 10. The problems at Treasure Data Treasure Data Service? > “Fast, powerful SQL access to big data from connected applications and products, with no new infrastructure or special skills required.” > Customers want to try Treasure Data, but > SEs write scripts to bulk load their data. Hard work :( > Customers want to migrate their big data, but > Hard work :( > Fluentd solved streaming data collection, but > bulk data loading is another problem.
  • 11. A solution: > Package the efforts as a plugin. > data cleaning, error handling, retrying > Share & reuse the plugin. > don’t repeat the pains! > Keep improving the plugin code. > rather than throwing away the efforts every time > using OSS-style pull-reqs & frequent releases.
  • 12. Embulk Embulk is an open-source, plugin-based parallel bulk data loader
 that makes data integration works relaxed.
  • 14. HDFS MySQL Amazon S3 Embulk CSV Files SequenceFile Salesforce.com Elasticsearch Cassandra Hive Redis ✓ Parallel execution ✓ Data validation ✓ Error recovery ✓ Deterministic behavior ✓ Idempotet retrying bulk load
  • 15. HDFS MySQL Amazon S3 Embulk CSV Files SequenceFile Salesforce.com Elasticsearch Cassandra Hive Redis ✓ Parallel execution ✓ Data validation ✓ Error recovery ✓ Deterministic behavior ✓ Idempotet retrying Plugins Plugins bulk load
  • 17. # install $ wget https://bintray.com/artifact/download/ embulk/maven/embulk-0.2.0.jar -o embulk.jar $ chmod 755 embulk.jar Installing embulk Bintray
 releases Embulk is released on Bintray wget embulk.jar
  • 18. # install $ wget https://bintray.com/artifact/download/ embulk/maven/embulk-0.2.0.jar -o embulk.jar $ chmod 755 embulk.jar
 # guess $ vi partial-config.yml $ ./embulk guess partial-config.yml
 -o config.yml Guess format & schema in: type: file paths: [data/examples/] out:
 type: example in: type: file paths: [data/examples/] decoders: - {type: gzip} parser: charset: UTF-8 newline: CRLF type: csv delimiter: ',' quote: '"' header_line: true columns: - name: time
 type: timestamp
 format: '%Y-%m-%d %H:%M:%S' - name: account
 type: long - name: purchase
 type: timestamp
 format: '%Y%m%d' - name: comment
 type: string out:
 type: example guess by guess plugins
  • 19. # install $ wget https://bintray.com/artifact/download/ embulk/maven/embulk-0.2.0.jar -o embulk.jar $ chmod 755 embulk.jar
 # guess $ vi partial-config.yml $ ./embulk guess partial-config.yml
 -o config.yml
 # preview $ ./embulk preview config.yml $ vi config.yml # if necessary +--------------------------------------+---------------+--------------------+ | time:timestamp | uid:long | word:string | +--------------------------------------+---------------+--------------------+ | 2015-01-27 19:23:49 UTC | 32,864 | embulk | | 2015-01-27 19:01:23 UTC | 14,824 | jruby | | 2015-01-28 02:20:02 UTC | 27,559 | plugin | | 2015-01-29 11:54:36 UTC | 11,270 | fluentd | +--------------------------------------+---------------+--------------------+ Preview & fix config
  • 20. # install $ wget https://bintray.com/artifact/download/ embulk/maven/embulk-0.2.0.jar -o embulk.jar $ chmod 755 embulk.jar
 # guess $ vi partial-config.yml $ ./embulk guess partial-config.yml
 -o config.yml
 # preview $ ./embulk preview config.yml $ vi config.yml # if necessary # run $ ./embulk run config.yml -o config.yml in: type: file paths: [data/examples/] decoders: - {type: gzip} parser: charset: UTF-8 newline: CRLF type: csv delimiter: ',' quote: '"' header_line: true columns: - name: time
 type: timestamp
 format: '%Y-%m-%d %H:%M:%S' - name: account
 type: long - name: purchase
 type: timestamp
 format: '%Y%m%d' - name: comment
 type: string last_paths: [data/examples/sample_001.csv.gz] out:
 type: example Deterministic run
  • 21. in: type: file paths: [data/examples/] decoders: - {type: gzip} parser: charset: UTF-8 newline: CRLF type: csv delimiter: ',' quote: '"' header_line: true columns: - name: time
 type: timestamp
 format: '%Y-%m-%d %H:%M:%S' - name: account
 type: long - name: purchase
 type: timestamp
 format: '%Y%m%d' - name: comment
 type: string last_paths: [data/examples/sample_002.csv.gz] out:
 type: example Repeat # install $ wget https://bintray.com/artifact/download/ embulk/maven/embulk-0.2.0.jar -o embulk.jar $ chmod 755 embulk.jar
 # guess $ vi partial-config.yml $ ./embulk guess partial-config.yml
 -o config.yml
 # preview $ ./embulk preview config.yml $ vi config.yml # if necessary # run $ ./embulk run config.yml -o config.yml # repeat $ ./embulk run config.yml -o config.yml $ ./embulk run config.yml -o config.yml
  • 24. InputPlugin OutputPlugin Embulk executor plugin MySQL, Cassandra, HBase, Elasticsearch,
 Treasure Data, … record record
  • 26. InputPlugin FileInputPlugin OutputPlugin FileOutputPlugin EncoderPlugin FormatterPlugin DecoderPlugin ParserPlugin Embulk executor plugin HDFS, S3,
 Riak CS, … gzip, bzip2,
 3des, … CSV, JSON,
 RCFile, … buffer buffer record record buffer buffer
  • 28. InputPlugin module Embulk class InputExample < InputPlugin Plugin.register_input('example', self) def self.transaction(config, &control) # read config task = { 'message' => config.param('message', :string, default: nil) } threads = config.param('threads', :int, default: 2) columns = [ Column.new(0, 'col0', :long), Column.new(1, 'col1', :double), Column.new(2, 'col2', :string), ] # BEGIN here commit_reports = yield(task, columns, threads) # COMMIT here puts "Example input finished" return {} end def run(task, schema, index, page_builder) puts "Example input thread #{@index}…" 10.times do |i| @page_builder.add([i, 10.0, "example"]) end @page_builder.finish commit_report = { } return commit_report end end end
  • 29. OutputPlugin module Embulk class OutputExample < OutputPlugin Plugin.register_output('example', self) def self.transaction( config, schema, processor_count, &control) # read config task = { 'message' => config.param('message', :string, default: "record") } puts "Example output started." commit_reports = yield(task) puts "Example output finished. Commit reports = #{commit_reports.to_json}" return {} end def initialize(task, schema, index) puts "Example output thread #{index}..." super @message = task.prop('message', :string) @records = 0 end def add(page) page.each do |record| hash = Hash[schema.names.zip(record)] puts "#{@message}: #{hash.to_json}" @records += 1 end end def finish end def abort end def commit commit_report = { "records" => @records } return commit_report end end end
  • 30. GuessPlugin # guess_gzip.rb module Embulk class GzipGuess < GuessPlugin Plugin.register_guess('gzip', self) GZIP_HEADER = "x1f x8b".force_encoding('ASCII-8BIT').freeze def guess(config, sample_buffer) if sample_buffer[0,2] == GZIP_HEADER return {"decoders" => [{"type" => "gzip"}]} end return {} end end end # guess_ module Embulk class GuessNewline < TextGuessPlugin Plugin.register_guess('newline', self) def guess_text(config, sample_text) cr_count = sample_text.count("r") lf_count = sample_text.count("n") crlf_count = sample_text.scan(/rn/).length if crlf_count > cr_count / 2 && crlf_count > lf_count / 2 return {"parser" => {"newline" => "CRLF"}} elsif cr_count > lf_count / 2 return {"parser" => {"newline" => "CR"}} else return {"parser" => {"newline" => "LF"}} end end end end
  • 31. Releasing to RubyGems Examples > embulk-plugin-postgres-json.gem > https://github.com/frsyuki/embulk-plugin-postgres-json > embulk-plugin-redis.gem > https://github.com/komamitsu/embulk-plugin-redis > embulk-plugin-input-sfdc-event-log-files.gem > https://github.com/nahi/embulk-plugin-input-sfdc-event- log-files
  • 33. Roadmap > Add missing JRuby Plugin APIs > ParserPlugin, FormatterPlugin > DecoderPlugin, EncoderPlugin > Add Executor plugin SPI > Add ssh distributed executor > embulk run —command ssh %host embulk run %task > Add MapReduce executor > Add support for nested records (?)
  • 34. Contributing to the Embulk project > Pull-requests & issues on Github > Posting blogs > “I tried Embulk. Here is how it worked” > “I read Embulk code. Here is how it’s written” > “Embulk is good because…but bad because…” > Talking on Twitter with a word “embulk" > Writing & releasing plugins > Windows support > Integration to other software > ETL tools, Fluentd, Hadoop, Presto, …
  • 35. Q&A + Discussion? Hiroshi Nakamura @nahi Muga Nishizawa @muga_nishizawa Sadayuki Furuhashi @frsyuki Embulk committers:
  • 36. https://jobs.lever.co/treasure-data Cloud service for the entire data pipeline. We’re hiring!