3. Topics
Embulk plugin development
Retry! Retry!! Retry!!!
Exception handling
Battle with external service’s specs
Write unit test
Java or JRuby ?
Use embulk at Treasure Data
Integration test
Implement new API endpoint
Infrastructure management
4. We’re using Embulk as bulkload tool
Pluggable bulkload tool
Released as OSS
We’re using same version of OSS
10. Java or JRuby ?
Embulk support both of Java and JRuby based plugin
Java based plugin
JRuby based plugin
High performance
Filter / Parser / Formatter / Encoder / Decoder plugin
These plugin need high performance
Some enterprise service/software support provides Java SDK.
write with Java7(MapReduce Executor needs Java7)
Easy to write
Network is bottleneck ( like cloud service).
11. Exception handling to avoid infinite retry
ConfigException
DataException
transaction method should validate all config values
should throw ConfigException or its subclass when validation fails
public ConfigDiff transaction(ConfigSource config, FileInputPlugin.Control control)
{
…
if (task.getFiles().isEmpty()) {
throw new ConfigException(“File is empty”);
}
}
…
} catch (CsvTokenizer.InvalidFormatException | CsvTokenizer.InvalidValueException … e) {
if (stopOnInvalidRecord) {
throw new DataException(“Invalid record”); // throw Exception if stopOnInvalidRecord : true
}
log.warn(“Invalid record”); // show warnings if stopOnInvalidRecord : false
}
should throw DataException or its subclass when it finds an invalid record
12. Battle with external service’s specs
Azure Blob Storage
Google Cloud Storage
AWS S3
String path = "/path/to/file";
String str = String.format("%06d", path.length()) + "!" + path + "!"
+ "000028" + "!" + "9999-12-31T23:59:59.9999999Z" + "!";
String encodedString = BaseEncoding.base64().encode(str);
String nextToken = "2" + "!" + encodedString.length + "!" + encodedString;
String path = "/path/to/file"; // use path string as next token
String path = "/path/to/file";
byte[] encoding;
byte[] utf8 = path.getBytes(Charsets.UTF_8);
encoding = new byte[utf8.length + 2];
encoding[0] = 0x0a;
encoding[1] = new Byte(String.valueOf(path.length()));
System.arraycopy(utf8, 0, encoding, 2, utf8.length);
String nextToken = BaseEncoding.base64().encode(encoding);
Example to get next token for object storage.
next token : next start point while getting file list stored at bucket or container.
13. Write unit test
We need 80% coverage to use at our platform.
But difficult to write test for embulk plugin😞
SFTP :
Create Java based virtual SFTP server at local machine.
DynamoDB :
AWS provides downloadable version of DynamoDB.
Filter/Parser/Formatter/Encoder/Decoder plugin
80% coverage is difficult without connect to service
Set confidential at environmental variables.
Use “Encryption keys” and “Encryption files” at Travis CI.
Connect to remote service for each running test
Unit test without remote connection I’ve ever seen
15. Architecture of Treasure Data
Load Balancer
TD API(API Servers)Web Console
td commands
Response
Response
Request
Request
Bulkload API
(API Servers)
Perfect Queue
TD worker
(worker process)
enqueue
dequeue
Submit Job
(Retry if need)
Execute with MR / Local Executor
guess/preview
MySQL
16. TD API / Bulkload API
TD API(API Servers)
Bulkload API(API Servers)
guess/preview is processed at different API Servers.
ResponseRequest
guess/preview
data import
Perfect Queue
Load Balancer
Queuing
Http Request/Response
guess/preview needs quick response
enqueue
17. Comes huge data
Embulk Config with thousands of columns
Huge data
Need enough validation at transaction method
Return clear error or warning messages at plugin
Retry logic of plugin is important
Retry if retryable exception happens
use MapReduce Executor
Reduce usage dirrerence at each instance.
18. Write integration test
Write integration for each connector(result output) with RSpec
td connector:guess(embulk guess) works?
td connector:preview(embulk preview) works?
td connector:issue(embulk run) works expectedly?
works with LocalExecutor?
works with MapReduce Executor?
works with filter plugin?
scheduled execution works expectedly?
for each servicemany test cases ×
19. Want to improve…
Target service is timeout 😞
Target service returns 50x error 😞
API limit exceeded 😞
CI failure
Long execution time
for each servicemany test cases ×
20. Want to implement…
API endpoint is not enough
guess
preview
issue(run)
GUI console
CUI
Unclear until user run jobs( or guess or preview)
and plugin return result or ConfigException.
Username and Password is valid?