2. About Me
• Kai Sasaki (佐々木 海)
• @Lewuathe (Twitter)
• Software Engineer
at Treasure Data Inc.
• Maintaining and develop
Hadoop/Presto infrastructure
3. Topic
• Treasure Data infrastructure
• Hive 2.0 change
• Migration architecture
• Resource management for multi tenancy
• Performance comparison
4. • Live Data Management Platform
• Original creator of Fluentd/Embulk/Digdag
• 70+ integrations with
• BI tools
• Mobile/IoT
• Cloud Storage
• and more
6. • Hive/Pig/Presto data processing interface
• 40000+ Hive queries / day
• 130000+ Presto queries / day
• Plazma Cloud Storage
• 450000+ records/sec imported
8. Hive 2.0
• Include major new features
• Fixed 600+ bugs
• 140+ improvements or new features
• Backward compatible as much as possible
• Hive 1.x stable line
• 2.1.0 is available from June 20th, 2016
http://www.slideshare.net/HadoopSummit/apache-hive-20-sql-speed-scale
9. Hive 2.0
• HPLSQL
• LLAP
• HBase metastore
• Improvements of Hive on Spark
• CBO improvements
http://www.slideshare.net/HadoopSummit/apache-hive-20-sql-speed-scale
10. HPLSQL
• Procedural SQL like Oracle’s PL/SQL
• Cursor
• loops (WHILE, FOR, LOOP)
• branches (IF)
• External library which communicates through JDBC
• http://www.hplsql.org/doc
http://www.slideshare.net/HadoopSummit/apache-hive-20-sql-speed-scale
11. LLAP
• Sub-second Queries in Hive
• Save JVM container launch time
• Data caching
• Fit to Adhoc or interactive use case
• Beta in 2.0
http://hortonworks.com/wp-content/uploads/2014/09/Screen-Shot-2014-09-02-at-5.03.47-PM.png
12. LLAP
• Sub-second Queries in Hive
http://hortonworks.com/wp-content/uploads/2014/09/Screen-Shot-2014-09-02-at-5.03.47-PM.png
13. HBase metastore
• Use HBase as metastore of Hive
• Fetching thousands of partitions
• Limitation of concurrent connection
• Will support transaction with Apache Omid
• Alpha in Hive 2.0
http://hortonworks.com/wp-content/uploads/2014/09/Screen-Shot-2014-09-02-at-5.03.47-PM.png
15. That’s all?
• Operation cost of migration
• Manage multiple cluster
• Test and verify multiple packages
• Difference of configuration and parameter
16. That’s all?
• Operation cost of migration
• Manage multiple cluster
• Test and verify multiple packages
• Difference of configuration and parameter
• Need to reduce operation cost at the same time
18. Challenge
• NO DOWNTIME
• NO HARMFUL OPERATION
• Change package easily
• Separate from other components (Micro service)
• NO DEGRADATION
• Automatic query test and validation
19. NO DOWNTIME
• Hadoop cluster Blue-Green deployment
• Reliable queue system separated from Hadoop
→ PerfectQueue
• Reliable storage system separated from Hadoop
→ Plazma
20. PerfectQueue
• Distributed queue built on top of RDBMS
• At-least-once semantics
• Graceful and live restarting
• State consistency by transaction
• https://github.com/treasure-data/perfectqueue
21. Plazma
• Distributed cloud-based storage
• PostgreSQL + S3/Riak CS
• Enable time-index push down for Hive/Pig/Presto
• Column-oriented IO (mpc1)
• Data consistency with transactional API
29. NO HARMFUL OPS
• Automatic package version up
• Chef server specifies the version
• Hadoop package repository
• S3 remote package repository
• Hadoop as a REST service
• elephant-server
38. NO DEGRADATION
• Validation in
• Parameter difference
• Query result difference
• Performance deterioration
• Automatic testing and persistent result tables
43. elephant
server
S3
1. upload param
and configurations
2. upload query result
Plazma
x
submit
v1
3. send metrics
S3 Plazma
x
v2
Verification between
persistent result set
PQ
PQ
App
request
pull REST
44. Resource management
• Define 1 resource per 1 account
• Workload type of an account varies
• Batch, Adhoc, BI tool…
• Require high level resource management
across clusters
• An account can have multiple resource pools
• For service and internal purpose