The main topic of slides is building high availability high throughput system for receiveing and saving different kind of information with horizontal scalling possibility using HBase, Flume and Grizzly hosted on Amazon EC2 low cost instances. Talk describes HBase HA cluster setup process with useful hints and EC2 pitfalls, Flume setup process with providing comparasion between standalone and embedded Flume versions and show difference and usecases of both versions. A lot of attention payed to Flume2Hbase streaming features with tweaks and different approaches for speeding up this process.
13. 13
AWS: VPC
● IP control
● Elastic IP permanency
● Internal DNS via Route 53
● Security
14. 14
AWS: Storage Type Options
Pro Contra
Instance-based
●Direct hardware attach
●No network costs
●Fast
●Fixed volume, based
on instance type
●Erase in case of
instance reboot
EBS
●Volume flexible
configuration
●Magnetic/SSD choice
●Persistent
●Instance independent
●Pay only for occupied
volume
●Network attach
●Pay for I/O requests
S3
●Fast via Amazon API
●Cheap
●Cloud storage
●Very slow when
instance attached
15. 15
AWS: EBS
Pro Contra
Instance-based
●Direct hardware attach
●No network costs
●Fast
●Fixed volume, based
on instance type
●Erase in case of
instance reboot
EBS
●Volume flexible
configuration
●Magnetic/SSD choice
●Persistent
●Instance independent
●Pay only for occupied
volume
●Network attach
●Pay for I/O requests
S3
●Fast via Amazon API
●Cheap
●Cloud storage
●Very slow when
instance attached
19. 19
Major Compaction
2015-03-28 11:11:15,894 INFO org.apache.hadoop.hbase.regionserver.Store:
Completed major compaction of 3 file(s) in data of users,4012ea33-46b9-
42cd-9094-f2a939991fbf,1421554452486.c9e5790f5b45f4306fe860a5ec5480d7. into
136b805bebbe44319f032b7d6d89a5dc, size=331.7 M; total size for store is
331.7 M
2015-03-28 11:11:15,894 INFO
org.apache.hadoop.hbase.regionserver.compactions.CompactionRequest:
completed compaction: regionName=users,4012ea33-46b9-42cd-9094-
f2a939991fbf,1421554452486.c9e5790f5b45f4306fe860a5ec5480d7.,
storeName=data, fileCount=3, fileSize=334.2 M, priority=4,
time=181415339541360; duration=17sec
20. 20
Major Compaction: jitter
● Default: 0.3-0.5 (depends on version)
● 14 HRegions on 3 servers → ~2-3
simultaneous major compactions per day
<property>
<name>
hbase.hregion.majorcompaction.jitter
</name>
<value>0.8</value>
</property>
26. 26
Flume: Standalone vs Embedded
Pro Contra
Flume Standalone
●No / few code
●Variety data sources
●Scalability
●Data transformation
●Network overhead
Flume Embedded
●No network overhead
●Less data
transformation
●Full process control
●A lot of code
●Complexity
●Own data source
handling
29. 29
Conclusions
●
AWS: VPC + EBS is a great deal for HDFS/HBase
cluster
●
Don't use big NewSize for HBase
●
Use Snappy
●
Use async batch puts where possible
●
Use SSD for HBase data
●
Remember about HBase compaction
●
Embedded Flume could be a good solution