Accumulo on EC2

5,353 views
5,256 views

Published on

Slides from a presentation

0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
5,353
On SlideShare
0
From Embeds
0
Number of Embeds
2,765
Actions
Shares
0
Downloads
108
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Accumulo on EC2

    1. 1. Accumulo in the CloudAmazon’s EC2
    2. 2. Apache Accumulo
    3. 3. BigTable
    4. 4. Java
    5. 5. Apache Hadoop
    6. 6. Advanced Features
    7. 7. Cell-level security
    8. 8. TimeRow Column Visibility Value Stamp family |Joe photos:vacation - photo425.jpg friendsJoe photos:expo work - photo648.jpg photos:Joe friends - photo772.jpg bachelor_party
    9. 9. TimeRow Column Visibility Value Stamp family |Joe photos:vacation - photo425.jpg friendsJoe photos:expo work - photo648.jpg photos:Joe friends - photo772.jpg bachelor_party FRIENDS
    10. 10. TimeRow Column Visibility Value Stamp family |Joe photos:vacation - photo425.jpg friendsJoe photos:expo work - photo648.jpg photos:Joe friends - photo772.jpg bachelor_party FAMILY
    11. 11. TimeRow Column Visibility Value Stamp family |Joe photos:vacation - photo425.jpg friendsJoe photos:expo work - photo648.jpg photos:Joe friends - photo772.jpg bachelor_party WORK
    12. 12. Combiners:
    13. 13. Combiners:Server-side Computation
    14. 14. TimeRow Column Visibility Value StampstoreA sales:shoes acct 20120305 100storeA sales:shoes acct 20120303 550storeA sales:shoes acct 20120302 300storeA sales:cameras acct 20120305 100storeA sales:cameras acct 20120303 200 COMBINER: SUM()
    15. 15. TimeRow Column Visibility Value StampstoreA sales:shoes acct - 950storeA sales:cameras acct - 300 COMBINER: SUM()
    16. 16. TimeRow Column Visibility Value StampstoreA sales:shoes acct - 950storeA sales:cameras acct - 300storeA sales:shoes acct 200120306 150 COMBINER: SUM()
    17. 17. TimeRow Column Visibility Value StampstoreA sales:shoes acct - 1100storeA sales:cameras acct - 300 COMBINER: SUM()
    18. 18. Combiners execute
    19. 19. Query time
    20. 20. Asynchronously inbackground
    21. 21. Insert / updates are fast
    22. 22. 10s of thousands persecond per server
    23. 23. Accumulo Leading brand 0 7500 15000 22500 30000 Inserts Updates WRITE PERFORMANCE PROJECT APPROXIMATEDATE CLIENT DATE NAME
    24. 24. AmazonElastic Compute Cloud
    25. 25. Really fast provisioning
    26. 26. On-demand hardware
    27. 27. Scale up/down in minutes
    28. 28. Independence DIFFERENT MACHINES DATA CENTERS GEOGRAPHIC LOCATIONS
    29. 29. Spot Instances
    30. 30. Set max pricefor instances
    31. 31. They’re yoursuntil you’re outbid
    32. 32. SPOT INSTANCE PRICING HISTORY
    33. 33. Are Big Data and Cloud amatch made in heaven?
    34. 34. MATCH Big Data CloudAlways need more hardware, Add more machines as not alway easy to predict needed
    35. 35. MATCH Big Data CloudNeed so much hardware, Sells exclusively commodity have to use lots of virtual servers that blow upcommodity machines and occasionally fault-tolerant software
    36. 36. MATCH Big Data Cloud Having lots of independentI/O (IOPs and bandwidth) is Offers disks local to the important, both for instance MapReduce and servicing requests
    37. 37. MISMATCH Big Data Cloud Having lots of independentI/O (IOPs and bandwidth) is Offers shared storage over important, both for ethernet (SAN) MapReduce and servicing requests
    38. 38. MISMATCH Big Data CloudLarge volumes of data have Create an entire cluster from‘mass’, get harder to move nothing, bring all the data in, around process, write it all out
    39. 39. MISMATCH, BUT OK Big Data CloudSoftware benefits from lots Heavily reliance onof independent hardware, virtualizationand using all the hardware
    40. 40. Scalable administrationand elasticity
    41. 41. When a machine fails
    42. 42. Is admin interventionrequired?
    43. 43. Will clients seeexceptions?
    44. 44. When adding a machine
    45. 45. Must data move before itcan service requests?
    46. 46. Must processes berestarted?
    47. 47. AccumuloFAILOVER AUTOMATICRE-REPLICATION AUTOMATICDATA REPLICATED ASYNCHRONOUSLYREQUEST-LOAD BALANCED INDEPENDENTLYNEW MACHINES DISCOVERED AUTOMATICALLY
    48. 48. Some Other NoSQL dbs ...NOT SO MUCH
    49. 49. Your DB needs to scale inthe human dimension too.
    50. 50. Accumulo in EC2
    51. 51. Instance Types
    52. 52. 64-bit machines
    53. 53. no 32-bit t1.micros,m1.small, or m1.medium
    54. 54. about 2-4GB RAM / core
    55. 55. about 2 disks / core
    56. 56. m1.larges
    57. 57. 7.5GB RAM for 4 cores* *EC2 COMPUTE UNITS
    58. 58. 2 x 420GB disks
    59. 59. m1.xlarge
    60. 60. 15GB RAM for 8 cores
    61. 61. 4 x 414GB disks
    62. 62. m2.xlarge
    63. 63. 17.1GB RAM for 6.5 cores
    64. 64. only 1 x 420GB disk
    65. 65. Bigger instances havemore RAM, CPU ...
    66. 66. Not more disks
    67. 67. EBS is an option ...
    68. 68. RAID is not necessary
    69. 69. Lose some independence
    70. 70. Exceptions
    71. 71. HDFS NameNode
    72. 72. Might get a machine withmore memory
    73. 73. 68GB is the current max
    74. 74. Use RAID across multipleNameNode disks
    75. 75. Where to place instances?
    76. 76. Region = geographic area
    77. 77. Availability Zone= Data Center
    78. 78. Can span multiple AZs
    79. 79. Tested on four AZsin East US region
    80. 80. Spanning regions?
    81. 81. Cross-site/WAN replication
    82. 82. AMIs
    83. 83. UbuntuMaverick (10.10) x86_64
    84. 84. CentOS
    85. 85. Cloudera Hadoop
    86. 86. Cloudera’s WhirrHTTPS://CCP.CLOUDERA.COM/DISPLAY/CDHDOC/WHIRR+INSTALLATION
    87. 87. OS Config
    88. 88. No swapping
    89. 89. Swappiness = 0
    90. 90. Up open file limits to 64k
    91. 91. Software
    92. 92. Hadoop
    93. 93. Only need HDFS
    94. 94. MapReduce is fine though
    95. 95. Version 0.20
    96. 96. Cloudera CDH3u2
    97. 97. or MapR!
    98. 98. ZooKeeper
    99. 99. Version 3.3.1 or greater
    100. 100. Java
    101. 101. Sun Java JDK 1.6
    102. 102. OpenJDK 1.6
    103. 103. Accumulo 1.3.5 or 1.4.0
    104. 104. wgetHTTP://INCUBATOR.APACHE.ORG/ACCUMULO/DOWNLOADS
    105. 105. Configuration
    106. 106. Use internal IP addressesfor important machines
    107. 107. HDFS NameNode
    108. 108. Accumulo Master
    109. 109. MapReduce JobTracker
    110. 110. EC2 Security Groups
    111. 111. a.k.a. the Firewall
    112. 112. 2181,2888,3888 ZOOKEEPERPorts 4560 ACCUMULO MONITOR 9000 HDFS 9001 JOBTRACKER 9997 TABLET SERVER 9999 MASTER SERVER 11224 ACCUMULO LOGGER 12234 ACCUMULO TRACER 50010 DATANODE DATA 50020 DATANODE METADATA 50060 TASKTRACKERS 50070 NAMENODE HTTP MONITOR 50075 DATANODE HTTP MONITOR 50091 ACCUMULO 50095 ACCUMULO HTTP MONITOR
    113. 113. Create SSH key onAccumulo Master
    114. 114. Distribute to tablet servers
    115. 115. Follow ordinary Hadoopand Accumulo config steps
    116. 116. Scaling Up
    117. 117. Provision new instances
    118. 118. Only need to be identicallyconfigured
    119. 119. Could store config filesinto an AMI
    120. 120. Start HDFS data node
    121. 121. Start Accumulo loggersand Tablet Server
    122. 122. Tablet servers assignedtablets immediately
    123. 123. HDFS blocks can berebalanced
    124. 124. Some new blocks will goto new machines
    125. 125. Scaling Up 20 40 100 200 400
    126. 126. Scaling Up 20 40ABOUT 85% INCREASE IN WRITE RATE 100 EACH TIME CLUSTER SIZE DOUBLED 200 400
    127. 127. Scaling Up 20 40HIT 1 MILLION WRITES PER SECOND 100AT 100 M1.LARGES WITH 50 CLIENTS 200 400
    128. 128. m1.large NameNodeserved 400 node cluster
    129. 129. Running ‘at scale’
    130. 130. Some machines will belost
    131. 131. Everything keeps working
    132. 132. Accumulo auto-recoversusing write-ahead logs
    133. 133. Occasionally provisionreplacement machines
    134. 134. On your schedule
    135. 135. Clients see no exceptions /errors
    136. 136. Watch writes and readsscale on monitor page HTTP://ACCUMULO-MONITOR:50095
    137. 137. See a list of failedmachines too
    138. 138. How many machines?
    139. 139. May need up to 2x VMs vs.bare metal
    140. 140. Scaling Down
    141. 141. Identify a set of machinesto remove
    142. 142. Stop tablet servers,loggers
    143. 143. accumulo admin stop <node> IN VERSION 1.4
    144. 144. Decommission HDFS datanodesHTTP://DEVELOPER.YAHOO.COM/HADOOP/TUTORIAL/MODULE2.HTML
    145. 145. NameNode willre-replicate blocks
    146. 146. Remaining machines haveenough storage?
    147. 147. Can lower replicationfactor if necessary
    148. 148. When decommissioning isdone, terminate instances
    149. 149. Scale Up again!
    150. 150. Repeat.
    151. 151. Details HTTP://WWW.ACCUMULODATA.COM/EC2.HTML
    152. 152. Questions

    ×