0
Cassandra ETLStreaming Bulk Loading     Alex Araujo alexaraujo@gmail.com
Background•   Sharded MySql ETL    Platform on EC2 (EBS)•   Database Size - Up to    1TB•   Write latencies    exponential...
Background• Cassandra Thrift Loading on EC2  (Ephemeral RAID0)• Database Size - ∑ available node space• Write latencies ~l...
Thrift ETL• Thrift overhead: Converting to/from  internal structures• Routing from coordinator nodes• Writing to commitlog...
Bulk Load• Core functionality• Existing ETL  Nodes for bulk  loading• Move data file &  index generation  off C* nodes
BMT Bulk Load• Requires StorageProxy API (Java)• Rows not live until flush• Wiki example uses Hadoop         Source: http:/...
Streaming Bulk Load• Cassandra as Fat Client• BYO SSTables• sstableloader [options] /path/to/  keyspace_dir  • Can ignore ...
UsersUserId     email                      name          ...<Hash>            UserGroups GroupId             UserId       ...
Setup• Opscode Chef 0.10.2 on EC2• Cassandra 0.8.2-dev-SNAPSHOT (trunk)• Custom Java ETL JAR• The Grinder 3.4 (Jython) Tes...
Chef 0.10.2• knife-ec2 bootstrap with --ephemeral• ec2::ephemeral_raid0 recipe • Installs mdadm, unmounts default /mnt,   ...
Chef 0.10.2• cassandra::default recipe • Downloads/extracts apache-cassandra-    <version>-bin.tar.gz • Links /var/lib/cas...
Chef 0.10.2• cassandra::cluster_node recipe • Determines # nodes in cluster • Calculates initial_token; generates    cassa...
Chef 0.10.2• cassandra::bulk_load_node recipe • Generates same cassandra.yaml with    empty initial_token • Installs/config...
ETL JAR
for (File : files){                         ETL JAR  importer = new CBLI(...);  importer.open();  // Processing omitted  i...
ETL JARCassandraBulkLoadImporter.initSSTableWriters():File tempFiles = new File(“/path/to/Prefs”);tempFiles.mkdirs();for (...
ETL JARCassandraBulkLoadImporter.processSuppressionRecips():for (User user : users) {  String key = user.getUserId();  SST...
ETL JARCassandraBulkLoadImporter.close():for (String cfName : COLUMN_FAMILY_NAMES) {  try {    tableWriters.get(cfName).cl...
cassandra_bulk_load.pyimport randomimport sysimport uuidfrom   java.io import Filefrom   net.grinder.script.Grinder import...
cassandra_bulk_load.pyinput_files = [] # files to loadsite_id = str(uuid.uuid4())import_id = random.randint(1,1000000)list...
cassandra_bulk_load.py# Import statsgrinder.statistics.registerDataLogExpression(  "Users Imported", "userLong0")grinder.s...
cassandra_bulk_load.py# Import and record statsdef import_and_record():    bulk_import.importFiles()    grinder.statistics...
Stress Results• Once Data and Index files generated,  streaming bulk load is FAST • Average: ~2.5x increase over Thrift • ~...
Q’s?
Upcoming SlideShare
Loading in...5
×

ETL With Cassandra Streaming Bulk Loading

7,084

Published on

Published in: Technology
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
7,084
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
73
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide

Transcript of "ETL With Cassandra Streaming Bulk Loading"

  1. 1. Cassandra ETLStreaming Bulk Loading Alex Araujo alexaraujo@gmail.com
  2. 2. Background• Sharded MySql ETL Platform on EC2 (EBS)• Database Size - Up to 1TB• Write latencies exponentially proportional to data size
  3. 3. Background• Cassandra Thrift Loading on EC2 (Ephemeral RAID0)• Database Size - ∑ available node space• Write latencies ~linearly proportional to number of nodes• 12 XL node cluster (RF=3): 6-125x Improvement over EBS backed MySQL systems
  4. 4. Thrift ETL• Thrift overhead: Converting to/from internal structures• Routing from coordinator nodes• Writing to commitlog• Internal structures -> on-disk format Source: http://wiki.apache.org/cassandra/BinaryMemtable
  5. 5. Bulk Load• Core functionality• Existing ETL Nodes for bulk loading• Move data file & index generation off C* nodes
  6. 6. BMT Bulk Load• Requires StorageProxy API (Java)• Rows not live until flush• Wiki example uses Hadoop Source: http://wiki.apache.org/cassandra/BinaryMemtable
  7. 7. Streaming Bulk Load• Cassandra as Fat Client• BYO SSTables• sstableloader [options] /path/to/ keyspace_dir • Can ignore list of nodes (-i) • keyspace_dir should be name of keyspace and contain generated SSTable Data & Index files
  8. 8. UsersUserId email name ...<Hash> UserGroups GroupId UserId ... <UUID> {“date_joined”:”<date>”,”date_left”: ”<date>”,”active”:<true|false>} UserGroupTimeline GroupId <TimeUUID> ... <UUID> UserId
  9. 9. Setup• Opscode Chef 0.10.2 on EC2• Cassandra 0.8.2-dev-SNAPSHOT (trunk)• Custom Java ETL JAR• The Grinder 3.4 (Jython) Test Harness
  10. 10. Chef 0.10.2• knife-ec2 bootstrap with --ephemeral• ec2::ephemeral_raid0 recipe • Installs mdadm, unmounts default /mnt, creates RAID0 array on /mnt/md0
  11. 11. Chef 0.10.2• cassandra::default recipe • Downloads/extracts apache-cassandra- <version>-bin.tar.gz • Links /var/lib/cassandra to /raid0/ cassandra • Creates cassandra user & directories, increases file limits, sets up cassandra service, generates config files
  12. 12. Chef 0.10.2• cassandra::cluster_node recipe • Determines # nodes in cluster • Calculates initial_token; generates cassandra.yaml • Creates keyspace and column families
  13. 13. Chef 0.10.2• cassandra::bulk_load_node recipe • Generates same cassandra.yaml with empty initial_token • Installs/configures grinder scripts; Java ETL JAR
  14. 14. ETL JAR
  15. 15. for (File : files){ ETL JAR importer = new CBLI(...); importer.open(); // Processing omitted importer.close()}
  16. 16. ETL JARCassandraBulkLoadImporter.initSSTableWriters():File tempFiles = new File(“/path/to/Prefs”);tempFiles.mkdirs();for (String cfName : COLUMN_FAMILY_NAMES) { SSTableSimpleUnsortedWriter writer = new SSTableSimpleUnsortedWriter( tempFiles, Model.Prefs.Keyspace.name, cfName, Model.COLUMN_FAMILY_COMPARATORS.get(cfName), null, bufferSizeInMB);// No Super CFs writers.put(name, writer);}
  17. 17. ETL JARCassandraBulkLoadImporter.processSuppressionRecips():for (User user : users) { String key = user.getUserId(); SSTSUW writer = tableWriters.get(Model.Users.CF.name); // rowKey() converts String to ByteBuffer writer.newRow(rowKey(key)); o.a.c.t.Column column = newUserColumn(user); writer.addColumn(column.name, column.value, column.timestamp); ... // Repeat for each column family}
  18. 18. ETL JARCassandraBulkLoadImporter.close():for (String cfName : COLUMN_FAMILY_NAMES) { try { tableWriters.get(cfName).close(); } catch (IOException e) { log.error(“close failed for ”+cfName); throw new RuntimeException(cfName+” did not close”); } String streamCmd = "sstableloader -v --debug" + tempFiles.getAbsolutePath(); Process stream = Runtime.getRuntime().exec(streamCmd); if (!stream.waitFor() == 0) log.error(“stream failed”);}
  19. 19. cassandra_bulk_load.pyimport randomimport sysimport uuidfrom java.io import Filefrom net.grinder.script.Grinder import grinderfrom net.grinder.script import Statisticsfrom net.grinder.script import Testfrom com.mycompany import Appfrom com.mycompany.tool import SingleColumnBulkImport
  20. 20. cassandra_bulk_load.pyinput_files = [] # files to loadsite_id = str(uuid.uuid4())import_id = random.randint(1,1000000)list_ids = [] # lists users will be loaded totry: App.INSTANCE.start() dao = App.INSTANCE.getUserDAO() bulk_import = SingleColumnBulkImport( dao.prefsKeyspace, input_files, site_id, list_ids, import_id)except: exception = sys.exc_info()[1] print exception.message print exception.stackTrace
  21. 21. cassandra_bulk_load.py# Import statsgrinder.statistics.registerDataLogExpression( "Users Imported", "userLong0")grinder.statistics.registerSummaryExpression( "Total Users Imported", "(+ userLong0)")grinder.statistics.registerDataLogExpression( "Import Time", "userLong1")grinder.statistics.registerSummaryExpression( "Import Total Time (sec)", "(/ (+ userLong1) 1000)")rate_expression = "(* (/ userLong0 (/ (/ userLong1 1000)"+str(num_threads)+")) "+str(replication_factor)+")"grinder.statistics.registerSummaryExpression( "Cluster Insert Rate (users/sec)", rate_expression)
  22. 22. cassandra_bulk_load.py# Import and record statsdef import_and_record(): bulk_import.importFiles() grinder.statistics.forCurrentTest.setLong( "userLong0", bulk_import.totalLines) grinder.statistics.forCurrentTest.setLong( "userLong1", grinder.statistics.forCurrentTest.time)# Create an Import Test with a test number and adescriptionimport_test = Test(1, "Recip Bulk Import").wrap( import_and_record)# A TestRunner instance is created for each threadclass TestRunner:# This method is called for every run. def __call__(self): import_test()
  23. 23. Stress Results• Once Data and Index files generated, streaming bulk load is FAST • Average: ~2.5x increase over Thrift • ~15-300x increase over MySQL• Impact on cluster is minimal• Observed downside: Writing own SSTables slower than Cassandra
  24. 24. Q’s?
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×