ETL With Cassandra Streaming Bulk Loading


Published on

Published in: Technology

ETL With Cassandra Streaming Bulk Loading

  1. 1. Cassandra ETLStreaming Bulk Loading Alex Araujo
  2. 2. Background• Sharded MySql ETL Platform on EC2 (EBS)• Database Size - Up to 1TB• Write latencies exponentially proportional to data size
  3. 3. Background• Cassandra Thrift Loading on EC2 (Ephemeral RAID0)• Database Size - ∑ available node space• Write latencies ~linearly proportional to number of nodes• 12 XL node cluster (RF=3): 6-125x Improvement over EBS backed MySQL systems
  4. 4. Thrift ETL• Thrift overhead: Converting to/from internal structures• Routing from coordinator nodes• Writing to commitlog• Internal structures -> on-disk format Source:
  5. 5. Bulk Load• Core functionality• Existing ETL Nodes for bulk loading• Move data file & index generation off C* nodes
  6. 6. BMT Bulk Load• Requires StorageProxy API (Java)• Rows not live until flush• Wiki example uses Hadoop Source:
  7. 7. Streaming Bulk Load• Cassandra as Fat Client• BYO SSTables• sstableloader [options] /path/to/ keyspace_dir • Can ignore list of nodes (-i) • keyspace_dir should be name of keyspace and contain generated SSTable Data & Index files
  8. 8. UsersUserId email name ...<Hash> UserGroups GroupId UserId ... <UUID> {“date_joined”:”<date>”,”date_left”: ”<date>”,”active”:<true|false>} UserGroupTimeline GroupId <TimeUUID> ... <UUID> UserId
  9. 9. Setup• Opscode Chef 0.10.2 on EC2• Cassandra 0.8.2-dev-SNAPSHOT (trunk)• Custom Java ETL JAR• The Grinder 3.4 (Jython) Test Harness
  10. 10. Chef 0.10.2• knife-ec2 bootstrap with --ephemeral• ec2::ephemeral_raid0 recipe • Installs mdadm, unmounts default /mnt, creates RAID0 array on /mnt/md0
  11. 11. Chef 0.10.2• cassandra::default recipe • Downloads/extracts apache-cassandra- <version>-bin.tar.gz • Links /var/lib/cassandra to /raid0/ cassandra • Creates cassandra user & directories, increases file limits, sets up cassandra service, generates config files
  12. 12. Chef 0.10.2• cassandra::cluster_node recipe • Determines # nodes in cluster • Calculates initial_token; generates cassandra.yaml • Creates keyspace and column families
  13. 13. Chef 0.10.2• cassandra::bulk_load_node recipe • Generates same cassandra.yaml with empty initial_token • Installs/configures grinder scripts; Java ETL JAR
  14. 14. ETL JAR
  15. 15. for (File : files){ ETL JAR importer = new CBLI(...);; // Processing omitted importer.close()}
  16. 16. ETL JARCassandraBulkLoadImporter.initSSTableWriters():File tempFiles = new File(“/path/to/Prefs”);tempFiles.mkdirs();for (String cfName : COLUMN_FAMILY_NAMES) { SSTableSimpleUnsortedWriter writer = new SSTableSimpleUnsortedWriter( tempFiles,, cfName, Model.COLUMN_FAMILY_COMPARATORS.get(cfName), null, bufferSizeInMB);// No Super CFs writers.put(name, writer);}
  17. 17. ETL JARCassandraBulkLoadImporter.processSuppressionRecips():for (User user : users) { String key = user.getUserId(); SSTSUW writer = tableWriters.get(; // rowKey() converts String to ByteBuffer writer.newRow(rowKey(key)); o.a.c.t.Column column = newUserColumn(user); writer.addColumn(, column.value, column.timestamp); ... // Repeat for each column family}
  18. 18. ETL JARCassandraBulkLoadImporter.close():for (String cfName : COLUMN_FAMILY_NAMES) { try { tableWriters.get(cfName).close(); } catch (IOException e) { log.error(“close failed for ”+cfName); throw new RuntimeException(cfName+” did not close”); } String streamCmd = "sstableloader -v --debug" + tempFiles.getAbsolutePath(); Process stream = Runtime.getRuntime().exec(streamCmd); if (!stream.waitFor() == 0) log.error(“stream failed”);}
  19. 19. cassandra_bulk_load.pyimport randomimport sysimport uuidfrom import Filefrom net.grinder.script.Grinder import grinderfrom net.grinder.script import Statisticsfrom net.grinder.script import Testfrom com.mycompany import Appfrom com.mycompany.tool import SingleColumnBulkImport
  20. 20. cassandra_bulk_load.pyinput_files = [] # files to loadsite_id = str(uuid.uuid4())import_id = random.randint(1,1000000)list_ids = [] # lists users will be loaded totry: App.INSTANCE.start() dao = App.INSTANCE.getUserDAO() bulk_import = SingleColumnBulkImport( dao.prefsKeyspace, input_files, site_id, list_ids, import_id)except: exception = sys.exc_info()[1] print exception.message print exception.stackTrace
  21. 21. Import statsgrinder.statistics.registerDataLogExpression( "Users Imported", "userLong0")grinder.statistics.registerSummaryExpression( "Total Users Imported", "(+ userLong0)")grinder.statistics.registerDataLogExpression( "Import Time", "userLong1")grinder.statistics.registerSummaryExpression( "Import Total Time (sec)", "(/ (+ userLong1) 1000)")rate_expression = "(* (/ userLong0 (/ (/ userLong1 1000)"+str(num_threads)+")) "+str(replication_factor)+")"grinder.statistics.registerSummaryExpression( "Cluster Insert Rate (users/sec)", rate_expression)
  22. 22. Import and record statsdef import_and_record(): bulk_import.importFiles() grinder.statistics.forCurrentTest.setLong( "userLong0", bulk_import.totalLines) grinder.statistics.forCurrentTest.setLong( "userLong1", grinder.statistics.forCurrentTest.time)# Create an Import Test with a test number and adescriptionimport_test = Test(1, "Recip Bulk Import").wrap( import_and_record)# A TestRunner instance is created for each threadclass TestRunner:# This method is called for every run. def __call__(self): import_test()
  23. 23. Stress Results• Once Data and Index files generated, streaming bulk load is FAST • Average: ~2.5x increase over Thrift • ~15-300x increase over MySQL• Impact on cluster is minimal• Observed downside: Writing own SSTables slower than Cassandra
  24. 24. Q’s?