ETL With Cassandra Streaming Bulk Loading


Published on

Published in: Technology
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

ETL With Cassandra Streaming Bulk Loading

  1. 1. Cassandra ETLStreaming Bulk Loading Alex Araujo
  2. 2. Background• Sharded MySql ETL Platform on EC2 (EBS)• Database Size - Up to 1TB• Write latencies exponentially proportional to data size
  3. 3. Background• Cassandra Thrift Loading on EC2 (Ephemeral RAID0)• Database Size - ∑ available node space• Write latencies ~linearly proportional to number of nodes• 12 XL node cluster (RF=3): 6-125x Improvement over EBS backed MySQL systems
  4. 4. Thrift ETL• Thrift overhead: Converting to/from internal structures• Routing from coordinator nodes• Writing to commitlog• Internal structures -> on-disk format Source:
  5. 5. Bulk Load• Core functionality• Existing ETL Nodes for bulk loading• Move data file & index generation off C* nodes
  6. 6. BMT Bulk Load• Requires StorageProxy API (Java)• Rows not live until flush• Wiki example uses Hadoop Source:
  7. 7. Streaming Bulk Load• Cassandra as Fat Client• BYO SSTables• sstableloader [options] /path/to/ keyspace_dir • Can ignore list of nodes (-i) • keyspace_dir should be name of keyspace and contain generated SSTable Data & Index files
  8. 8. UsersUserId email name ...<Hash> UserGroups GroupId UserId ... <UUID> {“date_joined”:”<date>”,”date_left”: ”<date>”,”active”:<true|false>} UserGroupTimeline GroupId <TimeUUID> ... <UUID> UserId
  9. 9. Setup• Opscode Chef 0.10.2 on EC2• Cassandra 0.8.2-dev-SNAPSHOT (trunk)• Custom Java ETL JAR• The Grinder 3.4 (Jython) Test Harness
  10. 10. Chef 0.10.2• knife-ec2 bootstrap with --ephemeral• ec2::ephemeral_raid0 recipe • Installs mdadm, unmounts default /mnt, creates RAID0 array on /mnt/md0
  11. 11. Chef 0.10.2• cassandra::default recipe • Downloads/extracts apache-cassandra- <version>-bin.tar.gz • Links /var/lib/cassandra to /raid0/ cassandra • Creates cassandra user & directories, increases file limits, sets up cassandra service, generates config files
  12. 12. Chef 0.10.2• cassandra::cluster_node recipe • Determines # nodes in cluster • Calculates initial_token; generates cassandra.yaml • Creates keyspace and column families
  13. 13. Chef 0.10.2• cassandra::bulk_load_node recipe • Generates same cassandra.yaml with empty initial_token • Installs/configures grinder scripts; Java ETL JAR
  14. 14. ETL JAR
  15. 15. for (File : files){ ETL JAR importer = new CBLI(...);; // Processing omitted importer.close()}
  16. 16. ETL JARCassandraBulkLoadImporter.initSSTableWriters():File tempFiles = new File(“/path/to/Prefs”);tempFiles.mkdirs();for (String cfName : COLUMN_FAMILY_NAMES) { SSTableSimpleUnsortedWriter writer = new SSTableSimpleUnsortedWriter( tempFiles,, cfName, Model.COLUMN_FAMILY_COMPARATORS.get(cfName), null, bufferSizeInMB);// No Super CFs writers.put(name, writer);}
  17. 17. ETL JARCassandraBulkLoadImporter.processSuppressionRecips():for (User user : users) { String key = user.getUserId(); SSTSUW writer = tableWriters.get(; // rowKey() converts String to ByteBuffer writer.newRow(rowKey(key)); o.a.c.t.Column column = newUserColumn(user); writer.addColumn(, column.value, column.timestamp); ... // Repeat for each column family}
  18. 18. ETL JARCassandraBulkLoadImporter.close():for (String cfName : COLUMN_FAMILY_NAMES) { try { tableWriters.get(cfName).close(); } catch (IOException e) { log.error(“close failed for ”+cfName); throw new RuntimeException(cfName+” did not close”); } String streamCmd = "sstableloader -v --debug" + tempFiles.getAbsolutePath(); Process stream = Runtime.getRuntime().exec(streamCmd); if (!stream.waitFor() == 0) log.error(“stream failed”);}
  19. 19. cassandra_bulk_load.pyimport randomimport sysimport uuidfrom import Filefrom net.grinder.script.Grinder import grinderfrom net.grinder.script import Statisticsfrom net.grinder.script import Testfrom com.mycompany import Appfrom com.mycompany.tool import SingleColumnBulkImport
  20. 20. cassandra_bulk_load.pyinput_files = [] # files to loadsite_id = str(uuid.uuid4())import_id = random.randint(1,1000000)list_ids = [] # lists users will be loaded totry: App.INSTANCE.start() dao = App.INSTANCE.getUserDAO() bulk_import = SingleColumnBulkImport( dao.prefsKeyspace, input_files, site_id, list_ids, import_id)except: exception = sys.exc_info()[1] print exception.message print exception.stackTrace
  21. 21. Import statsgrinder.statistics.registerDataLogExpression( "Users Imported", "userLong0")grinder.statistics.registerSummaryExpression( "Total Users Imported", "(+ userLong0)")grinder.statistics.registerDataLogExpression( "Import Time", "userLong1")grinder.statistics.registerSummaryExpression( "Import Total Time (sec)", "(/ (+ userLong1) 1000)")rate_expression = "(* (/ userLong0 (/ (/ userLong1 1000)"+str(num_threads)+")) "+str(replication_factor)+")"grinder.statistics.registerSummaryExpression( "Cluster Insert Rate (users/sec)", rate_expression)
  22. 22. Import and record statsdef import_and_record(): bulk_import.importFiles() grinder.statistics.forCurrentTest.setLong( "userLong0", bulk_import.totalLines) grinder.statistics.forCurrentTest.setLong( "userLong1", grinder.statistics.forCurrentTest.time)# Create an Import Test with a test number and adescriptionimport_test = Test(1, "Recip Bulk Import").wrap( import_and_record)# A TestRunner instance is created for each threadclass TestRunner:# This method is called for every run. def __call__(self): import_test()
  23. 23. Stress Results• Once Data and Index files generated, streaming bulk load is FAST • Average: ~2.5x increase over Thrift • ~15-300x increase over MySQL• Impact on cluster is minimal• Observed downside: Writing own SSTables slower than Cassandra
  24. 24. Q’s?
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.