• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
ETL With Cassandra Streaming Bulk Loading

ETL With Cassandra Streaming Bulk Loading






Total Views
Views on SlideShare
Embed Views



1 Embed 92

http://www.scoop.it 92



Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

    ETL With Cassandra Streaming Bulk Loading ETL With Cassandra Streaming Bulk Loading Presentation Transcript

    • Cassandra ETLStreaming Bulk Loading Alex Araujo alexaraujo@gmail.com
    • Background• Sharded MySql ETL Platform on EC2 (EBS)• Database Size - Up to 1TB• Write latencies exponentially proportional to data size
    • Background• Cassandra Thrift Loading on EC2 (Ephemeral RAID0)• Database Size - ∑ available node space• Write latencies ~linearly proportional to number of nodes• 12 XL node cluster (RF=3): 6-125x Improvement over EBS backed MySQL systems
    • Thrift ETL• Thrift overhead: Converting to/from internal structures• Routing from coordinator nodes• Writing to commitlog• Internal structures -> on-disk format Source: http://wiki.apache.org/cassandra/BinaryMemtable
    • Bulk Load• Core functionality• Existing ETL Nodes for bulk loading• Move data file & index generation off C* nodes
    • BMT Bulk Load• Requires StorageProxy API (Java)• Rows not live until flush• Wiki example uses Hadoop Source: http://wiki.apache.org/cassandra/BinaryMemtable
    • Streaming Bulk Load• Cassandra as Fat Client• BYO SSTables• sstableloader [options] /path/to/ keyspace_dir • Can ignore list of nodes (-i) • keyspace_dir should be name of keyspace and contain generated SSTable Data & Index files
    • UsersUserId email name ...<Hash> UserGroups GroupId UserId ... <UUID> {“date_joined”:”<date>”,”date_left”: ”<date>”,”active”:<true|false>} UserGroupTimeline GroupId <TimeUUID> ... <UUID> UserId
    • Setup• Opscode Chef 0.10.2 on EC2• Cassandra 0.8.2-dev-SNAPSHOT (trunk)• Custom Java ETL JAR• The Grinder 3.4 (Jython) Test Harness
    • Chef 0.10.2• knife-ec2 bootstrap with --ephemeral• ec2::ephemeral_raid0 recipe • Installs mdadm, unmounts default /mnt, creates RAID0 array on /mnt/md0
    • Chef 0.10.2• cassandra::default recipe • Downloads/extracts apache-cassandra- <version>-bin.tar.gz • Links /var/lib/cassandra to /raid0/ cassandra • Creates cassandra user & directories, increases file limits, sets up cassandra service, generates config files
    • Chef 0.10.2• cassandra::cluster_node recipe • Determines # nodes in cluster • Calculates initial_token; generates cassandra.yaml • Creates keyspace and column families
    • Chef 0.10.2• cassandra::bulk_load_node recipe • Generates same cassandra.yaml with empty initial_token • Installs/configures grinder scripts; Java ETL JAR
    • ETL JAR
    • for (File : files){ ETL JAR importer = new CBLI(...); importer.open(); // Processing omitted importer.close()}
    • ETL JARCassandraBulkLoadImporter.initSSTableWriters():File tempFiles = new File(“/path/to/Prefs”);tempFiles.mkdirs();for (String cfName : COLUMN_FAMILY_NAMES) { SSTableSimpleUnsortedWriter writer = new SSTableSimpleUnsortedWriter( tempFiles, Model.Prefs.Keyspace.name, cfName, Model.COLUMN_FAMILY_COMPARATORS.get(cfName), null, bufferSizeInMB);// No Super CFs writers.put(name, writer);}
    • ETL JARCassandraBulkLoadImporter.processSuppressionRecips():for (User user : users) { String key = user.getUserId(); SSTSUW writer = tableWriters.get(Model.Users.CF.name); // rowKey() converts String to ByteBuffer writer.newRow(rowKey(key)); o.a.c.t.Column column = newUserColumn(user); writer.addColumn(column.name, column.value, column.timestamp); ... // Repeat for each column family}
    • ETL JARCassandraBulkLoadImporter.close():for (String cfName : COLUMN_FAMILY_NAMES) { try { tableWriters.get(cfName).close(); } catch (IOException e) { log.error(“close failed for ”+cfName); throw new RuntimeException(cfName+” did not close”); } String streamCmd = "sstableloader -v --debug" + tempFiles.getAbsolutePath(); Process stream = Runtime.getRuntime().exec(streamCmd); if (!stream.waitFor() == 0) log.error(“stream failed”);}
    • cassandra_bulk_load.pyimport randomimport sysimport uuidfrom java.io import Filefrom net.grinder.script.Grinder import grinderfrom net.grinder.script import Statisticsfrom net.grinder.script import Testfrom com.mycompany import Appfrom com.mycompany.tool import SingleColumnBulkImport
    • cassandra_bulk_load.pyinput_files = [] # files to loadsite_id = str(uuid.uuid4())import_id = random.randint(1,1000000)list_ids = [] # lists users will be loaded totry: App.INSTANCE.start() dao = App.INSTANCE.getUserDAO() bulk_import = SingleColumnBulkImport( dao.prefsKeyspace, input_files, site_id, list_ids, import_id)except: exception = sys.exc_info()[1] print exception.message print exception.stackTrace
    • cassandra_bulk_load.py# Import statsgrinder.statistics.registerDataLogExpression( "Users Imported", "userLong0")grinder.statistics.registerSummaryExpression( "Total Users Imported", "(+ userLong0)")grinder.statistics.registerDataLogExpression( "Import Time", "userLong1")grinder.statistics.registerSummaryExpression( "Import Total Time (sec)", "(/ (+ userLong1) 1000)")rate_expression = "(* (/ userLong0 (/ (/ userLong1 1000)"+str(num_threads)+")) "+str(replication_factor)+")"grinder.statistics.registerSummaryExpression( "Cluster Insert Rate (users/sec)", rate_expression)
    • cassandra_bulk_load.py# Import and record statsdef import_and_record(): bulk_import.importFiles() grinder.statistics.forCurrentTest.setLong( "userLong0", bulk_import.totalLines) grinder.statistics.forCurrentTest.setLong( "userLong1", grinder.statistics.forCurrentTest.time)# Create an Import Test with a test number and adescriptionimport_test = Test(1, "Recip Bulk Import").wrap( import_and_record)# A TestRunner instance is created for each threadclass TestRunner:# This method is called for every run. def __call__(self): import_test()
    • Stress Results• Once Data and Index files generated, streaming bulk load is FAST • Average: ~2.5x increase over Thrift • ~15-300x increase over MySQL• Impact on cluster is minimal• Observed downside: Writing own SSTables slower than Cassandra
    • Q’s?