Data munging with hadoop


Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Data munging with hadoop

  1. 1. Data Munging with Hadoop August 2013 Giannis Neokleous @yiannis_n
  2. 2. Hadoop at Knewton ● AWS - EMR ○ Use S3 as FS then Redshift. ● Some Cascading ● Processing ○ Student Events (At different processing stages) ○ Recommendations ○ Course graphs ○ Compute metrics ● Use internal tools for launching EMR clusters
  3. 3. Analytics Team at Knewton? ● Currently working on building the platform, tools, interfaces, data for other teams ● Source of new model/metrics development ○ Compute and data processing platform, data warehouse ○ All your data are belong to us ● Making access to bulk data for Data Scientists easier ● Nature of data we’re publishing calls for solid validation methods ○ Analytics enables that ● Multiple sources, lots of data models ○ Recommendations/Student events/Course graphs/Rec Context/etc..
  4. 4. Dealing with multiple data source types
  5. 5. Why? ● Teams using multiple data stores ○ Want to bootstrap the data stores ○ Want to bulk analyze the data stores ● Decide to build a data warehouse to store all your data ● Not everything sits in human readable log files ● Simulate traffic through your system and want to push to services ○ Load tests
  6. 6. Stuck! How do you bring in all the data from all the data sources?
  7. 7. Solutions ● Ask the service owners to publish their payload every day to an easy to process format - Log files? ○ Consistency is hard (source =?= payload) ● In the case of services, use the API ● Rely on packaged tools to transform data ○ Slow transformation processes ○ One extra transformation step
  8. 8. Can you do better? ● Understand your input source or output destinations ○ Write custom: ■ InputFormats / OutputFormats ■ RecordReader/RecordWriter ● Cleaner code in your mappers and reducers ○ Good separation of data model initialization and actual logic ○ Meaningful objects in mappers, reducers
  9. 9. Really? Where?
  10. 10. Examples Used it for: ● Bootstrapping data stores like Cassandra, SenseiDB, Redshift ● Writing lucene indices ● Bulk reading data stores for analytics ● Load tests ○ Simulating real traffic ○ Publishing to queues SenseiDB
  11. 11. kk, show me more
  12. 12. A little about MapReduce ● InputFormat ○ Figures out where the data is, what to read and how to read them ○ Divides the data to record readers ● RecordReader ○ Instantiated by InputFormats ○ Do the actual reading ● Mapper ○ Key/Value pairs get passed in by the record readers ● Reducer ○ Key/Value pairs get passed in from the mappers ○ All the same keys end up in the same reducer
  13. 13. A little about MapReduce ● OutputFormat ○ Figures out where and how to write the data ○ Divides the data to record writers ○ What to do after the data has been written ● RecordWriter ○ Instantiated by OutputFormats ○ Do the actual writing
  14. 14. Data Flow DB DBTable DBTable DBTable DBTable Hadoop Cluster InputFormatPull RecordReader RecordReader RecordReader Output Reduce Reduce Reduce Map Map Map P u s h OutputFormat Record Writer Record Writer Record Writer
  15. 15. Input
  16. 16. What do I need? You need: ● Write a custom record reader extending org.apache.hadoop. mapreduce.RecordReader ● Write a custom input format extending from org.apache.hadoop. mapreduce.InputFormat ● Define them in your job configuration: Job job = new Job(new Configuration()); job.setInputFormatClass(MyAwesomeInputFormat.class);
  17. 17. RecordReader IFace public abstract class ExampleRecordReader extends RecordReader<LongWritable, RecommendationWritable> { public void initialize(InputSplit split, TaskAttemptContext context) throws IOException, InterruptedException; public boolean nextKeyValue() throws IOException, InterruptedException; public LongWritable getCurrentKey() throws IOException, InterruptedException; public RecommendationWritable getCurrentValue() throws IOException, InterruptedException; public float getProgress() throws IOException, InterruptedException; public void close() throws IOException; }
  18. 18. InputFormat IFace public class ExampleInputFormat extends InputFormat<LongWritable, RecommendationWritable> { public List<InputSplit> getSplits(JobContext context) throws IOException, InterruptedException; public RecordReader<LongWritable, RecommendationWritable> createRecordReader(InputSplit split, TaskAttemptContext context) throws IOException, InterruptedException; } ● A lot of times you subclass from an InputFormat further down the class hierarchy and override a few methods ● Good way to filter early. ○ Skip files you really don’t need to read ■ Works great with time series data
  19. 19. JsonRecordReader public abstract class JsonRecordReader<V extends Writable> extends RecordReader<LongWritable, V> { public boolean nextKeyValue() throws IOException, InterruptedException { key.set(pos); Text jsonText = new Text(); int newSize = 0; if (pos < end) { newSize = in.readLine(jsonText); if (newSize > 0 && !jsonText.toString().isEmpty()) { for (ObjectDecorator<String> decorator : decorators) { jsonText = new Text( decorator.decorateObject(jsonText.toString())); } V tempValue = (V) gson.fromJson(jsonText.toString(), getDataClass(jsonText.toString())); value = tempValue; } pos += newSize; } if (newSize == 0 || jsonText.toString().isEmpty()) { key = null; value = null; return false; } else { return true; } } public void initialize(InputSplit split, TaskAttemptContext context) throws IOException, InterruptedException { Configuration conf = context.getConfiguration(); FileSplit fileSplit = (FileSplit) split; start = fileSplit.getStart(); end = start + fileSplit.getLength(); compressionCodecs = new CompressionCodecFactory(conf); in = initLineReader(fileSplit, conf); pos = start; gson = gsonBuilder.create(); } public LongWritable getCurrentKey() throws IOException, InterruptedException { return key; } public V getCurrentValue() throws IOException, InterruptedException { return value; } public float getProgress() throws IOException, InterruptedException { if (start == end) { return 0.0f; } else { return Math.max(0.0f, Math.min(1.0f, (pos - start) / (float) (end - start))); } } public void registerDeserializer(Type type, JsonDeserializer<?> jsonDes) { gsonBuilder.registerTypeAdapter(type, jsonDes); } //Other omitted methods
  20. 20. RecommendationInputFormat public class RecommendationsInputFormat extends FileInputFormat<LongWritable, RecommendationWritable> { public RecordReader<LongWritable, RecommendationWritable> createRecordReader(InputSplit split, TaskAttemptContext context) throws IOException, InterruptedException { JsonRecordReader rr = new JsonRecordReader<RecommendationWritable>() { @Override protected Class<?> getDataClass(String jsonStr) { return RecommendationWritable.class; } }; FileSplit fileSplit = (FileSplit) split; rr.addDecorator(new JsonFilenameDecorator(fileSplit.getPath().getName())); return rr; } } ● Inherits from FileInputFormat so everything else is taken care of ● Define createRecordReader and instantiate a JsonRecordReader
  21. 21. Output
  22. 22. OutputFormat IFace public abstract class ExampleOutputFormat extends OutputFormat<LongWritable, RecommendationWritable> { public abstract RecordWriter<LongWritable, RecommendationWritable> getRecordWriter(TaskAttemptContext context) throws IOException, InterruptedException; public abstract void checkOutputSpecs(JobContext context) throws IOException, InterruptedException; public abstract OutputCommitter getOutputCommitter(TaskAttemptContext context) throws IOException, InterruptedException; } ● getRecordWriter should initialize the RecordWriter and pass it an output stream (doesn’t have to be HDFS specific). ○ example from TextOutputFormat: public RecordWriter<K, V> getRecordWriter(TaskAttemptContext job) throws IOException, InterruptedException { ... FileSystem fs = file.getFileSystem(conf); if (!isCompressed) { FSDataOutputStream fileOut = fs.create(file, false); return new LineRecordWriter<K, V>(fileOut, keyValueSeparator); } else { FSDataOutputStream fileOut = fs.create(file, false); return new LineRecordWriter<K, V>(new DataOutputStream (codec.createOutputStream(fileOut)), keyValueSeparator); } ... }
  23. 23. RecordWriter IFace public abstract class ExampleRecordWriter extends RecordWriter<LongWritable, RecommendationWritable> { public void write(LongWritable key, RecommendationWritable value) throws IOException, InterruptedException; public void close(TaskAttemptContext context) throws IOException, InterruptedException; } ● Can use an existing output stream (non-HDFS) from a library. ● If it’s a file then you can write to the temp work dir and on close move the file to HDFS.
  24. 24. Does it matter what I write?
  25. 25. Yes! ● You have to worry about splits ● Avro, Sequence files, Thrift (with a little work) ○ Automatically serialize and deserialize objects on read/write *Table borrowed from: Hadoop: The Definitive Guide.
  26. 26. What else?
  27. 27. Bulk read Cassandra starting today ● Apply same ideas for bulk reading (or writing, though not today) Cassandra ● No need to do a single call to Cassandra
  28. 28. A little about SSTables ● Sorted ○ Both row keys and columns ● Key Value pairs ○ Rows: ■ Row value: Key ■ Columns: Value ○ Columns: ■ Column name: Key ■ Column value: Value ● Immutable ● Consist of 4 parts ○ ColumnFamily-hd-3549-Data.db Column family name Version Table number Type (One table consists of 4 or 5 types) One of: Data, Index, Filter, Statistics, [CompressionInfo]
  29. 29. SSTableInputFormat ● An input format specifically for SSTables. ○ Extends from FileInputFormat ● Includes a DataPathFilter for filtering through files for *-Data.db files ● Expands all subdirectories of input - Filters for ColumnFamily ● Configures Comparator, Subcomparator and Partitioner classes used in ColumnFamily. ● Two types: FileInputFormat SSTableInputFormat SSTableColumnInputFormat SSTableRowInputFormat
  30. 30. SSTableRecordReader ● A record reader specifically for SSTables. ● On init: ○ Copies the table locally. (Decompresses it, if using Priam) ○ Opens the table for reading. (Only needs Data, Index and CompressionInfo tables) ○ Creates a TableScanner from the reader ● Two types: RecordReader SSTableRecordReader SSTableColumnRecordReader SSTableRowRecordReader
  31. 31. public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf); ClassLoader loader = SSTableMRExample.class.getClassLoader(); conf.addResource(loader.getResource("knewton-site.xml")); SSTableInputFormat.setPartitionerClass(RandomPartitioner.class.getName(), job); SSTableInputFormat.setComparatorClass(LongType.class.getName(), job); SSTableInputFormat.setColumnFamilyName("StudentEvents", job); job.setOutputKeyClass(LongWritable.class); job.setOutputValueClass(StudentEventWritable.class); job.setMapperClass(StudentEventMapper.class); job.setReducerClass(StudentEventReducer.class); job.setInputFormatClass(SSTableColumnInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); SSTableInputFormat.addInputPaths(job, args[0]); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); } public class StudentEventMapper extends SSTableColumnMapper<Long, StudentEvent, LongWritable, StudentEventWritable> { @Override public void performMapTask(Long key, StudentEvent value, Context context) { //do stuff here } // Some other omitted trivial methods } Load additional properties from a conf file Define mappers/reducers. The only thing you have to write. Read each column as a separate record. Sees row key/column pairs. Remember to skip deleted columns (tombstones)
  32. 32. KnewtonMRTools ● Open sourcing today! ○ ● Plan is to have reusable record readers and writers ● Currently includes the JsonRecordReader ● Includes examples jobs with sample input - Self contained
  33. 33. KassandraMRHelper ● Find out more: ○ Source: ○ Slides: ○ Presentation Video: clipId=pla_5aa2c893-2803-47f4-b674- 15b1af52a226&utm_source=lslibrary&utm_medium=ui-thumb ● Has all you need to get started on bulk reading SSTables with Hadoop. ● Includes an example job that reads "student events" ● Handles compressed tables ● Use Priam? Even better can snappy decompress priam backups. ● Don't have a cluster up or a table handy? ○ Use com.knewton.mapreduce.cassandra.WriteSampleSSTable in the test source directory to generate one.
  34. 34. Thank you Questions? Giannis Neokleous @yiannis_n