Scalding Presentation
Upcoming SlideShare
Loading in...5
×
 

Scalding Presentation

on

  • 269 views

Scalding, Scala, MapReduce

Scalding, Scala, MapReduce
24th Hadoop London Meetup

Statistics

Views

Total Views
269
Views on SlideShare
263
Embed Views
6

Actions

Likes
1
Downloads
2
Comments
0

2 Embeds 6

http://www.slideee.com 4
https://www.linkedin.com 2

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Monday, June 30, 2014 <br /> 6:30 PM to 9:30 PM <br /> <br /> Barclays Accelerator 69-89 Mile End Road, E1 4UJ, London <br /> <br /> http://www.meetup.com/big-data-london/events/188925412/ <br /> <br />
  • Here are my contact details. <br /> Find a number of open-source projects related to Hadoop & MapReduce that I have been contributing on GitHub <br /> <br /> And also the technical blog http://scalding.io
  • First book ever available on Scala + MapReduce + Hadoop <br /> Comes with hundreds of ready to run examples <br /> <br /> Book @ Amazon = http://amazon.co.uk/dp/1783287012 <br /> Book @ PACKT = http://packtpub.com/programming-mapreduce-with-scalding/book <br /> <br /> GitHub repository with examples = https://github.com/scalding-io/ProgrammingWithScalding
  • http://github.com/twitter/scalding
  • Once upon a time..
  • Hadoop provides HDFS for the distributed storage of large files and services for coordinated execution of MapReduce tasks <br /> The Java MapReduce API is very verbose and 70 lines of code for a simple WordCount example <br />
  • Hadoop provides HDFS for the distributed storage of large files and services for coordinated execution of MapReduce tasks <br /> The Java MapReduce API is very verbose and 70 lines of code for a simple WordCount example <br />
  • In-memory systems i.e. memcached , redis etc <br /> Document Databases <br /> Search systems
  • Explain: <br /> Taps, Tuples, pipes
  • Show parallelism
  • Cascading word count requires 20 lines <br /> Java MapReduce API word count requires 70 lines <br /> <br /> = We manage to remove 50 lines of code (70%) by using Cascading
  • Is this what adds on top of cascading ?
  • Parquet => Efficient columnar storage <br /> For a Scalding application to execute all defined input and output taps must participate in the pipeline. <br /> Reading & Writing files
  • // 15 map operations – that are translated into map phases <br />
  • How many items does each of our shops sell ?
  • Code reference - https://github.com/scalding-io/ProgrammingWithScalding/tree/master/chapter5/src/main/scala/externaloperations
  • Testing is challeging in the context of MapReduce and its challenges <br />
  • Driven takes Cascading application development to the next level with management and monitoring capabilities for your Cascading apps
  • Driven takes Cascading application development to the next level with management and monitoring capabilities for your Cascading apps
  • This is the stack presented so far

Scalding Presentation Scalding Presentation Presentation Transcript

  • MapReduce with Scalding Antonios Chalkiopoulos 24th Big Data London Meetup Scalding.io
  • $ whoami Scalding.io http://scalding.io http://github.com/scalding-io @chalkiopoulos
  • My recent achievement.. Scalding.io
  • What are we gonna talk about..? Scalding.io
  • Scalding.io
  • A Scala API on top of Cascading Scalding.io
  • But what is ? Scalding.io
  • A few years ago I started on a fresh Big Data team… Scalding.io
  • How do we efficiently develop MapReduce jobs for our new hadoop cluster ? Scalding.io
  • MapReduce Techs Scalding.io Java MapReduce Hadoop abstraction
  • ws Java MapReduce Word count example
  • MapReduce Techs Scalding.io Java MapReduce Pig Hive Hadoop Cascading Others abstraction
  • The promise of Cascading Scalding.io
  • [1] A simple, high level java API for MapReduce easy to understand and work with. Scalding.io
  • [2] Extensions to MANY platforms Scalding.io
  • Scalding.io Cascading NoSQL Databases SQL Databases Hadoop Filesystem Local Filesystem In memory systems Search Platforms  MongoDB  Cassandra  HBASE  Accumulo …  ElasticSearch  Solr …  Redis  Memcached …
  • How it works? Scalding.io
  • A pipeline architecture Scalding.io
  • Scalding.io data data data Source tap data data Sinktap
  • Scalding.io Log files Customer Data Log & Customer Final Results Log files Log files Customer Data Results Results
  • Cascading Example Scalding.io
  • Word count in Cascading 1. public class WordCount { 2. public static void main(String[] args) { 3. Properties properties = new Properties(); 4. FlowConnector.setApplicationJarClass (properties, WordCount.class); 5. Scheme sourceScheme = new TextLine (new Fields(“line”)); 6. Scheme sinkScheme = new TextLine (new Fields(“word”,”count”)); 7. Tap source = new Hfs( sourceScheme, args[0]); 8. Tap sink = new Hfs( sinkScheme, args[1], SinkMode.REPLACE ); 9. Pipe assembly = new Pipe(“ Word Count “); 10. String regex = “(?>!pL)(?=pL)[^ ]*(?<=pL)(?!pL)”; 11. Function function = new RegexGenerator( new Fields(“word”), regex); 12. assembly = new Each( assembly, new Fields(“line”), function ); 13. assembly = new GroupBy( assembly, new Fields(“word”) ); 14. Aggregator count = new Count(new Fields(“count”) ); 15. assembly = new Every( assembly, count ); 16. FlowConnector flowConnector = new FlowConnector( properties ); 17. Flow flow = flowConnector.connect(“word-count”, source, sink, assembly); 18. flow.complete(); 19. } 20. } Scalding.io 70% less boilerplate code But still some infrastructure code
  • Scalding.io
  • Scalding.io No boilerplate code at all Functional Robust & Scalable Run on JVM
  • Here it comes  Scalding.io Java MapReduce Pig Hive Hadoop Cascading Others abstraction Scalding
  • The power of Scala on top of Cascading Scalding.io
  • Scala fits naturally with data Scalding.io
  • Word count in Scalding Scalding.io 1. import com.twitter.scalding._ 2. class WordCountJob(args : Args) extends Job(args) { 3. TextLine("input.txt”).read 4. .flatMap('line -> 'word) { line : String => line.split("s+") } 5. .groupBy('word) { _.size } 6. .write( Tsv(”results.tsv”) ) 7. } Map phase Reduce phase 4
  • Who is using it? Scalding.io Many many others…
  • Scalding… …open sourced by twitter at 2011 …has more than 100 open source contributors …exposes the right abstractions …maximizes expressiveness …promotes extensibility …adds new capabilities to Cascading Scalding.io
  • Core Concepts Scalding.io
  • Sources & Sinks 1. Tsv("data.tsv", ('productID,'price,'quantity)) 2. .read 3. .write(UnpackedAvroSource("data.avro”)) Scalding.io Tsv Csv Osv Avro Parquet …
  • Map Operations Scalding.io 1. pipe1.filter ('age) { age:Int => age > 18 } 2. pipe1.map ('price -> ’withVAT) { price:Double => price * 1.2 } 3. pipe1.project('name, 'surname) 15 map operations translated into map phases
  • Join operations 1. pipe1.joinWithSmaller('productId -> 'productId, pipe2) 2. pipe1.joinWithLarger ('productId -> 'productId, pipe2) 3. pipe1.joinWithTiny ('productId -> 'productId, pipe2) Scalding.io Optimize by hinting the relative sizes Supports Left, Right, Inner, Outer Joins 1. pipe1 2. .joinWithSmaller('productId -> 'productId, pipe2, 3. joiner=new LeftJoin)
  • Group operations 1. val pipe = Tsv(“input”, ('shopId, 'itemId, 'quantity)) 2. .groupBy('shopId) { 3. _.sum[Long]('quantity-> 'totalSoldItems) 4. } 5. .write(Tsv(“results.tsv”)) Scalding.io Group by particular fields .groupBy .groupAll Group all data
  • Pipe operations 1. val p = (pipe1 ++ pipe2) // Concatenate 2 pipes 2. .debug // Print sample data to screen 3. .addTrap(Tsv(“bogus_lines”) // dirty data are recorded Scalding.io Simple pipe operations
  • Connect with external systems Scalding.io
  • Scalding + Hive 1. class HiveExample (args: Args) extends Job(args) { 2. val USER_SCHEMA = List('userId, 'username, 'photo) 3. HiveSource("myHiveTable", SinkMode.KEEP) 4. .withHCatScheme(osvInputScheme(fields = USER_SCHEMA)) 5. .write(Tsv("outputFromHive")) 6. } Scalding.io Define the schemaQuery Hcatalog Read directly from HDFS
  • Scalding + ElasticSearch 1. val schema = List('number, 'product, 'description) 2. val readES = ElasticSearchTap("localhost", 9200,"index firstType","", schema).read.write(Tsv("data/es-out.tsv")) 3. val writeES = Tsv("data.tsv”).read.write(ElasticSearchTap ("localhost”, 9200,"index/secondType","", schema)) Scalding.io Read from ElasticSearch in one line!Also index new data in ES
  • Design patterns Scalding.io
  • Dependency Injection Late bound External Operations
  • How about defining external operations? Scalding.io 1. val pipe1 = Tsv(“omniture.tsv”,OMNITURE_SCHEMA) 2. .read 3. .ETLOmnitureData 4. .calculateOmnitureUserStats 5. .joinWithCustomerDB('userId->'userId, customerPipe) 6. .write(Tsv(“omniture-results.tsv”)) Custom operations:  Re-usable modular code  Single responsibility  TestabilityFull-code http://bit.ly/1pNSUKf
  • Scalding Testing Scalding.io
  • Testing challenges in the context of MR Scalding.io Acceptance Tests Unit – Component Tests System Tests Integration Tests Scalding enables testing in every layer & TDD
  • example Scalding.io 1. class TsvWordCountJobTest extends FlatSpec 2. with ShouldMatchers with TuppleConversions { 3. “WordCountJob” should “count words” in { 4. JobTest(new WordCountJob(_)) 5. .args(“input”,”inFile”) 6. .args(“output”,”outFile”) 7. .source(TextLine(“inFile”), List(“0”) -> “cool Scala cool”)) 8. .sink[(String,Int)](Tsv(“outFile”)) { out => 9. out.toList should contain (“cool” -> 2) 10. } 11. .run 12. .finish 13. } 14. } Replaces taps with in-memory collections and asserts the expected output
  • Monitoring Scalding.io
  • “Driven takes Cascading application development to the next level with management and monitoring capabilities for your apps” Scalding.io http://driven.cascading.io
  • Scalding.io Collects telemetry data and expose through a Web UI
  • Advanced Concepts Scalding.io
  • Scalding adds  Typed API  Matrix API  Graphs  Machine Learning Algorithm Scalding.io
  • What the future like? Scalding.io
  • So far… Scalding.io abstraction
  • Real TimeBatch Hybrid Scalding.io abstraction Summingbird A unified API for everything StormTEZ Spark Enables the Lambda architecture
  • Scalding.io Questions?