Scalding Presentation

1,065 views

Published on

Scalding, Scala, MapReduce
24th Hadoop London Meetup

Published in: Technology, Education
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,065
On SlideShare
0
From Embeds
0
Number of Embeds
8
Actions
Shares
0
Downloads
18
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide
  • Monday, June 30, 2014
    6:30 PM to 9:30 PM

    Barclays Accelerator 69-89 Mile End Road, E1 4UJ, London

    http://www.meetup.com/big-data-london/events/188925412/

  • Here are my contact details.
    Find a number of open-source projects related to Hadoop & MapReduce that I have been contributing on GitHub

    And also the technical blog http://scalding.io
  • First book ever available on Scala + MapReduce + Hadoop
    Comes with hundreds of ready to run examples

    Book @ Amazon = http://amazon.co.uk/dp/1783287012
    Book @ PACKT = http://packtpub.com/programming-mapreduce-with-scalding/book

    GitHub repository with examples = https://github.com/scalding-io/ProgrammingWithScalding
  • http://github.com/twitter/scalding
  • Once upon a time..
  • Hadoop provides HDFS for the distributed storage of large files and services for coordinated execution of MapReduce tasks
    The Java MapReduce API is very verbose and 70 lines of code for a simple WordCount example
  • Hadoop provides HDFS for the distributed storage of large files and services for coordinated execution of MapReduce tasks
    The Java MapReduce API is very verbose and 70 lines of code for a simple WordCount example
  • In-memory systems i.e. memcached , redis etc
    Document Databases
    Search systems
  • Explain:
    Taps, Tuples, pipes
  • Show parallelism
  • Cascading word count requires 20 lines
    Java MapReduce API word count requires 70 lines

    = We manage to remove 50 lines of code (70%) by using Cascading
  • Is this what adds on top of cascading ?
  • Parquet => Efficient columnar storage
    For a Scalding application to execute all defined input and output taps must participate in the pipeline.
    Reading & Writing files
  • // 15 map operations – that are translated into map phases
  • How many items does each of our shops sell ?
  • Code reference - https://github.com/scalding-io/ProgrammingWithScalding/tree/master/chapter5/src/main/scala/externaloperations
  • Testing is challeging in the context of MapReduce and its challenges
  • Driven takes Cascading application development to the next level with management and monitoring capabilities for your Cascading apps
  • Driven takes Cascading application development to the next level with management and monitoring capabilities for your Cascading apps
  • This is the stack presented so far
  • Scalding Presentation

    1. 1. MapReduce with Scalding Antonios Chalkiopoulos 24th Big Data London Meetup Scalding.io
    2. 2. $ whoami Scalding.io http://scalding.io http://github.com/scalding-io @chalkiopoulos
    3. 3. My recent achievement.. Scalding.io
    4. 4. What are we gonna talk about..? Scalding.io
    5. 5. Scalding.io
    6. 6. A Scala API on top of Cascading Scalding.io
    7. 7. But what is ? Scalding.io
    8. 8. A few years ago I started on a fresh Big Data team… Scalding.io
    9. 9. How do we efficiently develop MapReduce jobs for our new hadoop cluster ? Scalding.io
    10. 10. MapReduce Techs Scalding.io Java MapReduce Hadoop abstraction
    11. 11. ws Java MapReduce Word count example
    12. 12. MapReduce Techs Scalding.io Java MapReduce Pig Hive Hadoop Cascading Others abstraction
    13. 13. The promise of Cascading Scalding.io
    14. 14. [1] A simple, high level java API for MapReduce easy to understand and work with. Scalding.io
    15. 15. [2] Extensions to MANY platforms Scalding.io
    16. 16. Scalding.io Cascading NoSQL Databases SQL Databases Hadoop Filesystem Local Filesystem In memory systems Search Platforms  MongoDB  Cassandra  HBASE  Accumulo …  ElasticSearch  Solr …  Redis  Memcached …
    17. 17. How it works? Scalding.io
    18. 18. A pipeline architecture Scalding.io
    19. 19. Scalding.io data data data Source tap data data Sinktap
    20. 20. Scalding.io Log files Customer Data Log & Customer Final Results Log files Log files Customer Data Results Results
    21. 21. Cascading Example Scalding.io
    22. 22. Word count in Cascading 1. public class WordCount { 2. public static void main(String[] args) { 3. Properties properties = new Properties(); 4. FlowConnector.setApplicationJarClass (properties, WordCount.class); 5. Scheme sourceScheme = new TextLine (new Fields(“line”)); 6. Scheme sinkScheme = new TextLine (new Fields(“word”,”count”)); 7. Tap source = new Hfs( sourceScheme, args[0]); 8. Tap sink = new Hfs( sinkScheme, args[1], SinkMode.REPLACE ); 9. Pipe assembly = new Pipe(“ Word Count “); 10. String regex = “(?>!pL)(?=pL)[^ ]*(?<=pL)(?!pL)”; 11. Function function = new RegexGenerator( new Fields(“word”), regex); 12. assembly = new Each( assembly, new Fields(“line”), function ); 13. assembly = new GroupBy( assembly, new Fields(“word”) ); 14. Aggregator count = new Count(new Fields(“count”) ); 15. assembly = new Every( assembly, count ); 16. FlowConnector flowConnector = new FlowConnector( properties ); 17. Flow flow = flowConnector.connect(“word-count”, source, sink, assembly); 18. flow.complete(); 19. } 20. } Scalding.io 70% less boilerplate code But still some infrastructure code
    23. 23. Scalding.io
    24. 24. Scalding.io No boilerplate code at all Functional Robust & Scalable Run on JVM
    25. 25. Here it comes  Scalding.io Java MapReduce Pig Hive Hadoop Cascading Others abstraction Scalding
    26. 26. The power of Scala on top of Cascading Scalding.io
    27. 27. Scala fits naturally with data Scalding.io
    28. 28. Word count in Scalding Scalding.io 1. import com.twitter.scalding._ 2. class WordCountJob(args : Args) extends Job(args) { 3. TextLine("input.txt”).read 4. .flatMap('line -> 'word) { line : String => line.split("s+") } 5. .groupBy('word) { _.size } 6. .write( Tsv(”results.tsv”) ) 7. } Map phase Reduce phase 4
    29. 29. Who is using it? Scalding.io Many many others…
    30. 30. Scalding… …open sourced by twitter at 2011 …has more than 100 open source contributors …exposes the right abstractions …maximizes expressiveness …promotes extensibility …adds new capabilities to Cascading Scalding.io
    31. 31. Core Concepts Scalding.io
    32. 32. Sources & Sinks 1. Tsv("data.tsv", ('productID,'price,'quantity)) 2. .read 3. .write(UnpackedAvroSource("data.avro”)) Scalding.io Tsv Csv Osv Avro Parquet …
    33. 33. Map Operations Scalding.io 1. pipe1.filter ('age) { age:Int => age > 18 } 2. pipe1.map ('price -> ’withVAT) { price:Double => price * 1.2 } 3. pipe1.project('name, 'surname) 15 map operations translated into map phases
    34. 34. Join operations 1. pipe1.joinWithSmaller('productId -> 'productId, pipe2) 2. pipe1.joinWithLarger ('productId -> 'productId, pipe2) 3. pipe1.joinWithTiny ('productId -> 'productId, pipe2) Scalding.io Optimize by hinting the relative sizes Supports Left, Right, Inner, Outer Joins 1. pipe1 2. .joinWithSmaller('productId -> 'productId, pipe2, 3. joiner=new LeftJoin)
    35. 35. Group operations 1. val pipe = Tsv(“input”, ('shopId, 'itemId, 'quantity)) 2. .groupBy('shopId) { 3. _.sum[Long]('quantity-> 'totalSoldItems) 4. } 5. .write(Tsv(“results.tsv”)) Scalding.io Group by particular fields .groupBy .groupAll Group all data
    36. 36. Pipe operations 1. val p = (pipe1 ++ pipe2) // Concatenate 2 pipes 2. .debug // Print sample data to screen 3. .addTrap(Tsv(“bogus_lines”) // dirty data are recorded Scalding.io Simple pipe operations
    37. 37. Connect with external systems Scalding.io
    38. 38. Scalding + Hive 1. class HiveExample (args: Args) extends Job(args) { 2. val USER_SCHEMA = List('userId, 'username, 'photo) 3. HiveSource("myHiveTable", SinkMode.KEEP) 4. .withHCatScheme(osvInputScheme(fields = USER_SCHEMA)) 5. .write(Tsv("outputFromHive")) 6. } Scalding.io Define the schemaQuery Hcatalog Read directly from HDFS
    39. 39. Scalding + ElasticSearch 1. val schema = List('number, 'product, 'description) 2. val readES = ElasticSearchTap("localhost", 9200,"index firstType","", schema).read.write(Tsv("data/es-out.tsv")) 3. val writeES = Tsv("data.tsv”).read.write(ElasticSearchTap ("localhost”, 9200,"index/secondType","", schema)) Scalding.io Read from ElasticSearch in one line!Also index new data in ES
    40. 40. Design patterns Scalding.io
    41. 41. Dependency Injection Late bound External Operations
    42. 42. How about defining external operations? Scalding.io 1. val pipe1 = Tsv(“omniture.tsv”,OMNITURE_SCHEMA) 2. .read 3. .ETLOmnitureData 4. .calculateOmnitureUserStats 5. .joinWithCustomerDB('userId->'userId, customerPipe) 6. .write(Tsv(“omniture-results.tsv”)) Custom operations:  Re-usable modular code  Single responsibility  TestabilityFull-code http://bit.ly/1pNSUKf
    43. 43. Scalding Testing Scalding.io
    44. 44. Testing challenges in the context of MR Scalding.io Acceptance Tests Unit – Component Tests System Tests Integration Tests Scalding enables testing in every layer & TDD
    45. 45. example Scalding.io 1. class TsvWordCountJobTest extends FlatSpec 2. with ShouldMatchers with TuppleConversions { 3. “WordCountJob” should “count words” in { 4. JobTest(new WordCountJob(_)) 5. .args(“input”,”inFile”) 6. .args(“output”,”outFile”) 7. .source(TextLine(“inFile”), List(“0”) -> “cool Scala cool”)) 8. .sink[(String,Int)](Tsv(“outFile”)) { out => 9. out.toList should contain (“cool” -> 2) 10. } 11. .run 12. .finish 13. } 14. } Replaces taps with in-memory collections and asserts the expected output
    46. 46. Monitoring Scalding.io
    47. 47. “Driven takes Cascading application development to the next level with management and monitoring capabilities for your apps” Scalding.io http://driven.cascading.io
    48. 48. Scalding.io Collects telemetry data and expose through a Web UI
    49. 49. Advanced Concepts Scalding.io
    50. 50. Scalding adds  Typed API  Matrix API  Graphs  Machine Learning Algorithm Scalding.io
    51. 51. What the future like? Scalding.io
    52. 52. So far… Scalding.io abstraction
    53. 53. Real TimeBatch Hybrid Scalding.io abstraction Summingbird A unified API for everything StormTEZ Spark Enables the Lambda architecture
    54. 54. Scalding.io Questions?

    ×