Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

YARN webinar series: Using Scalding to write applications to Hadoop and YARN

3,921 views

Published on

This webinar focuses on introducing Scalding for developers and writing applications for Hadoop and YARN using Scalding. Guest speaker Jonathan Coveney from Twitter provides an overview, use cases, limitations, and core concepts.

Published in: Technology

YARN webinar series: Using Scalding to write applications to Hadoop and YARN

  1. 1. Scalding YARN Webinar Series September 18, 2014 Page 1 © Hortonworks Inc. 2014 Ajay Singh, Director - Hortonworks Jonathan Coveney, Senior Software Engineer - Twitter
  2. 2. Agenda Introduction: Ajay Singh, Hortonworks Modern Data Architecture and how Cascading and Scalding fit in Scalding: Jonathan Coveney, Twitter Why Scalding? Core Concepts and Limitations Scalding at Twitter Resources Page 2 © Hortonworks Inc. 2014
  3. 3. Speakers Page 3 © Hortonworks Inc. 2014 Ajay Singh is Hortonworks Director of Technical Channels and leads the strategic alliances with partners from a technology standpoint such as driving alignment on roadmaps, product certifications and demos. Ajay is dedicated to building, scaling and delivering exceptional go-to-market solutions with partners. Jonathan Coveney currently works at Twitter, where he has spent a lot of time maintaining and updating Scalding; in the past, he has worked extensively on Apache Pig. He is deeply interested in functional programming, as well as developing usable, scalable API's for data processing at scale.
  4. 4. A Modern Data Architecture DATA SYSTEM APPLICATIONS RDBMS EDW MPP REPOSITORIES SOURCES Exis4ng Sources (CRM, ERP, Clickstream, Logs) Page 4 © Hortonworks Inc. 2014 Emerging Sources (Sensor, Sen4ment, Geo, Unstructured) DEV & DATA TOOLS BUILD & TEST OPERATIONAL TOOLS MANAGE & MONITOR Business Analy4cs Custom Applica4ons Packaged Applica4ons Governance & Integration ENTERPRISE HADOOP Security Operations Data Access Data Management
  5. 5. HDP 2.1: Enterprise Hadoop HDP 2.1 Hortonworks Data Platform Page 5 © Hortonworks Inc. 2014 Provision, Manage & Monitor Ambari Zookeeper Scheduling Oozie Data Workflow, Lifecycle & Governance Falcon Sqoop Flume NFS WebHDFS YARN : Data Opera4ng System DATA MANAGEMENT GOVERNANCE & DATA ACCESS SECURITY INTEGRATION Authen4ca4on Authoriza4on Accoun4ng Data Protec4on Storage: HDFS Resources: YARN Access: Hive, … Pipeline: Falcon Cluster: Knox OPERATIONS Script Pig Search Solr SQL Hive/Tez, HCatalog NoSQL HBase Accumulo Stream Storm Others In-­‐Memory AnalyNcs, ISV engines Cascading 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° N HDFS (Hadoop Distributed File System) Batch Map Reduce Deployment Choice Linux Windows On-Premise Cloud
  6. 6. Cascading SDK HDP Integrates and delivers Cascading SDK • Collection of tools, documentation, libraries, tutorials and example projects • Key Benefits • Simplified Development • Multi Language Support • Reuse existing skills and tools • Native YARN Integration Hortonworks delivers Enterprise support • Backed by Concurrent Hortonworks and Concurrent Advance Enterprise Data Application Development on Hadoop Page 6 © Hortonworks Inc. 2014
  7. 7. HDP Integration of Cascading SDK • Write once and deploy on your fabric of choice • Integration with data processing layer allows Cascading to take advantage of advances in interactive applications • Sep 17th - Cascading 3.0 WIP Now Supports Apache Tez – http://www.cascading.org/2014/09/17/ cascading-3-0-wip-now-supports-apache-tez/ Page 7 © Hortonworks Inc. 2014 PRESENTATION & APPLICATION Efficient Cluster Resource Management & Shared Services (YARN) Batch Data Processing MapReduce Interac4ve Data Processing TEZ Java Cascading Scala Scalding SQL Lingual ML Pa6ern Java Cascading Scala Scalding SQL Lingual ML Pa6ern Enable both existing and new application to provide value to the organization CURRENT WIP
  8. 8. Cascading.org Scalding Resources Scalding Resources on Cascading.org • Videos and Tutorials • Mailing List • Newsletter Cascading 3.0 WIP With Tez Support • https://github.com/cwensel/cascading/tree/wip-3.0/cascading-hadoop2-tez Scalding Training Debuts This Fall • In-person, 1-day class with labs • Email: info@cascading.io Page 8 © Hortonworks Inc. 2014
  9. 9. Page 9 © Hortonworks Inc. 2014 Jonathan Coveney Twitter @jco
  10. 10. Why Scalding? Writing raw map reduce is difficult! ● Scalding is o Less verbose o Less error prone (type checking!) o Easier to evolve o Performant enough Page 10 © Hortonworks Inc. 2014
  11. 11. But what about Hive and Pig? ● Really good for certain things o Excellent for quick, ad-hoc work o Easy to understand o Can leverage existing knowledge (ie SQL) ● Not always the best for maintainability o Composition isn’t great o Testing is difficult o Type safety is lacking Page 11 © Hortonworks Inc. 2014
  12. 12. So… Cascading? ● Still pretty verbose! ● But you can use normal java tools o Maven o JUnit o IDEs ● Handles the low level details for you ● A good target for higher level languages Page 12 © Hortonworks Inc. 2014
  13. 13. Scalding ● Concise, expressive syntax ● Testable ● Abstractable ● Composable Because it’s in a full-featured, functional language! Page 13 © Hortonworks Inc. 2014
  14. 14. But Scala is scary! ● Scalding doesn’t force you to use more complicated features ● Can just write less-verbose Java if desired ● Functional programming is an important paradigm -- but especially for big data Learning new things is good for your brain :) Page 14 © Hortonworks Inc. 2014
  15. 15. Example Scalding job class Webinar(arg: Args) extends Job(args) { import TDsl._ TextLine(args(“input”)) .flatMap { _.split(“s+”) } .map { w => (w, 1L) } .group .sum .write(TypedTsv[(String, Long)](args(“output”))) } “Hadoop is a system for counting words” -Oscar Boykin, @posco Page 15 © Hortonworks Inc. 2014
  16. 16. Core concepts ● Source o How to read or write data ● TypedPipe[T] o A distributed list of T o Kind of like a Seq[T] in Scala’s collections library ● Grouped[K, T] o A grouping on K o Represents transition to reduce phase Page 16 © Hortonworks Inc. 2014
  17. 17. Word Co-Occurrence TextLine(args("input")) .flatMap { line => val words = line.split("s+") for (w1 <- words; w2 <- words if (w1 != w2)) yield (w1, Map(w2 -> 1L)) }.group[String, Map[String, Long]] .sum .flatMap { case (word, wordMap) => wordMap.map { case (otherWord, count) => (word, otherWord, count) }}.write(TypedTsv[(String, String, Long)](args("output"))) Page 17 © Hortonworks Inc. 2014
  18. 18. Important concepts Scalding leverages a lot of Scala idioms, as well as concepts from functional programming ● map o a 1 to 1 mapping for every piece of data ● flatMap o a 1 to 0 or more mapping for every piece of data Page 18 © Hortonworks Inc. 2014
  19. 19. Important concepts (continued) ● Typeclasses o The separation of computation from data types o Think Java’s Comparator (but way more powerful) o These are what power .sum Page 19 © Hortonworks Inc. 2014
  20. 20. Limitations Scalding’s limitations are MapReduce’s limitations ● Bad at iterative jobs ● Lots of checkpointing, serialization, sorting However... ● Cascading on Tez could help! o in progress as part of Cascading 3.0 ● So could Cascading on Spark! Page 20 © Hortonworks Inc. 2014
  21. 21. The cutting edge ● REPL support ● Executor[T] o Decoupling TypedPipes from specifics of the execution engine o Makes Iterative algorithms much easier to express ● Macros o Allowing easier use of case classes o Closure analysis? Page 21 © Hortonworks Inc. 2014
  22. 22. Scalding at Twitter ● Thousands of users o Engineers AND data scientists ● Many thousands of jobs every day o ETL o Recommendations o Email o Time series analysis When you use Twitter, you’re using features powered by Scalding! Page 22 © Hortonworks Inc. 2014
  23. 23. Useful practices ● A standardized “Job” subclass with company specific information o Want the common case to be as simple as possible o Especially should configure serialization for users ● Separate data from functions on data o At Twitter, this means Thrift for data, and various Scala functions operating and that data o Decouples the specification of some data from the derived data people want based on it Page 23 © Hortonworks Inc. 2014
  24. 24. Q&A Page 24 © Hortonworks Inc. 2014
  25. 25. Contribute! ● Scalding ● Algebird o Math inspired aggregators (.sum uses it) ● Bijection o Conversion and serialization made fun ● Summingbird o Abstraction for batch and online map/reduce (see resources for more) Page 25 © Hortonworks Inc. 2014
  26. 26. More resources Scalding/Algebird • Oscar Boykin: Algebra for Scalable Analytics • Avi Bryant: Add ALL the Things • Oscar Boykin, Argyris Zimny: Scalding: Powerful & Concise MapReduce You might also be interested in… • Summingbird! Streaming real-time and batch analytics, unified and made beautiful • Oscar Boykin: Introduction to Summingbird • Oscar Boykin, Sam Ritchie, Ian O’Connell, Jimmy Lin: Summingbird, A Framework for Integrating Batch and Online MapReduce Computations Page 26 © Hortonworks Inc. 2014
  27. 27. Next Webinar – Oct 2 - Spark Writing applications to Hadoop and YARN using Spark • October 2nd at 9am Pacific Time • Register Find all webinars • Hortonworks.com/webinars Find past recorded webinars • Hortonworks.com/webinars/#library Page 27 © Hortonworks Inc. 2014
  28. 28. Thank you! Page 28 © Hortonworks Inc. 2014

×