Introduction to Designing and Building Big Data Applications


Published on

Learn what the course covers, from capturing data to building a search interface; the spectrum of processing engines, Apache projects, and ecosystem tools available for converged analytics; who is best suited to attend the course and what prior knowledge you should have; and the benefits of building applications with an enterprise data hub.

Published in: Software, Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Introduction to Designing and Building Big Data Applications

  1. 1. Tom Wheeler | Senior Curriculum Developer April 2014 Introduction to Designing and Building Big Data Applications
  2. 2. Agenda  Cloudera's Learning Path for Developers  Target Audience and Prerequisites  Course Outline  Short Presentation Based on Actual Course Material  Question and Answer Session
  3. 3. Intro to Data Science HBase Training Learn to code and write MapReduce programs for production Master advanced API topics required for real-world data analysis Design schemas to minimize latency on massive data sets Scale hundreds of thousands of operations per second Implement recommenders and data experiments Draw actionable insights from analysis of disparate data Big Data Applications Build converged applications using multiple processing engines Develop enterprise solutions using components across the EDH Developer Training Learning Path: Developers Create Powerful New Data Processing Tools Aaron T. Myers Software Engineer
  4. 4. 25% $115K An engineer with Hadoop skills requires a min. salary premium of Hadoop developers are now the top paid in tech, starting at Sources: Business Insider, “10 Tech Skills That Will Instantly Net You A $100,000+ Salary,” 11 August 2012. Business Insider, “30 Tech Skills That Will Instantly Net You A $100,000+ Salary,” 21 February 2013. GigaOm, “Big Data Skills Bring Big Dough,” 17 February 2012. $300K Compensation for a very senior Data Scientist opens at Hadoop Professionals: Build or Buy? Professional Certification Decreases Hiring Risk
  5. 5. 1 Broadest Range of Courses Developer, Admin, Analyst, HBase, Data Science 2 3 Most Experienced Instructors More than 20,000 students trained since 2009 6 Widest Geographic Coverage Most classes offered: 50 cities worldwide plus online 7 Most Relevant Platform & Community CDH deployed more than all other distributions combined 8 Depth of Training Material Hands-on labs and VMs support live instruction Leader in Certification Over 8,000 accredited Cloudera professionals 4 Trusted Source for Training 100,000+ people have attended online courses 9 Ongoing Learning Video tutorials and e-learning complement training Why Cloudera Training? Aligned to Best Practices and the Pace of Change 5 State of the Art Curriculum Courses updated as Hadoop evolves 10Commitment to Big Data Education University partnerships to teach Hadoop in the classroom
  6. 6. Designing and Building Big Data Applications About the Course
  7. 7. • Intended for people who write code, such as • Software Engineers • Data Engineers • ETL Developers Target Audience
  8. 8. • Successful completion of our Developer course • Or equivalent practical experience • Intermediate-level Java skills • Basic familiarity with Linux • Knowledge of SQL or HiveQL is also helpful Course Prerequisites
  9. 9. Example of Required Java Skill Level package com.cloudera.example; import; import; import; import; import org.apache.hadoop.mapreduce.Mapper; public class Example extends Mapper<LongWritable, Text, Text, IntWritable> { @Override public void map(LongWritable key, Text value, Context ctx) throws IOException, InterruptedException { 1 2 3 4 5 6 7 8 9 10 11 12 13 • Do you understand the following code? Could you write something similar?
  10. 10. Example of Required Linux Skill Level • Are you comfortable editing text files on a Linux system? • Are you familiar with the following commands? $ mkdir -p /tmp/incoming/web_logs $ cd /var/log/web $ mv *.log /tmp/incoming/web_logs
  11. 11. • During this course, you will learn • Determine which Hadoop-related tools are appropriate for specific tasks • Understand how file formats, serialization, and data compression affect application compatibility and performance • Design and evolve schemas in Apache Avro • Create, populate, and access data sets with the Kite SDK • Integrate with external systems using Apache Sqoop and Apache Flume • Integrate Apache Flume with existing applications and develop custom components to extend Flume’s capabilities Course Objectives
  12. 12. • Create, package, and deploy Oozie jobs to manage processing workflows • Develop Java-based data processing pipelines with Apache Crunch • Implement user-defined functions for use in Apache Hive and Impala • Index both static and streaming data sets with Cloudera Search • Use Hue to build a Web-based interface for Search queries • Integrate results from Impala and Cloudera Search into your applications Course Objectives (continued)
  13. 13. • Frequent hands-on exercises • Based on a hypothetical but realistic scenario • Each works towards building a working application Scenario for Hands-On Exercises mobile udacreo L
  14. 14. Tools Used in Hands-On Exercises HDFS Sqoop Flume Kite SDK / Morphlines Ingest and Data Management HCatalog Impala Search Interactive Queries MapReduce Crunch Hive Batch Processing Avro
  15. 15. Data Sources Used in Hands-On Exercises RDBMS Telecom Switches Enterprise Data Hub Equipment Records Customer Records Call Detail Records (Fixed-Width) CRM System Phone Activations (XML) Point of Sale Terminals Web Servers Static Documents (HTML) Log Files (Text) Device Status (CSV and TSV) Chat Transcripts (JSON)
  16. 16. • Exercises use real-world development environment • IDE (Eclipse) • Unit testing library (JUnit) • Build and configuration management tool (Maven) Development Environment
  17. 17. • Introduction • Application Architecture * • Designing and Using Data Sets * • Using the Kite SDK Data Module * • Importing Relational Data with Apache Sqoop * • Capturing Data with Apache Flume * Course Outline * This chapter contains a hands-on exercise * This chapter contains multiple hands-on exercises
  18. 18. • Developing Custom Flume Components * • Managing Workflows with Apache Oozie * • Processing Data Pipelines with Apache Crunch * • Working with Tables in Apache Hive * • Developing User-Defined Functions * • Executing Interactive Queries with Impala * Course Outline (continued)
  19. 19. • Understanding Cloudera Search • Indexing Data with Cloudera Search * • Presenting Results to Users * • Conclusion Course Outline (continued)
  20. 20. • Based on chapter 3: Designing and Using Data Sets Course Excerpt
  21. 21. • Define the concept of serialization • Represents data as a series of bytes • Allows us to store and transmit data • There are many ways of serializing data • How do you serialize the number 108125150? • 4 bytes when stored as a Java int • 9 bytes when stored as text What is Data Serialization?
  22. 22. • Affects performance and storage space • Chosen method may limit portability • is Java-specific • Writables are Hadoop-specific • May also limit backwards compatibility • Often depends on specific version of class • Avro was developed to address these challenges Implications of Data Serialization
  23. 23. • Avro is an open source data serialization framework • Widely supported throughout Hadoop ecosystem • Offers compatibility without sacrificing performance • Data is serialized according to a schema you define • Read and write from Java, C, C++, C#, Python, PHP, etc. • Optimized binary encoding for efficient storage • Defines rules for schema evolution What is Apache Avro?
  24. 24. • Avro schemas define the structure of your data • Similar to a CREATE TABLE in SQL, but more flexible • Defined using JSON syntax Avro Schemas id name title bonus 108424 Alice Salesperson 2500 101837 Bob Manager 3000 107812 Chuck President 9000 105476 Dan Accountant 3000 Metadata Data
  25. 25. • These are among the simple (scalar) types in Avro Simple Types in Avro Schemas Name Description null An absence of a value boolean A binary value int 32-bit signed integer long 64-bit signed integer float Single-precision floating point value double Double-precision floating point value string Sequence of Unicode characters
  26. 26. • These are the complex types in Avro Complex Types in Avro Schemas Name Description record A user-defined type composed of one or more named fields enum A specified set of values array Zero or more values of the same type map Set of key-value pairs; key is string while value is of specified type union Exactly one value matching a specified set of types fixed A fixed number of 8-bit unsigned bytes
  27. 27. • SQL CREATE TABLE statement Schema Example CREATE TABLE employees (id INT, name VARCHAR(30), title VARCHAR(20), bonus INT);
  28. 28. • Equivalent Avro schema Schema Example (Continued) {"namespace": "", "type": "record", "name": "Employee", "fields": [ {"name": "id", "type": "int"}, {"name": "name", "type": "string"}, {"name": "title", "type": "string"}, {"name": "bonus", "type": "int"} ]}
  29. 29. • Approaches for mapping Java object to a schema • Generic: Write code to map each field manually • Reflect: Generate a schema from an existing class • Specific: Generate a Java class from your schema Mapping Avro Schema to Java Object
  30. 30. • Hadoop and its ecosystem support many file formats • May ingest in one format and convert to another • Format selection involves several considerations • Ingest pattern • Tool compatibility • Expected lifetime • Storage and performance requirements Considerations for File Formats
  31. 31. • Each file format may also support compression • Reduces amount of disk space required to store data • Tradeoff between time and space • Can greatly improve performance • Many Hadoop jobs are I/O-bound Data Compression
  32. 32. • Refers to organizing data according to access patterns • Improves performance by limiting input • Common partitioning schemes • Customers: partition by state, province, or region • Events: separate by year, month, and day Data Partitioning
  33. 33. • Imagine that you store all Web server log files in HDFS • Marketing runs monthly jobs for search engine optimization • Security runs daily jobs to identify attempted exploits Partitioning Example 2014 March May 05 06 07 08 09 1001 02 03 04 11 12 13 April Input for monthly job Input for daily job
  34. 34. Register for training and certification at Use discount code Apps10 to save 10% on new enrollments in Big Data Applications classes delivered by Cloudera until July 4, 2014* • Enter questions in the Q&A panel • Follow Cloudera University: @ClouderaU • Follow the Developer learning path: • Learn about the enterprise data hub: • Join the Cloudera user community: • Get Developer Certification: • Explore Developer resources for Hadoop: * Excludes classes sold or delivered by other partners