Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Introduction to Data Analyst Training


Published on

Published in: Technology, Education
  • Got a new Iphone 6 in just 7 days completing surveys and offers! Now I'm just a few days away from completing and receiving my samsung tablet! Highly recommended! Definitely the best survey site out there! ★★★
    Are you sure you want to  Yes  No
    Your message goes here
  • Get Paid To Write Articles? YES! View 1000s of companies hiring online writers now! ■■■
    Are you sure you want to  Yes  No
    Your message goes here
  • Is Your Ex With a Man? Don't lose your Ex girlfriend! This weird trick will get her back! ◆◆◆
    Are you sure you want to  Yes  No
    Your message goes here

Introduction to Data Analyst Training

  1. 1. Cloudera Data Analyst Training: Using Pig, Hive, and Impala with Hadoop Tom Wheeler
  2. 2. Agenda Why Cloudera Training?* Target Audience and Prerequisites* Course Outline* Short Presentation Based on Actual Course Material* - Understanding Hadoop, Pig, Hive, Sqoop, and Impala Question and Answer Session*
  3. 3. 32,000 trained professionals by 2015 Rising demand for Big Data and analytics experts but a DEFICIENCY OF TALENT will result in a shortfall of Source: Accenture “Analytics in Action,“ March 2013.
  4. 4. 1 Broadest Range of Courses Developer, Admin, Analyst, HBase, Data Science 2 3 Widest Geographic Coverage 50 cities worldwide plus online 5 Leading Platform & Community CDH deployed more than all other distributions combined 6 Relevant Training Material Classes updated regularly as tools evolve 7 Practical Hands-On Exercises Real-world labs complement live instruction Most Experienced Instructors More than 15,000 students trained since 2009 4 Leader in Certification Over 5,000 accredited Cloudera professionals 8 Ongoing Learning Video tutorials and e-learning complement training Why Cloudera Training?
  5. 5. 55% of the Fortune 100 have attended live Cloudera training Source: Fortune, “Fortune 500 “ and “Global 500,” May 2012. Cloudera Trains the Top Companies 100% of the top 20 global technology firms to use Hadoop Cloudera has trained employees from Big Data professionals from
  6. 6. 94% 88% Would recommend or highly recommend Cloudera training to friends and colleagues Indicate Cloudera training provided the Hadoop expertise their roles require Sources: Cloudera Past Public Training Participant Study, December 2012. Cloudera Customer Satisfaction Study, January 2013. 66% Draw on lessons from Cloudera training on at least a monthly basis What Do Our Students Say?
  7. 7. Cloudera is the best vendor evangelizing the Big Data movement and is doing a great service promoting Hadoop in the industry. Developer training was a great way to get started on my journey.
  8. 8. Cloudera Data Analyst Training: Using Pig, Hive, and Impala with Hadoop About the Course
  9. 9.  This course was created for people in analytical roles, including –Data Analyst –Business Intelligence Analyst –Operations Analyst –Reporting Specialist  Also useful for others who want to use high-level Big Data tools –Business Intelligence Developer –Data Warehouse Engineer –ETL Developers Intended Audience
  10. 10.  Developers who want to learn details of MapReduce programming –Recommend Cloudera Developer Training for Apache Hadoop  System administrators who want to learn how to install/configure tools –Recommend Cloudera Administrator Training for Apache Hadoop Who Should Not Take this Course
  11. 11.  No prior knowledge of Hadoop is required  What is required is an understanding of –Basic relational database concepts –Basic knowledge of SQL –Basic end-user UNIX commands Course Prerequisites SELECT id, first_name, last_name FROM customers; ORDER BY last_name; $ mkdir /data $ cd /data $ rm /home/tomwheeler/salesreport.txt
  12. 12. During this course, you will learn  The purpose of Hadoop and its related tools  The features that Pig, Hive, and Impala offer for data acquisition, storage, and analysis  How to identify typical use cases for large-scale data analysis  How to load data from relational databases and other sources  How to manage data in HDFS and export it for use with other systems  How Pig, Hive, and Impala improve productivity for typical analysis tasks  The language syntax and data formats supported by these tools Course Objectives
  13. 13.  How to design and execute queries on data stored in HDFS  How to join diverse datasets to gain valuable business insight  How to analyze structured, semi-structured, and unstructured data  How Hive and Pig can be extended with custom functions and scripts  How to store and query data for better performance  How to determine which tool is the best choice for a given task Course Objectives (cont’d)
  14. 14.  Hadoop Fundamentals –Hands-On Exercise: Data Ingest with Hadoop Tools  Introduction to Pig  Basic Data Analysis with Pig –Hands-On Exercise: Using Pig for ETL Processing  Processing Complex Data with Pig –Hands-On Exercise: Analyzing Ad Campaign Data with Pig  Multi-Dataset Operations with Pig –Hands-On Exercise: Analyzing Disparate Data Sets with Pig  Extending Pig –Hands-On Exercise: Extending Pig with Streaming and UDFs Course Outline
  15. 15.  Pig Troubleshooting and Optimization –Demo: Troubleshooting a Failed Job with the Web UI  Introduction to Hive  Relational Data Analysis with Hive –Hands-On Exercise: Running Hive Queries on the Shell, Scripts, and Hue  Hive Data Management –Hands-On Exercise: Data Management with Hive  Text Processing with Hive –Hands-On Exercise: Gaining Insight with Sentiment Analysis  Hive Optimization  Extending Hive –Hands-On Exercise: Data Transformation with Hive Course Outline (cont’d)
  16. 16.  Introduction to Impala  Analyzing Data with Impala –Hands-On Exercise: Interactive Analysis with Impala  Choosing the Best Tool for the Job Course Outline (cont’d)
  17. 17.  We are generating data faster than ever –Processes are increasingly automated –People are increasingly interacting online –Systems are increasingly interconnected Velocity
  18. 18.  We are producing a wide variety of data –Social network connections –Images, audio, and video –Server and application log files –Product ratings on shopping and review Web sites –And much more…  Not all of this maps cleanly to the relational model Variety
  19. 19.  Every day… –More than 1.5 billion shares are traded on the New York Stock Exchange –Facebook stores 2.7 billion comments and ‘Likes’ –Google processes about 24 petabytes of data  Every minute… –Foursquare handles more than 2,000 check-ins –TransUnion makes nearly 70,000 updates to credit files  And every second… –Banks process more than 10,000 credit card transactions Volume
  20. 20.  This data has many valuable applications –Product recommendations –Predicting demand –Marketing analysis –Fraud detection –And many, many more…  We must process it to extract that value –And processing all the data can yield more accurate results Data Has Value
  21. 21.  We’re generating too much data to process with traditional tools  Two key problems to address –How can we reliably store large amounts of data at a reasonable cost? –How can we analyze all the data we have stored? We Need a System that Scales
  22. 22.  Scalable and economical data storage and processing –Distributed and fault-tolerant –Harnesses the power of industry standard hardware  Heavily inspired by technical documents published by Google  ‘Core’ Hadoop consists of two main components –Storage: the Hadoop Distributed File System (HDFS) –Processing: MapReduce What is Apache Hadoop?
  23. 23.  Apache Pig builds on Hadoop to offer high-level data processing –This is an alternative to writing low-level MapReduce code –Pig is especially good at joining and transforming data Apache Pig people = LOAD '/user/training/customers' AS (cust_id, name); orders = LOAD '/user/training/orders' AS (ord_id, cust_id, cost); groups = GROUP orders BY cust_id; totals = FOREACH groups GENERATE group, SUM(orders.cost) AS t; result = JOIN totals BY group, people BY cust_id; DUMP result;
  24. 24.  Pig is also widely used for Extract, Transform, and Load (ETL) processing Use Case: ETL Processing Operations Validate data Accounting Call Center Fix errors Remove duplicates Encode values Data Warehouse Pig Jobs Running on Hadoop Cluster
  25. 25.  Hive is another abstraction on top of MapReduce –Like Pig, it also reduces development time –Hive uses a SQL-like language called HiveQL Apache Hive SELECT customers.cust_id, SUM(cost) AS total FROM customers JOIN orders ON customers.cust_id = orders.cust_id GROUP BY customers.cust_id ORDER BY total DESC;
  26. 26.  Server log files are an important source of data  Hive allows you to treat a directory of log files like a table –Allows SQL-like queries against raw data Use Case: Log File Analytics Dualcore Inc. Public Web Site (June 1 - 8) Product Unique Visitors Page Views Bounce Rate Conversion RateAverage Time on Page Tablet 5,278 5,894 23% 65%17 seconds Notebook 4,139 4,375 47% 31%23 seconds Stereo 2,873 2,981 61% 12%42 seconds Monitor 1,749 1,862 74% 19%26 seconds Router 987 1,139 56% 17%37 seconds Server 314 504 48% 28%53 seconds Printer 86 97 27% 64%34 seconds
  27. 27. Apache Sqoop  Sqoop exchanges data between a database and Hadoop  It can import all tables, a single table, or a portion of a table into HDFS –Result is a directory in HDFS containing comma-delimited text files  Sqoop can also export data from HDFS back to the database Database Hadoop Cluster
  28. 28.  Massively parallel SQL engine which runs on a Hadoop cluster –Inspired by Google’s Dremel project –Can query data stored in HDFS or HBase tables  High performance –Typically at least 10 times faster than Pig, Hive, or MapReduce –High-level query language (subset of SQL)  Impala is 100% Apache-licensed open source Cloudera Impala
  29. 29. Where Impala Fits Into the Data Center Transaction Records from Application Database Log Data from Web Servers Hadoop Cluster with Impala Documents from File Server Analyst using Impala shell for ad hoc queries Analyst using Impala via BI tool
  30. 30.  MapReduce –Low-level processing and analysis  Pig –Procedural data flow language executed using MapReduce  Hive –SQL-based queries executed using MapReduce  Impala –High-performance SQL-based queries using a custom execution engine Recap of Data Analysis/Processing Tools
  31. 31. Comparing Pig, Hive, and Impala Description of Feature Pig Hive Impala SQL-based query language No Yes Yes User-defined functions (UDFs) Yes Yes No Process data with external scripts Yes Yes No Extensible file format support Yes Yes No Complex data types Yes Yes No Query latency High High Low Built-in data partitioning No Yes Yes Accessible via ODBC / JDBC No Yes Yes
  32. 32. • Submit questions in the Q&A panel • Watch on-demand video of this webinar at • Follow Cloudera University @ClouderaU • Attend Tom’s talk at OSCON: • Or Tom’s talks at StampedeCon: • Thank you for attending! Register now for Cloudera training at Use discount code Wheeler_10 to save 10% on new enrollments in Data Analyst Training classes delivered by Cloudera until September 1, 2013 Use discount code 15off2 to save 15% on enrollments in two or more training classes delivered by Cloudera until September 1, 2013