Introduction to Data Analyst Training


Published on

Published in: Technology, Education
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Hello everyone – thank you for attending today’s session. I’m going to tell you about our new course designed specifically for Data Analysts and others in similar job roles. My name is Tom Wheeler, and I’m a Senior Curriculum Developer here at Cloudera. I created this course and have spent much of the last year working on it, so there’s nobody more familiar with it – and probably nobody more excited to talk about it – than me. [SEGUE]Here’s an overview of what I’m going to cover today…
  • I’ll begin by briefly explaining a little bit about Cloudera and our training program, then move on to talking about the Data Analyst class itself, including the types of people I had in mind when I created this course and the skills that those people should have to get the most out of it. I’ll then describe the outline to show you specifically what we cover and follow that up with a short presentation based on actual material from the course, so you can get a taste for what it’s really like and also learn a little bit about some of the tools we cover. Finally, I’ll conclude with a Question and Answer session, so let me reiterate that you can use the Q&A tab to submit questions during my presentation and I’ll answer as many of them as I can at the end.[SEGUE]So let me explain why Cloudera has invested so much in creating it and why I think it’s so important that it’s available now…
  • According to a report called “Analytics in Action,” released by Accenture in March, the need for Big Data and analytics experts is so intense that they predict a shortfall of 32,000 professionals with this training just two years from now. Of course, another way to look at this is that the people who do have this training should be in very high demand.And it’s exactly this problem – the unmet need that organizations of all types have for people who knowhow to analyze data at scale – that our training is meant to help solve. But before we get into the specifics of what this course offers, let me first explain why Cloudera is the best choice for Big Data training.
  • Here are just a few good reasons to choose Cloudera for training: We offer the broadest range of courses in this area, with a training course appropriate for pretty much any role. We teach those classes all over the world – and online too. Our instructors have trained more than 15,000 students in the past four years. Our instructors have years’ of experience – many of them worked in educational services for database vendors and related companies before they joined Cloudera – so they understand what students need to know and do a fantastic job at explaining it just the right way. More than a third of those student have gone on to get Cloudera certifications to prove to employers that they have the experience that the industry demands. Cloudera’s CDH distribution is the most popular distribution including Apache Hadoop – so popular, in fact, that it’s deployed more widely than all the other distributions combined. This technology changes rapidly, so we’re constantly updating the material so that it’s current. Our classes have frequent hands-on exercises designed to practice what they’ve just learned by applying it to real-world scenarios using a private, virtualized Hadoop cluster.And we complement this instructor-led training with e-learning, available online at, to illustrate related technologies and advanced concepts, so students can continue learning even after they’ve completed the class.
  • And the most successful companies in the world recognize the quality of Cloudera’s training. Big Data professionals from more than half of the Fortune 100 have attended at least one of our instructor-led classes, and every single one of the top20 global technology firms has trusted Cloudera to train their employees on Hadoop.So what do these students say about the effectiveness of our training?
  • 94% say that they’d recommend – or highly recommend – Cloudera training to their friends and colleagues. 88% say that our training provided the Hadoop expertise they require in the current job roles. And two-thirds say that they use what they’ve learned in class at least once per month while on the job.
  • Statistics are nice because they can help to summarize a huge data set into a convenient number, but sometimes it’s helpful to look at a specific case that goes into that number.Cisco, the global leader in enterprise networking equipment and software, also recognizes the value of our training. … So the bottom line is that while some other companies have started teaching courses on Hadoop, Hive, and related tools, we’ve been doing it for years and remain the best. Our curriculum developers work with other experts at Cloudera – not only the engineers who actually wrote the software, but also the the solution architects who work on-site with our customers to help them implement it, and our excellent support team who work with the customers to keep their data centers running smoothly. Having access to experts like these means that we’re able to include1 tips and advice that help make our training material so valuable for our students. Training courses from other companies that have sprouted up in this space might teach you how to use those tools, but our courses teach you how to use them effectively. We know that your data is important and your time is valuable -- helping you to make the best use of both is what makes our classes stand out from the rest.
  • So now that I’ve covered Cloudera’s training in general, let me talk more specifically about what students can expect from this course.I’ll begin by explaining the intended audience, as well as a few cases where a prospective student might be better suited to one of our other courses.
  • This is one of the most important slides in my entire presentation, because it illustrates what’s unique about this course compared to others that focus on a particular tool. Rather than give an exhaustive list of every aspect and possible feature of specific tools, we focus on what someone using them for data analysisneeds to know.The primary target audience for this course will be those whose primary job functions include data analysis, reporting, and decision support. This includes a variety of job titles, including Data Analyst, Business Intelligence Analyst, and Operations Analyst. Although these jobs encompass many different skills, the primary audience will be comfortable with SQL.And that’s essential, because Hadoop has historically been limited to people with profound technical skills – people who had a in-depth knowledge of Java programming and distributed computing. Let me back up and mention that I am one of those people – I have about fifteen years experience writing Java as a software engineer and about five years working with Hadoop, and as Jon mentioned, I even helped to design and implement a distributed data processing system with pretty similar goals about ten years ago – if I’d only had Hadoop back then! During my career, I’ve worked with dozens of analysts and I can tell you from first-hand experience that the last thing they want to do is wait for someone else to write a custom program just so they can crunch some numbers. An analyst already knows how to query a database to get an answer. So the types of people I’ve listed here have historically been under-served by Hadoop, but that’s changing with high-level analysis tools like Pig, Hive, and now Impala. These make it possible for analysts to do the types of tasks they’ve been doing for years, only now with a system that scales cost-effectively to meet their needs. And teaching them how analyze all that data effectively – on their own, without having to wait for a programmer to help them – is what this course is all about.But the course can also be really valuable to a more technical audience used to writing software to work with data in some capacity, including BI developers, data warehouse engineers, and ETL Developers who Extract, Transform, and Load data from one system to another.And that brings me to my next point…
  • There are also some people who might be better served with another course.As I mentioned, I think this course is going to be really useful for developers too, but mainly for the ones that either already how to write low-level MapReduce programming code for Hadoop, or rather wait to learn about those details later on. I want to be clear that this course is about high-level analysis tools – it doesn’t require any programming experience at all, so for the people that want to get deep into the details of writing MapReduce code in Java, I’d recommend they take our Developer training course.And this is a course mainly focused on analysts, a group who are seldom responsible for – and almost never particularly interested in – managing software installations. Our focus in this class is on how to use tools like Pig, Hive, and Impala, so we do not go into detail on how to install or configure them. System administrators who want to learn how to do those things should instead sign up for our Administrator course.What I want to emphasize is that we set the scope to be appropriate for the audience, so we have time to cover the topics that really matter to them. That includes how you can get data out of other systems, like relational databases, data warehouses or file servers, do large-scale analysis on that data with Hadoop tools like Pig, Hive, and Impala, and also how to access the results of that analysis.
  • In addition to not requiring any programming experience, this course also does not requireanyprior knowledge of Hadoop, or related tools like Pig, Hive, Sqoop, or Impala. We explain all those things early on – in fact, during the first chapter after the introductions, we teach students why these tools are useful, what each one of them does, and where they fit into the workflow of an analyst who uses them. And I’m going to show you some of that material a bit later today.But prospective students will need some basic skills to get the most out of this course. The first is an understanding of “basic relational database concepts,” and what I mean by that is the student should be familiar with terms like “field,” “table” and “query”.If you’ve written queries for pretty much any relational database, whether it’s Oracle, MySQL, Microsoft SQL Server, IBM’s DB2 or Informix, or even something like Microsoft Access, you’ve almost certainly used SQL. I’ve shown a simple example of a SQL query that selects a few fields from the customers table. If that looks familiar to you, then you’ve already met the second requirement. I will also point out that in order to cover topics like user-defined functions and how you can process data with external scripts, it’s necessary to put those in context by showing a few lines of code in a high-level programming language like Python or Perl. But having experience with those languages is NOT a prerequisite for this class, because the instructor explains even those small snippets in detail, and at no point are students expected to write any code like this.And finally, almost every production Hadoop cluster in the world runs on a UNIX or Linux operating system, so it’s helpful to understand a few basic commands. I want to clarify that students won’t need to know anything about how to administer a UNIX system – these are basic commands for the end-user, much like the DOS commands you may have seen on Microsoft Windows systems. But simply understanding what commands like these do is probably enough, because the specific UNIX commands you’ll use in the hands-on exercises are all shown in the exercise guide we provide to each student. So if you’re able to follow directions – and if seeing a few commands like this doesn’t make you break out in an allergic reaction – then you’ve met the third and final requirement for this class.
  • And there about a dozen specific things students who successfully complete this three-day course will learn. As I explained a moment ago, the course begins by explaining the basics of Hadoop, plus related tools like Pig, Hive, Sqoop, or Impala. We discuss several ways that these tools are being used by people in similar roles, and then delve into more detail using a mix of lecture and hands-on exercises that help students reinforce what they’ve learned by doing similar tasks with realistic data on a virtualized Hadoop cluster.In case you haven’t attended one of our training classes before, let me expand on this a bit. Students get a virtual machine pre-configured with Hadoop and all the tools we use in class – including Pig, Hive, Impala, Sqoop – and user this for all hands-on exercises. That VM works just like a real cluster and everything they learn is something they'll be able to apply to their own cluster back at work. It also provides a perfect environment for them to experiment with; since it's already got the tools as well as the datasets, students can continue exploring what they learned even after class is over.Since you can’t process data on Hadoop until you’ve loaded it into Hadoop, we cover this first. Students will learn how to add data to a Hadoop cluster, including both static files and tables imported from a relational database. We also explain how to access the results of this processing, whether that’s exporting it to a new file or table on the cluster, to a local file, or back to a relational database or enterprise data warehouse.
  • After loading some data into the Hadoop cluster during the first hands-on exercise, students learn – and get pretty extensive practice with – how to design and execute queries to process and analyze that data using three different tools: Pig, Hive, and Impala. Students learn the specifics of those tools, not only basic things like syntax, but also how to extend the features they offer and how to structure their queries for better performance. Now I’ll let you in on a secret objective I have for this course, which is sort of implied by the second and third items on this page. I certainly want students to come away with an understanding of how to analyze data using the various tools I’ve mentioned already. But they’ll also learn how to use some unusual data sources – things like Web server log files and metadata from other types of files – in addition to traditional tables from relational databases. For example, the ‘orders’ table from a relational database will tell you what a customer bought, but the Web server log file will tell you what they looked at but didn’t buy. That yields valuable business insight, and joining a bunch of otherwise unrelated data sets to uncover that insight wasn’t really practical until Hadoop made it possible to store and analyze vast amounts of data inexpensively. And so I think it’s essential to understand not only how the tasks they’ve been doing for years map to these new tools, but also what new possibilitiesthese tools offer.Pig, Hive, Impala, and relational databases all have advantages and limitations, so one of the most important things students will learn is how to choose the best tool for a given task.Let me quickly cover the high-level course outline before moving on to show you a short presentation based on some of the actual material from the course.
  • I’m going to run through this pretty quickly, but I want to point out that a complete outline for this course – and, in fact, all of our courses – is available on the Web site.As I explained a moment ago, we start by explaining the basics of Hadoop and giving a high-level overview of how the different pieces fit together before we go into depth on specific tools. Students get frequent practice on what they’ve learned through realistic – and judging from the feedback from the people who’ve already taken the course – really interesting hands-on exercises. The first of these starts about an hour after class begins and gives students the opportunity to import data from files and tables from a relational database into Hadoop’s storage. Exercises like this continue through the last afternoon of class, with each exercise building on the last one.After covering the basics of Hadoop and its related tools, we move into a more thorough introduction to Apache Pig, a high-level tool for data processing originally developed at Yahoo. We start with the basics, but then move on to cover complex data types that will be new to most people coming from a relational database background. Students will gain experience with these things by doing one hands-on exercise to make the format of data sets from two different vendors consistent, and then they analyze that data during the next exercise to determine which advertisements are the most popular, cost the least to run, and which Web sites are the least effective places to show them. And even though we do these exercises in the context of an online retailer, the lessons students learn are applicable to pretty much any industry you could imagine.From there, we move on how to group and join different data sets together, and that’s an important technique I’ve already mentioned. Students then learn about the different ways that they can extend the built-in capabilities of Apache Pig, including creating custom functions and processing data through externals programs. They get to practice all that during the next hands-on exercises, which is probably the most interesting one of all, since they use data about recent orders to select the best location for a new warehouse so products can reach customers more quickly. And I don’t want to spoil the surprise, but I’ll just say that some of the data students use in this exercise comes from some places they’d probably never considered analyzing before.
  • We then wrap up our discussion of Pig by explaining what to do when things don’t go quite as planned. We have an instructor-led demo that students can follow along with that shows a pretty typical job failure, then shows them how they can use their Web browser to find the log file that reveals the source of the error. And another way things don’t go quite as planned is when a job runs slower than it should – we cover this by walking step-by-step through a pretty basic Pig job and show how some simple changes can make it more than five times faster.Afterwards, we move into Apache Hive. Hive is tool originally developed at Facebook with a similar goal to Pig – it offers a high-level language that’s much easier and faster to write than MapReduce code. And while Hive and Pig share similar goals and work pretty similarly under the hood, they have very different approaches. The high-level language that Hive uses is based on SQL, the standard language used to query relational databases.Students get the chance to query data with Hive in a variety of ways, from the UNIX command-line, from Hive’s own shell, and through the Web browser using a tool called Hue. Let me add here that if you’re not already familiar with Hue, learning about it might worth the price of admission all by itself. We created it as an open source project here at Cloudera a few years ago, and it’s since become the standard Web-based user interface for these tools. Even if you’re not able to sign up for the class right away, I really encourage you to look at some of the short videos posted on Cloudera’s blog demonstrating the new features of Hue. After that we explain how to manage data in Hive; that is, how to create databases and tables, and also how to populate them with data, including a table with the complex data types that Hive supports, such as maps and arrays.And since much of the data organizations analyze with Hive is unstructured, knowing how to process text is especially important. We have an entire chapter on that, and students will learn a number of text processing functions, including many of the natural language processing features supported in Hive. They’ll get to practice that during the optional lab where they use the sentiment analysis techniques to analyze comments customers post about the products they buy, ultimately giving them far more insight into customer opinions than they’d ever find by analyzing simple numeric ratings.And as with Pig, we cover how to optimize queries, but also explain how techniques like partitioning and bucketing can help make analytic jobs run faster. Finally, we explain how to use extensions to Hive; for example, user-defined functions and support for custom data formats. And students get practice doing just that in the unique lab that follows.
  • After covering Hive, we move into covering Impala. Like Hive and Pig, Impala is also a high-level open source tool for data analysis. From the perspective of an end-user, it’s quite similar to Hive only it’s between five and fifty times faster, making it a good choice for interactive queries where waiting minutes or even hours for Hive to return a result would impede analysts from being able to ask the questions they need to ask to get their jobs done. What I mean here is that you choose a relational database for speed, but it comes at the cost of limited scalability. Conversely, you choose Hive (or Pig) for scalability, but these tools are batch-oriented and even though they scale extremely well, it can take minutes or even hours for them to complete a query. With Impala, you no longer need to choose between speed and scalability, removing the limits for the types of ad hoc queries that  analysts often perform.  For example, You don’t want to type a query meant to answer the question “how many additional items did customers buy when they bought the product we advertised last month” and then have to wait 20 minutes for the answer, only to subsequently re-run another query so you can compare these results to a different product you advertised for some other month. With Impala, these types of queries are typically answered in seconds.You might wonder, then, why we have so much more material on Hive than we do on Impala. It’s because most of what students just learned about Hive applies to Impala as well, so there’s no need to repeat it. But we definitely do cover the cases where Hive and Impala behave differently. And that takes us to the final and perhaps most valuable chapter of all in the course – the one where we compare the capabilities and limitations of these tools so students learn how to choose the best one for a given task. That’s something that really sets our course apart– as I said before, it’s not about just learning how to use these tools, it’s about learning how to use them effectively.[SEGUE] Now that I’ve covered the outline, I think there's really no better way to exemplify what this course is all about than to show actual material *from* the course. So that's what I've done -- put together a presentation based on some of the things we cover, mainly centered around the material from the beginning of the course. I’ll begin by explaining some of the trends that have led to the intense demand for professionals who know the tools used to process Big Data.
  • The first of these three trends is velocity, and it refers to the increasing speed at which data is being generated. There are several factors that have led to this, includingA tendency to automate processes that were previously manual. File cabinets are being replaced by file servers… the move towards electronic health records, particularly in the United States, is a great example of this. Another factor that’s increased the speed at which we create new data is that both people and machines are increasingly interconnected. Mobile devices allow us to create information as well as consume it, and nearly everybody in the developed world has one with them – connected to the internet – all the time. Your mobile phone obviously allows you to actually call people, but there are so many other ways that people communicate and share information now; for example: sending text messages, posting pictures to Instagram, uploading videos to YouTube, or checking in on a site like Foursquare or Facebook. Even when you’re not actively using the device, it’s still sending and receiving data as it checks for signals and new messages, and even that data has valuable applications.And that’s just mobile devices – consider all the data generated by modern cars, utility meters, traffic signals, and industrial sensors. These systems used to stand alone – offline – but now they’re connected and generating data faster than ever before.
  • Another change that has led to the surge in demand for these tools is that the type of data we now generate these days has changed. This includes not only the messages posted on social networks, but the connections themselves – one of the cornerstones of LinkedIn, for example, is that they can predict who you might know and suggest those connections to you. I already mentioned the images, audio, and video files we create with our mobile devices; obviously creating these types of files produces a lot of data, but the act of simply viewing those files creates entries in log files on a server somewhere. With a relational DB, you must know the format of your data up front before you can even store a single record; this is known as “schema on write.” But one key difference with Hadoop is thatyou can store data without knowing its format, similar to how you’d store data on a file server. In fact, you don’t need to know about its format until you need to process it; this is known as “schema on read.”Both approaches have their merits, but Hadoop gives you a lot of flexibility, because you can store data today without worrying whether the format might change five years from now. This is ideal for precisely the types of data I’ve listed here, because most of these things don’t map neatly to relational database concepts like tables, columns, and rows.
  • And the last of the trends I’ll mention is Volume, which relates to the vastamount of data we’re creating.Although Hadoop can scale to easily handle petabytes of data – that’s thousands of terabytes – I’ll also add that it’s not yet that common for most organizations to have that volume of data. But I want to emphasize that what matters is that the trends I’ve just explained mean that we’re all producing data faster than ever before, so everybody – whether it’s Google or Facebook or the bank down the street from your house – is producing more data than they did last year, or even last month. And what all of them have in common is that they need a system that can meet those demands, not just now, but next year and three years from now.For example, if you have 3TB of data now and your data grows at a rate of 10% per month, you’ll have about 140TB of data in just three years. And that doesn’t even consider all the data that organizations discard now – things like server log files or historic account information – because until now it’s been just too expensive to keep.
  • As organizations realize the potential for low-cost, scalable data storage and processing that Hadoop offers, they tend to stop deleting “unneeded” data like log files, account histories, and so on, because they recognize the value that having access to more data will yield more useful and more accurate results, and that deleting this data – or even just exiling it to tape backup systems where it’s unlikely to ever be seen again – means that the valuable insight hiding within this data would also be discarded. And so “delete” is becoming a dirty word.Consider the example of a coin toss: if you flip four times, you might come to the conclusion that there’s a 75% chance of the coin landing face up and a 25% chance of landing face down. But having more data – say the result of 1000 coin tosses – will almost certainly lead you to observe that there’s a 50% chance of the coin landing either way. As an analyst trying to gather data for key business decisions, accuracy is essential. And finally having the ability to examine all the data can give you this accuracy. This is one of the main reasons that Hadoop is revolutionary, because previous systems forced you to choose a subset to work with.A single rating on an e-commerce site doesn’t create opportunity, but having a few million allows you to make product recommendations (and this is something we cover in detail in our Data Science course). A single tweet tells you what somebody was thinking about at a single moment in time. It may be interesting, but it’s more anecdotal than valuable. But if you take the tweets of all your customers and potential customers and analyze them, you can get a pretty good idea of what’s important to them. And that can have a profound impact on your marketing strategy.There are a lot of valuable applications for this data, including making product recommendations of the type you probably know from Amazon or Netflix, predicting demand for products based on past sales history, analyzing customer behavior to optimize sales, and detecting fraudulent transactions. But simply storing all that data isn’t enough, you’ve got to process and analyze that data to extract the value it contains.
  • So I’ve established that the value of data is related to how much of it you have.The trends I’ve explained mean that many organizations already – and certainly many more in the future – are standing at the crossroads of an opportunity. But it also means that they will outgrow the tools they’ve traditionally used to process it all. What they need is a system that addresses two fundamental concerns: how to store all that data at a reasonable cost, and how to analyze all the data they’ve stored.
  • And that is precisely what Hadoop is meant to do. Hadoop is an open source system for large-scale data storage and processing. It harnesses the power of relatively inexpensive servers – up to several thousand of them – to provide massive performance and storage capacity. A collection of machines running the Hadoop software is known as a “cluster”.Hadoop was factored out of a project to build an open source search engine. Doug Cutting, now an Architect at Cloudera, was trying to overcome scalability limitations in that system when he read two papers Google had recently published:[2003] The Google File System[2004] MapReduce: Simplified Data Processing on Large ClustersDoug Cutting rewrote the project based on the concepts described in those papers. This ultimately led to the creation of Hadoop as its own project, because he recognized that this technology had applications beyond Web search – many of the same applications I described just a moment ago. And for a little bit of trivia, the elephant logo associated with the Hadoop, shown here in the lower right corner, comes from the name of his son’s stuffed elephant toy.What is commonly called “Core Hadoop” is two things: storage (called HDFS) and processing (called MapReduce), but there are number of related tools – things like Pig, Hive and Impala – that build on top of Hadoop in order to make analyzing data even easier. In the course itself we actually explain both HDFS and MapReduce with some diagrams that make it easy for people without programming or system administration experience to understand, but in the interest of time, let’s get right to Pig, Hive, and Impala.
  • Pig can offer a major productivity boost for analysts. Although it ultimately runs as a MapReduce job on a Hadoop cluster and takes about the same amount of time to compute result, it is much easier and faster to write a job to do the analysis with Pig. The six lines of Pig code shown here Loads data from two tab-delimited data sets, customers and orders, Groups the order data by customer IDCalculates the total cost for all orders for each of these customersJoins the grouped totals with the customer data so we can associate customer details like “name” with each IDAnd finally, displays the result to the screenNow that I’ve explained this script, I want to emphasize the point about productivity. It would take an experienced Java developer a few hours to write the low-level MapReduce code needed to do what I’ve shown here, because it would take around one hundred lines of code. But someone who successfully completes this Data Analyst course will be able to write something like this in maybe ten minutes. And that’s important for two reasons: first, it brings the power of Hadoop to people who aren’t programmers – and who don’t really care to spend months learning Java just so they can analyze some data. But secondly, developers who already know how to write the low-level code in Java are really going to save a lot of time and trouble by using a high-level tool like this because it makes typical analytical tasks so much easier.Before I show you Hive, another high-level tool for data analysis, I want to point out what is unique about Pig. As you can see, you express what you want to do as a sequence of steps, much like you’d do in a shell script or a macro in Microsoft Excel. This makes Pig particularly suited to doing multiple transformations on the data like you’d have in an Extract, Transform, and Load process.
  • Data isn’t always delivered to you in the format you want, so imagine that you’ve got input data from one source that’s tab-delimited and has the fields in a particular order … and you have input data from another source that’s comma-delimited, has the fields in a different order, and maybe doesn’t format date fields the same way as the first set of files. Pig makes it easy to get that data into a consistent format, and in fact, that’s exactly what students do in the second hands-on exercise.ETL processing like what I’ve described is a pretty common use case for Pig, not only because it makes these complex transformations easy, but also because of the scalability that Hadoop provides.[SEGUE]: OK, now I want to explain Apache Hive, which is a similar tool that most analysts will feel pretty comfortable with right away….
  • Hive is another high-level abstraction for MapReduce. While Pig was created at Yahoo, Hive was initially created at Facebook when they found their existing data warehousing solution could not scale to process as much data as they were generating – a problem that Facebook faced five years ago and that organizations will increasingly face.Like Pig, Hive lets you define analysis using a high-level language which ultimately gets turned into MapReduce jobs that analyze data in your cluster. However, unlike the custom step-by-step language that Pig uses, the syntax of the queries for Hive is pretty much the same as SQL.The example shown here is similar to the Pig Latin example from the previous slide. It calculates the total order cost for each customer, although unlike the previous example, this one also sorts the results. In either case, writing the equivalent MapReduce code might take even an experienced Java developer a few hours to write, but Hive makes it possible for someone who already knows SQL to do this in just a few minutes.
  • Although Hive allows you to do many of the same queries you’d do with a relational database, one thing that’s pretty unique is the flexibility you have with data sources. Hive can query tables of data from many sources, but as I touched on earlier, data does not always fit neatly into neatly-delimited columns and rows. That’s certainly the case with something like a Web server log file, for example, because if you ever looked at one you’d recognize how complicated they are. Lots of specialized tools exist for creating reports about traffic on a Web site, and they exist precisely because people recognize how valuable this information is. Hive allows you to take a directory of these files and treat it like a table, letting you’d query it with SQL as if it was in a relational database. And that’s nice on its own, but even more valuable if you combine that with other data you have about your customers and products. And Hive allows you to do this because you can join tables of data from different sources. That’s something that students will get to do for themselves during the hands-on exercises for this class.[SEGUE]: So how could you get data about customers and orders out of your relational database or enterprise data warehouse and into Hadoop so you could query it with Hive?
  • The easiest way by far is to use Sqoop, another tool we cover in depth during the class. The name Sqoop is a contraction for “SQL-to-Hadoop” and it’s a program helps you import tables from a database into Hadoop so you can analyze it using the tools I’ve just discussed. Students use Sqoop during the very first exercise to import data from a relational database that they analyze throughout the class.In addition to being able to import data from a database into Hadoop, Sqoop can also export data. You can take the result of the analysis you’ve done in Hadoop and move it back to a database, so Sqoop is often a cornerstone of a complete analytical workflow.
  • There’s one other tool we cover in depth during the class. Last October at the Hadoop World conference in New York City, Cloudera announced Impala, a high-performance query engine for Hadoop. Like Hadoop itself, it’s heavily inspired by technology from Google – in fact, the engineer who leads Impala development at Cloudera joined us from Google where he worked on scalable high-performance technology.From the perspective of an end user, Impala is similar to Hive. It uses a SQL-based syntax and even shares metadata with Hive, so generally speaking, the data you analyze with Hive can also be analyzed with Impala. The Hive query I showed a moment ago, for example, works exactly the same in Impala. But the defining difference between the two is speed. Hive is batch-oriented and best suited for long-running jobs where instant results aren’t required, while Impala is purpose-built for interactive queries. Although performance depends on a lot of factors, in general, a query in Impala may finish about ten times faster – and often as much as fifty times faster – than it would in Hive. Impala gets this performance because it uses a custom execution engine developed specifically for queries rather than the more generic MapReduce approach that Pig and Hive use. The caveat is that Pig and Hive are more customizable and offer some features that Impala does not currently support. I’ll cover those in a moment when I explain how these tools compare to one another.By the way, I find that many people are surprised to learn that we give this technology away for free. While Impala is not an Apache project, it’s 100% open source and is released under the same license as Hive, Pig, and Hadoop itself. [SEGUE] So what sort of things do people use Impala for?
  • Impala is most useful towards the end of an analytic workflow, after you’ve imported the data with Sqoop, maybe processed it with Pig or done some other analysis with Hive. What I want to emphasize is that you can take data from a variety of sources, store them in your Hadoop cluster, and then run fast, interactive queries on that data using Impala. One analyst might use Impala’s built-in query tool for that, while another might use a business intelligence tool just as they might already do with an enterprise data warehouse system. Lots of “BI” tool vendors – companies like Tableau, Microstrategy, and Pentaho, to name just a few – have certified their products to work with Impala. [SEGUE]: I’ve now given you a high-level overview of the different analysis tools we cover in class, so let me quickly recap them and show how they compare to one another.
  • At the core of data processing in Hadoop is MapReduce. Although it’s possible to analyze data by writing MapReduce code directly, this is something that requires a fair amount of specialized knowledge and experience as a programmer. And since high-level tools like Pig and Hive provide an alternative to writing MapReduce directly, it’s usually better to use them instead. Pig and Hive are similar in their capabilities, and both work by turning the high-level language that you write into a series of MapReduce jobs that run on your Hadoop cluster. But Pig and Hive have somewhat different approaches for solving problems: with Pig you express your data flow as a sequence of steps, while with Hive you write a query using a variant of SQL called HiveQL. Impala is almost identical to Hive in terms of syntax, but unlike the others, it uses a high-performance custom execution engine to query data rather than using MapReduce. This gives it a significant advantage in terms of speed, though when we compare the features of these three tools, we’ll see that Pig and Hive offer more opportunities for customizations.
  • A common question that people ask is “Should I choose Pig, Hive, or Impala?” This is one of the questions that we answer in the class, and while the details require a bit more time than I have here, I’ve included one of the slides we use in class to summarize how these tools compare. The bottom line is that each of these tools has unique advantages and limitations, so we recommend that you don’t limit yourself to using just one of them. You’ll learn all three in this class – and most importantly, you’ll come away with an understanding about which one is best choice for a particular task.[PAUSE]: Stay on this screen for at least 45 seconds so people can read it… take a drink or something [PAUSE]
  • [SEGUE]: OK, so that’s the end of my presentation, so I’ll mention a few other noteworthy things and then move on to answering your questions.You can submit your questions in the Q&A panel, and if you want to watch this session again later – or recommend it to a friend – we’re going to post the recording on the Web site. And if you’d like to see me give a somewhat longer version of the Hadoop Introduction I just covered, I encourage you to attend my workshop at the OSCON conference in Portland next week, or the workshops I’m doing at the Stampede Big Data conference in St. Louis on Thursday, August 1st. If you want to sign up for our new Data Analyst class, go to the Web site. We’ve got a special discount code for people who sign up for the Cloudera-delivered Data Analyst course between now and September 1. Just use the code “wheeler underscore ten” when you check out to save 10 percent. And if you enroll in another course in addition to this one through September 1st, you can use the discount code “fifteen off two” to save 15% of both of them.Thank you very much for your attention, and I’ll now take time to answer your questions.
  • Introduction to Data Analyst Training

    1. 1. Cloudera Data Analyst Training: Using Pig, Hive, and Impala with Hadoop Tom Wheeler
    2. 2. Agenda Why Cloudera Training?* Target Audience and Prerequisites* Course Outline* Short Presentation Based on Actual Course Material* - Understanding Hadoop, Pig, Hive, Sqoop, and Impala Question and Answer Session*
    3. 3. 32,000 trained professionals by 2015 Rising demand for Big Data and analytics experts but a DEFICIENCY OF TALENT will result in a shortfall of Source: Accenture “Analytics in Action,“ March 2013.
    4. 4. 1 Broadest Range of Courses Developer, Admin, Analyst, HBase, Data Science 2 3 Widest Geographic Coverage 50 cities worldwide plus online 5 Leading Platform & Community CDH deployed more than all other distributions combined 6 Relevant Training Material Classes updated regularly as tools evolve 7 Practical Hands-On Exercises Real-world labs complement live instruction Most Experienced Instructors More than 15,000 students trained since 2009 4 Leader in Certification Over 5,000 accredited Cloudera professionals 8 Ongoing Learning Video tutorials and e-learning complement training Why Cloudera Training?
    5. 5. 55% of the Fortune 100 have attended live Cloudera training Source: Fortune, “Fortune 500 “ and “Global 500,” May 2012. Cloudera Trains the Top Companies 100% of the top 20 global technology firms to use Hadoop Cloudera has trained employees from Big Data professionals from
    6. 6. 94% 88% Would recommend or highly recommend Cloudera training to friends and colleagues Indicate Cloudera training provided the Hadoop expertise their roles require Sources: Cloudera Past Public Training Participant Study, December 2012. Cloudera Customer Satisfaction Study, January 2013. 66% Draw on lessons from Cloudera training on at least a monthly basis What Do Our Students Say?
    7. 7. Cloudera is the best vendor evangelizing the Big Data movement and is doing a great service promoting Hadoop in the industry. Developer training was a great way to get started on my journey.
    8. 8. Cloudera Data Analyst Training: Using Pig, Hive, and Impala with Hadoop About the Course
    9. 9.  This course was created for people in analytical roles, including –Data Analyst –Business Intelligence Analyst –Operations Analyst –Reporting Specialist  Also useful for others who want to use high-level Big Data tools –Business Intelligence Developer –Data Warehouse Engineer –ETL Developers Intended Audience
    10. 10.  Developers who want to learn details of MapReduce programming –Recommend Cloudera Developer Training for Apache Hadoop  System administrators who want to learn how to install/configure tools –Recommend Cloudera Administrator Training for Apache Hadoop Who Should Not Take this Course
    11. 11.  No prior knowledge of Hadoop is required  What is required is an understanding of –Basic relational database concepts –Basic knowledge of SQL –Basic end-user UNIX commands Course Prerequisites SELECT id, first_name, last_name FROM customers; ORDER BY last_name; $ mkdir /data $ cd /data $ rm /home/tomwheeler/salesreport.txt
    12. 12. During this course, you will learn  The purpose of Hadoop and its related tools  The features that Pig, Hive, and Impala offer for data acquisition, storage, and analysis  How to identify typical use cases for large-scale data analysis  How to load data from relational databases and other sources  How to manage data in HDFS and export it for use with other systems  How Pig, Hive, and Impala improve productivity for typical analysis tasks  The language syntax and data formats supported by these tools Course Objectives
    13. 13.  How to design and execute queries on data stored in HDFS  How to join diverse datasets to gain valuable business insight  How to analyze structured, semi-structured, and unstructured data  How Hive and Pig can be extended with custom functions and scripts  How to store and query data for better performance  How to determine which tool is the best choice for a given task Course Objectives (cont’d)
    14. 14.  Hadoop Fundamentals –Hands-On Exercise: Data Ingest with Hadoop Tools  Introduction to Pig  Basic Data Analysis with Pig –Hands-On Exercise: Using Pig for ETL Processing  Processing Complex Data with Pig –Hands-On Exercise: Analyzing Ad Campaign Data with Pig  Multi-Dataset Operations with Pig –Hands-On Exercise: Analyzing Disparate Data Sets with Pig  Extending Pig –Hands-On Exercise: Extending Pig with Streaming and UDFs Course Outline
    15. 15.  Pig Troubleshooting and Optimization –Demo: Troubleshooting a Failed Job with the Web UI  Introduction to Hive  Relational Data Analysis with Hive –Hands-On Exercise: Running Hive Queries on the Shell, Scripts, and Hue  Hive Data Management –Hands-On Exercise: Data Management with Hive  Text Processing with Hive –Hands-On Exercise: Gaining Insight with Sentiment Analysis  Hive Optimization  Extending Hive –Hands-On Exercise: Data Transformation with Hive Course Outline (cont’d)
    16. 16.  Introduction to Impala  Analyzing Data with Impala –Hands-On Exercise: Interactive Analysis with Impala  Choosing the Best Tool for the Job Course Outline (cont’d)
    17. 17.  We are generating data faster than ever –Processes are increasingly automated –People are increasingly interacting online –Systems are increasingly interconnected Velocity
    18. 18.  We are producing a wide variety of data –Social network connections –Images, audio, and video –Server and application log files –Product ratings on shopping and review Web sites –And much more…  Not all of this maps cleanly to the relational model Variety
    19. 19.  Every day… –More than 1.5 billion shares are traded on the New York Stock Exchange –Facebook stores 2.7 billion comments and ‘Likes’ –Google processes about 24 petabytes of data  Every minute… –Foursquare handles more than 2,000 check-ins –TransUnion makes nearly 70,000 updates to credit files  And every second… –Banks process more than 10,000 credit card transactions Volume
    20. 20.  This data has many valuable applications –Product recommendations –Predicting demand –Marketing analysis –Fraud detection –And many, many more…  We must process it to extract that value –And processing all the data can yield more accurate results Data Has Value
    21. 21.  We’re generating too much data to process with traditional tools  Two key problems to address –How can we reliably store large amounts of data at a reasonable cost? –How can we analyze all the data we have stored? We Need a System that Scales
    22. 22.  Scalable and economical data storage and processing –Distributed and fault-tolerant –Harnesses the power of industry standard hardware  Heavily inspired by technical documents published by Google  ‘Core’ Hadoop consists of two main components –Storage: the Hadoop Distributed File System (HDFS) –Processing: MapReduce What is Apache Hadoop?
    23. 23.  Apache Pig builds on Hadoop to offer high-level data processing –This is an alternative to writing low-level MapReduce code –Pig is especially good at joining and transforming data Apache Pig people = LOAD '/user/training/customers' AS (cust_id, name); orders = LOAD '/user/training/orders' AS (ord_id, cust_id, cost); groups = GROUP orders BY cust_id; totals = FOREACH groups GENERATE group, SUM(orders.cost) AS t; result = JOIN totals BY group, people BY cust_id; DUMP result;
    24. 24.  Pig is also widely used for Extract, Transform, and Load (ETL) processing Use Case: ETL Processing Operations Validate data Accounting Call Center Fix errors Remove duplicates Encode values Data Warehouse Pig Jobs Running on Hadoop Cluster
    25. 25.  Hive is another abstraction on top of MapReduce –Like Pig, it also reduces development time –Hive uses a SQL-like language called HiveQL Apache Hive SELECT customers.cust_id, SUM(cost) AS total FROM customers JOIN orders ON customers.cust_id = orders.cust_id GROUP BY customers.cust_id ORDER BY total DESC;
    26. 26.  Server log files are an important source of data  Hive allows you to treat a directory of log files like a table –Allows SQL-like queries against raw data Use Case: Log File Analytics Dualcore Inc. Public Web Site (June 1 - 8) Product Unique Visitors Page Views Bounce Rate Conversion RateAverage Time on Page Tablet 5,278 5,894 23% 65%17 seconds Notebook 4,139 4,375 47% 31%23 seconds Stereo 2,873 2,981 61% 12%42 seconds Monitor 1,749 1,862 74% 19%26 seconds Router 987 1,139 56% 17%37 seconds Server 314 504 48% 28%53 seconds Printer 86 97 27% 64%34 seconds
    27. 27. Apache Sqoop  Sqoop exchanges data between a database and Hadoop  It can import all tables, a single table, or a portion of a table into HDFS –Result is a directory in HDFS containing comma-delimited text files  Sqoop can also export data from HDFS back to the database Database Hadoop Cluster
    28. 28.  Massively parallel SQL engine which runs on a Hadoop cluster –Inspired by Google’s Dremel project –Can query data stored in HDFS or HBase tables  High performance –Typically at least 10 times faster than Pig, Hive, or MapReduce –High-level query language (subset of SQL)  Impala is 100% Apache-licensed open source Cloudera Impala
    29. 29. Where Impala Fits Into the Data Center Transaction Records from Application Database Log Data from Web Servers Hadoop Cluster with Impala Documents from File Server Analyst using Impala shell for ad hoc queries Analyst using Impala via BI tool
    30. 30.  MapReduce –Low-level processing and analysis  Pig –Procedural data flow language executed using MapReduce  Hive –SQL-based queries executed using MapReduce  Impala –High-performance SQL-based queries using a custom execution engine Recap of Data Analysis/Processing Tools
    31. 31. Comparing Pig, Hive, and Impala Description of Feature Pig Hive Impala SQL-based query language No Yes Yes User-defined functions (UDFs) Yes Yes No Process data with external scripts Yes Yes No Extensible file format support Yes Yes No Complex data types Yes Yes No Query latency High High Low Built-in data partitioning No Yes Yes Accessible via ODBC / JDBC No Yes Yes
    32. 32. • Submit questions in the Q&A panel • Watch on-demand video of this webinar at • Follow Cloudera University @ClouderaU • Attend Tom’s talk at OSCON: • Or Tom’s talks at StampedeCon: • Thank you for attending! Register now for Cloudera training at Use discount code Wheeler_10 to save 10% on new enrollments in Data Analyst Training classes delivered by Cloudera until September 1, 2013 Use discount code 15off2 to save 15% on enrollments in two or more training classes delivered by Cloudera until September 1, 2013