Your SlideShare is downloading. ×
0
Hadoop for the Absolute
Beginner
Ike Ellis, MVP
Agenda
• What is Big Data?
• Why is it a problem?
• What is Hadoop?
– MapReduce
– HDFS

•
•
•
•
•
•
•

Pig
Hive
Sqoop
HCAT...
What is Big Data?
• Trendy?
• Buzz words?
• Process?
• Big data is “a collection of data sets so large and complex
that it...
Traditional Data Warehouse Stack
There are a lot of moving pieces back there…
• Sometimes, that‟s our biggest challenge
– Simple question – massive data

•...
Big Data Characteristics
• Volume
– Large amount of data

• Velocity
– Need to be processed quickly

• Variety
– Excel, SQ...
Big Data Examples
• Structured Data
– Pre-defined Schema
– Highly Structured
– Relational

• Semi-structured Data
– Incons...
The Problem

• So you have some data
The Problem
• And you want to clean and/or analyze it
So you use the technology that you know
•
•
•
•

Excel
SQL Server
SQL Server Integration Services
SQL Server Reporting Ser...
But what happens if it’s TONS of data
• Like all the real estate transactions in the US for the last ten
years?
• Or GPS d...
Or what happens when the analysis is very complicated?
• Tell me when earthquakes happen!
• Tell me how shoppers view my w...
So you use SQL Server, and have a lot of data, so….
• YOU SCALE UP!
• But SQL can only have so much RAM, CPU, Disk I/O,
Ne...
Scale Out!
• Add servers until the pain goes away….
All analysis is done away from the data servers
But that’s easier said than done
• What‟s the process?
• You take one large task, and break it up into lots of smaller
tas...
Data Movement
Data moves to achieve fault tolerance, to segment data, to reassemble data, to derive data,
to output data, ...
Oh, and another problem
• In SQL, the performance between a query over 1MB of
data and 1TB of data is significant
• The pe...
So to summarize and repeat
•
•
•
•
•

Drive seek time….BIG PROBLEM
Drive channel latency…BIG PROBLEM
Data + processing tim...
Enter Hadoop
• Why write your own framework to handle fault tolerance,
logging, data partitioning, heavy analysis when you...
What is Hadoop?
• Hadoop is a distributed storage and processing technology
for large scale applications
– HDFS
• Self-hea...
Hadoop Ecosystem
Under the covers
Hadoop works by keeping the compute next to the data (to minimize network I/O costs)
MapReduce
Segmentation Problem
MapReduce Process – Very simple example
Programming MapReduce
• Steps
– Define the inputs
• Usually some files in HDFS/HBase (Or Azure Blob Storage)
– Write a map...
Scalability
• Hadoop scales linearly with data size
– Or analysis complexity
– Scales to hundreds of petabytes

•
•
•
•

D...
Hadoop allows you to write a query like this
Select productname, sum(costpergoods)
From salesorders
Group by productname
•...
Why use Hadoop?
• Who wants to write all that plumbing?
–
–
–
–
–
–
–
–

Segmenting data
Making it redundant and fault tol...
What is Hadoop Good For?
•
•
•
•
•
•
•
•

Enormous datasets
Log Analysis
Calculating statistics on enormous datasets
Runni...
What is Hadoop Bad For?
• Low latency (not current data)
• Sequential algorithms
– Recursion

• Joins (sometimes)
• When a...
Relational vs Hadoop
Another Problem
• MapReduce functions are written in Java, Python, .NET, and
a few other languages
• Those are languages t...
What MapReduce Functions look like
function map(String name, String document):
// name: document name
// document: documen...
Introduction to Pig
• Pig – ETL for big data
– Structure
– Pig Latin

• Parallel data processing for Hadoop
• Not trying t...
Here’s what SQL looks like
Select customername, count(orderdate) as totalOrders
From salesOrders so
Join customers c
On so...
Pig
Trx = load „transaction‟ as (customer, orderamount);
Grouped = group trx by customer;
Ttl = foreach grouped generate g...
Pig is like SSIS
• One step at a time. One thing executes, then the next in
the script, acting on the variable declaration...
How Pig Works
• Pig Latin goes to pre-processor
• Pre-processor creates MapReduce jobs that get submitted
to the JobTracke...
Pig Components
•
•
•
•
•

Data Types
Inputs & Outputs
Relational Operators
UDFs
Scripts & Testing
Pig Data Types
• Scalar
–
–
–
–
–
–

Int
Long
Float
Double
CharArray
ByteArray

• Complex
– Map (key/value pair)
– Tuple (...
Pig: Inputs/Outputs
• Load
– PigStorage
– TextLoader
– HBaseStorage

• Store
– PigStorage
– HBaseStorage

• Dump
– Dumps t...
Pig: Relational Operators
• Foreach – projection operator, applies expression to every
row in the pipeline
– Flatten – use...
Pig: UDFs
• Written in Java/Python
• String manipulation, math, complex type operations,
parsing
Pig: Useful commands
Describe – shows schema
Explain – shows the logical and physical MapReduce plan
Illustrate – runs a s...
Pig Demo
Introduction to HIVE
•
•
•
•
•
•
•
•

Very popular
Hive Query Language
Defining Tables, Views, Partitioning
Querying and I...
SQL vs Hive
• Almost useless to compare the two, because they are so
similar
• Create table Internal/External
• Hive is sc...
Hive is not a replacement for SQL
• So don‟t throw out SQL just yet
• Hive is for batch processing large data sets that ma...
Hive Components
•
•
•
•
•
•

Data Types
DDL
DML
Queries
Views, Indexes, Partitions
UDFs
Hive Data Types
• Scalar
–
–
–
–
–
–
–
–
–
–

TinyInt
SmallInt
Int
BigInt
Boolean
Float
Double
TimeStamp
String
Binary

• ...
What is a Hive Table?
• CREATE DATABASE NewDB
– LOCATION „hdfshuaNewDB‟

• CREATE TABLE
• A Hive table consists of:
– Data...
How does Hive work?
• Hive as a Translation Tool
– Compiles and executes queries
– Hive translates the SQL Query to a MapR...
Define a Hive Table
Create Table myTable (name string, age int)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY „;‟
STORED AS TE...
Loading Data
Use LOAD DATA to import data into a Hive table
LOAD DATA LOCAL INPATH „input/mydata/data.txt‟
INTO TABLE myTa...
Querying Data
• SELECT
–
–
–
–
–
–

WHERE
UNION ALL/DISTINCT
GROUP BY
HAVING
LIMIT
REGEX

• Subqueries
• JOIN
– INNER
– OU...
Hive Demo
Pig Vs Hive
• Famous Yahoo Blog Post
– http://developer.yahoo.com/blogs/hadoop/pig-hive-yahoo464.html

• PIG
–
–
–
–

ETL
...
Sqoop
• SQL to Hadoop
– SQL Server/Oracle/Something with a JDBC driver

• Import
– From RDBME into HFDS

• Export
– From H...
HUE
• Hadoop User Experience
HCatalog
• Metadata and table management system for Hadoop
• Provides a shared schema and data type mechanism for
various ...
HCatalog DDL
•
•
•
•
•

CREATE/ALTER/DROP Table
SHOW TABLES
SHOW FUNCTIONS
DESCRIBE
Supports a subset of Hive DDL
Why do we have HCat?
• Tools don‟t tend to agree on
– What a schema is
– What data types are
– How data is stored

• HCata...
HCatalog – HBase Integration
• Connects HBase tables to HCatalog
• Uses various Hadoop tools
• Provides flexibility with d...
HCat Demo
HBase
•
•
•
•
•

NoSQL Database
Modeled after Google BigTable
Written in Java
Runs on top of HDFS
Features
– Compression
–...
Yarn
• Apache Hadoop Next Gen MapReduce
• Yet aNother Resource Negotiator
• Seperates resource management and processing
c...
Impala
• Cloudara
• Real-time queries for Hadoop
• Low-latency Queries using SQL to HDFS or HBase
Storm
• Free and open source distributed real-time computation
system
• Makes it easy to process unbounded streams of data...
The Players
•
•
•
•
•
•
•
•
•
•

HortonWorks
Cloudara
MapR
Microsoft HDInsight
Microsoft PDW
IBM
Oracle
Amazon
Rackspace
G...
The Future
• Hadoop features will push into RDBMS systems
• RDBMS features will continue to push into Hadoop
• Tons of 3rd...
How to Learn Hadoop
• Lots of YouTube videos online
• HortonWorks, MapR, and Cloudara all have good videos
for free
• Hort...
Ike Ellis
•
•
•
•
•
•
•
•

www.ikeellis.com
SQL Pass Book Readers – VC Leader
@Ike_Ellis
619.922.9801
Microsoft MVP
Quick ...
Hadoop for the Absolute Beginner
Upcoming SlideShare
Loading in...5
×

Hadoop for the Absolute Beginner

1,076

Published on

Given on a free DevelopMentor webinar. A high level overview of big data and the need for Hadoop. Also covers Pig, Hive, Yarn, and the future of Hadoop.

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,076
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
25
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Transcript of "Hadoop for the Absolute Beginner"

  1. 1. Hadoop for the Absolute Beginner Ike Ellis, MVP
  2. 2. Agenda • What is Big Data? • Why is it a problem? • What is Hadoop? – MapReduce – HDFS • • • • • • • Pig Hive Sqoop HCAT The Players Maybe data visualization (depending on time) Q&A
  3. 3. What is Big Data? • Trendy? • Buzz words? • Process? • Big data is “a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications” – Wikipedia • So how do you know your data is big data? • When your existing data processing methodologies are no longer good enough.
  4. 4. Traditional Data Warehouse Stack
  5. 5. There are a lot of moving pieces back there… • Sometimes, that‟s our biggest challenge – Simple question – massive data • Do we really need to go through the pain of that huge stack?
  6. 6. Big Data Characteristics • Volume – Large amount of data • Velocity – Need to be processed quickly • Variety – Excel, SQL, OData feeds, CSVs, web downloads, JSON • Variability – Different semantics, in terms of meaning or context • Value
  7. 7. Big Data Examples • Structured Data – Pre-defined Schema – Highly Structured – Relational • Semi-structured Data – Inconsistent structure – Cannot be stored in rows and tables in a typical database – Logs, tweets, data feeds, GPS coordinates • Unstructured Data – – – – – Lacks structure Free-form text Customer feedback forms Audio Video
  8. 8. The Problem • So you have some data
  9. 9. The Problem • And you want to clean and/or analyze it
  10. 10. So you use the technology that you know • • • • Excel SQL Server SQL Server Integration Services SQL Server Reporting Services
  11. 11. But what happens if it’s TONS of data • Like all the real estate transactions in the US for the last ten years? • Or GPS data from every bike in your bike rental store? • Or every swing and every pitch from every baseball game since 1890?
  12. 12. Or what happens when the analysis is very complicated? • Tell me when earthquakes happen! • Tell me how shoppers view my website! • Tell me how to win my next election!
  13. 13. So you use SQL Server, and have a lot of data, so…. • YOU SCALE UP! • But SQL can only have so much RAM, CPU, Disk I/O, Network I/O • So you hit a wall, probably with disk I/O • So you….
  14. 14. Scale Out! • Add servers until the pain goes away…. All analysis is done away from the data servers
  15. 15. But that’s easier said than done • What‟s the process? • You take one large task, and break it up into lots of smaller tasks – How do you break them up? – Once it‟s broken up and processed, how do you put them back together? – How do you make sure you break them up evenly so they all execute at the same rate? – And really, you‟re breaking up two things: • Physical data • Computational Analysis – If one small task fails, how to you restart it? Log it? Recover from failure? – If one SQL Server fails, how do you divert all the new tasks away from it? – How do you load balance? • So you end up writing a lot of plumbing code….and even when you get done….you have one GIANT PROBLEM!
  16. 16. Data Movement Data moves to achieve fault tolerance, to segment data, to reassemble data, to derive data, to output data, etc, etc….and network (and disk) is SLOW..you’ve saturated it.
  17. 17. Oh, and another problem • In SQL, the performance between a query over 1MB of data and 1TB of data is significant • The performance of a query over one server and over 20 servers is also significant
  18. 18. So to summarize and repeat • • • • • Drive seek time….BIG PROBLEM Drive channel latency…BIG PROBLEM Data + processing time…BIG PROBLEM Network Pipe I/O saturation…BIG PROBLEM Lots of human problems – Building a data warehouse stack is a difficult challenge • Semi-structured data is difficult to handle – When data changes, it becomes less structured and less valuable as it changes – Flexible structures often give us fits
  19. 19. Enter Hadoop • Why write your own framework to handle fault tolerance, logging, data partitioning, heavy analysis when you can just use this one?
  20. 20. What is Hadoop? • Hadoop is a distributed storage and processing technology for large scale applications – HDFS • Self-healing, distributed file system. Breaks files into blocks and stores them redundantly across cluster – MapReduce • Framework for running large data processing jobs in parallel across many nodes and combining results • • • • • • • Open Source Distributed Data Replication Commodity hardware Disparate hardware Data and analysis co-location Scalability Reliable error handling
  21. 21. Hadoop Ecosystem
  22. 22. Under the covers Hadoop works by keeping the compute next to the data (to minimize network I/O costs)
  23. 23. MapReduce
  24. 24. Segmentation Problem
  25. 25. MapReduce Process – Very simple example
  26. 26. Programming MapReduce • Steps – Define the inputs • Usually some files in HDFS/HBase (Or Azure Blob Storage) – Write a map function – Write a reduce function – Define outputs • Usually some files in HDFS/HBase (Or Azure Blob Storage) • Lots of options for both inputs and outputs • Functions are usually written in Java – Or Python – Even .NET (C#, F#)
  27. 27. Scalability • Hadoop scales linearly with data size – Or analysis complexity – Scales to hundreds of petabytes • • • • Data-parallel or computer-parallel Extensive machine learning on <100GB of image data Simple SQL queries on >100TB of clickstream data Hadoop works for both!
  28. 28. Hadoop allows you to write a query like this Select productname, sum(costpergoods) From salesorders Group by productname • Over a ton of data, or a little data, and have it perform about the same • If it slows down, throw more nodes at it • Map is like the GROUP BY • While reduce is like the aggregate
  29. 29. Why use Hadoop? • Who wants to write all that plumbing? – – – – – – – – Segmenting data Making it redundant and fault tolerant Overcoming job failure Logging All those data providers All the custom scripting languages and tooling Synchonization Scale-free programming model • Wide adoption • You specify the map() and reduce() functions – Let the framework do the rest
  30. 30. What is Hadoop Good For? • • • • • • • • Enormous datasets Log Analysis Calculating statistics on enormous datasets Running large simulations ETL Machine learning Building inverted indexes Sorting – World record • • • • • Distributed Search Tokenization Image processing No fancy hardware…good in the cloud And so much more!
  31. 31. What is Hadoop Bad For? • Low latency (not current data) • Sequential algorithms – Recursion • Joins (sometimes) • When all the data is structured and can fit on one database server with scaling up – It is NOT a replacement for a good RDBMS
  32. 32. Relational vs Hadoop
  33. 33. Another Problem • MapReduce functions are written in Java, Python, .NET, and a few other languages • Those are languages that are widely known • Except by analysts and DBAs, the exact kind of people who struggle with big data • Enter Pig & Hive – Abstraction for MapReduce – Sits over MapReduce – Spawns MapReduce jobs
  34. 34. What MapReduce Functions look like function map(String name, String document): // name: document name // document: document contents for each word w in document: emit (w, 1) function reduce(String word, Iterator partialCounts): // word: a word // partialCounts: a list of aggregated partial counts sum = 0 for each pc in partialCounts: sum += ParseInt(pc) emit (word, sum)
  35. 35. Introduction to Pig • Pig – ETL for big data – Structure – Pig Latin • Parallel data processing for Hadoop • Not trying to get you to learn Pig. Just want you to want to learn it.
  36. 36. Here’s what SQL looks like Select customername, count(orderdate) as totalOrders From salesOrders so Join customers c On so.custid = c.custid Group by customername
  37. 37. Pig Trx = load „transaction‟ as (customer, orderamount); Grouped = group trx by customer; Ttl = foreach grouped generate group, sum(trx.orderamount) as tp; Cust = load „customers‟ as (customer, postalcode); Result = join ttl by group, cust by customer; Dump result; Executes on step at a time
  38. 38. Pig is like SSIS • One step at a time. One thing executes, then the next in the script, acting on the variable declarations above it
  39. 39. How Pig Works • Pig Latin goes to pre-processor • Pre-processor creates MapReduce jobs that get submitted to the JobTracker
  40. 40. Pig Components • • • • • Data Types Inputs & Outputs Relational Operators UDFs Scripts & Testing
  41. 41. Pig Data Types • Scalar – – – – – – Int Long Float Double CharArray ByteArray • Complex – Map (key/value pair) – Tuple (fixed-size ordered collection) – Bag(collection of tuples)
  42. 42. Pig: Inputs/Outputs • Load – PigStorage – TextLoader – HBaseStorage • Store – PigStorage – HBaseStorage • Dump – Dumps to console – Don‟t dump a ton of data…uh oh…
  43. 43. Pig: Relational Operators • Foreach – projection operator, applies expression to every row in the pipeline – Flatten – used with complex types, PIVOT • • • • • • • • • Filter – WHERE Group, Cogroup – GROUP BY (Cogroup on multiple keys) ORDER BY Distinct JOIN (INNER, OUTER, CROSS) LIMIT – TOP Sample – Random sample Parallel – level of parallelism on the reducer side Union
  44. 44. Pig: UDFs • Written in Java/Python • String manipulation, math, complex type operations, parsing
  45. 45. Pig: Useful commands Describe – shows schema Explain – shows the logical and physical MapReduce plan Illustrate – runs a sample of your data to test your script Stats – produced after every run and includes start/end times, # of records, MapReduce info • Supports parameter substitution and parameter files • Supports macros and functions (define) • Supports includes for script organization • • • •
  46. 46. Pig Demo
  47. 47. Introduction to HIVE • • • • • • • • Very popular Hive Query Language Defining Tables, Views, Partitioning Querying and Integration VERY SQL-LIKE Developed by FaceBook Data Warehouse for Hadoop Based on SQL-92 specification
  48. 48. SQL vs Hive • Almost useless to compare the two, because they are so similar • Create table Internal/External • Hive is schema on read – It defines a schema over your data that already exists in the database
  49. 49. Hive is not a replacement for SQL • So don‟t throw out SQL just yet • Hive is for batch processing large data sets that may span hundreds, or even thousands, of machines – Not for row-level updates • Hive has high overhead when starting a job. It translates queries to MR so it takes time • Hive does not cache data • Hive performance tuning is mainly Hadoop performance tuning • Similarity in the query engine, but different architectures for different purposes • Way too slow for OLTP workloads
  50. 50. Hive Components • • • • • • Data Types DDL DML Queries Views, Indexes, Partitions UDFs
  51. 51. Hive Data Types • Scalar – – – – – – – – – – TinyInt SmallInt Int BigInt Boolean Float Double TimeStamp String Binary • Complex – Struct – Array(Collection) – Map(key/value pair)
  52. 52. What is a Hive Table? • CREATE DATABASE NewDB – LOCATION „hdfshuaNewDB‟ • CREATE TABLE • A Hive table consists of: – Data: typically a file in HDFS – Schema: in the form of metadata stored in a relational database • Schema and data are separate – A schema can be defined for existing data – Data can be added or removed independently – Hive can be pointed to existing data • You have to define schema if you have existing data in HDFS that you want to use in Hive
  53. 53. How does Hive work? • Hive as a Translation Tool – Compiles and executes queries – Hive translates the SQL Query to a MapReduce job • Hive as a structuring tool – Creates a schema around the data in HDFS – Tables stored in directories • Hive Tables have rows and columns and data types • Hive Metastore – Namespace with a set of tables – Holds table definitions • Partitioning – Choose a partition key – Specify key when you load data
  54. 54. Define a Hive Table Create Table myTable (name string, age int) ROW FORMAT DELIMITED FIELDS TERMINATED BY „;‟ STORED AS TEXFILE;
  55. 55. Loading Data Use LOAD DATA to import data into a Hive table LOAD DATA LOCAL INPATH „input/mydata/data.txt‟ INTO TABLE myTable The files are not modified in Hive – they are loaded as is Use the word OVERWRITE to write over a file of the same name • Hive can read all the files in particular directory • The schema is checked when the data is queried • • • • • – If a row does not match the schema, it will be read as null
  56. 56. Querying Data • SELECT – – – – – – WHERE UNION ALL/DISTINCT GROUP BY HAVING LIMIT REGEX • Subqueries • JOIN – INNER – OUTER • ORDER BY – Reducer is 1 • SORT BY – Multiple reducers with a sorted file from each
  57. 57. Hive Demo
  58. 58. Pig Vs Hive • Famous Yahoo Blog Post – http://developer.yahoo.com/blogs/hadoop/pig-hive-yahoo464.html • PIG – – – – ETL For preparing data for easier analysis Good for SQL authors that take the time to learn something new Unless you store it, all data goes away when the script is finished • Hive – Analysis • When you have to answer a specific question – Good for SQL authors – Excel connectivity – Persists data in the Hadoop data store
  59. 59. Sqoop • SQL to Hadoop – SQL Server/Oracle/Something with a JDBC driver • Import – From RDBME into HFDS • Export – From HDFS into RDMBS • Other Commands – Create hive table – Evaluate import statement
  60. 60. HUE • Hadoop User Experience
  61. 61. HCatalog • Metadata and table management system for Hadoop • Provides a shared schema and data type mechanism for various Hadoop tools (Pig, Hive, MapReduce) – Enables interoperability across data processing tools – Enables users to choose the best tools for their environments • Provides a table abstraction so that users need not be concerned with how data is stored – Presents users with a relational view of data
  62. 62. HCatalog DDL • • • • • CREATE/ALTER/DROP Table SHOW TABLES SHOW FUNCTIONS DESCRIBE Supports a subset of Hive DDL
  63. 63. Why do we have HCat? • Tools don‟t tend to agree on – What a schema is – What data types are – How data is stored • HCatalog solution – Provides one consistent dta model for various Hadoop tools – Provides shared schema – Allows users to see when shared data is available
  64. 64. HCatalog – HBase Integration • Connects HBase tables to HCatalog • Uses various Hadoop tools • Provides flexibility with data in HBase or HDFS
  65. 65. HCat Demo
  66. 66. HBase • • • • • NoSQL Database Modeled after Google BigTable Written in Java Runs on top of HDFS Features – Compression – In-memory operations – Bloom filters • Can serve as input or output for MapReduce jobs • FaceBook‟s messaging platform uses it
  67. 67. Yarn • Apache Hadoop Next Gen MapReduce • Yet aNother Resource Negotiator • Seperates resource management and processing components – Breaking up the job tracker • YARN was born of a need to enable a broader array of interaction patterns for data stored in HDFS beyond MapReduce
  68. 68. Impala • Cloudara • Real-time queries for Hadoop • Low-latency Queries using SQL to HDFS or HBase
  69. 69. Storm • Free and open source distributed real-time computation system • Makes it easy to process unbounded streams of data • Storm is fast – Million tuples processed per second per node
  70. 70. The Players • • • • • • • • • • HortonWorks Cloudara MapR Microsoft HDInsight Microsoft PDW IBM Oracle Amazon Rackspace Google
  71. 71. The Future • Hadoop features will push into RDBMS systems • RDBMS features will continue to push into Hadoop • Tons of 3rd party vendors and open source projects have applications for Hadoop and RDBMS/Hadoop integration • Lots of buy-in, lots of progress, lots of changes
  72. 72. How to Learn Hadoop • Lots of YouTube videos online • HortonWorks, MapR, and Cloudara all have good videos for free • HortonWorks sandbox • Azure HDInsight VMs • Hadoop: The Definitive Guide • Tons of blog posts • Lots of open source projects
  73. 73. Ike Ellis • • • • • • • • www.ikeellis.com SQL Pass Book Readers – VC Leader @Ike_Ellis 619.922.9801 Microsoft MVP Quick Tips – YouTube San Diego TIG Founder and Chairman San Diego .NET User Group Steering Committee Member
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×