The Fundamentals
Guide to HDP and
HDInsight
Gert Drapers (#DataDude)
Principle Software Design Engineer
http://www.economist.com/node/15579717?Story_ID=15579717
Copyright © The Economist Newspaper Limited 2012. All rights rese...
The 4Vs of Big Data:
Volume, Velocity, Variability, & Variety
Source: http://nosql.mypopescu.com/post/9621746531/a-definit...
New In Hadoop 2
•YARN
• ResourceManager
• NodeManager
• ApplicationMaster
•HDFS 2
• NameNode HA
• Snapshots
• Federation
S...
Hortonworks Data Platform For Windows
• Leverages work from Hortonworks and Microsoft
• 100% open source Apache Hadoop
• B...
Microsoft Azure HDInsight 3.0
• Microsoft’s cloud Hadoop offer
• 100% open source Apache Hadoop
• Built on the latest rele...
Stinger Phase 2 in Hive 0.12
•QO improvements
•Predicate pushdown
•ORC file improvements
http://hortonworks.com/labs/sting...
Demo: Getting Started with Hadoop
2 in Azure with HDInsight
HDFS
HDFS Architecture
• Block based
(64MB default)
• Hierarchical file
organization of
directories and files
• Write once,
rea...
YARN
A long time ago, in a data center far,
far away…
Episode IV
There was Map Reduce
Introduction to Map/Reduce
Map f(k1,v1)  list(k2,v2)
Reduce f(k2, list(v2))  (k2, v3)
Functionally
In Practice, WordCoun...
And Map Reduce was… good?
Episode V
Then came the abstractions
A pig who eats everything
logs = LOAD 'wasb://sampledata@mwinklenortheurope.blob.core.windows.net/weblogs'
USING PigStorage(' ') AS (datereq:chararr...
Hive for those who know SQL
CREATE EXTERNAL TABLE websites_logs_raw (datereq STRING,
timereq STRING,
s_sitename STRING,
cs_method STRING,
cs_uri_stem ...
Cascading/Scalding to bring a
modern JVM API for analytics
WordCount in Scalding
See: https://github.com/twitter/scalding
But the abstractions all shared one
thing… Map Reduce
WordCount in Scalding…
See: https://github.com/twitter/scalding
Map Phase
Reduce Phase
Map/Reduce v1 Architecture
Source: http://hortonworks.com/wp-content/uploads/2012/08/MRArch.png
Episode VI
One YARN to rule them all
Compute Model != Resource Model
YARN Architecture
Source: http://hortonworks.com/wp-content/uploads/2012/08/YARNArch.png
• Thus, removing contention on Jo...
Other Interesting YARN projects
Some Existing YARN apps
• Storm on YARN
• Hbase on YARN
• Spark
• Giraph
• Hamster (MPI on Yarn)
• MemcacheD
• Dryad
Sourc...
Writing your own YARN app for fun
and profit…
Start by clicking here
Yikes…
See Slide 20 – Enter Abstractions
Tez
http://tez.incubator.apache.org/
Source: http://hortonworks.com/blog/introducing-tez-faster-hadoop-processing/
REEF
http://www.reef-project.org/
Kitten
https://github.com/cloudera/kitten
http://www.lua.org/manual/5.1
What about .NET?
Dryad on YARN
sources
background

The Microsoft Data Platform
Resources
• All about HDInsight
• Getting Started with HDInsight
• Windows HDP 2.0
• Hadoop project
• HadoopSDK Codeplex p...
Laat ons weten wat u vindt van deze sessie! Vul de evaluatie
in via www.techdaysapp.nl en maak kans op een van de 20
prijz...
Backup Slides
Moving Data Between Stores
•Sqoop
• Data in or out of relational store
•Pig
• Set of Storage & Loaders (JDBC, Mongo, etc)
...
Website log processing, Pig, Hive
logs = LOAD 'wasb://sampledata@mwinklenortheurope.blob.core.windows.net/weblogs' USING PigStorage(' ')
AS
(datereq:chararr...
CREATE EXTERNAL TABLE websites_logs_raw (datereq STRING,
timereq STRING,
s_sitename STRING,
cs_method STRING,
cs_uri_stem ...
Interacting with SQL DB
binsqoop import --connect
"jdbc:sqlserver://[yourserver].database.windows.net:1433;database=AdventureW
orks2012;user=[user...
The Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsight
Upcoming SlideShare
Loading in...5
×

The Fundamentals Guide to HDP and HDInsight

497

Published on

This session will give you the architectural overview and introduction in to inner workings of HDP 2.0 (http://hortonworks.com/products/hdp-windows/) and HDInsight. The world has embraced the Hadoop toolkit to solve their data problems from ETL, data warehouses to event processing pipelines. As Hadoop consists of many components, services and interfaces, understanding its architecture is crucial, before you can successfully integrate it in to your own environment.

Published in: Software, Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
497
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
45
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

The Fundamentals Guide to HDP and HDInsight

  1. 1. The Fundamentals Guide to HDP and HDInsight Gert Drapers (#DataDude) Principle Software Design Engineer
  2. 2. http://www.economist.com/node/15579717?Story_ID=15579717 Copyright © The Economist Newspaper Limited 2012. All rights reserved
  3. 3. The 4Vs of Big Data: Volume, Velocity, Variability, & Variety Source: http://nosql.mypopescu.com/post/9621746531/a-definition-of-big-data
  4. 4. New In Hadoop 2 •YARN • ResourceManager • NodeManager • ApplicationMaster •HDFS 2 • NameNode HA • Snapshots • Federation Source: http://hortonworks.com/hadoop/yarn/
  5. 5. Hortonworks Data Platform For Windows • Leverages work from Hortonworks and Microsoft • 100% open source Apache Hadoop • Built on the latest releases across Hadoop (2.2) • YARN • Stinger Phase 2 (Faster queries) • Only distribution available on Windows Server • Harness existing .NET and Java skills to write MapReduce • Utilize familiar BI tools for analysis including Microsoft Excel On-Premise Self-Deploy (Hadoop) See: http://hortonworks.com/products/releases/hdp-2-windows/
  6. 6. Microsoft Azure HDInsight 3.0 • Microsoft’s cloud Hadoop offer • 100% open source Apache Hadoop • Built on the latest releases across Hadoop (2.2) • YARN • Stinger Phase 2 (Faster queries) • Up and running in minutes with no hardware to deploy • Harness existing .NET and Java skills to write MapReduce • Utilize familiar BI tools for analysis including Microsoft Excel Cloud, Hadoop Microsoft Azure See: http://www.windowsazure.com/en-us/solutions/big-data/
  7. 7. Stinger Phase 2 in Hive 0.12 •QO improvements •Predicate pushdown •ORC file improvements http://hortonworks.com/labs/stinger/
  8. 8. Demo: Getting Started with Hadoop 2 in Azure with HDInsight
  9. 9. HDFS
  10. 10. HDFS Architecture • Block based (64MB default) • Hierarchical file organization of directories and files • Write once, read many • Highly portable • Optimized for small numbers of very large files Distributed Fault Tolerant File System Source: http://hortonworks.com/hadoop/hdfs/
  11. 11. YARN
  12. 12. A long time ago, in a data center far, far away…
  13. 13. Episode IV There was Map Reduce
  14. 14. Introduction to Map/Reduce Map f(k1,v1)  list(k2,v2) Reduce f(k2, list(v2))  (k2, v3) Functionally In Practice, WordCount The quick brown fox jumps over the lazy dog Map (the,1) (quick,1), (brown,1), (fox,1), (over,1), (the,1),(lazy,1),(dog,1) Shuffle (the,(1,1)) (quick,1), (brown,1), (fox,1), (over,1),(lazy,1),(dog,1) Reduce (the,2) (quick,1), (brown,1), (fox,1), (over,1), (lazy,1),(dog,1) In Code Then, scale to TB/PB of data over 10’s, 100’s or 1000’s of nodes
  15. 15. And Map Reduce was… good?
  16. 16. Episode V Then came the abstractions
  17. 17. A pig who eats everything
  18. 18. logs = LOAD 'wasb://sampledata@mwinklenortheurope.blob.core.windows.net/weblogs' USING PigStorage(' ') AS (datereq:chararray, timereq:chararray, s_sitename:chararray, cs_method:chararray, cs_uri_stem:chararray, cs_uri_query:chararray, s_port:chararray, cs_username:chararray, c_ip:chararray, cs_User_Agent:chararray, cs_Cookie chararray, cs_Referer:chararray, cs_host :chararray, sc_status:chararray, sc_substatus:chararray, sc_win32_status:chararray, sc_bytes:int, cs_bytes:int, time_taken:int ); SET default_parallel 5; -- remove header rows filtered_logs = FILTER logs BY datereq != '#'; referrer_logs = GROUP filtered_logs BY cs_Referer; summary_referrer = FOREACH referrer_logs GENERATE $0, COUNT($1) AS COUNT, SUM(filtered_logs.sc_bytes) AS TotalEgress, AVG(filtered_logs.time_taken) AS AverageTimeTaken; sorted_summary = ORDER summary_referrer BY COUNT DESC; limit_summary = LIMIT sorted_summary 25; grouped_by_stem = GROUP filtered_logs BY cs_uri_stem; summary_ip = FOREACH grouped_by_stem GENERATE $0, COUNT($1) AS NumberOfRequests, SUM(filtered_logs.sc_bytes) AS TotalEgress, AVG(filtered_logs.time_taken) AS AverageTimeTaken; sorted_summary = ORDER summary_ip BY NumberOfRequests DESC; limited_summary = LIMIT sorted_summary 25; STORE filtered_logs INTO 'wasb://output@mwinklenortheurope.blob.core.windows.net/tmp/results5/forhive' USING PigStorage('t'); STORE limited_summary INTO 'wasb://output@mwinklenortheurope.blob.core.windows.net/tmp/results5/stemstats' USING PigStorage('t'); STORE limit_summary INTO 'wasb://output@mwinklenortheurope.blob.core.windows.net/tmp/results5/referer_logs'
  19. 19. Hive for those who know SQL
  20. 20. CREATE EXTERNAL TABLE websites_logs_raw (datereq STRING, timereq STRING, s_sitename STRING, cs_method STRING, cs_uri_stem STRING, cs_uri_query STRING, s_port STRING, cs_username STRING, c_ip STRING, cs_User_Agent STRING, cs_Cookie STRING, cs_Referer STRING, cs_host STRING, sc_status INT, sc_substatus STRING, sc_win32_status STRING, sc_bytes INT, cs_bytes INT, time_taken INT ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' STORED AS TEXTFILE LOCATION 'wasb://sampledata@mwinklenortheurope.blob.core.windows.net/weblogs2' tblproperties ("skip.header.line.count"="1"); set mapred.input.dir.recursive=true; set hive.mapred.supports.subdirectories=true; select count(*) from websites_logs_raw
  21. 21. Cascading/Scalding to bring a modern JVM API for analytics
  22. 22. WordCount in Scalding See: https://github.com/twitter/scalding
  23. 23. But the abstractions all shared one thing… Map Reduce
  24. 24. WordCount in Scalding… See: https://github.com/twitter/scalding Map Phase Reduce Phase
  25. 25. Map/Reduce v1 Architecture Source: http://hortonworks.com/wp-content/uploads/2012/08/MRArch.png
  26. 26. Episode VI One YARN to rule them all
  27. 27. Compute Model != Resource Model
  28. 28. YARN Architecture Source: http://hortonworks.com/wp-content/uploads/2012/08/YARNArch.png • Thus, removing contention on Job Tracker to do everything • Become more resilient to RM failures • Number of active jobs more scalable
  29. 29. Other Interesting YARN projects
  30. 30. Some Existing YARN apps • Storm on YARN • Hbase on YARN • Spark • Giraph • Hamster (MPI on Yarn) • MemcacheD • Dryad Source: http://hortonworks.com/
  31. 31. Writing your own YARN app for fun and profit…
  32. 32. Start by clicking here
  33. 33. Yikes…
  34. 34. See Slide 20 – Enter Abstractions
  35. 35. Tez http://tez.incubator.apache.org/ Source: http://hortonworks.com/blog/introducing-tez-faster-hadoop-processing/
  36. 36. REEF http://www.reef-project.org/
  37. 37. Kitten https://github.com/cloudera/kitten http://www.lua.org/manual/5.1
  38. 38. What about .NET?
  39. 39. Dryad on YARN sources background
  40. 40.  The Microsoft Data Platform
  41. 41. Resources • All about HDInsight • Getting Started with HDInsight • Windows HDP 2.0 • Hadoop project • HadoopSDK Codeplex project • Getting Started with YARN blog series • YARN book
  42. 42. Laat ons weten wat u vindt van deze sessie! Vul de evaluatie in via www.techdaysapp.nl en maak kans op een van de 20 prijzen*. Prijswinnaars worden bekend gemaakt via Twitter (#TechDaysNL). Gebruik hiervoor de code op uw badge. Let us know how you feel about this session! Give your feedback via www.techdaysapp.nl and possibly win one of the 20 prices*. Winners will be announced via Twitter (#TechDaysNL). Use your personal code on your badge. * Over de uitslag kan niet worden gecorrespondeerd, prijzen zijn voorbeelden – All results are final, prices are examples
  43. 43. Backup Slides
  44. 44. Moving Data Between Stores •Sqoop • Data in or out of relational store •Pig • Set of Storage & Loaders (JDBC, Mongo, etc) •Hive • Table formats (Mongo, Azure Tables)
  45. 45. Website log processing, Pig, Hive
  46. 46. logs = LOAD 'wasb://sampledata@mwinklenortheurope.blob.core.windows.net/weblogs' USING PigStorage(' ') AS (datereq:chararray, timereq:chararray, s_sitename:chararray, cs_method:chararray, cs_uri_stem:chararray, cs_uri_query:chararray, s_port:chararray, cs_username:chararray, c_ip:chararray, cs_User_Agent:chararray, cs_Cookie :chararray, cs_Referer:chararray, cs_host :chararray, sc_status:chararray, sc_substatus:chararray, sc_win32_status:chararray, sc_bytes:int, cs_bytes:int, time_taken:int ); SET default_parallel 100; -- remove header rows filtered_logs = FILTER logs BY datereq != '#'; grouped_by_stem = GROUP filtered_logs BY cs_uri_stem; summary_ip = FOREACH grouped_by_stem GENERATE $0, COUNT($1) AS NumberOfRequests, SUM(filtered_logs.sc_bytes) AS TotalEgress, AVG(filtered_logs.time_taken) AS AverageTimeTaken; sorted_summary = ORDER summary_ip BY NumberOfRequests DESC; limited_summary = LIMIT sorted_summary 1000; --STORE limited_summary INTO 'wasb://output@mwinklenortheurope.blob.core.windows.net/build2014/stats' USING PigStorage('t');
  47. 47. CREATE EXTERNAL TABLE websites_logs_raw (datereq STRING, timereq STRING, s_sitename STRING, cs_method STRING, cs_uri_stem STRING, cs_uri_query STRING, s_port STRING, cs_username STRING, c_ip STRING, cs_User_Agent STRING, cs_Cookie STRING, cs_Referer STRING, cs_host STRING, sc_status INT, sc_substatus STRING, sc_win32_status STRING, sc_bytes INT, cs_bytes INT, time_taken INT ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' STORED AS TEXTFILE LOCATION 'wasb://sampledata@mwinklenortheurope.blob.core.windows.net/weblogs2' tblproperties ("skip.header.line.count"="1"); set mapred.input.dir.recursive=true; set hive.mapred.supports.subdirectories=true; select count(*) from websites_logs_raw
  48. 48. Interacting with SQL DB
  49. 49. binsqoop import --connect "jdbc:sqlserver://[yourserver].database.windows.net:1433;database=AdventureW orks2012;user=[username];password=[password]" --table SalesOrderDetail -- hive-import -m 10 -- --schema Sales New-AzureHDInsightSqoopJobDefinition –Command ‘import --connect "jdbc:sqlserver://[yourserver].database.windows.net:1433;database=AdventureW orks2012;user=[username];password=[password]" --table SalesOrderDetail -- hive-import -m 10 -- --schema Sales’ REGISTER lib/piggybank.jar; REGISTER c:appsdistsqljdbc_3.0enusqljdbc4.jar; STORE limited_summary INTO '/doesnotmatter' USING org.apache.pig.piggybank.storage.DBStorage('com.microsoft.sqlserver.jdbc.SQLServerDriver', 'jdbc:sqlserver://[yourserver].database.windows.net;database=AdventureWorks2012;user=[username]; password=[password]', 'INSERT INTO OutputFromPig(cs_uri_stem, NumberOfRequests, TotalEgress, AverageTimeTaken) VALUES (?,?,?,?)');
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×