SlideShare a Scribd company logo
1 of 45
Download to read offline
Ā© Hortonworks Inc. 2011
Hadoop Introduction
Olivier Renault
Solution Engineer - Hortonworks
Ā© Hortonworks Inc. 2011
Hortonworks
Ā© Hortonworks Inc. 2011
A Brief History of Apache Hadoop
Page 3
2013
Focus on INNOVATION
2005: Yahoo! creates
team under E14 to
work on Hadoop
Focus on OPERATIONS
2008: Yahoo team extends focus to
operations to support multiple
projects & growing clusters
Yahoo! begins to
Operate at scale
Enterprise
Hadoop
Apache Project
Established
Hortonworks
Data Platform
2004 2008 2010 20122006
STABILITY
2011: Hortonworks created to focus on
ā€œEnterprise Hadoopā€œ. Starts with 24
key Hadoop engineers from Yahoo
Ā© Hortonworks Inc. 2011
Hortonworks Snapshot
Page 4
ā€¢ We distribute the only 100%
Open Source Enterprise
Hadoop Distribution:
Hortonworks Data
Platform
ā€¢ We engineer, test & certify
HDP for enterprise usage
ā€¢ We employ the core
architects, builders and
operators of Apache Hadoop
ā€¢ We drive innovation within
Apache Software
Foundation projects
ā€¢ We are uniquely positioned
to deliver the highest quality
of Hadoop support
ā€¢ We enable the ecosystem to
work better with Hadoop
DevelopDevelop Distribute Support
We develop, distribute and support
the ONLY 100% open source
Enterprise Hadoop distribution
Endorsed by Strategic Partners
Headquarters: Palo Alto, CA
Employees: 180+ and growing
Investors: Benchmark, Index, Yahoo
Ā© Hortonworks Inc. 2011
Leadership that Starts at the Core
Page 5
ā€¢ Driving next generation Hadoop
ā€“ YARN, MapReduce2, HDFS2, High
Availability, Disaster Recovery
ā€¢ 420k+ lines authored since 2006
ā€“ More than twice nearest contributor
ā€¢ Deeply integrating w/ecosystem
ā€“ Enabling new deployment platforms
ā€“ (ex. Windows & Azure, Linux & VMware HA)
ā€“ Creating deeply engineered solutions
ā€“ (ex. Teradata big data appliance)
ā€¢ All Apache, NO holdbacks
ā€“ 100% of code contributed to Apache
Ā© Hortonworks Inc. 2011
OS Cloud VM Appliance
HDP: Enterprise Hadoop Distribution
Page 6
HORTONWORKS
DATA PLATFORM (HDP)
Hortonworks
Data Platform (HDP)
Enterprise Hadoop
ā€¢ The ONLY 100% open source
and complete distribution
ā€¢ Enterprise grade, proven and
tested at scale
ā€¢ Ecosystem endorsed to
ensure interoperability
PLATFORM SERVICES
HADOOP CORE
Enterprise Readiness: HA,
DR, Snapshots, Security, ā€¦
Distributed
Storage & ProcessingHDFS YARN (in 2.0)
WEBHDFS MAP REDUCE
DATA
SERVICES
Store,
Process and
Access Data
HCATALOG
HIVEPIG
HBASE
SQOOP
FLUME
OPERATIONAL
SERVICES
Manage &
Operate at
Scale
OOZIE
AMBARI
Ā© Hortonworks Inc. 2011
Overview of Hadoop
Ā© Hortonworks Inc. 2011
In the Beginning
ā€¢ It all started when Google needed a way to:
ā€“ Do page ranking
ā€“ Determine which web sites to provide for searches
Link Link
Ā© Hortonworks Inc. 2011
Page Rank Solution - Simplified
ā€¢ Google engineers developed an internal solution and provided a paper
on it titled:
ā€œMapReduce: Simplified Data Processing on Large
Clustersā€
ā€¢ It described a process something like this:
Map
Map
1. Many tasks look
at links in parts
of the data
2. Mapped results are
shuffled to Reducers
3. Reducers compute
the links into a result
Links to sites A, C, F
Links to sites B, D, E
1 2 3
Map
Reduce
Reduce
Ā© Hortonworks Inc. 2011
Words to Websites - Simplified
ā€¢ From words provide locations
ā€¢ Provides what to display for a search
ā€“ Note: Page rank determines the order
ā€¢ For example ā€“ to find URLs with books on them
books www.barnesandnoble.com www.amazon.com
email www.google.com www.yahoo.com www.facebook.com
finance www.yahoo.com www.google.com
groceries www.walmart.com www.target.com
jeans www.target.com www.amazon.com
K, V
<url, keyword>
<keyword, url>
www.barnesandnoble.com books calendars
www.yahoo.com sports finance email celebrity
www.amazon.com shoes books jeans
www.google.com finance email search
www.microsoft.com operating-system productivity system
Map
Reduce
Ā© Hortonworks Inc. 2011
Data Model
ā€¢ MapReduce works on <key, value> pairs
(Key input, Value input)
(Key intermediate, Value intermediate)
(Key output, Value output)
Map
Reduce
(books, www.barnesandnoble.com www.amazon.com)
(books, www.barnesandnoble.com)
(www.barnesandnoble.com , books calendars)
Ā© Hortonworks Inc. 2011
Hadoop Basic Core Architecture
Hadoop Distributed File System (HDFS)
Map
Reduce
Mapper Reducer
Shuffle
MapReduce
Ā© Hortonworks Inc. 2011
HDFS & MapReduce
Enterprise Apache Hadoop
Page 13
Ā© Hortonworks Inc. 2011
Hortonworks
ā€¢ Cluster Topology
ā€¢ HDFS
ā€¢ MapReduce
Page 14
Ā© Hortonworks Inc. 2011
Cluster Topology
Page 15
Master Services Slave Services
Ā© Hortonworks Inc. 2011
Hortonworks
ā€¢ Cluster Topology
ā€¢ HDFS
ā€¢ MapReduce
Page 16
Ā© Hortonworks Inc. 2011
OS Cloud VM Appliance
HDP: Enterprise Hadoop Distribution
Page 17
HORTONWORKS
DATA PLATFORM (HDP)
Hortonworks
Data Platform (HDP)
Enterprise Hadoop
ā€¢ The ONLY 100% open source
and complete distribution
ā€¢ Enterprise grade, proven and
tested at scale
ā€¢ Ecosystem endorsed to
ensure interoperability
PLATFORM SERVICES
HADOOP CORE
Enterprise Readiness: HA,
DR, Snapshots, Security, ā€¦
Distributed
Storage & ProcessingHDFS YARN (in 2.0)
WEBHDFS MAP REDUCE
DATA
SERVICES
Store,
Process and
Access Data
HCATALOG
HIVEPIG
HBASE
SQOOP
FLUME
OPERATIONAL
SERVICES
Manage &
Operate at
Scale
OOZIE
AMBARI
Ā© Hortonworks Inc. 2011
HDFS
ā€¢ Distributed file system designed to run on commodity Hardware.
ā€¢ Key Assumptions
ā€“ Hardware failure is the norm
ā€“ Need streaming access to data sets.
ā€“ Optimized for high throughput
ā€“ Data sets are large
ā€“ Append only file system. Write once read many times
ā€“ Moving computation is cheaper than moving data
Page 18
Ā© Hortonworks Inc. 2011
HDFS: Key Services
ā€¢ NameNode
ā€“ Master service
ā€“ Manages the file system namespace and regulates access to files by clients
ā€“ Single service across the cluster
ā€¢ DataNode
ā€“ Slave service. Runs on slave nodes
ā€“ Manages block read/write for HDFS
ā€“ Pings NameNode for instructions
ā€“ If heatbest fails, Datanode is removed from the cluster and replicated blocks take
over
ā€¢ Seconday NameNode
ā€“ Merges Namenodeā€™s file system image and edit logs
ā€“ Not a failover namenode !!!
Page 19
Ā© Hortonworks Inc. 2011
HDFS: File create lifecycle
Page 20
NameNode
RACK1
RACK2
RACK3
FILEFILE
B
1
B
1
B
2
B
2
FILE
HDFS CLIENT
Create
B
1
B
1
B
1
B
1
B
1
B
1
B
2
B
2
B
2
B
2
B
2
B
2
ack
ack
ack
22 11
33
Complete44
Ā© Hortonworks Inc. 2011
Hortonworks
ā€¢ Cluster Topology
ā€¢ HDFS
ā€¢ MapReduce
Page 21
Ā© Hortonworks Inc. 2011
OS Cloud VM Appliance
HDP: Enterprise Hadoop Distribution
Page 22
HORTONWORKS
DATA PLATFORM (HDP)
Hortonworks
Data Platform (HDP)
Enterprise Hadoop
ā€¢ The ONLY 100% open source
and complete distribution
ā€¢ Enterprise grade, proven and
tested at scale
ā€¢ Ecosystem endorsed to
ensure interoperability
PLATFORM SERVICES
HADOOP CORE
Enterprise Readiness: HA,
DR, Snapshots, Security, ā€¦
Distributed
Storage & ProcessingHDFS YARN (in 2.0)
WEBHDFS MAP REDUCE
DATA
SERVICES
Store,
Process and
Access Data
HCATALOG
HIVEPIG
HBASE
SQOOP
FLUME
OPERATIONAL
SERVICES
Manage &
Operate at
Scale
OOZIE
AMBARI
Ā© Hortonworks Inc. 2011
MapReduce
ā€¢ A software framework for developing distributed applications to
process vast amounts of data in-parallel on large clusters
ā€¢ MapReduce job splits the input data-set into independent chunks which
are processed by the map tasks in a completely parallel manner across
a cluster of nodes
Page 23
Ā© Hortonworks Inc. 2011
MapReduce: Key Services
ā€¢ Job Tracker
ā€“ Master service
ā€“ Scheduleā€™s jobsā€™ component task on the task tracker
ā€“ Monitors task progress
ā€“ Reschedules failed tasks
ā€¢ Task Tracker
ā€“ Spawnā€™s Jobā€™s task
ā€“ Reports progress to the Job Tracker
ā€“ Runs on slave nodes, collocated with DataNode service
ā€¢ Task (Map / Reduce)
ā€“ Spawned by Task Tracker
ā€“ Executes Map/Reduce task, encapsulating the business logic.
Page 24
Ā© Hortonworks Inc. 2012 Page 25
DataNode 1
Mapper
Data is shuffled
across the network
and sorted
Map Phase Shuffle/Sort Reduce Phase
DataNode 2
Mapper
DataNode 3
Mapper
DataNode 1
Reducer
DataNode 2
DataNode 3
Reducer
MapReduce: Job Lifecycle
Ā© Hortonworks Inc. 2011
The Key/Value Pairs of MapReduce
<K1, V1>
Mapper Shuffle/Sort<K2, V2>
Reducer
<K2, (V2,V2,V2,V2)>
<K3, V3>
ā€¢ Map & Reduce operate on (key, value) pairs and output (key, value) pairsā€¢ Map & Reduce operate on (key, value) pairs and output (key, value) pairs
ā€¢ User provides map and reduce functions
ā€¢ Input Key and Value is determined by InputFormat
ā€¢ Common InputFormats: TextInputFormat,
KeyValueTextInputFormat,SequenceFileInputFormat
ā€¢ Common OutputFormats: TextOutputFormat, SequenceFileOutputFormat
Ā© Hortonworks Inc. 2011
PIG, HIVE
Enterprise Apache Hadoop
Page 27
Ā© Hortonworks Inc. 2011
Hortonworks
ā€¢ Pig
ā€¢ Hive
Page 28
Ā© Hortonworks Inc. 2011
OS Cloud VM Appliance
HDP: Enterprise Hadoop Distribution
Page 29
HORTONWORKS
DATA PLATFORM (HDP)
Hortonworks
Data Platform (HDP)
Enterprise Hadoop
ā€¢ The ONLY 100% open source
and complete distribution
ā€¢ Enterprise grade, proven and
tested at scale
ā€¢ Ecosystem endorsed to
ensure interoperability
PLATFORM SERVICES
HADOOP CORE
Enterprise Readiness: HA,
DR, Snapshots, Security, ā€¦
Distributed
Storage & ProcessingHDFS YARN (in 2.0)
WEBHDFS MAP REDUCE
DATA
SERVICES
Store,
Process and
Access Data
HCATALOG
HIVEPIG
HBASE
SQOOP
FLUME
OPERATIONAL
SERVICES
Manage &
Operate at
Scale
OOZIE
AMBARI
Ā© Hortonworks Inc. 2011
Pig
ā€¢An engine for executing programs on top of
Hadoop
ā€¢It provides a language, Pig Latin, to specify these
programs
Page 30
Ā© Hortonworks Inc. 2011
Why use Pig?
ā€¢ Suppose you have user data in one file, website data in another, and
you need to find the top 5 most visited sites by users aged 18 - 25
Page 31
Ā© Hortonworks Inc. 2011
In Map-Reduce
Page 32
170 lines of code, 4 hours to write
Ā© Hortonworks Inc. 2011
In Pig Latin
Users = load ā€˜input/usersā€™ using PigStorage(ā€˜,ā€™) as (name:chararray,
age:int);
Fltrd = filter Users by age >= 18 and age <= 25;
Pages = load ā€˜input/pagesā€™ using PigStorage(ā€˜,ā€™) as (user:chararray,
url:chararray);
Jnd = join Fltrd by name, Pages by user;
Grpd = group Jnd by url;
Smmd = foreach Grpd generate group,COUNT(Jnd) as clicks;
Srtd = order Smmd by clicks desc;
Top5 = limit Srtd 5;
store Top5 into ā€˜output/top5sitesā€™ using PigStorage(ā€˜,ā€™);
Page 33
9 lines of code, 15 minutes to write
Ā© Hortonworks Inc. 2011
Essence of Pig
ā€¢ Map-Reduce is too low a level, SQL too high
ā€¢ Pig-Latin, a language intended to sit between the two
ā€“ Provides standard relational transforms (join, sort, etc.)
ā€“ Schemas are optional, used when available, can be defined at runtime
ā€“ User Defined Functions are first class citizens
Page 34
Ā© Hortonworks Inc. 2011
Pig Elements
Page 35
ā€¢ High-level scripting language
ā€¢ Requires no metadata or schema
ā€¢ Statements translated into a series
of MapReduce jobs
Pig Latin
ā€¢ Interactive shellGrunt
ā€¢ Shared repository for User Defined
Functions (UDFs)Piggybank
Ā© Hortonworks Inc. 2011
Hortonworks
ā€¢ Pig
ā€¢ Hive
Page 36
Ā© Hortonworks Inc. 2011
OS Cloud VM Appliance
HDP: Enterprise Hadoop Distribution
Page 37
HORTONWORKS
DATA PLATFORM (HDP)
Hortonworks
Data Platform (HDP)
Enterprise Hadoop
ā€¢ The ONLY 100% open source
and complete distribution
ā€¢ Enterprise grade, proven and
tested at scale
ā€¢ Ecosystem endorsed to
ensure interoperability
PLATFORM SERVICES
HADOOP CORE
Enterprise Readiness: HA,
DR, Snapshots, Security, ā€¦
Distributed
Storage & ProcessingHDFS YARN (in 2.0)
WEBHDFS MAP REDUCE
DATA
SERVICES
Store,
Process and
Access Data
HCATALOG
HIVEPIG
HBASE
SQOOP
FLUME
OPERATIONAL
SERVICES
Manage &
Operate at
Scale
OOZIE
AMBARI
Ā© Hortonworks Inc. 2011
Motivation
ā€¢ Hadoop as Enterprise Data Warehouse
ā€¢ Adhoc query support
ā€¢ Schema information
ā€¢ Tool for end-users
Page 38
USED EXTENSIVELY FOR ANALYTICS & BUSINESS INTELLIGENCEUSED EXTENSIVELY FOR ANALYTICS & BUSINESS INTELLIGENCE
Ā© Hortonworks Inc. 2011
HiveQL Features
ā€¢ HiveQL is similar to other SQLs
ā€“ Uses familiar relational database concepts
(tables, rows, columns and schema)
ā€“ Based on the SQL-92 specification
ā€¢ Treats Big Data as tables
ā€¢ Converts SQL queries into MapReduce jobs
ā€“ User does not need to know MapReduce
ā€¢ Also supports plugging custom MapReduce scripts into queries
Page 39
Ā© Hortonworks Inc. 2011
Performing Queries
ā€¢ SELECT
ā€¢ WHERE clause
ā€¢ UNION ALL and DISTINCT
ā€¢ GROUP BY and HAVING
ā€¢ JOIN
ā€¢ ORDER BY
ā€¢ LIMIT clause
ā€“ Rows returned are chosen at random
ā€¢ Can use REGEX Column Specification
Page 40
SELECTSELECT '(ds|hr)?+.+' FROM sales;
Ā© Hortonworks Inc. 2011
Hive vs Pig
Page 41
Pig and Hive work well togetherPig and Hive work well together
and many businesses use both.
Hive is a good choice:
ā€¢ when you want to query the data
ā€¢ when you need an answer to specific
questions
ā€¢ if you are familiar with SQL
Hive is a good choice:
ā€¢ when you want to query the data
ā€¢ when you need an answer to specific
questions
ā€¢ if you are familiar with SQL
Pig is a good choice:
ā€¢ for ETL (Extract -> Transform -> Load)
ā€¢ for preparing data for easier analysis
ā€¢ when you have a long series of steps to
perform
Ā© Hortonworks Inc. 2011
Ambari
Streamlining Hadoop Operations
Page 42
Ā© Hortonworks Inc. 2011
Ambari: Install, Manage, Monitor, Tune
Simplify Deployment and Maintenance: Wizard
based install, handles dependency checks,
recommends service mappings
Ensure a Healthy Cluster: Monitor, alert, heat maps
Optimize Performance: Root cause analysis for
cluster tuning - fix problems BEFORE SLAs are
breached
Integrate with your operations: Open APIs,
standard web-tech
43
Ā© Hortonworks Inc. 2011
Community-Driven, Enterprise Class
ā€¢ Productizes over a combined 100
person-years of operational Hadoop
experience
ā€¢ Stability and Scale: Ops & Dev team
that took Yahoo! From 1000 to 45,000+
nodes
ā€¢ Fast-paced, open source community
driven innovation and integration Red
Hat, Teradata, HP, Microsoft
Contributions (and more)
44
Ā© Hortonworks Inc. 2011
Demonstration
Page 45

More Related Content

What's hot

Using Tableau with Hortonworks Data Platform
Using Tableau with Hortonworks Data PlatformUsing Tableau with Hortonworks Data Platform
Using Tableau with Hortonworks Data PlatformHortonworks
Ā 
Hortonworks tech workshop in-memory processing with spark
Hortonworks tech workshop   in-memory processing with sparkHortonworks tech workshop   in-memory processing with spark
Hortonworks tech workshop in-memory processing with sparkHortonworks
Ā 
Hortonworks Big Data Career Paths and Training
Hortonworks Big Data Career Paths and Training Hortonworks Big Data Career Paths and Training
Hortonworks Big Data Career Paths and Training Aengus Rooney
Ā 
YARN: the Key to overcoming the challenges of broad-based Hadoop Adoption
YARN: the Key to overcoming the challenges of broad-based Hadoop AdoptionYARN: the Key to overcoming the challenges of broad-based Hadoop Adoption
YARN: the Key to overcoming the challenges of broad-based Hadoop AdoptionDataWorks Summit
Ā 
Securing Hadoop with Apache Ranger
Securing Hadoop with Apache RangerSecuring Hadoop with Apache Ranger
Securing Hadoop with Apache RangerDataWorks Summit
Ā 
Hortonworks Technical Workshop: Apache Ambari
Hortonworks Technical Workshop:   Apache AmbariHortonworks Technical Workshop:   Apache Ambari
Hortonworks Technical Workshop: Apache AmbariHortonworks
Ā 
Hadoop: today and tomorrow
Hadoop: today and tomorrowHadoop: today and tomorrow
Hadoop: today and tomorrowSteve Loughran
Ā 
Falcon Meetup
Falcon Meetup Falcon Meetup
Falcon Meetup Hortonworks
Ā 
A First-Hand Look at What's New in HDP 2.3
A First-Hand Look at What's New in HDP 2.3 A First-Hand Look at What's New in HDP 2.3
A First-Hand Look at What's New in HDP 2.3 DataWorks Summit
Ā 
Hortonworks Data in Motion Webinar Series - Part 1
Hortonworks Data in Motion Webinar Series - Part 1Hortonworks Data in Motion Webinar Series - Part 1
Hortonworks Data in Motion Webinar Series - Part 1Hortonworks
Ā 
Log Analytics Optimization
Log Analytics OptimizationLog Analytics Optimization
Log Analytics OptimizationHortonworks
Ā 
Hdp developer apache spark using python (lab guide) by hortonworks university...
Hdp developer apache spark using python (lab guide) by hortonworks university...Hdp developer apache spark using python (lab guide) by hortonworks university...
Hdp developer apache spark using python (lab guide) by hortonworks university...ssusercda69b
Ā 
Apache Hadoop Now Next and Beyond
Apache Hadoop Now Next and BeyondApache Hadoop Now Next and Beyond
Apache Hadoop Now Next and BeyondDataWorks Summit
Ā 
Hive present-and-feature-shanghai
Hive present-and-feature-shanghaiHive present-and-feature-shanghai
Hive present-and-feature-shanghaiYifeng Jiang
Ā 
Hadoop pycon2011uk
Hadoop pycon2011ukHadoop pycon2011uk
Hadoop pycon2011ukAditya Sakhuja
Ā 
Apache Hive 0.13 Performance Benchmarks
Apache Hive 0.13 Performance BenchmarksApache Hive 0.13 Performance Benchmarks
Apache Hive 0.13 Performance BenchmarksHortonworks
Ā 
ODPi 101: Who we are, What we do
ODPi 101: Who we are, What we doODPi 101: Who we are, What we do
ODPi 101: Who we are, What we doHortonworks
Ā 
Enabling Diverse Workload Scheduling in YARN
Enabling Diverse Workload Scheduling in YARNEnabling Diverse Workload Scheduling in YARN
Enabling Diverse Workload Scheduling in YARNDataWorks Summit
Ā 
High throughput data replication over RAFT
High throughput data replication over RAFTHigh throughput data replication over RAFT
High throughput data replication over RAFTDataWorks Summit
Ā 

What's hot (20)

Using Tableau with Hortonworks Data Platform
Using Tableau with Hortonworks Data PlatformUsing Tableau with Hortonworks Data Platform
Using Tableau with Hortonworks Data Platform
Ā 
Hortonworks tech workshop in-memory processing with spark
Hortonworks tech workshop   in-memory processing with sparkHortonworks tech workshop   in-memory processing with spark
Hortonworks tech workshop in-memory processing with spark
Ā 
Hortonworks Big Data Career Paths and Training
Hortonworks Big Data Career Paths and Training Hortonworks Big Data Career Paths and Training
Hortonworks Big Data Career Paths and Training
Ā 
YARN: the Key to overcoming the challenges of broad-based Hadoop Adoption
YARN: the Key to overcoming the challenges of broad-based Hadoop AdoptionYARN: the Key to overcoming the challenges of broad-based Hadoop Adoption
YARN: the Key to overcoming the challenges of broad-based Hadoop Adoption
Ā 
Securing Hadoop with Apache Ranger
Securing Hadoop with Apache RangerSecuring Hadoop with Apache Ranger
Securing Hadoop with Apache Ranger
Ā 
Hortonworks Technical Workshop: Apache Ambari
Hortonworks Technical Workshop:   Apache AmbariHortonworks Technical Workshop:   Apache Ambari
Hortonworks Technical Workshop: Apache Ambari
Ā 
Hadoop: today and tomorrow
Hadoop: today and tomorrowHadoop: today and tomorrow
Hadoop: today and tomorrow
Ā 
201305 hadoop jpl-v3
201305 hadoop jpl-v3201305 hadoop jpl-v3
201305 hadoop jpl-v3
Ā 
Falcon Meetup
Falcon Meetup Falcon Meetup
Falcon Meetup
Ā 
A First-Hand Look at What's New in HDP 2.3
A First-Hand Look at What's New in HDP 2.3 A First-Hand Look at What's New in HDP 2.3
A First-Hand Look at What's New in HDP 2.3
Ā 
Hortonworks Data in Motion Webinar Series - Part 1
Hortonworks Data in Motion Webinar Series - Part 1Hortonworks Data in Motion Webinar Series - Part 1
Hortonworks Data in Motion Webinar Series - Part 1
Ā 
Log Analytics Optimization
Log Analytics OptimizationLog Analytics Optimization
Log Analytics Optimization
Ā 
Hdp developer apache spark using python (lab guide) by hortonworks university...
Hdp developer apache spark using python (lab guide) by hortonworks university...Hdp developer apache spark using python (lab guide) by hortonworks university...
Hdp developer apache spark using python (lab guide) by hortonworks university...
Ā 
Apache Hadoop Now Next and Beyond
Apache Hadoop Now Next and BeyondApache Hadoop Now Next and Beyond
Apache Hadoop Now Next and Beyond
Ā 
Hive present-and-feature-shanghai
Hive present-and-feature-shanghaiHive present-and-feature-shanghai
Hive present-and-feature-shanghai
Ā 
Hadoop pycon2011uk
Hadoop pycon2011ukHadoop pycon2011uk
Hadoop pycon2011uk
Ā 
Apache Hive 0.13 Performance Benchmarks
Apache Hive 0.13 Performance BenchmarksApache Hive 0.13 Performance Benchmarks
Apache Hive 0.13 Performance Benchmarks
Ā 
ODPi 101: Who we are, What we do
ODPi 101: Who we are, What we doODPi 101: Who we are, What we do
ODPi 101: Who we are, What we do
Ā 
Enabling Diverse Workload Scheduling in YARN
Enabling Diverse Workload Scheduling in YARNEnabling Diverse Workload Scheduling in YARN
Enabling Diverse Workload Scheduling in YARN
Ā 
High throughput data replication over RAFT
High throughput data replication over RAFTHigh throughput data replication over RAFT
High throughput data replication over RAFT
Ā 

Similar to OSDC 2013 | Introduction into Hadoop by Olivier Renault

Don't Let Security Be The 'Elephant in the Room'
Don't Let Security Be The 'Elephant in the Room'Don't Let Security Be The 'Elephant in the Room'
Don't Let Security Be The 'Elephant in the Room'Hortonworks
Ā 
Mrinal devadas, Hortonworks Making Sense Of Big Data
Mrinal devadas, Hortonworks Making Sense Of Big DataMrinal devadas, Hortonworks Making Sense Of Big Data
Mrinal devadas, Hortonworks Making Sense Of Big DataPatrickCrompton
Ā 
Hortonworks - What's Possible with a Modern Data Architecture?
Hortonworks - What's Possible with a Modern Data Architecture?Hortonworks - What's Possible with a Modern Data Architecture?
Hortonworks - What's Possible with a Modern Data Architecture?Hortonworks
Ā 
Storm Demo Talk - Colorado Springs May 2015
Storm Demo Talk - Colorado Springs May 2015Storm Demo Talk - Colorado Springs May 2015
Storm Demo Talk - Colorado Springs May 2015Mac Moore
Ā 
Introduction to the Hortonworks YARN Ready Program
Introduction to the Hortonworks YARN Ready ProgramIntroduction to the Hortonworks YARN Ready Program
Introduction to the Hortonworks YARN Ready ProgramHortonworks
Ā 
Supporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big DataSupporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big DataHortonworks
Ā 
Big Data Analytics - Is Your Elephant Enterprise Ready?
Big Data Analytics - Is Your Elephant Enterprise Ready?Big Data Analytics - Is Your Elephant Enterprise Ready?
Big Data Analytics - Is Your Elephant Enterprise Ready?Hortonworks
Ā 
Internet of things Crash Course Workshop
Internet of things Crash Course WorkshopInternet of things Crash Course Workshop
Internet of things Crash Course WorkshopDataWorks Summit
Ā 
Internet of Things Crash Course Workshop at Hadoop Summit
Internet of Things Crash Course Workshop at Hadoop SummitInternet of Things Crash Course Workshop at Hadoop Summit
Internet of Things Crash Course Workshop at Hadoop SummitDataWorks Summit
Ā 
Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG
Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUGReal-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG
Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUGskumpf
Ā 
Introduction to Microsoft HDInsight and BI Tools
Introduction to Microsoft HDInsight and BI ToolsIntroduction to Microsoft HDInsight and BI Tools
Introduction to Microsoft HDInsight and BI ToolsDataWorks Summit
Ā 
Storm Demo Talk - Denver Apr 2015
Storm Demo Talk - Denver Apr 2015Storm Demo Talk - Denver Apr 2015
Storm Demo Talk - Denver Apr 2015Mac Moore
Ā 
Rescue your Big Data from Downtime with HP Operations Bridge and Apache Hadoop
Rescue your Big Data from Downtime with HP Operations Bridge and Apache HadoopRescue your Big Data from Downtime with HP Operations Bridge and Apache Hadoop
Rescue your Big Data from Downtime with HP Operations Bridge and Apache HadoopHortonworks
Ā 
Discover HDP 2.1: Apache Solr for Hadoop Search
Discover HDP 2.1: Apache Solr for Hadoop SearchDiscover HDP 2.1: Apache Solr for Hadoop Search
Discover HDP 2.1: Apache Solr for Hadoop SearchHortonworks
Ā 
Apache Hadoop on the Open Cloud
Apache Hadoop on the Open CloudApache Hadoop on the Open Cloud
Apache Hadoop on the Open CloudHortonworks
Ā 
YARN - Strata 2014
YARN - Strata 2014YARN - Strata 2014
YARN - Strata 2014Hortonworks
Ā 
Hadoop Now, Next and Beyond
Hadoop Now, Next and BeyondHadoop Now, Next and Beyond
Hadoop Now, Next and BeyondDataWorks Summit
Ā 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemShivaji Dutta
Ā 
Transform Your Business with Big Data and Hortonworks
Transform Your Business with Big Data and Hortonworks Transform Your Business with Big Data and Hortonworks
Transform Your Business with Big Data and Hortonworks Pactera_US
Ā 
Transform You Business with Big Data and Hortonworks
Transform You Business with Big Data and HortonworksTransform You Business with Big Data and Hortonworks
Transform You Business with Big Data and HortonworksHortonworks
Ā 

Similar to OSDC 2013 | Introduction into Hadoop by Olivier Renault (20)

Don't Let Security Be The 'Elephant in the Room'
Don't Let Security Be The 'Elephant in the Room'Don't Let Security Be The 'Elephant in the Room'
Don't Let Security Be The 'Elephant in the Room'
Ā 
Mrinal devadas, Hortonworks Making Sense Of Big Data
Mrinal devadas, Hortonworks Making Sense Of Big DataMrinal devadas, Hortonworks Making Sense Of Big Data
Mrinal devadas, Hortonworks Making Sense Of Big Data
Ā 
Hortonworks - What's Possible with a Modern Data Architecture?
Hortonworks - What's Possible with a Modern Data Architecture?Hortonworks - What's Possible with a Modern Data Architecture?
Hortonworks - What's Possible with a Modern Data Architecture?
Ā 
Storm Demo Talk - Colorado Springs May 2015
Storm Demo Talk - Colorado Springs May 2015Storm Demo Talk - Colorado Springs May 2015
Storm Demo Talk - Colorado Springs May 2015
Ā 
Introduction to the Hortonworks YARN Ready Program
Introduction to the Hortonworks YARN Ready ProgramIntroduction to the Hortonworks YARN Ready Program
Introduction to the Hortonworks YARN Ready Program
Ā 
Supporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big DataSupporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big Data
Ā 
Big Data Analytics - Is Your Elephant Enterprise Ready?
Big Data Analytics - Is Your Elephant Enterprise Ready?Big Data Analytics - Is Your Elephant Enterprise Ready?
Big Data Analytics - Is Your Elephant Enterprise Ready?
Ā 
Internet of things Crash Course Workshop
Internet of things Crash Course WorkshopInternet of things Crash Course Workshop
Internet of things Crash Course Workshop
Ā 
Internet of Things Crash Course Workshop at Hadoop Summit
Internet of Things Crash Course Workshop at Hadoop SummitInternet of Things Crash Course Workshop at Hadoop Summit
Internet of Things Crash Course Workshop at Hadoop Summit
Ā 
Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG
Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUGReal-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG
Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG
Ā 
Introduction to Microsoft HDInsight and BI Tools
Introduction to Microsoft HDInsight and BI ToolsIntroduction to Microsoft HDInsight and BI Tools
Introduction to Microsoft HDInsight and BI Tools
Ā 
Storm Demo Talk - Denver Apr 2015
Storm Demo Talk - Denver Apr 2015Storm Demo Talk - Denver Apr 2015
Storm Demo Talk - Denver Apr 2015
Ā 
Rescue your Big Data from Downtime with HP Operations Bridge and Apache Hadoop
Rescue your Big Data from Downtime with HP Operations Bridge and Apache HadoopRescue your Big Data from Downtime with HP Operations Bridge and Apache Hadoop
Rescue your Big Data from Downtime with HP Operations Bridge and Apache Hadoop
Ā 
Discover HDP 2.1: Apache Solr for Hadoop Search
Discover HDP 2.1: Apache Solr for Hadoop SearchDiscover HDP 2.1: Apache Solr for Hadoop Search
Discover HDP 2.1: Apache Solr for Hadoop Search
Ā 
Apache Hadoop on the Open Cloud
Apache Hadoop on the Open CloudApache Hadoop on the Open Cloud
Apache Hadoop on the Open Cloud
Ā 
YARN - Strata 2014
YARN - Strata 2014YARN - Strata 2014
YARN - Strata 2014
Ā 
Hadoop Now, Next and Beyond
Hadoop Now, Next and BeyondHadoop Now, Next and Beyond
Hadoop Now, Next and Beyond
Ā 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystem
Ā 
Transform Your Business with Big Data and Hortonworks
Transform Your Business with Big Data and Hortonworks Transform Your Business with Big Data and Hortonworks
Transform Your Business with Big Data and Hortonworks
Ā 
Transform You Business with Big Data and Hortonworks
Transform You Business with Big Data and HortonworksTransform You Business with Big Data and Hortonworks
Transform You Business with Big Data and Hortonworks
Ā 

Recently uploaded

Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran
Ā 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityNeo4j
Ā 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
Ā 
Dealing with Cultural Dispersion ā€” Stefano Lambiase ā€” ICSE-SEIS 2024
Dealing with Cultural Dispersion ā€” Stefano Lambiase ā€” ICSE-SEIS 2024Dealing with Cultural Dispersion ā€” Stefano Lambiase ā€” ICSE-SEIS 2024
Dealing with Cultural Dispersion ā€” Stefano Lambiase ā€” ICSE-SEIS 2024StefanoLambiase
Ā 
Maximizing Efficiency and Profitability with OnePlanā€™s Professional Service A...
Maximizing Efficiency and Profitability with OnePlanā€™s Professional Service A...Maximizing Efficiency and Profitability with OnePlanā€™s Professional Service A...
Maximizing Efficiency and Profitability with OnePlanā€™s Professional Service A...OnePlan Solutions
Ā 
(Genuine) Escort Service Lucknow | Starting ā‚¹,5K To @25k with A/C šŸ§‘šŸ½ā€ā¤ļøā€šŸ§‘šŸ» 89...
(Genuine) Escort Service Lucknow | Starting ā‚¹,5K To @25k with A/C šŸ§‘šŸ½ā€ā¤ļøā€šŸ§‘šŸ» 89...(Genuine) Escort Service Lucknow | Starting ā‚¹,5K To @25k with A/C šŸ§‘šŸ½ā€ā¤ļøā€šŸ§‘šŸ» 89...
(Genuine) Escort Service Lucknow | Starting ā‚¹,5K To @25k with A/C šŸ§‘šŸ½ā€ā¤ļøā€šŸ§‘šŸ» 89...gurkirankumar98700
Ā 
The Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfThe Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfPower Karaoke
Ā 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptkotipi9215
Ā 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanyChristoph Pohl
Ā 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig
Ā 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
Ā 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
Ā 
What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....kzayra69
Ā 
GOING AOT WITH GRAALVM ā€“ DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM ā€“ DEVOXX GREECE.pdfGOING AOT WITH GRAALVM ā€“ DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM ā€“ DEVOXX GREECE.pdfAlina Yurenko
Ā 
č‹±å›½UN学位čƁ,åŒ—å®‰ę™®é”æ大学ęƕäøščƁ书1:1制作
č‹±å›½UN学位čƁ,åŒ—å®‰ę™®é”æ大学ęƕäøščƁ书1:1åˆ¶ä½œč‹±å›½UN学位čƁ,åŒ—å®‰ę™®é”æ大学ęƕäøščƁ书1:1制作
č‹±å›½UN学位čƁ,åŒ—å®‰ę™®é”æ大学ęƕäøščƁ书1:1制作qr0udbr0
Ā 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
Ā 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Velvetech LLC
Ā 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp
Ā 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - InfographicHr365.us smith
Ā 

Recently uploaded (20)

Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalm
Ā 
Hot Sexy call girls in Patel NagaršŸ” 9953056974 šŸ” escort Service
Hot Sexy call girls in Patel NagaršŸ” 9953056974 šŸ” escort ServiceHot Sexy call girls in Patel NagaršŸ” 9953056974 šŸ” escort Service
Hot Sexy call girls in Patel NagaršŸ” 9953056974 šŸ” escort Service
Ā 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered Sustainability
Ā 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
Ā 
Dealing with Cultural Dispersion ā€” Stefano Lambiase ā€” ICSE-SEIS 2024
Dealing with Cultural Dispersion ā€” Stefano Lambiase ā€” ICSE-SEIS 2024Dealing with Cultural Dispersion ā€” Stefano Lambiase ā€” ICSE-SEIS 2024
Dealing with Cultural Dispersion ā€” Stefano Lambiase ā€” ICSE-SEIS 2024
Ā 
Maximizing Efficiency and Profitability with OnePlanā€™s Professional Service A...
Maximizing Efficiency and Profitability with OnePlanā€™s Professional Service A...Maximizing Efficiency and Profitability with OnePlanā€™s Professional Service A...
Maximizing Efficiency and Profitability with OnePlanā€™s Professional Service A...
Ā 
(Genuine) Escort Service Lucknow | Starting ā‚¹,5K To @25k with A/C šŸ§‘šŸ½ā€ā¤ļøā€šŸ§‘šŸ» 89...
(Genuine) Escort Service Lucknow | Starting ā‚¹,5K To @25k with A/C šŸ§‘šŸ½ā€ā¤ļøā€šŸ§‘šŸ» 89...(Genuine) Escort Service Lucknow | Starting ā‚¹,5K To @25k with A/C šŸ§‘šŸ½ā€ā¤ļøā€šŸ§‘šŸ» 89...
(Genuine) Escort Service Lucknow | Starting ā‚¹,5K To @25k with A/C šŸ§‘šŸ½ā€ā¤ļøā€šŸ§‘šŸ» 89...
Ā 
The Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfThe Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdf
Ā 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.ppt
Ā 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
Ā 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
Ā 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
Ā 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Ā 
What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....
Ā 
GOING AOT WITH GRAALVM ā€“ DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM ā€“ DEVOXX GREECE.pdfGOING AOT WITH GRAALVM ā€“ DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM ā€“ DEVOXX GREECE.pdf
Ā 
č‹±å›½UN学位čƁ,åŒ—å®‰ę™®é”æ大学ęƕäøščƁ书1:1制作
č‹±å›½UN学位čƁ,åŒ—å®‰ę™®é”æ大学ęƕäøščƁ书1:1åˆ¶ä½œč‹±å›½UN学位čƁ,åŒ—å®‰ę™®é”æ大学ęƕäøščƁ书1:1制作
č‹±å›½UN学位čƁ,åŒ—å®‰ę™®é”æ大学ęƕäøščƁ书1:1制作
Ā 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Ā 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...
Ā 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
Ā 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - Infographic
Ā 

OSDC 2013 | Introduction into Hadoop by Olivier Renault

  • 1. Ā© Hortonworks Inc. 2011 Hadoop Introduction Olivier Renault Solution Engineer - Hortonworks
  • 2. Ā© Hortonworks Inc. 2011 Hortonworks
  • 3. Ā© Hortonworks Inc. 2011 A Brief History of Apache Hadoop Page 3 2013 Focus on INNOVATION 2005: Yahoo! creates team under E14 to work on Hadoop Focus on OPERATIONS 2008: Yahoo team extends focus to operations to support multiple projects & growing clusters Yahoo! begins to Operate at scale Enterprise Hadoop Apache Project Established Hortonworks Data Platform 2004 2008 2010 20122006 STABILITY 2011: Hortonworks created to focus on ā€œEnterprise Hadoopā€œ. Starts with 24 key Hadoop engineers from Yahoo
  • 4. Ā© Hortonworks Inc. 2011 Hortonworks Snapshot Page 4 ā€¢ We distribute the only 100% Open Source Enterprise Hadoop Distribution: Hortonworks Data Platform ā€¢ We engineer, test & certify HDP for enterprise usage ā€¢ We employ the core architects, builders and operators of Apache Hadoop ā€¢ We drive innovation within Apache Software Foundation projects ā€¢ We are uniquely positioned to deliver the highest quality of Hadoop support ā€¢ We enable the ecosystem to work better with Hadoop DevelopDevelop Distribute Support We develop, distribute and support the ONLY 100% open source Enterprise Hadoop distribution Endorsed by Strategic Partners Headquarters: Palo Alto, CA Employees: 180+ and growing Investors: Benchmark, Index, Yahoo
  • 5. Ā© Hortonworks Inc. 2011 Leadership that Starts at the Core Page 5 ā€¢ Driving next generation Hadoop ā€“ YARN, MapReduce2, HDFS2, High Availability, Disaster Recovery ā€¢ 420k+ lines authored since 2006 ā€“ More than twice nearest contributor ā€¢ Deeply integrating w/ecosystem ā€“ Enabling new deployment platforms ā€“ (ex. Windows & Azure, Linux & VMware HA) ā€“ Creating deeply engineered solutions ā€“ (ex. Teradata big data appliance) ā€¢ All Apache, NO holdbacks ā€“ 100% of code contributed to Apache
  • 6. Ā© Hortonworks Inc. 2011 OS Cloud VM Appliance HDP: Enterprise Hadoop Distribution Page 6 HORTONWORKS DATA PLATFORM (HDP) Hortonworks Data Platform (HDP) Enterprise Hadoop ā€¢ The ONLY 100% open source and complete distribution ā€¢ Enterprise grade, proven and tested at scale ā€¢ Ecosystem endorsed to ensure interoperability PLATFORM SERVICES HADOOP CORE Enterprise Readiness: HA, DR, Snapshots, Security, ā€¦ Distributed Storage & ProcessingHDFS YARN (in 2.0) WEBHDFS MAP REDUCE DATA SERVICES Store, Process and Access Data HCATALOG HIVEPIG HBASE SQOOP FLUME OPERATIONAL SERVICES Manage & Operate at Scale OOZIE AMBARI
  • 7. Ā© Hortonworks Inc. 2011 Overview of Hadoop
  • 8. Ā© Hortonworks Inc. 2011 In the Beginning ā€¢ It all started when Google needed a way to: ā€“ Do page ranking ā€“ Determine which web sites to provide for searches Link Link
  • 9. Ā© Hortonworks Inc. 2011 Page Rank Solution - Simplified ā€¢ Google engineers developed an internal solution and provided a paper on it titled: ā€œMapReduce: Simplified Data Processing on Large Clustersā€ ā€¢ It described a process something like this: Map Map 1. Many tasks look at links in parts of the data 2. Mapped results are shuffled to Reducers 3. Reducers compute the links into a result Links to sites A, C, F Links to sites B, D, E 1 2 3 Map Reduce Reduce
  • 10. Ā© Hortonworks Inc. 2011 Words to Websites - Simplified ā€¢ From words provide locations ā€¢ Provides what to display for a search ā€“ Note: Page rank determines the order ā€¢ For example ā€“ to find URLs with books on them books www.barnesandnoble.com www.amazon.com email www.google.com www.yahoo.com www.facebook.com finance www.yahoo.com www.google.com groceries www.walmart.com www.target.com jeans www.target.com www.amazon.com K, V <url, keyword> <keyword, url> www.barnesandnoble.com books calendars www.yahoo.com sports finance email celebrity www.amazon.com shoes books jeans www.google.com finance email search www.microsoft.com operating-system productivity system Map Reduce
  • 11. Ā© Hortonworks Inc. 2011 Data Model ā€¢ MapReduce works on <key, value> pairs (Key input, Value input) (Key intermediate, Value intermediate) (Key output, Value output) Map Reduce (books, www.barnesandnoble.com www.amazon.com) (books, www.barnesandnoble.com) (www.barnesandnoble.com , books calendars)
  • 12. Ā© Hortonworks Inc. 2011 Hadoop Basic Core Architecture Hadoop Distributed File System (HDFS) Map Reduce Mapper Reducer Shuffle MapReduce
  • 13. Ā© Hortonworks Inc. 2011 HDFS & MapReduce Enterprise Apache Hadoop Page 13
  • 14. Ā© Hortonworks Inc. 2011 Hortonworks ā€¢ Cluster Topology ā€¢ HDFS ā€¢ MapReduce Page 14
  • 15. Ā© Hortonworks Inc. 2011 Cluster Topology Page 15 Master Services Slave Services
  • 16. Ā© Hortonworks Inc. 2011 Hortonworks ā€¢ Cluster Topology ā€¢ HDFS ā€¢ MapReduce Page 16
  • 17. Ā© Hortonworks Inc. 2011 OS Cloud VM Appliance HDP: Enterprise Hadoop Distribution Page 17 HORTONWORKS DATA PLATFORM (HDP) Hortonworks Data Platform (HDP) Enterprise Hadoop ā€¢ The ONLY 100% open source and complete distribution ā€¢ Enterprise grade, proven and tested at scale ā€¢ Ecosystem endorsed to ensure interoperability PLATFORM SERVICES HADOOP CORE Enterprise Readiness: HA, DR, Snapshots, Security, ā€¦ Distributed Storage & ProcessingHDFS YARN (in 2.0) WEBHDFS MAP REDUCE DATA SERVICES Store, Process and Access Data HCATALOG HIVEPIG HBASE SQOOP FLUME OPERATIONAL SERVICES Manage & Operate at Scale OOZIE AMBARI
  • 18. Ā© Hortonworks Inc. 2011 HDFS ā€¢ Distributed file system designed to run on commodity Hardware. ā€¢ Key Assumptions ā€“ Hardware failure is the norm ā€“ Need streaming access to data sets. ā€“ Optimized for high throughput ā€“ Data sets are large ā€“ Append only file system. Write once read many times ā€“ Moving computation is cheaper than moving data Page 18
  • 19. Ā© Hortonworks Inc. 2011 HDFS: Key Services ā€¢ NameNode ā€“ Master service ā€“ Manages the file system namespace and regulates access to files by clients ā€“ Single service across the cluster ā€¢ DataNode ā€“ Slave service. Runs on slave nodes ā€“ Manages block read/write for HDFS ā€“ Pings NameNode for instructions ā€“ If heatbest fails, Datanode is removed from the cluster and replicated blocks take over ā€¢ Seconday NameNode ā€“ Merges Namenodeā€™s file system image and edit logs ā€“ Not a failover namenode !!! Page 19
  • 20. Ā© Hortonworks Inc. 2011 HDFS: File create lifecycle Page 20 NameNode RACK1 RACK2 RACK3 FILEFILE B 1 B 1 B 2 B 2 FILE HDFS CLIENT Create B 1 B 1 B 1 B 1 B 1 B 1 B 2 B 2 B 2 B 2 B 2 B 2 ack ack ack 22 11 33 Complete44
  • 21. Ā© Hortonworks Inc. 2011 Hortonworks ā€¢ Cluster Topology ā€¢ HDFS ā€¢ MapReduce Page 21
  • 22. Ā© Hortonworks Inc. 2011 OS Cloud VM Appliance HDP: Enterprise Hadoop Distribution Page 22 HORTONWORKS DATA PLATFORM (HDP) Hortonworks Data Platform (HDP) Enterprise Hadoop ā€¢ The ONLY 100% open source and complete distribution ā€¢ Enterprise grade, proven and tested at scale ā€¢ Ecosystem endorsed to ensure interoperability PLATFORM SERVICES HADOOP CORE Enterprise Readiness: HA, DR, Snapshots, Security, ā€¦ Distributed Storage & ProcessingHDFS YARN (in 2.0) WEBHDFS MAP REDUCE DATA SERVICES Store, Process and Access Data HCATALOG HIVEPIG HBASE SQOOP FLUME OPERATIONAL SERVICES Manage & Operate at Scale OOZIE AMBARI
  • 23. Ā© Hortonworks Inc. 2011 MapReduce ā€¢ A software framework for developing distributed applications to process vast amounts of data in-parallel on large clusters ā€¢ MapReduce job splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner across a cluster of nodes Page 23
  • 24. Ā© Hortonworks Inc. 2011 MapReduce: Key Services ā€¢ Job Tracker ā€“ Master service ā€“ Scheduleā€™s jobsā€™ component task on the task tracker ā€“ Monitors task progress ā€“ Reschedules failed tasks ā€¢ Task Tracker ā€“ Spawnā€™s Jobā€™s task ā€“ Reports progress to the Job Tracker ā€“ Runs on slave nodes, collocated with DataNode service ā€¢ Task (Map / Reduce) ā€“ Spawned by Task Tracker ā€“ Executes Map/Reduce task, encapsulating the business logic. Page 24
  • 25. Ā© Hortonworks Inc. 2012 Page 25 DataNode 1 Mapper Data is shuffled across the network and sorted Map Phase Shuffle/Sort Reduce Phase DataNode 2 Mapper DataNode 3 Mapper DataNode 1 Reducer DataNode 2 DataNode 3 Reducer MapReduce: Job Lifecycle
  • 26. Ā© Hortonworks Inc. 2011 The Key/Value Pairs of MapReduce <K1, V1> Mapper Shuffle/Sort<K2, V2> Reducer <K2, (V2,V2,V2,V2)> <K3, V3> ā€¢ Map & Reduce operate on (key, value) pairs and output (key, value) pairsā€¢ Map & Reduce operate on (key, value) pairs and output (key, value) pairs ā€¢ User provides map and reduce functions ā€¢ Input Key and Value is determined by InputFormat ā€¢ Common InputFormats: TextInputFormat, KeyValueTextInputFormat,SequenceFileInputFormat ā€¢ Common OutputFormats: TextOutputFormat, SequenceFileOutputFormat
  • 27. Ā© Hortonworks Inc. 2011 PIG, HIVE Enterprise Apache Hadoop Page 27
  • 28. Ā© Hortonworks Inc. 2011 Hortonworks ā€¢ Pig ā€¢ Hive Page 28
  • 29. Ā© Hortonworks Inc. 2011 OS Cloud VM Appliance HDP: Enterprise Hadoop Distribution Page 29 HORTONWORKS DATA PLATFORM (HDP) Hortonworks Data Platform (HDP) Enterprise Hadoop ā€¢ The ONLY 100% open source and complete distribution ā€¢ Enterprise grade, proven and tested at scale ā€¢ Ecosystem endorsed to ensure interoperability PLATFORM SERVICES HADOOP CORE Enterprise Readiness: HA, DR, Snapshots, Security, ā€¦ Distributed Storage & ProcessingHDFS YARN (in 2.0) WEBHDFS MAP REDUCE DATA SERVICES Store, Process and Access Data HCATALOG HIVEPIG HBASE SQOOP FLUME OPERATIONAL SERVICES Manage & Operate at Scale OOZIE AMBARI
  • 30. Ā© Hortonworks Inc. 2011 Pig ā€¢An engine for executing programs on top of Hadoop ā€¢It provides a language, Pig Latin, to specify these programs Page 30
  • 31. Ā© Hortonworks Inc. 2011 Why use Pig? ā€¢ Suppose you have user data in one file, website data in another, and you need to find the top 5 most visited sites by users aged 18 - 25 Page 31
  • 32. Ā© Hortonworks Inc. 2011 In Map-Reduce Page 32 170 lines of code, 4 hours to write
  • 33. Ā© Hortonworks Inc. 2011 In Pig Latin Users = load ā€˜input/usersā€™ using PigStorage(ā€˜,ā€™) as (name:chararray, age:int); Fltrd = filter Users by age >= 18 and age <= 25; Pages = load ā€˜input/pagesā€™ using PigStorage(ā€˜,ā€™) as (user:chararray, url:chararray); Jnd = join Fltrd by name, Pages by user; Grpd = group Jnd by url; Smmd = foreach Grpd generate group,COUNT(Jnd) as clicks; Srtd = order Smmd by clicks desc; Top5 = limit Srtd 5; store Top5 into ā€˜output/top5sitesā€™ using PigStorage(ā€˜,ā€™); Page 33 9 lines of code, 15 minutes to write
  • 34. Ā© Hortonworks Inc. 2011 Essence of Pig ā€¢ Map-Reduce is too low a level, SQL too high ā€¢ Pig-Latin, a language intended to sit between the two ā€“ Provides standard relational transforms (join, sort, etc.) ā€“ Schemas are optional, used when available, can be defined at runtime ā€“ User Defined Functions are first class citizens Page 34
  • 35. Ā© Hortonworks Inc. 2011 Pig Elements Page 35 ā€¢ High-level scripting language ā€¢ Requires no metadata or schema ā€¢ Statements translated into a series of MapReduce jobs Pig Latin ā€¢ Interactive shellGrunt ā€¢ Shared repository for User Defined Functions (UDFs)Piggybank
  • 36. Ā© Hortonworks Inc. 2011 Hortonworks ā€¢ Pig ā€¢ Hive Page 36
  • 37. Ā© Hortonworks Inc. 2011 OS Cloud VM Appliance HDP: Enterprise Hadoop Distribution Page 37 HORTONWORKS DATA PLATFORM (HDP) Hortonworks Data Platform (HDP) Enterprise Hadoop ā€¢ The ONLY 100% open source and complete distribution ā€¢ Enterprise grade, proven and tested at scale ā€¢ Ecosystem endorsed to ensure interoperability PLATFORM SERVICES HADOOP CORE Enterprise Readiness: HA, DR, Snapshots, Security, ā€¦ Distributed Storage & ProcessingHDFS YARN (in 2.0) WEBHDFS MAP REDUCE DATA SERVICES Store, Process and Access Data HCATALOG HIVEPIG HBASE SQOOP FLUME OPERATIONAL SERVICES Manage & Operate at Scale OOZIE AMBARI
  • 38. Ā© Hortonworks Inc. 2011 Motivation ā€¢ Hadoop as Enterprise Data Warehouse ā€¢ Adhoc query support ā€¢ Schema information ā€¢ Tool for end-users Page 38 USED EXTENSIVELY FOR ANALYTICS & BUSINESS INTELLIGENCEUSED EXTENSIVELY FOR ANALYTICS & BUSINESS INTELLIGENCE
  • 39. Ā© Hortonworks Inc. 2011 HiveQL Features ā€¢ HiveQL is similar to other SQLs ā€“ Uses familiar relational database concepts (tables, rows, columns and schema) ā€“ Based on the SQL-92 specification ā€¢ Treats Big Data as tables ā€¢ Converts SQL queries into MapReduce jobs ā€“ User does not need to know MapReduce ā€¢ Also supports plugging custom MapReduce scripts into queries Page 39
  • 40. Ā© Hortonworks Inc. 2011 Performing Queries ā€¢ SELECT ā€¢ WHERE clause ā€¢ UNION ALL and DISTINCT ā€¢ GROUP BY and HAVING ā€¢ JOIN ā€¢ ORDER BY ā€¢ LIMIT clause ā€“ Rows returned are chosen at random ā€¢ Can use REGEX Column Specification Page 40 SELECTSELECT '(ds|hr)?+.+' FROM sales;
  • 41. Ā© Hortonworks Inc. 2011 Hive vs Pig Page 41 Pig and Hive work well togetherPig and Hive work well together and many businesses use both. Hive is a good choice: ā€¢ when you want to query the data ā€¢ when you need an answer to specific questions ā€¢ if you are familiar with SQL Hive is a good choice: ā€¢ when you want to query the data ā€¢ when you need an answer to specific questions ā€¢ if you are familiar with SQL Pig is a good choice: ā€¢ for ETL (Extract -> Transform -> Load) ā€¢ for preparing data for easier analysis ā€¢ when you have a long series of steps to perform
  • 42. Ā© Hortonworks Inc. 2011 Ambari Streamlining Hadoop Operations Page 42
  • 43. Ā© Hortonworks Inc. 2011 Ambari: Install, Manage, Monitor, Tune Simplify Deployment and Maintenance: Wizard based install, handles dependency checks, recommends service mappings Ensure a Healthy Cluster: Monitor, alert, heat maps Optimize Performance: Root cause analysis for cluster tuning - fix problems BEFORE SLAs are breached Integrate with your operations: Open APIs, standard web-tech 43
  • 44. Ā© Hortonworks Inc. 2011 Community-Driven, Enterprise Class ā€¢ Productizes over a combined 100 person-years of operational Hadoop experience ā€¢ Stability and Scale: Ops & Dev team that took Yahoo! From 1000 to 45,000+ nodes ā€¢ Fast-paced, open source community driven innovation and integration Red Hat, Teradata, HP, Microsoft Contributions (and more) 44
  • 45. Ā© Hortonworks Inc. 2011 Demonstration Page 45