SlideShare a Scribd company logo
1 of 26
Download to read offline
Introduction To Hadoop Ecosystem
InSemble Inc.
http://www.insemble.com
Agenda
What is Big Data ?1
Use Cases & Java Developer fit4
Hadoop Ecosystem3
Relevance to your Enterprise2
Demo5
Big Data Definitions
• Wikipedia defines it as “ Data Sets with sizes beyond the ability of
commonly used software tools to capture, curate, manage and process data
within a tolerable elapsed time
• Gartner defines it as Data with the following characteristics
– High Velocity
– High Variety
– High Volume
• Another Definition is “ Big Data is a large volume, unstructured data which
cannot be handled by traditional database management systems
Why a game changer
• Schema on Read
– Interpreting data at processing time
– Key, Values are not intrinsic properties of data but chosen by person
analyzing the data
• Move code to data
– With traditional, we bring data to code and I/O becomes a
bottleneck
– With distributed systems, we have to deal with our own
checkpointing/recovery
• More data beats better algorithms
Enterprise Relevance
• Missed Opportunities
– Channels
– Data that is analyzed
• Constraint was high cost
– Storage
– Processing
• Future-proof your business
– Schema on Read
– Access pattern not as relevant
– Not just future-proofing your architecture
Motivation and History
• Disk access speeds have not caught up with storage capacities
• Need a high speed parallel processing platform to process large
datasets on a distributed filesharing framework
• Google published MapReduce architecture in 2004
• Mapreduce framework
– Split the query, distribute it and process in parallel(Map Step)
– Gather the results and deliver it ( Reduce Step)
• Apache Open Source Project called Hadoop implemented the
MapReduce framework
– “Software library that gives users ability to process large datasets across cluster of
commodity hardware in a reliable, fault-tolerant manner using a simple programming model”
Hadoop Ecosystem
Source: Apache Hadoop Documentation
HDFS Architecture
Source: Hadoop Definitive Guide by Tom White
MapReduce framework
Hadoop 2 with YARN
Source: Hadoop In Practice by Alex Holmes
Map Reduce
• Restrictive programming model
– Key, values
– Map, reduce functions with only coordination being just
passing keys and values
• But still considered a general data-processing tool
– Google used for production search indexes
– Image Analysis
– Machine learning algorithms
PIG
• High level scripting language
• Data Flow Language
– Good for describing data analysis problems as data flows
– Can plugin UDFs written in other languages such as Java, Scala,
JRuby
– Other languages can execute PIG scripts
– Predominant use cases are
• Production ETL jobs
• Data exploration by analysts
• Higher Level Abstraction
– Map Reduce
– Tez
Hive
• Framework for data warehouse on top of
Hadoop
– SQL Access on HDFS
– Queries for Analysis
• Batch Oriented
– Impala
– Tez
HBase
• NoSQL database on Hadoop
– Based on Google’s BigTable
– Column oriented database on HDFS
• Regular Interactive/Update use cases
– Real time read/write random access
– Row updates are atomic
SQOOP
• Import/Export data from RDBMS into Hadoop
– HDFS,Hive, Hbase
– CouchBase
– Uses JDBC driver to get the data types of the columns
– Serialization/Deserialization
• Actual load done internally by Mapreduce jobs
Apache Flume
Source: Apache Flume Documentation
Real time streaming with Kafka &
Storm
• Kafka
– Pub/Sub messaging using topics
– Kafka producers publish to topics
• Storm
– Real time computational engine
– Consumes data from spouts and passes data to bolts
– Can run on top of YARN
– Uses Zookeeper, implemented in Clojure
– You define workflows as Directed Acyclic Graphs
– True stream processing engine, so used for low latency ingestion
– Can support At most once, At least once and Exactly Once semantics
Apache Spark
• High speed general purpose engine for large-scale data
processing
• Does not need Hadoop, just needs a shared file system such as
S3, NFS or HDFS
• Spark can run on YARN
• Spark is implemented in Scala
• Has Streaming API but a true batch processing engine that micro-
batches
• Can only support Exactly once, but under some failure
conditions degrades to At-least once
Common Use Cases
• Queries from Detail Record Data
• Queries from longer duration data
• Diagnostic/Metrics/Web Logs Data Analysis
• 360 degree view incorporating clickstream data
• Unable to generate report within the needed timeframe
• Capture and analyze sensor data
• Analyze large volume of image data
• Build User profiles from large volumes of data
• Sentiment Analysis
• Recommendation Engines
• Risk Analysis
Securing Hadoop Data
Source: http://www.voltage.com
Closing
• Technology in hyper growth phase
• Complex
• Tools/Productivity/Monitoring products
evolving
• Pilot Project
• Incremental Journey
Demo - Start HDP cluster in AWS
• Total 6 EC2 machine, type t2.medium
• RHEL 6.5, 3.75G Memory, 10G hard drive
• 1 Ambari server + 5-node cluster
• 1 Namenode + 1 Secondary node + 3 Data
Node
• Public data set from 

https://data.cityofchicago.org
Managing Hadoop Cluster using
Ambari
• Ambari in Indian language stands for a seat
sitting on top of an elephant
• Ambari is an Apache open source project that
is used to
• Provision Hadoop cluster
• Manage Hadoop cluster
• Monitor Hadoop cluster
• Agent-based deployment model
Ambari Architecture
Taken from http://docs.hortonworks.com/
Demo — Hue
• Apache Hue provides web interface for
analyzing data in Hadoop
• Use HCatalog to create table
• Demo Hive Script
• Demo Pig Script
Demo — Advanced Hive
• Use built-in UDF to extract latitude and longitude info
• Use custom UDF (scala) to calculate distance
between two locations
• Join tables between library and school data and find
libraries within 1 mile for each school
• Use Tableau to connect to Hive through ODBC driver
to plot social economy data

More Related Content

What's hot

Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Uwe Printz
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystemsunera pathan
 
Top Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for FresherTop Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for FresherJanBask Training
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyRohit Kulkarni
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecasesudhakara st
 
The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemCloudera, Inc.
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceeakasit_dpu
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Thanh Nguyen
 
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaWhat are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaEdureka!
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY pptsravya raju
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3tcloudcomputing-tw
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and HadoopFlavio Vit
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce introGeoff Hendrey
 
Apache hadoop introduction and architecture
Apache hadoop  introduction and architectureApache hadoop  introduction and architecture
Apache hadoop introduction and architectureHarikrishnan K
 

What's hot (20)

Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystem
 
Top Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for FresherTop Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for Fresher
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
 
Apache Hadoop at 10
Apache Hadoop at 10Apache Hadoop at 10
Apache Hadoop at 10
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecase
 
The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop Ecosystem
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1
 
PPT on Hadoop
PPT on HadoopPPT on Hadoop
PPT on Hadoop
 
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaWhat are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
 
SQOOP - RDBMS to Hadoop
SQOOP - RDBMS to HadoopSQOOP - RDBMS to Hadoop
SQOOP - RDBMS to Hadoop
 
Hadoop
Hadoop Hadoop
Hadoop
 
Hadoop overview
Hadoop overviewHadoop overview
Hadoop overview
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce intro
 
Apache hadoop introduction and architecture
Apache hadoop  introduction and architectureApache hadoop  introduction and architecture
Apache hadoop introduction and architecture
 

Viewers also liked

Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview Senthil Kumar
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop EcosystemJ Singh
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Uwe Printz
 
Introduction to Big Data Technologies: Hadoop/EMR/Map Reduce & Redshift
Introduction to Big Data Technologies:  Hadoop/EMR/Map Reduce & RedshiftIntroduction to Big Data Technologies:  Hadoop/EMR/Map Reduce & Redshift
Introduction to Big Data Technologies: Hadoop/EMR/Map Reduce & RedshiftDataKitchen
 
An introduction to Big-Data processing applying hadoop
An introduction to Big-Data processing applying hadoopAn introduction to Big-Data processing applying hadoop
An introduction to Big-Data processing applying hadoopAmir Sedighi
 
Haddop in Business Intelligence
Haddop in Business IntelligenceHaddop in Business Intelligence
Haddop in Business IntelligenceHGanesh
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introductionDong Ngoc
 
Big data on Azure for Architects
Big data on Azure for ArchitectsBig data on Azure for Architects
Big data on Azure for ArchitectsTomasz Kopacz
 
Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)Uwe Printz
 
MongoDB and Fractal Tree Indexes
MongoDB and Fractal Tree IndexesMongoDB and Fractal Tree Indexes
MongoDB and Fractal Tree IndexesMongoDB
 

Viewers also liked (20)

Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop Ecosystem
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
What is Big Data?
What is Big Data?What is Big Data?
What is Big Data?
 
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
 
Introduction to Big Data Technologies: Hadoop/EMR/Map Reduce & Redshift
Introduction to Big Data Technologies:  Hadoop/EMR/Map Reduce & RedshiftIntroduction to Big Data Technologies:  Hadoop/EMR/Map Reduce & Redshift
Introduction to Big Data Technologies: Hadoop/EMR/Map Reduce & Redshift
 
An introduction to Big-Data processing applying hadoop
An introduction to Big-Data processing applying hadoopAn introduction to Big-Data processing applying hadoop
An introduction to Big-Data processing applying hadoop
 
Desayuno de arquitectos: Big data en azure
Desayuno de arquitectos: Big data en azureDesayuno de arquitectos: Big data en azure
Desayuno de arquitectos: Big data en azure
 
Haddop in Business Intelligence
Haddop in Business IntelligenceHaddop in Business Intelligence
Haddop in Business Intelligence
 
Hadoop-Introduction
Hadoop-IntroductionHadoop-Introduction
Hadoop-Introduction
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Hadoop
HadoopHadoop
Hadoop
 
Big data on Azure for Architects
Big data on Azure for ArchitectsBig data on Azure for Architects
Big data on Azure for Architects
 
Using hadoop for big data
Using hadoop for big dataUsing hadoop for big data
Using hadoop for big data
 
Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)
 
MongoDB and Fractal Tree Indexes
MongoDB and Fractal Tree IndexesMongoDB and Fractal Tree Indexes
MongoDB and Fractal Tree Indexes
 

Similar to Introduction To Hadoop Ecosystem

Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics PlatformN Masahiro
 
Introduction to Hadoop and Big Data
Introduction to Hadoop and Big DataIntroduction to Hadoop and Big Data
Introduction to Hadoop and Big DataJoe Alex
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : BeginnersShweta Patnaik
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : BeginnersShweta Patnaik
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : BeginnersShweta Patnaik
 
Big Data tools in practice
Big Data tools in practiceBig Data tools in practice
Big Data tools in practiceDarko Marjanovic
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014cdmaxime
 
Survey of Real-time Processing Systems for Big Data
Survey of Real-time Processing Systems for Big DataSurvey of Real-time Processing Systems for Big Data
Survey of Real-time Processing Systems for Big DataLuiz Henrique Zambom Santana
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introductionSandeep Singh
 
BDA R20 21NM - Summary Big Data Analytics
BDA R20 21NM - Summary Big Data AnalyticsBDA R20 21NM - Summary Big Data Analytics
BDA R20 21NM - Summary Big Data AnalyticsNetajiGandi1
 
An Introduction of Apache Hadoop
An Introduction of Apache HadoopAn Introduction of Apache Hadoop
An Introduction of Apache HadoopKMS Technology
 

Similar to Introduction To Hadoop Ecosystem (20)

Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics Platform
 
Cloudera Hadoop Distribution
Cloudera Hadoop DistributionCloudera Hadoop Distribution
Cloudera Hadoop Distribution
 
Introduction to Hadoop and Big Data
Introduction to Hadoop and Big DataIntroduction to Hadoop and Big Data
Introduction to Hadoop and Big Data
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
 
Hadoop Primer
Hadoop PrimerHadoop Primer
Hadoop Primer
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Big Data tools in practice
Big Data tools in practiceBig Data tools in practice
Big Data tools in practice
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
 
Getting started big data
Getting started big dataGetting started big data
Getting started big data
 
Survey of Real-time Processing Systems for Big Data
Survey of Real-time Processing Systems for Big DataSurvey of Real-time Processing Systems for Big Data
Survey of Real-time Processing Systems for Big Data
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
 
BDA R20 21NM - Summary Big Data Analytics
BDA R20 21NM - Summary Big Data AnalyticsBDA R20 21NM - Summary Big Data Analytics
BDA R20 21NM - Summary Big Data Analytics
 
Hadoop
HadoopHadoop
Hadoop
 
An Introduction of Apache Hadoop
An Introduction of Apache HadoopAn Introduction of Apache Hadoop
An Introduction of Apache Hadoop
 
Apache drill
Apache drillApache drill
Apache drill
 

Recently uploaded

9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxFurkanTasci3
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...ThinkInnovation
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAbdelrhman abooda
 

Recently uploaded (20)

9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptx
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
 

Introduction To Hadoop Ecosystem

  • 1. Introduction To Hadoop Ecosystem InSemble Inc. http://www.insemble.com
  • 2. Agenda What is Big Data ?1 Use Cases & Java Developer fit4 Hadoop Ecosystem3 Relevance to your Enterprise2 Demo5
  • 3. Big Data Definitions • Wikipedia defines it as “ Data Sets with sizes beyond the ability of commonly used software tools to capture, curate, manage and process data within a tolerable elapsed time • Gartner defines it as Data with the following characteristics – High Velocity – High Variety – High Volume • Another Definition is “ Big Data is a large volume, unstructured data which cannot be handled by traditional database management systems
  • 4. Why a game changer • Schema on Read – Interpreting data at processing time – Key, Values are not intrinsic properties of data but chosen by person analyzing the data • Move code to data – With traditional, we bring data to code and I/O becomes a bottleneck – With distributed systems, we have to deal with our own checkpointing/recovery • More data beats better algorithms
  • 5. Enterprise Relevance • Missed Opportunities – Channels – Data that is analyzed • Constraint was high cost – Storage – Processing • Future-proof your business – Schema on Read – Access pattern not as relevant – Not just future-proofing your architecture
  • 6. Motivation and History • Disk access speeds have not caught up with storage capacities • Need a high speed parallel processing platform to process large datasets on a distributed filesharing framework • Google published MapReduce architecture in 2004 • Mapreduce framework – Split the query, distribute it and process in parallel(Map Step) – Gather the results and deliver it ( Reduce Step) • Apache Open Source Project called Hadoop implemented the MapReduce framework – “Software library that gives users ability to process large datasets across cluster of commodity hardware in a reliable, fault-tolerant manner using a simple programming model”
  • 7. Hadoop Ecosystem Source: Apache Hadoop Documentation
  • 8. HDFS Architecture Source: Hadoop Definitive Guide by Tom White
  • 10. Hadoop 2 with YARN Source: Hadoop In Practice by Alex Holmes
  • 11. Map Reduce • Restrictive programming model – Key, values – Map, reduce functions with only coordination being just passing keys and values • But still considered a general data-processing tool – Google used for production search indexes – Image Analysis – Machine learning algorithms
  • 12. PIG • High level scripting language • Data Flow Language – Good for describing data analysis problems as data flows – Can plugin UDFs written in other languages such as Java, Scala, JRuby – Other languages can execute PIG scripts – Predominant use cases are • Production ETL jobs • Data exploration by analysts • Higher Level Abstraction – Map Reduce – Tez
  • 13. Hive • Framework for data warehouse on top of Hadoop – SQL Access on HDFS – Queries for Analysis • Batch Oriented – Impala – Tez
  • 14. HBase • NoSQL database on Hadoop – Based on Google’s BigTable – Column oriented database on HDFS • Regular Interactive/Update use cases – Real time read/write random access – Row updates are atomic
  • 15. SQOOP • Import/Export data from RDBMS into Hadoop – HDFS,Hive, Hbase – CouchBase – Uses JDBC driver to get the data types of the columns – Serialization/Deserialization • Actual load done internally by Mapreduce jobs
  • 16. Apache Flume Source: Apache Flume Documentation
  • 17. Real time streaming with Kafka & Storm • Kafka – Pub/Sub messaging using topics – Kafka producers publish to topics • Storm – Real time computational engine – Consumes data from spouts and passes data to bolts – Can run on top of YARN – Uses Zookeeper, implemented in Clojure – You define workflows as Directed Acyclic Graphs – True stream processing engine, so used for low latency ingestion – Can support At most once, At least once and Exactly Once semantics
  • 18. Apache Spark • High speed general purpose engine for large-scale data processing • Does not need Hadoop, just needs a shared file system such as S3, NFS or HDFS • Spark can run on YARN • Spark is implemented in Scala • Has Streaming API but a true batch processing engine that micro- batches • Can only support Exactly once, but under some failure conditions degrades to At-least once
  • 19. Common Use Cases • Queries from Detail Record Data • Queries from longer duration data • Diagnostic/Metrics/Web Logs Data Analysis • 360 degree view incorporating clickstream data • Unable to generate report within the needed timeframe • Capture and analyze sensor data • Analyze large volume of image data • Build User profiles from large volumes of data • Sentiment Analysis • Recommendation Engines • Risk Analysis
  • 20. Securing Hadoop Data Source: http://www.voltage.com
  • 21. Closing • Technology in hyper growth phase • Complex • Tools/Productivity/Monitoring products evolving • Pilot Project • Incremental Journey
  • 22. Demo - Start HDP cluster in AWS • Total 6 EC2 machine, type t2.medium • RHEL 6.5, 3.75G Memory, 10G hard drive • 1 Ambari server + 5-node cluster • 1 Namenode + 1 Secondary node + 3 Data Node • Public data set from 
 https://data.cityofchicago.org
  • 23. Managing Hadoop Cluster using Ambari • Ambari in Indian language stands for a seat sitting on top of an elephant • Ambari is an Apache open source project that is used to • Provision Hadoop cluster • Manage Hadoop cluster • Monitor Hadoop cluster • Agent-based deployment model
  • 24. Ambari Architecture Taken from http://docs.hortonworks.com/
  • 25. Demo — Hue • Apache Hue provides web interface for analyzing data in Hadoop • Use HCatalog to create table • Demo Hive Script • Demo Pig Script
  • 26. Demo — Advanced Hive • Use built-in UDF to extract latitude and longitude info • Use custom UDF (scala) to calculate distance between two locations • Join tables between library and school data and find libraries within 1 mile for each school • Use Tableau to connect to Hive through ODBC driver to plot social economy data