SlideShare a Scribd company logo
1 of 16
Hadoop & Big Data: Revealed 
Presenter: Sachin Holla 
Date: 08/29/2014
Big Data: An Overview 
Big Data 
- High volume 
- High velocity 
- High variety information assets 
- High Veracity 
- Require new forms of processing 
- Like NoSQL, MapReduce, Machine Learning 
Examples 
 Large Hadron Collider 
 150 million sensors -> data 40 million times/sec 
 data flow > 150 million petabytes (annual ), or ~ 500 exabytes per day 
 Tipp24 (European lotteries) 
 Analyze billions of transactions and hundreds of customer attributes 
 Leads to a 90% decrease in the time it took to build predictive models
DATA: ON A BIG SCALE
Hadoop: Elephant in the Room 
Apache Hadoop 
- open-source Java-based software framework 
- distributed processing of large data sets 
- On clusters of computers based on commodity hardware. 
Hadoop’s Benefits (Historical context) 
- Don’t rely on Hardware to provide HA (“Big Iron”) 
- Failures are expected and assumed 
- Framework handles failures to provide a HA computing service 
- “Scale Up v/s Scale Out” 
Key Components 
- Hadoop Distributed File System (HDFS™) – the file system 
- Hadoop MapReduce – the programming model 
- Hadoop (v2) YARN: the resource manager 
Year Activity 
2002Nutch Started 
2003 GFS White Paper published 
2004 
Google MapReduce White 
Paper 
2005 First MR Implementation 
2006 Hadoop project in Apache 
2008 Hadoop in Y! Production 
2009 Wins 500GB sort contest
What’s the Hadoop Arch., Kenneth ? 
(1/2)
What’s the Hadoop Arch., Kenneth ? 
(2/2)
Hadoop: FAQs 
 What is a Map-Reduce job and why do I care ? 
 Processing data paradigm in hadoop 
 Batch-mode or in real-time 
 In Java or in a variety of other langs (see below). 
 There are higher-level frameworks that help too like Pig , Hive, etc.. 
 I don’t drink java anymore – what do I do ? 
 Hadoop is Java-based but … 
 Hadoop Streaming supports python, Ruby, R, etc. 
 I/O bound – no difference. CPU-bound – Java better 
 What is Hadoop2 and how will it affect my big data needs (See slide#14) 
 Much more scalable 
 Programming models v/s Cluster & Resource Management 
 Under what scenarios should I not use Hadoop ? 
 Need Answers in a Hurry 
 Queries Are Complex Needing Optimization 
 Require Random, Interactive Access to Data 
 Store Sensitive Data 
 Replacing Data Warehouse 
 What are differences between Hadoop & traditional database ? 
 Hadoop is not a DB 
 ACID properties 
 Unstructured / mixture of data sources 
 SQL Access
Hadoop Stack: Snapshot 
Technology Domain Description 
HDFS File Storage Java-based file storage - reliable and scalable access 
MapReduce Programming Framework Original framework for distributed processing of data 
Hadoop YARN Resource Mgmt Next generation framework – MR and non-MR 
models 
Pig ETL / Data Flow Allows High level analysis of large data. Generates MR 
Hive SQL Interface DW - allows data summarization and ad-hoc queries 
Hbase Columnar NoSQL storage Column-oriented NoSQL data storage system 
Sqoop Data Exchange Easy data import/export from Hadoop clusters 
Zookeeper Process Coordination Highly available system for process coordination 
Oozie Workflow Scheduler Helps manage complex DAG job workflows 
Ambari Cluster Monitoring Installation, Admin & Monitoring for Hadoop clusters 
Avro Serializer Serializes data in efficient binary format. Uses JSON. 
Spark Real-time data 
processing 
Powerful processing engine - speed, ease of use, and 
sophisticated analytics (using ML).
Data Science: The Scoop 
What is Data Science or a Data Scientist ? 
 To understand data, to process it, to extract value from it, to visualize it, to communicate it 
 Single source v/s disparate sources 
 Mine data for insight to extract business/competitive value 
What is Machine Learning then ? 
 The science of getting computers to act without being explicitly programmed. 
 Machine learning and statistics may be the stars, but DS orchestrates the whole show. 
Practical Uses 
 Product Recommendation 
 Medical Diagnosis 
 Stock Trading 
 Face Detection
Demo: Lets get dirty ! 
 Hadoop running on Single-Node Pseudo Cluster (Linux VM) 
 Start Hadoop 
 HelloWorld Hadoop style 
 Run a MapReduce job (wordcount) 
 No Java here 
 Use python scripts to run a MapReduce job 
 Lipstick on a Pig 
 Perform ETL on some stocks/dividend data 
 Give me Hive 
 Calculate Top Batter Scores 
 Can you feel the Hbase 
 Dump Sales Data into Hbase and then access via Hive 
 Use AWS to show a ‘real’ cluster 
 Connect to AWS and startup the cluster 
 Demo performance using wordcount example 
* All Demos, installation guide and references available @ GitHub
… And, that’s a wrap !
Backup
Typical Hadoop Cluster
Hadoop Stack: Visualized
Hadoop: v1 -> v2

More Related Content

What's hot

Big Data and Hadoop Ecosystem
Big Data and Hadoop EcosystemBig Data and Hadoop Ecosystem
Big Data and Hadoop EcosystemRajkumar Singh
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟datastack
 
Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview Senthil Kumar
 
The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemCloudera, Inc.
 
Hadoop distributions - ecosystem
Hadoop distributions - ecosystemHadoop distributions - ecosystem
Hadoop distributions - ecosystemJakub Stransky
 
An Introduction to Apache Spark
An Introduction to Apache SparkAn Introduction to Apache Spark
An Introduction to Apache SparkElvis Saravia
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY pptsravya raju
 
9/2017 STL HUG - Back to School
9/2017 STL HUG - Back to School9/2017 STL HUG - Back to School
9/2017 STL HUG - Back to SchoolAdam Doyle
 
Data Engineering Quick Guide
Data Engineering Quick GuideData Engineering Quick Guide
Data Engineering Quick GuideAsim Jalis
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introductionChirag Ahuja
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystemsunera pathan
 

What's hot (20)

Hadoop An Introduction
Hadoop An IntroductionHadoop An Introduction
Hadoop An Introduction
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
Apache Hadoop at 10
Apache Hadoop at 10Apache Hadoop at 10
Apache Hadoop at 10
 
Big Data and Hadoop Ecosystem
Big Data and Hadoop EcosystemBig Data and Hadoop Ecosystem
Big Data and Hadoop Ecosystem
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟
 
PPT on Hadoop
PPT on HadoopPPT on Hadoop
PPT on Hadoop
 
Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview
 
The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop Ecosystem
 
Hadoop distributions - ecosystem
Hadoop distributions - ecosystemHadoop distributions - ecosystem
Hadoop distributions - ecosystem
 
Hadoop overview
Hadoop overviewHadoop overview
Hadoop overview
 
An Introduction to Apache Spark
An Introduction to Apache SparkAn Introduction to Apache Spark
An Introduction to Apache Spark
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
 
Big data & Hadoop
Big data & HadoopBig data & Hadoop
Big data & Hadoop
 
Spark - Philly JUG
Spark  - Philly JUGSpark  - Philly JUG
Spark - Philly JUG
 
9/2017 STL HUG - Back to School
9/2017 STL HUG - Back to School9/2017 STL HUG - Back to School
9/2017 STL HUG - Back to School
 
Hadoop
HadoopHadoop
Hadoop
 
Data Engineering Quick Guide
Data Engineering Quick GuideData Engineering Quick Guide
Data Engineering Quick Guide
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystem
 
Hadoop overview
Hadoop overviewHadoop overview
Hadoop overview
 

Viewers also liked

Diqkd vidick
Diqkd vidickDiqkd vidick
Diqkd vidickkore80
 
Presentatie Eduard Frieser
Presentatie Eduard FrieserPresentatie Eduard Frieser
Presentatie Eduard FrieserEduard Frieser
 
Giới thiệu VTC Academy
Giới thiệu VTC AcademyGiới thiệu VTC Academy
Giới thiệu VTC AcademyThằng Khó Ưa
 
Plane Crash NCA Presentation
Plane Crash NCA PresentationPlane Crash NCA Presentation
Plane Crash NCA PresentationLindsey Harvell
 
Halloween tics
Halloween ticsHalloween tics
Halloween ticsgarfiel28
 

Viewers also liked (7)

Diqkd vidick
Diqkd vidickDiqkd vidick
Diqkd vidick
 
Presentatie Eduard Frieser
Presentatie Eduard FrieserPresentatie Eduard Frieser
Presentatie Eduard Frieser
 
Giới thiệu VTC Academy
Giới thiệu VTC AcademyGiới thiệu VTC Academy
Giới thiệu VTC Academy
 
Empathize and Define
Empathize and DefineEmpathize and Define
Empathize and Define
 
Slide 3 d_animation_vfx
Slide 3 d_animation_vfxSlide 3 d_animation_vfx
Slide 3 d_animation_vfx
 
Plane Crash NCA Presentation
Plane Crash NCA PresentationPlane Crash NCA Presentation
Plane Crash NCA Presentation
 
Halloween tics
Halloween ticsHalloween tics
Halloween tics
 

Similar to Hadoop and Big Data: Revealed

Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerMark Kromer
 
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune Big-Data Hadoop Tutorials - MindScripts Technologies, Pune
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune amrutupre
 
Overview of big data & hadoop version 1 - Tony Nguyen
Overview of big data & hadoop   version 1 - Tony NguyenOverview of big data & hadoop   version 1 - Tony Nguyen
Overview of big data & hadoop version 1 - Tony NguyenThanh Nguyen
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Thanh Nguyen
 
Hadoop Developer
Hadoop DeveloperHadoop Developer
Hadoop DeveloperEdureka!
 
Overview of big data & hadoop v1
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1Thanh Nguyen
 
Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Rajan Kanitkar
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo pptPhil Young
 
Big data or big deal
Big data or big dealBig data or big deal
Big data or big dealeduarderwee
 
Architecting the Future of Big Data and Search
Architecting the Future of Big Data and SearchArchitecting the Future of Big Data and Search
Architecting the Future of Big Data and SearchHortonworks
 
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookHow Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookAmr Awadallah
 
Introduction to apache hadoop
Introduction to apache hadoopIntroduction to apache hadoop
Introduction to apache hadoopShashwat Shriparv
 
Interactive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataInteractive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataOfir Manor
 
Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014Milos Milovanovic
 
Hadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderDmitry Makarchuk
 

Similar to Hadoop and Big Data: Revealed (20)

Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
 
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune Big-Data Hadoop Tutorials - MindScripts Technologies, Pune
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune
 
Overview of big data & hadoop version 1 - Tony Nguyen
Overview of big data & hadoop   version 1 - Tony NguyenOverview of big data & hadoop   version 1 - Tony Nguyen
Overview of big data & hadoop version 1 - Tony Nguyen
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1
 
Big Data Concepts
Big Data ConceptsBig Data Concepts
Big Data Concepts
 
Hadoop in action
Hadoop in actionHadoop in action
Hadoop in action
 
What is hadoop
What is hadoopWhat is hadoop
What is hadoop
 
Hadoop Developer
Hadoop DeveloperHadoop Developer
Hadoop Developer
 
Overview of big data & hadoop v1
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1
 
Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt
 
Big data or big deal
Big data or big dealBig data or big deal
Big data or big deal
 
Architecting the Future of Big Data and Search
Architecting the Future of Big Data and SearchArchitecting the Future of Big Data and Search
Architecting the Future of Big Data and Search
 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
 
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookHow Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
 
Hadoop info
Hadoop infoHadoop info
Hadoop info
 
Introduction to apache hadoop
Introduction to apache hadoopIntroduction to apache hadoop
Introduction to apache hadoop
 
Interactive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataInteractive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroData
 
Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014
 
Hadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris Schneider
 

Recently uploaded

RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
Digi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxDigi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxTanveerAhmed817946
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service LucknowAminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknowmakika9823
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 

Recently uploaded (20)

RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
Digi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxDigi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptx
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service LucknowAminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 

Hadoop and Big Data: Revealed

  • 1. Hadoop & Big Data: Revealed Presenter: Sachin Holla Date: 08/29/2014
  • 2. Big Data: An Overview Big Data - High volume - High velocity - High variety information assets - High Veracity - Require new forms of processing - Like NoSQL, MapReduce, Machine Learning Examples  Large Hadron Collider  150 million sensors -> data 40 million times/sec  data flow > 150 million petabytes (annual ), or ~ 500 exabytes per day  Tipp24 (European lotteries)  Analyze billions of transactions and hundreds of customer attributes  Leads to a 90% decrease in the time it took to build predictive models
  • 3. DATA: ON A BIG SCALE
  • 4. Hadoop: Elephant in the Room Apache Hadoop - open-source Java-based software framework - distributed processing of large data sets - On clusters of computers based on commodity hardware. Hadoop’s Benefits (Historical context) - Don’t rely on Hardware to provide HA (“Big Iron”) - Failures are expected and assumed - Framework handles failures to provide a HA computing service - “Scale Up v/s Scale Out” Key Components - Hadoop Distributed File System (HDFS™) – the file system - Hadoop MapReduce – the programming model - Hadoop (v2) YARN: the resource manager Year Activity 2002Nutch Started 2003 GFS White Paper published 2004 Google MapReduce White Paper 2005 First MR Implementation 2006 Hadoop project in Apache 2008 Hadoop in Y! Production 2009 Wins 500GB sort contest
  • 5. What’s the Hadoop Arch., Kenneth ? (1/2)
  • 6. What’s the Hadoop Arch., Kenneth ? (2/2)
  • 7. Hadoop: FAQs  What is a Map-Reduce job and why do I care ?  Processing data paradigm in hadoop  Batch-mode or in real-time  In Java or in a variety of other langs (see below).  There are higher-level frameworks that help too like Pig , Hive, etc..  I don’t drink java anymore – what do I do ?  Hadoop is Java-based but …  Hadoop Streaming supports python, Ruby, R, etc.  I/O bound – no difference. CPU-bound – Java better  What is Hadoop2 and how will it affect my big data needs (See slide#14)  Much more scalable  Programming models v/s Cluster & Resource Management  Under what scenarios should I not use Hadoop ?  Need Answers in a Hurry  Queries Are Complex Needing Optimization  Require Random, Interactive Access to Data  Store Sensitive Data  Replacing Data Warehouse  What are differences between Hadoop & traditional database ?  Hadoop is not a DB  ACID properties  Unstructured / mixture of data sources  SQL Access
  • 8. Hadoop Stack: Snapshot Technology Domain Description HDFS File Storage Java-based file storage - reliable and scalable access MapReduce Programming Framework Original framework for distributed processing of data Hadoop YARN Resource Mgmt Next generation framework – MR and non-MR models Pig ETL / Data Flow Allows High level analysis of large data. Generates MR Hive SQL Interface DW - allows data summarization and ad-hoc queries Hbase Columnar NoSQL storage Column-oriented NoSQL data storage system Sqoop Data Exchange Easy data import/export from Hadoop clusters Zookeeper Process Coordination Highly available system for process coordination Oozie Workflow Scheduler Helps manage complex DAG job workflows Ambari Cluster Monitoring Installation, Admin & Monitoring for Hadoop clusters Avro Serializer Serializes data in efficient binary format. Uses JSON. Spark Real-time data processing Powerful processing engine - speed, ease of use, and sophisticated analytics (using ML).
  • 9.
  • 10. Data Science: The Scoop What is Data Science or a Data Scientist ?  To understand data, to process it, to extract value from it, to visualize it, to communicate it  Single source v/s disparate sources  Mine data for insight to extract business/competitive value What is Machine Learning then ?  The science of getting computers to act without being explicitly programmed.  Machine learning and statistics may be the stars, but DS orchestrates the whole show. Practical Uses  Product Recommendation  Medical Diagnosis  Stock Trading  Face Detection
  • 11. Demo: Lets get dirty !  Hadoop running on Single-Node Pseudo Cluster (Linux VM)  Start Hadoop  HelloWorld Hadoop style  Run a MapReduce job (wordcount)  No Java here  Use python scripts to run a MapReduce job  Lipstick on a Pig  Perform ETL on some stocks/dividend data  Give me Hive  Calculate Top Batter Scores  Can you feel the Hbase  Dump Sales Data into Hbase and then access via Hive  Use AWS to show a ‘real’ cluster  Connect to AWS and startup the cluster  Demo performance using wordcount example * All Demos, installation guide and references available @ GitHub
  • 12. … And, that’s a wrap !

Editor's Notes

  1. Introduce Hadoop, Map-Reduce and HDFS concepts. Hadoop Apache Hadoop is an open-source software framework allowing for distributed processing of large data sets across clusters of computers on commodity hardware. USP Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures. Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data. Hadoop YARN: A framework for job scheduling and cluster resource management. Hadoop MapReduce: A YARN-based system for parallel processing of large data sets. - Mike Cafarella and Doug Cutting estimated a system supporting a one-billion-page index would cost around half a million dollars in hardware, with a monthly running cost of $30,000. - Nutch was started in 2002, and a working crawler and search system quickly emerged. However, they realized that their architecture wouldn’t scale to the billions of pages on the Web. - Help was at hand with the publication of a paper in 2003 that described the architecture of Google’s distributed filesystem, called GFS, which was being used in production at Google. GFS, or something like it, would solve their storage needs for the very large files generated as a part of the web crawl and indexing process. In particular, GFS would free up time being spent on administrative tasks such as managing storage nodes. - In 2004, they set about writing an open source implementation, the Nutch Distributed Filesystem (NDFS). In 2004, Google published the paper that introduced MapReduce to the world. Early in 2005, the Nutch developers had a working MapReduce implementation in Nutch - in February 2006 they moved out of Nutch to form an independent subproject of Lucene called Hadoop. At around the same time, Doug Cutting joined Yahoo!, which provided a dedicated team and the resources to turn Hadoop into a system that ran at web scale (see the sidebar Hadoop at Yahoo!). - This was demonstrated in February 2008 when Yahoo! announced that its production search index was being generated by a 10,000-core Hadoop cluster. - (May 2009), it was announced that a team at Yahoo! used Hadoop to sort one terabyte in 62 seconds.
  2. What is a Map-Reduce job and why do I care ? Processing data paradigm in hadoop Batch-mode or in real-time In Java or in a variety of other langs (see below). There are higher-level frameworks that help too like Pig , Hive, etc.. I don’t drink java anymore – what do I do ? Hadoop is Java-based but … Hadoop Streaming supports python, Ruby, R, etc. I/O bound – no difference. CPU-bound – Java better what is Hadoop2 and how will it affect my big data needs (See slide#14) Muchmore scalable (3,500 -> ~10000 nodes) Abstraction between the programming models (MapReduce, Impala, etc.) and cluster & resource management Under what scenarios should I not use Hadoop ? Need Answers in a Hurry – MR crunching can take hours or days sometimes Queries Are Complex and Require Extensive Optimization – need serious tech skills for optimizing queries Require Random, Interactive Access to Data – SQL on Hadoop is getting better but not yet comparable Store Sensitive Data – Hadoop has less than stellar security capabilities Replacing Data Warehouse – Hadoop can pre-process raw data and hand over to DW to run analytic workloads What are differences between Hadoop & traditional database ? Hadoop is not a DB, more like a file system (HDFS) Traditional DBs have ACID properties and Hadoop doesn’t support this OOTB Traditional DBs can support Unstructured but less efficiently. Hadoop shines with a mixture of data sources Hadoop SQL access is an order of magnitude(s) slower than traditional SQL
  3. Hortonworks and Cloudera- They both offer the same basic service to their customers- enterprise-ready Hadoop with greater security and stability as well as training for companies unfamiliar with the technology. Many have drawn the dividing line down how Hortonworks and Cloudera approach data warehouses, suggesting Hortonworks want to complement existing data warehouse storage and Cloudera want to do away with it altogether. Yet if you look at how Cloudera’s suggested deployment for its Enterprise Data Hub, it does incorporate legacy warehouse storage. A greater distinction can be found in what technologies the companies offer. Hortonworks are open-source purists, using only technology that’s open-sourced through the Apache Foundation; when you pay for Cloudera, you pay for a whole stack of proprietary and open source components, including online NoSQL (HBase), analytic SQL (Impala), in-memory processing and machine learning (Apache Spark) and data management (Cloudera Manager). Hortonworks Cloudera Money raised $225 million $900 million ($740 million from a recent partnership with Intel) Customers Added 250 customers in the past five quarters; big names include Spotify, ebay, Bloomberg and Samsung. Estimated around the 350-mark. Big names include Nokia, Mastercard, BT and ebay (curiously appearing on both Hortonworks’ and Cloudera’s customer lists) Partners Around 300 listed on their website, including SAP, HP and Dell- a full list can be found here. Over 1,000, including HP, IBM, Intel… a full list can be found here. MapR is founded on the idea that the Apache Hadoop core is a beautiful thing that needs to grow up fast to have the most impact on the enterprise. What MapR has done is add some proprietary software for helping manage the installation, configuration, and operation of its distribution. But MapR rejects open source purity. Srivas has taken significant parts of Hadoop and re-implemented them in an API compatible manner. Hortonworks and Cloudera argue using the API-compatible approach means that MapR isn’t open source. MapR argues back: Do you want to have read write access to your files system? Do you want to be able to handle lots of small files? Do you want to support NFS in a production quality matter so other software you have can use the data in HDFS? Do you want to have better security that doesn’t require Kerberos? Do you want to be able to run other software like Vertica on the machines in the Hadoop cluster?