SlideShare a Scribd company logo
1 of 16
Hadoop & Big Data: Revealed 
Presenter: Sachin Holla 
Date: 08/29/2014
Big Data: An Overview 
Big Data 
- High volume 
- High velocity 
- High variety information assets 
- High Veracity 
- Require new forms of processing 
- Like NoSQL, MapReduce, Machine Learning 
Examples 
 Large Hadron Collider 
 150 million sensors -> data 40 million times/sec 
 data flow > 150 million petabytes (annual ), or ~ 500 exabytes per day 
 Tipp24 (European lotteries) 
 Analyze billions of transactions and hundreds of customer attributes 
 Leads to a 90% decrease in the time it took to build predictive models
DATA: ON A BIG SCALE
Hadoop: Elephant in the Room 
Apache Hadoop 
- open-source Java-based software framework 
- distributed processing of large data sets 
- On clusters of computers based on commodity hardware. 
Hadoop’s Benefits (Historical context) 
- Don’t rely on Hardware to provide HA (“Big Iron”) 
- Failures are expected and assumed 
- Framework handles failures to provide a HA computing service 
- “Scale Up v/s Scale Out” 
Key Components 
- Hadoop Distributed File System (HDFS™) – the file system 
- Hadoop MapReduce – the programming model 
- Hadoop (v2) YARN: the resource manager 
Year Activity 
2002Nutch Started 
2003 GFS White Paper published 
2004 
Google MapReduce White 
Paper 
2005 First MR Implementation 
2006 Hadoop project in Apache 
2008 Hadoop in Y! Production 
2009 Wins 500GB sort contest
What’s the Hadoop Arch., Kenneth ? 
(1/2)
What’s the Hadoop Arch., Kenneth ? 
(2/2)
Hadoop: FAQs 
 What is a Map-Reduce job and why do I care ? 
 Processing data paradigm in hadoop 
 Batch-mode or in real-time 
 In Java or in a variety of other langs (see below). 
 There are higher-level frameworks that help too like Pig , Hive, etc.. 
 I don’t drink java anymore – what do I do ? 
 Hadoop is Java-based but … 
 Hadoop Streaming supports python, Ruby, R, etc. 
 I/O bound – no difference. CPU-bound – Java better 
 What is Hadoop2 and how will it affect my big data needs (See slide#14) 
 Much more scalable 
 Programming models v/s Cluster & Resource Management 
 Under what scenarios should I not use Hadoop ? 
 Need Answers in a Hurry 
 Queries Are Complex Needing Optimization 
 Require Random, Interactive Access to Data 
 Store Sensitive Data 
 Replacing Data Warehouse 
 What are differences between Hadoop & traditional database ? 
 Hadoop is not a DB 
 ACID properties 
 Unstructured / mixture of data sources 
 SQL Access
Hadoop Stack: Snapshot 
Technology Domain Description 
HDFS File Storage Java-based file storage - reliable and scalable access 
MapReduce Programming Framework Original framework for distributed processing of data 
Hadoop YARN Resource Mgmt Next generation framework – MR and non-MR 
models 
Pig ETL / Data Flow Allows High level analysis of large data. Generates MR 
Hive SQL Interface DW - allows data summarization and ad-hoc queries 
Hbase Columnar NoSQL storage Column-oriented NoSQL data storage system 
Sqoop Data Exchange Easy data import/export from Hadoop clusters 
Zookeeper Process Coordination Highly available system for process coordination 
Oozie Workflow Scheduler Helps manage complex DAG job workflows 
Ambari Cluster Monitoring Installation, Admin & Monitoring for Hadoop clusters 
Avro Serializer Serializes data in efficient binary format. Uses JSON. 
Spark Real-time data 
processing 
Powerful processing engine - speed, ease of use, and 
sophisticated analytics (using ML).
Data Science: The Scoop 
What is Data Science or a Data Scientist ? 
 To understand data, to process it, to extract value from it, to visualize it, to communicate it 
 Single source v/s disparate sources 
 Mine data for insight to extract business/competitive value 
What is Machine Learning then ? 
 The science of getting computers to act without being explicitly programmed. 
 Machine learning and statistics may be the stars, but DS orchestrates the whole show. 
Practical Uses 
 Product Recommendation 
 Medical Diagnosis 
 Stock Trading 
 Face Detection
Demo: Lets get dirty ! 
 Hadoop running on Single-Node Pseudo Cluster (Linux VM) 
 Start Hadoop 
 HelloWorld Hadoop style 
 Run a MapReduce job (wordcount) 
 No Java here 
 Use python scripts to run a MapReduce job 
 Lipstick on a Pig 
 Perform ETL on some stocks/dividend data 
 Give me Hive 
 Calculate Top Batter Scores 
 Can you feel the Hbase 
 Dump Sales Data into Hbase and then access via Hive 
 Use AWS to show a ‘real’ cluster 
 Connect to AWS and startup the cluster 
 Demo performance using wordcount example 
* All Demos, installation guide and references available @ GitHub
… And, that’s a wrap !
Backup
Typical Hadoop Cluster
Hadoop Stack: Visualized
Hadoop: v1 -> v2

More Related Content

What's hot

Big Data and Hadoop Ecosystem
Big Data and Hadoop EcosystemBig Data and Hadoop Ecosystem
Big Data and Hadoop EcosystemRajkumar Singh
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟datastack
 
Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview Senthil Kumar
 
The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemCloudera, Inc.
 
Hadoop distributions - ecosystem
Hadoop distributions - ecosystemHadoop distributions - ecosystem
Hadoop distributions - ecosystemJakub Stransky
 
An Introduction to Apache Spark
An Introduction to Apache SparkAn Introduction to Apache Spark
An Introduction to Apache SparkElvis Saravia
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY pptsravya raju
 
9/2017 STL HUG - Back to School
9/2017 STL HUG - Back to School9/2017 STL HUG - Back to School
9/2017 STL HUG - Back to SchoolAdam Doyle
 
Data Engineering Quick Guide
Data Engineering Quick GuideData Engineering Quick Guide
Data Engineering Quick GuideAsim Jalis
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introductionChirag Ahuja
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystemsunera pathan
 

What's hot (20)

Hadoop An Introduction
Hadoop An IntroductionHadoop An Introduction
Hadoop An Introduction
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
Apache Hadoop at 10
Apache Hadoop at 10Apache Hadoop at 10
Apache Hadoop at 10
 
Big Data and Hadoop Ecosystem
Big Data and Hadoop EcosystemBig Data and Hadoop Ecosystem
Big Data and Hadoop Ecosystem
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟
 
PPT on Hadoop
PPT on HadoopPPT on Hadoop
PPT on Hadoop
 
Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview
 
The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop Ecosystem
 
Hadoop distributions - ecosystem
Hadoop distributions - ecosystemHadoop distributions - ecosystem
Hadoop distributions - ecosystem
 
Hadoop overview
Hadoop overviewHadoop overview
Hadoop overview
 
An Introduction to Apache Spark
An Introduction to Apache SparkAn Introduction to Apache Spark
An Introduction to Apache Spark
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
 
Big data & Hadoop
Big data & HadoopBig data & Hadoop
Big data & Hadoop
 
Spark - Philly JUG
Spark  - Philly JUGSpark  - Philly JUG
Spark - Philly JUG
 
9/2017 STL HUG - Back to School
9/2017 STL HUG - Back to School9/2017 STL HUG - Back to School
9/2017 STL HUG - Back to School
 
Hadoop
HadoopHadoop
Hadoop
 
Data Engineering Quick Guide
Data Engineering Quick GuideData Engineering Quick Guide
Data Engineering Quick Guide
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystem
 
Hadoop overview
Hadoop overviewHadoop overview
Hadoop overview
 

Viewers also liked

Diqkd vidick
Diqkd vidickDiqkd vidick
Diqkd vidickkore80
 
Presentatie Eduard Frieser
Presentatie Eduard FrieserPresentatie Eduard Frieser
Presentatie Eduard FrieserEduard Frieser
 
Giới thiệu VTC Academy
Giới thiệu VTC AcademyGiới thiệu VTC Academy
Giới thiệu VTC AcademyThằng Khó Ưa
 
Plane Crash NCA Presentation
Plane Crash NCA PresentationPlane Crash NCA Presentation
Plane Crash NCA PresentationLindsey Harvell
 
Halloween tics
Halloween ticsHalloween tics
Halloween ticsgarfiel28
 

Viewers also liked (7)

Diqkd vidick
Diqkd vidickDiqkd vidick
Diqkd vidick
 
Presentatie Eduard Frieser
Presentatie Eduard FrieserPresentatie Eduard Frieser
Presentatie Eduard Frieser
 
Giới thiệu VTC Academy
Giới thiệu VTC AcademyGiới thiệu VTC Academy
Giới thiệu VTC Academy
 
Empathize and Define
Empathize and DefineEmpathize and Define
Empathize and Define
 
Slide 3 d_animation_vfx
Slide 3 d_animation_vfxSlide 3 d_animation_vfx
Slide 3 d_animation_vfx
 
Plane Crash NCA Presentation
Plane Crash NCA PresentationPlane Crash NCA Presentation
Plane Crash NCA Presentation
 
Halloween tics
Halloween ticsHalloween tics
Halloween tics
 

Similar to Hadoop and Big Data: Revealed

Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerMark Kromer
 
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune Big-Data Hadoop Tutorials - MindScripts Technologies, Pune
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune amrutupre
 
Overview of big data & hadoop version 1 - Tony Nguyen
Overview of big data & hadoop   version 1 - Tony NguyenOverview of big data & hadoop   version 1 - Tony Nguyen
Overview of big data & hadoop version 1 - Tony NguyenThanh Nguyen
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Thanh Nguyen
 
Hadoop Developer
Hadoop DeveloperHadoop Developer
Hadoop DeveloperEdureka!
 
Overview of big data & hadoop v1
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1Thanh Nguyen
 
Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Rajan Kanitkar
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo pptPhil Young
 
Big data or big deal
Big data or big dealBig data or big deal
Big data or big dealeduarderwee
 
Architecting the Future of Big Data and Search
Architecting the Future of Big Data and SearchArchitecting the Future of Big Data and Search
Architecting the Future of Big Data and SearchHortonworks
 
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookHow Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookAmr Awadallah
 
Introduction to apache hadoop
Introduction to apache hadoopIntroduction to apache hadoop
Introduction to apache hadoopShashwat Shriparv
 
Interactive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataInteractive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataOfir Manor
 
Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014Milos Milovanovic
 
Hadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderDmitry Makarchuk
 

Similar to Hadoop and Big Data: Revealed (20)

Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
 
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune Big-Data Hadoop Tutorials - MindScripts Technologies, Pune
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune
 
Overview of big data & hadoop version 1 - Tony Nguyen
Overview of big data & hadoop   version 1 - Tony NguyenOverview of big data & hadoop   version 1 - Tony Nguyen
Overview of big data & hadoop version 1 - Tony Nguyen
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1
 
Big Data Concepts
Big Data ConceptsBig Data Concepts
Big Data Concepts
 
Hadoop in action
Hadoop in actionHadoop in action
Hadoop in action
 
What is hadoop
What is hadoopWhat is hadoop
What is hadoop
 
Hadoop Developer
Hadoop DeveloperHadoop Developer
Hadoop Developer
 
Overview of big data & hadoop v1
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1
 
Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt
 
Big data or big deal
Big data or big dealBig data or big deal
Big data or big deal
 
Architecting the Future of Big Data and Search
Architecting the Future of Big Data and SearchArchitecting the Future of Big Data and Search
Architecting the Future of Big Data and Search
 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
 
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookHow Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
 
Hadoop info
Hadoop infoHadoop info
Hadoop info
 
Introduction to apache hadoop
Introduction to apache hadoopIntroduction to apache hadoop
Introduction to apache hadoop
 
Interactive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataInteractive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroData
 
Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014
 
Hadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris Schneider
 

Recently uploaded

Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachBoston Institute of Analytics
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...only4webmaster01
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...amitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...amitlee9823
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Pooja Nehwal
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...amitlee9823
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsJoseMangaJr1
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangaloreamitlee9823
 

Recently uploaded (20)

Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning Approach
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 

Hadoop and Big Data: Revealed

  • 1. Hadoop & Big Data: Revealed Presenter: Sachin Holla Date: 08/29/2014
  • 2. Big Data: An Overview Big Data - High volume - High velocity - High variety information assets - High Veracity - Require new forms of processing - Like NoSQL, MapReduce, Machine Learning Examples  Large Hadron Collider  150 million sensors -> data 40 million times/sec  data flow > 150 million petabytes (annual ), or ~ 500 exabytes per day  Tipp24 (European lotteries)  Analyze billions of transactions and hundreds of customer attributes  Leads to a 90% decrease in the time it took to build predictive models
  • 3. DATA: ON A BIG SCALE
  • 4. Hadoop: Elephant in the Room Apache Hadoop - open-source Java-based software framework - distributed processing of large data sets - On clusters of computers based on commodity hardware. Hadoop’s Benefits (Historical context) - Don’t rely on Hardware to provide HA (“Big Iron”) - Failures are expected and assumed - Framework handles failures to provide a HA computing service - “Scale Up v/s Scale Out” Key Components - Hadoop Distributed File System (HDFS™) – the file system - Hadoop MapReduce – the programming model - Hadoop (v2) YARN: the resource manager Year Activity 2002Nutch Started 2003 GFS White Paper published 2004 Google MapReduce White Paper 2005 First MR Implementation 2006 Hadoop project in Apache 2008 Hadoop in Y! Production 2009 Wins 500GB sort contest
  • 5. What’s the Hadoop Arch., Kenneth ? (1/2)
  • 6. What’s the Hadoop Arch., Kenneth ? (2/2)
  • 7. Hadoop: FAQs  What is a Map-Reduce job and why do I care ?  Processing data paradigm in hadoop  Batch-mode or in real-time  In Java or in a variety of other langs (see below).  There are higher-level frameworks that help too like Pig , Hive, etc..  I don’t drink java anymore – what do I do ?  Hadoop is Java-based but …  Hadoop Streaming supports python, Ruby, R, etc.  I/O bound – no difference. CPU-bound – Java better  What is Hadoop2 and how will it affect my big data needs (See slide#14)  Much more scalable  Programming models v/s Cluster & Resource Management  Under what scenarios should I not use Hadoop ?  Need Answers in a Hurry  Queries Are Complex Needing Optimization  Require Random, Interactive Access to Data  Store Sensitive Data  Replacing Data Warehouse  What are differences between Hadoop & traditional database ?  Hadoop is not a DB  ACID properties  Unstructured / mixture of data sources  SQL Access
  • 8. Hadoop Stack: Snapshot Technology Domain Description HDFS File Storage Java-based file storage - reliable and scalable access MapReduce Programming Framework Original framework for distributed processing of data Hadoop YARN Resource Mgmt Next generation framework – MR and non-MR models Pig ETL / Data Flow Allows High level analysis of large data. Generates MR Hive SQL Interface DW - allows data summarization and ad-hoc queries Hbase Columnar NoSQL storage Column-oriented NoSQL data storage system Sqoop Data Exchange Easy data import/export from Hadoop clusters Zookeeper Process Coordination Highly available system for process coordination Oozie Workflow Scheduler Helps manage complex DAG job workflows Ambari Cluster Monitoring Installation, Admin & Monitoring for Hadoop clusters Avro Serializer Serializes data in efficient binary format. Uses JSON. Spark Real-time data processing Powerful processing engine - speed, ease of use, and sophisticated analytics (using ML).
  • 9.
  • 10. Data Science: The Scoop What is Data Science or a Data Scientist ?  To understand data, to process it, to extract value from it, to visualize it, to communicate it  Single source v/s disparate sources  Mine data for insight to extract business/competitive value What is Machine Learning then ?  The science of getting computers to act without being explicitly programmed.  Machine learning and statistics may be the stars, but DS orchestrates the whole show. Practical Uses  Product Recommendation  Medical Diagnosis  Stock Trading  Face Detection
  • 11. Demo: Lets get dirty !  Hadoop running on Single-Node Pseudo Cluster (Linux VM)  Start Hadoop  HelloWorld Hadoop style  Run a MapReduce job (wordcount)  No Java here  Use python scripts to run a MapReduce job  Lipstick on a Pig  Perform ETL on some stocks/dividend data  Give me Hive  Calculate Top Batter Scores  Can you feel the Hbase  Dump Sales Data into Hbase and then access via Hive  Use AWS to show a ‘real’ cluster  Connect to AWS and startup the cluster  Demo performance using wordcount example * All Demos, installation guide and references available @ GitHub
  • 12. … And, that’s a wrap !

Editor's Notes

  1. Introduce Hadoop, Map-Reduce and HDFS concepts. Hadoop Apache Hadoop is an open-source software framework allowing for distributed processing of large data sets across clusters of computers on commodity hardware. USP Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures. Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data. Hadoop YARN: A framework for job scheduling and cluster resource management. Hadoop MapReduce: A YARN-based system for parallel processing of large data sets. - Mike Cafarella and Doug Cutting estimated a system supporting a one-billion-page index would cost around half a million dollars in hardware, with a monthly running cost of $30,000. - Nutch was started in 2002, and a working crawler and search system quickly emerged. However, they realized that their architecture wouldn’t scale to the billions of pages on the Web. - Help was at hand with the publication of a paper in 2003 that described the architecture of Google’s distributed filesystem, called GFS, which was being used in production at Google. GFS, or something like it, would solve their storage needs for the very large files generated as a part of the web crawl and indexing process. In particular, GFS would free up time being spent on administrative tasks such as managing storage nodes. - In 2004, they set about writing an open source implementation, the Nutch Distributed Filesystem (NDFS). In 2004, Google published the paper that introduced MapReduce to the world. Early in 2005, the Nutch developers had a working MapReduce implementation in Nutch - in February 2006 they moved out of Nutch to form an independent subproject of Lucene called Hadoop. At around the same time, Doug Cutting joined Yahoo!, which provided a dedicated team and the resources to turn Hadoop into a system that ran at web scale (see the sidebar Hadoop at Yahoo!). - This was demonstrated in February 2008 when Yahoo! announced that its production search index was being generated by a 10,000-core Hadoop cluster. - (May 2009), it was announced that a team at Yahoo! used Hadoop to sort one terabyte in 62 seconds.
  2. What is a Map-Reduce job and why do I care ? Processing data paradigm in hadoop Batch-mode or in real-time In Java or in a variety of other langs (see below). There are higher-level frameworks that help too like Pig , Hive, etc.. I don’t drink java anymore – what do I do ? Hadoop is Java-based but … Hadoop Streaming supports python, Ruby, R, etc. I/O bound – no difference. CPU-bound – Java better what is Hadoop2 and how will it affect my big data needs (See slide#14) Muchmore scalable (3,500 -> ~10000 nodes) Abstraction between the programming models (MapReduce, Impala, etc.) and cluster & resource management Under what scenarios should I not use Hadoop ? Need Answers in a Hurry – MR crunching can take hours or days sometimes Queries Are Complex and Require Extensive Optimization – need serious tech skills for optimizing queries Require Random, Interactive Access to Data – SQL on Hadoop is getting better but not yet comparable Store Sensitive Data – Hadoop has less than stellar security capabilities Replacing Data Warehouse – Hadoop can pre-process raw data and hand over to DW to run analytic workloads What are differences between Hadoop & traditional database ? Hadoop is not a DB, more like a file system (HDFS) Traditional DBs have ACID properties and Hadoop doesn’t support this OOTB Traditional DBs can support Unstructured but less efficiently. Hadoop shines with a mixture of data sources Hadoop SQL access is an order of magnitude(s) slower than traditional SQL
  3. Hortonworks and Cloudera- They both offer the same basic service to their customers- enterprise-ready Hadoop with greater security and stability as well as training for companies unfamiliar with the technology. Many have drawn the dividing line down how Hortonworks and Cloudera approach data warehouses, suggesting Hortonworks want to complement existing data warehouse storage and Cloudera want to do away with it altogether. Yet if you look at how Cloudera’s suggested deployment for its Enterprise Data Hub, it does incorporate legacy warehouse storage. A greater distinction can be found in what technologies the companies offer. Hortonworks are open-source purists, using only technology that’s open-sourced through the Apache Foundation; when you pay for Cloudera, you pay for a whole stack of proprietary and open source components, including online NoSQL (HBase), analytic SQL (Impala), in-memory processing and machine learning (Apache Spark) and data management (Cloudera Manager). Hortonworks Cloudera Money raised $225 million $900 million ($740 million from a recent partnership with Intel) Customers Added 250 customers in the past five quarters; big names include Spotify, ebay, Bloomberg and Samsung. Estimated around the 350-mark. Big names include Nokia, Mastercard, BT and ebay (curiously appearing on both Hortonworks’ and Cloudera’s customer lists) Partners Around 300 listed on their website, including SAP, HP and Dell- a full list can be found here. Over 1,000, including HP, IBM, Intel… a full list can be found here. MapR is founded on the idea that the Apache Hadoop core is a beautiful thing that needs to grow up fast to have the most impact on the enterprise. What MapR has done is add some proprietary software for helping manage the installation, configuration, and operation of its distribution. But MapR rejects open source purity. Srivas has taken significant parts of Hadoop and re-implemented them in an API compatible manner. Hortonworks and Cloudera argue using the API-compatible approach means that MapR isn’t open source. MapR argues back: Do you want to have read write access to your files system? Do you want to be able to handle lots of small files? Do you want to support NFS in a production quality matter so other software you have can use the data in HDFS? Do you want to have better security that doesn’t require Kerberos? Do you want to be able to run other software like Vertica on the machines in the Hadoop cluster?