Hadoop Infrastructure (Oct. 3rd, 2012)

•

0 likes•406 views

John Dougherty

Technology

What is Hadoop?

• Apache’s implementation of Google’s BigTable
• Uses a Java VM in order to parse instructions
• Uses sequential writes & column based file
structures with HDFS
• Grants the ability to read/write/manipulate very
large data sets/structures.

What is BigTable
• Contains the framework that was based on,
and is used in, hadoop

• Uses a commodity approach to hardware

• Extreme scalability and redundancy

Commodity Perspective
• Commercial Hardware cost vs. failure rate
– Roughly double the cost of commodity
– Roughly 5% failure rate

• Commodity Hardware cost vs. failure rate
– Roughly half the cost of commodity
– Rougly 10-15% failure rate

What is HDFS
• Backend file system for the Hadoop platform

• Allows for easy operability/node management

• Certain technologies can replace or augment
– Hbase (Augments HDFS)
– Cassandra (Replaces HDFS)

What works with Hadoop?
• Middleware and connectivity tools improve functionality

• Hive, Pig, Cassandra (all sub-projects of Apache’s Hadoop)
help to connect and utilize

• Each application set has different uses

Pig

Schedulers/Configurators
• Zookeeper
– Helps you in configuring many nodes
– Can be integrated easily
• Oozie
– A job resource/scheduler for hadoop
– Open source
• Flume
– Concatenator/Aggregator (Dist. log collection)

Middleware
• Hive
– Data warehouse, connects natively to hadoop’s internals
– Uses HiveQL to create queries
– Easily extendable with plugins/macros
• Pig
– Hive-like in that it uses its own query language (pig latin)
– Easily extendable, more like SQL than Hive
• Sqoop
– Connects databases and datasets
– Limited, but powerful

How can Hadoop/Hbase/MapReduce help?

• You have a very large data set(s)
• You require results on your data in a timely
manner
• You don’t enjoy spending millions on
infrastructure
• Your data is large enough to cause a classic
RDBMS headaches

Column Based Data
• Developer woes
– Extract/Transfer/Load is a concern for complicated
schemas
– Egress/Ingress between existing queries/results
becomes complicated
– Solutions are deployed with walls of functionality
– Hard questions turn into hard queries

Column Based Data (cont.)
• Developer joys
– You can now process PB, into EB, and beyond
– Your extended datasets can be aggregated, not
easily; but also unlike ever before
– You can extend your daily queries to include
historical data, even incorporating into existing
real-time data usage

Future Projects/Approaches

• Cross discipline data sharing/comparisons
• Complex statistical models re-constructed
• Massive data set conglomeration and
standardization (Public sector data, mostly)

How some software makes it easier
• Alteryx
– Very similar to Talend for interface, visual
– Allows easy integration into reporting (Crystal Reports)
• Qubole
– This will be expanded on shortly
– Easy to use interface and management of data
• Hortonworks (Open Source)
– Management utility for internal cluster deployments
• Cloudera (Open, to an extent)
– Management utility from Cloudera, also for internal deployments

What's hot

Cloud Optimized Big DataJoydeep Sen Sarma

SpringPeople Introduction to Apache HadoopSpringPeople

The Meta of Hadoop - COMAD 2012Joydeep Sen Sarma

AWS Database ServicesMackenzie LeJeune

HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster Cloudera, Inc.

HBasePooja Sunkapur

Large-scale Web Apps @ PinterestHBaseCon

Introduction to hbaseUday Vakalapudi

Harmonizing Multi-tenant HBase Clusters for Managing Workload DiversityHBaseCon

4. hadoop גיא לבנברגTaldor Group

Google cloud certification data engineerJoseph Holbrook, Chief Learning Officer (CLO)

BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use caseDavid Lauzon

Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsEsther Kundin

HBaseCon 2012 | Building a Large Search Platform on a Shoestring BudgetCloudera, Inc.

Hadoop online trainingsGeek Trainings

HBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big DataCloudera, Inc.

Sql over hadoop ver 3Sudheesh Narayanan

Kudu demoHemanth Kumar Ratakonda

Transform your DBMS to drive engagement innovation with Big DataAshnikbiz

What's hot (20)

Cloud Optimized Big Data

SpringPeople Introduction to Apache Hadoop

The Meta of Hadoop - COMAD 2012

AWS Database Services

HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster

HBase

Large-scale Web Apps @ Pinterest

Introduction to hbase

Harmonizing Multi-tenant HBase Clusters for Managing Workload Diversity

4. hadoop גיא לבנברג

Google cloud certification data engineer

BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use case

Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends

HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget

Hadoop online trainings

HBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big Data

Sql over hadoop ver 3

Kudu demo

Transform your DBMS to drive engagement innovation with Big Data

Viewers also liked

Pre cd and artist research61141

QtllbDuyen Hoang

Applying Big DataJohn Dougherty

Usability pptClayton Benjamin

Rosalia de CastroDeni-sa

Vir’s ib educators ankeetaparth_damania

Big Data ROIJohn Dougherty

Aca advocacyschacctf

SEO Pricing & CostChakra Systems

Evolución de los avances tecnológicosMerlys Escarpeta

Enc 3241 document_design1Clayton Benjamin

PRUEBA TOEFLPaula Andrea Osorio

Enc 3241 colorClayton Benjamin

Top 150 global design firmsSamar Momin

Subculture hippieannaskelding

Rosalia de CastroDeni-sa

Catedra virtual de cultura ciudadanaLuisa Paternina

페차쿠차ekwjd4793

Jiit 2013 14 project presentation aniket mishraAniket Mishra

페차쿠차_ 조연진연진 조

Viewers also liked (20)

Pre cd and artist research

Qtllb

Applying Big Data

Usability ppt

Rosalia de Castro

Vir’s ib educators ankeeta

Big Data ROI

Aca advocacy

SEO Pricing & Cost

Evolución de los avances tecnológicos

Enc 3241 document_design1

PRUEBA TOEFL

Enc 3241 color

Top 150 global design firms

Subculture hippie

Rosalia de Castro

Catedra virtual de cultura ciudadana

페차쿠차

Jiit 2013 14 project presentation aniket mishra

페차쿠차_ 조연진

Similar to Hadoop Infrastructure (Oct. 3rd, 2012)

Introduction to HadoopDr. C.V. Suresh Babu

SQL Server 2012 and Big DataMicrosoft TechNet - Belgium and Luxembourg

Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3tcloudcomputing-tw

Big Data and Cloud ComputingFarzad Nozarian

Hadoop jonHumoyun Ahmedov

Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...Global Business Events

Big data - Online TrainingLearntek1

Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosLester Martin

Big data Hadoop Ayyappan Paramesh

Apache Hadoop HiveSome corner at the Laboratory

Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2tcloudcomputing-tw

Apache hadoop basicssaili mane

Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valleymarkgrover

Backup and Disaster Recovery in Hadooplarsgeorge

Hadoop introduction , Why and What is Hadoop ?sudhakara st

M. Florence Dayana - Hadoop Foundation for Analytics.pptxDr.Florence Dayana

INTRODUCTION TO BIG DATA HADOOPKrishna Sujeer

Hadoop presentationChandra Sekhar Saripaka

Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle Op...Alex Gorbachev

Technologies for Data Analytics PlatformN Masahiro

Similar to Hadoop Infrastructure (Oct. 3rd, 2012) (20)

Introduction to Hadoop

SQL Server 2012 and Big Data

Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3

Big Data and Cloud Computing

Hadoop jon

Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...

Big data - Online Training

Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos

Big data Hadoop

Apache Hadoop Hive

Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2

Apache hadoop basics

Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley

Backup and Disaster Recovery in Hadoop

Hadoop introduction , Why and What is Hadoop ?

M. Florence Dayana - Hadoop Foundation for Analytics.pptx

INTRODUCTION TO BIG DATA HADOOP

Hadoop presentation

Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle Op...

Technologies for Data Analytics Platform

Recently uploaded

Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro

Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson

Gen AI in Business - Global Trends Report 2024.pdfAddepto

DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell

DMCC Future of Trade Web3 - Special EditionDubai Multi Commodity Centre

Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays

Take control of your SAP testing with UiPath Test SuiteDianaGray10

Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren

What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett

Artificial intelligence in cctv survelliance.pptxhariprasad279825

Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos

"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays

SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero

How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe

Search Engine Optimization SEO PDF for 2024.pdfRankYa

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3

How to write a Business Continuity PlanDatabarracks

Recently uploaded (20)

Unraveling Multimodality with Large Language Models.pdf

Are Multi-Cloud and Serverless Good or Bad?

Gen AI in Business - Global Trends Report 2024.pdf

DSPy a system for AI to Write Prompts and Do Fine Tuning

DMCC Future of Trade Web3 - Special Edition

Unleash Your Potential - Namagunga Girls Coding Club

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...

Take control of your SAP testing with UiPath Test Suite

Advanced Test Driven-Development @ php[tek] 2024

What's New in Teams Calling, Meetings and Devices March 2024

Artificial intelligence in cctv survelliance.pptx

Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)

"Debugging python applications inside k8s environment", Andrii Soldatenko

SIP trunking in Janus @ Kamailio World 2024

How AI, OpenAI, and ChatGPT impact business and software.

Search Engine Optimization SEO PDF for 2024.pdf

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx

How to write a Business Continuity Plan

Hadoop Infrastructure (Oct. 3rd, 2012)

1. Infrastructure and Stack Presented by John Dougherty, Viriton 10/03/2012 john.dougherty@viriton.com

2. What is Hadoop? • Apache’s implementation of Google’s BigTable • Uses a Java VM in order to parse instructions • Uses sequential writes & column based file structures with HDFS • Grants the ability to read/write/manipulate very large data sets/structures.

3. What is Hadoop? (cont.) VS.

4. What is BigTable • Contains the framework that was based on, and is used in, hadoop • Uses a commodity approach to hardware • Extreme scalability and redundancy

5. Commodity Perspective • Commercial Hardware cost vs. failure rate – Roughly double the cost of commodity – Roughly 5% failure rate • Commodity Hardware cost vs. failure rate – Roughly half the cost of commodity – Rougly 10-15% failure rate

6. Breaking Down the Complexity

7. What is HDFS • Backend file system for the Hadoop platform • Allows for easy operability/node management • Certain technologies can replace or augment – Hbase (Augments HDFS) – Cassandra (Replaces HDFS)

8. What works with Hadoop? • Middleware and connectivity tools improve functionality • Hive, Pig, Cassandra (all sub-projects of Apache’s Hadoop) help to connect and utilize • Each application set has different uses Pig

9. Layout of Middleware

10. Schedulers/Configurators • Zookeeper – Helps you in configuring many nodes – Can be integrated easily • Oozie – A job resource/scheduler for hadoop – Open source • Flume – Concatenator/Aggregator (Dist. log collection)

11. Middleware • Hive – Data warehouse, connects natively to hadoop’s internals – Uses HiveQL to create queries – Easily extendable with plugins/macros • Pig – Hive-like in that it uses its own query language (pig latin) – Easily extendable, more like SQL than Hive • Sqoop – Connects databases and datasets – Limited, but powerful

12. How can Hadoop/Hbase/MapReduce help? • You have a very large data set(s) • You require results on your data in a timely manner • You don’t enjoy spending millions on infrastructure • Your data is large enough to cause a classic RDBMS headaches

13. Column Based Data • Developer woes – Extract/Transfer/Load is a concern for complicated schemas – Egress/Ingress between existing queries/results becomes complicated – Solutions are deployed with walls of functionality – Hard questions turn into hard queries

14. Column Based Data (cont.) • Developer joys – You can now process PB, into EB, and beyond – Your extended datasets can be aggregated, not easily; but also unlike ever before – You can extend your daily queries to include historical data, even incorporating into existing real-time data usage

15. Future Projects/Approaches • Cross discipline data sharing/comparisons • Complex statistical models re-constructed • Massive data set conglomeration and standardization (Public sector data, mostly)

16. How some software makes it easier • Alteryx – Very similar to Talend for interface, visual – Allows easy integration into reporting (Crystal Reports) • Qubole – This will be expanded on shortly – Easy to use interface and management of data • Hortonworks (Open Source) – Management utility for internal cluster deployments • Cloudera (Open, to an extent) – Management utility from Cloudera, also for internal deployments

Hadoop Infrastructure (Oct. 3rd, 2012)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Hadoop Infrastructure (Oct. 3rd, 2012)

Similar to Hadoop Infrastructure (Oct. 3rd, 2012) (20)

Recently uploaded

Recently uploaded (20)

Hadoop Infrastructure (Oct. 3rd, 2012)