SlideShare a Scribd company logo
The Good, The Bad and the Ugly
How to tame the Big Data Beast
Guy Loewenberg
May 2013
Overview
• Data Explosion
Overview
• Big Data: A collection of data sets so large and complex that
it becomes difficult to process using on-hand database
management tools or traditional data processing applications
• Hadoop: A framework that allows distributed
processing of large data-sets across clusters of
computers using a simple programming model
• 1000 Kilobytes = 1 Megabyte
• 1000 Megabytes = 1 Gigabyte
• 1000 Gigabytes = 1 Terabyte
• 1000 Terabytes = 1 Petabyte
• 1000 Petabytes = 1 Exabyte
• 1000 Exabytes = 1 Zettabyte
• 1000 Zettabytes = 1 Yottabyte
• 1000 Yottabytes = 1 Brontobyte
• 1000 Brontobytes = 1 Geopbyte
Most US SME corporations
Most US large corporations
Leaders like Facebook & Google
Hadoop Basics
• Designed to scale
• Uses commodity hardware
• Processes data in batches
• Can process very large scale of data (PBs)
Core Hadoop
• Core hadoop is built from two main systems:
– Hadoop Clustered file system - HDFS
– MapReduce programming framework
Hadoop architecture
• Hadoop Distributed File System (HDFS):
self-healing high-bandwidth clustered
storage.
– NameNode controls HDFS
whereas DataNodes does the
block replications, read/write
operations and drives the
workloads for HDFS
– Work in a master/slave mode.
Hadoop architecture
• MapReduce: Distributed fault-tolerant resource
management and scheduling coupled with a
scalable data programming abstraction.
– The JobTracker schedules
jobs and allocates activities
to TaskTracker nodes which
execute the map and reduce
processes requested
– Work in master/slave mode
Hadoop software architecture
MapReduce: Parallel data processing
framework for large data sets
HDFS: Hadoop
distributed File System
Oozie: MapReduce
job Scheduler
HBase: Key-value
database
Pig: Large data sets
analysis language
Hive: High-level language for
analyzing large data sets
ZooKeeper: distributed
coordination system
Solr / Lucene search
engine, query engine library
What Hadoop can’t do
• Hadoop lets you perform batch analysis on whatever
data you have stored within Hadoop. That data, does
not have to be structured
– Many solutions take advantage of the low storage expense of
Hadoop to store structured data there instead of RDBMS. But
shifting data back and forth between Hadoop and an RDBMS
would be overkill.
– Transactional data is highly complex, as a transaction on an
ecommerce site can generate many steps that all have to be
implemented quickly. That scenario is not ideal for Hadoop
– Structured data sets that require very minimal latency
Comparing RDBMS to MapReduce
RDBMS MapReduce
Data size Gigabytes Petabytes
Access Interactive and batch Batch
Structure Fixed schema Unstructured schema
Language SQL Procedural (Java, C++, Ruby, etc)
Integrity High Low
Scaling Nonlinear Linear
Updates Read and write Write once, read many times
Latency Low High
What Hadoop can do
• High data volume, stored in Hadoop, and queried at
length later using MapReduce functions
– index building
– pattern recognitions
– creating recommendation engines
– sentiment analysis
• Hadoop should be integrated within your existing IT
infrastructure in order to capitalize on the countless
pieces of data that flows into your organization.
Hadoop Maturity?!
• Inaccessible to analysts without programming ability
• clusters have no record of who changed which record and when
it was changed
• storage functionality they have always depended on (snapshots,
mirroring) are lacking in HDFS.
• Incompatibility with existing tools
• Data without structure has limited value and applying the
structure at query time requires a lot of Java code.
• Limited documentation
• Limited troubleshooting capabilities
Choosing your infrastructure
• Define what you want to achieve
– POC
– Scale (few, tens, hundreds)
– One-time, periodic, continuous
• Infrastructure design
– Servers, storage, network, rack-space
– Define a joined team Hadoop App/Dev and infrastructure
specialist (facilities/server/network) when building a solution
– Virtual machines vs. Physical machines (IO performance, High
CPU, Network)
Choosing your infrastructure
• Network infrastructure
– Data movement between nodes (rack-awareness,
replication factor)
– Data between sites (Hosting/Service)
• Storage (architecture, disks)
– Local disks, JBOD
– Increase default block-size
• Operations
– Monitor
– Backup (configuration files, journal, Checkpoint …)
Performance & Scale considerations
• Consider running on a dedicated/standalone not
shared with other Hadoop processes on the same
server
– Name Node, Secondary Name Node and/or Checkpoint
Node
– Job Tracker and the HBASE (or any DB) Master
• Consider a Physical dedicated environment
Thank you!
Hadoop - The Good, The Bad and the Ugly
Guy Loewenberg
SUPPORTING SLIDES
HDFS Architecture
Improving RDBMS with Hadoop
• Accelerating nightly batch business processes.
• Storage of extremely high volumes of enterprise data
• Creation of automatic redundant backups
• Improving the scalability of applications
• Use of Java for data processing instead of SQL.
• Produce just-in-time feeds for dashboards and business intelligence
• Handling urgent, ad hoc requests for data
• Turning unstructured data into relational data
• Taking on tasks that require massive parallelism
• Moving existing algorithms, code, frameworks, and components to
a highly distributed computing environment.

More Related Content

What's hot

Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop Overview
Brian Enochson
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
Sandip Darwade
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystem
sunera pathan
 
Hadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingHadoop: Distributed Data Processing
Hadoop: Distributed Data Processing
Cloudera, Inc.
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1
Rohit Agrawal
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
sravya raju
 
Introduction to Hadoop - The Essentials
Introduction to Hadoop - The EssentialsIntroduction to Hadoop - The Essentials
Introduction to Hadoop - The Essentials
Fadi Yousuf
 
Big data and hadoop anupama
Big data and hadoop anupamaBig data and hadoop anupama
Big data and hadoop anupama
Anupama Prabhudesai
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
Shweta Patnaik
 
Hadoop
HadoopHadoop
Hadoop
chandinisanz
 
Hadoop Fundamentals I
Hadoop Fundamentals IHadoop Fundamentals I
Hadoop Fundamentals I
Romeo Kienzler
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
Varun Narang
 
Hadoop Ecosystem Overview
Hadoop Ecosystem OverviewHadoop Ecosystem Overview
Hadoop Ecosystem Overview
Gerrit van Vuuren
 
Nextag talk
Nextag talkNextag talk
Nextag talk
Joydeep Sen Sarma
 
Hadoop overview
Hadoop overviewHadoop overview
Hadoop overview
Siva Pandeti
 
Introduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemIntroduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-System
Md. Hasan Basri (Angel)
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache Hadoop
Ajit Koti
 
Getting started big data
Getting started big dataGetting started big data
Getting started big data
Kibrom Gebrehiwot
 
Hadoop and Distributed Computing
Hadoop and Distributed ComputingHadoop and Distributed Computing
Hadoop and Distributed Computing
Federico Cargnelutti
 
Hadoop jon
Hadoop jonHadoop jon
Hadoop jon
Humoyun Ahmedov
 

What's hot (20)

Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop Overview
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystem
 
Hadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingHadoop: Distributed Data Processing
Hadoop: Distributed Data Processing
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
 
Introduction to Hadoop - The Essentials
Introduction to Hadoop - The EssentialsIntroduction to Hadoop - The Essentials
Introduction to Hadoop - The Essentials
 
Big data and hadoop anupama
Big data and hadoop anupamaBig data and hadoop anupama
Big data and hadoop anupama
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop Fundamentals I
Hadoop Fundamentals IHadoop Fundamentals I
Hadoop Fundamentals I
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Hadoop Ecosystem Overview
Hadoop Ecosystem OverviewHadoop Ecosystem Overview
Hadoop Ecosystem Overview
 
Nextag talk
Nextag talkNextag talk
Nextag talk
 
Hadoop overview
Hadoop overviewHadoop overview
Hadoop overview
 
Introduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemIntroduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-System
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache Hadoop
 
Getting started big data
Getting started big dataGetting started big data
Getting started big data
 
Hadoop and Distributed Computing
Hadoop and Distributed ComputingHadoop and Distributed Computing
Hadoop and Distributed Computing
 
Hadoop jon
Hadoop jonHadoop jon
Hadoop jon
 

Similar to 4. hadoop גיא לבנברג

Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
Ayyappan Paramesh
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
Roorkee College of Engineering, Roorkee
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
arslanhaneef
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
sonukumar379092
 
Scaling Storage and Computation with Hadoop
Scaling Storage and Computation with HadoopScaling Storage and Computation with Hadoop
Scaling Storage and Computation with Hadoop
yaevents
 
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvew
Kunal Khanna
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
Prashanth Yennampelli
 
2. hadoop fundamentals
2. hadoop fundamentals2. hadoop fundamentals
2. hadoop fundamentals
Lokesh Ramaswamy
 
Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft Platform
Jesus Rodriguez
 
Bigdata workshop february 2015
Bigdata workshop  february 2015 Bigdata workshop  february 2015
Bigdata workshop february 2015
clairvoyantllc
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
chariorienit
 
Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and Hadoop
Amir Shaikh
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
Nisanth Simon
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
tcloudcomputing-tw
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
Vaibhav Jain
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
Sandeep Singh
 
Hadoop
HadoopHadoop
Hadoop
avnishagr
 
Big Data and Cloud Computing
Big Data and Cloud ComputingBig Data and Cloud Computing
Big Data and Cloud Computing
Farzad Nozarian
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
tcloudcomputing-tw
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
sudhakara st
 

Similar to 4. hadoop גיא לבנברג (20)

Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Scaling Storage and Computation with Hadoop
Scaling Storage and Computation with HadoopScaling Storage and Computation with Hadoop
Scaling Storage and Computation with Hadoop
 
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvew
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
2. hadoop fundamentals
2. hadoop fundamentals2. hadoop fundamentals
2. hadoop fundamentals
 
Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft Platform
 
Bigdata workshop february 2015
Bigdata workshop  february 2015 Bigdata workshop  february 2015
Bigdata workshop february 2015
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
 
Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and Hadoop
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
 
Hadoop
HadoopHadoop
Hadoop
 
Big Data and Cloud Computing
Big Data and Cloud ComputingBig Data and Cloud Computing
Big Data and Cloud Computing
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 

More from Taldor Group

7. emc isilon hdfs enterprise storage for hadoop
7. emc isilon hdfs   enterprise storage for hadoop7. emc isilon hdfs   enterprise storage for hadoop
7. emc isilon hdfs enterprise storage for hadoop
Taldor Group
 
5. big data vs it stki - pini cohen
5. big data vs  it    stki - pini cohen5. big data vs  it    stki - pini cohen
5. big data vs it stki - pini cohen
Taldor Group
 
3. ami big data hadoop on ucs seminar may 2013
3. ami big data hadoop on ucs seminar may 20133. ami big data hadoop on ucs seminar may 2013
3. ami big data hadoop on ucs seminar may 2013
Taldor Group
 
A new platform for a new era emc
A new platform for a new era   emcA new platform for a new era   emc
A new platform for a new era emc
Taldor Group
 
Yossi cohen 3 base
Yossi cohen   3 baseYossi cohen   3 base
Yossi cohen 3 base
Taldor Group
 
פיני מנדל תובנות עסקיות מיישומי Hadoop
פיני מנדל   תובנות עסקיות מיישומי Hadoopפיני מנדל   תובנות עסקיות מיישומי Hadoop
פיני מנדל תובנות עסקיות מיישומי Hadoop
Taldor Group
 
נתן פרידחי הקדמה לכנס Hadoop
נתן פרידחי   הקדמה לכנס Hadoopנתן פרידחי   הקדמה לכנס Hadoop
נתן פרידחי הקדמה לכנס Hadoop
Taldor Group
 
הערך העסקי שבאיכות הנתונים קוסטין מרזאה
הערך העסקי שבאיכות הנתונים   קוסטין מרזאההערך העסקי שבאיכות הנתונים   קוסטין מרזאה
הערך העסקי שבאיכות הנתונים קוסטין מרזאהTaldor Group
 
Dcl צביקה מנלה - סיפורי לקוחות
Dcl   צביקה מנלה - סיפורי לקוחותDcl   צביקה מנלה - סיפורי לקוחות
Dcl צביקה מנלה - סיפורי לקוחותTaldor Group
 
Taldor data quality einat shimoni - stki
Taldor data quality   einat shimoni - stkiTaldor data quality   einat shimoni - stki
Taldor data quality einat shimoni - stki
Taldor Group
 
2013 04 irm mdmdg - jon asprey 4 most asked dg questions v 1 3
2013 04 irm mdmdg - jon asprey 4 most asked dg questions v 1 32013 04 irm mdmdg - jon asprey 4 most asked dg questions v 1 3
2013 04 irm mdmdg - jon asprey 4 most asked dg questions v 1 3
Taldor Group
 
Loshin operationalizingdatagovernance
Loshin operationalizingdatagovernanceLoshin operationalizingdatagovernance
Loshin operationalizingdatagovernance
Taldor Group
 

More from Taldor Group (12)

7. emc isilon hdfs enterprise storage for hadoop
7. emc isilon hdfs   enterprise storage for hadoop7. emc isilon hdfs   enterprise storage for hadoop
7. emc isilon hdfs enterprise storage for hadoop
 
5. big data vs it stki - pini cohen
5. big data vs  it    stki - pini cohen5. big data vs  it    stki - pini cohen
5. big data vs it stki - pini cohen
 
3. ami big data hadoop on ucs seminar may 2013
3. ami big data hadoop on ucs seminar may 20133. ami big data hadoop on ucs seminar may 2013
3. ami big data hadoop on ucs seminar may 2013
 
A new platform for a new era emc
A new platform for a new era   emcA new platform for a new era   emc
A new platform for a new era emc
 
Yossi cohen 3 base
Yossi cohen   3 baseYossi cohen   3 base
Yossi cohen 3 base
 
פיני מנדל תובנות עסקיות מיישומי Hadoop
פיני מנדל   תובנות עסקיות מיישומי Hadoopפיני מנדל   תובנות עסקיות מיישומי Hadoop
פיני מנדל תובנות עסקיות מיישומי Hadoop
 
נתן פרידחי הקדמה לכנס Hadoop
נתן פרידחי   הקדמה לכנס Hadoopנתן פרידחי   הקדמה לכנס Hadoop
נתן פרידחי הקדמה לכנס Hadoop
 
הערך העסקי שבאיכות הנתונים קוסטין מרזאה
הערך העסקי שבאיכות הנתונים   קוסטין מרזאההערך העסקי שבאיכות הנתונים   קוסטין מרזאה
הערך העסקי שבאיכות הנתונים קוסטין מרזאה
 
Dcl צביקה מנלה - סיפורי לקוחות
Dcl   צביקה מנלה - סיפורי לקוחותDcl   צביקה מנלה - סיפורי לקוחות
Dcl צביקה מנלה - סיפורי לקוחות
 
Taldor data quality einat shimoni - stki
Taldor data quality   einat shimoni - stkiTaldor data quality   einat shimoni - stki
Taldor data quality einat shimoni - stki
 
2013 04 irm mdmdg - jon asprey 4 most asked dg questions v 1 3
2013 04 irm mdmdg - jon asprey 4 most asked dg questions v 1 32013 04 irm mdmdg - jon asprey 4 most asked dg questions v 1 3
2013 04 irm mdmdg - jon asprey 4 most asked dg questions v 1 3
 
Loshin operationalizingdatagovernance
Loshin operationalizingdatagovernanceLoshin operationalizingdatagovernance
Loshin operationalizingdatagovernance
 

Recently uploaded

Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
Claudio Di Ciccio
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
Tomaz Bratanic
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
Pixlogix Infotech
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
Zilliz
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
DianaGray10
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
Daiki Mogmet Ito
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
Neo4j
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 

Recently uploaded (20)

Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 

4. hadoop גיא לבנברג

  • 1. The Good, The Bad and the Ugly How to tame the Big Data Beast Guy Loewenberg May 2013
  • 3. Overview • Big Data: A collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications • Hadoop: A framework that allows distributed processing of large data-sets across clusters of computers using a simple programming model • 1000 Kilobytes = 1 Megabyte • 1000 Megabytes = 1 Gigabyte • 1000 Gigabytes = 1 Terabyte • 1000 Terabytes = 1 Petabyte • 1000 Petabytes = 1 Exabyte • 1000 Exabytes = 1 Zettabyte • 1000 Zettabytes = 1 Yottabyte • 1000 Yottabytes = 1 Brontobyte • 1000 Brontobytes = 1 Geopbyte Most US SME corporations Most US large corporations Leaders like Facebook & Google
  • 4. Hadoop Basics • Designed to scale • Uses commodity hardware • Processes data in batches • Can process very large scale of data (PBs)
  • 5. Core Hadoop • Core hadoop is built from two main systems: – Hadoop Clustered file system - HDFS – MapReduce programming framework
  • 6. Hadoop architecture • Hadoop Distributed File System (HDFS): self-healing high-bandwidth clustered storage. – NameNode controls HDFS whereas DataNodes does the block replications, read/write operations and drives the workloads for HDFS – Work in a master/slave mode.
  • 7. Hadoop architecture • MapReduce: Distributed fault-tolerant resource management and scheduling coupled with a scalable data programming abstraction. – The JobTracker schedules jobs and allocates activities to TaskTracker nodes which execute the map and reduce processes requested – Work in master/slave mode
  • 8. Hadoop software architecture MapReduce: Parallel data processing framework for large data sets HDFS: Hadoop distributed File System Oozie: MapReduce job Scheduler HBase: Key-value database Pig: Large data sets analysis language Hive: High-level language for analyzing large data sets ZooKeeper: distributed coordination system Solr / Lucene search engine, query engine library
  • 9. What Hadoop can’t do • Hadoop lets you perform batch analysis on whatever data you have stored within Hadoop. That data, does not have to be structured – Many solutions take advantage of the low storage expense of Hadoop to store structured data there instead of RDBMS. But shifting data back and forth between Hadoop and an RDBMS would be overkill. – Transactional data is highly complex, as a transaction on an ecommerce site can generate many steps that all have to be implemented quickly. That scenario is not ideal for Hadoop – Structured data sets that require very minimal latency
  • 10. Comparing RDBMS to MapReduce RDBMS MapReduce Data size Gigabytes Petabytes Access Interactive and batch Batch Structure Fixed schema Unstructured schema Language SQL Procedural (Java, C++, Ruby, etc) Integrity High Low Scaling Nonlinear Linear Updates Read and write Write once, read many times Latency Low High
  • 11. What Hadoop can do • High data volume, stored in Hadoop, and queried at length later using MapReduce functions – index building – pattern recognitions – creating recommendation engines – sentiment analysis • Hadoop should be integrated within your existing IT infrastructure in order to capitalize on the countless pieces of data that flows into your organization.
  • 12. Hadoop Maturity?! • Inaccessible to analysts without programming ability • clusters have no record of who changed which record and when it was changed • storage functionality they have always depended on (snapshots, mirroring) are lacking in HDFS. • Incompatibility with existing tools • Data without structure has limited value and applying the structure at query time requires a lot of Java code. • Limited documentation • Limited troubleshooting capabilities
  • 13. Choosing your infrastructure • Define what you want to achieve – POC – Scale (few, tens, hundreds) – One-time, periodic, continuous • Infrastructure design – Servers, storage, network, rack-space – Define a joined team Hadoop App/Dev and infrastructure specialist (facilities/server/network) when building a solution – Virtual machines vs. Physical machines (IO performance, High CPU, Network)
  • 14. Choosing your infrastructure • Network infrastructure – Data movement between nodes (rack-awareness, replication factor) – Data between sites (Hosting/Service) • Storage (architecture, disks) – Local disks, JBOD – Increase default block-size • Operations – Monitor – Backup (configuration files, journal, Checkpoint …)
  • 15. Performance & Scale considerations • Consider running on a dedicated/standalone not shared with other Hadoop processes on the same server – Name Node, Secondary Name Node and/or Checkpoint Node – Job Tracker and the HBASE (or any DB) Master • Consider a Physical dedicated environment
  • 16.
  • 17. Thank you! Hadoop - The Good, The Bad and the Ugly Guy Loewenberg
  • 20. Improving RDBMS with Hadoop • Accelerating nightly batch business processes. • Storage of extremely high volumes of enterprise data • Creation of automatic redundant backups • Improving the scalability of applications • Use of Java for data processing instead of SQL. • Produce just-in-time feeds for dashboards and business intelligence • Handling urgent, ad hoc requests for data • Turning unstructured data into relational data • Taking on tasks that require massive parallelism • Moving existing algorithms, code, frameworks, and components to a highly distributed computing environment.

Editor's Notes

  1. NameNode and DataNode are HDFS components that work in a master/slave mode. NameNode is a major component that controls HDFS whereas DataNodes does the block replications, read/write operations and drives the workloads for HDFS.
  2. JobTracker and TaskTracker are also components that work in master/slave mode where JobTracker tasks control the mapping and reducing tasks at individual nodes among other tasks. The TaskTrackers run at the node levels and maintains communications with JobTracker for all nodes within the cluster.
  3. The main components include:Hadoop. Java software framework to support data-intensive distributed applications ZooKeeper. A highly reliable distributed coordination system MapReduce. A flexible parallel data processing framework for large data sets HDFS. Hadoop Distributed File System Oozie. A MapReduce job scheduler HBase. Key-value database Hive. A high-level language built on top of MapReduce for analyzing large data sets Pig. Enables the analysis of large data sets using Pig Latin. Pig Latinis a high-level language compiled into MapReduce for parallel data processing.