SlideShare a Scribd company logo
Zohar Elkayam
www.realdbamagic.com
Twitter: @realmgic
Things Every Oracle DBA
Needs to Know about the
Hadoop Ecosystem
Who am I?
• Zohar Elkayam, CTO at Brillix
• DBA, team leader, database trainer, public speaker, and a
senior consultant for over 18 years
• Oracle ACE Associate
• Involved with Big Data projects since 2011
• Blogger – www.realdbamagic.com and www.ilDBA.co.il
http://brillix.co.il2
About Brillix
• Brillix is a leading company that specialized in Data
Management
• We provide professional services, training and
consulting for Databases, Security, NoSQL, and Big
Data solutions
• Providing the Brillix Big Data Experience Center
3
Agenda
• What is the Big Data challenge?
• A Big Data Solution: Hadoop
• HDFS
• MapReduce and YARN
• Hadoop Ecosystem: HBase, Sqoop, Hive, Pig and other
tools
• Where does the DBA fits in?
http://brillix.co.il4
The Challenge
http://brillix.co.il5
The Big Data Challenge
http://brillix.co.il6
Volume
• Big data come in one size: Big.
• Size is measured in Terabyte(1012), Petabyte(1015),
Exabyte(1018), Zettabyte (1021)
• The storing and handling of the data becomes an issue
• Producing value out of the data in a reasonable time is
an issue
http://brillix.co.il7
Variety
• Big Data extends beyond structured data, including semi-structured
and unstructured information: logs, text, audio and videos
• Wide variety of rapidly evolving data types requires highly flexible
stores and handling
http://brillix.co.il8
Un-Structured Structured
Objects Tables
Flexible Columns and Rows
Structure Unknown Predefined Structure
Textual and Binary Mostly Textual
Velocity
• The speed in which the data is being generated and
collected
• Streaming data and large volume data movement
• High velocity of data capture – requires rapid ingestion
• Might cause a backlog problem
http://brillix.co.il9
Okay, So What Defines Big Data?
• When the data is too big or moves too fast to handle in
a sensible amount of time
• When the data doesn’t fit any conventional database
structure
• When the solution to the business need becomes part
of the problem
• When we think that we can still produce value from that
data and want to handle it
http://brillix.co.il10
Value
Big data is not about the size of the data,
It’s about the value within the data
http://brillix.co.il11
How to do Big Data
http://brillix.co.il12
13
Big Data in Practice
• Big data is big: technological infrastructure solutions
needed
• Big data is complicated:
• We need developers to manage handling of the data
• We need devops to manage the clusters
• We need data analysts and data scientists to produce value
http://brillix.co.il14
Infrastructure Challenges
• Infrastructure that is built for:
• Large-scale
• Distributed / scaled out
• Data-intensive jobs that spread the problem across clusters
of server nodes
http://brillix.co.il15
Infrastructure Challenges (cont.)
• Storage: Efficient and cost-effective enough to capture
and store terabytes, if not petabytes, of data
• Network infrastructure that can quickly import large data
sets and then replicate it to various nodes for
processing
• Security capabilities that protect highly-distributed
infrastructure and data
16 http://brillix.co.il
A Big Data Solution:
Apache Hadoop
http://brillix.co.il17
Apache Hadoop
• Open source project run by Apache Foundation (2006)
• Hadoop brings the ability to cheaply process large
amounts of data, regardless of its structure
• It Is has been the driving force behind the growth of the
big data industry
• Get the public release from:
• http://hadoop.apache.org/core/
http://brillix.co.il18
Original Hadoop Components
• HDFS: Hadoop Distributed File System – distributed file
system that runs in a clustered environment
• MapReduce – programming paradigm for running
processes over a clustered environments.
Main idea: bring the program to the data
19 http://brillix.co.il
Hadoop Benefits
• Reliable solution based on unreliable hardware
• Designed for large files
• Load data first, structure later
• Designed to maximize throughput of large scans
• Designed to leverage parallelism
• Designed for scale out
• Flexible development platform
• Solution Ecosystem
20
What Hadoop Is Not?
• Hadoop is not a database – it does not a replacement for
DW, or for other relational databases
• Hadoop is not for OLTP/real-time systems
• Very good for large amount, not so much for smaller sets
• Designed for clusters – there is no Hadoop monster
server (single server)
21 http://brillix.co.il
Hadoop Limitations
• Hadoop is scalable but it’s not fast
• Some assembly is required
• Batteries are not included
• DIY mindset
• Open source limitations apply
• Technology is changing very rapidly
http://brillix.co.il22
Hadoop under the Hood
http://brillix.co.il23
Original Hadoop 1.0 Components
• HDFS: Hadoop Distributed File System – distributed file
system that runs in a clustered environment
• MapReduce – programming paradigm for running
processes over a clustered environments
24 http://brillix.co.il
Hadoop 2.0
• Hadoop 2.0 changed the Hadoop conception and
introduced better resource management and speed:
• Hadoop Common
• HDFS
• YARN
• Multiple data processing
frameworks including
MapReduce, Spark and
others
http://brillix.co.il25
HDFS is...
• A distributed file system
• Designed to reliably store data using commodity hardware
• Designed to expect hardware failures and stay resilient
• Intended for large files
• Designed for batch inserts
26 http://brillix.co.il
Files and Blocks
• Files are split into blocks (single unit of storage)
• Managed by Namenode and stored on Datanodes
• Transparent to users
• Replicated across machines at load time
• Same block is stored on multiple machines
• Good for fault-tolerance and access
• Default replication factor is 3
27
HDFS Node Types
HDFS has three types of Nodes:
• Datanodes
• Responsible for actual file store
• Serving data from files (data) to client
• Namenode (MasterNode)
• Distribute files in the cluster
• Responsible for the replication between
the datanodes and for file blocks location
• BackupNode
• Backup node for the NameNode
28
HDFS is Good for...
• Storing large files
• Terabytes, Petabytes, etc...
• Millions rather than billions of files
• 128MB or more per file
• Streaming data
• Write once and read-many times patterns
• Optimized for streaming reads rather than random reads
29
HDFS is Not So Good For...
• Low-latency reads / Real-time application
• High-throughput rather than low latency for small chunks of data
• HBase addresses this issue
• Large amount of small files
• Better for millions of large files instead of billions of small files
• Multiple Writers
• Single writer per file
• Writes at the end of files, no-support for arbitrary offset
30
MapReduce is...
• A programming model for expressing distributed
computations at a massive scale
• An execution framework for organizing and performing
such computations
• Bring the code to the data, not the data to the code
http://brillix.co.il31
The MapReduce Paradigm
• Imposes key-value input/output
• We implement two functions:
• MAP - Takes a large problem and divides into sub problems and
performs the same function on all sub-problems
Map(k1, v1) -> list(k2, v2)
• REDUCE - Combine the output from all sub-problems (each key goes
to the same reducer)
Reduce(k2, list(v2)) -> list(v3)
• Framework handles everything else (almost)
32
Divide and Conquer
http://brillix.co.il33
MapReduce is Good For...
• Embarrassingly parallel algorithms
• Summing, grouping, filtering, joining
• Off-line batch jobs on massive data sets
• Analyzing an entire large dataset
http://brillix.co.il34
MapReduce is NOT Good For...
• Jobs that needs shared state or coordination
• Tasks are shared-nothing
• Shared-state requires scalable state store
• Low-latency jobs
• Jobs on smaller datasets
• Finding individual records
35 http://brillix.co.il
YARN
• Takes care of distributed processing and coordination
• Scheduling
• Jobs are broken down into smaller chunks called tasks
• These tasks are scheduled to run on data nodes
• Task Localization with Data
• Framework strives to place tasks on the nodes that host the
segment of data to be processed by that specific task
• Code is moved to where the data is
36
YARN
• Error Handling
• Failures are an expected behavior so tasks are automatically re-
tried on other machines
• Data Synchronization
• Shuffle and Sort barrier re-arranges and moves data between
machines
• Input and output are coordinated by the framework
37
Extending Hadoop
Improving Hadoop
• Core Hadoop is complicated so some tools and solution
frameworks were added to make things easier
• There are over 80 different Apache projects for big data
solution which uses Hadoop
• Hadoop Distributions collects some of these tools and
release them as a complete package
• Cloudera
• HortonWorks
• MapR
• Amazon EMR
39
Common HADOOP 2.0 Technology Eco System
40 http://brillix.co.il
Databases and DB Connectivity
• HBase: NoSQL Key/Value wide-column oriented
datastore that is native to HDFS
• Sqoop: a tool designed to import data from and export
data to relational databases (HDFS, Hbase, or Hive)
41
HBase
• HBase is the closest thing we had to database in the early
Hadoop days
• Distributed key/value with wide-column oriented database
built on top of HDFS - providing Big Table-like capabilities
• Does not have a query language: only get, put, and scan
commands
• Usually compared with Cassandra (non-Hadoop native
Apache project)
42
When do we use HBase?
• Huge volumes of randomly accessed data
• HBase is at its best when it’s accessed in a distributed fashion
by many clients (high consistency)
• Consider HBase when we are loading data by key, searching
data by key (or range), serving data by key, querying data by
key or when storing data by row that doesn’t conform well to a
schema.
43
When NOT to use HBase
• HBase doesn’t use SQL, don’t have an optimizer,
doesn’t support in transactions or joins
• HBase doesn’t have data types
• See project Apache Phoenix for better data structure
and query language when using HBase
44
Sqoop2
• Sqoop is a command line tool for moving data from RDBMS to Hadoop
• Uses MapReduce program or Hive to load the data
• Can also export data from HBase to RDBMS
• Comes with connectors to MySQL, PostgreSQL, Oracle, SQL Server and
DB2.
• Example:
$bin/sqoop import --connect 'jdbc:sqlserver://10.80.181.127;username=dbuser;password=dbpasswd;database=tpch' 
--table lineitem --hive-import
$bin/sqoop export --connect 'jdbc:sqlserver://10.80.181.127;username=dbuser;password=dbpasswd;database=tpch' --table lineitem --
export-dir /data/lineitemData
45
Improving MapReduce Programmability
• Pig: Programming language that simplifies Hadoop
actions: loading, transforming and sorting of data
• Hive: enables Hadoop to operate as data warehouse
using SQL-like syntax
46
Pig
• Pig is an abstraction on top of Hadoop
• Provides high level programming language designed for data
processing
• Scripts converted into MapReduce code, and executed on
the Hadoop Clusters
• Makes ETL processing and other simple MapReduce
easier without writing MapReduce code
• Often replaced by more up-to-date tools like
Apache Spark
47
Hive
• Data warehousing solution built on top of Hadoop
• Provides SQL-like query language named HiveQL
• Minimal learning curve for people with SQL expertise
• Data analysts are target audience
• Early Hive development work started at Facebook in
2007
48
Hive Provides
• Ability to bring structure to various data formats
• Simple interface for ad hoc querying, analyzing and
summarizing large amounts of data
• Access to files on various data stores such as HDFS
and HBase
• Also see: Apache Impala (mainly in Cloudera)
49
Improving Hadoop – More useful tools
• For improving coordination: Zookeeper
• For Improving log collection: Flume
• For improving scheduling/orchestration: Oozie
• For Improving UI: Hue/Ambari
50
ZooKeeper
• ZooKeeper is a centralized service for maintaining configuration
information, naming, providing distributed synchronization, and providing
group services
• It allows distributed processes to coordinate with each other through a
shared hierarchal namespace which is organized similarly to a standard
file system
• ZooKeeper stamps each update with a number that reflects the order of all
ZooKeeper transactions
51
Flume
• Flume is a distributed system for collecting log data from many
sources, aggregating it, and writing it to HDFS
• Flume does for file what Sqoop did for RDBMS
• Flume maintains a central list of ongoing data flows, stored
redundantly in Zookeeper.
52
53
Is Hadoop the Only Big Data Solution?
• No – There are other solutions:
• Apache Spark and Apache Mesos frameworks
• NoSQL systems (Apache Cassandra, CouchBase, MongoDB and
many others)
• Stream analysis (Apache Kafka, Apache Flink)
• Machine learning (Apache Mahout, Spark MLib)
• Some can be integrated with Hadoop, but some are
independent
http://brillix.co.il54
Where Does the DBA Fits In?
• Big Data solutions are not databases. Databases are
probably not going to disappear, but we feel the change
even today: DBA’s must be ready for the change
• DBA’s are the perfect candidates to transition into Big Data
Experts:
• Have system (OS, disk, memory, hardware) experience
• Can understand data easily
• DBA’s are used to work with developers and other data users
http://brillix.co.il55
What DBAs Needs Now?
• DBA’s will need to know more programming: Java,
Scala, Python, R or any other popular language in the
Big Data world will do
• DBA’s needs to understand the position shifts, and the
introduction of DevOps, Data Scientists, CDO etc.
• Big Data is changing daily: we need to learn, read, and
be involved before we are left behind…
http://brillix.co.il56
Q&A
http://brillix.co.il57
Summary
• Big Data is here – it’s complicated and RDBMS does
not fit anymore
• Big Data solutions are evolving Hadoop is an example
for such a solution
• Spark is very popular Big Data solution
• DBA’s need to be ready for the change: Big Data
solutions are not databases and we make ourselves
ready
http://brillix.co.il58
Thank You
Zohar Elkayam
twitter: @realmgic
Zohar@Brillix.co.il
www.realdbamagic.com
http://brillix.co.il59

More Related Content

What's hot

MySQL 5.7 New Features for Developers
MySQL 5.7 New Features for DevelopersMySQL 5.7 New Features for Developers
MySQL 5.7 New Features for Developers
Zohar Elkayam
 
Docker Concepts for Oracle/MySQL DBAs and DevOps
Docker Concepts for Oracle/MySQL DBAs and DevOpsDocker Concepts for Oracle/MySQL DBAs and DevOps
Docker Concepts for Oracle/MySQL DBAs and DevOps
Zohar Elkayam
 
Oracle Database Performance Tuning Advanced Features and Best Practices for DBAs
Oracle Database Performance Tuning Advanced Features and Best Practices for DBAsOracle Database Performance Tuning Advanced Features and Best Practices for DBAs
Oracle Database Performance Tuning Advanced Features and Best Practices for DBAs
Zohar Elkayam
 
SQLcl the next generation of SQLPlus?
SQLcl the next generation of SQLPlus?SQLcl the next generation of SQLPlus?
SQLcl the next generation of SQLPlus?
Zohar Elkayam
 
Simplify Consolidation with Oracle Database 12c
Simplify Consolidation with Oracle Database 12cSimplify Consolidation with Oracle Database 12c
Simplify Consolidation with Oracle Database 12c
Maris Elsins
 
Oracle Database In-Memory Option for ILOUG
Oracle Database In-Memory Option for ILOUGOracle Database In-Memory Option for ILOUG
Oracle Database In-Memory Option for ILOUG
Zohar Elkayam
 
Oracle 12c New Features For Better Performance
Oracle 12c New Features For Better PerformanceOracle 12c New Features For Better Performance
Oracle 12c New Features For Better Performance
Zohar Elkayam
 
Fast, Flexible Application Development with Oracle Database Cloud Service
Fast, Flexible Application Development with Oracle Database Cloud ServiceFast, Flexible Application Development with Oracle Database Cloud Service
Fast, Flexible Application Development with Oracle Database Cloud Service
Gustavo Rene Antunez
 
Architecting Your Own DBaaS in a Private Cloud with EM12c
Architecting Your Own DBaaS in a Private Cloud with EM12cArchitecting Your Own DBaaS in a Private Cloud with EM12c
Architecting Your Own DBaaS in a Private Cloud with EM12c
Gustavo Rene Antunez
 
Winning performance challenges in oracle multitenant
Winning performance challenges in oracle multitenantWinning performance challenges in oracle multitenant
Winning performance challenges in oracle multitenant
Pini Dibask
 
Oracle Fleet Patching and Provisioning Deep Dive Webcast Slides
Oracle Fleet Patching and Provisioning Deep Dive Webcast SlidesOracle Fleet Patching and Provisioning Deep Dive Webcast Slides
Oracle Fleet Patching and Provisioning Deep Dive Webcast Slides
Ludovico Caldara
 
Maa goldengate-rac-2007111
Maa goldengate-rac-2007111Maa goldengate-rac-2007111
Maa goldengate-rac-2007111
pablitosax
 
Oracle Goldengate training by Vipin Mishra
Oracle Goldengate training by Vipin Mishra Oracle Goldengate training by Vipin Mishra
Oracle Goldengate training by Vipin Mishra
Vipin Mishra
 
Hadoop For Enterprises
Hadoop For EnterprisesHadoop For Enterprises
Hadoop For Enterprises
nvvrajesh
 
An introduction into Oracle Enterprise Manager Cloud Control 12c Release 3
An introduction into Oracle Enterprise Manager Cloud Control 12c Release 3An introduction into Oracle Enterprise Manager Cloud Control 12c Release 3
An introduction into Oracle Enterprise Manager Cloud Control 12c Release 3
Marco Gralike
 
An AMIS Overview of Oracle database 12c (12.1)
An AMIS Overview of Oracle database 12c (12.1)An AMIS Overview of Oracle database 12c (12.1)
An AMIS Overview of Oracle database 12c (12.1)
Marco Gralike
 
How many ways to monitor oracle golden gate-Collaborate 14
How many ways to monitor oracle golden gate-Collaborate 14How many ways to monitor oracle golden gate-Collaborate 14
How many ways to monitor oracle golden gate-Collaborate 14
Bobby Curtis
 
Database Consolidation using Oracle Multitenant
Database Consolidation using Oracle MultitenantDatabase Consolidation using Oracle Multitenant
Database Consolidation using Oracle Multitenant
Pini Dibask
 
Oracle GoldenGate for Oracle DBAs
Oracle GoldenGate for Oracle DBAsOracle GoldenGate for Oracle DBAs
Oracle GoldenGate for Oracle DBAs
Guatemala User Group
 
Enabling real interactive BI on Hadoop
Enabling real interactive BI on HadoopEnabling real interactive BI on Hadoop
Enabling real interactive BI on Hadoop
DataWorks Summit
 

What's hot (20)

MySQL 5.7 New Features for Developers
MySQL 5.7 New Features for DevelopersMySQL 5.7 New Features for Developers
MySQL 5.7 New Features for Developers
 
Docker Concepts for Oracle/MySQL DBAs and DevOps
Docker Concepts for Oracle/MySQL DBAs and DevOpsDocker Concepts for Oracle/MySQL DBAs and DevOps
Docker Concepts for Oracle/MySQL DBAs and DevOps
 
Oracle Database Performance Tuning Advanced Features and Best Practices for DBAs
Oracle Database Performance Tuning Advanced Features and Best Practices for DBAsOracle Database Performance Tuning Advanced Features and Best Practices for DBAs
Oracle Database Performance Tuning Advanced Features and Best Practices for DBAs
 
SQLcl the next generation of SQLPlus?
SQLcl the next generation of SQLPlus?SQLcl the next generation of SQLPlus?
SQLcl the next generation of SQLPlus?
 
Simplify Consolidation with Oracle Database 12c
Simplify Consolidation with Oracle Database 12cSimplify Consolidation with Oracle Database 12c
Simplify Consolidation with Oracle Database 12c
 
Oracle Database In-Memory Option for ILOUG
Oracle Database In-Memory Option for ILOUGOracle Database In-Memory Option for ILOUG
Oracle Database In-Memory Option for ILOUG
 
Oracle 12c New Features For Better Performance
Oracle 12c New Features For Better PerformanceOracle 12c New Features For Better Performance
Oracle 12c New Features For Better Performance
 
Fast, Flexible Application Development with Oracle Database Cloud Service
Fast, Flexible Application Development with Oracle Database Cloud ServiceFast, Flexible Application Development with Oracle Database Cloud Service
Fast, Flexible Application Development with Oracle Database Cloud Service
 
Architecting Your Own DBaaS in a Private Cloud with EM12c
Architecting Your Own DBaaS in a Private Cloud with EM12cArchitecting Your Own DBaaS in a Private Cloud with EM12c
Architecting Your Own DBaaS in a Private Cloud with EM12c
 
Winning performance challenges in oracle multitenant
Winning performance challenges in oracle multitenantWinning performance challenges in oracle multitenant
Winning performance challenges in oracle multitenant
 
Oracle Fleet Patching and Provisioning Deep Dive Webcast Slides
Oracle Fleet Patching and Provisioning Deep Dive Webcast SlidesOracle Fleet Patching and Provisioning Deep Dive Webcast Slides
Oracle Fleet Patching and Provisioning Deep Dive Webcast Slides
 
Maa goldengate-rac-2007111
Maa goldengate-rac-2007111Maa goldengate-rac-2007111
Maa goldengate-rac-2007111
 
Oracle Goldengate training by Vipin Mishra
Oracle Goldengate training by Vipin Mishra Oracle Goldengate training by Vipin Mishra
Oracle Goldengate training by Vipin Mishra
 
Hadoop For Enterprises
Hadoop For EnterprisesHadoop For Enterprises
Hadoop For Enterprises
 
An introduction into Oracle Enterprise Manager Cloud Control 12c Release 3
An introduction into Oracle Enterprise Manager Cloud Control 12c Release 3An introduction into Oracle Enterprise Manager Cloud Control 12c Release 3
An introduction into Oracle Enterprise Manager Cloud Control 12c Release 3
 
An AMIS Overview of Oracle database 12c (12.1)
An AMIS Overview of Oracle database 12c (12.1)An AMIS Overview of Oracle database 12c (12.1)
An AMIS Overview of Oracle database 12c (12.1)
 
How many ways to monitor oracle golden gate-Collaborate 14
How many ways to monitor oracle golden gate-Collaborate 14How many ways to monitor oracle golden gate-Collaborate 14
How many ways to monitor oracle golden gate-Collaborate 14
 
Database Consolidation using Oracle Multitenant
Database Consolidation using Oracle MultitenantDatabase Consolidation using Oracle Multitenant
Database Consolidation using Oracle Multitenant
 
Oracle GoldenGate for Oracle DBAs
Oracle GoldenGate for Oracle DBAsOracle GoldenGate for Oracle DBAs
Oracle GoldenGate for Oracle DBAs
 
Enabling real interactive BI on Hadoop
Enabling real interactive BI on HadoopEnabling real interactive BI on Hadoop
Enabling real interactive BI on Hadoop
 

Similar to Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem

Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016
Zohar Elkayam
 
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
Rittman Analytics
 
002 Introduction to hadoop v3
002   Introduction to hadoop v3002   Introduction to hadoop v3
002 Introduction to hadoop v3
Dendej Sawarnkatat
 
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with HadoopCloudera, Inc.
 
Big data applications
Big data applicationsBig data applications
Big data applications
Juan Pablo Paz Grau, Ph.D., PMP
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
Prashanth Yennampelli
 
50 Shades of SQL
50 Shades of SQL50 Shades of SQL
50 Shades of SQL
DataWorks Summit
 
Big Data Strategy for the Relational World
Big Data Strategy for the Relational World Big Data Strategy for the Relational World
Big Data Strategy for the Relational World Andrew Brust
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
Adaryl "Bob" Wakefield, MBA
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
Roorkee College of Engineering, Roorkee
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
arslanhaneef
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
sonukumar379092
 
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvew
Kunal Khanna
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoopMohit Tare
 
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
Venneladonthireddy1
 
Big data - Online Training
Big data - Online TrainingBig data - Online Training
Big data - Online Training
Learntek1
 
Simple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform ConceptSimple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform Concept
Satish Mohan
 
Big data analysis using hadoop cluster
Big data analysis using hadoop clusterBig data analysis using hadoop cluster
Big data analysis using hadoop cluster
Furqan Haider
 
Scaling etl with hadoop shapira 3
Scaling etl with hadoop   shapira 3Scaling etl with hadoop   shapira 3
Scaling etl with hadoop shapira 3Gwen (Chen) Shapira
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
York University
 

Similar to Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem (20)

Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016
 
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
 
002 Introduction to hadoop v3
002   Introduction to hadoop v3002   Introduction to hadoop v3
002 Introduction to hadoop v3
 
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with Hadoop
 
Big data applications
Big data applicationsBig data applications
Big data applications
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
50 Shades of SQL
50 Shades of SQL50 Shades of SQL
50 Shades of SQL
 
Big Data Strategy for the Relational World
Big Data Strategy for the Relational World Big Data Strategy for the Relational World
Big Data Strategy for the Relational World
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvew
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
 
Big data - Online Training
Big data - Online TrainingBig data - Online Training
Big data - Online Training
 
Simple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform ConceptSimple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform Concept
 
Big data analysis using hadoop cluster
Big data analysis using hadoop clusterBig data analysis using hadoop cluster
Big data analysis using hadoop cluster
 
Scaling etl with hadoop shapira 3
Scaling etl with hadoop   shapira 3Scaling etl with hadoop   shapira 3
Scaling etl with hadoop shapira 3
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 

More from Zohar Elkayam

PL/SQL New and Advanced Features for Extreme Performance
PL/SQL New and Advanced Features for Extreme PerformancePL/SQL New and Advanced Features for Extreme Performance
PL/SQL New and Advanced Features for Extreme Performance
Zohar Elkayam
 
The art of querying – newest and advanced SQL techniques
The art of querying – newest and advanced SQL techniquesThe art of querying – newest and advanced SQL techniques
The art of querying – newest and advanced SQL techniques
Zohar Elkayam
 
Oracle Advanced SQL and Analytic Functions
Oracle Advanced SQL and Analytic FunctionsOracle Advanced SQL and Analytic Functions
Oracle Advanced SQL and Analytic Functions
Zohar Elkayam
 
Things Every Oracle DBA Needs to Know About the Hadoop Ecosystem (c17lv version)
Things Every Oracle DBA Needs to Know About the Hadoop Ecosystem (c17lv version)Things Every Oracle DBA Needs to Know About the Hadoop Ecosystem (c17lv version)
Things Every Oracle DBA Needs to Know About the Hadoop Ecosystem (c17lv version)
Zohar Elkayam
 
Advanced PL/SQL Optimizing for Better Performance 2016
Advanced PL/SQL Optimizing for Better Performance 2016Advanced PL/SQL Optimizing for Better Performance 2016
Advanced PL/SQL Optimizing for Better Performance 2016
Zohar Elkayam
 
Oracle Database Advanced Querying (2016)
Oracle Database Advanced Querying (2016)Oracle Database Advanced Querying (2016)
Oracle Database Advanced Querying (2016)
Zohar Elkayam
 
OOW2016: Exploring Advanced SQL Techniques Using Analytic Functions
OOW2016: Exploring Advanced SQL Techniques Using Analytic FunctionsOOW2016: Exploring Advanced SQL Techniques Using Analytic Functions
OOW2016: Exploring Advanced SQL Techniques Using Analytic Functions
Zohar Elkayam
 
Is SQLcl the Next Generation of SQL*Plus?
Is SQLcl the Next Generation of SQL*Plus?Is SQLcl the Next Generation of SQL*Plus?
Is SQLcl the Next Generation of SQL*Plus?
Zohar Elkayam
 
Exploring Advanced SQL Techniques Using Analytic Functions
Exploring Advanced SQL Techniques Using Analytic FunctionsExploring Advanced SQL Techniques Using Analytic Functions
Exploring Advanced SQL Techniques Using Analytic Functions
Zohar Elkayam
 
Exploring Advanced SQL Techniques Using Analytic Functions
Exploring Advanced SQL Techniques Using Analytic FunctionsExploring Advanced SQL Techniques Using Analytic Functions
Exploring Advanced SQL Techniques Using Analytic Functions
Zohar Elkayam
 
Advanced PLSQL Optimizing for Better Performance
Advanced PLSQL Optimizing for Better PerformanceAdvanced PLSQL Optimizing for Better Performance
Advanced PLSQL Optimizing for Better Performance
Zohar Elkayam
 
Oracle Database Advanced Querying
Oracle Database Advanced QueryingOracle Database Advanced Querying
Oracle Database Advanced Querying
Zohar Elkayam
 
Oracle Data Guard A to Z
Oracle Data Guard A to ZOracle Data Guard A to Z
Oracle Data Guard A to Z
Zohar Elkayam
 
Oracle Data Guard Broker Webinar
Oracle Data Guard Broker WebinarOracle Data Guard Broker Webinar
Oracle Data Guard Broker Webinar
Zohar Elkayam
 

More from Zohar Elkayam (14)

PL/SQL New and Advanced Features for Extreme Performance
PL/SQL New and Advanced Features for Extreme PerformancePL/SQL New and Advanced Features for Extreme Performance
PL/SQL New and Advanced Features for Extreme Performance
 
The art of querying – newest and advanced SQL techniques
The art of querying – newest and advanced SQL techniquesThe art of querying – newest and advanced SQL techniques
The art of querying – newest and advanced SQL techniques
 
Oracle Advanced SQL and Analytic Functions
Oracle Advanced SQL and Analytic FunctionsOracle Advanced SQL and Analytic Functions
Oracle Advanced SQL and Analytic Functions
 
Things Every Oracle DBA Needs to Know About the Hadoop Ecosystem (c17lv version)
Things Every Oracle DBA Needs to Know About the Hadoop Ecosystem (c17lv version)Things Every Oracle DBA Needs to Know About the Hadoop Ecosystem (c17lv version)
Things Every Oracle DBA Needs to Know About the Hadoop Ecosystem (c17lv version)
 
Advanced PL/SQL Optimizing for Better Performance 2016
Advanced PL/SQL Optimizing for Better Performance 2016Advanced PL/SQL Optimizing for Better Performance 2016
Advanced PL/SQL Optimizing for Better Performance 2016
 
Oracle Database Advanced Querying (2016)
Oracle Database Advanced Querying (2016)Oracle Database Advanced Querying (2016)
Oracle Database Advanced Querying (2016)
 
OOW2016: Exploring Advanced SQL Techniques Using Analytic Functions
OOW2016: Exploring Advanced SQL Techniques Using Analytic FunctionsOOW2016: Exploring Advanced SQL Techniques Using Analytic Functions
OOW2016: Exploring Advanced SQL Techniques Using Analytic Functions
 
Is SQLcl the Next Generation of SQL*Plus?
Is SQLcl the Next Generation of SQL*Plus?Is SQLcl the Next Generation of SQL*Plus?
Is SQLcl the Next Generation of SQL*Plus?
 
Exploring Advanced SQL Techniques Using Analytic Functions
Exploring Advanced SQL Techniques Using Analytic FunctionsExploring Advanced SQL Techniques Using Analytic Functions
Exploring Advanced SQL Techniques Using Analytic Functions
 
Exploring Advanced SQL Techniques Using Analytic Functions
Exploring Advanced SQL Techniques Using Analytic FunctionsExploring Advanced SQL Techniques Using Analytic Functions
Exploring Advanced SQL Techniques Using Analytic Functions
 
Advanced PLSQL Optimizing for Better Performance
Advanced PLSQL Optimizing for Better PerformanceAdvanced PLSQL Optimizing for Better Performance
Advanced PLSQL Optimizing for Better Performance
 
Oracle Database Advanced Querying
Oracle Database Advanced QueryingOracle Database Advanced Querying
Oracle Database Advanced Querying
 
Oracle Data Guard A to Z
Oracle Data Guard A to ZOracle Data Guard A to Z
Oracle Data Guard A to Z
 
Oracle Data Guard Broker Webinar
Oracle Data Guard Broker WebinarOracle Data Guard Broker Webinar
Oracle Data Guard Broker Webinar
 

Recently uploaded

Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 

Recently uploaded (20)

Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 

Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem

  • 1. Zohar Elkayam www.realdbamagic.com Twitter: @realmgic Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem
  • 2. Who am I? • Zohar Elkayam, CTO at Brillix • DBA, team leader, database trainer, public speaker, and a senior consultant for over 18 years • Oracle ACE Associate • Involved with Big Data projects since 2011 • Blogger – www.realdbamagic.com and www.ilDBA.co.il http://brillix.co.il2
  • 3. About Brillix • Brillix is a leading company that specialized in Data Management • We provide professional services, training and consulting for Databases, Security, NoSQL, and Big Data solutions • Providing the Brillix Big Data Experience Center 3
  • 4. Agenda • What is the Big Data challenge? • A Big Data Solution: Hadoop • HDFS • MapReduce and YARN • Hadoop Ecosystem: HBase, Sqoop, Hive, Pig and other tools • Where does the DBA fits in? http://brillix.co.il4
  • 6. The Big Data Challenge http://brillix.co.il6
  • 7. Volume • Big data come in one size: Big. • Size is measured in Terabyte(1012), Petabyte(1015), Exabyte(1018), Zettabyte (1021) • The storing and handling of the data becomes an issue • Producing value out of the data in a reasonable time is an issue http://brillix.co.il7
  • 8. Variety • Big Data extends beyond structured data, including semi-structured and unstructured information: logs, text, audio and videos • Wide variety of rapidly evolving data types requires highly flexible stores and handling http://brillix.co.il8 Un-Structured Structured Objects Tables Flexible Columns and Rows Structure Unknown Predefined Structure Textual and Binary Mostly Textual
  • 9. Velocity • The speed in which the data is being generated and collected • Streaming data and large volume data movement • High velocity of data capture – requires rapid ingestion • Might cause a backlog problem http://brillix.co.il9
  • 10. Okay, So What Defines Big Data? • When the data is too big or moves too fast to handle in a sensible amount of time • When the data doesn’t fit any conventional database structure • When the solution to the business need becomes part of the problem • When we think that we can still produce value from that data and want to handle it http://brillix.co.il10
  • 11. Value Big data is not about the size of the data, It’s about the value within the data http://brillix.co.il11
  • 12. How to do Big Data http://brillix.co.il12
  • 13. 13
  • 14. Big Data in Practice • Big data is big: technological infrastructure solutions needed • Big data is complicated: • We need developers to manage handling of the data • We need devops to manage the clusters • We need data analysts and data scientists to produce value http://brillix.co.il14
  • 15. Infrastructure Challenges • Infrastructure that is built for: • Large-scale • Distributed / scaled out • Data-intensive jobs that spread the problem across clusters of server nodes http://brillix.co.il15
  • 16. Infrastructure Challenges (cont.) • Storage: Efficient and cost-effective enough to capture and store terabytes, if not petabytes, of data • Network infrastructure that can quickly import large data sets and then replicate it to various nodes for processing • Security capabilities that protect highly-distributed infrastructure and data 16 http://brillix.co.il
  • 17. A Big Data Solution: Apache Hadoop http://brillix.co.il17
  • 18. Apache Hadoop • Open source project run by Apache Foundation (2006) • Hadoop brings the ability to cheaply process large amounts of data, regardless of its structure • It Is has been the driving force behind the growth of the big data industry • Get the public release from: • http://hadoop.apache.org/core/ http://brillix.co.il18
  • 19. Original Hadoop Components • HDFS: Hadoop Distributed File System – distributed file system that runs in a clustered environment • MapReduce – programming paradigm for running processes over a clustered environments. Main idea: bring the program to the data 19 http://brillix.co.il
  • 20. Hadoop Benefits • Reliable solution based on unreliable hardware • Designed for large files • Load data first, structure later • Designed to maximize throughput of large scans • Designed to leverage parallelism • Designed for scale out • Flexible development platform • Solution Ecosystem 20
  • 21. What Hadoop Is Not? • Hadoop is not a database – it does not a replacement for DW, or for other relational databases • Hadoop is not for OLTP/real-time systems • Very good for large amount, not so much for smaller sets • Designed for clusters – there is no Hadoop monster server (single server) 21 http://brillix.co.il
  • 22. Hadoop Limitations • Hadoop is scalable but it’s not fast • Some assembly is required • Batteries are not included • DIY mindset • Open source limitations apply • Technology is changing very rapidly http://brillix.co.il22
  • 23. Hadoop under the Hood http://brillix.co.il23
  • 24. Original Hadoop 1.0 Components • HDFS: Hadoop Distributed File System – distributed file system that runs in a clustered environment • MapReduce – programming paradigm for running processes over a clustered environments 24 http://brillix.co.il
  • 25. Hadoop 2.0 • Hadoop 2.0 changed the Hadoop conception and introduced better resource management and speed: • Hadoop Common • HDFS • YARN • Multiple data processing frameworks including MapReduce, Spark and others http://brillix.co.il25
  • 26. HDFS is... • A distributed file system • Designed to reliably store data using commodity hardware • Designed to expect hardware failures and stay resilient • Intended for large files • Designed for batch inserts 26 http://brillix.co.il
  • 27. Files and Blocks • Files are split into blocks (single unit of storage) • Managed by Namenode and stored on Datanodes • Transparent to users • Replicated across machines at load time • Same block is stored on multiple machines • Good for fault-tolerance and access • Default replication factor is 3 27
  • 28. HDFS Node Types HDFS has three types of Nodes: • Datanodes • Responsible for actual file store • Serving data from files (data) to client • Namenode (MasterNode) • Distribute files in the cluster • Responsible for the replication between the datanodes and for file blocks location • BackupNode • Backup node for the NameNode 28
  • 29. HDFS is Good for... • Storing large files • Terabytes, Petabytes, etc... • Millions rather than billions of files • 128MB or more per file • Streaming data • Write once and read-many times patterns • Optimized for streaming reads rather than random reads 29
  • 30. HDFS is Not So Good For... • Low-latency reads / Real-time application • High-throughput rather than low latency for small chunks of data • HBase addresses this issue • Large amount of small files • Better for millions of large files instead of billions of small files • Multiple Writers • Single writer per file • Writes at the end of files, no-support for arbitrary offset 30
  • 31. MapReduce is... • A programming model for expressing distributed computations at a massive scale • An execution framework for organizing and performing such computations • Bring the code to the data, not the data to the code http://brillix.co.il31
  • 32. The MapReduce Paradigm • Imposes key-value input/output • We implement two functions: • MAP - Takes a large problem and divides into sub problems and performs the same function on all sub-problems Map(k1, v1) -> list(k2, v2) • REDUCE - Combine the output from all sub-problems (each key goes to the same reducer) Reduce(k2, list(v2)) -> list(v3) • Framework handles everything else (almost) 32
  • 34. MapReduce is Good For... • Embarrassingly parallel algorithms • Summing, grouping, filtering, joining • Off-line batch jobs on massive data sets • Analyzing an entire large dataset http://brillix.co.il34
  • 35. MapReduce is NOT Good For... • Jobs that needs shared state or coordination • Tasks are shared-nothing • Shared-state requires scalable state store • Low-latency jobs • Jobs on smaller datasets • Finding individual records 35 http://brillix.co.il
  • 36. YARN • Takes care of distributed processing and coordination • Scheduling • Jobs are broken down into smaller chunks called tasks • These tasks are scheduled to run on data nodes • Task Localization with Data • Framework strives to place tasks on the nodes that host the segment of data to be processed by that specific task • Code is moved to where the data is 36
  • 37. YARN • Error Handling • Failures are an expected behavior so tasks are automatically re- tried on other machines • Data Synchronization • Shuffle and Sort barrier re-arranges and moves data between machines • Input and output are coordinated by the framework 37
  • 39. Improving Hadoop • Core Hadoop is complicated so some tools and solution frameworks were added to make things easier • There are over 80 different Apache projects for big data solution which uses Hadoop • Hadoop Distributions collects some of these tools and release them as a complete package • Cloudera • HortonWorks • MapR • Amazon EMR 39
  • 40. Common HADOOP 2.0 Technology Eco System 40 http://brillix.co.il
  • 41. Databases and DB Connectivity • HBase: NoSQL Key/Value wide-column oriented datastore that is native to HDFS • Sqoop: a tool designed to import data from and export data to relational databases (HDFS, Hbase, or Hive) 41
  • 42. HBase • HBase is the closest thing we had to database in the early Hadoop days • Distributed key/value with wide-column oriented database built on top of HDFS - providing Big Table-like capabilities • Does not have a query language: only get, put, and scan commands • Usually compared with Cassandra (non-Hadoop native Apache project) 42
  • 43. When do we use HBase? • Huge volumes of randomly accessed data • HBase is at its best when it’s accessed in a distributed fashion by many clients (high consistency) • Consider HBase when we are loading data by key, searching data by key (or range), serving data by key, querying data by key or when storing data by row that doesn’t conform well to a schema. 43
  • 44. When NOT to use HBase • HBase doesn’t use SQL, don’t have an optimizer, doesn’t support in transactions or joins • HBase doesn’t have data types • See project Apache Phoenix for better data structure and query language when using HBase 44
  • 45. Sqoop2 • Sqoop is a command line tool for moving data from RDBMS to Hadoop • Uses MapReduce program or Hive to load the data • Can also export data from HBase to RDBMS • Comes with connectors to MySQL, PostgreSQL, Oracle, SQL Server and DB2. • Example: $bin/sqoop import --connect 'jdbc:sqlserver://10.80.181.127;username=dbuser;password=dbpasswd;database=tpch' --table lineitem --hive-import $bin/sqoop export --connect 'jdbc:sqlserver://10.80.181.127;username=dbuser;password=dbpasswd;database=tpch' --table lineitem -- export-dir /data/lineitemData 45
  • 46. Improving MapReduce Programmability • Pig: Programming language that simplifies Hadoop actions: loading, transforming and sorting of data • Hive: enables Hadoop to operate as data warehouse using SQL-like syntax 46
  • 47. Pig • Pig is an abstraction on top of Hadoop • Provides high level programming language designed for data processing • Scripts converted into MapReduce code, and executed on the Hadoop Clusters • Makes ETL processing and other simple MapReduce easier without writing MapReduce code • Often replaced by more up-to-date tools like Apache Spark 47
  • 48. Hive • Data warehousing solution built on top of Hadoop • Provides SQL-like query language named HiveQL • Minimal learning curve for people with SQL expertise • Data analysts are target audience • Early Hive development work started at Facebook in 2007 48
  • 49. Hive Provides • Ability to bring structure to various data formats • Simple interface for ad hoc querying, analyzing and summarizing large amounts of data • Access to files on various data stores such as HDFS and HBase • Also see: Apache Impala (mainly in Cloudera) 49
  • 50. Improving Hadoop – More useful tools • For improving coordination: Zookeeper • For Improving log collection: Flume • For improving scheduling/orchestration: Oozie • For Improving UI: Hue/Ambari 50
  • 51. ZooKeeper • ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services • It allows distributed processes to coordinate with each other through a shared hierarchal namespace which is organized similarly to a standard file system • ZooKeeper stamps each update with a number that reflects the order of all ZooKeeper transactions 51
  • 52. Flume • Flume is a distributed system for collecting log data from many sources, aggregating it, and writing it to HDFS • Flume does for file what Sqoop did for RDBMS • Flume maintains a central list of ongoing data flows, stored redundantly in Zookeeper. 52
  • 53. 53
  • 54. Is Hadoop the Only Big Data Solution? • No – There are other solutions: • Apache Spark and Apache Mesos frameworks • NoSQL systems (Apache Cassandra, CouchBase, MongoDB and many others) • Stream analysis (Apache Kafka, Apache Flink) • Machine learning (Apache Mahout, Spark MLib) • Some can be integrated with Hadoop, but some are independent http://brillix.co.il54
  • 55. Where Does the DBA Fits In? • Big Data solutions are not databases. Databases are probably not going to disappear, but we feel the change even today: DBA’s must be ready for the change • DBA’s are the perfect candidates to transition into Big Data Experts: • Have system (OS, disk, memory, hardware) experience • Can understand data easily • DBA’s are used to work with developers and other data users http://brillix.co.il55
  • 56. What DBAs Needs Now? • DBA’s will need to know more programming: Java, Scala, Python, R or any other popular language in the Big Data world will do • DBA’s needs to understand the position shifts, and the introduction of DevOps, Data Scientists, CDO etc. • Big Data is changing daily: we need to learn, read, and be involved before we are left behind… http://brillix.co.il56
  • 58. Summary • Big Data is here – it’s complicated and RDBMS does not fit anymore • Big Data solutions are evolving Hadoop is an example for such a solution • Spark is very popular Big Data solution • DBA’s need to be ready for the change: Big Data solutions are not databases and we make ourselves ready http://brillix.co.il58
  • 59. Thank You Zohar Elkayam twitter: @realmgic Zohar@Brillix.co.il www.realdbamagic.com http://brillix.co.il59

Editor's Notes

  1. YARN = Yet another resource negotiator
  2. Source: Hadoop documentation
  3. One very common use of Hadoop is taking web server or other logs from a large number of machines, and periodically processing them to pull out analytics information. The Flume project is designed to make the data gathering process easy and scalable, by running agents on the source machines that pass the data updates to collectors, which then aggregate them into large chunks that can be efficiently written as HDFS files. It’s usually set up using a command-line tool that supports common operations, like tailing a file or listening on a network socket, and has tunable reliability guarantees that let you trade off performance and the potential for data loss.