xGem BigData

Agenda
 What is Big Data?
 Big Data Technologies
 What is Hadoop?
 Big Data Components
 Hadoop Distributions
 HortonWorkd Data Platform
 Log Analyzed

What is Big Data?
Ernst and Young offers the following definition:
Big Data refers to the dynamic, large and disparate volumes of data being created by people, tools and machines. It
requires new, innovative, and scalable technology to collect, host and analytically process the vast amount of data
gathered in order to derive real-time business insights that relate to consumers, risk, profit, performance, productivity
management and enhanced shareholder value.
The research firm Gartner, defines Big Data as follows:
Big Data is high-volume, high-velocity, and/or high-variety information assets that demand cost-effective,
innovative forms of information processing that enable enhanced insight, decision making and process
automation.

5V’s del Big Data
BigData
3
Variety is the diversity of the data. We have structured
data that fits neatly into rows and columns, or
relational databases and unstructured data that is not
organized in a pre-defined way, for example Tweets,
blogposts, pictures, numbers, and even video data.
Variety
1
Velocity
Velocity is the idea that data is being generated
extremely fast, a process that never stops. Attributes
include near or real-time streaming and local and
cloud-based technologies that can process information
very quickly.
4
Veracity is the conformity to facts and accuracy.
Is the information real, or is it false?
Veracity
2
Volume
Volume is the scale of the data, or the increase in
the amount of data stored.
5
VALUE

Big Data
Value isn't just profit. It may be medical or social benefits, or
customer, employee, personal satisfaction or crime prevention. The
main reasons for why people invest time to understand Big Data is to
derive value from it.
VALUE

What is Apache Hadoop?
• Hadoop is an open-source software
framework used to store and process huge
amounts of data.
• Owned by Apache Software Foundation
• Transforms commodity hardware into a
service that:
• Stores petabytes of data reliably (HDFS)
• Allows huge distributed computations
(MapReduce)
• Key attributes:
• Redundant and reliable
• Doesn’t stop or lose data even if hardware
fails
• Easy to program
• Extremely powerful
• Allows the development of big data
algorithms & tools
• Batch processing centric
• Runs on commodity hardware
• Computers & network

Who use Hadoop?
2006 2008 2009 2010
The Datagraph Blog
2007

How HDFS Works?
Namenode
Persistent Namespace
Metadata & Journal
Namespace
State
Block
Map
Heartbeats & Block Reports
Block ID  Block Locations
Datanodes
Block ID  Data
Hierarchal Namespace
File Name  BlockIDs
Horizontally Scale IO and Storage
b1
b5
b3
JBOD
BlockStorageNamespace
b2
b3
b1
JBOD
b3
b5
b2
JBOD
b1
b5
b2
JBOD

HDFS Data Reliability
Namenode
Namespace
State
Block
Map
b1
b5
b3
JBOD
b2
b3
b4
JBOD
b3
b5
b2
JBOD
b1b5
b2
JBOD
2. copy
3.
blockReceived
1.
replicate
Bad/lost block
replica
Periodically
check block
checksums

Log Analytics Systems Today
LOG
ANALYTICS
PLATFORMNetwork
Device Logs
• Not all data can be captured
• Not all captured data is valuable
• Transport all data

LOG
ANALYTICS
PLATFORM
Network
Device Logs
HDP
HDF
2. Content-based routing based on dynamic
evaluation of content, attributes, priority
1. Integrate and enrich logs across
data centers and security zones
3. Cost effectively expand collection and grow
timescale of logs collected
Expand Storage Options of Log Data

xGem BigData

More Related Content

What's hot

Similar to xGem BigData

More from Julio Castro

Recently uploaded

xGem BigData