XGem
XGem
Big Data
Agenda
 What is Big Data?
 Big Data Technologies
 What is Hadoop?
 Big Data Components
 Hadoop Distributions
 HortonWorkd Data Platform
 Log Analyzed
What is Big Data?
Ernst and Young offers the following definition:
Big Data refers to the dynamic, large and disparate volumes of data being created by people, tools and machines. It
requires new, innovative, and scalable technology to collect, host and analytically process the vast amount of data
gathered in order to derive real-time business insights that relate to consumers, risk, profit, performance, productivity
management and enhanced shareholder value.
The research firm Gartner, defines Big Data as follows:
Big Data is high-volume, high-velocity, and/or high-variety information assets that demand cost-effective,
innovative forms of information processing that enable enhanced insight, decision making and process
automation.
5V’s del Big Data
BigData
3
Variety is the diversity of the data. We have structured
data that fits neatly into rows and columns, or
relational databases and unstructured data that is not
organized in a pre-defined way, for example Tweets,
blogposts, pictures, numbers, and even video data.
Variety
1
Velocity
Velocity is the idea that data is being generated
extremely fast, a process that never stops. Attributes
include near or real-time streaming and local and
cloud-based technologies that can process information
very quickly.
4
Veracity is the conformity to facts and accuracy.
Is the information real, or is it false?
Veracity
2
Volume
Volume is the scale of the data, or the increase in
the amount of data stored.
5
VALUE
Big Data
Value isn't just profit. It may be medical or social benefits, or
customer, employee, personal satisfaction or crime prevention. The
main reasons for why people invest time to understand Big Data is to
derive value from it.
VALUE
Big Data Technologies
What is Apache Hadoop?
• Hadoop is an open-source software
framework used to store and process huge
amounts of data.
• Owned by Apache Software Foundation
• Transforms commodity hardware into a
service that:
• Stores petabytes of data reliably (HDFS)
• Allows huge distributed computations
(MapReduce)
• Key attributes:
• Redundant and reliable
• Doesn’t stop or lose data even if hardware
fails
• Easy to program
• Extremely powerful
• Allows the development of big data
algorithms & tools
• Batch processing centric
• Runs on commodity hardware
• Computers & network
Who build Hadoop?
Who use Hadoop?
2006 2008 2009 2010
The Datagraph Blog
2007
How HDFS Works?
Namenode
Persistent Namespace
Metadata & Journal
Namespace
State
Block
Map
Heartbeats & Block Reports
Block ID  Block Locations
Datanodes
Block ID  Data
Hierarchal Namespace
File Name  BlockIDs
Horizontally Scale IO and Storage
b1
b5
b3
JBOD
BlockStorageNamespace
b2
b3
b1
JBOD
b3
b5
b2
JBOD
b1
b5
b2
JBOD
HDFS Data Reliability
Namenode
Namespace
State
Block
Map
b1
b5
b3
JBOD
b2
b3
b4
JBOD
b3
b5
b2
JBOD
b1b5
b2
JBOD
2. copy
3.
blockReceived
1.
replicate
Bad/lost block
replica
Periodically
check block
checksums
What is the Hadoop framework?
Hadoop framework Components
Hadoop Distributions
Agenda
Hortonworks Solutions
Log Analytics Systems Today
LOG
ANALYTICS
PLATFORMNetwork
Device Logs
• Not all data can be captured
• Not all captured data is valuable
• Transport all data
LOG
ANALYTICS
PLATFORM
Network
Device Logs
HDP
HDF
2. Content-based routing based on dynamic
evaluation of content, attributes, priority
1. Integrate and enrich logs across
data centers and security zones
3. Cost effectively expand collection and grow
timescale of logs collected
Expand Storage Options of Log Data
Thanks!

xGem BigData

  • 1.
  • 2.
  • 3.
    Agenda  What isBig Data?  Big Data Technologies  What is Hadoop?  Big Data Components  Hadoop Distributions  HortonWorkd Data Platform  Log Analyzed
  • 4.
    What is BigData? Ernst and Young offers the following definition: Big Data refers to the dynamic, large and disparate volumes of data being created by people, tools and machines. It requires new, innovative, and scalable technology to collect, host and analytically process the vast amount of data gathered in order to derive real-time business insights that relate to consumers, risk, profit, performance, productivity management and enhanced shareholder value. The research firm Gartner, defines Big Data as follows: Big Data is high-volume, high-velocity, and/or high-variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making and process automation.
  • 5.
    5V’s del BigData BigData 3 Variety is the diversity of the data. We have structured data that fits neatly into rows and columns, or relational databases and unstructured data that is not organized in a pre-defined way, for example Tweets, blogposts, pictures, numbers, and even video data. Variety 1 Velocity Velocity is the idea that data is being generated extremely fast, a process that never stops. Attributes include near or real-time streaming and local and cloud-based technologies that can process information very quickly. 4 Veracity is the conformity to facts and accuracy. Is the information real, or is it false? Veracity 2 Volume Volume is the scale of the data, or the increase in the amount of data stored. 5 VALUE
  • 6.
    Big Data Value isn'tjust profit. It may be medical or social benefits, or customer, employee, personal satisfaction or crime prevention. The main reasons for why people invest time to understand Big Data is to derive value from it. VALUE
  • 7.
  • 8.
    What is ApacheHadoop? • Hadoop is an open-source software framework used to store and process huge amounts of data. • Owned by Apache Software Foundation • Transforms commodity hardware into a service that: • Stores petabytes of data reliably (HDFS) • Allows huge distributed computations (MapReduce) • Key attributes: • Redundant and reliable • Doesn’t stop or lose data even if hardware fails • Easy to program • Extremely powerful • Allows the development of big data algorithms & tools • Batch processing centric • Runs on commodity hardware • Computers & network
  • 9.
  • 10.
    Who use Hadoop? 20062008 2009 2010 The Datagraph Blog 2007
  • 11.
    How HDFS Works? Namenode PersistentNamespace Metadata & Journal Namespace State Block Map Heartbeats & Block Reports Block ID  Block Locations Datanodes Block ID  Data Hierarchal Namespace File Name  BlockIDs Horizontally Scale IO and Storage b1 b5 b3 JBOD BlockStorageNamespace b2 b3 b1 JBOD b3 b5 b2 JBOD b1 b5 b2 JBOD
  • 12.
    HDFS Data Reliability Namenode Namespace State Block Map b1 b5 b3 JBOD b2 b3 b4 JBOD b3 b5 b2 JBOD b1b5 b2 JBOD 2.copy 3. blockReceived 1. replicate Bad/lost block replica Periodically check block checksums
  • 14.
    What is theHadoop framework?
  • 15.
  • 16.
  • 17.
  • 18.
    Log Analytics SystemsToday LOG ANALYTICS PLATFORMNetwork Device Logs • Not all data can be captured • Not all captured data is valuable • Transport all data
  • 19.
    LOG ANALYTICS PLATFORM Network Device Logs HDP HDF 2. Content-basedrouting based on dynamic evaluation of content, attributes, priority 1. Integrate and enrich logs across data centers and security zones 3. Cost effectively expand collection and grow timescale of logs collected Expand Storage Options of Log Data
  • 20.