What is Big Data?
Big Data Technologies
What is Hadoop?
Big Data Components
HortonWorkd Data Platform
What is Big Data?
Ernst and Young offers the following definition:
Big Data refers to the dynamic, large and disparate volumes of data being created by people, tools and machines. It
requires new, innovative, and scalable technology to collect, host and analytically process the vast amount of data
gathered in order to derive real-time business insights that relate to consumers, risk, profit, performance, productivity
management and enhanced shareholder value.
The research firm Gartner, defines Big Data as follows:
Big Data is high-volume, high-velocity, and/or high-variety information assets that demand cost-effective,
innovative forms of information processing that enable enhanced insight, decision making and process
5V’s del Big Data
Variety is the diversity of the data. We have structured
data that fits neatly into rows and columns, or
relational databases and unstructured data that is not
organized in a pre-defined way, for example Tweets,
blogposts, pictures, numbers, and even video data.
Velocity is the idea that data is being generated
extremely fast, a process that never stops. Attributes
include near or real-time streaming and local and
cloud-based technologies that can process information
Veracity is the conformity to facts and accuracy.
Is the information real, or is it false?
Volume is the scale of the data, or the increase in
the amount of data stored.
Value isn't just profit. It may be medical or social benefits, or
customer, employee, personal satisfaction or crime prevention. The
main reasons for why people invest time to understand Big Data is to
derive value from it.
What is Apache Hadoop?
• Hadoop is an open-source software
framework used to store and process huge
amounts of data.
• Owned by Apache Software Foundation
• Transforms commodity hardware into a
• Stores petabytes of data reliably (HDFS)
• Allows huge distributed computations
• Key attributes:
• Redundant and reliable
• Doesn’t stop or lose data even if hardware
• Easy to program
• Extremely powerful
• Allows the development of big data
algorithms & tools
• Batch processing centric
• Runs on commodity hardware
• Computers & network
Log Analytics Systems Today
• Not all data can be captured
• Not all captured data is valuable
• Transport all data
2. Content-based routing based on dynamic
evaluation of content, attributes, priority
1. Integrate and enrich logs across
data centers and security zones
3. Cost effectively expand collection and grow
timescale of logs collected
Expand Storage Options of Log Data