BIG DATA
Presented By,
R.S.M.N.PRASAD.
(pvpsit)
OUTLOOK
 Introduction
 Hadoop
 MapReduce
 Hyper Table
 Advantages
BIG DATA
• The data comes from everywhere: sensors used to
gather climate information, posts to social media sites,
digital pictures and videos, purchase transaction records,
and cell phone GPS signals to name a few. This data
is called Big Data.
• Every day, we create 2.5 quintillion bytes (one quintillion
bytes = one billion gigabytes). Of all data, so much of
90% of the data in the world today has been created in
the last two years alone.
IN FACT, IN A MINUTE…
• Email users send more than 204 million messages;
• Mobile Web receives 217 new users;
• Google receives over 2 million search queries;
• YouTube users upload 48 hours of new video;
• Facebook users share 684,000 bits of content;
• Twitter users send more than 100,000 tweets;
• Consumers spend $272,000 on Web shopping;
• Apple receives around 47,000 application downloads;
• Brands receive more than 34,000 Facebook 'likes';
• Tumblr blog owners publish 27,000 new posts;
• Instagram users share 3,600 new photos;
• Flickr users , on the other hand , add 3,125 new photos;
• Foursquare users perform 2,000 check-ins;
• WordPress users publish close to 350 new blog posts.
Big Data Vectors
• High-volume:
Amount of data
• High-velocity:
Speed rate in collecting or acquiring or generating or
processing of data
• High-variety:
Different data type such as audio, video, image data
Big Data = Transactions + Interactions + Observations
What is Hadoop?
• HADOOP
High-availability distributed object-oriented platform or
“Hadoop” is a software framework which analyze structured
and unstructured data and distribute applications on different
servers.
• Basic Application of Hadoop
Hadoop is used in maintaining, scaling, error handling,
self healing and securing large scale of data. These data can
be structured or unstructured. What I mean to say is if data is
large then traditional systems are unable to handle it.
HADOOP
DIFFERENT COMPONENTS ARE..........
Data Access Components :- PIG & HIVE
Data Storage Components :- HBASE
Data Integration Components :- APACHEFLUME ,SQOOP, CHUKWA.
Data Management Components :- AMBARI , ZOOKEEPER.
Data Serialization Components :- THRIFT & AVRO
Data Intelligence Components :- APACHE MAHOUT, DRILL
What does it do?
• Hadoop implements Google’s MapReduce, using
HDFS
• MapReduce divides applications into many small
blocks of work.
• HDFS creates multiple replicas of data blocks for
reliability, placing them on compute nodes
around the cluster.
• MapReduce can then process the data where it
is located.
• Hadoop ‘s target is to run on clusters of the order
of 10,000-nodes.
How does MapReduce work?
• The run time partitions the input and provides it
to different Map instances;
• Map (key, value)  (key’, value’)
• The run time collects the (key’, value’) pairs and
distributes them to several Reduce functions so
that each Reduce function gets the pairs with the
same key’.
• Each Reduce produces a single (or zero) file
output.
• Map and Reduce are user written functions.
HYPERTABLE
What is it?
• Open source Big table clone
• Manages massive sparse tables with timestamped cell
versions
• Single primary key index
What is it not?
• No joins
• No secondary indexes (not yet)
• No transactions (not yet)
SCALING
TABLE: VISUAL REPRESENTATION
TABLE: ACTUAL REPRESENTATION
SYSTEM OVERVIEW
RANGE SERVER
• Manages ranges of table data
• Caches updates in memory (Cell Cache)
• Periodically spills (compacts) cached updates to disk (CellStore)
PERFORMANCE OPTIMIZATIONS
Block Cache
• Caches CellStore blocks
• Blocks are cached uncompressed
Bloom Filter
• Avoids unnecessary disk access
• Filter by rows or rows + columns
• Configurable false positive rate
Access Groups
• Physically store co-accessed columns together
• Improves performance by minimizing I/O
ADVANTAGES
• Flexible : Easily to access Structured & Unstructured
Data
• Scalable: It can store & distributed very large data , sets
100’s of inexpensive Servers that Operate in Parallel.
• Efficient: By distributing the data, it can process it in
parallel on the nodes where the data is located.
• Resistant to Failure: It automatically maintains
multiple copies of data and automatically redeploys
computing tasks based on failures.
QUERIES????
Big data

Big data

  • 1.
  • 2.
    OUTLOOK  Introduction  Hadoop MapReduce  Hyper Table  Advantages
  • 3.
    BIG DATA • Thedata comes from everywhere: sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals to name a few. This data is called Big Data. • Every day, we create 2.5 quintillion bytes (one quintillion bytes = one billion gigabytes). Of all data, so much of 90% of the data in the world today has been created in the last two years alone.
  • 4.
    IN FACT, INA MINUTE… • Email users send more than 204 million messages; • Mobile Web receives 217 new users; • Google receives over 2 million search queries; • YouTube users upload 48 hours of new video; • Facebook users share 684,000 bits of content; • Twitter users send more than 100,000 tweets; • Consumers spend $272,000 on Web shopping; • Apple receives around 47,000 application downloads; • Brands receive more than 34,000 Facebook 'likes'; • Tumblr blog owners publish 27,000 new posts; • Instagram users share 3,600 new photos; • Flickr users , on the other hand , add 3,125 new photos; • Foursquare users perform 2,000 check-ins; • WordPress users publish close to 350 new blog posts.
  • 5.
    Big Data Vectors •High-volume: Amount of data • High-velocity: Speed rate in collecting or acquiring or generating or processing of data • High-variety: Different data type such as audio, video, image data Big Data = Transactions + Interactions + Observations
  • 6.
    What is Hadoop? •HADOOP High-availability distributed object-oriented platform or “Hadoop” is a software framework which analyze structured and unstructured data and distribute applications on different servers. • Basic Application of Hadoop Hadoop is used in maintaining, scaling, error handling, self healing and securing large scale of data. These data can be structured or unstructured. What I mean to say is if data is large then traditional systems are unable to handle it.
  • 7.
  • 8.
    DIFFERENT COMPONENTS ARE.......... DataAccess Components :- PIG & HIVE Data Storage Components :- HBASE Data Integration Components :- APACHEFLUME ,SQOOP, CHUKWA. Data Management Components :- AMBARI , ZOOKEEPER. Data Serialization Components :- THRIFT & AVRO Data Intelligence Components :- APACHE MAHOUT, DRILL
  • 9.
    What does itdo? • Hadoop implements Google’s MapReduce, using HDFS • MapReduce divides applications into many small blocks of work. • HDFS creates multiple replicas of data blocks for reliability, placing them on compute nodes around the cluster. • MapReduce can then process the data where it is located. • Hadoop ‘s target is to run on clusters of the order of 10,000-nodes.
  • 10.
    How does MapReducework? • The run time partitions the input and provides it to different Map instances; • Map (key, value)  (key’, value’) • The run time collects the (key’, value’) pairs and distributes them to several Reduce functions so that each Reduce function gets the pairs with the same key’. • Each Reduce produces a single (or zero) file output. • Map and Reduce are user written functions.
  • 11.
    HYPERTABLE What is it? •Open source Big table clone • Manages massive sparse tables with timestamped cell versions • Single primary key index What is it not? • No joins • No secondary indexes (not yet) • No transactions (not yet)
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
    RANGE SERVER • Managesranges of table data • Caches updates in memory (Cell Cache) • Periodically spills (compacts) cached updates to disk (CellStore)
  • 17.
    PERFORMANCE OPTIMIZATIONS Block Cache •Caches CellStore blocks • Blocks are cached uncompressed Bloom Filter • Avoids unnecessary disk access • Filter by rows or rows + columns • Configurable false positive rate Access Groups • Physically store co-accessed columns together • Improves performance by minimizing I/O
  • 18.
    ADVANTAGES • Flexible :Easily to access Structured & Unstructured Data • Scalable: It can store & distributed very large data , sets 100’s of inexpensive Servers that Operate in Parallel. • Efficient: By distributing the data, it can process it in parallel on the nodes where the data is located. • Resistant to Failure: It automatically maintains multiple copies of data and automatically redeploys computing tasks based on failures.
  • 19.