Making sense of streaming Big Data               Flume – HBaseReal-Time Big Data Analytics with SLA-Dani Abel Rayan
Who am I ?Interned with Cloudera.Flume Contributor.HBaseUser.Work with KarstenSchwan @ GaTechJoining as “Big Data Engineer” in a lead role to manage exponential growing data for makers of League of Legends (Multiplayer Online Battle Arena)- recently received 400M
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.
Why near Real-Time?Activity stream data is a normal part of any website for reporting on usage of the site. Activity data is things like page views, information about what content was shown, searches, etc. This kind of thing is usually handled by logging the activity out to some kind of file and then periodically aggregating these files for analysis. In recent years, however, activity data has become a critical part of the production features of websites, and a slightly more sophisticated set of infrastructure is needed.
The Big PictureWhy you want to build this ? - Customer retargeting
The Big PictureContent serving by measuring current audience interests. Product Patterns – Twitter StreamsS4 is being used for applications such as personalization, user feedback, malicious traffic detection, and real-time searchLocation based streams – find out people matching specific threshold almost near real-time - RealTime Shopping/Restaurants discountSo many possibilities!
Million Impressions in a sec …100 nodes (soon to be 500) in CERCSEach one can generate 10,000 impressions in one second.Specific products are given “a known”     impression rates and others are pseudo random The challenging task is to ensure, that we can bucket the Product impression in proper ColFamilies within a SLA of few seconds.
Whats the storage ?HBaseCurrently used for real-time analytics in companies like Facebook (also FB messaging), Yahoo!, TwitterThe high-throughput stream of immutable activity data represents a real computational challenge as the volume may easily be 10x or 100x larger than the next largest data source on a site. Do we really need to store everything ?Nope! HBase has TTL for ColFamiliesWait! What are “Column Families” ?
HBase Data Model
HBase Data Model
HBase TTLDESCRIPTION                                                                     ENABLED {NAME => 'test', FAMILIES => [{NAME => 'host', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647',  BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}, {NAME => ' info', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'fal se', BLOCKCACHE => 'true'}, {NAME => 'log', BLOOMFILTER => 'NONE', REPLICATION _SCOPE => '0', VERSIONS => '3', COMPRESSION => 'NONE', TTL => '2147483647',      BL OCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}, {NAME => 'pro duct', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '10', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}
HBase ArchitectureWhy NoSQL ? Hbase ?Horizontal scalability Commodity HardwareHadoop basedOnly one index possibile– RowKeyVery high write load ->   20 billion events per day (200,000 events per second) with a lag of less than 30 seconds. (at Facebook)
FlumeFlume is a distributed, reliable, and available service for efficiently moving large amounts of data soon after the data is produced. The primary use case for Flume is as a logging system that gathers a set of log files on every machine in a cluster and aggregates them to a centralized persistent store such as the Hadoop Distributed File System (HDFS).The system was designed with these four key goals in mind:Reliability Scalability Manageability Extensibility
Where it is used ?MozillaShopzillaAOLSimple GEOPath….
Flume Architecture
Flume Data ModelFlume internally converts every external source of data into a stream of events. Events are Flume’s unit of data and are a simple and flexible representation. An event is composed of a body and metadata. The event body is a string of bytes representing the content of an event. For example, a line in a log file is represented as an event whose body was the actual byte representation of that line. The event metadata is a table of key / value pairs that capture some detail about the event, such as the time it was created or the name of the machine on which it originated. This table can be appended as an event travels along a Flume flow, and the table can be read to control the operation of individual components of that flow. For example, the machine name attached to an event can be used to control the output path where the event is written at the end of the flow.An event’s body can be up to 32KB long - although this limit can be controlled via a system property, it is recommended that it is not changed in order to preserve performance.
Flume – HBase connectorThis is challenging since we need to interface single dimensional key-value pairs into multi-dimensional key-value pairMany possible approaches:1. "usage: hbase(\"table\", \"rowkey\", "  + "\"cf1\"," + " \"c1\", \"val1\"[,\"cf2\", \"c2\", \"val2\", ....]{, "  + KW_BUFFER_SIZE + "=int, " + KW_USE_WAL + "=true|false})";2. usage: attr2hbase(\"table\" [,\"sysFamily\"[, \"writeBody\"[,\"attrPrefix\"[,\"writeBufferSize\"[,\"writeToWal\"]]]]])";https://issues.cloudera.org/browse/FLUME-6
The Story so far … for a demoDeployed Flume Agent Nodes in ~ 100 machinesAgents monitor a “specific” log directory in all the machines.Any logfile matching the *.hb” suffix will be continuously tailed.Deployed Flume Collector Nodes in 5 machinesOne HBase – psuedo distributed
DemoSingle Event  - Choose a Machine 7,9,10,11,12  - Choose a VM 2,3, ….201000’s of Event   -  Lets do it again on multiple machinesFailover chain DemoInactive Demo
EvaluationFlume is an awesome product but it isn’t perfect and there are instances wherein certain flows though they appear “ACTIVE” aren’t spewing out any data and sometimes are inactive or work slower because of long queues.No SLA guaranteesSo are the other products like S4 from Yahoo! And Kafka from LinkedIN
Monalytics
SLAMonalytics - combined monitoring and analysis systems used for managing large-scale data center systems.Due to the scale and complexity on commodity software/hardware in data centers, the performance problems in the streaming MapReduce system are inevitable. Solving those problems is extremely hard and much more stringent, if you need to meet SLA: “We will bucket the impressions in less than 2sec, no matter what”
So many moving partsStill need to be AdaptableNeeds to be configurable with new storesEnd to End Monitoring of SLAReliabilityGarbage Collection PausesGood use case for MonalyticsNew technologies are really cool to implement, but once you past the initial honeymoon phase, complexities start surfacing up and one either enters the “cul-de-sac” mode to fall-back on the more traditional methods.
DemoMonalyticsLatencyStart – Stop 100 nodes
Similar systems to FlumeS4 from Yahoo!Kafka from LinkedINFlumeBase – an extension to support SQL like constructs operating on Data Streams – another startup
Thanks !dr@verticalengine.comQuestions ?

Streaming map reduce

  • 1.
    Making sense ofstreaming Big Data Flume – HBaseReal-Time Big Data Analytics with SLA-Dani Abel Rayan
  • 2.
    Who am I?Interned with Cloudera.Flume Contributor.HBaseUser.Work with KarstenSchwan @ GaTechJoining as “Big Data Engineer” in a lead role to manage exponential growing data for makers of League of Legends (Multiplayer Online Battle Arena)- recently received 400M
  • 3.
    This work islicensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.
  • 4.
    Why near Real-Time?Activitystream data is a normal part of any website for reporting on usage of the site. Activity data is things like page views, information about what content was shown, searches, etc. This kind of thing is usually handled by logging the activity out to some kind of file and then periodically aggregating these files for analysis. In recent years, however, activity data has become a critical part of the production features of websites, and a slightly more sophisticated set of infrastructure is needed.
  • 5.
    The Big PictureWhyyou want to build this ? - Customer retargeting
  • 6.
    The Big PictureContentserving by measuring current audience interests. Product Patterns – Twitter StreamsS4 is being used for applications such as personalization, user feedback, malicious traffic detection, and real-time searchLocation based streams – find out people matching specific threshold almost near real-time - RealTime Shopping/Restaurants discountSo many possibilities!
  • 7.
    Million Impressions ina sec …100 nodes (soon to be 500) in CERCSEach one can generate 10,000 impressions in one second.Specific products are given “a known” impression rates and others are pseudo random The challenging task is to ensure, that we can bucket the Product impression in proper ColFamilies within a SLA of few seconds.
  • 8.
    Whats the storage?HBaseCurrently used for real-time analytics in companies like Facebook (also FB messaging), Yahoo!, TwitterThe high-throughput stream of immutable activity data represents a real computational challenge as the volume may easily be 10x or 100x larger than the next largest data source on a site. Do we really need to store everything ?Nope! HBase has TTL for ColFamiliesWait! What are “Column Families” ?
  • 9.
  • 10.
  • 11.
    HBase TTLDESCRIPTION ENABLED {NAME => 'test', FAMILIES => [{NAME => 'host', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}, {NAME => ' info', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'fal se', BLOCKCACHE => 'true'}, {NAME => 'log', BLOOMFILTER => 'NONE', REPLICATION _SCOPE => '0', VERSIONS => '3', COMPRESSION => 'NONE', TTL => '2147483647', BL OCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}, {NAME => 'pro duct', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '10', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}
  • 13.
    HBase ArchitectureWhy NoSQL? Hbase ?Horizontal scalability Commodity HardwareHadoop basedOnly one index possibile– RowKeyVery high write load -> 20 billion events per day (200,000 events per second) with a lag of less than 30 seconds. (at Facebook)
  • 14.
    FlumeFlume is adistributed, reliable, and available service for efficiently moving large amounts of data soon after the data is produced. The primary use case for Flume is as a logging system that gathers a set of log files on every machine in a cluster and aggregates them to a centralized persistent store such as the Hadoop Distributed File System (HDFS).The system was designed with these four key goals in mind:Reliability Scalability Manageability Extensibility
  • 15.
    Where it isused ?MozillaShopzillaAOLSimple GEOPath….
  • 16.
  • 17.
    Flume Data ModelFlumeinternally converts every external source of data into a stream of events. Events are Flume’s unit of data and are a simple and flexible representation. An event is composed of a body and metadata. The event body is a string of bytes representing the content of an event. For example, a line in a log file is represented as an event whose body was the actual byte representation of that line. The event metadata is a table of key / value pairs that capture some detail about the event, such as the time it was created or the name of the machine on which it originated. This table can be appended as an event travels along a Flume flow, and the table can be read to control the operation of individual components of that flow. For example, the machine name attached to an event can be used to control the output path where the event is written at the end of the flow.An event’s body can be up to 32KB long - although this limit can be controlled via a system property, it is recommended that it is not changed in order to preserve performance.
  • 18.
    Flume – HBaseconnectorThis is challenging since we need to interface single dimensional key-value pairs into multi-dimensional key-value pairMany possible approaches:1. "usage: hbase(\"table\", \"rowkey\", " + "\"cf1\"," + " \"c1\", \"val1\"[,\"cf2\", \"c2\", \"val2\", ....]{, " + KW_BUFFER_SIZE + "=int, " + KW_USE_WAL + "=true|false})";2. usage: attr2hbase(\"table\" [,\"sysFamily\"[, \"writeBody\"[,\"attrPrefix\"[,\"writeBufferSize\"[,\"writeToWal\"]]]]])";https://issues.cloudera.org/browse/FLUME-6
  • 19.
    The Story sofar … for a demoDeployed Flume Agent Nodes in ~ 100 machinesAgents monitor a “specific” log directory in all the machines.Any logfile matching the *.hb” suffix will be continuously tailed.Deployed Flume Collector Nodes in 5 machinesOne HBase – psuedo distributed
  • 21.
    DemoSingle Event - Choose a Machine 7,9,10,11,12 - Choose a VM 2,3, ….201000’s of Event - Lets do it again on multiple machinesFailover chain DemoInactive Demo
  • 22.
    EvaluationFlume is anawesome product but it isn’t perfect and there are instances wherein certain flows though they appear “ACTIVE” aren’t spewing out any data and sometimes are inactive or work slower because of long queues.No SLA guaranteesSo are the other products like S4 from Yahoo! And Kafka from LinkedIN
  • 23.
  • 24.
    SLAMonalytics - combinedmonitoring and analysis systems used for managing large-scale data center systems.Due to the scale and complexity on commodity software/hardware in data centers, the performance problems in the streaming MapReduce system are inevitable. Solving those problems is extremely hard and much more stringent, if you need to meet SLA: “We will bucket the impressions in less than 2sec, no matter what”
  • 25.
    So many movingpartsStill need to be AdaptableNeeds to be configurable with new storesEnd to End Monitoring of SLAReliabilityGarbage Collection PausesGood use case for MonalyticsNew technologies are really cool to implement, but once you past the initial honeymoon phase, complexities start surfacing up and one either enters the “cul-de-sac” mode to fall-back on the more traditional methods.
  • 26.
  • 27.
    Similar systems toFlumeS4 from Yahoo!Kafka from LinkedINFlumeBase – an extension to support SQL like constructs operating on Data Streams – another startup
  • 28.