The document provides an introduction and overview of streaming data. It discusses sources of streaming data such as operational monitoring, web analytics, online advertising, social media, and mobile/IoT data. It explains that streaming data is different from other data types in that it is always flowing in, loosely structured, and can have high-cardinality dimensions. Real-time architectures for streaming data need to have high availability, low latency, and horizontal scalability.
2. Agenda
• Sources of Streaming Data ,Operational Monitoring, Web Analytics
• Online Advertising, Social Media, Mobile Data and the Internet of Things
• Why Streaming Data is Different ,Loosely Structured , Highly-Cardinality
Storage
• Real-Time Architecture Components ,Collection, Data Flow, Processing,
Storage, Delivery, Features of a Real-Time Architecture,
• High Availability, Low Latency, Horizontal Scalability,
• Languages for Real-Time Programming, Understanding MapReduce
Failure for Streaming Data
3. Introduction to Streaming Data
• It seems like the world moves at a faster pace every day.
• People and places become more connected, and people and organizations try to
react at an ever-increasing pace.
• Reaching the limits of a human's ability to respond, tools are built to process the
vast amounts of data available to decision makers, analyze it, present it, and, in
some cases, respond to events as they happen.
• The collection and processing of this data has a number of application areas,
some of which are discussed in the next section.
• These applications, which are discussed later in this chapter, require an
infrastructure and method of analysis specific to streaming data.
• Fortunately, like batch processing before it, the state of the art of streaming
infrastructure is focused on using commodity hardware and software to build its
systems rather than the specialized systems required for real-time analysis prior
to the Internet era.
• This, combined with flexible cloud-based environment, puts the implementation
of a real-time system within the reach of nearly any organization.
• These commodity systems allow organizations to analyze their data in real time
and scale that infrastructure to meet future needs as the organization grows and
changes over time
4. Introduction to Streaming Data
The goal of this course is to allow a fairly broad range of potential users
and implementers in an organization to gain comfort with the complete
stack of applications.
When real-time projects reach a certain point, they should be agile and
adaptable systems that can be easily modified, which requires that the
users have a fair understanding of the stack as a whole in addition to their
own areas of focus.
“Real time” applies as much to the development of new analyses as it does
to the data itself. Any number of well-meaning projects have failed
because they took so long to implement that the people who requested the
project have either moved on to other things or simply forgotten why they
wanted the data in the first place.
By making the projects agile and incremental, this can be avoided as
much as possible. This chapter is cover the first section, “Sources of
Streaming Data,” is some of the common sources and applications of
streaming data. They are arranged more or less chronologically and
provide some background on the origin of streaming data infrastructures.
5. Sources of Streaming Data
• There are a variety of sources of streaming data. This section introduces
some of the major categories of data.
• The data motion systems presented in this book got their start handling
data for website analytics and online advertising at places like LinkedIn,
Yahoo!, and Facebook.
• The processing systems were designed to meet the challenges of
processing social media data from Twitter and social networks like
LinkedIn.
• Google, whose business is largely related to online advertising, makes
heavy use of the advanced algorithmic approaches. Google seems to be
especially interested in a technique called deep learning, which makes use
of very large-scale neural networks to learn complicated patterns.
• These systems are even enabling entirely new areas of data collection and
analysis by making the Internet of Things and other highly distributed data
collection efforts economically feasible.
• It is hoped that outlining some of the previous application areas provides
some inspiration for as-of-yet unforeseen applications of these
technologies.
6. Operational Monitoring
• Operational monitoring of physical systems was the original
application of streaming data. Originally, this would have been
implemented using specialized hardware and software (or even
analog and mechanical systems in the pre-computer era).
• The most common use case today of operational monitoring is
tracking the performance of the physical systems that power the
Internet.
• These datacenters house thousands—possibly even tens of
thousands—of discrete computer systems. All of these systems
continuously record data about their physical state from the
temperature of the processor, to the speed of the fan and the voltage
draw of their power supplies. They also record information about the
state of their disk drives and fundamental metrics of their operation,
such as processor load, network activity, and storage access times.
7. Operational Monitoring
• To make the monitoring of all of these systems possible and to
identify problems, this data is collected and aggregated in real time
through a variety of mechanisms.
• The first systems tended to be specialized ad hoc mechanisms, but
when these sorts of techniques started applying to other areas, they
started using the same collection systems as other data collection
mechanisms.
8. Web Analytics
• The introduction of the commercial web, through e-commerce and online
advertising, led to the need to track activity on a website.
• Like the circulation numbers of a newspaper, the number of unique
visitors who see a website in a day is important information.
• For e-commerce sites, the data is less about the number of visitors as it is
the various products they browse and the correlations between them.
• To analyze this data, a number of specialized log-processing tools were
introduced and marketed. With the rise of Big Data and tools like
Hadoop, much of the web analytics infrastructure shifted to these large
batch-based systems.
• They were used to implement recommendation systems and other
analysis. It also became clear that it was possible to conduct experiments
on the structure of websites to see how they affected various metrics of
interest. This is called A/B testing because—in the same way an
optometrist tries to determine the best prescription—two choices are
pitted against each other to determine which is best.
• These tests were mostly conducted sequentially, but this has a number of
problems, not the least of which is the amount of time needed to conduct
the study.
9. Web Analytics
• As more and more organizations mined their website data, the need to
reduce the time in the feedback loop and to collect data on a more continual
basis became more important.
• Using the tools of the system-monitoring community, it became possible to
also collect this data in real time and perform things like A/B tests in
parallel rather than in sequence.
• As the number of dimensions being measured and the need for appropriate
auditing (due to the metrics being used for billing) increased, the analytics
community developed much of the streaming infrastructure to safely move
data from their web servers spread around the world to processing and
billing systems.
• This sort of data still accounts for a vast source of information from a
variety of sources, although it is usually contained within an organization
rather than made publicly available. Applications range from simple
aggregation for billing to the real-time optimization of product
recommendations based on current browsing history (or viewing history, in
the case of a company like Netflix).
10. Online Advertising
• A major user and generator of real-time data is online advertising.
• The original forms of online advertising were similar to their print counterparts
with “buys” set up months in advance.
• With the rise of the advertising exchange and real-time bidding infrastructure, the
advertising market has become much more fluid for a large and growing portion
of traffic.
• For these applications, the money being spent in different environments and on
different sites is being managed on a per-minute basis in much the same way as
the stock market.
• In addition, these buys are often being managed to some sort of metric, such as
the number of purchases (called a conversion) or even the simpler metric of the
number of clicks on an ad.
• When a visitor arrives at a website via a modern advertising exchange, a call is
made to a number of bidding agencies (perhaps 30 or 40 at a time), who place
bids on the page view in real time.
11. Online Advertising
• An auction is run, and the advertisement from the winning party is displayed. This
usually happens while the rest of the page is loading; the elapsed time is less than
about 100 milliseconds.
• If the page contains several advertisements, as many of them do, an auction is
often being run for all of them, sometimes on several different exchanges.
• All the parties in this process—the exchange, the bidding agent, the advertiser,
and the publisher—are collecting this data in real time for various purposes.
• For the exchange, this data is a key part of the billing process as well as
important for real-time feedback mechanisms that are used for a variety of
purposes.
• Examples include monitoring the exchange for fraudulent traffic and other risk
management activities, such as throttling the access to impressions to various
parties. Advertisers, publishers, and bidding agents on both sides of the exchange
are also collecting the data in real time.
12. Online Advertising
• Their goal is the management and optimization of the campaigns they are
currently running. From selecting the bid (in the case of the advertiser) or the
“reserve” price (in the case of the publisher), to deciding which exchange offers
the best prices for a particular type of traffic, the data is all being managed on a
moment-to-moment basis.
• A good-sized advertising campaign or a large website could easily see page views
in the tens or hundreds of millions. Including other events such as clicks and
conversions could easily double that.
• A bidding agent is usually acting on the part of many different advertisers or
publishers at once and will often be collecting on the order of hundreds of
millions to billions of events per day. Even a medium-sized exchange, sitting as a
central party, can have billions of events per day. All of this data is being
collected, moved, analyzed, and stored as it happens.
13. Social Media
• Another newer source of massive collections of data are social media sources,
especially public ones such as Twitter.
• As of the middle of 2013, Twitter reported that it collected around 500 million
tweets per day with spikes up to around 150,000 tweets per second. That number
has surely grown since then. This data is collected and disseminated in real time,
making it an important source of information for news outlets and other
consumers around the world.
• In 2011, Twitter users in New York City received information about an earthquake
outside of Washington, D.C. about 30 seconds before the tremors struck New
York itself. Combined with other sources like Facebook, Foursquare, and
upcoming communications platforms, this data is extremely large and varied.
• The data from applications like web analytics and online advertising, although
highly dimensional, are usually fairly well structured. The dimensions, such as
money spent or physical location, are fairly well understood quantitative values.
• In social media, however, the data is usually highly unstructured, at least as data
analysts understand the term. It is usually some form of “natural language” data
that must be parsed, processed, and somehow understood by automated systems.
This makes social media data incredibly rich, but incredibly challenging for the
real-time data sources to process.
14. Mobile Data and the Internet of Things
• One of the most exciting new sources of data was introduced to the world in 2007 in
the form of Apple's iPhone.
• Cellular data-enabled computers had been available for at least a decade, and devices
like Blackberries had put data in the hands of business users, but these devices were
still essentially specialist tools and were managed as such.
• The iPhone, Android phones, and other smartphones that followed made cellular data
a consumer technology with the accompanying economies of scale that goes hand in
hand with a massive increase in user base.
• It also put a general-purpose computer in the pocket of a large population.
Smartphones have the ability not only to report back to an online service, but also to
communicate with other nearby objects using technologies like Bluetooth LE.
• Technologies like so-called “wearables,” which make it possible to measure the
physical world the same way the virtual world has been measured for the last two
decades, have taken advantage of this new infrastructure.
• The applications range from the silly to the useful to the creepy. For example, a
wristband that measures sleep activity could trigger an automated coffee maker when
the user gets a poor night's sleep and needs to be alert the next day. The smell of
coffee brewing could even be used as an alarm. The communication between these
systems no longer needs to be direct or specialized, as envisioned in the various
“smart home” demonstration projects during the past 50 years. These tools are
possible today using tools like If This Then That (IFTTT) and other publicly available
systems built on infrastructure similar
15. Mobile Data and the Internet of Things
• The smell of coffee brewing could even be used as an alarm. The communication
between these systems no longer needs to be direct or specialized, as envisioned
in the various “smart home” demonstration projects during the past 50 years.
These tools are possible today using tools like If This Then That (IFTTT) and other
publicly available systems built on infrastructure similar.
• On a more serious note, important biometric data can be measured in real time
by remote facilities in a way that has previously been available only when using
highly specialized and expensive equipment, which has limited its application to
high-stress environments like space exploration. Now this data can be collected
for an individual over a long period of time (this is known in statistics as
longitudinal data) and pooled with other users' data to provide a more complete
picture of human biometric norms.
• Instead of taking a blood pressure test once a year in a cold room wearing a
paper dress, a person's blood pressure might be tracked over time with the goal
of “heading off problems at the pass.” Outside of health, there has long been the
idea of “smart dust”—large collections of inexpensive sensors that can be
distributed into an area of the world and remotely queried to collect interesting
data. The limitation of these devices has largely been the expense required to
manufacture relatively specialized pieces of equipment. This has been solved by
the commodification of data collection hardware and software (such as the
smartphone) and is now known as the Internet of Things.
16. Why Streaming Data Is Different
There are a number of aspects to streaming data that set it apart from
other kinds of data.
The three most important, covered in this section, are the “always-
on” nature of the data, the loose and changing data structure, and
the challenges presented by high-cardinality dimensions.
All three play a major role in decisions made in the design and
implementation of the various streaming frameworks .
Always On, Always Flowing –
This first is somewhat obvious: streaming data streams. The data is
always available and new data is always being generated. This has a
few effects on the design of any collection and analysis system. First,
the collection itself needs to be very robust. Downtime for the primary
collection system means that data is permanently lost.
17. Why Streaming Data Is Different
Second, the fact that the data is always flowing means that the system
needs to be able to keep up with the data.
If 2 minutes are required to process 1 minute of data, the system will
not be real time for very long. Eventually, the problem will be so bad
that some data will have to be dropped to allow the system to catch
up.
In practice it is not enough to have a system that can merely “keep
up” with data in real time. It needs to be able to process data far more
quickly than real time.
For reasons that are either intentional, such as a planned downtime, or
due to catastrophic failures, such as network outages, the system
either whole or in part will go down.
18. Why Streaming Data Is Different
Failing to plan for this inevitability and having a system that can only
process at the same speed as events happen means that the system is now
delayed by the amount of data stored at the collectors while the system was
down.
A system that can process 1 hour of data in 1 minute, on the other hand, can
catch up fairly quickly with little need for intervention. A mature
environment that has good horizontal scalability , even implement auto-
scaling.
In this setting, as the delay increases, more processing power is temporarily
added to bring the delay back into acceptable limits. On the algorithmic side,
this always-flowing feature of streaming data is a bit of a double-edged
sword.
On the positive side, there is rarely a situation where there is not enough
data. If more data is required for an analysis, simply wait for enough data to
become available. It may require a long wait, but other analyses can be
conducted in the meantime that can provide early indicators of how the later
analysis might proceed.
19. Loosely Structured
• Streaming data is often loosely structured compared to many other
datasets. There are several reasons this happens, and although this
loose structure is not unique to streaming data, it seems to be more
common in the streaming settings than in other situations. Part of
the reason seems to be the type of data that is interesting in the
streaming setting.
• Streaming data comes from a variety of sources. Although some of
these sources are rigidly structured, many of them are carrying an
arbitrary data payload. Social media streams, in particular, will be
carrying data about everything from world events to the best slice
of pizza to be found in Brooklyn on a Sunday night. To make
things more difficult, the data is encoded as human language.
20. Loosely Structured
• Another reason is that there is a “kitchen sink” mentality to
streaming data projects. Most of the projects are fairly young and
exploring unknown territory, so it makes sense to toss as many
different dimensions into the data as possible. This is likely to
change over time, so the decision is also made to use a format that
can be easily modified, such as JavaScript Object Notation (JSON).
The general paradigm is to collect as much data as possible in the
event that it is actually interesting.
• Finally, the real-time nature of the data collection also means that
various dimensions may or may not be available at any given time.
For example, a service that converts IP addresses to a geographical
location may be temporarily unavailable. For a batch system this
does not present a problem; the analysis can always be redone later
when the service is once more available. The streaming system, on
the other hand, must be able to deal with changes in the available
dimensions and do the best it can.
21. High-Cardinality Storage
• Cardinality refers to the number of unique values a piece of data can
take on. Formally, cardinality refers to the size of a set and can be
applied to the various dimensions of a dataset as well as the entire
dataset itself. This high cardinality often manifests itself in a “long
tail” feature of the data. For a given dimension (or combination of
dimensions) there is a small set of different states that are quite
common, usually accounting for the majority of the observed data,
and then a “long tail” of other data states that comprise a fairly small
fraction.
• This feature is common to both streaming and batch systems, but it is
much harder to deal with high cardinality in the streaming setting. In
the batch setting it is usually possible to perform multiple passes over
the dataset. A first pass over the data is often used to identify
dimensions with high cardinality and compute the states that make up
most of the data. These common states can be treated individually,
and the remaining state is combined into a single “other” state that
can usually be ignored. 21
22. High-Cardinality Storage
• In the streaming setting, the data can usually be processed a single time. If
the common cases are known ahead of time, this can be included in the
processing step. The long tail can also be combined into the “other” state,
and the analysis can proceed as it does in the batch case. If a batch study
has already been performed on an earlier dataset, it can be used to inform
the streaming analysis. However, it is often not known if the common states
for the current data will be the common states for future data. In fact,
changes in the mix of states might actually be the metric of interest. More
commonly, there is no previous data to perform an analysis upon. In this
case, the streaming system must attempt to deal with the data at its natural
cardinality.
• This is difficult both in terms of processing and in terms of storage. Doing
anything with a large set necessarily takes time to process anything that
involves a large number of different states. It also requires a linear amount
of space to store information about each different state and, unlike batch
processing, storage space is much more restricted than in the batch setting
because it must generally use very fast main memory storage instead of the
much slower tertiary storage of hard drives. This has been relaxed
somewhat with the introduction of high-performance Solid State Drives
(SSDs), but they are still orders of magnitude slower than memory access.
22
23. High-Cardinality Storage
• As a result, a major topic of research in streaming data is how to deal
with high-cardinality data. This course discusses some of the
approaches to dealing with the problem. As an active area of
research, more solutions are being developed and improved every
day
23
24. Real-time processing Architecure
• Real time processing deals with streams of data that are
captured in real-time and processed with minimal latency to
generate real-time (or near-real-time) reports or automated
responses.
• For example, a real-time traffic monitoring solution might
use sensor data to detect high traffic volumes.
• This data could be used to dynamically update a map to show
congestion, or automatically initiate high-occupancy lanes or
other traffic management systems.
24
25. Real-time processing Architecure
• Real-time processing is defined as the processing of
unbounded stream of input data, with very short latency
requirements for processing — measured in milliseconds or
seconds.
• This incoming data typically arrives in an unstructured or
semi-structured format, such as JSON, and has the same
processing requirements as batch processing, but with shorter
turnaround times to support real-time consumption.
• Processed data is often written to an analytical data store,
which is optimized for analytics and visualization.
25
27. Real-time processing Architecure
• Real-time ingestion/Collection-
• The architecture must include a way to capture and store real-time
messages to be consumed by a stream processing consumer.
• Stream processing-
• After capturing real-time messages, the solution must process
them by filtering, aggregating, and otherwise preparing the data
for analysis.
• Analytical data store/Storage-
• Many big data solutions are designed to prepare data for
analysis and then serve the processed data in a structured
format that can be queried using analytical tools..
27
28. Real-time processing Architecure
• Analysis and reporting-
• The goal of most big data solutions is to provide insights
into the data through analysis and reporting.
28
29. Features of a Real-Time Architecture
• Three key features that allow them to operate in a real-time streaming
environment: high availability, low latency, and horizontal
scalability
•High Availability:
•key difference between the real-time streaming
application and business-intelligence application
•distribution and replication
•availability of the service and data
•data unavailable-replication 29
30. Features of a Real-Time Architecture
• Three key features that allow them to operate in a real-time streaming
environment: high availability, low latency, and horizontal
scalability
•Low Latency:
• many seconds in a day (86,400 )
• less time it takes to handle a single request
• real-time streaming application, low latency means something a bit
different. Rather than referring to the return time for a single
request, it refers to the amount of time between an event occurring
somewhere at the “edge” of the system and it being made available
to the processing and delivery frameworks.
30
31. Features of a Real-Time Architecture
• Three key features that allow them to operate in a real-time streaming
environment: high availability, low latency, and horizontal
scalability
•High Availability:
• Horizontal scaling refers to the practice of adding more
physically separate servers to a cluster of servers handling a
specific problem to improve performance.
• The opposite is vertical scaling, which is adding more resources
to a single server to improve performance
• limited coordination
• partitioning techniques
31
32. Languages for Real-Time Programming
• Java
• Write Once Run Anywhere”
• Scala
• “run on the Java Virtual Machine.”
• most popular ones used in real-time application development
• JavaScript
• It is built into every web browser
• functional language
32
33. Understanding MapReduce Failure for Streaming Data
Good in processing large amount of data in parallel.
Real-time processing demands a very low latency.
Data collected from multiple sources may not have
all arrived at the point of aggregation.
In the standard model of Map/Reduce, the reduce
phase cannot start until the map phase is completed.
And all the intermediate data is persisted in the disk
before download to the reducer. All these added to
significant latency of the processing.
33