The document provides an introduction and overview of streaming data. It discusses sources of streaming data such as operational monitoring, web analytics, online advertising, social media, and mobile/IoT data. It explains that streaming data is different from other data types in that it is always flowing in, loosely structured, and can have high-cardinality dimensions. Real-time architectures for streaming data need to have high availability, low latency, and horizontal scalability.
There are only TWO substantial phases in a life of a digital service: either they are being built or they are being optimized. How to improve your UX with digital analytics?
Operations Research and ICT A Keynote AddressElvis Muyanja
By Prof. Venansius Baryamureeba, PhD
Uganda Technology And Management University (UTAMU)
www.utamu.ac.ug/barya ; barya@utamu.ac.ug
12th Operations Research Society for Eastern Africa (ORSEA) Conference, October 20-21, 2016, Hosted at the Faculty of Computing and Management Science Building, Makerere University Business School (MUBS), Kampala Uganda
Overview of major factors in big data, analytics and data science. Illustrates the growing changes from data capture and the way it is changing business beyond technology industries.
Data is being generated at a feverish pace and forward thinking companies are integrating big data and analytics as part of their core strategy from day one. However, it is often hard to sift through the hype around big data and many companies start with only a small subset of data. Can smaller companies benefit from big data efforts? We will discuss several use cases and examples of how startups are using data to optimize their operations, connect with their users, and expand their market.
There are only TWO substantial phases in a life of a digital service: either they are being built or they are being optimized. How to improve your UX with digital analytics?
Operations Research and ICT A Keynote AddressElvis Muyanja
By Prof. Venansius Baryamureeba, PhD
Uganda Technology And Management University (UTAMU)
www.utamu.ac.ug/barya ; barya@utamu.ac.ug
12th Operations Research Society for Eastern Africa (ORSEA) Conference, October 20-21, 2016, Hosted at the Faculty of Computing and Management Science Building, Makerere University Business School (MUBS), Kampala Uganda
Overview of major factors in big data, analytics and data science. Illustrates the growing changes from data capture and the way it is changing business beyond technology industries.
Data is being generated at a feverish pace and forward thinking companies are integrating big data and analytics as part of their core strategy from day one. However, it is often hard to sift through the hype around big data and many companies start with only a small subset of data. Can smaller companies benefit from big data efforts? We will discuss several use cases and examples of how startups are using data to optimize their operations, connect with their users, and expand their market.
Analysis of business oriented mobile mediaAFF Group
New trends in mobile analitics
- During measurements, a complex data set is generated, which can be handled with difficulty.
- Traditional (generally accepted) web analytics indicators and measures are not sufficient in themselves to describe the behaviour of mobile applications.
- Use of data mining technologies and methods.
- Business intelligence solutions even for mobile devices.
Data Mining is defined as extracting information from huge sets of data. In other words, we can say that data mining is the procedure of mining knowledge from data.
According to Inmon, a data warehouse is a subject oriented,
integrated, time-variant, and non-volatile collection of data. He defined the terms
in the sentence as follows:
The presentation includes the introduction to the topic, the various dimensions of big data, its evolution from big data 1.0 to bid data 3.0 and its impact on various industries, uses as well as the challenges it faces. The concluding slide gives a brief on the future of big data.
As the strategic importance of data has increased, new approaches to customer analytics have emerged as well. As customer interactions with companies grow and diversify, the need to integrate data faster and deliver real-time insights is critical. This presentation explores the underlying trends driving companies to become more data-driven and invest in customer analytics. And, it outlines three types of approaches to capturing, managing, analyzing, and activating customer knowledge and insights.
I have been drinking from a virtual fire hose since joining my most recent technology company, Anametrix, a cloud-based digital analytics innovator. A whole new book opened for me on how digital analytics can both increase top line revenue and reduce spend by shining a very bright flashlight into marketing efforts.
We are all painfully aware of the data explosion problem. In 2011, the Gartner Group stated that information volume collected by businesses today is growing at a minimum 59% annually. The rapid adoption of social media has also caused customer data to explode in the last few years, creating entirely new challenges for marketers. It is now imperative for organizations to think differently to accommodate the variety, volume, and velocity of their growing customer-related data.
This is where my recent experiences come in: I have personally seen how digital analytics can harness the power of massive amounts customer-related data. It can literally simplify the accelerating complexity by providing deep visibility – as well as clarity – into the effectiveness of various marketing efforts, across both online and offline channels.
I will now outline the role of IT and CFO in adopting cloud-based digital analytics solutions, discuss the benefits as well as challenges of moving to this emerging category, and provide some illustrative examples on how digital analytics can transform your marketing organization.
Banking managment
Bug Tracking
Chat-Server-system
College Information System
CourierInformationSystem
CYBER_SHOPPING
Data Centric Knowledge Management System
Distributed Cycle Minimization Protocol
E-COMMERCE Mechanism
Finance Managment
Global intractive solutins
Health Center System
IntranetChatting
MobileService management
NetConferening
online order processing system with AJAX enabled
OnLineExam process
web based Manufacturing
WEBREPORTING PROCESS
Andhra Pradesh State Finance Corporation (APSFC)
Classifieds
Customer Relationship Management for AIRLINE Industry
DataMart Management Software
E Procurement System
e-Classifieds
Ecommerce shopping cart
Elearn
Employee Resource Info sys
ENTERPRISE REOURCE PLANNING MANAGEMENT
e-Shopping
E-TRANSACTION_Totalproj
EWheelz
EzeeMail system
foresty management system
Fuji Distribution
global communication
GLOBAL COMMUNICATION MEDIA
Google map-wc
GovtSchemes-wc
human resource management system
Info ware Services
Insurance
Intranet Mailing System
Intrusion Detection System over Abnormal Internet Sequence
Lending Tree
Master and Science Research Center
Matrimony.com
MediTracker
MingleSpot
net-banking
On-line java compiler with security editor
ONLINE_EXAMS_POJECT
OnlineBanking
OnlineLibrary
PayRoll
Pharmacy system
product service management system
Project online music application
project status info system
project status information system
Resource out Sourcing
ResourcePlanner
SecuredNetAuction
ShoutBox
smartcard
SpeedAge
Status Information System
StockAnalyzer
stores management system
TelecomConnectionSystem-wc
Univesity Admission System
Web-Based Library
Analysis of business oriented mobile mediaAFF Group
New trends in mobile analitics
- During measurements, a complex data set is generated, which can be handled with difficulty.
- Traditional (generally accepted) web analytics indicators and measures are not sufficient in themselves to describe the behaviour of mobile applications.
- Use of data mining technologies and methods.
- Business intelligence solutions even for mobile devices.
Data Mining is defined as extracting information from huge sets of data. In other words, we can say that data mining is the procedure of mining knowledge from data.
According to Inmon, a data warehouse is a subject oriented,
integrated, time-variant, and non-volatile collection of data. He defined the terms
in the sentence as follows:
The presentation includes the introduction to the topic, the various dimensions of big data, its evolution from big data 1.0 to bid data 3.0 and its impact on various industries, uses as well as the challenges it faces. The concluding slide gives a brief on the future of big data.
As the strategic importance of data has increased, new approaches to customer analytics have emerged as well. As customer interactions with companies grow and diversify, the need to integrate data faster and deliver real-time insights is critical. This presentation explores the underlying trends driving companies to become more data-driven and invest in customer analytics. And, it outlines three types of approaches to capturing, managing, analyzing, and activating customer knowledge and insights.
I have been drinking from a virtual fire hose since joining my most recent technology company, Anametrix, a cloud-based digital analytics innovator. A whole new book opened for me on how digital analytics can both increase top line revenue and reduce spend by shining a very bright flashlight into marketing efforts.
We are all painfully aware of the data explosion problem. In 2011, the Gartner Group stated that information volume collected by businesses today is growing at a minimum 59% annually. The rapid adoption of social media has also caused customer data to explode in the last few years, creating entirely new challenges for marketers. It is now imperative for organizations to think differently to accommodate the variety, volume, and velocity of their growing customer-related data.
This is where my recent experiences come in: I have personally seen how digital analytics can harness the power of massive amounts customer-related data. It can literally simplify the accelerating complexity by providing deep visibility – as well as clarity – into the effectiveness of various marketing efforts, across both online and offline channels.
I will now outline the role of IT and CFO in adopting cloud-based digital analytics solutions, discuss the benefits as well as challenges of moving to this emerging category, and provide some illustrative examples on how digital analytics can transform your marketing organization.
Banking managment
Bug Tracking
Chat-Server-system
College Information System
CourierInformationSystem
CYBER_SHOPPING
Data Centric Knowledge Management System
Distributed Cycle Minimization Protocol
E-COMMERCE Mechanism
Finance Managment
Global intractive solutins
Health Center System
IntranetChatting
MobileService management
NetConferening
online order processing system with AJAX enabled
OnLineExam process
web based Manufacturing
WEBREPORTING PROCESS
Andhra Pradesh State Finance Corporation (APSFC)
Classifieds
Customer Relationship Management for AIRLINE Industry
DataMart Management Software
E Procurement System
e-Classifieds
Ecommerce shopping cart
Elearn
Employee Resource Info sys
ENTERPRISE REOURCE PLANNING MANAGEMENT
e-Shopping
E-TRANSACTION_Totalproj
EWheelz
EzeeMail system
foresty management system
Fuji Distribution
global communication
GLOBAL COMMUNICATION MEDIA
Google map-wc
GovtSchemes-wc
human resource management system
Info ware Services
Insurance
Intranet Mailing System
Intrusion Detection System over Abnormal Internet Sequence
Lending Tree
Master and Science Research Center
Matrimony.com
MediTracker
MingleSpot
net-banking
On-line java compiler with security editor
ONLINE_EXAMS_POJECT
OnlineBanking
OnlineLibrary
PayRoll
Pharmacy system
product service management system
Project online music application
project status info system
project status information system
Resource out Sourcing
ResourcePlanner
SecuredNetAuction
ShoutBox
smartcard
SpeedAge
Status Information System
StockAnalyzer
stores management system
TelecomConnectionSystem-wc
Univesity Admission System
Web-Based Library
About
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
• Remote control: Parallel or serial interface.
• Compatible with MAFI CCR system.
• Compatible with IDM8000 CCR.
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
• Easy in configuration using DIP switches.
Technical Specifications
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
Key Features
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
• Remote control: Parallel or serial interface
• Compatible with MAFI CCR system
• Copatiable with IDM8000 CCR
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
Application
• Remote control: Parallel or serial interface.
• Compatible with MAFI CCR system.
• Compatible with IDM8000 CCR.
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
• Easy in configuration using DIP switches.
Saudi Arabia stands as a titan in the global energy landscape, renowned for its abundant oil and gas resources. It's the largest exporter of petroleum and holds some of the world's most significant reserves. Let's delve into the top 10 oil and gas projects shaping Saudi Arabia's energy future in 2024.
Welcome to WIPAC Monthly the magazine brought to you by the LinkedIn Group Water Industry Process Automation & Control.
In this month's edition, along with this month's industry news to celebrate the 13 years since the group was created we have articles including
A case study of the used of Advanced Process Control at the Wastewater Treatment works at Lleida in Spain
A look back on an article on smart wastewater networks in order to see how the industry has measured up in the interim around the adoption of Digital Transformation in the Water Industry.
Overview of the fundamental roles in Hydropower generation and the components involved in wider Electrical Engineering.
This paper presents the design and construction of hydroelectric dams from the hydrologist’s survey of the valley before construction, all aspects and involved disciplines, fluid dynamics, structural engineering, generation and mains frequency regulation to the very transmission of power through the network in the United Kingdom.
Author: Robbie Edward Sayers
Collaborators and co editors: Charlie Sims and Connor Healey.
(C) 2024 Robbie E. Sayers
Explore the innovative world of trenchless pipe repair with our comprehensive guide, "The Benefits and Techniques of Trenchless Pipe Repair." This document delves into the modern methods of repairing underground pipes without the need for extensive excavation, highlighting the numerous advantages and the latest techniques used in the industry.
Learn about the cost savings, reduced environmental impact, and minimal disruption associated with trenchless technology. Discover detailed explanations of popular techniques such as pipe bursting, cured-in-place pipe (CIPP) lining, and directional drilling. Understand how these methods can be applied to various types of infrastructure, from residential plumbing to large-scale municipal systems.
Ideal for homeowners, contractors, engineers, and anyone interested in modern plumbing solutions, this guide provides valuable insights into why trenchless pipe repair is becoming the preferred choice for pipe rehabilitation. Stay informed about the latest advancements and best practices in the field.
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Dr.Costas Sachpazis
Terzaghi's soil bearing capacity theory, developed by Karl Terzaghi, is a fundamental principle in geotechnical engineering used to determine the bearing capacity of shallow foundations. This theory provides a method to calculate the ultimate bearing capacity of soil, which is the maximum load per unit area that the soil can support without undergoing shear failure. The Calculation HTML Code included.
Immunizing Image Classifiers Against Localized Adversary Attacksgerogepatton
This paper addresses the vulnerability of deep learning models, particularly convolutional neural networks
(CNN)s, to adversarial attacks and presents a proactive training technique designed to counter them. We
introduce a novel volumization algorithm, which transforms 2D images into 3D volumetric representations.
When combined with 3D convolution and deep curriculum learning optimization (CLO), itsignificantly improves
the immunity of models against localized universal attacks by up to 40%. We evaluate our proposed approach
using contemporary CNN architectures and the modified Canadian Institute for Advanced Research (CIFAR-10
and CIFAR-100) and ImageNet Large Scale Visual Recognition Challenge (ILSVRC12) datasets, showcasing
accuracy improvements over previous techniques. The results indicate that the combination of the volumetric
input and curriculum learning holds significant promise for mitigating adversarial attacks without necessitating
adversary training.
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdfKamal Acharya
The College Bus Management system is completely developed by Visual Basic .NET Version. The application is connect with most secured database language MS SQL Server. The application is develop by using best combination of front-end and back-end languages. The application is totally design like flat user interface. This flat user interface is more attractive user interface in 2017. The application is gives more important to the system functionality. The application is to manage the student’s details, driver’s details, bus details, bus route details, bus fees details and more. The application has only one unit for admin. The admin can manage the entire application. The admin can login into the application by using username and password of the admin. The application is develop for big and small colleges. It is more user friendly for non-computer person. Even they can easily learn how to manage the application within hours. The application is more secure by the admin. The system will give an effective output for the VB.Net and SQL Server given as input to the system. The compiled java program given as input to the system, after scanning the program will generate different reports. The application generates the report for users. The admin can view and download the report of the data. The application deliver the excel format reports. Because, excel formatted reports is very easy to understand the income and expense of the college bus. This application is mainly develop for windows operating system users. In 2017, 73% of people enterprises are using windows operating system. So the application will easily install for all the windows operating system users. The application-developed size is very low. The application consumes very low space in disk. Therefore, the user can allocate very minimum local disk space for this application.
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxR&R Consult
CFD analysis is incredibly effective at solving mysteries and improving the performance of complex systems!
Here's a great example: At a large natural gas-fired power plant, where they use waste heat to generate steam and energy, they were puzzled that their boiler wasn't producing as much steam as expected.
R&R and Tetra Engineering Group Inc. were asked to solve the issue with reduced steam production.
An inspection had shown that a significant amount of hot flue gas was bypassing the boiler tubes, where the heat was supposed to be transferred.
R&R Consult conducted a CFD analysis, which revealed that 6.3% of the flue gas was bypassing the boiler tubes without transferring heat. The analysis also showed that the flue gas was instead being directed along the sides of the boiler and between the modules that were supposed to capture the heat. This was the cause of the reduced performance.
Based on our results, Tetra Engineering installed covering plates to reduce the bypass flow. This improved the boiler's performance and increased electricity production.
It is always satisfying when we can help solve complex challenges like this. Do your systems also need a check-up or optimization? Give us a call!
Work done in cooperation with James Malloy and David Moelling from Tetra Engineering.
More examples of our work https://www.r-r-consult.dk/en/cases-en/
2. Agenda
• Sources of Streaming Data ,Operational Monitoring, Web Analytics
• Online Advertising, Social Media, Mobile Data and the Internet of Things
• Why Streaming Data is Different ,Loosely Structured , Highly-Cardinality
Storage
• Real-Time Architecture Components ,Collection, Data Flow, Processing,
Storage, Delivery, Features of a Real-Time Architecture,
• High Availability, Low Latency, Horizontal Scalability,
• Languages for Real-Time Programming, Understanding MapReduce
Failure for Streaming Data
3. Introduction to Streaming Data
• It seems like the world moves at a faster pace every day.
• People and places become more connected, and people and organizations try to
react at an ever-increasing pace.
• Reaching the limits of a human's ability to respond, tools are built to process the
vast amounts of data available to decision makers, analyze it, present it, and, in
some cases, respond to events as they happen.
• The collection and processing of this data has a number of application areas,
some of which are discussed in the next section.
• These applications, which are discussed later in this chapter, require an
infrastructure and method of analysis specific to streaming data.
• Fortunately, like batch processing before it, the state of the art of streaming
infrastructure is focused on using commodity hardware and software to build its
systems rather than the specialized systems required for real-time analysis prior
to the Internet era.
• This, combined with flexible cloud-based environment, puts the implementation
of a real-time system within the reach of nearly any organization.
• These commodity systems allow organizations to analyze their data in real time
and scale that infrastructure to meet future needs as the organization grows and
changes over time
4. Introduction to Streaming Data
The goal of this course is to allow a fairly broad range of potential users
and implementers in an organization to gain comfort with the complete
stack of applications.
When real-time projects reach a certain point, they should be agile and
adaptable systems that can be easily modified, which requires that the
users have a fair understanding of the stack as a whole in addition to their
own areas of focus.
“Real time” applies as much to the development of new analyses as it does
to the data itself. Any number of well-meaning projects have failed
because they took so long to implement that the people who requested the
project have either moved on to other things or simply forgotten why they
wanted the data in the first place.
By making the projects agile and incremental, this can be avoided as
much as possible. This chapter is cover the first section, “Sources of
Streaming Data,” is some of the common sources and applications of
streaming data. They are arranged more or less chronologically and
provide some background on the origin of streaming data infrastructures.
5. Sources of Streaming Data
• There are a variety of sources of streaming data. This section introduces
some of the major categories of data.
• The data motion systems presented in this book got their start handling
data for website analytics and online advertising at places like LinkedIn,
Yahoo!, and Facebook.
• The processing systems were designed to meet the challenges of
processing social media data from Twitter and social networks like
LinkedIn.
• Google, whose business is largely related to online advertising, makes
heavy use of the advanced algorithmic approaches. Google seems to be
especially interested in a technique called deep learning, which makes use
of very large-scale neural networks to learn complicated patterns.
• These systems are even enabling entirely new areas of data collection and
analysis by making the Internet of Things and other highly distributed data
collection efforts economically feasible.
• It is hoped that outlining some of the previous application areas provides
some inspiration for as-of-yet unforeseen applications of these
technologies.
6. Operational Monitoring
• Operational monitoring of physical systems was the original
application of streaming data. Originally, this would have been
implemented using specialized hardware and software (or even
analog and mechanical systems in the pre-computer era).
• The most common use case today of operational monitoring is
tracking the performance of the physical systems that power the
Internet.
• These datacenters house thousands—possibly even tens of
thousands—of discrete computer systems. All of these systems
continuously record data about their physical state from the
temperature of the processor, to the speed of the fan and the voltage
draw of their power supplies. They also record information about the
state of their disk drives and fundamental metrics of their operation,
such as processor load, network activity, and storage access times.
7. Operational Monitoring
• To make the monitoring of all of these systems possible and to
identify problems, this data is collected and aggregated in real time
through a variety of mechanisms.
• The first systems tended to be specialized ad hoc mechanisms, but
when these sorts of techniques started applying to other areas, they
started using the same collection systems as other data collection
mechanisms.
8. Web Analytics
• The introduction of the commercial web, through e-commerce and online
advertising, led to the need to track activity on a website.
• Like the circulation numbers of a newspaper, the number of unique
visitors who see a website in a day is important information.
• For e-commerce sites, the data is less about the number of visitors as it is
the various products they browse and the correlations between them.
• To analyze this data, a number of specialized log-processing tools were
introduced and marketed. With the rise of Big Data and tools like
Hadoop, much of the web analytics infrastructure shifted to these large
batch-based systems.
• They were used to implement recommendation systems and other
analysis. It also became clear that it was possible to conduct experiments
on the structure of websites to see how they affected various metrics of
interest. This is called A/B testing because—in the same way an
optometrist tries to determine the best prescription—two choices are
pitted against each other to determine which is best.
• These tests were mostly conducted sequentially, but this has a number of
problems, not the least of which is the amount of time needed to conduct
the study.
9. Web Analytics
• As more and more organizations mined their website data, the need to
reduce the time in the feedback loop and to collect data on a more continual
basis became more important.
• Using the tools of the system-monitoring community, it became possible to
also collect this data in real time and perform things like A/B tests in
parallel rather than in sequence.
• As the number of dimensions being measured and the need for appropriate
auditing (due to the metrics being used for billing) increased, the analytics
community developed much of the streaming infrastructure to safely move
data from their web servers spread around the world to processing and
billing systems.
• This sort of data still accounts for a vast source of information from a
variety of sources, although it is usually contained within an organization
rather than made publicly available. Applications range from simple
aggregation for billing to the real-time optimization of product
recommendations based on current browsing history (or viewing history, in
the case of a company like Netflix).
10. Online Advertising
• A major user and generator of real-time data is online advertising.
• The original forms of online advertising were similar to their print counterparts
with “buys” set up months in advance.
• With the rise of the advertising exchange and real-time bidding infrastructure, the
advertising market has become much more fluid for a large and growing portion
of traffic.
• For these applications, the money being spent in different environments and on
different sites is being managed on a per-minute basis in much the same way as
the stock market.
• In addition, these buys are often being managed to some sort of metric, such as
the number of purchases (called a conversion) or even the simpler metric of the
number of clicks on an ad.
• When a visitor arrives at a website via a modern advertising exchange, a call is
made to a number of bidding agencies (perhaps 30 or 40 at a time), who place
bids on the page view in real time.
11. Online Advertising
• An auction is run, and the advertisement from the winning party is displayed. This
usually happens while the rest of the page is loading; the elapsed time is less than
about 100 milliseconds.
• If the page contains several advertisements, as many of them do, an auction is
often being run for all of them, sometimes on several different exchanges.
• All the parties in this process—the exchange, the bidding agent, the advertiser,
and the publisher—are collecting this data in real time for various purposes.
• For the exchange, this data is a key part of the billing process as well as
important for real-time feedback mechanisms that are used for a variety of
purposes.
• Examples include monitoring the exchange for fraudulent traffic and other risk
management activities, such as throttling the access to impressions to various
parties. Advertisers, publishers, and bidding agents on both sides of the exchange
are also collecting the data in real time.
12. Online Advertising
• Their goal is the management and optimization of the campaigns they are
currently running. From selecting the bid (in the case of the advertiser) or the
“reserve” price (in the case of the publisher), to deciding which exchange offers
the best prices for a particular type of traffic, the data is all being managed on a
moment-to-moment basis.
• A good-sized advertising campaign or a large website could easily see page views
in the tens or hundreds of millions. Including other events such as clicks and
conversions could easily double that.
• A bidding agent is usually acting on the part of many different advertisers or
publishers at once and will often be collecting on the order of hundreds of
millions to billions of events per day. Even a medium-sized exchange, sitting as a
central party, can have billions of events per day. All of this data is being
collected, moved, analyzed, and stored as it happens.
13. Social Media
• Another newer source of massive collections of data are social media sources,
especially public ones such as Twitter.
• As of the middle of 2013, Twitter reported that it collected around 500 million
tweets per day with spikes up to around 150,000 tweets per second. That number
has surely grown since then. This data is collected and disseminated in real time,
making it an important source of information for news outlets and other
consumers around the world.
• In 2011, Twitter users in New York City received information about an earthquake
outside of Washington, D.C. about 30 seconds before the tremors struck New
York itself. Combined with other sources like Facebook, Foursquare, and
upcoming communications platforms, this data is extremely large and varied.
• The data from applications like web analytics and online advertising, although
highly dimensional, are usually fairly well structured. The dimensions, such as
money spent or physical location, are fairly well understood quantitative values.
• In social media, however, the data is usually highly unstructured, at least as data
analysts understand the term. It is usually some form of “natural language” data
that must be parsed, processed, and somehow understood by automated systems.
This makes social media data incredibly rich, but incredibly challenging for the
real-time data sources to process.
14. Mobile Data and the Internet of Things
• One of the most exciting new sources of data was introduced to the world in 2007 in
the form of Apple's iPhone.
• Cellular data-enabled computers had been available for at least a decade, and devices
like Blackberries had put data in the hands of business users, but these devices were
still essentially specialist tools and were managed as such.
• The iPhone, Android phones, and other smartphones that followed made cellular data
a consumer technology with the accompanying economies of scale that goes hand in
hand with a massive increase in user base.
• It also put a general-purpose computer in the pocket of a large population.
Smartphones have the ability not only to report back to an online service, but also to
communicate with other nearby objects using technologies like Bluetooth LE.
• Technologies like so-called “wearables,” which make it possible to measure the
physical world the same way the virtual world has been measured for the last two
decades, have taken advantage of this new infrastructure.
• The applications range from the silly to the useful to the creepy. For example, a
wristband that measures sleep activity could trigger an automated coffee maker when
the user gets a poor night's sleep and needs to be alert the next day. The smell of
coffee brewing could even be used as an alarm. The communication between these
systems no longer needs to be direct or specialized, as envisioned in the various
“smart home” demonstration projects during the past 50 years. These tools are
possible today using tools like If This Then That (IFTTT) and other publicly available
systems built on infrastructure similar
15. Mobile Data and the Internet of Things
• The smell of coffee brewing could even be used as an alarm. The communication
between these systems no longer needs to be direct or specialized, as envisioned
in the various “smart home” demonstration projects during the past 50 years.
These tools are possible today using tools like If This Then That (IFTTT) and other
publicly available systems built on infrastructure similar.
• On a more serious note, important biometric data can be measured in real time
by remote facilities in a way that has previously been available only when using
highly specialized and expensive equipment, which has limited its application to
high-stress environments like space exploration. Now this data can be collected
for an individual over a long period of time (this is known in statistics as
longitudinal data) and pooled with other users' data to provide a more complete
picture of human biometric norms.
• Instead of taking a blood pressure test once a year in a cold room wearing a
paper dress, a person's blood pressure might be tracked over time with the goal
of “heading off problems at the pass.” Outside of health, there has long been the
idea of “smart dust”—large collections of inexpensive sensors that can be
distributed into an area of the world and remotely queried to collect interesting
data. The limitation of these devices has largely been the expense required to
manufacture relatively specialized pieces of equipment. This has been solved by
the commodification of data collection hardware and software (such as the
smartphone) and is now known as the Internet of Things.
16. Why Streaming Data Is Different
There are a number of aspects to streaming data that set it apart from
other kinds of data.
The three most important, covered in this section, are the “always-
on” nature of the data, the loose and changing data structure, and
the challenges presented by high-cardinality dimensions.
All three play a major role in decisions made in the design and
implementation of the various streaming frameworks .
Always On, Always Flowing –
This first is somewhat obvious: streaming data streams. The data is
always available and new data is always being generated. This has a
few effects on the design of any collection and analysis system. First,
the collection itself needs to be very robust. Downtime for the primary
collection system means that data is permanently lost.
17. Why Streaming Data Is Different
Second, the fact that the data is always flowing means that the system
needs to be able to keep up with the data.
If 2 minutes are required to process 1 minute of data, the system will
not be real time for very long. Eventually, the problem will be so bad
that some data will have to be dropped to allow the system to catch
up.
In practice it is not enough to have a system that can merely “keep
up” with data in real time. It needs to be able to process data far more
quickly than real time.
For reasons that are either intentional, such as a planned downtime, or
due to catastrophic failures, such as network outages, the system
either whole or in part will go down.
18. Why Streaming Data Is Different
Failing to plan for this inevitability and having a system that can only
process at the same speed as events happen means that the system is now
delayed by the amount of data stored at the collectors while the system was
down.
A system that can process 1 hour of data in 1 minute, on the other hand, can
catch up fairly quickly with little need for intervention. A mature
environment that has good horizontal scalability , even implement auto-
scaling.
In this setting, as the delay increases, more processing power is temporarily
added to bring the delay back into acceptable limits. On the algorithmic side,
this always-flowing feature of streaming data is a bit of a double-edged
sword.
On the positive side, there is rarely a situation where there is not enough
data. If more data is required for an analysis, simply wait for enough data to
become available. It may require a long wait, but other analyses can be
conducted in the meantime that can provide early indicators of how the later
analysis might proceed.
19. Loosely Structured
• Streaming data is often loosely structured compared to many other
datasets. There are several reasons this happens, and although this
loose structure is not unique to streaming data, it seems to be more
common in the streaming settings than in other situations. Part of
the reason seems to be the type of data that is interesting in the
streaming setting.
• Streaming data comes from a variety of sources. Although some of
these sources are rigidly structured, many of them are carrying an
arbitrary data payload. Social media streams, in particular, will be
carrying data about everything from world events to the best slice
of pizza to be found in Brooklyn on a Sunday night. To make
things more difficult, the data is encoded as human language.
20. Loosely Structured
• Another reason is that there is a “kitchen sink” mentality to
streaming data projects. Most of the projects are fairly young and
exploring unknown territory, so it makes sense to toss as many
different dimensions into the data as possible. This is likely to
change over time, so the decision is also made to use a format that
can be easily modified, such as JavaScript Object Notation (JSON).
The general paradigm is to collect as much data as possible in the
event that it is actually interesting.
• Finally, the real-time nature of the data collection also means that
various dimensions may or may not be available at any given time.
For example, a service that converts IP addresses to a geographical
location may be temporarily unavailable. For a batch system this
does not present a problem; the analysis can always be redone later
when the service is once more available. The streaming system, on
the other hand, must be able to deal with changes in the available
dimensions and do the best it can.
21. High-Cardinality Storage
• Cardinality refers to the number of unique values a piece of data can
take on. Formally, cardinality refers to the size of a set and can be
applied to the various dimensions of a dataset as well as the entire
dataset itself. This high cardinality often manifests itself in a “long
tail” feature of the data. For a given dimension (or combination of
dimensions) there is a small set of different states that are quite
common, usually accounting for the majority of the observed data,
and then a “long tail” of other data states that comprise a fairly small
fraction.
• This feature is common to both streaming and batch systems, but it is
much harder to deal with high cardinality in the streaming setting. In
the batch setting it is usually possible to perform multiple passes over
the dataset. A first pass over the data is often used to identify
dimensions with high cardinality and compute the states that make up
most of the data. These common states can be treated individually,
and the remaining state is combined into a single “other” state that
can usually be ignored. 21
22. High-Cardinality Storage
• In the streaming setting, the data can usually be processed a single time. If
the common cases are known ahead of time, this can be included in the
processing step. The long tail can also be combined into the “other” state,
and the analysis can proceed as it does in the batch case. If a batch study
has already been performed on an earlier dataset, it can be used to inform
the streaming analysis. However, it is often not known if the common states
for the current data will be the common states for future data. In fact,
changes in the mix of states might actually be the metric of interest. More
commonly, there is no previous data to perform an analysis upon. In this
case, the streaming system must attempt to deal with the data at its natural
cardinality.
• This is difficult both in terms of processing and in terms of storage. Doing
anything with a large set necessarily takes time to process anything that
involves a large number of different states. It also requires a linear amount
of space to store information about each different state and, unlike batch
processing, storage space is much more restricted than in the batch setting
because it must generally use very fast main memory storage instead of the
much slower tertiary storage of hard drives. This has been relaxed
somewhat with the introduction of high-performance Solid State Drives
(SSDs), but they are still orders of magnitude slower than memory access.
22
23. High-Cardinality Storage
• As a result, a major topic of research in streaming data is how to deal
with high-cardinality data. This course discusses some of the
approaches to dealing with the problem. As an active area of
research, more solutions are being developed and improved every
day
23
24. Real-time processing Architecure
• Real time processing deals with streams of data that are
captured in real-time and processed with minimal latency to
generate real-time (or near-real-time) reports or automated
responses.
• For example, a real-time traffic monitoring solution might
use sensor data to detect high traffic volumes.
• This data could be used to dynamically update a map to show
congestion, or automatically initiate high-occupancy lanes or
other traffic management systems.
24
25. Real-time processing Architecure
• Real-time processing is defined as the processing of
unbounded stream of input data, with very short latency
requirements for processing — measured in milliseconds or
seconds.
• This incoming data typically arrives in an unstructured or
semi-structured format, such as JSON, and has the same
processing requirements as batch processing, but with shorter
turnaround times to support real-time consumption.
• Processed data is often written to an analytical data store,
which is optimized for analytics and visualization.
25
27. Real-time processing Architecure
• Real-time ingestion/Collection-
• The architecture must include a way to capture and store real-time
messages to be consumed by a stream processing consumer.
• Stream processing-
• After capturing real-time messages, the solution must process
them by filtering, aggregating, and otherwise preparing the data
for analysis.
• Analytical data store/Storage-
• Many big data solutions are designed to prepare data for
analysis and then serve the processed data in a structured
format that can be queried using analytical tools..
27
28. Real-time processing Architecure
• Analysis and reporting-
• The goal of most big data solutions is to provide insights
into the data through analysis and reporting.
28
29. Features of a Real-Time Architecture
• Three key features that allow them to operate in a real-time streaming
environment: high availability, low latency, and horizontal
scalability
•High Availability:
•key difference between the real-time streaming
application and business-intelligence application
•distribution and replication
•availability of the service and data
•data unavailable-replication 29
30. Features of a Real-Time Architecture
• Three key features that allow them to operate in a real-time streaming
environment: high availability, low latency, and horizontal
scalability
•Low Latency:
• many seconds in a day (86,400 )
• less time it takes to handle a single request
• real-time streaming application, low latency means something a bit
different. Rather than referring to the return time for a single
request, it refers to the amount of time between an event occurring
somewhere at the “edge” of the system and it being made available
to the processing and delivery frameworks.
30
31. Features of a Real-Time Architecture
• Three key features that allow them to operate in a real-time streaming
environment: high availability, low latency, and horizontal
scalability
•High Availability:
• Horizontal scaling refers to the practice of adding more
physically separate servers to a cluster of servers handling a
specific problem to improve performance.
• The opposite is vertical scaling, which is adding more resources
to a single server to improve performance
• limited coordination
• partitioning techniques
31
32. Languages for Real-Time Programming
• Java
• Write Once Run Anywhere”
• Scala
• “run on the Java Virtual Machine.”
• most popular ones used in real-time application development
• JavaScript
• It is built into every web browser
• functional language
32
33. Understanding MapReduce Failure for Streaming Data
Good in processing large amount of data in parallel.
Real-time processing demands a very low latency.
Data collected from multiple sources may not have
all arrived at the point of aggregation.
In the standard model of Map/Reduce, the reduce
phase cannot start until the map phase is completed.
And all the intermediate data is persisted in the disk
before download to the reducer. All these added to
significant latency of the processing.
33