SlideShare a Scribd company logo
1 of 9
HIA: Autonomic and General Collecting of Real-time Data
Streams
Changfeng Chen, Xiang Wang1.
Abstract. Power manufacturing enterprises process events by hundreds of devices interacting with
their plants. Rich statistical data distilled from combining such interactions in real-time could generate
much business value. In this paper, we describe the architecture of HIA, a systemfor ingesting multiple
geographically distributed data source in real-time with high scalability and low latency, where the data
streams may be real-time or batch. The system strongly self-manages infrastructure degradation
without any manual intervention. HIA guarantees that there will be no data lost in the ingested output
at any point in time, that most ingested events will be present in the output in soft real-time. Our
product deployment ingests millions of events per second at peak with an average end-to-end latency of
less than 1second. We also present challenges and solutions in maintaining large persistent ingesting
state across geographically distant locations, and highlight the design principles that emerged from our
experience.
Keywords: Event Streams Collecting, Autonomic Computing, I/O Driver
1 Introduction
Collecting real-time streams has received considerable attention in the past three decades due to its
importance in numerous applications (e.g. Device[16]monitor), where information generates in the
form of very high-speed streams that need to be processed online to enable real-time[6] response.
With the explosive evolution of the power manufacturing enterprises in the last several years, the
need for similar technologies has grown multifold as energy-based enterprises must process events
generated by millions of devices interacting with their plants. Devices all over the enterprise measure
the manufacture process on a daily basis,issuing power production, gas emit and materials cost.
Distilling rich statistical data from plant device interactions in real-time has a huge impact on
business processes. The data enables decision maker to fine-tune production, budgets, campaigns and
supervision, vary the parameters in real-time to suit changing market environment. It provides
immediate feedback to decision maker on the effectiveness of their changes, and also allows the
enterprise to optimize the budget spent for each power plant on a continuous basis.
To provide real-time statistical data, we built a system, called HighSoon Interface Adapter or HIA,
that can ingest event[6,16] that generated by distributed plant devices. HIA ingest multiple real-time
continuous streams of events.
Let us illustrate the operational steps in HIA with an example of ingesting a CO gas event:
1
Xiang Wang()
China Real-time Database Co. LTD, Software Avenue. 180, 210000 Nanjing, China
e-mail:wangxiang2@sgepri.sgcc.com.cn
 When a plant device record a measure and forms in an event. The Pull driver collets the event,
where the event is forward to the HIA server.
 HIA server routes an event to destination Push driver
 Push driver push event to backing store or destination.
HIA ingest input event streams from plant device and writes collected events to backing stores.
Collected events are used to generate rich statisticaldata.
Fig. 1.1 Collecting event in HIA
1.1 Problem Statement
Formally, we express the data ingest task in PullPush model. PullPush is a similar MapReduce[4]
programming model and an associated implementation for ingesting large data events. Programmers
specify a pull function that poll events from a data device source to generate a set of intermediate
events group by channel, and a push function that merges all intermediate events associated with the
same channel key to writes collected events to backing stores.
The main goal of HIA is to ingest continuous stream in real-time. We use this framework to ingest
many different event streams in our products.
1.2 System Challenges
While building HIA to ingest continuous data streams, we face these challenges:
t2
Timeline
Plant Device
HIA
t1 t 𝑛
Event
{time_stamp, value, quality_code}
Event
{time_stamp, value, quality_code}
Measure metrics / Unit
 Consistent semantics: The output produced by HIA is used for operating production, reporting
revenue to senior officer in the enterprise, and so on. HIA must guarantee that a single input event is
never lost, since this leads to misunderstand the real world.
 Automatic system: Service interruptions can last from a couple of minutes to a few days or even
weeks. There are hundreds plant devices. The operations of the whole system are overhead. The
volume and complexity may be managed by highly skilled humans; but the demand for skilled IT
personnel is already outstripping supply. The system could make decisions on its own, using
high-level policies; it will constantly check and optimize its status and automatically adapt itself to
changing conditions. An autonomic computing framework is the solution.
 Multiple sources and different formats: How will we ingest data from multiple sources and different
formats in an efficient manner?
 Multiple destinations:Should all the raw data be ingested only in a single data store?
 High scalability: Not only does HIA need to handle millions of events per second today, it must also
be able to handle the ever-increasing number of events in the future.
 High extensibility: How to support the owner develops the customdriver for specify data source?
 Low latency: The output of HIA is used to compute statistics for decision maker on how their
manufacturing processes are performing, detecting overhead cost, optimizing budget, etc. Having
HIA perform the collect within a few cent seconds significantly improves the effectiveness of these
business processes.
 Just-in-Time Transformation: Should preprocessing of data – for example, checking / cleansing /
validation - always be done before ingesting data in data store.
 Hybrid collecting mode: What are the essential tools / frameworks required in ingestion layer to
handle data in real-time or batch mode?
1.3 Our Contributions
The key contributions from this paper are:
 To the best of our knowledge, this the first paper to formulate and solve the problem of ingesting
multiple streams[1] continuously under these system constraints: consistent semantics, autonomic
system, high scalability, low latency, multiple sources and different formats, multiple destinations,
Just-in-Time transformation, Hybrid collecting mode.
 The solutions detailed in this paper have been fully implemented and deployed in a live production
for several months. Based on this real-world experience, we present detailed performance results,
design trade-offs, and key design lessons from our deployment.
The rest of this paper is organized as follows. Section 2 describes the data model in more detail.
Section 3 presents the detailed design and architecture of the HIA system. Section 4 describes our
product deployment settings and measurements collected from the running system. Sections 5 and 6
describe related research, and summarize future work and conclusions.
2 Data Model
A HIA system is a set of processes. Each systemserves a set of channels. A channel[12,14] in HIA is a
sparse map, and an abstraction like as a process in operating system for fulfilling transfer data between
backing stores. Channels are dynamic entities that usually have a limited life span with the HIA
system.
A HIA relation is an information container structured as a sequence of events , and a mapping
between a measure point in the plant device and a destination measure point in the backing store. The
relation just like as a file in the UNIX-style operating system[2][10]. Data is organized into two
dimensions: relations and timestamps.
(relation: string, time: int64)-->value
We ingest the raw data identified by a particular relation key and timestamp. Rows or relations are
grouped together to formthe unit of load balancing, called frames, and columns or timestamps also are
grouped togetherto form the unit of transferring, called batches.
{𝑡1, 𝑣𝑎𝑙𝑢𝑒11 , 𝑞𝑢𝑎𝑙𝑖𝑡𝑦11 } {𝑡2, 𝑣𝑎𝑙𝑢𝑒12, 𝑞𝑢𝑎𝑙𝑖𝑡𝑦12 } … … {𝑡m, 𝑣𝑎𝑙𝑢𝑒1𝑚 , 𝑞𝑢𝑎𝑙𝑖𝑡𝑦1𝑚 }
{𝑡1, 𝑣𝑎𝑙𝑢𝑒21 , 𝑞𝑢𝑎𝑙𝑖𝑡𝑦21} {𝑡2, 𝑣𝑎𝑙𝑢𝑒22, 𝑞𝑢𝑎𝑙𝑖𝑡𝑦22 } … … {𝑡 𝑚 , 𝑣𝑎𝑙𝑢𝑒2𝑚 , 𝑞𝑢𝑎𝑙𝑖𝑡𝑦2𝑚 }
… … … … … … … …
{𝑡1, 𝑣𝑎𝑙𝑢𝑒 𝑛1 , 𝑞𝑢𝑎𝑙𝑖𝑡𝑦 𝑛1 } {𝑡2, 𝑣𝑎𝑙𝑢𝑒 𝑛2, 𝑞𝑢𝑎𝑙𝑖𝑡𝑦 𝑛2 } … … {𝑡 𝑚 , 𝑣𝑎𝑙𝑢𝑒 𝑛𝑚 , 𝑞𝑢𝑎𝑙𝑖𝑡𝑦 𝑛𝑚 }
Fig. 2.1 A slice of an example channel that represents the collected data. The row name is associated a measure point.
The columns name is a sampling timestamp.
Rows. HIA operates data in ascend order by row id. The row keys in a channel are arbitrary strings
(currently up to 256 bytes in size).
Timestamps. Different relations in a channel can contain a sequence of events, where the events are
indexed by timestamp. HIA timestamps are two 32-bit integers. They are assigned explicitly by plant
device, in which case they represent real time in microseconds formed by Julian day. Plant devices that
need to avoid collisions must generate unique timestamps themselves. Different events of a relation are
represented in decreasing timestamp order, so that the most recent events can be seeing first.
3 DesignOverview
We now describe the architecture of a pipeline at a single instance, as show in Figure 3.1. In the
following examples, we use current events as inputs, but the architecture applies to any similar event
streams.
There are three major components in a single instance: the driver program to read/write current
continuously and feed them to the HIA server; HIA sever orchestrates globaloperation of system.
3.1 Assumptions
In designing a collecting system for market’s needs, we have been guided by assumptions that offer
both challenges and opportunities. We alluded to some key observations earlier and now lay out our
assumptions in more details [5].
 The system is built from low sustained technique environment that often fail. It must constantly
monitor itself and detect, tolerate, and recover promptly component failures on a routine basis.
 The system ingest a million number of data points. We expect a few million points, each event
typically 3 seconds period.
 The workloads primarily consist of two kinds of ingestion: real-time streaming collect and batch
streaming collect.
3.2 Interface
HIA provides a simple collection interface. We support the usual operations to get_snapshot, and
get_archive_value events.
Moreover, HIA has device connection maintain operations. Plant device connection creates a link
between I/O driver and plant data device.
3.3 Architecture
A HIA system instance consists of a single server with a kernel service and multiple streaming
processor services and is accessed by multiple I/O driver programs, as shown Figure 3.1. Both HIA
server and HIA I/O driver are user-level processes in a commodity Windows_NT machine.
The kernel maintains all relation system metadata. This includes namespace, the mapping from a
source measure point to a destination measure point, and the channel owner of relations. It also controls
system-wide activities such as loading I/O driver program[15], linking channels, keeping track of
channel list[8,13], Keeping track of the I/O driver program[3], Watching the production
environment[3], Scanning the directory of drivers[2]. The HIA kernel periodically communicates with
each driver in Heartbeat message[5] to give it instructions and collect its state.
The programmer writes code to implement the collecting interfaces in a PullPush specification
custom by practical data sources, and products a dynamic library. The user’s code is linked together
with the PullPush library (or HIA driver framework, implemented in C++). The HIA client as a C++
framework then invokes the PullPush function running in a user-level process. HIA driver program
communicates with the HIA kernel and HIA streaming processor to pull or push data on behalf of the
data device. Drivers interact with the kernel for metadata operations, but all real-time and batch data
communication goes directly to the Streaming Processor.
The driver doesn’t cache event data. Driver caches offer little benefit because it increases the
complex of driver. Streaming Processor needs to cache event data because the backing store will be
offline.
Fig. 3.1 HIA Architecture
3.4 Single Kernel
Having a single kernel or server vastly simplifies our design and enables the kernel to make
sophisticated taking care of channels running using global knowledge. HIA system provides an
execution environment in which channel may run. The task of creating, eliminating the channels is
delegated to a group of routines in the kernel. Kernel itself is not a channel but a channel manager.
To let the kernel manage channels, each channel is represented by a channel descriptor that includes
information about the current state of the channel. When the kernel stops the execution of a channel, it
saves that current content in the channel descriptor. When the kernel decides to resume executing a
channel, it uses the proper channel descriptor fields to load the Streaming Processor’s registers.
Because the stored value of the program counter points to the message following the last message
collected, the channel resumes execution at the point where it was stopped.
The kernel interacts with plant’s I/O devices (e.g. PI system[11]) by means of device drivers.
3.5 Channel Size
Channel Size is one of the key design parameters. We have chosen 40,000 relations per channel, which
is larger than typical plant’s need. The numbers of channels can up to 256 in our system. In other
words, our ingesting systemcan support to collect 10 million measure points in real-time.
A suitable channel size offers several important advantages. First, it reduces driver’s need to interact
with the kernel because pulls and pushes on the same channel require only initial request to the kernel
for channel information. Second, it reduces Administrator’s need to interact with the HIA system
because a suitable channel can cover all measure points in a plant device. Third, since on a large
channel, a driver is more likely to perform many operations on a given channel, it can reduce network
overhead by keeping a persistent TCP connection to the Streaming Processor over an extended period
of time.
3.6 Metadata
The kernel stores three major types of metadata [1, 5, 16]: the list of relation, channel and driver. All
metadata is maintained by the kernel’s tracker service.
3.7 Consistency Model
HIA has a relaxed consistency model that supports our high performance ingesting well but remains
relatively simple and efficient to implement.
Every channel can isolated acquire resource. There are no relations between channels at runtime.
The metadata update is marked by version, it can avoid conflicts [6,15].
4 Performance Result
In this section we present a few micro-benchmarks to illustrate the bottlenecks inherent in the HIA
architecture and implementation, and also some numbers from real clusters in use at State Power
Investment Corporation.
We measured performance on a HIA instance consisting of one HIA server and16 plant devices.
Note that this configuration was set up for ease of testing. Typical instance have hundreds of plant data
devices.
All the machines are configured with dual 2.0 GHz Intel Core i5 processors, 4 GB of memory, two
320 GB 7200 rpm disks, and a 1Gbps full-duplex Ethernet connection.
Fig. 4.1 Event throughputs when run 64channels
Fig. 4.2 Event throughputs when run 128 channels
0
50000
100000
150000
200000
250000
300000
350000
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64
Retrival Throughput (64 channels)
Avg. Throughput
0
50000
100000
150000
200000
250000
300000
350000
1
8
15
22
29
36
43
50
57
64
71
78
85
92
99
106
113
120
127
Retrival Throughput (128
channels)
Avg. Throughput
Fig. 4.3 Event throughputs when run 256 channels
We can see that the peak event throughput occurs in a single channel, the average throughput is
decreasing by the number of channels increase. According to the average throughput, the total event
throughput could over 20 million per second.
5 RelatedWorks
Like other ingest systems such as Kafka[7][9], PI ICU[11], HIA provides an abstraction which enables
data to be moved between backing stores.
Unlike Kafka, HIA spreads collecting data across backing store servers in a way distributed by
internet style, provides more precise operating model to control data flow, and also supplies highly
efficiency because HIA systemcomplete implements using C++. In contrast to system like Kafka, HIA
does not provide large cluster architecture, since HIA’s technical environment is constrained by
resource.
In compare to systems like PI, HIA provide more flexible ingesting pattern using PullPush model.
6 CONCLUSIONS
HIA system demonstrates the qualities essential for supporting large-scale data ingesting workloads on
loose couple environment. While some design decisions are specific to our unique setting, many may
apply to data ingesting tasks of a similar magnitude and cost consciousness. We started by reexamining
the popular ingesting system assumptions in light of our current and anticipated application workloads
and technological environment. Our observations have led to radically different points in the design
space. We treat component failures as the norm rather than the exception using autonomic computing
paradigm. Our systemprovides fault tolerance by constant monitoring, and fast and automatic recovery.
Our design delivers high aggregate throughput to many concurrent channels performing a variety of
collecting tasks. We achieve this by separating channel control, which passes through the master, from
data transfer, which passes directly between Streaming Processors and drivers. Kernel involvement in
common operations is minimized by a suitable channel size. This makes possible a simple, centralized
kernel that does not become a bottleneck. HIA has successfully met our ingestion needs and is widely
0.00
50,000.00
100,000.00
150,000.00
200,000.00
250,000.00
300,000.00
350,000.00
400,000.00
450,000.00
500,000.00
1
14
27
40
53
66
79
92
105
118
131
144
157
170
183
196
209
222
235
248
Retrival Throughput (256
channels)
Avg. Throughput
used as the ingesting platform for production data processing. It is an important tool that enables us to
continue to integrate in otherhigh-level applications.
References
1. Ananthanarayanan R, Basker V, Das S et al (2013) Photon: Fault-tolerant and Scalable Joining of Continuous Data
Streams. SIGMOD’13, New York
2. Bovet DP, Cesati M (2005) Understandingthe Linux Kernel. 3rd edn. O’Reilly, New York
3. Curry E, Grace P (2008) Flexible Self-Management Using the Model-View-Controller Pattern. IEEE Software, vol. 25,
no.3, pp. 84-90
4. Dean J, Ghemawat S (2005) MapReduce: Simplified Data Processing on Large Clusters. http://labs.google.com/papers/
mapreduce.html
5. Ghemawat S, Gobioff H, Leung ST (2003) The Google file system. In 19th Symposium on Operating Systems
Principles, pages 29–43, Lake George, New York
6. Gill CD (2002) Flexible Scheduling in Middleware for Distributed Rate-Based Real-Time Applications. Doctoral
Dissertation
7. Goodhope K, Koshy, Kreps J et al (2012) Building LinkedIn’s Real-time Activity Data Pipeline. Bulletin of the IEEE
Computer Society Technical Committee on Data Engineering
8. Kephart JO, Chess DM (2003) The vision of autonomic computing. Computer, 36(1):41–52
9. Kreps J, Narkhede N, Rao J (2011) Kafka: a Distributed Messaging System for Log Processing. http://kafka.apache.org
/documentation.html
10. Mauerer W (2008) Professional Linux Kernel Architecture. Wrox, New York
11 PI System. http://www.osisoft.com/.
12. Schmidt DC (1996) Applying a Pattern Language to Develop Application-level Gateways. Theory and Practice of
Object System, Wiley & Sons, Vol. 2, No. 1
13. Schmidt DC (2012) Reactor: An Object Behavioral Pattern for Demultiplexing and Dispatching Handles for
Synchronous Events. http://www.yeolar.com/note/2012/12/09/reactor/
14. Schmidt DC, Pyarali I (2010) The Design and Implementation of the Reactor: An Object-Oriented Framework for
Event Demultiplexing. http://www.dre.vanderbilt.edu/~schmidt/PDF/Reactor2-93.pdf
15. Schmidt DC, Suda T (1994) The Service Configurator Framework: An Extensible Architecture for Dynamically
Configuring Concurrent, Multi-Service Network Daemons. Proceedings of the IEEE Second International Workshop
on Configurable Distributed Systems, Pittsburgh, PA
16. Schmidt DC, Suda T (1996) The Performance of Alternative Threading Architectures for Parallel Communication
Subsystems. Journal of Parallel and Distributed Computing

More Related Content

What's hot

Voxxed days thessaloniki 21/10/2016 - Streaming Engines for Big Data
Voxxed days thessaloniki 21/10/2016 - Streaming Engines for Big DataVoxxed days thessaloniki 21/10/2016 - Streaming Engines for Big Data
Voxxed days thessaloniki 21/10/2016 - Streaming Engines for Big DataStavros Kontopoulos
 
Streaming Pipelines in Kubernetes Using Apache Pulsar, Heron and BookKeeper
Streaming Pipelines in Kubernetes Using Apache Pulsar, Heron and BookKeeperStreaming Pipelines in Kubernetes Using Apache Pulsar, Heron and BookKeeper
Streaming Pipelines in Kubernetes Using Apache Pulsar, Heron and BookKeeperKarthik Ramasamy
 
Tutorial - Modern Real Time Streaming Architectures
Tutorial - Modern Real Time Streaming ArchitecturesTutorial - Modern Real Time Streaming Architectures
Tutorial - Modern Real Time Streaming ArchitecturesKarthik Ramasamy
 
Meter Data Management Presentation
Meter Data Management PresentationMeter Data Management Presentation
Meter Data Management PresentationTalyam
 

What's hot (9)

Voxxed days thessaloniki 21/10/2016 - Streaming Engines for Big Data
Voxxed days thessaloniki 21/10/2016 - Streaming Engines for Big DataVoxxed days thessaloniki 21/10/2016 - Streaming Engines for Big Data
Voxxed days thessaloniki 21/10/2016 - Streaming Engines for Big Data
 
Meter Data Management 2.0
Meter Data Management 2.0Meter Data Management 2.0
Meter Data Management 2.0
 
Streaming Pipelines in Kubernetes Using Apache Pulsar, Heron and BookKeeper
Streaming Pipelines in Kubernetes Using Apache Pulsar, Heron and BookKeeperStreaming Pipelines in Kubernetes Using Apache Pulsar, Heron and BookKeeper
Streaming Pipelines in Kubernetes Using Apache Pulsar, Heron and BookKeeper
 
Tutorial - Modern Real Time Streaming Architectures
Tutorial - Modern Real Time Streaming ArchitecturesTutorial - Modern Real Time Streaming Architectures
Tutorial - Modern Real Time Streaming Architectures
 
Meter Data Management Presentation
Meter Data Management PresentationMeter Data Management Presentation
Meter Data Management Presentation
 
On the Path to a Smarter World
On the Path to a Smarter WorldOn the Path to a Smarter World
On the Path to a Smarter World
 
Modern Data Pipelines
Modern Data PipelinesModern Data Pipelines
Modern Data Pipelines
 
Streaming analytics
Streaming analyticsStreaming analytics
Streaming analytics
 
Kx for Telco
Kx for TelcoKx for Telco
Kx for Telco
 

Viewers also liked

ASADS Presentation - What Do We Know
ASADS Presentation - What Do We KnowASADS Presentation - What Do We Know
ASADS Presentation - What Do We KnowKris Vilamaa, ACHE
 
Mote por 2014 galleon coral reef restoration wo video galleon co branded
Mote por 2014 galleon coral reef restoration wo video galleon co brandedMote por 2014 galleon coral reef restoration wo video galleon co branded
Mote por 2014 galleon coral reef restoration wo video galleon co brandedTwoOceansDigital
 
How to publish in the MENA region (Middle East and North Africa)
How to publish in the MENA region (Middle East and North Africa)How to publish in the MENA region (Middle East and North Africa)
How to publish in the MENA region (Middle East and North Africa)Caglar Eger
 
Distributed Real-time Data Collection for Enterprises 1.0 v03
Distributed Real-time Data Collection for Enterprises 1.0 v03Distributed Real-time Data Collection for Enterprises 1.0 v03
Distributed Real-time Data Collection for Enterprises 1.0 v03Andrew Chen
 
The Lean Startup - Startup Analytics
The Lean Startup - Startup AnalyticsThe Lean Startup - Startup Analytics
The Lean Startup - Startup AnalyticsDr. Judith Grummer
 
Travel Design Egypt Company porfile
Travel Design Egypt Company porfileTravel Design Egypt Company porfile
Travel Design Egypt Company porfileMaged Girgis
 
Lean Analytics: Using Data to Build a Better Business Faster
Lean Analytics: Using Data to Build a Better Business FasterLean Analytics: Using Data to Build a Better Business Faster
Lean Analytics: Using Data to Build a Better Business FasterLean Startup Co.
 
Anticholinergics and anti emetics
Anticholinergics and anti emeticsAnticholinergics and anti emetics
Anticholinergics and anti emeticsAntara Banerji
 
End tidal co2 and transcutaneous monitoring
End tidal co2 and transcutaneous monitoringEnd tidal co2 and transcutaneous monitoring
End tidal co2 and transcutaneous monitoringAntara Banerji
 
Neuromonitoring in anesthesia
Neuromonitoring in anesthesiaNeuromonitoring in anesthesia
Neuromonitoring in anesthesiaAntara Banerji
 
Pulmonary Function Tests
Pulmonary Function TestsPulmonary Function Tests
Pulmonary Function TestsAntara Banerji
 
Zero to traction
Zero to tractionZero to traction
Zero to tractionAndrew Chen
 
Growth Hack Your Way to Startup Traction by @rocketshp
Growth Hack Your Way to Startup Traction by @rocketshpGrowth Hack Your Way to Startup Traction by @rocketshp
Growth Hack Your Way to Startup Traction by @rocketshpRocketshp
 
Pitch Deck Templates for Startups
Pitch Deck Templates for StartupsPitch Deck Templates for Startups
Pitch Deck Templates for StartupsNextView Ventures
 
Lean Analytics workshop (from Lean Startup Conf)
Lean Analytics workshop (from Lean Startup Conf)Lean Analytics workshop (from Lean Startup Conf)
Lean Analytics workshop (from Lean Startup Conf)Lean Analytics
 

Viewers also liked (20)

ASADS Presentation - What Do We Know
ASADS Presentation - What Do We KnowASADS Presentation - What Do We Know
ASADS Presentation - What Do We Know
 
CMHB Presentation 8-24-06
CMHB Presentation 8-24-06CMHB Presentation 8-24-06
CMHB Presentation 8-24-06
 
Mote por 2014 galleon coral reef restoration wo video galleon co branded
Mote por 2014 galleon coral reef restoration wo video galleon co brandedMote por 2014 galleon coral reef restoration wo video galleon co branded
Mote por 2014 galleon coral reef restoration wo video galleon co branded
 
How to publish in the MENA region (Middle East and North Africa)
How to publish in the MENA region (Middle East and North Africa)How to publish in the MENA region (Middle East and North Africa)
How to publish in the MENA region (Middle East and North Africa)
 
Distributed Real-time Data Collection for Enterprises 1.0 v03
Distributed Real-time Data Collection for Enterprises 1.0 v03Distributed Real-time Data Collection for Enterprises 1.0 v03
Distributed Real-time Data Collection for Enterprises 1.0 v03
 
GreenTree_Capacity Statement-2016-17
GreenTree_Capacity Statement-2016-17GreenTree_Capacity Statement-2016-17
GreenTree_Capacity Statement-2016-17
 
The Lean Startup - Startup Analytics
The Lean Startup - Startup AnalyticsThe Lean Startup - Startup Analytics
The Lean Startup - Startup Analytics
 
Travel Design Egypt Company porfile
Travel Design Egypt Company porfileTravel Design Egypt Company porfile
Travel Design Egypt Company porfile
 
GreenTree Brochure
GreenTree BrochureGreenTree Brochure
GreenTree Brochure
 
Lean Analytics: Using Data to Build a Better Business Faster
Lean Analytics: Using Data to Build a Better Business FasterLean Analytics: Using Data to Build a Better Business Faster
Lean Analytics: Using Data to Build a Better Business Faster
 
Anticholinergics and anti emetics
Anticholinergics and anti emeticsAnticholinergics and anti emetics
Anticholinergics and anti emetics
 
Head injury
Head injuryHead injury
Head injury
 
End tidal co2 and transcutaneous monitoring
End tidal co2 and transcutaneous monitoringEnd tidal co2 and transcutaneous monitoring
End tidal co2 and transcutaneous monitoring
 
Neuromonitoring in anesthesia
Neuromonitoring in anesthesiaNeuromonitoring in anesthesia
Neuromonitoring in anesthesia
 
Iv anaesthetics
Iv anaestheticsIv anaesthetics
Iv anaesthetics
 
Pulmonary Function Tests
Pulmonary Function TestsPulmonary Function Tests
Pulmonary Function Tests
 
Zero to traction
Zero to tractionZero to traction
Zero to traction
 
Growth Hack Your Way to Startup Traction by @rocketshp
Growth Hack Your Way to Startup Traction by @rocketshpGrowth Hack Your Way to Startup Traction by @rocketshp
Growth Hack Your Way to Startup Traction by @rocketshp
 
Pitch Deck Templates for Startups
Pitch Deck Templates for StartupsPitch Deck Templates for Startups
Pitch Deck Templates for Startups
 
Lean Analytics workshop (from Lean Startup Conf)
Lean Analytics workshop (from Lean Startup Conf)Lean Analytics workshop (from Lean Startup Conf)
Lean Analytics workshop (from Lean Startup Conf)
 

Similar to HIA - Autonomic and General Collecting of Real-time Data Streams(commit)

Hitachi streaming data platform v8
Hitachi streaming data platform v8Hitachi streaming data platform v8
Hitachi streaming data platform v8Navaid Khan
 
Hitachi Streaming Data Platform_v8
Hitachi Streaming Data Platform_v8Hitachi Streaming Data Platform_v8
Hitachi Streaming Data Platform_v8Navaid Khan
 
Big Data Analytics for Real Time Systems
Big Data Analytics for Real Time SystemsBig Data Analytics for Real Time Systems
Big Data Analytics for Real Time SystemsKamalika Dutta
 
Affinitymeterflow Sinkorswimwithsmartmeterdatamanagement
Affinitymeterflow SinkorswimwithsmartmeterdatamanagementAffinitymeterflow Sinkorswimwithsmartmeterdatamanagement
Affinitymeterflow SinkorswimwithsmartmeterdatamanagementTalyam
 
Meterflowsinkorswim
MeterflowsinkorswimMeterflowsinkorswim
Meterflowsinkorswimloichares
 
Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream ProcessingGuido Schmutz
 
Flink Forward Berlin 2018: Stephan Ewen - Keynote: "Unlocking the next wave o...
Flink Forward Berlin 2018: Stephan Ewen - Keynote: "Unlocking the next wave o...Flink Forward Berlin 2018: Stephan Ewen - Keynote: "Unlocking the next wave o...
Flink Forward Berlin 2018: Stephan Ewen - Keynote: "Unlocking the next wave o...Flink Forward
 
Wikibon #IoT #HyperConvergence Presentation via @theCUBE
Wikibon #IoT #HyperConvergence Presentation via @theCUBE Wikibon #IoT #HyperConvergence Presentation via @theCUBE
Wikibon #IoT #HyperConvergence Presentation via @theCUBE John Furrier
 
Stream Meets Batch for Smarter Analytics- Impetus White Paper
Stream Meets Batch for Smarter Analytics- Impetus White PaperStream Meets Batch for Smarter Analytics- Impetus White Paper
Stream Meets Batch for Smarter Analytics- Impetus White PaperImpetus Technologies
 
High availability of data using Automatic Selection Algorithm (ASA) in distri...
High availability of data using Automatic Selection Algorithm (ASA) in distri...High availability of data using Automatic Selection Algorithm (ASA) in distri...
High availability of data using Automatic Selection Algorithm (ASA) in distri...journalBEEI
 
Event Stream Processing SAP
Event Stream Processing SAPEvent Stream Processing SAP
Event Stream Processing SAPGaurav Ahluwalia
 
Real-time data integration to the cloud
Real-time data integration to the cloudReal-time data integration to the cloud
Real-time data integration to the cloudSankar Nagarajan
 
INTERNET OF THINGS & AZURE
INTERNET OF THINGS & AZUREINTERNET OF THINGS & AZURE
INTERNET OF THINGS & AZUREDotNetCampus
 
The Digital Decoupling Journey | John Kriter, Accenture
The Digital Decoupling Journey | John Kriter, AccentureThe Digital Decoupling Journey | John Kriter, Accenture
The Digital Decoupling Journey | John Kriter, AccentureHostedbyConfluent
 
3 reasons to pick a time series platform for monitoring dev ops driven contai...
3 reasons to pick a time series platform for monitoring dev ops driven contai...3 reasons to pick a time series platform for monitoring dev ops driven contai...
3 reasons to pick a time series platform for monitoring dev ops driven contai...DevOps.com
 
IoT & Azure (EventHub)
IoT & Azure (EventHub)IoT & Azure (EventHub)
IoT & Azure (EventHub)Mirco Vanini
 

Similar to HIA - Autonomic and General Collecting of Real-time Data Streams(commit) (20)

Hitachi streaming data platform v8
Hitachi streaming data platform v8Hitachi streaming data platform v8
Hitachi streaming data platform v8
 
Hitachi Streaming Data Platform_v8
Hitachi Streaming Data Platform_v8Hitachi Streaming Data Platform_v8
Hitachi Streaming Data Platform_v8
 
Big Data Analytics for Real Time Systems
Big Data Analytics for Real Time SystemsBig Data Analytics for Real Time Systems
Big Data Analytics for Real Time Systems
 
Affinitymeterflow Sinkorswimwithsmartmeterdatamanagement
Affinitymeterflow SinkorswimwithsmartmeterdatamanagementAffinitymeterflow Sinkorswimwithsmartmeterdatamanagement
Affinitymeterflow Sinkorswimwithsmartmeterdatamanagement
 
Meterflowsinkorswim
MeterflowsinkorswimMeterflowsinkorswim
Meterflowsinkorswim
 
Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream Processing
 
Flink Forward Berlin 2018: Stephan Ewen - Keynote: "Unlocking the next wave o...
Flink Forward Berlin 2018: Stephan Ewen - Keynote: "Unlocking the next wave o...Flink Forward Berlin 2018: Stephan Ewen - Keynote: "Unlocking the next wave o...
Flink Forward Berlin 2018: Stephan Ewen - Keynote: "Unlocking the next wave o...
 
Wikibon #IoT #HyperConvergence Presentation via @theCUBE
Wikibon #IoT #HyperConvergence Presentation via @theCUBE Wikibon #IoT #HyperConvergence Presentation via @theCUBE
Wikibon #IoT #HyperConvergence Presentation via @theCUBE
 
Hyper-Convergence CrowdChat
Hyper-Convergence CrowdChatHyper-Convergence CrowdChat
Hyper-Convergence CrowdChat
 
Stream Meets Batch for Smarter Analytics- Impetus White Paper
Stream Meets Batch for Smarter Analytics- Impetus White PaperStream Meets Batch for Smarter Analytics- Impetus White Paper
Stream Meets Batch for Smarter Analytics- Impetus White Paper
 
BCM App brief
BCM App briefBCM App brief
BCM App brief
 
ATS SmartHIS
ATS SmartHISATS SmartHIS
ATS SmartHIS
 
High availability of data using Automatic Selection Algorithm (ASA) in distri...
High availability of data using Automatic Selection Algorithm (ASA) in distri...High availability of data using Automatic Selection Algorithm (ASA) in distri...
High availability of data using Automatic Selection Algorithm (ASA) in distri...
 
IoT Big Data Analytics Insights from Patents
IoT Big Data Analytics Insights from PatentsIoT Big Data Analytics Insights from Patents
IoT Big Data Analytics Insights from Patents
 
Event Stream Processing SAP
Event Stream Processing SAPEvent Stream Processing SAP
Event Stream Processing SAP
 
Real-time data integration to the cloud
Real-time data integration to the cloudReal-time data integration to the cloud
Real-time data integration to the cloud
 
INTERNET OF THINGS & AZURE
INTERNET OF THINGS & AZUREINTERNET OF THINGS & AZURE
INTERNET OF THINGS & AZURE
 
The Digital Decoupling Journey | John Kriter, Accenture
The Digital Decoupling Journey | John Kriter, AccentureThe Digital Decoupling Journey | John Kriter, Accenture
The Digital Decoupling Journey | John Kriter, Accenture
 
3 reasons to pick a time series platform for monitoring dev ops driven contai...
3 reasons to pick a time series platform for monitoring dev ops driven contai...3 reasons to pick a time series platform for monitoring dev ops driven contai...
3 reasons to pick a time series platform for monitoring dev ops driven contai...
 
IoT & Azure (EventHub)
IoT & Azure (EventHub)IoT & Azure (EventHub)
IoT & Azure (EventHub)
 

HIA - Autonomic and General Collecting of Real-time Data Streams(commit)

  • 1. HIA: Autonomic and General Collecting of Real-time Data Streams Changfeng Chen, Xiang Wang1. Abstract. Power manufacturing enterprises process events by hundreds of devices interacting with their plants. Rich statistical data distilled from combining such interactions in real-time could generate much business value. In this paper, we describe the architecture of HIA, a systemfor ingesting multiple geographically distributed data source in real-time with high scalability and low latency, where the data streams may be real-time or batch. The system strongly self-manages infrastructure degradation without any manual intervention. HIA guarantees that there will be no data lost in the ingested output at any point in time, that most ingested events will be present in the output in soft real-time. Our product deployment ingests millions of events per second at peak with an average end-to-end latency of less than 1second. We also present challenges and solutions in maintaining large persistent ingesting state across geographically distant locations, and highlight the design principles that emerged from our experience. Keywords: Event Streams Collecting, Autonomic Computing, I/O Driver 1 Introduction Collecting real-time streams has received considerable attention in the past three decades due to its importance in numerous applications (e.g. Device[16]monitor), where information generates in the form of very high-speed streams that need to be processed online to enable real-time[6] response. With the explosive evolution of the power manufacturing enterprises in the last several years, the need for similar technologies has grown multifold as energy-based enterprises must process events generated by millions of devices interacting with their plants. Devices all over the enterprise measure the manufacture process on a daily basis,issuing power production, gas emit and materials cost. Distilling rich statistical data from plant device interactions in real-time has a huge impact on business processes. The data enables decision maker to fine-tune production, budgets, campaigns and supervision, vary the parameters in real-time to suit changing market environment. It provides immediate feedback to decision maker on the effectiveness of their changes, and also allows the enterprise to optimize the budget spent for each power plant on a continuous basis. To provide real-time statistical data, we built a system, called HighSoon Interface Adapter or HIA, that can ingest event[6,16] that generated by distributed plant devices. HIA ingest multiple real-time continuous streams of events. Let us illustrate the operational steps in HIA with an example of ingesting a CO gas event: 1 Xiang Wang() China Real-time Database Co. LTD, Software Avenue. 180, 210000 Nanjing, China e-mail:wangxiang2@sgepri.sgcc.com.cn
  • 2.  When a plant device record a measure and forms in an event. The Pull driver collets the event, where the event is forward to the HIA server.  HIA server routes an event to destination Push driver  Push driver push event to backing store or destination. HIA ingest input event streams from plant device and writes collected events to backing stores. Collected events are used to generate rich statisticaldata. Fig. 1.1 Collecting event in HIA 1.1 Problem Statement Formally, we express the data ingest task in PullPush model. PullPush is a similar MapReduce[4] programming model and an associated implementation for ingesting large data events. Programmers specify a pull function that poll events from a data device source to generate a set of intermediate events group by channel, and a push function that merges all intermediate events associated with the same channel key to writes collected events to backing stores. The main goal of HIA is to ingest continuous stream in real-time. We use this framework to ingest many different event streams in our products. 1.2 System Challenges While building HIA to ingest continuous data streams, we face these challenges: t2 Timeline Plant Device HIA t1 t 𝑛 Event {time_stamp, value, quality_code} Event {time_stamp, value, quality_code} Measure metrics / Unit
  • 3.  Consistent semantics: The output produced by HIA is used for operating production, reporting revenue to senior officer in the enterprise, and so on. HIA must guarantee that a single input event is never lost, since this leads to misunderstand the real world.  Automatic system: Service interruptions can last from a couple of minutes to a few days or even weeks. There are hundreds plant devices. The operations of the whole system are overhead. The volume and complexity may be managed by highly skilled humans; but the demand for skilled IT personnel is already outstripping supply. The system could make decisions on its own, using high-level policies; it will constantly check and optimize its status and automatically adapt itself to changing conditions. An autonomic computing framework is the solution.  Multiple sources and different formats: How will we ingest data from multiple sources and different formats in an efficient manner?  Multiple destinations:Should all the raw data be ingested only in a single data store?  High scalability: Not only does HIA need to handle millions of events per second today, it must also be able to handle the ever-increasing number of events in the future.  High extensibility: How to support the owner develops the customdriver for specify data source?  Low latency: The output of HIA is used to compute statistics for decision maker on how their manufacturing processes are performing, detecting overhead cost, optimizing budget, etc. Having HIA perform the collect within a few cent seconds significantly improves the effectiveness of these business processes.  Just-in-Time Transformation: Should preprocessing of data – for example, checking / cleansing / validation - always be done before ingesting data in data store.  Hybrid collecting mode: What are the essential tools / frameworks required in ingestion layer to handle data in real-time or batch mode? 1.3 Our Contributions The key contributions from this paper are:  To the best of our knowledge, this the first paper to formulate and solve the problem of ingesting multiple streams[1] continuously under these system constraints: consistent semantics, autonomic system, high scalability, low latency, multiple sources and different formats, multiple destinations, Just-in-Time transformation, Hybrid collecting mode.  The solutions detailed in this paper have been fully implemented and deployed in a live production for several months. Based on this real-world experience, we present detailed performance results, design trade-offs, and key design lessons from our deployment. The rest of this paper is organized as follows. Section 2 describes the data model in more detail. Section 3 presents the detailed design and architecture of the HIA system. Section 4 describes our product deployment settings and measurements collected from the running system. Sections 5 and 6 describe related research, and summarize future work and conclusions. 2 Data Model A HIA system is a set of processes. Each systemserves a set of channels. A channel[12,14] in HIA is a sparse map, and an abstraction like as a process in operating system for fulfilling transfer data between backing stores. Channels are dynamic entities that usually have a limited life span with the HIA
  • 4. system. A HIA relation is an information container structured as a sequence of events , and a mapping between a measure point in the plant device and a destination measure point in the backing store. The relation just like as a file in the UNIX-style operating system[2][10]. Data is organized into two dimensions: relations and timestamps. (relation: string, time: int64)-->value We ingest the raw data identified by a particular relation key and timestamp. Rows or relations are grouped together to formthe unit of load balancing, called frames, and columns or timestamps also are grouped togetherto form the unit of transferring, called batches. {𝑡1, 𝑣𝑎𝑙𝑢𝑒11 , 𝑞𝑢𝑎𝑙𝑖𝑡𝑦11 } {𝑡2, 𝑣𝑎𝑙𝑢𝑒12, 𝑞𝑢𝑎𝑙𝑖𝑡𝑦12 } … … {𝑡m, 𝑣𝑎𝑙𝑢𝑒1𝑚 , 𝑞𝑢𝑎𝑙𝑖𝑡𝑦1𝑚 } {𝑡1, 𝑣𝑎𝑙𝑢𝑒21 , 𝑞𝑢𝑎𝑙𝑖𝑡𝑦21} {𝑡2, 𝑣𝑎𝑙𝑢𝑒22, 𝑞𝑢𝑎𝑙𝑖𝑡𝑦22 } … … {𝑡 𝑚 , 𝑣𝑎𝑙𝑢𝑒2𝑚 , 𝑞𝑢𝑎𝑙𝑖𝑡𝑦2𝑚 } … … … … … … … … {𝑡1, 𝑣𝑎𝑙𝑢𝑒 𝑛1 , 𝑞𝑢𝑎𝑙𝑖𝑡𝑦 𝑛1 } {𝑡2, 𝑣𝑎𝑙𝑢𝑒 𝑛2, 𝑞𝑢𝑎𝑙𝑖𝑡𝑦 𝑛2 } … … {𝑡 𝑚 , 𝑣𝑎𝑙𝑢𝑒 𝑛𝑚 , 𝑞𝑢𝑎𝑙𝑖𝑡𝑦 𝑛𝑚 } Fig. 2.1 A slice of an example channel that represents the collected data. The row name is associated a measure point. The columns name is a sampling timestamp. Rows. HIA operates data in ascend order by row id. The row keys in a channel are arbitrary strings (currently up to 256 bytes in size). Timestamps. Different relations in a channel can contain a sequence of events, where the events are indexed by timestamp. HIA timestamps are two 32-bit integers. They are assigned explicitly by plant device, in which case they represent real time in microseconds formed by Julian day. Plant devices that need to avoid collisions must generate unique timestamps themselves. Different events of a relation are represented in decreasing timestamp order, so that the most recent events can be seeing first. 3 DesignOverview We now describe the architecture of a pipeline at a single instance, as show in Figure 3.1. In the following examples, we use current events as inputs, but the architecture applies to any similar event streams. There are three major components in a single instance: the driver program to read/write current continuously and feed them to the HIA server; HIA sever orchestrates globaloperation of system. 3.1 Assumptions In designing a collecting system for market’s needs, we have been guided by assumptions that offer both challenges and opportunities. We alluded to some key observations earlier and now lay out our assumptions in more details [5].  The system is built from low sustained technique environment that often fail. It must constantly monitor itself and detect, tolerate, and recover promptly component failures on a routine basis.  The system ingest a million number of data points. We expect a few million points, each event typically 3 seconds period.
  • 5.  The workloads primarily consist of two kinds of ingestion: real-time streaming collect and batch streaming collect. 3.2 Interface HIA provides a simple collection interface. We support the usual operations to get_snapshot, and get_archive_value events. Moreover, HIA has device connection maintain operations. Plant device connection creates a link between I/O driver and plant data device. 3.3 Architecture A HIA system instance consists of a single server with a kernel service and multiple streaming processor services and is accessed by multiple I/O driver programs, as shown Figure 3.1. Both HIA server and HIA I/O driver are user-level processes in a commodity Windows_NT machine. The kernel maintains all relation system metadata. This includes namespace, the mapping from a source measure point to a destination measure point, and the channel owner of relations. It also controls system-wide activities such as loading I/O driver program[15], linking channels, keeping track of channel list[8,13], Keeping track of the I/O driver program[3], Watching the production environment[3], Scanning the directory of drivers[2]. The HIA kernel periodically communicates with each driver in Heartbeat message[5] to give it instructions and collect its state. The programmer writes code to implement the collecting interfaces in a PullPush specification custom by practical data sources, and products a dynamic library. The user’s code is linked together with the PullPush library (or HIA driver framework, implemented in C++). The HIA client as a C++ framework then invokes the PullPush function running in a user-level process. HIA driver program communicates with the HIA kernel and HIA streaming processor to pull or push data on behalf of the data device. Drivers interact with the kernel for metadata operations, but all real-time and batch data communication goes directly to the Streaming Processor. The driver doesn’t cache event data. Driver caches offer little benefit because it increases the complex of driver. Streaming Processor needs to cache event data because the backing store will be offline. Fig. 3.1 HIA Architecture
  • 6. 3.4 Single Kernel Having a single kernel or server vastly simplifies our design and enables the kernel to make sophisticated taking care of channels running using global knowledge. HIA system provides an execution environment in which channel may run. The task of creating, eliminating the channels is delegated to a group of routines in the kernel. Kernel itself is not a channel but a channel manager. To let the kernel manage channels, each channel is represented by a channel descriptor that includes information about the current state of the channel. When the kernel stops the execution of a channel, it saves that current content in the channel descriptor. When the kernel decides to resume executing a channel, it uses the proper channel descriptor fields to load the Streaming Processor’s registers. Because the stored value of the program counter points to the message following the last message collected, the channel resumes execution at the point where it was stopped. The kernel interacts with plant’s I/O devices (e.g. PI system[11]) by means of device drivers. 3.5 Channel Size Channel Size is one of the key design parameters. We have chosen 40,000 relations per channel, which is larger than typical plant’s need. The numbers of channels can up to 256 in our system. In other words, our ingesting systemcan support to collect 10 million measure points in real-time. A suitable channel size offers several important advantages. First, it reduces driver’s need to interact with the kernel because pulls and pushes on the same channel require only initial request to the kernel for channel information. Second, it reduces Administrator’s need to interact with the HIA system because a suitable channel can cover all measure points in a plant device. Third, since on a large channel, a driver is more likely to perform many operations on a given channel, it can reduce network overhead by keeping a persistent TCP connection to the Streaming Processor over an extended period of time. 3.6 Metadata The kernel stores three major types of metadata [1, 5, 16]: the list of relation, channel and driver. All metadata is maintained by the kernel’s tracker service. 3.7 Consistency Model HIA has a relaxed consistency model that supports our high performance ingesting well but remains relatively simple and efficient to implement. Every channel can isolated acquire resource. There are no relations between channels at runtime. The metadata update is marked by version, it can avoid conflicts [6,15]. 4 Performance Result In this section we present a few micro-benchmarks to illustrate the bottlenecks inherent in the HIA
  • 7. architecture and implementation, and also some numbers from real clusters in use at State Power Investment Corporation. We measured performance on a HIA instance consisting of one HIA server and16 plant devices. Note that this configuration was set up for ease of testing. Typical instance have hundreds of plant data devices. All the machines are configured with dual 2.0 GHz Intel Core i5 processors, 4 GB of memory, two 320 GB 7200 rpm disks, and a 1Gbps full-duplex Ethernet connection. Fig. 4.1 Event throughputs when run 64channels Fig. 4.2 Event throughputs when run 128 channels 0 50000 100000 150000 200000 250000 300000 350000 1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 Retrival Throughput (64 channels) Avg. Throughput 0 50000 100000 150000 200000 250000 300000 350000 1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106 113 120 127 Retrival Throughput (128 channels) Avg. Throughput
  • 8. Fig. 4.3 Event throughputs when run 256 channels We can see that the peak event throughput occurs in a single channel, the average throughput is decreasing by the number of channels increase. According to the average throughput, the total event throughput could over 20 million per second. 5 RelatedWorks Like other ingest systems such as Kafka[7][9], PI ICU[11], HIA provides an abstraction which enables data to be moved between backing stores. Unlike Kafka, HIA spreads collecting data across backing store servers in a way distributed by internet style, provides more precise operating model to control data flow, and also supplies highly efficiency because HIA systemcomplete implements using C++. In contrast to system like Kafka, HIA does not provide large cluster architecture, since HIA’s technical environment is constrained by resource. In compare to systems like PI, HIA provide more flexible ingesting pattern using PullPush model. 6 CONCLUSIONS HIA system demonstrates the qualities essential for supporting large-scale data ingesting workloads on loose couple environment. While some design decisions are specific to our unique setting, many may apply to data ingesting tasks of a similar magnitude and cost consciousness. We started by reexamining the popular ingesting system assumptions in light of our current and anticipated application workloads and technological environment. Our observations have led to radically different points in the design space. We treat component failures as the norm rather than the exception using autonomic computing paradigm. Our systemprovides fault tolerance by constant monitoring, and fast and automatic recovery. Our design delivers high aggregate throughput to many concurrent channels performing a variety of collecting tasks. We achieve this by separating channel control, which passes through the master, from data transfer, which passes directly between Streaming Processors and drivers. Kernel involvement in common operations is minimized by a suitable channel size. This makes possible a simple, centralized kernel that does not become a bottleneck. HIA has successfully met our ingestion needs and is widely 0.00 50,000.00 100,000.00 150,000.00 200,000.00 250,000.00 300,000.00 350,000.00 400,000.00 450,000.00 500,000.00 1 14 27 40 53 66 79 92 105 118 131 144 157 170 183 196 209 222 235 248 Retrival Throughput (256 channels) Avg. Throughput
  • 9. used as the ingesting platform for production data processing. It is an important tool that enables us to continue to integrate in otherhigh-level applications. References 1. Ananthanarayanan R, Basker V, Das S et al (2013) Photon: Fault-tolerant and Scalable Joining of Continuous Data Streams. SIGMOD’13, New York 2. Bovet DP, Cesati M (2005) Understandingthe Linux Kernel. 3rd edn. O’Reilly, New York 3. Curry E, Grace P (2008) Flexible Self-Management Using the Model-View-Controller Pattern. IEEE Software, vol. 25, no.3, pp. 84-90 4. Dean J, Ghemawat S (2005) MapReduce: Simplified Data Processing on Large Clusters. http://labs.google.com/papers/ mapreduce.html 5. Ghemawat S, Gobioff H, Leung ST (2003) The Google file system. In 19th Symposium on Operating Systems Principles, pages 29–43, Lake George, New York 6. Gill CD (2002) Flexible Scheduling in Middleware for Distributed Rate-Based Real-Time Applications. Doctoral Dissertation 7. Goodhope K, Koshy, Kreps J et al (2012) Building LinkedIn’s Real-time Activity Data Pipeline. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering 8. Kephart JO, Chess DM (2003) The vision of autonomic computing. Computer, 36(1):41–52 9. Kreps J, Narkhede N, Rao J (2011) Kafka: a Distributed Messaging System for Log Processing. http://kafka.apache.org /documentation.html 10. Mauerer W (2008) Professional Linux Kernel Architecture. Wrox, New York 11 PI System. http://www.osisoft.com/. 12. Schmidt DC (1996) Applying a Pattern Language to Develop Application-level Gateways. Theory and Practice of Object System, Wiley & Sons, Vol. 2, No. 1 13. Schmidt DC (2012) Reactor: An Object Behavioral Pattern for Demultiplexing and Dispatching Handles for Synchronous Events. http://www.yeolar.com/note/2012/12/09/reactor/ 14. Schmidt DC, Pyarali I (2010) The Design and Implementation of the Reactor: An Object-Oriented Framework for Event Demultiplexing. http://www.dre.vanderbilt.edu/~schmidt/PDF/Reactor2-93.pdf 15. Schmidt DC, Suda T (1994) The Service Configurator Framework: An Extensible Architecture for Dynamically Configuring Concurrent, Multi-Service Network Daemons. Proceedings of the IEEE Second International Workshop on Configurable Distributed Systems, Pittsburgh, PA 16. Schmidt DC, Suda T (1996) The Performance of Alternative Threading Architectures for Parallel Communication Subsystems. Journal of Parallel and Distributed Computing